Facial Landmark Data Collection to Train Facial Emotion Detection Learning Models

Blog, Journal for High Schoolers, Journal for High Schoolers 2021

Authors

Leevi Symister, Kaiser Williams, Roshan Prabhakar, Ganesh Pimpale, Tsachy Weissman

Abstract

In-person theater performances have become difficult due to COVID-19, hence video conferencing softwares like Zoom have become increasingly popular to deliver live virtual performances. Such performances require a deliverance of audience feedback information to performers so that they can adapt their performance. However, typical audience feedback is not viable in a virtual setting. Intuition suggests that extracting feedback from an audience preoccupied with a performance by requiring a redirection of their attention to the feedback will result in inadequate representations of an audience’s emotional state. More authentic feedback can be gained by analyzing real time emotions of the audience through the use of a live webcam by extracting facial expressions.

Existing facial emotional recognition softwares map a face mesh to a subject and derive emotional states through the use of CNNs and a Facial Action Coding System (the current standard for determining emotions from a facial state). These models are trained on a variety of images from publicly available datasets as well as images scraped from the web. This project aims to develop software that performs more tangible data collection as it relates to the audience-performer feedback loop: specifically the collection of emotional expressions and corresponding reactionary expressions (clapping, booing, etc.). This project aims to determine whether there exists a relationship between face mesh data collected across frames and reactionary motions during said frames which may not be easily observable through the webcam, by building a database of such data entries which may be used to train a machine learning model. Should a relationship be found, such a model could be deployed within virtual performance frameworks to enable live collection of audience feedback information. Our code is available in a github repository: https://github.com/roshanprabhakar/af-datacollection.

Introduction

Background

Picture1

Audience feedback simply refers to any form of audience reactions during a performance that can be interpreted by the performer. Live performances such as concerts and plays have a dynamic exchange of emotional information formally known as the Audience Performer Feedback Loop [13]. This constant interaction between the two parties allows the performer to adapt their performance based on the general reactions the audience reciprocates such as laughing, applause, booing, smiling, frowning, clapping, etc. However, in virtual situations, the audience typically has their microphone turned off so the typical feedback loop is impractical. In addition, existing software tools use low level GUI’s that synthesize artificial feedback (such as laughter, booing, cheering, thumbs-up, etc.) using buttons which reduces the naturality of feedback by diverting the audience’s attention between watching and reacting to a performance. Thus, an alternative method to gain audience feedback is to analyze emotional information through facial detection and analysis and then provide that data to the performer in real time. Facial emotion detection can be done by just using a live webcam and camera features embedded in virtual conference software such as Zoom.

Technical concepts and project specifics

Facial emotion detection is generally divided into two tasks:

  1. Recognizing and detecting a human face in the video feed
  2. Detecting emotion through facial landmark analysis.

The first task utilizes Convolutional Neural Networks (CNN), a machine learning algorithm which takes a sample input image and analyzes its features to find patterns in the images. Passing relevant kernels across an image’s pixels [3] (a process similar to the figure on the left) and performing multiplication operations between the kernel and the image pixels creates a new matrix. This results in a convoluted layer with high-level features such as edges extracted from the input image. [12]. Next, the goal is to analyze the face and its certain features to see how they correspond to particular emotions. This is commonly achieved by utilizing the Facial Action Coding System (FACS) as well as emotion recognition software using another CNN. FACS was originally developed by Paul Ekman and Wallace Friesen [4] and is used to define a set of Action Units which correspond to a particular facial muscle group movement [2]. With these units, the emotion recognition software can identify and group certain action units with a particular set of emotions [6]. Each emotion recognition software can be different in how it processes the images and Action Units [5], but the set of emotions derived from the softwares generally consists of around seven universal human emotions. Those emotions are happiness, sadness, anger, surprise, fear, disgust, and contempt [11]. For example, iMotions, a software for human behavior analysis, defines happiness as the combination of two action units: one that describes a cheek raise, and the other which denotes a lip corner pull [5].

Modern facial analytics softwares, including iMotions, now integrate face detection and FACS emotion recognition all into one software [5]. This is usually done by using computer vision algorithms to map points to facial landmarks (a face mesh) which are then tracked and analyzed using deep learning to determine an emotion [5].

In particular, facial emotion recognition models require training with a large data set [15] in order to accurately detect emotions. Our software brings an interactive element to this data collection and training. Rather than algorithm-based data collection, we create a subject-observer model that collects the observer’s perceived emotion from a subject’s facial reactions. In an integrated environment, we extract the relevant facial feature/landmark information from each frame of a face mesh video feed, while concurrently having the observer analyze the same video and record what emotions he or she detects. The collected data entries will be used to develop an aggregate data set upon which a learning model may determine the mathematical relationship between face meshes and the corresponding reaction vector (this vector is determined by the human observer in our environment). Our project will be especially relevant in determining if certain facial landmark movements correlate to certain emotions and reactionary expressions of the body such as clapping, booing, etc.

Past & Related Work

Prior to the advancement of CNN models, a viable option for facial detection was the Viola-Jones algorithm. This algorithm detects the haar-like features of a human face using cascade classifiers to then identify human faces in mainly images with a detection rate of 95% [14] . While this method is viable in live video feeds, the most straightforward technique for improving detection performance, adding features to the classifier, directly increases computation time and thus is inefficient. Nonetheless, this algorithm has paved the road for many neural networks today that follow similar techniques [14, 10].

Materials, Libraries, Methods

We are creating a subject-observer software that facilitates the creation of a database which may be used to train a learning model that attempts to find a mathematical relationship between reactionary expressions less observable through a webcam, such as clapping, and corresponding facial expressions. This requires two independent tasks during data collection: the constant monitoring of a human subject’s face mesh (executed by CNN-based libraries) while the subject is watching and reacting to some sort of video (Facial Landmark Data Collection Interface), and the constant monitoring of the same subject’s emotional state conducted by a separate human observer (Observer Output Data Collection Interface). Subsequently, we will develop an integrated environment to allow these tasks to occur simultaneously and software that merges the resulting data of the two tasks to create entries for our database.

For face detection and facial landmark mapping, we employed the MediaPipe TensorFlow.js library called Face Mesh. It does not require a depth sensor and instead only needs access to a webcam on the device being used. This library employs face detection using the BlazeFace CNN model which is tailored for lighter computational performance while being very fast [8]. This model produces an input image which is composed of the face as well as a few facial keypoint coordinates inside of a rectangular bounding box that helps it detect face rotations [1]. This cropped image is then passed as input to the facial landmark neural network which then subsequently maps 468 coordinates (x, y, z) back onto the original uncropped image [8]. As this uses no depth sensor, the z coordinates are scaled in accordance with the x coordinate using Weak Perspective Projection [9].

Facial Landmark Data Collection Interface

The first task consists of collecting facial landmark data and the timestamps of our collection and storing them in a file for later use. This way, when the data of the observer and the data of the face mesh are compared, we can learn which reactions correspond to certain facial landmark movements. To do this, we first need to access the predefined objects in the Mediapipe Face Mesh library.

The most important object in this library is the faceDetection object. It contains the 468 coordinates (x, y, z) of each point that is mapped to a detected face through the webcam. It stores these values in the Mesh and Scaled Mesh properties of the object. The Mesh property consists of the facial landmark coordinates without normalization while the Scaled Mesh property of the object contains the normalized coordinates. In our program, the goal is to collect 100 packets each consisting of 10 frames of Mesh data. Because we need this data stored within a local file and the straightforward file representation for a JavaScript array is simply the ASCII encoded stringified representation of said array, we store our data in a byte buffer where each 4 bytes corresponds to a 32-bit float value in the faceDetection object. This allows us to reduce consumption from almost 30 bytes per number (2 bytes per character of the string representation (UTF – 16)), to just 4 bytes (32 bit floating point representation). As a string, our data would consume 538,260 bytes per packet. With the floating point representation we would only be using 56,168 bytes per packet, almost a 90% difference in consumption.

To implement this storage solution, each frame of Mesh data and Scaled Mesh data are temporarily stored in an array called faceMeshArray and scaledFaceMeshArray respectively. The faceMeshArray is then mapped and stored in an array buffer called meshBuffer and opened with a view called meshBufferView.

The length of meshBuffer is determined by multiplying the number of bytes stored per each number (in the case of Float32 this number is 4) * the number of dimensions per point * the number of points * the number of frames of Mesh data per packet. This comes out to 56,160.

Now we store the timestamp data in our packet. The timestamp data will be stored in milliseconds elapsed since the epoch which occurred on January 1, 1970.

This number will be in the tens of trillions, so in order to store it we need to utilize an array buffer (timeBuffer) of length 8 bytes to ensure that our program will correctly store the data. We then open this using a Float64 Array Buffer View(timeBufferView) which will encode each number (including decimal values) with 8 bytes. Because timeBuffer only stores 8 bytes, opening it with a Float64 View will leave the array buffer with only one index. We then store in this index the time at which the data for the packet is being collected.

Next, we create a concatenated Array Buffer that stores both the timestamp and Mesh data together. To do this we utilize another Array Buffer called meshPacketBuffer. The input to its length will be the combined byte length of timeBuffer (8) + meshBuffer (56,160) for a total length of 56,168.

The next step is to write the timestamp bytes to the packet. To do this we create an Int8Array Buffer View. This takes each of the 8 bytes of the timeBuffer and segments them into 8 indices where 1 byte represents 1 index of the timeBufferView. This allows us to iteratively loop through the timeBufferView and store each of the 8 indices into the meshPacketBufferView.

This means that the first 8 bytes of the meshPacketBufferView now store the timestamp data for that individual packet. The rest of the indices (56,160) can now be used to store the meshBuffer data which again represents 10 frames of mesh data collection.

To do this, it involves the same process of opening meshBuffer with an Int8Array View. This again accesses the contents of meshBuffer and makes each index of the View represent 1 byte for a total of 56,160 indices. To store this data in the meshPacketBufferView, we need to iteratively step through each index of the meshBufferView and store them to the meshPacketBufferView. We must remember, however, to skip the 8 bytes that already store the timestamp information in the packet.

Lastly, before saving the data to file, we need to push the meshPacketBuffer to an empty array called packetArray. This array will now store the 56,168 bytes of information for the timestamp(8 bytes) and the Mesh data(56,160 bytes). The data consisting of 10 frames of Mesh data and 1 timestamp will constitute one index of the packetArray. This packetArray will collect data until its length reaches 100 and will therefore contain 1000 frames of meshBuffer data and 100 timestamps.

The last and final step of this process is to save the collected packets to disk. In order to do this we convert each packet to a string according to the byte → character ASCII mapping of each byte in the packet buffer, then we write each packet to a file. We are currently in the implementation phase of this step.

Observer Output Data Collection Interface

Next, the human observer will monitor the same subject’s emotional state in real time. To collect the emotions the observer perceives from the video feed, we created a simple interface with HTML, CSS, and JavaScript that can be run on a live server. There will be a window that displays the live webcam feed of the subject, and below that are eleven buttons corresponding to the seven universal emotions and four common emotional expressions: “Laughing”, “Applause”, “Booing”, and “Crying”. When initially run, the webcam will automatically launch using the JavaScript getUserMedia() function and display the subject’s webcam feed on the observer’s computer, and the initial timestamp will be collected simultaneously. As the webcam feed plays, the observer will record what emotion they perceive from the facial emotion of the subject and press the corresponding buttons. As a button is pressed, the emotion and the number of milliseconds passed from the initial timestamp when the button is pressed is recorded in a JSON object with the keys “action” and “timestamp” and then appended to a JSON array that contains all the data. This allows easier access to the values later in the process.

Once the observation is complete and the “finish” button is pressed, this data must be stored in a file that can be combined with the data from Task 1 to form a data entry. In order to save space, we convert this JSON array into an array buffer similar to Task 1. Additionally, each of the eleven emotions take quite a bit of space as each letter is a byte of information. Thus, we convert each emotion to a key (1-11) that is only 4 bits before storing it in the buffer. While it is ideal to create and save a file type containing the array buffers in binary (bin files) that can be exportable to the local device or cloud storage, Javascript prohibits natively saving files to prevent malware installation. Our solution is to parse through the buffer array and convert the binary values to ASCII codes: an encoding system that translates 128 specific characters into seven digit integers and vise-versa. This array is finally downloaded as a text file using an external API called FileSaver.js [7].

This is a model of the entire process. Note how the emotions are translated to the binary representation of its corresponding key value (ex. Anger -> 5 ->101). Also note how the binary representation of the timestamp values exceeds 7 bits. In this case, we split this binary representation by 7 bits each, and assign an ASCII code for each sub-value. The intervals at which the emotions and the timestamp ASCII values are stored are recorded in the background for processing later. We are currently finding an alternative method to create an exportable file type and effectively store binary data.

Picture17

Integrated Environment and Combining Data Files

Picture18

Lastly, we are in development of an integrated environment which will allow both tasks to perform in parallel. In order to display the live webcam feed on the observer’s screen, we will use the webRTC API which allows real time video connection between two peers. So far, we have been able to connect the video feed from two different browsers on a local computer. Eventually will be able to access the video feed from another computer using a third party stun server which will save each device’s ICE ( interactive connectivity establishment) candidates and make it available for the other peer. In the background, the two separate data files created in each task will be merged to create data entries for our aggregate dataset. This final program will read the two data files and connect the face mesh landmark movements with the reaction data perceived by the observer according to the timestamps. This aggregate data set can be used to find a correlation between certain face movements and emotions/emotional expressions, and may be used to train a machine learning model.

Results

The results of our experiment consist of accurate collection of both facial landmark and reactionary data whilst maintaining computational efficiency. For facial landmark detection, our Array Buffer storage solution achieved close to a 90% reduction in the amount of bytes encoded per each packet of facial landmark data. This is also before the truncation of the decimal values comprising the facial landmark coordinates, which would allow us to reduce precision slightly but alleviate computational strain on memory. This is expected to be nearly a 95% reduction in byte usage from the original UTF – 16 (2 bytes per character in a string representation) encoding solution.

Our other successes lie in the storage of reactionary and timestamp information on the observer side. By mapping a set of keys to each of the 11 reactions and then converting those keys to binary and then finally to ASCII, we were able to cut our byte usage by at least 94%. We also reduced strain on memory by breaking down the timestamp information into smaller components of at most 7 bits. This allowed us to convert the timestamp information to ASCII as well. Now by using the ASCII and key codes we will be able to map out exactly what our timestamp and reactionary information was without needing to store as much data in the process on the local device.

Conclusion

Our research revolved around the issue of Audience Feedback and in particular the Audience Feedback Loop and how it has been disrupted due to factors such as COVID – 19 and our increased reliance on

video conferencing software such as Zoom. Our goal was to aid in the training of Machine Learning models and other algorithms surrounding facial landmark analysis and corresponding emotional responses. By creating software that is used for increased data collection in regards to reactionary information not easily observable through a webcam, we hope to better train these models.

The results of our research comprise the development of software that, using the input of a webcam, maps to an observed subject’s face, facial landmarks and subsequently stores the coordinate and timestamp information as array buffers. These buffers are then further concatenated into a larger array called a packet. We also developed a GUI that allows an observer to simultaneously watch the subject and input emotional reaction data. The reactionary information is later encoded as a set of keys which along with the timestamp of that data are converted to ASCII. Finally, we developed a mapping system that is used to convert between the predefined keys for the reactions and ASCII values in order to reduce computational strain when storing observer information.

During our research, we found that our custom array buffer solution in regards to collecting landmark data, is able to achieve almost a 90% reduction in the amount of bytes stored in memory and on the local device. With our custom key mapping solution, in regards to observer data collection, we found that we were able to achieve at least a 94% reduction in the amount of bytes stored on the local device.

Future Directions

Our next goal is to further truncate the Mesh data in order to reduce computational strain while still maintaining high level point mapping precision. After this, our goal will be to integrate both Task 1 and Task 2 into a single environment that runs each task simultaneously with each other. While doing this, we also will consider the ethical implications regarding facial tracking and data collection and will derive a solution that notifies users of exactly what is being collected and stored.

Once we complete development of the integrated environment and write software to create the data entries, we will begin populating entries for our aggregated data set. Typical facial emotional neural networks require thousands of images for training for accurate results, so it will be necessary to perform many iterations of data collection with different subjects.

If a mathematical correlation is found between face mesh landmark movements and reactionary data (emotions and emotional expressions such as clapping, booing, etc), we will optimize data collection by

creating a website platform where people online can contribute to the dataset. In addition, we will refine the architecture of the network so that it is feasible to deploy in current audience-feedback solutions. If successful, this project can be utilized to enhance the precision of machine learning algorithms and other neural networks in regards to facial landmark detection.

Acknowledgements

We would like to thank everyone who supported our project, including our mentors, Roshan and Ganesh. We would like to acknowledge Professor Tsachy Weissman of Stanford’s Electrical Engineering Department and the head of the Stanford Compression Forum for his support throughout this project. In addition, we would like to acknowledge Cindy Nguyen, the STEM to SHTEM Program Coordinator, for the constant check-ins and coordinating the many insightful events during the 8 week internship period. Thank you to all of the alumni, professors, and PhD students who presented research in a variety of fields in the past eight weeks. Lastly, thank you to past researchers and innovators; your work has helped and inspired our project.

References

  1. Bazarevsky, Valentin, et al. BlazeFace: Sub-Millisecond Neural Face Detection on Mobile GPUs. Google Research, 14 July 2019, https://arxiv.org/pdf/1907.05047.pdf.
  2. Coan, James, and John Allen. Handbook of Emotion Elicitation and Assessment. Oxford University Press, 2007.
  3. Riley, Sean. Detecting Faces (Viola Jones Algorithm) – Computerphile. Computerfile, 19 Oct. 2018, http://www.youtube.com/watch?v=uEJ71VlUmMQ.
  4. Ekman, Paul, and Wallance V. Friesen. “Measuring Facial Movement.” Paulekman, 1976, http://www.paulekman.com/wp-content/uploads/2013/07/Measuring-Facial-Movement.pdf.
  5. Farnsworth, Bryn. What Is Facial Expression Analysis? (And How Does It Work?). IMotions, 2 Oct. 2018, https://imotions.com/blog/facial-expression/.
  6. Farnsworth, Bryn. “Facial Action Coding System (FACS) – A Visual Guidebook.” IMotions, 2019, https://imotions.com/blog/facial-action-coding-system/.
  7. Grey, Eli. FileSaver.js. Github, 19 Nov. 2020, https://github.com/eligrey/FileSaver.js/.
  8. MediaPipe. MediaPipe Face Detection. Google, 2020, https://google.github.io/mediapipe/solutions/face_detection.html.
  9. MediaPipe. MediaPipe Face Mesh. Google, 2020, https://google.github.io/mediapipe/solutions/face_mesh.html.
  10. OpenCV. Cascade Classifier. Open CV, 11 Aug. 2021, https://www.docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html.
  11. Paul Ekman Group. “Universal Emotions.” Paul Ekman Group, https://www.paulekman.com/universal-emotions/.
  12. Saha, Sumit. “A Comprehensive Guide to Convolutional Neural Networks — the ELI5 Way.” Towards Data Science, 15 Dec. 2018, https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd 2b1164a53.
  13. Schwartz, Rene. “Extending the Purpose of the Audience-Performer Feedback Loop (APFL).” The Story Is Everything, 18 Mar. 2021, https://renemarcel.opened.ca/2021/03/18/chapter-four-conclusion/.
  14. Viola, Paul, and Michael Jones. Rapid Object Detection Using a Boosted Cascade of Simple Features. IEEE, 2001, https://ieeexplore.ieee.org/Xplore/home.jsp.
  15. Zijderveld, Gabi. “The World’s Largest Emotion Database: 5.3 Million Faces and Counting.” Af ectiva, 14 Apr. 2017, https://blog.affectiva.com/the-worlds-largest-emotion-database-5.3-million-faces-and-counting.

Leave a Reply