Fact Check: AI Fact Checking and Claim Correcting System

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Han Lee, Sukhamrit Singh


The creation of the internet sparked an age where information is available with the click of a button. Humans all around the world have been able to use this resource to their benefit. Recipes, travelling guides, political stances, anything that can be thought of can be found in this expan- sive cloud. With this rise however, misinformation runs rampant as false information is spread with the potential to persuade users. As a response to this, our group has developed an AI Fact Checking and Claim Cor- recting System that checks claims for their validity. Our project makes use of the Transformers NLP which utilizes a large database to cross ref- erence the information that is inputted. In order to score the validity of a claim, our project uses the DistilBERT and RoBERTa AI models. The models try to predict words which fit best in the context of the claim, and the more similar the predicted claim is from the original claim, the higher accuracy rating the original claim gets. After scoring a claim, the Sentence Correction System corrects the original claim by replacing each word in it with the top predictions from the models. The corrected claim is then passed into the Wikipedia Tokenizer System which converts the words from the corrected claim into tokens, and searches each token on Wikipedia. The data from the Wikipedia articles are stored and used to create a knowledge graph for the corrected claim. Our AI Fact Check- ing and Claim Correcting System is also extended to a mobile app called “Fact Check,” where the user can input their claim as a spoken query and the accuracy results will be shown.

Keywords: NLP, DistilBERT, RoBERTa, Sentence Correction System, Wikipedia Tokenizer System


Natural Language Processing (NLP)

Natural language processing is used to emulate human language. Through the use of context and repeated patterns, NLP attempts to correctly predict words that fit best within the context of the sentence and can resemble our language to the best of its abilities. The research for this form of AI began in the 1940s when Weaver and Booth began to develop Machine Translation (MT) [7]. The format behind MT was very textbook and only referred to the dictionary in order to replace words. It did not take long for this approach to produce poor results. Later in 1957, Chomsky proposed the idea of generative grammar which stated that all sentences have a logical reasoning behind them [7]. This opened a whole new perspective on the topic which continued on till now. In the past decade, work on NLP has developed rapidly resulting in many innovations.

Knowledge Graphs

Knowledge graphs are an immense collection of data that are used for making connections between interrelated-subjects. These connections are formed through the use of nodes, where each node is connected to other nodes carrying information on similar topics. To create these connections, large data sources, such as Wikipedia, are scavenged through to collect as much information on a specific subject. Thus, based on what is being searched for, node A may be connected to node B for the similarities that they share. The multiple connections that can be formed between nodes eventually creates a tree data structure, where a knowledge graph can contain multiple of these structures.

If the sentence “Barack Obama is the 44th president of the USA” is inputted to create a basic knowledge graph, the program would extract the key components of the sentence, such as “Barack Obama”, “44th president”, and “USA.” With these tokenized components, a simple graph is created:

Fig. 1 An example of a knowledge graph

Fact Checking

Fact checking is the process in which statements are checked for their validity. Once the whole of a claim is checked, it is classified on whether it is factually accurate or not. Traditionally, humans have been the ones to execute this process. Because of the amount of work required to complete such an arduous task, the process can take anywhere from a day to weeks. In order to lighten this load, the idea of using AI in this process has been introduced. However, this new source of technology is still in its development phases, meaning it does not have the complete knowledge to accurately fact check claims covering a vast array of topics. Thus, when the AI tends to make errors, humans still must be aware to correct them.

Fact checking is done through many different components such as quote ver- ification, title verification, position claims, and the most important method, by triples [14]. The triple refers to subject, object and predicate. These com- ponents of a sentence are used to cross reference information from a data set to verify information. We used the method of checking with triples in our project to judge which parts of a sentence were greater in significance and which parts would provide the most accurate validity ratings.

Related Works

Due to the rapid rise in the spread of misinformation, NLPs have seen continued use in the field of fact checking. Our group specifically used the Transformers NLP throughout the entirety of our project. The Transformers NLP makes use of one of the most well known inputs for NLPs called triples. Another popular input that was targeted were textual claims. This input was the most important because of the role it plays in spreading misinformation. However, one difficulty that we as a group and other researchers have come upon is the trouble that the NLP has with proper nouns included inside these claims. Names are often hard to generate around because names can be shared by many people around the world. This fact makes it difficult for the NLP to correctly predict words.

Another aspect of our project that we shared with researchers was the use of knowledge graphs. When using NLPs, it is essential to provide a large amount of data upon which cross referencing can be done. Knowledge graphs, as seen in Vlacho and Riedel’s research [16], help in the process of retrieving information when verifying the claim at hand with said information.

Lastly, the NLPs that we used during our project were works that were cre- ated before us. They fit best for our project since they were made to predict words that are taken out of a claim. We used the DistilBERT and RoBERTa- Large models to predict numerical claims, general claims, and proper nouns.

DistilBERT is a mask-filling-model from the Transformers NLP. This is one of many pre-trained models used to predict words based on the context of the sentence. DistilBERT was made to be a compact version of the original BERT model while retaining the same prediction values as the original model. Thus, this means that the DistilBERT model returns similar outputs at a faster rate due to its smaller size. The DistilBERT model was one of the models used in our project.

RoBERTa-Large (a larger version of the RoBERTa model) is another pre-trained mask-filling-model from the Transformers NLP. This model is generally much smarter than the original BERT model. This is because it was pre-trained with an enormous collection of English data. As a whole, RoBERTa-Large’s capabilities surpass those of BERT. The RoBERTa-Large model was another model used in our project.

Methods and Materials



The word “mask” is a crucial keyword and a very important aspect of our project. What it does is it lets the program know where the AI model should predict a word that most accurately fits the context of the sentence or claim.

For example, if the mask-filling-model is to complete the claim “David Beck- ham played soccer, a sport also referred to as <mask>,” the “mask” keyword in this case tells the model to predict the last word of the sentence. Based on the context of the claim, using its pre-trained knowledge, the AI model will predict 5 words which best fit the sentence:

Predicted WordConfidence Score











Fig. 2 Actual predictions from the model for the input: “David Beckham played soccer, a sport also referred to as <mask>.”

The way the mask-filling-model outputs its predictions is from the order of most confident to least confident. We can see that the first prediction is accu- rate and is the answer we want, but the rest are either completely irrelevant or repeated words. However, the AI model deserves some credit for the rest of the predictions since they are sports related, indicating that the model used the context of the sentence to determine that David Beckham is an athlete.

If our program was to always take the prediction with the highest rating and add it to the original claim, our output would be “David Beckham played soccer, a sport also referred to as football,” which in the end, is a factually correct claim.

“DistilBERT vs. RoBERTa”




Base: 66

Base: 110

Large: 340

Training Time

4 times less than BERT

4-5 times more than BERT


3 % degradation from BERT

2-20 % improvement over BERT


16 GB data

160 GB data

Fig. 3 Data from [12]

Why are both of the models being used in our AI system? When it came to testing which model was best at giving the most accurate validity ratings, each model was better at some aspect than the other. For example, DistilBERT was better at predicting numbers based on the context of the claim whereas the RoBERTa-Large model was better at predicting general information and facts. With this in mind, we thought that using the knowledge from both would be most beneficial to our project. Utilizing both the AI models will allow our program to determine a validity rating for a claim with greater accuracy.

Accuracy Rating system

In order to judge the accuracy of a claim being inputted into our program, a system had to be created where each word is graded on how well it fits within the context of the sentence. The way this was done was by replacing each word from the input with the “mask” keyword. The mask-filling-models were then called to predict which words would best fit within the context of the sentence. If any of the predicted words matched the original word, a score would then be assigned for that word.

The max score any word can achieve is the rating 1. However, this is only possible if the model predicts the same exact word and it is the highest rated prediction. If the prediction were to be the second strongest prediction instead, 0.2 will be subtracted by the score 1. Essentially, the rating is given by multiplying how many places the prediction is from the top prediction by 0.2 and subtracting that from 1. If the word from the original claim is never predicted, it is given a rating of 0.

\text{WordRating} = 1 − (0.2 \times \text{NumberOf PlacesFromTopPrediction}) \quad (1)

Original Claim: “The 2020 Olympics were held in Japan.” Inputted Claim: “The 2020 [MASK] were held in Japan.”

Inputted Claim: “The 2020 [MASK] were held in Japan.”

Predicted Word

Confidence Score











Fig. 4 Scoring representation for one word (using the DistilBERT model)

The word olympics is 3 places below the top predicted word

Rating = 1 − (0.2 × 3) = 0.4 \quad (2)

After the AI models have gone through each word, all the scores are added together and divided by the total number of words in the claim (special characters also count as individual words).

ClaimAccuracy = \frac{TotalScoresOfAllWords}{TotalNumberOfWords} \quad (3)

Once the final rating is determined, it must pass an accuracy threshold, which is 0.85. If the total score for a given claim is 0.85 or above, the claim will be deemed “factually accurate,” and if the rating is below this threshold, the claim will be deemed “factually inaccurate.”

Sentence Correction Wikipedia Tokenizer System

After receiving the accuracy rating of a claim, it is corrected. This is done by masking each word in a claim and replacing it with the top predicted word from the NLP models. Even if some parts of a claim are correct, they will be replaced. However, this does not matter because if they are truly accurate, the models should be able to predict them, thus, not changing the accurate parts of the original claim at all. Once a claim is corrected, it is passed to the Wikipedia Tokenizer System, where the corrected claim is broken into tokens. These tokens are then searched for on Wikipedia and the lines in the articles for each token are stored. The stored information is then used to create a knowledge graph for the entire claim.

Fig. 5 Actual results from the AI Fact Checking System Example of an invalid claim

Fact Check Mobile App

The Fact Check mobile app is an extension of our AI Fact Checking and Claim Correcting system. The name, “Fact Check,” has been registered on the Apple App Store and the app currently has a working prototype.

Fig. 6 Screenshots of each screen from the Fact Check Mobile App

When the app is first launched, the Launch Screen is displayed to the user. From there, the user can switch between the Home Screen and the History Screen by switching tabs on the tab bar located at the bottom of the app (see Figure 6). When on the Home Screen, tapping the microphone button will allow the user to record a claim to be fed into the AI Fact Checking and Claim Correcting System. The app uses the SiriKit NLP from Apple to convert a user’s speech into text. Once the speech is converted to text, the claim is passed to our back end hosting the AI Fact Checking and Claim Correcting System. The claim is passed into the system, and the back end then returns the accuracy results, the corrected claim, and an image of a generated knowledge graph (see Figure 7). These results are then shown to the user in the Results Screen. The knowledge graph can be zoomed into to see more details and tapping the speaker icon on the Results Screen plays back the results to the user. The user can also access the History Screen by switching to the History Tab. The History Screen contains all the previously run claims by the app. The results of these previous claims can also be accessed by tapping on one of the claims (see Figure 6).

Fig. 7 Fact Check System Block Diagram


When testing the results of our AI Fact Checking and Claim Correcting System, we compared the results of both DistilBERT and RoBERTa AI models. We documented these results and compared them with data from an online database such as Wikipedia. It was then our group found that both AI models only returned partly accurate predictions. For example, the DistilBERT model was better for predicting numerical components of a claim such as years or size, whereas the RoBERTa model was better at predicting general knowledge such as who the first president of the United States was.

Knowing that both these aspects of a claim are equally as important, our group decided to use both models. We updated our system to follow certain conditions when using these AI models: when the current word of a claim is a number, use the DistilBERT model, and when the current word has a part of speech, such as a noun or verb, use the RoBERTa model. After implementing these conditions, we again tested, documented, and compared the results of our AI Fact Checking and Claim Correcting System with information from Wikipedia. We found that our new results were significantly closer to the data from Wikipedia, indicating our system is providing accurate results.

In the context of the current NLP and fact checking landscape, having to use two types of AI models indicates that current NLP systems are still in need of much work. The use of two models takes up valuable resources such as memory and space, resulting in longer execution times. Current and upcoming models need to be pre-trained with larger amounts of data so that they can provide predictions with greater accuracy and confidence and perform tasks that do not require additional NLP models.


The Fact Check System is being built to combat the widespread issue of misinformation. To achieve this, we created an AI Fact Checking and Claim Correcting System. This system takes a claim from the user and scores how accurate it is using the DistilBERT and RoBERTa AI models. It then cor- rects the claim using the predictions from these AI models and provides a knowledge graph using the corrected claim.

Though there are already AI fact checking models that exist, they do not pro- vide an end-to-end solution for people to use. With the Fact Check application we are building, we are able to provide a solution that users can interact with using a mobile app. The mobile app will not only provide an accuracy rating for a claim, but will also provide a corrected claim if applicable. The Fact Check mobile app is unique in the aspects that it is the first of its kind and it gives users the power of fact checking through something as portable and compact as their phone.

Future Directions

We plan to improve our AI Fact Checking and Claim Correcting System by adding a weighting system when it comes to determining a score of accuracy for an inputted claim. Some parts of a claim, such as nouns and verbs, are worth more if they are accurate than other words, such as articles. By adding this weighting system, the accuracy score of a claim is more representative of how accurate the real content of the claim is, overall making the accuracy scores more valid.

Another goal for the future is to add extra features to the Fact Check mobile app to give the user more freedom when inputting a claim. Currently, the Fact Check app only has one method of receiving a claim from the user, which is as a spoken query. However, this method is not always the most optimal. Thus, we plan to add an extra option where the speaker can simply type in their claim to pass to the back end hosting our AI Fact Checking and Sentence Correcting System. We also aim to not only develop and release a mobile app for the Apple App Store but also for the Google Play Store. Creating an app for multiple platforms will allow for a wider audience to have the ability to fact check a variety of information and acts as a solution to the spreading of misinformation.

The source code for both the Fact Check mobile app and the AI Fact Checking and Sentence Correcting System can be found at: https://github.com/Sukhamrit-Singh/Fact-Check


We want to express our utmost gratitude to our mentor Aadit Trivedi for introducing us to the topic of fact checking using AI and guiding us through our journey of creating our AI Fact Checking and Claim Correcting System. Additionally, we want to thank Professor Tsachy Weissman, Cindy Nguyen, and the Stanford Compression Forum for providing our group this internship opportunity. This program is a valuable stepping stone in our education on AI and without it, the creation of our AI Fact Checking and Claim Correcting System would not have been possible.


  1. Carterart. (2016, March 16). Hand Drawn Circle Shape Set Free Vector. Vecteezy. https://www.vecteezy.com/vector-art/ 108435-hand-drawn-messy-circle-shape-set.
  2. Bouziane, M., Perrin, H., Cluzeau, A., Mardas, J., amp; Sadeq, A. (2020). Team Buster.ai at CheckThat! 2020: Insights And Recommendations To Improve Fact-Checking. DEI – Unipd. http://www.dei.unipd.it/~ferro/ CLEF-WN-Drafts/CLEF2020/paper_134.pdf.
  3. Ding, Y., Guo, B., Liu, Y., Liang, Y., Shen, H., amp; Yu, Z. (2021). MetaDetector: Meta Event Knowledge Transfer for Fake News Detection. arXiv. https://arxiv.org/pdf/2106.11177.pdf.
  4. Edrisian, A. D. (2016, August 9). Building a Speech-to-text app using speech framework in iOS 10. AppCoda. https://www.appcoda.com/ siri-speech-framework/.
  5. Jones, W. (2015, August 17). Text-to-Speech in Swift in 5 lines. Medium. https://medium.com/@WilliamJones/ text-to-speech-in-swift-in-5-lines-e6f6c6139086.
  6. Lazarski, E., Al-Khassaweneh, M., amp; Howard, C. (2021). Using NLP for Fact Checking: A Survey. MDPI. https://www.mdpi.com/2411-9660/5/3/ 42/pdf.
  7. Liddy, E. (2001). Natural Language Processing . Syracuse University. https://surface.syr.edu/cgi/viewcontent.cgi?article=1019&amp;context=cnlp.
  8. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., amp; Stoyanov, V. (2019). Roberta-Base · Hugging Face. roberta-base · Hugging Face. https://huggingface.co/roberta-base.
  9. Nakov, P., Corney, D., Hasanain, M., Alam, F., Elsayed, T., Barron- Cedeño, A., Papotti, P., Shaar, S., amp; Da San Martino, G. (2021). Automated Fact-Checking for Assisting Human Fact-Checkers. arXiv. https://arxiv.org/pdf/2103.07769.pdf.
  10. Ninjaprox. (2020). Ninjaprox/Nvactivityindicatorview: A collection of awesome loading animations. GitHub. https://github.com/ninjaprox/ NVActivityIndicatorView.
  11. Sanh, V., Debut, L., Chaumond, J., amp; Wolf, T. (2021). Distil- BERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv.https://arxiv.org/pdf/1910.01108.pdf.
  12. Suleiman Khan, P. D. (2021, May 18). BERT, RoBERTa, distilbert, XLNET – which one to use? Medium. https://towardsdatascience.com/ bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8.
  13. Thorne, J., amp; Vlachos, A. (2017). An Extensible Frame- work for Verification of Numerical Claims. ACL Anthology. https://aclanthology.org/E17-3010.pdf.
  14. Thorne, J., amp; Vlachos, A. (2018). Automated Fact Check- ing: Task formulations, methods and future directions. ACL Anthology. https://aclanthology.org/C18-1283.pdf.
  15. Thorne, J., amp; Vlachos, A. (2021). Evidence-based Factual Error Cor- rection. arXiv. https://arxiv.org/pdf/2012.15788v2.pdf.
  16. Vlachos, A., amp; Riedel, S. (2014). Fact Checking: Task def- inition and dataset constructionAndreas Vlachos. ACL Anthology. https://aclanthology.org/W14-2508.pdf.

An analysis of Uncertainty Quantification algorithms and Bias in Recommender Systems

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Alyssa Ho, Robert Beliveau, Shynn Lawrence, Vivek Alumootil, Dr. Ali


Companies across the world are currently using recommendation systems in order to uniquely serve products, media, and ads to users such that users are more inclined to use their service. Household names – Spotify, Amazon, Netflix, Facebook, Google – all use such recommendation systems within their products; Because such systems are used so widely, we decided to investigate different methods of quantifying uncertainty about recommendations. Furthermore, we investigated correlations between data in the dataset we worked with, finding instances of unequal representation that could cause unethically bias towards different groups in recommendation system implementations.


Thousands of petabytes of data are processed daily, with thousands of multibillionaire companies using data to improve the lives of their customers. One of the most useful techniques software companies use is to recommend items/media based on their previous interests. Such a system is generally classified as a Recommender System and revolves around the idea that better recommendations usually lead to more satisfied customers (and higher profits).

There are many types of Recommender Systems, though many popular ones are used daily by many people. One of the most notable examples of Recommender Systems is the System used in Netflix. Netflix recommends movies based on a user’s previous rating of different movies, demographic data collected on users, and different features of movies (the movie genre, movie length, etc). Another popular example of a company using a Recommender System is Spotify. Spotify uses similar data about users, movies, and ratings (likes/dislikes) to recommend more music to its users. These technologies have been popularized in the area of entertainment and socialization (particularly by Instagram, Facebook, and Youtube), however, they can also be extremely useful in aiding humans – especially in the areas of medicine recommendations, criminal sentence recommendations, and parole release systems.

When considering all of the impacts that Recommender Systems have on Society, potential discrimination and bias must be carefully investigated. As many data scientists and Statisticians observe, “Garbage in – Garbage out.” In other words, the algorithms used in data science are rarely inherently discriminatory, but pre-existing bias in data usually leads to biased models. For this reason, one of the topics our project focused on was finding prominent bias in popular datasets.

Another relatively unforged path in the Recommender System world is Uncertainty Quantification – Giving bounds for recommendations and quantifying the model’s uncertainty for different recommendations. A particularly interesting aspect of Uncertainty Quantification (UQ) techniques is the resulting model’s inference bound size, coverage (similar to accuracy), and validation time – all of which are dependent on the type of UQ system and the method of making recommendations.

Overall, our goal in this project was to observe bias in datasets and compare different uncertainty quantification techniques paired with different recommendation algorithms. In this new age of Big Data and AI, our project hopes to help elucidate different trade-offs software engineers and data scientists will encounter and the high possibility of discriminatory ML systems stemming from the use of biased data.


We used the MovieLens 1M ratings data. It included three separate tables: ratings, users, and movies. All unknown factors were changed to a -1 and ignored during training.

The next step was to change string values to integers. We did that as follows, categorizing gender, genre, and zip code:

We hypothesized our model would do better with continuous data as we were using linear regression models, so we added random noise to our ratings. To see if our hypothesis was correct, we also tried the same models on discrete data to compare their effectiveness. This did not seem to make any difference.

Lastly, we split the data in multiple ways. For Bootstrap models, ratings were split into a training and test set. This simplified graphic depicts the two ways we split data where red is train and green is test. Splitting the data by columns or rows did not make much of a difference as can be seen in our Bootstrap Averages Results Section.

Splitting by rows to find trends in movie ratings:

Movie ID / User ID 1 2 3 4
1 rating rating null null
2 null rating null null
3 rating null null rating
4 null null rating null

Splitting by columns to find trends in user ratings:

Movie ID / User ID 1 2 3 4
1 rating rating null null
2 null rating null rating
3 rating null null null
4 null null rating null

For Conformal Inference models, ratings were split into a training, validation and test set. This simplified graphic depicts the one way we split data where red is train, yellow is validation, and green is test.

Table 3

Movie ID / User ID 1 2 3 4
1 rating rating null null
2 null rating null null
3 rating null null rating
4 null null rating null
5 null rating null null
6 null null null rating


Before we explain our models, there is an important data bias we discovered that must be taken into account. Firstly, this database consists of 1709 females compared to 4331 males. This gender disparity can negatively affect the movies predicted as the model has been trained on mostly male data. We plan to continue the project to further observe if this is the case. Despite this unevenness, the gender of a user does not seem to significantly impact what rating they would give a movie of a certain genre. We came to this conclusion after graphing a simplified dataset of 1475 men and 539 women against all 18 genres: Action, Adventure, Animation, Children’s, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western. It’s important to note that the following graphs only show a normalized distribution that doesn’t show the total lack of female ratings against male ratings.

As can be seen above, the percentages for each rating is pretty close for Film-Noir and Romance. Some have a slightly greater difference like for Fantasy and Musicals.

However, the differences in height never exceed 0.05 for any genre. More testing will have to be done to see if these slight differences actually add a bias and change the results of prediction models.

Aside from that, there are some interesting trends we discovered from just plotting these points. For example, every graph except Documentary follows a similar curve with a peak of 4. In addition, males are more likely to give out the highest rating of 5 than females as 100% of orange bars at x=5 are lower than blue.


Bootstrap Averages:

This resampling method creates a rating prediction for the test set by taking the average of known ratings in a training set and then resampling the training data with replacement over a series of trials, collecting every average to then calculate the truest average. This average is then used to fill unknown ratings in the test dataset. To test the accuracy of this method, a 90% and 10% quantile were calculated from each matrix column of collected averages to generate reasonable intervals. Coverages were then measured by the average number of test ratings that fell within that interval.

Bootstrap Linear Regression:

The Bootstrap with Linear Regression method operates by going into each column of the dataset and splitting the column into two different sections, representing the testing and training sets. The testing set is ⅔ of the column, while the training is the remaining third, this is to maintain the same datasets as the other methods that rely on testing, validation, and training sets. After the data is divided into the sets, the -1 values are removed from each of the sets to remove any unnecessary values. At that point, the movies with less than a certain amount of ratings are removed from the set, the smallest value that this can be is 10, as any lower results in errors from sizes being too small. Each row that is removed for being a -1, such that a specific user hasn’t rated it, is removed on the user dataset as well, to prevent the bootstrap from grabbing a user who hasn’t rated that movie.

A bootstrap training set is then created with the same size as the original training set, and random rows are selected and placed into the bootstrap set. The random indices are also applied in the user dataset, grabbing the same random rows as the ratings dataset and saving them in an external array. From there, a linear regression model is created, fitted with the bootstrapped training set, and the model creates a prediction. That prediction is then compared with the testing set and given a score based on if each of the points are between the 90th quantile of the prediction. If all of the points are between the prediction’s 90th quantile, the model would receive a score of 1, and 0 if none of the points are between the quantile. After this scoring, the program repeats the process of randomly selecting indices, placing them into external arrays, predicting values using linear regression, and scoring 25 times to ensure that the first time wasn’t a randomly biased sample where every row was the same value.

Conformal Inference Averages:

For Conformal Inference with Averages, the dataset is initially split into thirds. The first third is used for training. In the case of recommendations with averages, every prediction is just the simple average of all the ratings for the movie in the training dataset. The second third is used for validation – predictions for the known data are made, and the accuracy of the prediction (the prediction minus the actual value) is collected. Then, the 90th quantile of this collection is taken and will be the bounds for all future recommendations – for example, if the 90th quantile was 1.5, then a recommendation of 3 for a movie would result in a 90% chance of the actual rating falling between [1.5, 4.5]. The final third of the dataset is used for testing – a similar process to validation – however instead of just predictions, the quantile taken in the previous step is used for creating bounds. If the actual value is within the prediction 土 quantile, then the model is awarded a 1. If it is not within the bounds, it is given a 0. Taking the average of all 1s and 0s – dependent on the accuracy of the model bounds – should give nearly 90%. In other words, creating bounds through this method allows the model to recommend a range of values that, on average, contain the actual rating 90% of the time.

Conformal Inference Linear Regression:

We split the data into three sets: training, validation and testing, split by rows (users). User features were read in from the data set and each user had a user vector containing their age and gender (one-hot encoding). In the training set, for each movie, linear regression was done on the users who have rated that movie and a vector was produced which maps user feature vectors to ratings for that movie. Thus, we ended up with a set of vectors, one for each movie. In the validation set, for each movie, the 90% quantile of the array of errors (defined differently for each submethod, but generally a difference between the predicted value and true rating) for that movie was calculated and stored. In the testing set, we used the stored quantiles to form confidence intervals (if our predicted value was p and the 90% quantile for the movie was q, then our interval was (p-q, p+q). We then calculated the coverage (the fraction of intervals we created that contained the true rating). This was predicted to be 0.9, since the quantile we took was 90%. Movies with less than a certain number of ratings were ignored.


Bootstrap Averages With Continuous Ratings:

We decided to split the dataset into training and test with two different methods to see which one would work better. The first was by rows, so the model would need to find averages by each movie, while the second was by columns, so averages were found for each user. Although bootstrap by users does slightly better on coverage, splitting data by columns or rows does not seem to make a drastic impact.

Compared to other models, the interval sizes generated by Bootstrap averages were very small. The reason for this is due to the fact that the calculated averages of each movie/user are very similar, many only decimals apart. Thus, the quantiles will also be close in number so the size, the difference between the quantiles, are very small. This is also why our coverage is slim since the majority of test data will not be able to fall in such a small interval. This is a huge problem as we want our coverage to meet the 90% mark.

The graphs also show an immense amount of outliers. We believe the reason for this is due to the large number of movies with little known ratings. Because the number of ratings are so different for each movie, the coverage for each will reflect on this diversity, producing many outliers. To test this, we increased the variable r, the minimum number of ratings movies needed. If they did not exceed r, they were dropped from the database. The following two graphs prove that this is the case: with a higher variable r, the less outliers appear on the plots.

Although increasing r does produce fewer outliers, it does not fix the problem of our small coverage and size. We suspected the problem was calculating the averages and how similar they were to each other. Therefore, we decided to just get rid of that step and calculate quantiles from the list of resampled, whole number data. It worked.

Bootstrap with Linear Regression:

Surprisingly, combining both Bootstrap and Linear Regression results in a lot fewer outliers than the other methods. This does come with the downside that every point is within the range of 0.0 and 1.0, definitely not as confident as the Bootstrap with Averages. Taking a look at the sizes now, the maximum size being a 4 and the lowest being 0, it reflects almost exactly what the coverage showed. This broad spectrum of both quantiles and coverages shows that this method of combining Bootstrap and Linear Regression is not the most optimal or accurate in any sense of the word. The average coverage and average sizes were both around the center of the spectrum, showing almost a bell-curve like representation with bootstrap and linear regression.

Even with the changes in R, the amount of movies dropped from the analysis for not having enough ratings, the only data pieces that changed were that the size dropped slightly, to around 2.1 instead of 2.34. If R was between 0 and 100, the linear regression model wouldn’t be fitted correctly, allowing for a complete misjudgement of the general trend.

Outside of the R changes and the general inaccuracies of the coverage and sizes, the runtime of the program was a completely different factor that could range from an hour to a few days. This runtime primarily comes from the CVXPY library, where for each movie that had enough ratings, each user who rated it, and for each bootstrap resampling, a problem would be created, fitted to a linear regression model, solved, and predicted, the process taking around 10µs. It is definitely optimized to do complex equations singularly, but the repetition that is required exponentially increases the time taken. This difference can be shown with the difference of the bootstrap resampling amount, one being 25 and the other being 100. When the bootstrap resampling amount was 25, meaning that there would be 25 resamples per user, the program had a runtime of around 16 hours. When the resampling amount was 100, the program had a predicted runtime of 3 days, assuming that it had the same R values as the 25 resample.

Overall, this method resulted in the highest size distribution and the lowest average coverage out of the other methods mentioned, not including the runtime, storage cost, and maintenance the system requires.

Conformal Inference with Averages:

Conformal Inference Coverage And Sizes, full quantile.
Mean Coverage: 0.9
Mean Size: 2.69

Interestingly, there are many outliers below the Coverage graph. This is due to the fact that there are a high number of movies with not many ratings – for example, some movies that have only been rated by < 10 people. These movies easily become outliers as their low amount of ratings pull their average either very high, or really low – and thus the recommendations for them are not as accurate. An alternate method of presenting the testing coverages and sizes include only calculating the accuracy of the model on the movies which have more than 100 ratings in the validation dataset:

Conformal Inference Coverage And Sizes, full quantile.
Mean Coverage: 0.9
Mean Size: 2.60

Note how in this case, the mean in the Coverage box-plot is actually somewhat higher than 0.9, though the actual average coverage is 0.9. This is because matplotlib (the python library used for plotting the data) is plotting all of the average coverages per movie – not the total coverage. In other words, it is plotting the “mean of means”, which is not the actual mean of the entire data as a whole.

A differing perspective on conformal inference includes taking the quantiles from the validation set on a per-movie basis – each movie has a different quantile calculated on its accuracy in the validation set. This should, theoretically, make the resulting prediction bounds more accurate as they would be more fine tuned for each movie. Another addition that could be made is increasing the bounds based on the amount of ratings in the validation set. For example, a movie which has only one rating in the validation set would have larger bounds because the model would have less confidence, because it only ran through validation on only one data-point for that specific movie. This is done by multiplying the quantile per-movie by 1 + 1/n, with n being the amount of ratings in the validation set for that movie. Using this technique, the previously mentioned coverage outliers almost completely disappear – as the bounds are more suited for movies with low amounts of ratings. Indeed, the following data corroborates that fact:

Conformal Inference Coverage And Sizes, movie-specific quantites.
Mean Coverage: 0.94
Mean Size: 2.63

As seen in these box and whisker charts, the amount of coverage outliers is extremely low, and the overall average coverage is 5% higher due to the “fine-tuning” (movie-specific quantiles) with this method.

Conformal Inference with Linear Regression:

We split the data by rows (by movies) into the 3 sets: training, validation and testing

Most of the data below uses the following set sizes:

(m1 = 5000, m2 = 500, m3 = 500, n= 1000 a = 0.1, z = 60)

m1, m2 and m3 are the sizes of the training, validation and testing sets, respectively. n is the number of movies, and a is the alpha level. z is the cutoff level for the number of movie ratings (movies with < z ratings are ignored). We chose m1 to be much larger than m2 and m3 since empirical evidence seems to show placing most of the 6000 available users into the training set produces the lowest average sizes for a given coverage level.

Each regression takes in a user feature vector, which contains their gender and age. Empirical evidence seems to show that including additional information (e.g occupation, ZIP code) isn’t very helpful.

Each algorithm differs by the loss function of the regression done. Quantile Regression is not conformal in the same way as the other three, but it is still similar. Here are descriptions of each algorithm:

Least-Squares: This method aps a vector of user features to a rating for a particular movie by minimizing the mean of the squares of the rating errors.

Least-Abs-Mean: This method does the same as above, but instead it minimizes the mean absolute value instead of the mean square of the errors.

Quantile Reg: This method ses gradient boost regression (scikit-learn) to produce a 10% and 90% quantile for the value of a rating. It does not do conformal inference in the same way as the first two methods.

Class-Conformal: This method takes in a user feature vector and outputs probabilities of a rating of a 1, 2, 3, 4 and 5. In the validation set, this method found the 10% quantile for the probability of the true rating (that is, if the 10% quantile was x, then 90% of the time, the true rating will be predicted by the model to have a probability of \geq x). In the testing set, this was then used to make discrete sets of possible ratings (such as {1, 2, 4}).

Least-Squares, Least-Abs-Mean and Class-Conformal were very similar. They all had close to an average size of 2.8 and coverage of 90%. Performance improves considerably when z is increased; however, since the model should be able to predict ratings in all movies, not just those with a large number of ratings, we chose to keep z at 60, which is already quite high.

An important note should be made about the sizes: these sizes are not directly comparable to some of the other methods since they describe discrete sets rather than continuous intervals. That is, these methods produced sets such as {1, 2, 4} with a size of 3, rather than an interval like [1.2, 3.9] with size 2.7. In fact, the former way generally resulted in much larger sizes. When these methods were done the latter way, we found sizes of around 1.9, which is more comparable to the sizes of the other methods.

The motivation for choosing other methods of linear regression was that least-squares seeks to minimize the sum of the squared error, while we actually seek to minimize the 90% percentile of the error. It wasn’t clear that least-squares regression is the best option. Moreover, we hypothesized that using other loss functions for regression would make the expansion of the user-feature vectors successful (including more information, such as their profession or ZIP code). Unfortunately, this did not happen, although further research needs to be done to investigate the relationship between information in the user-feature vectors and the success of a model.


In terms of both size and coverage, linear regression with conformal inference was the most successful method. This was expected, since both averaging methods don’t take into account user information. Coverage ended up being quite low for both bootstrap methods, and we suspect this is a result of the bootstrap sampling changing the distribution. Both bootstrap methods were also very slow compared to the conformal inference methods. Conformal inference with averaging was reasonably successful, but did not produce intervals as small as those of conformal inference with linear regression. However, averaging was significantly faster than the other methods. For five possible ratings, conformal inference with linear regression managed to give sets of size 2.8 with 90% coverage. While this is reasonably good, there is still significant room for improvement.

Future Directions

Fairness: The first step is to delve further into evaluating the bias in our data and seeing if it translates into the predictions we make. We currently only analyzed gender vs genre bias in our pre-existing database. We plan to use the same methods to analyze the predictions our linear regression models made and then compare them. Other potential biases like age, occupation, and location still need to be tested. These objectives also open social and philosophical questions that we hope to further discuss and examine such as whether predictions should even use gender as a factor. If movies were recommended solely by what the user previously rated high, a model could be unbiased to their gender but still recommend good results. The only problem is, this requires data a new user to the streaming service wouldn’t have. In addition, if we were to erase any gender bias in our dataset, would it significantly weaken our model? But one could argue that with bias, the recommended movies could contribute to gender stereotypes. Still, even if perfect predictions could be made without gender bias, wouldn’t it be helpful in some cases? For example, what if there is an educational documentary on feminine products and female empowerment. Would it be useful to add this to a female user’s list even if they don’t usually watch documentaries? One could argue that this should also be recommended to males as all genders need to be aware of diverse ideas and that society shouldn’t taboo these topics and further divide genders into two distinct categories. This leads into another problem: If you only feed people what they like, then it can get kind of one-sided and build a community of ignorance. Perhaps instead of being biased, a recommendation system can have some degree of randomness to keep the selection diverse. On the other hand, this will not attract revenue as many customers may not enjoy it as much. These sort of ethical questions are quite interesting to us and we hope to look into them in the future.

Computational Statistics:

While the various methods of implementing recommender systems vary in their accuracy and strength, they also vary in their speed. In our own methods, we found that the conformal inference methods were much faster than the bootstrap methods. Further analysis needs to be done to investigate the relationship between accuracy and time with our methods and other methods, such as those involving neural networks and machine learning.

Application: We plan to continue this project after the program ends by implementing the best models we researched to produce a working recommendation system on a website or other application. We’ll first need to find some simple problem that our application could help with that will attract users to use it. Perhaps it will be a website that will find similar movies for a user to watch that also uses our uncertainty quantification methods or we could use a different database entirely.

Neural Networks: We also plan to investigate modern methods of creating recommender systems, such as using neural networks. These methods are expected to be more powerful than our current methods and should provide better predictions.


  1. https://arxiv.org/pdf/1905.03222.pdf
  2. http://www.ec.tuwien.ac.at/~dimitris/research/recsys-fairness.html
  3. https://berthuang.com/papers/yao-nips17.pdf
  4. https://netflixtechblog.com/artwork-personalization-c589f074ad76
  5. http://www.cs.columbia.edu/~jebara/6998/hw2.pdf
  6. http://www.cs.columbia.edu/~jebara/6998/dataset.txt
  7. http://www.yisongyue.com/courses/cs159/lectures/mab.pdf
  8. http://www.yisongyue.com/courses/cs159/lectures/LinUCB.pdf
  9. http://rob.schapire.net/papers/www10.pdf
  10. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/10/fskd11-1.pdf
  11. https://heartbeat.fritz.ai/recommender-systems-with-python-part-iii-collaborative-filtering-singular-value-decomposition-5b5dcb3f242b
  12. https://alyssaq.github.io/2015/20150426-simple-movie-recommender-using-svd/
  13. https://analyticsindiamag.com/singular-value-decomposition-svd-application-recommender-system/
  14. https://beckernick.github.io/matrix-factorization-recommender/

A Survey of Deep Learning Applications and Transfer Learning in Medical Image Classification

Journal for High Schoolers, Journal for High Schoolers 2021


Eugenia Druzhinina, Joyce Lu, Napoleon Vuong, Ethan Liang


Over the past few decades, artificial intelligence has become increasingly popular in the medical sector. Deep learning, a subset of artificial intelligence, has played an essential role in the formation of computer vision. This paper specifically considers convolutional neural networks and transfer learning for image classification. Currently, medical imaging modalities such as MRI and X-rays are used to detect and diagnose neurological disorders including Alzheimer’s disease, brain tumors, and other pathologies. A convolutional neural network that can identify key features of an image and differentiate between different pathologies may potentially assist clinicians and researchers in medical image interpretation and disease diagnosis. Despite significant improvements in deep learning for medical image classification, limitations in medical image dataset size hinder the development of robust networks. To better understand this issue, we investigated deep learning and its applications in medical imaging through a review of published literature. Transfer learning was then identified and explored as a possible solution to countering dataset limitations through the testing of various convolutional neural network models. We found that lowering the learning rate and increasing the epoch count in our models increased performance stability and accuracy.


Artificial intelligence (AI) seeks to imitate human intelligence and behavior through machines [1]. It is broadly applied to many sectors, including the medical field. By utilizing AI in medicine, we can potentially automate tedious and repetitive processes, lower overall workload, and increase efficiency. Machine learning, a subset of AI, is the process of acquiring knowledge through data and learning from the data to improve systems and algorithms. Deep learning is part of machine learning and features deep (multi-layered) networks. A standard deep learning model (also known as a neural network) consists of an input layer and an output layer, with hidden layers in between. Data input to each hidden layer is transformed for input into the next layer.

Convolutional Neural Networks

Neural networks that contain at least one convolutional layer are called convolutional neural networks. Other subsets of the hidden layers include activation layers and pooling layers. A convolutional layer abstracts features from small sections of the training data [2]. Such layers progressively abstract and identify increasingly specialized features. In the case of image recognition, identified features would include lines, edges, and more. In the pooling layer, the matrix size is reduced through means such as max pooling. As the matrices decrease in size, the number of parameters is reduced, which results in benefits such as increased computing speed and decreased chances of overfitting. While pooling layers are optional, they aid in finding the maximums and averages of values in each region of the feature maps. In the table below, a brief overview of key and relevant convolutional neural network architectures is provided.





  • First successful convolutional neural network [3]
  • Uses gradient-based learning for document recognition [4]




  • First convolutional neural network for image recognition and classification
  • Uses parameter optimization strategies, dropout, and ReLU [3]




  • Improved version of AlexNet
  • Uses parameter optimization and feature visualization [3]




  • Reduces hyperparameters from 136 million to 4 million [4]
  • Uses inception block and bottleneck layer [4]

Inception V3



  • Version 3 of GoogLeNet
  • Shrinks filter size [4]




(convolutional layers)

  • Part of the Inception family
  • More efficient use of Inception V3 parameters [5]
  • Uses depth-wise separable convolutions [5]




  • Emphasized depth for image recognition [4]
  • 3.6% classification error rate [6]
  • Uses residual blocks [3]




  • Version 2 of ResNet
  • Uses skip connections [7]




  • Solves the vanishing gradient problem
  • through cross-layer connectivity [3]



19 [8]

  • 16 convolutional layers and 3 fully-connected layers [9]
  • Smaller (1×1) filters for lowered computational complexity [3]
Table 1 Overview of Select Convolutional Neural Network Architectures

Neural Network Architectures for Medical Imaging

Convolutional neural networks preserve spatial structure and are therefore commonly used in deep learning for medical imaging. Other networks employed in medical imaging include stacked autoencoders and deep belief networks. A stacked autoencoder consists of an input layer and an output layer, with hidden layers in between that encode and decode data [2]. The encoding layers use convolutional layers to compress data into a lower-dimensional representation, and the decoding layers reconstruct the compressed representation as close as possible back to the original data [10]. Due to its compression and reconstruction functions, stacked autoencoders are excellent for improving accuracy in the classification of raw data [10]. Deep belief networks have multiple restricted Boltzman machine layers. Restricted Boltzmann machines consist of a visible (input) layer and a hidden (output) layer [11]. As opposed to a feedforward network where neurons are acyclic, neurons within restricted Boltzmann machine layers are interconnected [2]. Restricted Boltzmann machines can reduce data dimensionality and initialize weights for training [2].

Applications in Medical Imaging

Deep learning is used in the processing and analysis of medical images produced from modalities such as magnetic resonance imaging (MRI), computerized tomography (CT), and positron emission tomography (PET). Deep learning is employed in image detection, registration, segmentation, and classification [2], as feature analysis is of interest to those applications. Image detection consists of detecting lesions from tissues of interest [2]. Image registration is a part of image preprocessing and aids in clinical diagnosis by superimposing two or more images to provide a more complete and cohesive picture for diagnosis [2]. Image segmentation is the process of categorizing parts of an image into different regions based on its features (such as bone vs. tissue or gray matter vs. white matter). Image classification is essential to automated disease diagnosis and consists of learning features that are related to diseases and classifying them as such. In this report, we focus on image classification. An overview of three examples in image classification and a description of their architectures are given below. These include a Boltzman machine for Alzheimer’s disease classification, a convolutional neural network for Alzheimer’s disease classification, and a deep belief network for schizophrenia classification.

In 2013, Suk et al. [12] trained a multi-modal deep Boltzmann machine using images from MRI and PET scans for the classification of Alzheimer’s disease. Using a latent feature representation and a stacked autoencoder, shared low-level features were found and combined with other non-latent features [2]. This method achieved an accuracy of 98.8% in the classification of Alzheimer’s disease and healthy controls.

Sarraf and Tofighi [13] classified Alzheimer’s disease using convolutional neural networks and fMRI data. LeNet-5, a convolutional neural network, was used due to its advantages in both feature extraction and classification. The convolutional layer performs high-quality feature extraction and discrimination, and the complex architecture enables classification. A 96.86% Alzheimer’s versus healthy control classification accuracy was achieved using LeNet-5, which is a major improvement from the support vector machine’s classification accuracy of 84%.

Deep learning was also utilized to extract MRI features and to classify schizophrenia. Pinaya et al. [14] used a multilayer network by combining a pre-trained deep belief network to find high-level latent features indicative of schizophrenia from the MR images and a softmax layer to fine-tune the network and classify the images. This deeper network was able to capture more complex information, which resulted in better classification performance. This network achieved an accuracy of 73.6%, which is significantly higher than the support vector machine’s accuracy of 68.1% for the same classification problem.

Challenges and Solutions in Medical Imaging

While significant advances in deep learning for medical imaging applications have been made, limitations in acquiring sufficiently large and comprehensive datasets present a major challenge. The size of a dataset directly influences the quality of the network that it trains. Although a sizable amount of medical imaging data is generated each year, access to the data is limited due to patient privacy concerns and regulations (such as HIPAA) [17].

Additionally, most deep learning networks employ supervised learning. In medical imaging datasets, a specialized professional (such as a radiologist) would be needed to annotate each image by hand so that the deep learning network can learn the true label. Given that datasets must be very large to properly train neural networks, the process of image acquisition is lengthy and costly.

It has also been noted that in currently available medical imaging datasets, pathological data is rare [17]. This class imbalance, in which there is a significantly larger amount of imaging from healthy controls than from pathological subjects, leads to difficulty in choosing an appropriate neural network, which ultimately results in poorer performance [2].

Three solutions have been proposed to address this dataset limitation issue. First, undersampling can be used to rebalance the pathological versus normal control distribution by deleting or merging images [15]. Second, oversampling, which is the process of generating new images from existing data, can be used to address both class imbalance and small dataset sizes [2]. Using two publicly available datasets, researchers at MGH & BWH Center for Clinical Data Science, NVIDIA, and Mayo Clinic developed a machine learning network to generate synthetic MR images with brain tumors [16]. Beyond providing additional sources of pathological data that can improve network accuracy, synthetic generation of images in oversampling can be used as an anonymization tool, addressing patient privacy concerns in datasets. Finally, transfer learning can be used to train a network despite insufficient data [17]. In transfer learning, a neural network is first trained using a large dataset such as CIFAR-10 or ImageNet. The top layers of the network are then re-trained and fine-tuned on the smaller dataset of interest. Given that medical imaging datasets tend to be small, transfer learning is one of the most popular and effective methods of training neural networks in medical applications. In the next section, we demonstrate the effectiveness of transfer learning through training multiple networks with two datasets of 75 and 251 images, respectively.

Methods and Materials


Transfer Learning

The concept of transfer learning in artificial neural networks is taking knowledge acquired from training on one particular domain and applying it to learn a separate task [18].


The number of passes through an entire training dataset [19].

Learning Rate

A hyperparameter used in the training of neural networks that has a small positive value, often ranging between 0.0 and 1.0 This controls how quickly or slowly a neural network model learns a problem [20].

Batch Size

The number of training examples utilized in one iteration [21].

Validation Accuracy

Accuracy of the model on unseen data after the model has been trained with the testing data.

Validation Loss

Loss of the model on unseen data after the model has been trained with the testing data.

Testing Accuracy

Accuracy of the model from training with the testing data.

Testing Loss

Loss of the model from training with the testing data.


A problem in machine learning that introduces errors in real-world situations. Noise and meaningless data are taken into account in prediction or classification. Overfitting tends to happen when training datasets are too small or include parameters and/or unrelated features correlated with a causal feature of interest [22].

Table 2 List of Relevant Keywords and Descriptions

To become familiar with deep learning architectures, we implemented and tested five convolutional neural network models (Inception V3, DenseNet201, ResNet152V2, Xception, and VGG19) using a transfer learning template [23]. The template was designed as “a high-level introduction into practical machine learning for purposes of medical image classification” [23]. We used two small datasets designed for binary classification to compare the accuracy and performance of the models as well as to identify the potential sources of inaccuracy within our results.

We used Google Colab and a collection of Python libraries to implement and evaluate the five models. Tensorflow and Keras were essential in the establishment of the architectures while Numpy and Matplotlib were used to visualize the data from our testing. Our choice of convolutional neural network models came from the most updated and popular models available through the Keras library.

The template provided the necessary code to classify abdominal and chest X-ray scans from a 75 image dataset (65 training, 10 validation) with preset hyperparameters. The default model used in the template is InceptionV3. After testing the default model, we experimented with four different architectures: DenseNet201, ResNet152V2, Xception, and VGG19. We adjusted the hyperparameters to reduce fluctuations in loss and accuracy over the entirety of each run (which will be shown in the Results section). Hyperparameter alterations include decreasing the template’s preset learning rate from 1 \cdot 10^{-4} to 1 \cdot 10^{-5} and increasing the number of epochs from 20 to 40.

We ran each of the models with the hyperparameter adjustments listed above to identify the best-performing model. This was done by examining the highest average testing and validation accuracy as well as the lowest average testing and validation loss. We then trained VGG19, the best performing model, on a larger and less ideal 251 image dataset [24]. The images were sized differently. This 251 image dataset (221 training, 30 validation) contained MRI brain scans of healthy controls and MRI brain scans of subjects with tumors. This dataset allowed us to continue to work with a binary classification problem. The purpose of training the model on a second dataset was twofold. First, it would allow us to better gauge the performance of the model in an application closer to a real-world scenario. Second, it would help us pinpoint potential sources of inaccuracy. This was done by comparing VGG19 results to the ResNet model (a very popular deep learning model and the most consistent performing model out of all five models). A copy of our codebases can be found for the 75 image dataset here [25] and the 251 image dataset here [26].


We used a learning rate of 1 \cdot 10^{-4} and 20 epochs on the 75 image dataset. The results of the five models using those hyperparameters are shown on the left. An adjusted learning rate of  1 \cdot 10^{-5} and 40 epochs were then applied to the five models; the results are shown on the right.

From Figures 1 through 10, we can see that the adjustments to the hyperparameters were essential in the development of a model that learns from the data that it was trained on. This enabled us to analyze and evaluate the performance of those models. A significant improvement was seen in the VGG19 and DenseNet201 models, as high fluctuation and unpredictability across epochs were experienced before hyperparameter optimization. After the hyperparameter adjustments, we see that the VGG19 model performed the best, having the highest average training and validation accuracy as well as the lowest average training and validation loss. We continued to use this model on the larger 251 image dataset to see how well it would perform when dataset size was scaled up. The figures below show the results of the VGG19 model and the ResNet152V2 model.

Upon further examination of Figure 11 and Figure 12, we observe that VGG19 did not experience consistent performance with the increase in the dataset size. This would initially lead us to argue that an increase in dataset size causes such an inconsistency, but after seeing the contradictory performance from the ResNet152V2 model, this argument is no longer valid. Therefore, factors outside of dataset size must be affecting the results of our testing, hindering the consistency and accuracy of even our previously best-performing model.

Given our experience with hyperparameter adjustments, we believe that non-optimized hyperparameters may be the leading cause of inconsistent performances across our models. This is due to the significant impact that our hyperparameter adjustments had on the original dataset for the VGG19 and DenseNet201 models. We believe that further tuning of the hyperparameters within the template could lead to more consistent results not only for VGG19 but for the rest of the convolutional neural network models as well.


From the transfer learning template results, we concluded that factors outside of dataset scaling cause fluctuation in convolutional neural network performance. This may include hyperparameter values and choices and other factors. Additionally, we demonstrate that a reduction in learning rate (from 1 \cdot 10^{-4} to 1 \cdot 10^{-5} ) increases performance in terms of both accuracy and loss across models. This experiment also demonstrates the efficacy of transfer learning on small-sized datasets.

Future Directions

To further improve performance results on the models tested, we plan to optimize hyperparameters such as batch size, learning rate, epoch count, and more. The choice to focus on hyperparameter optimization comes from our results in Figures 3, 4, 9, and 10, which demonstrate the significant impact that learning rate and epoch count have on producing accurate and consistent data. Testing can be conducted through trials of different hyperparameter values and analysis of subsequent results to determine the optimal combination. These adjustments will likely produce more consistent and accurate validation results and will additionally decrease the probability of overfitting.


We would like to extend our deepest gratitude to our mentor, Ethan Liang, for his guidance, support, and time, which were essential to this project. We would also like to thank Professor Tsachy Weissman for providing us with this opportunity by founding the STEM to SHTEM program, Professor Stephen Boyd for his role in the development of this program, and Cindy Nguyen for directing and coordinating this program.


  1. W. Samek, T. Wiegand, and K.-R. Müller, “Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models,” arXiv.org, 28-Aug-2017. [Online]. Available: https://arxiv.org/abs/1708.08296. [Accessed: 12-Jul-2021].
  2. J. Liu, Y. Pan, Z. Chen, L. Tang, C. Lu, and J. Wang, “Applications of Deep Learning to MRI Images: A Survey,” IEEE Xplore Full-Text PDF: Mar-2018. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&amp;arnumber=8268732. [Accessed: 13-Aug-2021].
  3. A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures of deep convolutional neural networks.” Artificial Intelligence Review, vol. 53, no. 8, pp. 5455-5516, 2020, doi: 10.1007/s10462-020-09825-6.
  4. S. Yeung. Lecture 5 | Convolutional Neural Networks – YouTube. (2017). Accessed: Aug. 06, 2021. [Online Video]. Available: https://www.youtube.com/watch?v=bNb2fEVKeEo.
  5. F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1800-1807, doi: 10.1109/CVPR.2017.195.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Xplore, 10-Dec-2015. [Online]. Available: https://ieeexplore.ieee.org/document/7780459/. [Accessed: 13-Aug-2021].
  7. S.-H. Tsang, “Review: ResNet – winner OF ILSVRC 2015 (Image Classification, Localization, Detection),” Towards Data Science, 15-Sept-201. [Online]. Available: https://towardsdatascience.com/review-resnet-winner-of-ilsvrc-2015-image-classification-localization-detection-e39402bfa5d8. [Accessed: 06-Aug-2021].)
  8. D. Garcia-Gasulla, F. Parés, A. Vilalta, J. Moreno, E. Ayguadé, J. Labarta, U. Cortés, and T. Suzumura, “On the behavior of convolutional nets for feature extraction,” Journal of Artificial Intelligence Research, vol. 61, pp. 563–592, 2018.
  9. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” dblp, 2015. [Online]. Available: https://dblp.org/rec/journals/corr/SimonyanZ14a.html. [Accessed: 13-Aug-2021].
  10. V. K. Jonnalagadda, “Sparse, stacked and Variational Autoencoder,” Medium, 06-Dec-2018. [Online]. Available: https://medium.com/@venkatakrishna.jonnalagadda/sparse-stacked-and-variational-auto encoder-efe5bfe73b64. [Accessed: 13-Aug-2021].
  11. P. Canuma, “What are rbms, deep belief networks and why are they important to deep learning?,” Medium, 23-Dec-2020. [Online]. Available: https://medium.com/swlh/what-are-rbms-deep-belief-networks-and-why-are-they-importa nt-to-deep-learning-491c7de8937a. [Accessed: 13-Aug-2021].
  12. L. HI, L. SW, and S. D, “Latent feature representation with stacked auto-encoder for AD/MCI diagnosis.,” Europe pmc, 22-Dec-2013. [Online]. Available: https://europepmc.org/article/med/24363140. [Accessed: 13-Aug-2021].
  13. S. Sarraf and G. Tofighi, “Classification of alzheimer’s disease using fmri data and deep learning convolutional neural networks,” arXiv.org, 29-Mar-2016. [Online]. Available: https://arxiv.org/abs/1603.08631. [Accessed: 14-Aug-2021].
  14. W. H. Pinaya, A. Gadelha, O. M. Doyle, C. Noto, A. Zugman, Q. Cordeiro, A. P. Jackowski, R. A. Bressan, and J. R. Sato, Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia, Sci. Rep., vol. 6, p. 38897, 2016.
  15. J. Brownlee, “How to Combine oversampling and Undersampling for imbalanced classification,” Machine Learning Mastery, 10-May-2021. [Online]. Available: https://machinelearningmastery.com/combine-oversampling-and-undersampling-for-imba lanced-classification/. [Accessed: 13-Aug-2021].
  16. H. C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image synthesis for data augmentation and anonymization using generative adversarial networks,” Mayo Clinic, 01-Jan-1970. [Online]. Available: https://mayoclinic.pure.elsevier.com/en/publications/medical-image-synthesis-for-data-au gmentation-and-anonymization-u. [Accessed: 13-Aug-2021].
  17. A. S. Lundervold and A. Lundervold, “An overview of deep learning in medical imaging focusing on MRI.” Zeitschrift für Medizinische Physik, vol. 29, no. 2, pp. 102-127, 2019, doi: 10.1016/j.zemedi.2018.11.002.
  18. E. Chmiel, “Transfer learning: Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/transfer-learning-1?lang=us. [Accessed: 13-Aug-2021].
  19. F. Gaillard, “Epoch (machine learning): Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/epoch-machine-learning?lang=us. [Accessed: 13-Aug-2021].
  20. J. Brownlee, “How to configure the learning rate when training deep learning neural networks,” Machine Learning Mastery, 06-Aug-2019. [Online]. Available: https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/. [Accessed: 13-Aug-2021].
  21. F. Gaillard, “Batch size (machine learning): Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/batch-size-machine-learning?lang=us. [Accessed: 13-Aug-2021].
  22. C. M. Moore, “Overfitting: Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/overfitting?lang=us. [Accessed: 13-Aug-2021].
  23. P. Lakhani, “Paras42/Hello_World_Deep_Learning: Hello world introduction to deep learning for medical image classification,” GitHub, 16-Apr-2018. [Online]. Available: https://github.com/paras42/Hello_World_Deep_Learning. [Accessed: 06-Aug-2021].
  24. N. Chakrabarty, “Brain mri images for brain tumor detection,” Kaggle, 14-Apr-2019. [Online]. Available: https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. [Accessed: 06-Aug-2021].
  25. N. Vuong, “Google Collaboratory – HWD_7_Models_Data,” Google Colab, 05-Aug-2021. [Online]. Available: https://colab.research.google.com/drive/1fbx9hfLIMJyNSf6T_VoGrZISGp0BpkpN?usp=s haring. [Accessed: 14-Aug-2021].
  26. N. Vuong, “Google Colaboratory,” Google Colab, 06-Aug-2021. [Online]. Available: https://colab.research.google.com/drive/1IgqnOBHL3H_GgBlSwPLZMDpeXlaikoBn?usp=sharing. [Accessed: 14-Aug-2021].

Learning to Play Tic Tac Toe via Reinforcement Learning

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Ethan Cao, Jocelyn Ho, Sameeh Maayah, Maria Rufova, Adithya Devraj


Machine learning algorithms are implemented in a wide variety of modern applications to train computers to make efficient predictions and decisions based on their observations of past data. Reinforcement learning is a paradigm of machine learning that is concerned with making the most beneficial decision in a particular environment in order to maximize the reward. I The Q-learning algorithm is a reinforcement learning algorithm that solves decision-making problems by assigning different scores to different decisions at each state of the problem. When applied to learn to play a game such as Tic-Tac-Toe, at each board position, the goal of the algorithm is to assign a numerical score to each move, where the score is proportional to the chance of winning. This trains the machine to perform the most beneficial moves by simultaneously exploiting already learned beneficial information and exploring its possibilities by constantly trying out new game moves.

Using numerically mapping algorithms of Q-Learning to predict a game’s outcome through iterative training, we have trained an artificial Tic-Tac-Toe opponent to refer to Q values for all possible player decisions in a current environment and chose the one with the highest value. With the example of Tic-Tac-Toe, we examine how reinforcement learning allows machines to implement dynamic programming techniques to maximize their performance with minimal human intervention and assistance.


In 2015, AlphaGo became the first computer to beat a human professional player over Go, an ancient chinese game that contains over 10172 possible board positions [1]. In the proceeding years, AlphaGo continued to improve by training against both human and artificial opponents, eventually defeating numerous top-ranked players around the world [1]. With such, AlphaGo presents the possibilities of artificial intelligence (AI) to reach a higher level of expertise in a specific field usually dominated by human players. Through a distinct point of view from AI, it is possible to realize many applications that existed in fictions, such as autonomous robot agents, trading systems, business intelligence, and chat bots [4].

In order to have a similar automated approach to solve real-world problems, we decided to look into board games. In contrast to video games, board games present more adequate environments to train AI opponents because board games have exact rules and possibilities for its outcomes [4]. With AI knowing all potential actions, it is able to calculate exact outcomes based on the states of the board. This allows AI to obtain the goal of interest by figuring out how to maximize the utility. Out of the millions of today’s board games, we looked into the well-known and easily comprehensible game of Tic-Tac-Toe. Although in comparison with Go Tic-Tac-Toe has only less than 360,000 possible board positions, it is still complicated enough to be solved with deep neural networks [4].

In general, there are at least two ways for a machine to reach a certain outcome from a series of actions: decision tree and reinforcement learning [5]. Decision tree contains branches that represent specific conditions, where the outcomes are located at the ends of the branches [5]. On each node of the decision tree, an action is made to determine which condition is reached. However, the decision tree requires storing all patterns of a game, which takes up an enormous amount of memory. Therefore, some use an extended version of the decision tree called Min-Max algorithm to estimate how good the action made is through backtracking. In this algorithm, the opponents will try to obtain the best move (which is the worst move for the other player) and a value will be assigned to the outcome [5]. Depending on the number of moves the player made, the depth of the tree will be set. In order to optimize AI’s performance, an evaluation function and the tree’s depth will be used to estimate the final value of the match. Nevertheless, the Min-Max algorithm requires massive state space and there are no evaluation functions that are efficient enough. Hence, we attempted to use reinforcement learning—which automatically finds the balance between exploration of unknown pathways and exploitation of current knowledge—to train the machine how to play Tic-Tac-Toe [3].


In reinforcement learning, the machine will aim to optimize the rewards through interactions with the environment and updating itself with better policy based on its experiences. Taking Q-learning as an example. For each action the agent of the function has made, it comes with a reward (based on the outcome) and a value that is calculated based on the current state and the optimal action that the agent has previously made [7]. Using these variables, the machine calculates the new value and repeats this process until the game terminates. Over many simulations, the machine will have experienced a range of patterns of actions and states and be able to estimate the probability of winning the game [7]. Through reinforcement learning, the machine will start off playing the game terribly, but through iterations, it improves gradually and ultimately reaches a high winning possibility.

Methods and Materials

The goal of reinforcement learning is to discover how to maximize a numerical reward of a task without any prior information of actions’ potential value. Instead the machine must individually explore its environment and exploit learned material to understand which actions will yield the greatest cumulative reward at the end of the game. In case of Tic-Tac-Toe, the learning agent conducts the trial-and-error learning through repetitive plays of the standard Tic-Tac-Toe game on a 3X3 field, where a win is obtained by consecutively placing either three X’s or three O’s horizontally, vertically, or diagonally. To achieve the delayed reward of such winning combination, the agent must evaluate which actions most beneficially affect not only the immediate reward, but also all subsequent future rewards.


In any reinforcement learning environment, the learning agent’s behavior at a given time is determined by its policy—a learned strategy that dictates the agent’s actions as a function of its current state and environment. Additionally, reward signals—numerical payoff received after each action that agent strives to maximize—acts as the primary basis for altering policy. Actions with high rewards influence the policy to prioritize them in the future, thus altering the agent’s strategy. The ultimate reward of Tic-Tac-Toe is a winning combo of three consecutive characters, which awards the agent +1. The agent’s loss would be punished by -1, thus influencing its policy to avoid the actions that led to the low reward.

The agent-environment interaction considering the state, action, and reward possibilities of a problem is best framed in a finite Markov Decision Process. The Markov Decision Process represents a stochastic sequence of agent’s possible actions in which the outcomes are partially random and partially under the control of the machine. A variety of actions gives the motivation to seek different rewards. Markov Decision Processes thus operates through a value function where at a time step t , the agent may choose action a available in that state s , thus moving the system into a new state s' and providing the corresponding reward R . Next state depends only on the current state and the action taken, and the probability of moving into a new state s' is given by the transition function p_{a}(s,s') . The value function is given as a solution to the Bellman equation which expresses the relationship between the value of the current state and the values of its successor states. Beginning with initial state s , the agent can take a variety of actions a according to its policy \pi , to which the environment can respond with one of potential next states s' along with a reward, r , depending on the dynamics given by probability function p . The Bellman equation thus averages over all possible next states, weighing each by its probability of occurring. It ultimately tells the value of the immediate reward produced by an action a in some state s, and the maximum expected reward you can get in the next state.

Q^{\text {new }}\left(s_{t}, a_{t}\right) \leftarrow \underbrace{Q\left(s_{t}, a_{t}\right)}_{\text {old value }}+\underbrace{\alpha}_{\text {learning rate }} \cdot \overbrace{(\underbrace{\underbrace{r_{t}}_{\text {reward }}+\underbrace{\gamma}_{\text {discount factor }} \cdot \underbrace{\max _{a} Q\left(s_{t+1}, a\right)}_{\text {estimate of optimal future ralue }}}_{\text{new value (temporal difference target)}}-\underbrace{Q\left(s_{t}, a_{t}\right)}_{\text {old value }})}^{\text {temporal difference }}

Our Tic-Tac-Toe project uses the Bellman equation as a part of the Q-Learning method, in which the machine iteratively approximates the “Q” values—the expected reward for an action in a given state. Q-learning assigns each state-action pair in a game of Tic-Tac-Toe a particular reward, with higher Q values indicating the most desirable actions. An equation iteratively updates the Q values depending on the current state of the board, potential actions, and future states. In the case of Tic-Tac-Toe, board positions are states and game moves are actions. When the end of a match is reached during training, the result of the game is the move that led to that result. The machine then works back recursively through the history of the game and updates the Q-values for each action taken during the game. In order to make our learning agent familiar with all possible Tic-Tac-Toe moves instead of just reinforcing already high Q values, we use the epsilon-greedy strategy which either selects a random move with probability \epsilon , or uses a move from Q-table with probability 1- \epsilon . This ensures a balanced exploration – exploitation learning strategy for our agent.


The Q-learning algorithm was created in Python. By converting each board state to a string we created a unique hash for that board state. The hash and its reward were stored as a key and value pair in a dictionary data structure. During each episode, all moves made are recorded and the reward received by the algorithm will be distributed to each board state hash via the Q-learning equation previously described.

Using the epsilon greedy method, our algorithm determines whether to follow the path of greatest reward by looking at the reward associated with future game states or explore through a random move. The computer will pick one of the two moves by generating a random number between 0 and 1. If this number is smaller than a predetermined epsilon value the algorithm will play a random move however if the generated number was larger than epsilon the algorithm will take the q-learning approach.

After a lot of testing we decided to first set epsilon to 1 (this ensures that the algorithm will take a random action) for the first 1000 episodes. Then epsilon would be divided by 2 for each 1000 episodes played, enabling the algorithm to discover a large number of the possible states and then play strategically based on the q-value of the board. After each move taken by the agent the q-table (a table that holds all the state action pairs of the board and its corresponding q-value) would be updated using the bellman equation.


To yield great results and make the game as challenging for the agent as possible, the opponent of the agent works with the same q-learning algorithm but with the q-table of the main agent from 100 episodes earlier. The q-table of the opponent will keep updating every 100 episodes played.

Figure 1 demonstrates the results after the agent has played for 300,000 games. The agent was able to win about 75% of the games played, 15% of the games were a tie and 10% were a loss.


Our research has demonstrated that with sufficient amount of training, a learning agent is capable of utilizing reinforcement learning to master playing simple games such as Tic-Tac-Toe with a reasonably high winning outcome. Our agent has utilized the Bellman equation and the Q-learning algorithm to track and re-trace its Tic-Tac-Toe moves to improve its playing strategy. By the conclusion of our experiment, the agent is capable of winning on average 85% of its games after exactly 300,000 episodes of training. We believe that with further improvement to the Q-learning algorithm and a longer training period we will be able to increase the winning percentage even more and completely eradicate losing scenarios.

Future Directions

Future work on the project would include finding ways to optimize the machine’s learning strategies in order to guarantee mostly a winning outcome for the agent, with perhaps some occasional ties. We are looking forward to finding a way to expedite and maximize learning without relying on prolonged episodes of training, as an increased amount of training matches takes a significantly longer time for machines to complete (which is another separate issue we can improve upon).

We hope to conduct further research on the adjustments of learning rate and decay exploration rates and their effects on machine’s abilities in not just Tic-Tac-Toe, but similar programs that can be improved with reinforcement learning. While the specifications of states, actions, and rewards change from game to game, the essence of Q-learning algorithm remains the same, and thus our research can be applied to a wide variety of programs. Our work with reinforcement learning can thus be extended to not only various games, but also applied in many important fields such as finance, business, medicine, industrial robotics and other areas where credible unsupervised learning is essential to quick operations and exemplary results.


  1. AlphaGo: The story so far. (2021). Deepmind. Retrieved August 4, 2021, from https://deepmind.com/research/case-studies/alphago-the-story-so-far
  2. Babes, M., Littman, M., & Wunder, M. (n.d.). Classes of Multiagent Q-learning Dynamics with ∊-greedy Exploitation. Rutgers University, Department of Computer Science. Retrieved August 4, 2021 from https://icml.cc/Conferences/2010/papers/191.pdf
  3. Friedrich, C. (2018, July 20). TABULAR Q learning, a Tic Tac Toe player that gets better and better. Medium. Retrieved August 4, 2021, from https://medium.com/@carsten.friedrich/part-3-tabular-q-learning-a-tic-tac-toe-player-that- gets-better-and-better-fa4da4b0892a
  4. Ritthaler, M. (2018, January 18). Using Q-Learning and Deep Learning to Solve Tic-Tac-Toe. YouTube. Clearwater Analytics. Retrieved August 4, 2021, from https://www.youtube.com/watch?v=4C133ilFm3Q
  5. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press Ltd. Retrieved August 4, 2021, from http://incompleteideas.net/book/RLbook2020.pdf
  6. Torres, J. (2020, June 11). The Bellman Equation. Medium. Towards Data Science. Retrieved August 4, 2021, from ttps://towardsdatascience.com/the-bellman-equation-59258a0d3fa7
  7. Watkins, J. (1992). Q-Learning. Kluwer Academic Publishers. Retrieved August 2, 2021 from https://link.springer.com/content/pdf/10.1007/BF00992698.pdf



Facial Landmark Data Collection to Train Facial Emotion Detection Learning Models

Blog, Journal for High Schoolers, Journal for High Schoolers 2021


Leevi Symister, Kaiser Williams, Roshan Prabhakar, Ganesh Pimpale, Tsachy Weissman


In-person theater performances have become difficult due to COVID-19, hence video conferencing softwares like Zoom have become increasingly popular to deliver live virtual performances. Such performances require a deliverance of audience feedback information to performers so that they can adapt their performance. However, typical audience feedback is not viable in a virtual setting. Intuition suggests that extracting feedback from an audience preoccupied with a performance by requiring a redirection of their attention to the feedback will result in inadequate representations of an audience’s emotional state. More authentic feedback can be gained by analyzing real time emotions of the audience through the use of a live webcam by extracting facial expressions.

Existing facial emotional recognition softwares map a face mesh to a subject and derive emotional states through the use of CNNs and a Facial Action Coding System (the current standard for determining emotions from a facial state). These models are trained on a variety of images from publicly available datasets as well as images scraped from the web. This project aims to develop software that performs more tangible data collection as it relates to the audience-performer feedback loop: specifically the collection of emotional expressions and corresponding reactionary expressions (clapping, booing, etc.). This project aims to determine whether there exists a relationship between face mesh data collected across frames and reactionary motions during said frames which may not be easily observable through the webcam, by building a database of such data entries which may be used to train a machine learning model. Should a relationship be found, such a model could be deployed within virtual performance frameworks to enable live collection of audience feedback information. Our code is available in a github repository: https://github.com/roshanprabhakar/af-datacollection.




Audience feedback simply refers to any form of audience reactions during a performance that can be interpreted by the performer. Live performances such as concerts and plays have a dynamic exchange of emotional information formally known as the Audience Performer Feedback Loop [13]. This constant interaction between the two parties allows the performer to adapt their performance based on the general reactions the audience reciprocates such as laughing, applause, booing, smiling, frowning, clapping, etc. However, in virtual situations, the audience typically has their microphone turned off so the typical feedback loop is impractical. In addition, existing software tools use low level GUI’s that synthesize artificial feedback (such as laughter, booing, cheering, thumbs-up, etc.) using buttons which reduces the naturality of feedback by diverting the audience’s attention between watching and reacting to a performance. Thus, an alternative method to gain audience feedback is to analyze emotional information through facial detection and analysis and then provide that data to the performer in real time. Facial emotion detection can be done by just using a live webcam and camera features embedded in virtual conference software such as Zoom.

Technical concepts and project specifics

Facial emotion detection is generally divided into two tasks:

  1. Recognizing and detecting a human face in the video feed
  2. Detecting emotion through facial landmark analysis.

The first task utilizes Convolutional Neural Networks (CNN), a machine learning algorithm which takes a sample input image and analyzes its features to find patterns in the images. Passing relevant kernels across an image’s pixels [3] (a process similar to the figure on the left) and performing multiplication operations between the kernel and the image pixels creates a new matrix. This results in a convoluted layer with high-level features such as edges extracted from the input image. [12]. Next, the goal is to analyze the face and its certain features to see how they correspond to particular emotions. This is commonly achieved by utilizing the Facial Action Coding System (FACS) as well as emotion recognition software using another CNN. FACS was originally developed by Paul Ekman and Wallace Friesen [4] and is used to define a set of Action Units which correspond to a particular facial muscle group movement [2]. With these units, the emotion recognition software can identify and group certain action units with a particular set of emotions [6]. Each emotion recognition software can be different in how it processes the images and Action Units [5], but the set of emotions derived from the softwares generally consists of around seven universal human emotions. Those emotions are happiness, sadness, anger, surprise, fear, disgust, and contempt [11]. For example, iMotions, a software for human behavior analysis, defines happiness as the combination of two action units: one that describes a cheek raise, and the other which denotes a lip corner pull [5].

Modern facial analytics softwares, including iMotions, now integrate face detection and FACS emotion recognition all into one software [5]. This is usually done by using computer vision algorithms to map points to facial landmarks (a face mesh) which are then tracked and analyzed using deep learning to determine an emotion [5].

In particular, facial emotion recognition models require training with a large data set [15] in order to accurately detect emotions. Our software brings an interactive element to this data collection and training. Rather than algorithm-based data collection, we create a subject-observer model that collects the observer’s perceived emotion from a subject’s facial reactions. In an integrated environment, we extract the relevant facial feature/landmark information from each frame of a face mesh video feed, while concurrently having the observer analyze the same video and record what emotions he or she detects. The collected data entries will be used to develop an aggregate data set upon which a learning model may determine the mathematical relationship between face meshes and the corresponding reaction vector (this vector is determined by the human observer in our environment). Our project will be especially relevant in determining if certain facial landmark movements correlate to certain emotions and reactionary expressions of the body such as clapping, booing, etc.

Past & Related Work

Prior to the advancement of CNN models, a viable option for facial detection was the Viola-Jones algorithm. This algorithm detects the haar-like features of a human face using cascade classifiers to then identify human faces in mainly images with a detection rate of 95% [14] . While this method is viable in live video feeds, the most straightforward technique for improving detection performance, adding features to the classifier, directly increases computation time and thus is inefficient. Nonetheless, this algorithm has paved the road for many neural networks today that follow similar techniques [14, 10].

Materials, Libraries, Methods

We are creating a subject-observer software that facilitates the creation of a database which may be used to train a learning model that attempts to find a mathematical relationship between reactionary expressions less observable through a webcam, such as clapping, and corresponding facial expressions. This requires two independent tasks during data collection: the constant monitoring of a human subject’s face mesh (executed by CNN-based libraries) while the subject is watching and reacting to some sort of video (Facial Landmark Data Collection Interface), and the constant monitoring of the same subject’s emotional state conducted by a separate human observer (Observer Output Data Collection Interface). Subsequently, we will develop an integrated environment to allow these tasks to occur simultaneously and software that merges the resulting data of the two tasks to create entries for our database.

For face detection and facial landmark mapping, we employed the MediaPipe TensorFlow.js library called Face Mesh. It does not require a depth sensor and instead only needs access to a webcam on the device being used. This library employs face detection using the BlazeFace CNN model which is tailored for lighter computational performance while being very fast [8]. This model produces an input image which is composed of the face as well as a few facial keypoint coordinates inside of a rectangular bounding box that helps it detect face rotations [1]. This cropped image is then passed as input to the facial landmark neural network which then subsequently maps 468 coordinates (x, y, z) back onto the original uncropped image [8]. As this uses no depth sensor, the z coordinates are scaled in accordance with the x coordinate using Weak Perspective Projection [9].

Facial Landmark Data Collection Interface

The first task consists of collecting facial landmark data and the timestamps of our collection and storing them in a file for later use. This way, when the data of the observer and the data of the face mesh are compared, we can learn which reactions correspond to certain facial landmark movements. To do this, we first need to access the predefined objects in the Mediapipe Face Mesh library.

The most important object in this library is the faceDetection object. It contains the 468 coordinates (x, y, z) of each point that is mapped to a detected face through the webcam. It stores these values in the Mesh and Scaled Mesh properties of the object. The Mesh property consists of the facial landmark coordinates without normalization while the Scaled Mesh property of the object contains the normalized coordinates. In our program, the goal is to collect 100 packets each consisting of 10 frames of Mesh data. Because we need this data stored within a local file and the straightforward file representation for a JavaScript array is simply the ASCII encoded stringified representation of said array, we store our data in a byte buffer where each 4 bytes corresponds to a 32-bit float value in the faceDetection object. This allows us to reduce consumption from almost 30 bytes per number (2 bytes per character of the string representation (UTF – 16)), to just 4 bytes (32 bit floating point representation). As a string, our data would consume 538,260 bytes per packet. With the floating point representation we would only be using 56,168 bytes per packet, almost a 90% difference in consumption.

To implement this storage solution, each frame of Mesh data and Scaled Mesh data are temporarily stored in an array called faceMeshArray and scaledFaceMeshArray respectively. The faceMeshArray is then mapped and stored in an array buffer called meshBuffer and opened with a view called meshBufferView.

The length of meshBuffer is determined by multiplying the number of bytes stored per each number (in the case of Float32 this number is 4) * the number of dimensions per point * the number of points * the number of frames of Mesh data per packet. This comes out to 56,160.

Now we store the timestamp data in our packet. The timestamp data will be stored in milliseconds elapsed since the epoch which occurred on January 1, 1970.

This number will be in the tens of trillions, so in order to store it we need to utilize an array buffer (timeBuffer) of length 8 bytes to ensure that our program will correctly store the data. We then open this using a Float64 Array Buffer View(timeBufferView) which will encode each number (including decimal values) with 8 bytes. Because timeBuffer only stores 8 bytes, opening it with a Float64 View will leave the array buffer with only one index. We then store in this index the time at which the data for the packet is being collected.

Next, we create a concatenated Array Buffer that stores both the timestamp and Mesh data together. To do this we utilize another Array Buffer called meshPacketBuffer. The input to its length will be the combined byte length of timeBuffer (8) + meshBuffer (56,160) for a total length of 56,168.

The next step is to write the timestamp bytes to the packet. To do this we create an Int8Array Buffer View. This takes each of the 8 bytes of the timeBuffer and segments them into 8 indices where 1 byte represents 1 index of the timeBufferView. This allows us to iteratively loop through the timeBufferView and store each of the 8 indices into the meshPacketBufferView.

This means that the first 8 bytes of the meshPacketBufferView now store the timestamp data for that individual packet. The rest of the indices (56,160) can now be used to store the meshBuffer data which again represents 10 frames of mesh data collection.

To do this, it involves the same process of opening meshBuffer with an Int8Array View. This again accesses the contents of meshBuffer and makes each index of the View represent 1 byte for a total of 56,160 indices. To store this data in the meshPacketBufferView, we need to iteratively step through each index of the meshBufferView and store them to the meshPacketBufferView. We must remember, however, to skip the 8 bytes that already store the timestamp information in the packet.

Lastly, before saving the data to file, we need to push the meshPacketBuffer to an empty array called packetArray. This array will now store the 56,168 bytes of information for the timestamp(8 bytes) and the Mesh data(56,160 bytes). The data consisting of 10 frames of Mesh data and 1 timestamp will constitute one index of the packetArray. This packetArray will collect data until its length reaches 100 and will therefore contain 1000 frames of meshBuffer data and 100 timestamps.

The last and final step of this process is to save the collected packets to disk. In order to do this we convert each packet to a string according to the byte → character ASCII mapping of each byte in the packet buffer, then we write each packet to a file. We are currently in the implementation phase of this step.

Observer Output Data Collection Interface

Next, the human observer will monitor the same subject’s emotional state in real time. To collect the emotions the observer perceives from the video feed, we created a simple interface with HTML, CSS, and JavaScript that can be run on a live server. There will be a window that displays the live webcam feed of the subject, and below that are eleven buttons corresponding to the seven universal emotions and four common emotional expressions: “Laughing”, “Applause”, “Booing”, and “Crying”. When initially run, the webcam will automatically launch using the JavaScript getUserMedia() function and display the subject’s webcam feed on the observer’s computer, and the initial timestamp will be collected simultaneously. As the webcam feed plays, the observer will record what emotion they perceive from the facial emotion of the subject and press the corresponding buttons. As a button is pressed, the emotion and the number of milliseconds passed from the initial timestamp when the button is pressed is recorded in a JSON object with the keys “action” and “timestamp” and then appended to a JSON array that contains all the data. This allows easier access to the values later in the process.

Once the observation is complete and the “finish” button is pressed, this data must be stored in a file that can be combined with the data from Task 1 to form a data entry. In order to save space, we convert this JSON array into an array buffer similar to Task 1. Additionally, each of the eleven emotions take quite a bit of space as each letter is a byte of information. Thus, we convert each emotion to a key (1-11) that is only 4 bits before storing it in the buffer. While it is ideal to create and save a file type containing the array buffers in binary (bin files) that can be exportable to the local device or cloud storage, Javascript prohibits natively saving files to prevent malware installation. Our solution is to parse through the buffer array and convert the binary values to ASCII codes: an encoding system that translates 128 specific characters into seven digit integers and vise-versa. This array is finally downloaded as a text file using an external API called FileSaver.js [7].

This is a model of the entire process. Note how the emotions are translated to the binary representation of its corresponding key value (ex. Anger -> 5 ->101). Also note how the binary representation of the timestamp values exceeds 7 bits. In this case, we split this binary representation by 7 bits each, and assign an ASCII code for each sub-value. The intervals at which the emotions and the timestamp ASCII values are stored are recorded in the background for processing later. We are currently finding an alternative method to create an exportable file type and effectively store binary data.


Integrated Environment and Combining Data Files


Lastly, we are in development of an integrated environment which will allow both tasks to perform in parallel. In order to display the live webcam feed on the observer’s screen, we will use the webRTC API which allows real time video connection between two peers. So far, we have been able to connect the video feed from two different browsers on a local computer. Eventually will be able to access the video feed from another computer using a third party stun server which will save each device’s ICE ( interactive connectivity establishment) candidates and make it available for the other peer. In the background, the two separate data files created in each task will be merged to create data entries for our aggregate dataset. This final program will read the two data files and connect the face mesh landmark movements with the reaction data perceived by the observer according to the timestamps. This aggregate data set can be used to find a correlation between certain face movements and emotions/emotional expressions, and may be used to train a machine learning model.


The results of our experiment consist of accurate collection of both facial landmark and reactionary data whilst maintaining computational efficiency. For facial landmark detection, our Array Buffer storage solution achieved close to a 90% reduction in the amount of bytes encoded per each packet of facial landmark data. This is also before the truncation of the decimal values comprising the facial landmark coordinates, which would allow us to reduce precision slightly but alleviate computational strain on memory. This is expected to be nearly a 95% reduction in byte usage from the original UTF – 16 (2 bytes per character in a string representation) encoding solution.

Our other successes lie in the storage of reactionary and timestamp information on the observer side. By mapping a set of keys to each of the 11 reactions and then converting those keys to binary and then finally to ASCII, we were able to cut our byte usage by at least 94%. We also reduced strain on memory by breaking down the timestamp information into smaller components of at most 7 bits. This allowed us to convert the timestamp information to ASCII as well. Now by using the ASCII and key codes we will be able to map out exactly what our timestamp and reactionary information was without needing to store as much data in the process on the local device.


Our research revolved around the issue of Audience Feedback and in particular the Audience Feedback Loop and how it has been disrupted due to factors such as COVID – 19 and our increased reliance on

video conferencing software such as Zoom. Our goal was to aid in the training of Machine Learning models and other algorithms surrounding facial landmark analysis and corresponding emotional responses. By creating software that is used for increased data collection in regards to reactionary information not easily observable through a webcam, we hope to better train these models.

The results of our research comprise the development of software that, using the input of a webcam, maps to an observed subject’s face, facial landmarks and subsequently stores the coordinate and timestamp information as array buffers. These buffers are then further concatenated into a larger array called a packet. We also developed a GUI that allows an observer to simultaneously watch the subject and input emotional reaction data. The reactionary information is later encoded as a set of keys which along with the timestamp of that data are converted to ASCII. Finally, we developed a mapping system that is used to convert between the predefined keys for the reactions and ASCII values in order to reduce computational strain when storing observer information.

During our research, we found that our custom array buffer solution in regards to collecting landmark data, is able to achieve almost a 90% reduction in the amount of bytes stored in memory and on the local device. With our custom key mapping solution, in regards to observer data collection, we found that we were able to achieve at least a 94% reduction in the amount of bytes stored on the local device.

Future Directions

Our next goal is to further truncate the Mesh data in order to reduce computational strain while still maintaining high level point mapping precision. After this, our goal will be to integrate both Task 1 and Task 2 into a single environment that runs each task simultaneously with each other. While doing this, we also will consider the ethical implications regarding facial tracking and data collection and will derive a solution that notifies users of exactly what is being collected and stored.

Once we complete development of the integrated environment and write software to create the data entries, we will begin populating entries for our aggregated data set. Typical facial emotional neural networks require thousands of images for training for accurate results, so it will be necessary to perform many iterations of data collection with different subjects.

If a mathematical correlation is found between face mesh landmark movements and reactionary data (emotions and emotional expressions such as clapping, booing, etc), we will optimize data collection by

creating a website platform where people online can contribute to the dataset. In addition, we will refine the architecture of the network so that it is feasible to deploy in current audience-feedback solutions. If successful, this project can be utilized to enhance the precision of machine learning algorithms and other neural networks in regards to facial landmark detection.


We would like to thank everyone who supported our project, including our mentors, Roshan and Ganesh. We would like to acknowledge Professor Tsachy Weissman of Stanford’s Electrical Engineering Department and the head of the Stanford Compression Forum for his support throughout this project. In addition, we would like to acknowledge Cindy Nguyen, the STEM to SHTEM Program Coordinator, for the constant check-ins and coordinating the many insightful events during the 8 week internship period. Thank you to all of the alumni, professors, and PhD students who presented research in a variety of fields in the past eight weeks. Lastly, thank you to past researchers and innovators; your work has helped and inspired our project.


  1. Bazarevsky, Valentin, et al. BlazeFace: Sub-Millisecond Neural Face Detection on Mobile GPUs. Google Research, 14 July 2019, https://arxiv.org/pdf/1907.05047.pdf.
  2. Coan, James, and John Allen. Handbook of Emotion Elicitation and Assessment. Oxford University Press, 2007.
  3. Riley, Sean. Detecting Faces (Viola Jones Algorithm) – Computerphile. Computerfile, 19 Oct. 2018, http://www.youtube.com/watch?v=uEJ71VlUmMQ.
  4. Ekman, Paul, and Wallance V. Friesen. “Measuring Facial Movement.” Paulekman, 1976, http://www.paulekman.com/wp-content/uploads/2013/07/Measuring-Facial-Movement.pdf.
  5. Farnsworth, Bryn. What Is Facial Expression Analysis? (And How Does It Work?). IMotions, 2 Oct. 2018, https://imotions.com/blog/facial-expression/.
  6. Farnsworth, Bryn. “Facial Action Coding System (FACS) – A Visual Guidebook.” IMotions, 2019, https://imotions.com/blog/facial-action-coding-system/.
  7. Grey, Eli. FileSaver.js. Github, 19 Nov. 2020, https://github.com/eligrey/FileSaver.js/.
  8. MediaPipe. MediaPipe Face Detection. Google, 2020, https://google.github.io/mediapipe/solutions/face_detection.html.
  9. MediaPipe. MediaPipe Face Mesh. Google, 2020, https://google.github.io/mediapipe/solutions/face_mesh.html.
  10. OpenCV. Cascade Classifier. Open CV, 11 Aug. 2021, https://www.docs.opencv.org/3.4/db/d28/tutorial_cascade_classifier.html.
  11. Paul Ekman Group. “Universal Emotions.” Paul Ekman Group, https://www.paulekman.com/universal-emotions/.
  12. Saha, Sumit. “A Comprehensive Guide to Convolutional Neural Networks — the ELI5 Way.” Towards Data Science, 15 Dec. 2018, https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd 2b1164a53.
  13. Schwartz, Rene. “Extending the Purpose of the Audience-Performer Feedback Loop (APFL).” The Story Is Everything, 18 Mar. 2021, https://renemarcel.opened.ca/2021/03/18/chapter-four-conclusion/.
  14. Viola, Paul, and Michael Jones. Rapid Object Detection Using a Boosted Cascade of Simple Features. IEEE, 2001, https://ieeexplore.ieee.org/Xplore/home.jsp.
  15. Zijderveld, Gabi. “The World’s Largest Emotion Database: 5.3 Million Faces and Counting.” Af ectiva, 14 Apr. 2017, https://blog.affectiva.com/the-worlds-largest-emotion-database-5.3-million-faces-and-counting.