Implementing and Analyzing Three Machine Learning Models For Brain Tumor Recognition

Journal for High Schoolers, Journal for High Schoolers 2022

Vivian Chang, Ashwin Chintalapati, Chessa Park, Artem Tesov, and Keying Zhang


Magnetic resonance imaging is a medical imaging technique that uses a magnetic field and computer generated radio waves in order to create images of the human body’s organs and tissues. Radiologists use these images to determine what condition your body is in, detect any tumors or other anomalies, and reach a specific diagnosis regarding your health. However, some tumors may be difficult to easily identify in a limited time frame. Throughout the years, scientists have focused on improving both the versatility as well as the accuracy of these MRI machines. One way to do this was by making the process more automated. Machine learning allows systems to recognize patterns, improve from experience, and predict outcomes. Such models can be used to identify brain tumors from brain MRI images.

Here we implement the following three models: K-means algorithm with image clustering, K-means algorithm with feature clustering, and convolutional neural network for image classification. In addition, we analyze the performances of a few of these models using the following metrics: accuracy, ease of training and tuning, generalizability, and explainability.


K-means algorithm with image clustering

In this K-means model, each cluster is meant to represent a group of images with similar features that determine whether the images have tumors or not. For example, in a model with

five clusters, three might represent groups of images with tumors and two represent groups of images without tumors. Each value in the cluster is a flattened one-dimensional array representing the image file, where the elements inside the array are the RGB values normalized to a range of 0-1.

This algorithm is implemented in VSCode using an available K-means algorithm from the Sklearn Library.[11]After the K-means algorithm returns the cluster labels, we use the following code to determine the predicted tumor statuses of the images.

In the above code, we first find the tumor status of each image, which was given in the file names of the training set; this was set as “True”. Here, we change all the images in a cluster to have the status either “no” or “yes” based on the majority status of that cluster; this changed depiction represents “Predicted.” The next step for the model would be for an image of unknown status to be placed in a cluster through the model and have a predicted status returned.

K-means algorithm with feature clustering

In order to properly implement the K-means algorithm with feature clustering, we needed to be able to isolate and iterate through images, obtaining the two specific features we want. The following step will be using VS Code and Jupyter Notebooks to analyze and form our clusters, then verifying accuracy of the model. Depending on the resulting accuracy, we change dimensions, scale our values, or do both, while also using the elbow method to verify our practicality of the algorithm.

The following code allows us to input image statistics into a .csv file:

Now, we can analyze the data and use the K-Means algorithm to find clusters and centroids. We test for model accuracy, as well as ease of training/tuning ability (how easily the model can be used for multiple different parameter values). The final components we check for are generalizability, as well as explainability. Both of these are strengths of the K-means model, as the concept is rather easy to understand, and can be used for any two dimensional set of data.

Convolutional Neural Network For Image Classification

An alternative approach with image classification was taken for our third model: a Convolutional Neural network. Unlike the K-means algorithm implementations, this model would be able to train itself to classify images based on the details that the brain images were composed of, and identify images containing components indicating the presence of a brain tumor. For this implementation, the Tensorflow[3] python library and its higher level Keras sub-API. Keras was used because of convenience, since low-level optimization is not needed for a proof of concept implementation. The following was the code which processed images for the model, created the Convolutional Neural Network, and trained and tested the model using the common MRI dataset for testing, and an additional MRI dataset for testing accuracy (see Materials):

First, testing and training images are retrieved and formatted, with batch size matching the number of images in each directory. Then, a model is created with three convolutional layers, each with their respective pooling layer and a dropout layer. Afterwards the model is compiled using the categorical cross entropy loss function contained in Keras, an optimizer, and a metric. Finally the model is trained and tested over 300 epochs, at a batch size of 2. It should be noted

that GPU limitations required us to balance the image resolution, image batch size, and kernel count. The GPU used in this implementation was a NVIDIA RTX 2080 with 8GB VRAM. Though batch size was low, it had negligible effects on the test accuracy when compensated for with high epoch counts. Finally, Matplotlib was used to chart the accuracy data produced per epoch. This would allow us to visualize the accuracy of the model, as well analyze the significance that epoch count has on model accuracy.


The training dataset used by all three models is the “Brain MRI Images for Brain Tumor Detection” set from Kaggle[1]. Below are two MRI scan files from the dataset.

Fig. 1: Brain with no tumor
Fig. 2: Brain with tumor

It should be also noted that the Convolutional Neural Network implementation used an additional 192 images from the “Brain Tumor Classification (MRI)” set, also acquired from Kaggle[2]. Some images from this dataset were not included due to the inconsistency in their framing with the first dataset.

Fig. 3: Brain with tumor
Fig. 4: Brain with no tumor


K-means algorithm with image clustering

After implementation, the model returns a raw accuracy score of around 75%, which varies for cluster sizes ranging from 2 to 8 and is highest at cluster size 4. Therefore, it looks like k=4 is the optimal cluster size for this model.

Based on this implementation, we now look at the four metrics. The accuracy metric receives a score of average, with raw accuracy around 75%. The model shows much potential, since this is only the first version of the model. With further optimization, such as removing outliers, and training on a larger data set, the accuracy score has the potential to increase significantly. The ease of training/tuning metric receives a score of average, since the K-means algorithm only returns unlabeled clusters and requires an additional algorithm to retrieve data. The generalizability metric receives a score of good, since the model scales well to other datasets, including those that are much larger. The explainability metric receives a score of good, since the K-means algorithm is a simple concept and can easily be implemented.

K-means algorithm with feature clustering

#1: Average pixel value vs pixel sum

Looking at the preliminary results for analyzing average pixel value vs average pixel sum, we face an unfortunate complication. Since clusters are formed by calculating the Sum of squared error between a data point and its centroid, both the x and y axes will have to be on the same scale. In this case, since y axis numbers are skewed, we must scale both parameter values using the MinMaxScaler.

After scaling all relevant data, we achieve the following graph:

About 72% of the data points are in the correct clusters.

Similarly, we achieve the following for average pixel value vs. black+white pixels:

This graph has far less skew or outliers compared to the previous graph, indicating it may be more accurate. Indeed, when verifying the model, we see it has 75% accuracy, a slight uptick.

The Final Test: The Elbow Method

The elbow method is a test to determine the correct number of clusters, and we use it to determine the practicality of using the K-means algorithm with this data. For us, we want to obtain two clusters given our data, one yes cluster and one no cluster. The elbow method is done

by plotting a graph of K vs. the standard square error. The “elbow” of the arm depicted in the graph is the point at which the ideal number of clusters is achieved.

k=2 is the ideal number of clusters to use for the model, which confirms initial suspicion. We have good reason to believe that the K-means algorithm is a great tool to use to create these models, and analyze brain tumors. If this model continues to be improved, and training data increases, it can provide a long term solution to MRI image anomaly detection.

Convolutional Neural Network For Image Classification

Running the program yielded a testing dataset accuracy rate of about 70% (see below).

Altering various parameters within the restrictions of the limited VRAM and dataset size yielded negligible benefits. It should be noted that outliers such as accuracy levels at ~175 and ~225 epochs were results of the low batch size, and are not present running on hardware allowing for large batch size.

Analyzing these results we can now inspect them in the context of our four metrics: In terms of accuracy, the 70% accuracy rate places it about on par with the K-means algorithm, however more advanced implementations of this method have yielded near 100% accuracy rates[4] in the past, indicating that there is a significant amount of potential optimization that can be made (if hardware and dataset limitations were not present). Tuning and generalizability of this method are very straightforward, since changing parameters is relatively intuitive and the sub-classes of yes and no can be created by altering a few lines of code. The “explainability” of the CNN is more complicated, especially if low-level optimization is wanted, however this is simply the nature of convolutional neural networks and most advanced computer vision as well.


All three models share similar accuracy rates of 70-75%. To be more specific, the K-means clustering by image method achieved an accuracy of 75%, the K-means clustering by

image feature achieved an accuracy of 72% for the pixel sum method and a 75% for the black and white method, and finally, the convolutional neural network achieved an accuracy rate of 70%. With high accuracy rates, we are able to conclude that these models, if tuned further, can accurately identify tumors from MRI scans.

Future Direction

To improve our performance results, we can look to increase the data set available for testing. With more scans to test our models, we can more accurately determine specific areas of improvement. Additionally, we hope that we can improve our models with more advanced technology. For instance, by creating models with a more advanced Graphics Processing Unit (GPU), we can develop our code to be more efficient and extensive so that it can cover finer details in each MRI scan.

These methods are generally easy to train/tune and, because each program scales well, has great generalizability. In addition, with the exception of the convolutional neural network method, each model is relatively easy to explain. As a result, health care professionals can easily adapt and use these or similar models in their workplace. With the development and application of our models into the medical field, we hope to accelerate the diagnosis time for brain tumors using MRI scans.


  1. Chakrabarty, N. (2019, April 14). Brain MRI images for Brain tumor detection. Kaggle. Retrieved August 4, 2022, from
  2. “Sartaj”. (2020, May 24). Brain tumor classification (MRI). Kaggle. Retrieved August 4, 2022, from
  3. Alphabet inc. (n.d.). Tensorflow. TensorFlow. Retrieved August 4, 2022, from
  4. Chattopadhyay, A., & Maitra, M. (2022, February 25). MRI-based brain tumour image detection using CNN based Deep Learning Method. Neuroscience Informatics. Retrieved July 28, 2022, from ork%2C%20CNN%20gained,the%20result%20obtained%20so%20far
  5. Bien, N. (2018, February 9). Don’t just scan this: Deep learning techniques for MRI. Medium. Retrieved August 5, 2022, from mri-52610e9b7a85
  6. Busch, H. von. (2019, June 6). Artificial Intelligence for MRI. Siemens Healthineers. Retrieved August 5, 2022, from for-mri.html
  7. Gupta, S. (2021, January 25). Image clustering using K-means. Medium. Retrieved August 5, 2022, from a0
  8. Mishra, V. (2019, January 9). K means algorithm explained with an example. Medium. Retrieved August 5, 2022, from 64ce1
  9. Piazza, G. (2018, April 17). Artificial intelligence enhances MRI scans. National Institutes of Health. Retrieved August 5, 2022, from
  10. Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011
  11. Franklin, S. J. (2020, January 2). K-means clustering for image classification. Medium. Retrieved July 28, 2022, from

Machine Learning and Truncated Monte Carlo data shapely to Predict Stages of Cirrhosis

Journal for High Schoolers, Journal for High Schoolers 2022

Ethan Daniel Taylor*, Joseph Chai*, Joyce Zheng*, Juliana Maria Gaitan*, Paulos Waiyaki*, Tara Maria Salli*, David Jose Florez Rodriguez**

*These authors contributed equally to this work **Mentor


Cirrhosis is a common and deadly disease that requires the time and experience of a doctor to diagnose. We hypothesized that we could use machine learning and a truncated Monte Carlo Data Shapley algorithm to diagnose the stage of cirrhosis. This would help save doctors time and hospitals and patients money. We sourced the dataset from Kaggle. The data was collected by Mayo Clinic. We first had to clean up the data. Then we explored potential models, settling on a TensorFlow neural network classifier. Then the hyperparameter space was explored to come up with an ideal neural network for our task. A truncated Monte Carlo Data Shapley algorithm was then applied to our model to improve the accuracy when using such a small set of data. We found that this method with future tuning and a larger data set provides the potential to aid doctors in diagnosing the stage of cirrhosis.


Cirrhosis is among the most common causes of death worldwide; and with no definitive cure, early detection is paramount for a patient’s survival. Cirrhosis affects a patient’s liver, gradually replacing healthy liver cells with scarred cells. Due to its progressive nature, cirrhosis can take several years to fully develop. This makes it extremely difficult to detect early on, significantly decreasing a patient’s chances of survival.

There are four stages to Cirrhosis [1]. The first stage begins with inflammation of the bile duct and/or liver; the second stage involves the inflammation of the previous stage to scar; the third stage involves the liver losing its ability to function optimally due to the scarring; and finally, the fourth stage results in liver failure and a high risk for developing liver cancer.

Cirrhosis can be detected either by radiology testing or a needle biopsy of the liver [2]. Both methods are costly and can become a significant financial burden for the patient and their family; therefore, coming up with less invasive and cost-effective ways to triage a patient’s risk for cirrhosis with common medical knowledge could increase the chances of early detection. With this goal in mind, our team decided to test various AI models to predict a patient’s stage of Cirrhosis purely based on their medical data.


In the first set of experiments, we used the website to conduct the tests. Before tests could launch, we had NaN values that were obstructing us. To clean the data set, so it could be usable, the data was split into ‘MedNan’ – where NaN values were replaced with medians – and ‘ZeroNan’ – where NaN values were made zeroes. After defining these two sets of data, various models explored architectures, learning rates (LR), whether the predicted stage variable was categorical or numerical, and different activation functions: Relu, Tanh, and Sigmoid. In total, We explored a total of 180 models.

As shown in figure 1 the best model from those experiments was a MedNan, categorical, Relu model with four layers, 64 cells per layer, and a 0.04 learning rate – this model had an accuracy rate of 57%. The worst model was also a MedNan categorical Relu model with three layers and a 0.1 learning rate with an 18% accuracy rate.

As shown in figure 2 the MedNan results performed better than the best and worst results from ZeroNan. Among ZeroNan tests, [categorical, Sigmoid model with LR 0.0003 and 22% accuracy was the worst, and a categorical, Relu model with 0.1 LR and 51% accuracy was the best. In comparison, the best and worst results from MedNan are a categorical, l Sigmoid model with 1 LR and 5% accuracy, and a categorical, Relu model with 0.04 LR and 57% accuracy. Additionally, regression architectures typically performed poorly overall as they gave nonsensical numbers for the stages, including negatives and values above four. When it came to activation functions as shown in figures 1 and 2, Relu tended to have higher accuracy rates than Sigmoid and Tanh, specifically for MedNan results. The best Tanh model was a categorical Tanh with a 0.003 LR, three layers, and 55% accuracy. The best Sigmoid model being categorical Sigmoid with a 0.04 LR, three layers, and 47% accuracy.

Focusing on MedNan categorical results, lower learning rates and deeper models tended to perform better in experiments, this may be due to deeper models having more parameters and thus a greater capacity to learn.

With a clear preference for types of models and optimal handling of NaN values, the next experiments explored more architectures (number of layers, and number of cells per layer), hyperparameters, various regularization, learning rates, now on raw python. Training included 50 epochs for each model in this set of experiments, we used a data set that only included the patients with all the data filled out. Of all the models that were run, accuracies were in the range of 20%-48%. As shown in figure 3 the best models (three of them) each had 48.28% accuracies with all having learning rate (0.003), regularization (0.03), and cell numbers (100,100,100), (60,50,40,30,20), and (50,100,50,20) respectively. However, the worst models had 21.84% accuracy rates with the following models:

  1. Learning rates (0.003), regularization (0.01), and cell numbers (100, 100,100)
  2. Learning rate (0.0003), regularization (0.03), and cell numbers (60,50,40,30,20)
  3. Learning rate (0.003), regularization (0.01), and cell numbers (200,200).

After that, a data selection algorithm gave every patient a ‘value’, approximating how much it would help in training a model. The optimal architecture from previous experiments then trained on the top 50, 100, 150, 200, and 250 patients best fit for the model. These 5 models and their performances were compared to models with the same optimal architecture but trained on random sets of 50, 100, 150, 200, and 250 patients.

Data selection results peaked at 200 patients, but had no observable pattern. Random selection on the other hand exponentially grew in accuracy between its models from 50 to 250 and had the highest accuracy rate overall between the two data sets.

Data selection certainly outperformed the randomly selected for most models ran, save for the models with 250 patients, where data selection had 41.4% accuracy and randomly selected had a 47.1% accuracy. For all other comparisons, there is an average of 11.1% difference between the results from data selection and the results from random selection (max being 15.9% with 50 patients and minimum being 8% with 200 patients).

Two more experiments were run, with one using a data set with the first 325 patients (including ones with missing data) and another using only patients with all the data filled out, which totaled to 276 patients. Using the data set with the first 325 patients, the model yielded a 49.6% accuracy rate. In comparison, the latter yielded a 41.4% accuracy rate.


Our first attempt to construct an accurate model was by trying to use a linear regression model to predict the stage. This initial experiment was of no avail, and the model obtained poor results. The trial failed because linear regression assumes the stage prediction is a linear function containing two variables: input x, and output y. That was not the case with our data set, as we had 32 columns in the input representing 18 numerical and categorical vars. This led us to conclude linear regression models are inadequate for the stage prediction task, which is likely non-linear.

The next experiment was neural network regressors and classifiers. The data was split between MedNan and ZeroNan, and tested with various hyperparameters and ran through a model predicting stage as either a categorical or numerical feature type. The180 different tests yielded three key findings.

MedNan results performed better than ZeroNan. This might be because replacing NaN values with the median values of that column further reinforces the median and makes the resulting feature more consistent. Replacing the Nan values with zeroes reduces the accuracy and skews the data by introducing outlier values.

The second key finding was regression architecture results were insufficient when compared to results from categorical models. This might be due to the stage being a categorical variable, so running a Numerical test on it is unavailing. In addition, a lot of our variables were categorical (drug administered, sex, and stage). Their presence in the input may favor a similar categorical structure in the output.

The third key finding was that the Relu activation function was the most accurate one tested, and none of the others came close. Having found good architectures and NaN handling methods that produced the best results (MedNan, Categorical, and Relu) the research progressed to later stages.

The next step in research was training hyperparameters and optimizing the architecture even further (various amounts of layers and cells per layer). We used a classifier and integrated a Data Shapley algorithm to our existing models implemented in python3 through TensorFlow, too help improve our results. After numerous trials, we arrived at a solid model that works to provide well-founded predictions on cirrhosis with convincing accuracy. A data selection algorithm employing said model then gave values to our data. Thereby chosen ‘highly valuable’ subsets of the data trained multiple models with this optimal architecture, other models trained on random subsets of the data, and one model trained on a subset including only patients with no missing data. The resulting performances validated data selection for creating subsets over randomly created subsets when making a training subset four out of five times and also outperformed using the subset with no NaNs originally; however, the model trained on all the data had the highest performance. The goal to construct a model that can predict the stage of cirrhosis somebody is in using data selection was complete!

Methods and Materials

Cleaning up the data

We acquired our training and validation data from a Kaggle dataset. This data was collected by the Mayo Clinic, containing 418 cirrhosis patients, each with 20 data points. The data has some issues, however. Six patients were missing a stage, they were removed from our research since this is the single piece of information we have to have. Some patients had ‘not a number’ (NaN) values for numerical data points. There were also categorical values that needed to be converted to one-hot encoding for the neural network. This encoding converted all the potential values in the category into their own separate categories. These then had binary values to represent if the category was present. For example, in the category “Drug Administered” there can be several outputs such as ‘placebo’, ‘D-penicillamine’, or ‘N/A’. What one-hot encoding does is that it makes a new column in the data for each possible output (placebo, …) and uses 1s and 0s (binary notation) to denote whether this patient has this output or not. These values can then be fed into the machine learning model. Numerical NaNs are different from categorical NaNs. When a categorical data point is not there, it means that it is not present or that they do not fit into that category. However, numerical NaNs should have a number value. This meant the data was partially incomplete. To resolve the issue of missing data, we ran four tests. These were run on the website for ease of use and repeatability. All tests were run with exactly the same neural network. In one test the dataset was exactly as downloaded besides the one hot encoding. Another test included only the patients with complete data. One test replaced the NaNs with zeros, and the last test replaced the NaNs with the median value of that variable. The median value replacement outperformed the other datasets. This dataset was used for the rest of the experiment. With our small data size, it was more valuable to have more patients with mostly complete data and filled in NaNs than to have perfect data, but fewer patients. Most of our experiments were performed using the first 325 patients as our training set and the rest as our validation. Only when using data selection did we change this.

Exploring Models and Hyperparameter Tuning

With the newly cleaned data, we explored various potential models. The very first thing tried was a simple linear regression. We fed in some data points, and it then attempted to predict the others. This proved to be worthless. With so many variables, a linear model is very ineffective. Continuing on, the question of a neural network regression or classifier came up. The stage variable is a number. One could use a neural network regressor and get it to output a number. What we found was it was widely inaccurate, very frequently leaving the one to four range. This also implies a mathematical relationship between the stages of cirrhosis, e.g.: stage one plus stage three equals stage four. This is not consistent with our current understanding of medicine and the stages of cirrhosis. Per the results, using a neural network classifier to classify into one of four stages proved to be a more accurate model. In, we also explored different activation functions per layer. Relu, Tanh, and Sigmoid were all tested. Nothing was able to beat the validation accuracy of the Relu activation function, so it was used for the rest of the experiments. The rest of our experiments used a neural network classifier. The experiment was run in a Google Colab notebook to have more flexibility in building models and implementing the Data Shapley algorithm to our learning model. This is a python 3 environment. The TensorFlow Keras was library enabled to create the neural network classifier. To optimize the accuracy of the model, the hyperparameter space was explored. To do this, two nested ‘for’ loops tested combinations of three different learning rates and three different regularization rates. These hyperparameters were tested on 18 different architectures ranging from one to five layers and with varying numbers of cells per layer. All the activation functions were Relu, except the last activation function, which was Softmax. Learning rates ranging from .03 to .0003 were used and regularization rates from .1 to .01. The model that showed the most promise as a model for cirrhosis detection was selected. The goal was to get a model with a demonstrated capacity for learning that could then be used to quantify how much each data point in the data set(patient) contributes to overall learning the via a truncated Monte Carlo Data Shapley algorithm.

Why do we use Data Selection?

When selecting training data for your learning model, most data selection algorithms use three main parts: your data set, model, and evaluation metric. The idea is to use a portion of your data set to train your model to accurately predict a predefined, separate validation set. When your model is trained to its highest possible accuracy, you can then use it to predict future outputs of unknown real outputs but known inputs. You might want to try randomly selecting points, but a better way is to find out how valuable each data point is to the performance of the model. The methodology to this idea is called Data Shapley- a data selection algorithm designed by Assistant Professor James Zou at Stanford University.

What is Datashapley?

Data Shapley is a data selection algorithm used commonly for supervised machine learning and commonly applied as a Truncated Monte Carlo Simulation (TMC). In terms of our research, the algorithm randomly selects one patient from the 418 rows and records validation loss after training on that data point. Next, another patient is randomly selected without replacement and its shapley value will be set as the difference between the resulting. This process is repeated until the Data Shapley values of our randomly selected points plateaus. [This happens in TMC, but doesn’t happen in Data Shapley. In Data Shapely, we stop each iteration when validation loss plateaus for a single selected point rather than the randomly selected points. Afterwards, the algorithm calculates the median Data Shapley values for every data point; median values can be used to minimize the impact of outliers to the data selection model’s accuracy. If there was a larger data set, including several dozen approximations for each data point’s shapley value, then it’s sufficient to compute the mean of the shapley values for each data point. We used the mean, having over 50 approximations for each shapley value for most patients will be desirable to calculate the means instead of medians. With a value for each data point, Data Shapley allows researchers to select the “n” best data points as the optimal training data.

In the data selection stage of our research, each group member individually ran the Data Shapley selection algorithm to generate more approximations for each data point’s shapley value by leveraging the power of multiple computer to minimize real time spent. We combined the individual results using a custom algorithm made for this work, which can be found in our GitHub repository. Essentially, the outputs to the individual data_shapley selection algorithm were the patient identification number, data_shapley value, and sample count; they can be denoted respectively as PIN, DSV, and SC. PIN is the index number in our original data set downloaded from Kaggle that was assigned to that single patient. DSV is the Data Shapley value assigned to that patient – the mean of all approximate shapley vaules for that patient – which will be used to evaluate that data point’s usefulness to the training progression of our model. SC represents the number of times that a single data point was randomly selected during the duration of the Data Shapley data selection algorithm run.

We combined all of our Data Shapley record sheets using the following methods. The sheets were first grouped with the same PIN. Next, the combiner program calculated the weighted mean of the data_shapley values using the following formula . The new sample count is set to the sum of all sample counts of that PIN from all data_shapley output sheets (). The PIN remains the same and after all calculations the combined_output sheet is set to reorder itself numerically based on PINs. Next, a few ML architectures were run with our Data Shapley selected training data using TensorFlow and the model with the highest validation accuracy was recorded.


This research aims to attempt data selection and explore general machine learning for predicting the stage of cirrhosis from patient health records. Early experiments dealt with how to handle missing data, whether to treat the stage of cirrhosis as numerical or categorical, and what activation function to use in a neural network. A no code platform for machine learning,, ran these models for convenience. Having initial results, research focused on hyperparameter tuning and more architecture exploration ixn TensorFlow keras. The best model here obtained 49% accuracy, while chance is 25%. Data selection with this architecture then provided subsets of the training data that could perform well. Compared to random subsets of equal size and to the subset of patients with no missing data, data selection did improve results. Data selection-based models didn’t outperform models trained on the whole data, however.

Future work could carefully tune the data selection algorithm used, thus improving its selection of subsets. Additionally, more patients and more features might allow the development of better models if the resources were available.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5: Linear Regression Results


  1. Karthik Kumar, MBBS. “4 Stages of Cirrhosis of the Liver: 18 Symptoms, Causes & Treatment.” MedicineNet, MedicineNet, 7 Apr. 2022
  2. “Cirrhosis.” Mayo Clinic, Mayo Foundation for Medical Education and Research, 6 Feb. 2021

Algorithms on Stock Performance

Journal for High Schoolers, Journal for High Schoolers 2022

Kevin Xiao, Anahita Vaidhya, Nathan Pao, Audrey Kuo Stanford University and New York University


One important approach of systematic trade is to make decisions based on the predicted price. To predict accurate results, statistics and machine learning models are normally used. We seek to develop different models for stock price prediction, comparing the feasibility and accuracy of each. Our trading strategy is based on prediction, and we tested it by simulating real-time trade. This simulation grants us the real-time performance of different models. Through all the comparisons, we propose the effectiveness of different models in stock price predictions and trading.


Using machine learning and financial algorithms to contribute to market trading allows traders to make financial investments/decisions at a faster and more accurate pace. Financial algorithm trading is a program that follows an algorithm to trade. Due to the extensive data the models receive to train on, they can make educated decisions much better than a human. This allows investors to invest at not only a faster pace but also to make more informed decisions based on the trends from previous data. Currently, companies use algorithms like arbitrage, index fund rebalancing, mean reversion, and market timing to make trading decisions quickly, yet these are based on rigid rules and cannot account for complexities in shifting stock prices. Thus, a machine learning model could look at past trends and make a more informed decision before trading. To develop a proper method to create a more accurate prediction model, we need to analyze a diverse set of existing prediction models along with ones in the work to ensure the model is as efficient and accurate as possible. We analyzed prediction models starting from linear and logistic regression to AutoRegressive Integrated Moving Average (ARIMA), Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Holt-Winters Exponential Smoothing (HWES). From these models, we ran a backtest to see the accuracy of our models and set up an email program that sends an automated daily report to inform the investor about the daily returns from each model. To optimize our statistical models, like ARIMA and HWES, we used detrending and adjusted for seasonality. To optimize our machine learning models, like LSTM and GRU, we added sequential layers that will read input from every layer before it, creating a deep learning environment where the model can properly train on historical data. To improve our performance, we continuously adjusted the parameters used in modeling until we

found one that works best. After testing our models extensively over a couple of weeks, we found that the LSTM model had the greatest return on average.


Various methods have historically been used to employ data science in stock trading, the first of which is algorithmic trading. This refers to the automatic buying and selling of stocks based on set rules and calculated decisions. Once learning models are created, machine learning is used to train the computer to accurately predict stock prices and their fluctuations in the future based on patterns and trends present in a certain amount of data from the past. After backtesting and comparing the predictions generated by the model to the actual stock returns, we can examine the variation and difference between the two values to optimize the model’s accuracy.

This approach to algorithmic trading is not new: in fact, many different models have often been used for trading. These include both time series models and classification models. The former refers to deep reinforcement or machine learning and past stock price data to predict future prices, while the latter creates model representations of given data points. Some common models currently being widely used for this purpose include XGBoost, a decision tree library, and LSTM, a type of neural network.


Data Collection

The first step in our research process was data collection. We made an API request to Yahoo Finance API to collect historical data within a specific time frame. The specified time frame for data collection varied based on the model we looked into, but the consensus was to collect data at fifteen-minute intervals within ten days. The data was collected into a Pandas DataFrame, a two-dimensional size-mutable table allowing data manipulation.

Data Cleaning

The next step was to clean and filter the data. Yahoo Finance provides an array of useful data, ranging from the opening, closing, high/low, trading volume, and timestamps of the stock data. The information our models require are timestamps as well as the closing price of the stock. Our first task was to manipulate data tables. After successfully creating a data frame of the closing price and timestamp, we could move on to fitting our data into a trainable dataset.

Data Fitting and Creating Features

The general procedure for building and testing all our models requires extensive work on building features that fit data into our models. After filtering to only the closing price of the data, we needed to scale the data to a range between 0 and 1. This is known as “Min-Max Normalization” and prevents outliers in data from being exceedingly influential when creating a prediction. We will call this dataset the scaled dataset.

Fig. 1

Next, we need to split our dataset. This is one of the most crucial features, as training the model on one dataset will cause overfitting, the case in which the model fits exactly to its training data. This means the model will not be able to make any predictions with unseen data. We used an 80/20 split, with 80% of our dataset being used for training and the other 20% used for testing.

Fig. 2

With our training, we created the training data set that will be fed into our models.

Fig. 3

To accomplish this, we must create two more datasets, an x_train, and a y_train. The dataset, x_train, represents historical data. The size of x_train will vary depending on the amount of data points we plan to use in our model; in this case, we built our models on the past ten data points. The dataset, y_train, represents the future data points that will be compared to our model’s predictions to give our model an accuracy score.

Fig. 4

After training our models on these two datasets, we create two datasets for testing: x_test and y_test. These datasets call upon our models to make a prediction based on the x_test, comparing it with the y_test to create a prediction score.

Fig. 5
Fig. 6

Model-Building Phase

We built a total of five distinct models, three of which are statistical models that employ regression techniques such as linear and logistic regression, and the remaining two being machine learning models. The statistical models are Autoregressive Integrated Moving Average (ARIMA), Simple Exponential Smoothing (SES), and Holt-Winters Exponential Smoothing (HWES). The machine learning models are Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU).

Daily Report

Using the SMTP python library, we constructed an automatic email program that takes in certain inputs at a certain time of day and sends a predetermined list of recipients a message containing those inputs. We used this email program to automatically send the return rate of each model in correspondence with the backtest.


Consistent across all model iterations was a measurement of performance upon historical financial data—backtesting. Here, we first quantified each model’s success with regard to the accuracy of their stock market predictions. However, given variations in each model’s implementation, we later implemented a generalized daily return algorithm for a tangible, statistical comparison.

Backtest Performances

Introductory Models:

As a means of gauging generalized market trends, we implemented the following simple yet powerful models: logistic regression and linear regression.

Logistic Regression

Being a binary classification model, logistic regression proved much less flexible in analyzing the stock market relative to our other models. However, we were able to achieve a generalized gain/loss prediction with increasing return accuracies throughout extended test cycles.

Figure 7. The results of our logistic regression algorithm in table format

Here, the confusion matrix—displayed in the table above—exhibits a trend of increasing prediction accuracy with longer observation periods on the AAPL index. Consequently, the ability to accurately predict when a stock market will either increase or decrease proved crucial for our later models to succeed.

Linear Regression

As our first closing model, we implemented a linear regression algorithm to deduce generalized linear trends in observed market intervals.

Figure 8. This sequence of linear regressions comprises our preliminary model.

While rather rough, the first semblance of a stock market model can be observed here. Our lines of best fit spanned intervals of two days in which the overarching trends of the stock market can be observed.

Statistical Models:

General trends in mind, we utilized the ARIMA-GARCH and HWES models to break down the stock market into statistical components before extracting our own predictions.

ARIMA-GARCH (AutoRegressive Integrated Moving Average – Generalized AutoRegressive Conditional Heteroskedasticity)

This model uses the aforementioned regression techniques to make predictions on stationary data. Given the volatility of the stock market, this made it necessary to detrend and difference our data before observing any relevant patterns. From here, we used ARIMA to weigh

every data point within the stock market before calculating a future price based on these weights. We then identified each series’ error variance through the GARCH algorithm before applying such to the ARIMA prediction.

Figure 9. Graph of ARIMA-GARCH’s train and test datasets. Blue indicates the training data set, red indicates the model’s prediction, and green indicates the real closing price.

As we can see, the results of our ARIMA-GARCH algorithm did relatively well in predicting the stock market’s overall trend in the backtested time window. However, its ability to account for market volatility appears rather limited due to the model’s statistical approach. The consistency of ARIMA-GARCH’s predictions, however, is quite significant on its own.

HWES (Holt-Winters Exponential Smoothing)

Exponential smoothing is a well-established forecasting model that predicts values based on a weighted sum of previous values, placing a greater importance on recent values and an exponentially decreasing importance on older ones. HWES improves upon exponential smoothing by adding two terms to account for market volatility. The first term combines a weighted average of slopes in the data to account for a general trend, and the second uses a seasonal period to account for seasonality.

Figure 10. Graph of HWES’s return rate success, graded on a scale of 0-1.

While seemingly volatile, the averaged return of the HWES algorithm proves rather promising. In the graph pictured, it averages above a 0.5 score, indicating the potential for successful implementation in the real-world stock market.

Machine Learning Models:

Using the prior statistical backtests as a baseline, we employed LSTM and GRU neural networks to optimize our market predictions beyond those observed in stationery trends.

LSTM (Long Short-Term Memory)

LSTM is a recurrent neural network system utilized in machine learning with the intention of predicting future data points. It does this by using different layers for processing data including an input layer, a hidden layer, and an output layer. In our case, we imported a LSTM model from the Keras package before adding dense layers to account for unseen market variables. We then incorporated the Adam optimization model to address the parameter processing of stock market data.

Figure 11. Graph of LSTM’s train and test datasets. Blue indicates the training data set, orange indicates the model’s prediction, and red indicates the real closing price. Note that closed market periods were included in this dataset, hence sustained linear trends.

As seen above, our LSTM model does well to account for the general trend of the market, despite a relatively short testing window. Notably, the model makes an effort to account for market volatility, as indicated in sudden spikes and shifts in the graph above.

GRU (Gated Recurrent Unit)

GRU uses gates and model sequences making it structurally similar to LSTM. However, it only has two gates to inform its data analysis: update and reset. Unlike LSTM, GRU only uses a hidden state; it does not have a cell state. Here, we trained a GRU model to predict stock market returns on a similar scale as we did for HWES.

Figure 12. Graph of GRU’s return rate success, graded on a scale of 0-1.

Though it is difficult to directly compare our GRU model’s predictions with that of our LSTM model, its results proved comparable in success. As another point of reference, the GRU model performed similarly to the HWES model on the same scale, alluding to its feasibility for market implementation.

In short, each model was engineered to provide favorable returns on their respective training datasets. To further test the validity of these results, however, we have implemented a daily return algorithm for use on live stock market data. With greater data for analysis, we may be able to substantiate or clarify the above results.


Admittedly, our results were hindered by two key factors—model formatting and a limited testing phase. These two issues hindered our analysis of our models’ backtest performances and daily returns, respectively, due to inadequate data points for comparison.

Regarding our backtesting results, variations across performance metrics made it difficult to deduce a definitive conclusion for each model’s accuracy. As an example, many of our predictive models differed in the time-scale of our training and testing datasets. Consequently, each success cannot be accurately compared in the context of another at a brief glance.

While our daily returns algorithm was able to quantify each model’s performance under a single metric, the runtime of this test was too short to offer any definitive results either. As proven by previous literature, such a short testing window has often detrimented the results of statistical models—which derive success from relatively slow growths. At the same time, the sporadic returns of our machine learning models did not have enough time to balance either to identify an averaged rate of return.


Based on our daily report system, our best model currently is our LSTM model, with an average return rate of around 10% daily. As of yet, it has never returned a negative return rate, meaning that theoretically, our model should not lose any money while performing intra-day trading. Our other models ARIMA and Holt-Winters ES also have an average positive return rate, with an accuracy that will only improve as we continue to find the proper parameters for these statistical models to make the best predictions. We plan to continue to test and implement more features into our models to further their accuracy while expanding our dataset to contain more training data. With a few more months of testing our models, we can deploy the algorithms for real-time intra-day trading within the stock market. A concern we have for the near future is the possibility of stock market instability when we use our models. However, it is a necessary risk involved in every stock market investment.

Future Directions

All of the models analyzed have been added to the daily performance report. This way, the automatic emails allow us to review their accuracy and compare them to each other and the real returns. If the daily results indicate success in terms of accuracy and continue to follow a positive trend, the next steps will include refining the best models and rendering them more user-friendly so that financial analysts will be able to employ them for practical use in the stock market. Other future goals that we have expressed interest in pursuing include applying the Universal Portfolio Algorithm to the completed models, as well as expanding upon the scope of the research to include virtual reality in the modeling process, thereby incorporating more dimensions into the data visualization.


Algos – Guide to Algorithms Used in Trading Strategies. (2022, January 15). Corporate Finance Institute. Retrieved August 5, 2022, from orithms-algos/

Basic Research Paper Format Examples. (n.d.). Example Articles & Resources. Retrieved August 5, 2022, from Kazem, A., Sharifi, E., Hussain, F. K., Saberi, M., & Hussain, O. K. (2021, April 19). Support

vector regression with chaos-based firefly algorithm for stock market price forecasting. ScienceDirect. Retrieved August 5, 2022, from

“AI is Smart Technology!”: Analyzing Expert-Novice Conceptions of Artificial Intelligence, Machine Learning, and Mathematics

Journal for High Schoolers, Journal for High Schoolers 2022

Elly Kang, Sean Sehoon Kim, Sarah Porter, Taylor Torres


There has been widespread enthusiasm for artificial intelligence and machine learning (AIML) curricula and instruction. Yet, integrating these fields into schools remains challenging. One underexplored avenue, presented in this paper, involves integrating AIML curricula with mathematics. To explore this approach, we conducted interviews to illustrate how high school students of various mathematics backgrounds explained how AIML and mathematics work within FaceID, a well-known technology. Interviews were analyzed with the Knowledge in Pieces (KiP) framework and compared to AIML experts’ responses, who were asked the same questions. The findings showcase where students’ primitive responses took on characteristics of experts’ and where they diverged. Given our results, we highlight potential starting points for AIML curricula to be integrated with mathematics concepts.

Purpose of Study

Within the last half-decade, there have been a number of calls for youth to learn about artificial intelligence and machine learning (AIML) (Chiu & Chai, 2020; Touretsky et al., 2019). However, much remains unknown about how concepts from AIML can be operationalized in pre-collegiate classrooms. There are already several empirically-backed studies that provide examples of how students can learn about artificial intelligence successfully. Some of these curricula approach AIML with an ethical lens (Williams, Kaputsos, & Breazeal, 2021) while others build from a computer science standpoint, encouraging students to explore AIML concepts through relevant coding projects (Estevez, Garate & Graña, 2019).

Our interest in distributing AIML knowledge to youth lies in examining its mathematical underpinnings. While many mathematics concepts involved in AIML applications are beyond the scope of most pre-collegiate students’ mathematical knowledge (e.g., multivariate calculus), others are not (e.g., probability and geometry). Our overarching aim is to provide AIML curricula for students that are integrated within Common Core State Standards for Mathematics (CCSSM), so that students from diverse backgrounds may learn how AIML is supported in part by mathematics available to them.

To progress toward a “mathematics of AI” curriculum, a crucial first step is to understand what intuitions, ideas, and conjectures students offer when asked to explain how AI works. To address this aim, our research team asked students and experts to predict how FaceID, a common AIML application, operates, and what math concepts might be included. Through this study, we address where students’ conceptions mirrored those of experts’ and where they diverged. By doing so, we wish to identify common ground between experts’ and novices’ explanations of AI systems and their mathematics that may be used in service of broader AI curricula creation efforts. To that end we ask, how do novices conceive of AI and its relationship to mathematics, and, to what degree do novices’ conceptions parallel those of AI experts?

Theoretical Perspectives

Interest in studying human expertise arose from developments in artificial intelligence itself (Glaser, Chi, & Farr, 2014). Characteristics of expert thinkers were explored in cognitive psychology (Newell & Simon, 1972) and further defined throughout the 1970s-80s. Glaser (1987) noted that experts: excel and perceive large, meaningful patterns in their domains, solve domain problems quickly, have superior memories, represent domain-specific problems at deep levels compared to novices, qualitatively analyze problems for long periods of time, and have strong self-monitoring skills. For the present study, we will focus only on experts’ recognition of patterns, analysis of systems, and domain-specific representations.

Novices construct explanations for novel phenomena based on superficial, everyday experiences. Their knowledge of novel phenomena is diverse and dynamically cued, as characterized by diSessa’s Knowledge in Pieces (KiP) framework (diSessa, 1993). Elements in a KiP system have multiple forms and levels of complexity. The most basic unit, coined phenomenological primitives (p-prims), describes novices’ fragmented knowledge and causal explanations that often appear to be self-evident when evoked. As novices build expertise in a domain through carefully designed instruction, they learn to cue p-prims more productively and build expert-like explanations (diSessa, Gillespie, & Esterly, 2004), which can take the form of explanatory primitives (e-prims), offering more detailed explanations (Kapon & diSessa, 2012). Our work at present seeks to identify p-prims and e-prims in novices’ explanations of AIML.

Although the KiP framework was first constructed to examine expert-novice conceptions of physics (diSessa, 1993), p-prims have been applied elsewhere. For example, Southerland et al. (2001) used p-prims to investigate students’ tentative, shifting descriptions of biological phenomena, concluding that p-prims were useful in characterizing students’ formations of scientific ideas. Likewise, Iszak applied KiP to both students and pre-service teachers studying proportion, functions, and other multiplicative relationships in mathematics (Iszak, 2005; Izsak & Jacobson, 2017), and suggested that the KiP framework was productive for suggesting improvements for mathematics curriculum and instruction design.

To the best of our knowledge, KiP has not yet been used to describe students’ conceptions of AIML. Yet, it shares many commonalities with physics. AIML is a domain that is positioned as difficult for non-experts to learn (diSessa, 1996) and difficult to access without a statistics or computing background (Sulmont et al., 2017). Like physics, AIML is omnipresent in the lives of 21st century citizens. We therefore hypothesize that novice students will have informal conceptions of AIML based on their repeated interactions with technologies that rely upon it.


Participants and Procedures

We recruited 7 experts and 36 high school students via snowball sampling on their conceptions of AIML and its mathematics. Students participated from 20 schools in eight states and territories from all regions of the U.S. They gave demographic information and their most recent math course completed, which varied from Algebra 1 to advanced courses such as Differential Equations. Although not explicitly asked, some students described their experiences in computer science courses such as AP Computer Science A. Experts were identified based on their position in AIML fields, such as data scientist or computer science professor. Rather than inquiring about prior coursework, experts were asked how they used mathematics in their current work.

We conducted semi-structured interviews designed to elicit explanations of how FaceID, an AIML application that uses face recognition to unlock mobile phones, works. Interviews took place on Zoom and lasted 30-45 minutes. To ensure that all participants understood what we meant by FaceID, we showed a video segment of a user setting up his iPhone by capturing photos of his face from different perspectives, then using it to successfully unlock his phone. After the clip, the team asked follow-up questions to determine what each respondent noticed, how they believed the technology worked, and what mathematics concepts might be involved in FaceID. Probing questions were asked until the interviewer felt confident that they understood the hypothesis presented by each participant.

Data Analysis

After the interview phase, each member of the research team independently coded all transcripts for similarities in participants’ explanations (Saldaña, 2014). Researchers attended to subject domains that participants drew upon when explaining AIML concepts (e.g., references to computing, human cognition), which mathematics topics participants connected to FaceID, and how participants explained the role of mathematics in the technology. For students’ descriptions, researchers noted patterns in explanations that reflected primitiveness (e.g., “AI is a smart robot”) and patterns that suggested formal theories (e.g., “A computer program that makes decisions from the data.”). After data were independently coded, the research team discussed which students’ explanations contained similar elements and should be classified together. The meeting continued until all disagreements were resolved, and a categorization hierarchy that was motivated by Southerland et al. (2001) was created to group similar student explanations (see Table 1). A second coding pass of the data was conducted to ensure that all responses fit within the coding rubric.


Experts’ Descriptions of AI Systems

The seven experts initially described AI with two standout features. First, all but one foregrounded AI’s historical roots in cognitive psychology rather than its presence in modern computational systems. They connected AI with the humanistic notions of decision-making, intelligence, and behaviors of humans, such as using a priori past observations to make future judgments. Second, experts supplanted explanations with domain-specific examples. For instance, Expert 7, a native of California, explained how machine learning systems draw upon past climate data to make predictions about present wildfire frequencies. In doing so, experts demonstrated focused attention to multiple facets of AI as well as multiple real-world uses of it.

When asked to describe FaceID and its associated mathematics concepts, all but one expert gave both high-level descriptions and esoteric details. All experts described FaceID as a large network that took image data as input and processed it with a machine learning algorithm through pixel comparisons. Some experts elaborated further, explaining that convolutional neural networks extract relevant features from images, such as edges, facial features, and ratios between facial features, to create a distinguishable profile of the user to be used during face recognition. Mathematically, experts agreed that calculus and linear algebra comprised the core mathematics domains used in FaceID, which would have been out of scope for many high school students. However, they observed that knowing topics prerequisite for calculus and linear algebra, such as matrices, matrix operations, probability, general functions, trigonometric functions, and inequalities, also supported understanding FaceID’s operations.

Students’ Descriptions of AI Systems

A high-level summary of each student groups’ hypotheses and examples is shown in Table 2. Students with anthropomorphic knowledge constructions attributed FaceID’s operations to human-like characteristics. Students with teleological knowledge constructions attributed FaceID’s recognition abilities to iPhone’s camera. The only mathematics topics identified were geometric knowledge of angles and taking measurements. We suspect those topics may have been cued by the FaceID video in the interview protocol. A few students did not believe that mathematics was involved. Although some students’ explanations mirrored experts’ attention to humanistic features, students did not connect their conjectures to computing.

Students with mechanistic proximate or mechanistic anthropomorphic knowledge constructions connected aspects of computing and mathematics to their explanations of FaceID. They mentioned computer programs, tasks, and data in their hypotheses, although they stopped short of explaining how those elements were coordinated to accomplish AIML. Mechanistic anthropomorphic students additionally attributed humanistic properties to AIML systems. Mathematically, students’ explanations varied. Some did not know how math was involved, some identified the same mathematics concepts as students in the first level, and others drew inferences about invisible mathematics, such as probability. Student 29, for example, stated, “They must use math to analyze how well the system is working. Probability to ensure that…it is getting the right face, geometry to map out the dimensions of your face and looking for color proportions, and maybe intervals to calculate a certain area of your face.” Although a primitive explanation, this student identified many pre-linear algebra math topics identified by experts.

Students with mechanistic ultimate knowledge constructions explained AIML as a system of interconnected computing-based components that were coordinated to perform humanistic tasks. These students offered cause-and-effect inferences that explained how FaceID’s infrastructure relied on bigger ideas from AIML (e.g., training data) and what mechanism from AIML permitted FaceID to make decisions. Geometry topics were still the most prominently cited as involved with AIML. Surprisingly, their levels of explanatory detail attributed to mathematics varied greatly. Student 12, for example, explained that FaceID involved, “an absolute or local minimum…so when the algorithm keeps tuning itself, it’s hoping to move down,” whereas Student 20 offered, “I believe that math has a place in this. I can’t tell you what specific mathematical function, but it’s definitely determining values of…shapes and stuff.” While these students’ overall explanations of FaceID showed some characteristics of e-prims, their connections to mathematics sounded more primitive than we expected.

Discussions and Implications

In aggregate, students’ conceptions tended to contain one or more the following p-prims: AI as a humanistic agent, AI as a machine that executes tasks, and AI as a data-driven system. Some students offered more explanatory power to p-prims into what could be considered e-prims, for example, AI as a machine that executes tasks by combining data from humans with a learning algorithm. Though some students’ explanations paralleled experts’ descriptions in highlighting humanistic aspects of AIML or by offering detail on how machines are trained, not even the most detailed hypotheses mirrored experts in mathematical precision.

In some ways, this finding is not surprising. Druga, Otero, and Ko (2022) reviewed over 50 AI curricula, where only three connected AIML to mathematics. A majority of the curricula focused on overarching concepts in AIML while abstracting away mathematics. One hypothesis is that many students who gave mechanistic ultimate knowledge constructions may have received prior AIML instruction in one such curricula, yet were never implored to consider mathematical connections.

However, a vast majority of students offered conjectures about mathematics’ involvement in AIML’s infrastructure even without formal knowledge. In our view, this suggests that students could learn about the mathematics of AIML, albeit at a basic level. We suggest that future curricula strongly consider designs that explicitly bridge AIML with its prerequisite mathematics.


Chi, M. T., Glaser, R., & Farr, M. J. (2014). The nature of expertise. Psychology Press.

Chiu, T. K., & Chai, C. S. (2020). Sustainable curriculum planning for artificial intelligence education: A self-determination theory perspective. Sustainability, 12(14), 5568.

diSessa, A. A., Gillespie, N. M., & Esterly, J. B. (2004). Coherence versus fragmentation in the development of the concept of force. Cognitive science, 28(6), 843-900.

diSessa, A. A. (1996). What do “just plain folk” know about physics. The handbook of education and human development, 709-730.

diSessa, A. A. (1993). Toward an epistemology of physics. Cognition and instruction, 10(2-3), 105-225.

Druga, S., Otero, N., & Ko, A. J. (2022). The Landscape of Teaching Resources for AI Education.

Estevez, J., Garate, G., & Graña, M. (2019). Gentle introduction to artificial intelligence for high-school students using scratch. IEEE access, 7, 179027-179036.

Glaser, R. (1987). Thoughts on Expertise in C. Schooler e K.. W. Schaie (dir.), Cognitive Functioning and Social Structure over the Life Course (p. 81-94), Norwood.

Izsak, A. (2005). ” You have to count the squares”: applying knowledge in pieces to learning rectangular area. The Journal of the Learning Sciences, 14(3), 361-403.

Izsák, A., & Jacobson, E. (2017). Preservice teachers’ reasoning about relationships that are and are not proportional: A knowledge-in-pieces account. Journal for Research in Mathematics Education, 48(3), 300-339.

Kapon, S., & diSessa, A. A. (2012). Reasoning through instructional analogies. Cognition and Instruction, 30(3), 261-310.

Newell, A., & Simon, H. A. (1972). Human problem solving (Vol. 104, No. 9). Englewood Cliffs, NJ: Prentice-hall.

Saldaña, J. (2014). Coding and analysis strategies. The Oxford handbook of qualitative research, 581-605.

Southerland, S. A., Abrams, E., Cummins, C. L., & Anzelmo, J. (2001). Understanding students’ explanations of biological phenomena: Conceptual frameworks or p‐prims?. Science Education, 85(4), 328-348.

Sulmont, E., Patitsas, E., & Cooperstock, J. R. (2019). What is hard about teaching machine learning to non-majors? Insights from classifying instructors’ learning goals. ACM Transactions on Computing Education (TOCE), 19(4), 1-16.

Touretzky, D., Gardner-McCune, C., Martin, F., & Seehorn, D. (2019, July). Envisioning AI for K-12: What should every child know about AI?. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 9795-9799).

Williams, R., Kaputsos, S. P., & Breazeal, C. (2021, May). Teacher Perspectives on How To Train Your Robot: A Middle School AI and Ethics Curriculum. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 35, No. 17, pp. 15678-15686).

Novices Students’ Explanations of AI


Illustrative Quote

Anthropomorphic or Teleological

A: An explanation based on the use of human attributes as the causal agent for AI.

T: An explanation where the technological ends are considered as the agent that explains AI.

“AI is computers that have brains and are running everything.”

—Student 9

“AI is smart technology.”

—Student 19

Mechanistic Proximate

An explanation in which an aspect of computing, such as programming, is identified as the underlying determinant of AI systems.

“Technology that is programmed to respond like a human.”

—Student 17

Mechanistic Anthropomorphic

An explanation in which an aspect of computing, such as programming is identified and connected to a human attribute of AI, such as behaving autonomously, reasoning, or making a decision.

“Computers that are coded to think on their own, that are coded to generate answers or do tasks that are learned by themselves without human intervention.”

—Student 29

Mechanistic Ultimate

An explanation in which several aspects of computing are combined together by cause-and-effect mechanisms to illustrate a goal or outcome of an AI system.

“A computer program that learns and grows depending on the data that it is given, which comes from saved answers. [AI] can be used for human-like tasks, such as social media.”

—Student 13

Table 1: Students’ conceptions of artificial intelligence (based on Southerland et al., 2001)

Novices Students’ Explanations of FaceID

Summary of Students’ Hypotheses

Illustrative Quote

Anthropomorphic or Teleological

(n = 11)

A: Attributed FaceID’s operations to human-like characteristics.

T: Attributed FaceID’s operations to capabilities of visible hardware.

“AI looks at the user’s face from all different perspectives.”

—Student 6

“AI reads multiple points from your face to see if it matches with the given information”

—Student 15

“It uses a camera that recognizes stuff.”

—Student 24

Mechanistic Proximate

(n = 12)

Attributed FaceID’s operations to a coordinated system containing a program, data collection, and/or a task to be accomplished, without explaining how

“To get accurate reads of what makes your face unique, data is collected that is specific to your face.”

—Student 35

“AI would be shown the user’s face and be quickly trained to recognize that person’s face specifically.”

—Student 11

Mechanistic Anthropomorphic

(n = 6)

Attributed FaceID’s operations to a coordinated system with human-like properties (e.g., ability to make a decision, reason, think) containing a program, data collection, and/or a task to be accomplished, without explaining how

“Face ID asks for many angles so it gets all the miniscule details, and when you unlock your phone from any angle, it works. Face ID processes all the images, internalizing them to recognize faces at any angle.”

—Student 32

Mechanistic Ultimate

(n = 8)

Attributed FaceID’s operations to a coordinated system with human-like properties and explains how each element of the system contributes to the completion of the task.

“When you move your head around, the AI is trying to combine different angles of you into what one person looks like. So you’re training the machine to produce several portraits of you. When you look at your camera, if your facial features match, your phone unlocks.”

—Student 12

Table 2: Summary of students’ responses

Biofeedback in Performance

Journal for High Schoolers, Journal for High Schoolers 2022, Uncategorized

Jai Bhatia, Yui Hasegawa, Gabriele Muratori, Stasia Vaituulala, Farangiz Akhadova, Nikko Boling


The purpose of our research is to investigate whether there are physical, quantifiable differences between an actor’s portrayal of emotions and the real-life sensation of those emotions. There is a lack of research surrounding the physiological changes that actors undergo as a result of their performances. To tackle this problem, we ran preliminary trials collecting Electrocardiography (ECG) signals, heart rate signals as an indication of their physical state while experiencing emotions, Galvanic Skin Response (GSR), and Electromyography (EMG). This paper serves as a starting point to integrate real-time biometric data into a theatrical performance and explores the potential of providing biofeedback for actors.


There exist some universal standards of expression in theatrical performance. One of the most notable examples is the Delsarte System of Bodily Expression [1], which serves as a dictionary for outward expressions (such as gestures, facial cues, and movements) to express inner emotions. However, this model only centers around the external, visible outlook of actors and neglects the internal changes they experience while performing. One of the biggest challenges actors face is portraying the emotions of their characters in a “genuine” manner to create a realistic performance. And yet, without a way to quantify the internal changes of an actor, it remains extremely difficult to define what constitutes a “genuine” performance and how actors can better achieve it.

We wanted to test the hypothesis that our emotions are correlated with bodily changes, also known as the physiological theory of emotion [2]. In order to categorize the emotions based on physiological data, we used the circumplex model of affect – a two-dimensional framework of emotion – which plots arousal, the intensity of emotion, and valence, the extent of which an emotion is positive or negative. Prior research indicates that changes in the visceral motor system of the body are the most notable signs of emotional arousal. Hence, the actor’s heart rate and sweat gland activity are important signals to measure for our research. Though we conducted our experiment on ourselves, we outlined directions for the application of this research in theatrical performance.

While biofeedback has been explored in the performing arts [3] [4], integrating real-time metabolic data into a theatrical performance and making it visible to the audience is a new approach to performance making in theater.



We utilized SparkFun’s RP2040 mikroBUS Development Board [5]. The Mikrobus Shuttle [6] and Shuttle Click [7] enabled us to hook up multiple sensors at a time to the Development Board. We used four Mikroe Click Boards™ [8] discussed in 1.3 to collect biometrics.


We began experimenting with Micropython and Circuit Python through the Mu Editor, the Thonny IDE, VSCode, and the macOS terminal. We used C++ and the Arduino IDE, extracting data from the Serial Monitor and Serial Plotter. We utilized python for data visualization and analysis.

We utilized the Arduino Mbed OS RP2040 Board library [9] and the EmotiBit MAX30101 [10] library for the Heart Rate Click Board.


Data Collection

In order to mirror the emotional changes of an actor, we measured physiological data on student test subjects while they simulated different emotions.

ECG Data

The ECG (Electrocardiography) Click measures heart rate variability by picking up on the heart’s rhythm and electrical activity. We created an experiment to find changes in ECG signals as we experience emotions. First, we measured the signals for a duration of ~4 minutes, which served as the control. Then, a series of short clips were played for a test subject and they were asked to identify how they felt while watching each video. Simultaneously, we measured the ECG signals of the test subject. We then analyzed the data to guide research to find a correlation between the ECG signals and the self-reported emotions of the test subjects. The data reported compares the “scary” and control videos. The first electrode was placed under the subject’s ribcage, below their heart. The second and third ones were placed near their upper shoulder and calibrated until a QRS complex was represented in the output graph.

GSR Data

The GSR (Galvanic Skin Response) Click measures the electrodermal activity in the body, or changes in sweat gland activity. We conducted the same experiment as 1.3.1 obtaining a control measurement with the GSR click before taking data over a course of videos. The electrodes were fastened to the subject’s finger with velcro.

EMG Data

The EMG (Electromyography) Click measures the electrical activity of muscles. Same methodology as 1.3.1. We also reported a control and “distress” graph, comparing a time when a subject passionately expressed distress. The electrodes were placed on the subject’s eyebrows and cheek, with the DRL electrode on one wrist.

Heart Rate Data

The heart rate sensor measures the test subject’s heartbeats per minute. The experiment consisted of a user watching a selected horror scene from three different movies as they placed their finger on the Heart Rate Click.


ECG Data

Normally the heart beats in a regular, rhythmic fashion producing a P wave, QRS complex, and T wave. The QRS complex represents three waves representing ventricular depolarization [12].

“The R wave reflects depolarization of the main mass of the ventricles—hence it is the largest wave” [11]. However, “exercise-induced left ventricular hypertrophy is considered a normal physiologic adaptation to the particularly rigorous training of athletes” [12]. We addressed this confounding variable by sitting still while recording data. Note that the r amplitudes of the “scary’ data are higher in relation to the Q wave than the control data.

After collecting ECG data, we confirmed the QRS complex represented in our data by zooming into certain parts of the graph (Figure 1). We then plotted data outlined in 1.3.1 (Figure 2), and its extracted r-wave amplitude (Figure 3).

We illustrate an example of the data analysis below. During one of the video clips the test subject reported feeling “scared” and “fearful” for the entire duration of the video, as well as “shocked” at 3 specific points due to jump scares in the video. We then looked to the subject’s physiological data (Figure 3) to identify a correlation. In contrast to the control data, where the R wave amplitudes remained relatively similar, there were three distinct peaks in the data collected while the test subject watched the video. Those three peaks occurred concurrently with the self-reported “shock” of the test subject.

For future analysis, we collected data on the R-R intervals or the distances between the

R-waves. This helps us plot the heart rate variability (HRV). Note that heart rate is the average number of heartbeats in a time interval while HRV is the difference in time between each heartbeat. “Research and theory support the utility of HRV as a noninvasive, objective index of the brain’s ability to organize regulated emotional responses” [13] which is why “the current neurobiological evidence suggests that HRV is impacted by stress and supports its use for the objective assessment of psychological health and stress” [14].

Figure 1: Short interval of ECG data depicting the QRS complex
Figure 2: ECG data snippet of control and scary data
Figure 3: R waves extrapolated from respective Figure 2 data

GSR Data

When plotted, GSR data consists of two major components: the tonic component, often measured from skin conductance level (SCL), and rapid, phasic changes, measured from event-related (ER-SCR) and non-event-related (NS-SCR) stimuli [15]. Higher frequencies of both ER-SCRs and NS-SCRs are correlated with higher emotional arousal.

We sampled the data from the same video clip mentioned in 2.1, during which the test subject reported feeling “scared” and “fearful” throughout.

There was no substantial difference between the GSR data recorded in Figure 5 and 6 which was from two separate video clips. The data from Figure 5 was taken when the test subject reported feeling “scared” and “fearful” while watching the video; the data in Figure 6 was taken while playing a separate video clip that evoked “euphoria” and “excitement” in the test subject. The two-clips both resulted in much higher frequencies in GSR activity than the control data in Figure 4.

This suggests that GSR activity indicated emotional arousal rather than emotional valence.

Figure 4: GSR Control
Figure 5: GSR while watching video (scary)
Figure 6: GSR while watching happiness inducing video

Heart Rate Data

In the heart rate data, we saw a direct correlation between the suspense and feelings of anxiety reported in the test subject and the heart rate frequency.

In Figure 7, the test subject reported the feeling of shock due to a loud sound effect that corresponded with the jumpscare. The user’s heart rate spiked correspondingly, with values reaching a max of 85 beats per minute at the peak of the jumpscare.

In Figure 8, the test subject reported feeling continuously apprehensive and on the edge of their seat. Instead of a singular tall peak, the data illustrate more frequent but shorter peaks that correlate with the self-reported anxiety of the test subject. The first-reported “jumpscare” corresponded with a peak of higher values, up to a max of 106 beats per minute, however, after the first scare, the values never reached as high.

In Figure 9, the test subject reported feeling peaceful and not being too caught off guard by the jump scares, thus the low average bpm.

This suggests that the levels of anxiety and uncertainty the subject encountered throughout the experiment are demonstrated in the heart rate signals.

Figure 7: Heart Rate visual and sound scare but with not much build up
Figure 8: Heart Rate visual and sound scare with build up
Figure 9: Heart Rate Sound scare with buildup

EMG Data

Figure 10 shows a control and “distress” graph, comparing a time when a subject passionately expressed distress verbally to when they sat still. The control data had consistent y values between 241 and 383, as well as consistent frequency. The distress data fluctuated much more, corresponding to times when the subject was raising their eyebrows. The max y value for the distress graph is 557.

Another experiment shown in Figure 11 compares the control data of the subject reporting to be peaceful with the subject being happy and smiling. Figure 12 reported more variation and higher values in the data, likely from their cheek muscles.

In Figure 10 and Figure 11 the subject noted that the low peaks on the control data were from when they blinked.

Figure 9: Heart Rate Sound scare with buildup
Figure 11: EMG Peace (control)
Figure 12: EMG Happy

Experimental Errors

It is important to note that data interpretation in the context of emotional arousal is not yet standardized in all aspects, so while statistical functions (e.g. standard deviation, calculated kurtosis of skin conductance, local maxima peak) can be used to determine arousal [16], they depend on goals of a project and will vary accordingly. Based on our limited range of datasets, the aforementioned measures serve as a starting point for further data collection and analysis.

User-related data inaccuracy could be due to different electrocardiographic artifacts along with user health conditions. Other factors that can affect data between different individuals are obesity, pregnancy, location of heart within chest, exercise habits etc., [17] [18].

Non-user-related factors that could have affected the collection of ECG and GSR data include high-frequency noises, high humidity, extreme temperature variations, and the vicinity of other machines.

Future Directions

Theatrical Implementation

We suggest the same experiment be conducted with professional actors without video stimuli, and instead taking measurements while they perform various emotions of their characters to increase our dataset and suggest a correlation using methods that are rooted in theater rather than one that attempts to mirror it.

We plan on implementing our devices in theater at the University of Brasilia. Implementation could include: the overlaying of actors’ or the audience’s heart rate to create a soundtrack, actors’ GSR data being used to stimulate lighting color and intensity, and “limited heart rate” performances where actors have a certain number of heartbeats before their microphone is cut off and they have to speak louder to be heard which symbolizes the aging process.

Additionally, the visualization of the data will be formatted to be captivating for audience members rather than those with technical backgrounds, enhancing the artistic aspect of this

performance. We plan on using configurations of geometric shapes and colors to represent each actor’s data.

This look into the biometrics of an actor creates an original type of performance: one in which the fourth wall is broken and the data collection process likely reflexively juxtaposes the subjective nature of theater, pushing the audience to reconsider their notions of emotional and empirical truth.

Outside of the performance itself, we can analyze the change in a performer’s technique and its correlation to the biometric data as an objective metric of feedback for actors.

Sensor Hookup to Actor

Utilizing the Mikroe phone jack ECG Cable [19] and adhesive electrode Sensors [20], we are able to record data from all four click boards at once (Figure 13). We are currently working on getting the BLE Tiny Click to [21] collect this data wirelessly. The device is to be powered using the Mikroe 3.7V secondary batteries [22] (Figure 14), based on efficiency in criteria of weight, size, and capacity.

We propose keeping the sensors in the pocket of an actor with their clothes covering up the EMG and ECG electrodes. The GSR electrode will be wrapped around the actor’s fingers and succeed with velcro. The device would need to be placed on the arm to also take measurements from the Heart Rate Click. If this is not possible, we can use the QRS complex produced by the ECG to represent a heartbeat.

Figure 13: Sensor setup
Figure 14: Battery Calculations


Through our research, we observed some correlations between biometrics such as heart rate, GSR, ECG, and EMG signals that can be furthered to prove statistical significance between our control and test data in theater. This data will provide ways for the audience to receive sensory information from an actor’s state, opening up many possibilities for theatrical implementation.


We would like to thank Professor Michael Rau, Sreela Kodali, Deniz Yagmur Urey, Ashley Jun, and Rinni Bhansali for their technical guidance and support this summer. We would also like to express our appreciation for Professor Tsachy Weissman, Cindy Nguyen, Sylvia Chin, and the other mentors at the Stanford Compression Forum who made this opportunity possible.


  1. Kirby, E. T. “The Delsarte Method: 3 Frontiers of Actor Training.” The Drama Review: TDR, vol. 16, no. 1, 1972, pp. 55–69. JSTOR, Accessed 5 Aug. 2022.
  2. Cornelius, Randolph R. “Department of Computer Science, Columbia University.” THEORETICAL APPROACHES TO EMOTION, ISCA Archive, 5 Sept. 2000,
  3. Gruzelier, John. Enhancing Creativity with Neurofeedback in the Performing Arts: Actors, Musicians, Dancers: Theory and Action in Theatre/Drama Education. Sept. 2018, k_in_the_Performing_Arts_Actors_Musicians_Dancers_Theory_and_Action_in_TheatreDrama_ Education.
  4. Gruzelier J, Inoue A, Smart R, Steed A, Steffert T. Acting performance and flow state enhanced with sensory-motor rhythm neurofeedback comparing ecologically valid immersive VR and training screen scenarios. Neurosci Lett. 2010 Aug 16;480(2):112-6. doi: 10.1016/j.neulet.2010.06.019. Epub 2010 Jun 11. PMID: 20542087.
  5. “Sparkfun RP2040 MikroBUS Development Board.” DEV-18721 – SparkFun Electronics, VifORTdKXTpFs_poh_bXbNy_sjJ9OCNbvPIMFawIo38aAiDBEALw_wcB.
  6. “Mikrobus Shuttle: Mikroelektronika.” MIKROE,
  7. “Shuttle Click: Mikroelektronika.” MIKROE,
  8. “Click Boards.” MIKROE,
  9. “Arduino/ArduinoCore-Mbed.” ArduinoCore-Mbed,
  10. “EmotiBit_MAX30101.” GitHub,
  11. Ashley, Euan A, and Josef Niebauer. “Conquering the ECG.” National Center for Biotechnology Information, U.S. National Library of Medicine, 2004,
  12. Is Hypertrophy a Short Term Effect of Exercise?, 11 Oct. 2020,
  13. Wei, Chuguang, et al. “Affective Emotion Increases Heart Rate Variability and Activates Left Dorsolateral Prefrontal Cortex in Post-Traumatic Growth.” Nature News, Nature Publishing Group, 30 Nov. 2017,
  14. Kim, Hye-Geum, et al. “Stress and Heart Rate Variability: A Meta-Analysis and Review of the Literature.” Psychiatry Investigation, Korean Neuropsychiatric Association, Mar. 2018,
  15. Braithwaite, Jason J, et al. “A Guide for Analysing Electrodermal Activity (EDA) & Skin Conductance …” A Guide for Analysing Electrodermal Activity (EDA) & Skin Conductance Responses (SCRs) for Psychological Experiments, 2015, f.
  16. Kolodziej, M., et al. “Electrodermal activity measurements for detection of emotional arousal.” Warsaw University of Technology, Institute of Theory of Electrical Engineering, Measurement and Information Systems,


  1. García-Niebla, Javier et al. “Technical mistakes during the acquisition of the electrocardiogram.” Annals of noninvasive electrocardiology : the official journal of the International Society for Holter and Noninvasive Electrocardiology, Inc vol. 14,4 (2009): 389-403. doi:10.1111/j.1542-474X.2009.00328.x
  2. Rashid, Muhammad Shihab, et al. “Emotion Recognition with Forearm-Based Electromyography.”, 13 Nov. 2019,