Ethan Daniel Taylor*, Joseph Chai*, Joyce Zheng*, Juliana Maria Gaitan*, Paulos Waiyaki*, Tara Maria Salli*, David Jose Florez Rodriguez**
*These authors contributed equally to this work **Mentor
Abstract
Cirrhosis is a common and deadly disease that requires the time and experience of a doctor to diagnose. We hypothesized that we could use machine learning and a truncated Monte Carlo Data Shapley algorithm to diagnose the stage of cirrhosis. This would help save doctors time and hospitals and patients money. We sourced the dataset from Kaggle. The data was collected by Mayo Clinic. We first had to clean up the data. Then we explored potential models, settling on a TensorFlow neural network classifier. Then the hyperparameter space was explored to come up with an ideal neural network for our task. A truncated Monte Carlo Data Shapley algorithm was then applied to our model to improve the accuracy when using such a small set of data. We found that this method with future tuning and a larger data set provides the potential to aid doctors in diagnosing the stage of cirrhosis.
Background
Cirrhosis is among the most common causes of death worldwide; and with no definitive cure, early detection is paramount for a patient’s survival. Cirrhosis affects a patient’s liver, gradually replacing healthy liver cells with scarred cells. Due to its progressive nature, cirrhosis can take several years to fully develop. This makes it extremely difficult to detect early on, significantly decreasing a patient’s chances of survival.
There are four stages to Cirrhosis [1]. The first stage begins with inflammation of the bile duct and/or liver; the second stage involves the inflammation of the previous stage to scar; the third stage involves the liver losing its ability to function optimally due to the scarring; and finally, the fourth stage results in liver failure and a high risk for developing liver cancer.
Cirrhosis can be detected either by radiology testing or a needle biopsy of the liver [2]. Both methods are costly and can become a significant financial burden for the patient and their family; therefore, coming up with less invasive and cost-effective ways to triage a patient’s risk for cirrhosis with common medical knowledge could increase the chances of early detection. With this goal in mind, our team decided to test various AI models to predict a patient’s stage of Cirrhosis purely based on their medical data.
Results
In the first set of experiments, we used the website superbio.ai to conduct the tests. Before tests could launch, we had NaN values that were obstructing us. To clean the data set, so it could be usable, the data was split into ‘MedNan’ – where NaN values were replaced with medians – and ‘ZeroNan’ – where NaN values were made zeroes. After defining these two sets of data, various models explored architectures, learning rates (LR), whether the predicted stage variable was categorical or numerical, and different activation functions: Relu, Tanh, and Sigmoid. In total, We explored a total of 180 models.
As shown in figure 1 the best model from those experiments was a MedNan, categorical, Relu model with four layers, 64 cells per layer, and a 0.04 learning rate – this model had an accuracy rate of 57%. The worst model was also a MedNan categorical Relu model with three layers and a 0.1 learning rate with an 18% accuracy rate.
As shown in figure 2 the MedNan results performed better than the best and worst results from ZeroNan. Among ZeroNan tests, [categorical, Sigmoid model with LR 0.0003 and 22% accuracy was the worst, and a categorical, Relu model with 0.1 LR and 51% accuracy was the best. In comparison, the best and worst results from MedNan are a categorical, l Sigmoid model with 1 LR and 5% accuracy, and a categorical, Relu model with 0.04 LR and 57% accuracy. Additionally, regression architectures typically performed poorly overall as they gave nonsensical numbers for the stages, including negatives and values above four. When it came to activation functions as shown in figures 1 and 2, Relu tended to have higher accuracy rates than Sigmoid and Tanh, specifically for MedNan results. The best Tanh model was a categorical Tanh with a 0.003 LR, three layers, and 55% accuracy. The best Sigmoid model being categorical Sigmoid with a 0.04 LR, three layers, and 47% accuracy.
Focusing on MedNan categorical results, lower learning rates and deeper models tended to perform better in experiments, this may be due to deeper models having more parameters and thus a greater capacity to learn.
With a clear preference for types of models and optimal handling of NaN values, the next experiments explored more architectures (number of layers, and number of cells per layer), hyperparameters, various regularization, learning rates, now on raw python. Training included 50 epochs for each model in this set of experiments, we used a data set that only included the patients with all the data filled out. Of all the models that were run, accuracies were in the range of 20%-48%. As shown in figure 3 the best models (three of them) each had 48.28% accuracies with all having learning rate (0.003), regularization (0.03), and cell numbers (100,100,100), (60,50,40,30,20), and (50,100,50,20) respectively. However, the worst models had 21.84% accuracy rates with the following models:
- Learning rates (0.003), regularization (0.01), and cell numbers (100, 100,100)
- Learning rate (0.0003), regularization (0.03), and cell numbers (60,50,40,30,20)
- Learning rate (0.003), regularization (0.01), and cell numbers (200,200).
After that, a data selection algorithm gave every patient a ‘value’, approximating how much it would help in training a model. The optimal architecture from previous experiments then trained on the top 50, 100, 150, 200, and 250 patients best fit for the model. These 5 models and their performances were compared to models with the same optimal architecture but trained on random sets of 50, 100, 150, 200, and 250 patients.
Data selection results peaked at 200 patients, but had no observable pattern. Random selection on the other hand exponentially grew in accuracy between its models from 50 to 250 and had the highest accuracy rate overall between the two data sets.
Data selection certainly outperformed the randomly selected for most models ran, save for the models with 250 patients, where data selection had 41.4% accuracy and randomly selected had a 47.1% accuracy. For all other comparisons, there is an average of 11.1% difference between the results from data selection and the results from random selection (max being 15.9% with 50 patients and minimum being 8% with 200 patients).
Two more experiments were run, with one using a data set with the first 325 patients (including ones with missing data) and another using only patients with all the data filled out, which totaled to 276 patients. Using the data set with the first 325 patients, the model yielded a 49.6% accuracy rate. In comparison, the latter yielded a 41.4% accuracy rate.
Discussion
Our first attempt to construct an accurate model was by trying to use a linear regression model to predict the stage. This initial experiment was of no avail, and the model obtained poor results. The trial failed because linear regression assumes the stage prediction is a linear function containing two variables: input x, and output y. That was not the case with our data set, as we had 32 columns in the input representing 18 numerical and categorical vars. This led us to conclude linear regression models are inadequate for the stage prediction task, which is likely non-linear.
The next experiment was neural network regressors and classifiers. The data was split between MedNan and ZeroNan, and tested with various hyperparameters and ran through a model predicting stage as either a categorical or numerical feature type. The180 different tests yielded three key findings.
MedNan results performed better than ZeroNan. This might be because replacing NaN values with the median values of that column further reinforces the median and makes the resulting feature more consistent. Replacing the Nan values with zeroes reduces the accuracy and skews the data by introducing outlier values.
The second key finding was regression architecture results were insufficient when compared to results from categorical models. This might be due to the stage being a categorical variable, so running a Numerical test on it is unavailing. In addition, a lot of our variables were categorical (drug administered, sex, and stage). Their presence in the input may favor a similar categorical structure in the output.
The third key finding was that the Relu activation function was the most accurate one tested, and none of the others came close. Having found good architectures and NaN handling methods that produced the best results (MedNan, Categorical, and Relu) the research progressed to later stages.
The next step in research was training hyperparameters and optimizing the architecture even further (various amounts of layers and cells per layer). We used a classifier and integrated a Data Shapley algorithm to our existing models implemented in python3 through TensorFlow, too help improve our results. After numerous trials, we arrived at a solid model that works to provide well-founded predictions on cirrhosis with convincing accuracy. A data selection algorithm employing said model then gave values to our data. Thereby chosen ‘highly valuable’ subsets of the data trained multiple models with this optimal architecture, other models trained on random subsets of the data, and one model trained on a subset including only patients with no missing data. The resulting performances validated data selection for creating subsets over randomly created subsets when making a training subset four out of five times and also outperformed using the subset with no NaNs originally; however, the model trained on all the data had the highest performance. The goal to construct a model that can predict the stage of cirrhosis somebody is in using data selection was complete!
Methods and Materials
Cleaning up the data
We acquired our training and validation data from a Kaggle dataset. This data was collected by the Mayo Clinic, containing 418 cirrhosis patients, each with 20 data points. The data has some issues, however. Six patients were missing a stage, they were removed from our research since this is the single piece of information we have to have. Some patients had ‘not a number’ (NaN) values for numerical data points. There were also categorical values that needed to be converted to one-hot encoding for the neural network. This encoding converted all the potential values in the category into their own separate categories. These then had binary values to represent if the category was present. For example, in the category “Drug Administered” there can be several outputs such as ‘placebo’, ‘D-penicillamine’, or ‘N/A’. What one-hot encoding does is that it makes a new column in the data for each possible output (placebo, …) and uses 1s and 0s (binary notation) to denote whether this patient has this output or not. These values can then be fed into the machine learning model. Numerical NaNs are different from categorical NaNs. When a categorical data point is not there, it means that it is not present or that they do not fit into that category. However, numerical NaNs should have a number value. This meant the data was partially incomplete. To resolve the issue of missing data, we ran four tests. These were run on the superbio.ai website for ease of use and repeatability. All tests were run with exactly the same neural network. In one test the dataset was exactly as downloaded besides the one hot encoding. Another test included only the patients with complete data. One test replaced the NaNs with zeros, and the last test replaced the NaNs with the median value of that variable. The median value replacement outperformed the other datasets. This dataset was used for the rest of the experiment. With our small data size, it was more valuable to have more patients with mostly complete data and filled in NaNs than to have perfect data, but fewer patients. Most of our experiments were performed using the first 325 patients as our training set and the rest as our validation. Only when using data selection did we change this.
Exploring Models and Hyperparameter Tuning
With the newly cleaned data, we explored various potential models. The very first thing tried was a simple linear regression. We fed in some data points, and it then attempted to predict the others. This proved to be worthless. With so many variables, a linear model is very ineffective. Continuing on superbio.ai, the question of a neural network regression or classifier came up. The stage variable is a number. One could use a neural network regressor and get it to output a number. What we found was it was widely inaccurate, very frequently leaving the one to four range. This also implies a mathematical relationship between the stages of cirrhosis, e.g.: stage one plus stage three equals stage four. This is not consistent with our current understanding of medicine and the stages of cirrhosis. Per the results, using a neural network classifier to classify into one of four stages proved to be a more accurate model. In superbio.ai, we also explored different activation functions per layer. Relu, Tanh, and Sigmoid were all tested. Nothing was able to beat the validation accuracy of the Relu activation function, so it was used for the rest of the experiments. The rest of our experiments used a neural network classifier. The experiment was run in a Google Colab notebook to have more flexibility in building models and implementing the Data Shapley algorithm to our learning model. This is a python 3 environment. The TensorFlow Keras was library enabled to create the neural network classifier. To optimize the accuracy of the model, the hyperparameter space was explored. To do this, two nested ‘for’ loops tested combinations of three different learning rates and three different regularization rates. These hyperparameters were tested on 18 different architectures ranging from one to five layers and with varying numbers of cells per layer. All the activation functions were Relu, except the last activation function, which was Softmax. Learning rates ranging from .03 to .0003 were used and regularization rates from .1 to .01. The model that showed the most promise as a model for cirrhosis detection was selected. The goal was to get a model with a demonstrated capacity for learning that could then be used to quantify how much each data point in the data set(patient) contributes to overall learning the via a truncated Monte Carlo Data Shapley algorithm.
Why do we use Data Selection?
When selecting training data for your learning model, most data selection algorithms use three main parts: your data set, model, and evaluation metric. The idea is to use a portion of your data set to train your model to accurately predict a predefined, separate validation set. When your model is trained to its highest possible accuracy, you can then use it to predict future outputs of unknown real outputs but known inputs. You might want to try randomly selecting points, but a better way is to find out how valuable each data point is to the performance of the model. The methodology to this idea is called Data Shapley- a data selection algorithm designed by Assistant Professor James Zou at Stanford University.
What is Datashapley?
Data Shapley is a data selection algorithm used commonly for supervised machine learning and commonly applied as a Truncated Monte Carlo Simulation (TMC). In terms of our research, the algorithm randomly selects one patient from the 418 rows and records validation loss after training on that data point. Next, another patient is randomly selected without replacement and its shapley value will be set as the difference between the resulting. This process is repeated until the Data Shapley values of our randomly selected points plateaus. [This happens in TMC, but doesn’t happen in Data Shapley. In Data Shapely, we stop each iteration when validation loss plateaus for a single selected point rather than the randomly selected points. Afterwards, the algorithm calculates the median Data Shapley values for every data point; median values can be used to minimize the impact of outliers to the data selection model’s accuracy. If there was a larger data set, including several dozen approximations for each data point’s shapley value, then it’s sufficient to compute the mean of the shapley values for each data point. We used the mean, having over 50 approximations for each shapley value for most patients will be desirable to calculate the means instead of medians. With a value for each data point, Data Shapley allows researchers to select the “n” best data points as the optimal training data.
In the data selection stage of our research, each group member individually ran the Data Shapley selection algorithm to generate more approximations for each data point’s shapley value by leveraging the power of multiple computer to minimize real time spent. We combined the individual results using a custom algorithm made for this work, which can be found in our GitHub repository. Essentially, the outputs to the individual data_shapley selection algorithm were the patient identification number, data_shapley value, and sample count; they can be denoted respectively as PIN, DSV, and SC. PIN is the index number in our original data set downloaded from Kaggle that was assigned to that single patient. DSV is the Data Shapley value assigned to that patient – the mean of all approximate shapley vaules for that patient – which will be used to evaluate that data point’s usefulness to the training progression of our model. SC represents the number of times that a single data point was randomly selected during the duration of the Data Shapley data selection algorithm run.
We combined all of our Data Shapley record sheets using the following methods. The sheets were first grouped with the same PIN. Next, the combiner program calculated the weighted mean of the data_shapley values using the following formula . The new sample count is set to the sum of all sample counts of that PIN from all data_shapley output sheets (). The PIN remains the same and after all calculations the combined_output sheet is set to reorder itself numerically based on PINs. Next, a few ML architectures were run with our Data Shapley selected training data using TensorFlow and the model with the highest validation accuracy was recorded.
Conclusion
This research aims to attempt data selection and explore general machine learning for predicting the stage of cirrhosis from patient health records. Early experiments dealt with how to handle missing data, whether to treat the stage of cirrhosis as numerical or categorical, and what activation function to use in a neural network. A no code platform for machine learning, superbio.ai, ran these models for convenience. Having initial results, research focused on hyperparameter tuning and more architecture exploration ixn TensorFlow keras. The best model here obtained 49% accuracy, while chance is 25%. Data selection with this architecture then provided subsets of the training data that could perform well. Compared to random subsets of equal size and to the subset of patients with no missing data, data selection did improve results. Data selection-based models didn’t outperform models trained on the whole data, however.
Future work could carefully tune the data selection algorithm used, thus improving its selection of subsets. Additionally, more patients and more features might allow the development of better models if the resources were available.
Bibliography
- Karthik Kumar, MBBS. “4 Stages of Cirrhosis of the Liver: 18 Symptoms, Causes & Treatment.” MedicineNet, MedicineNet, 7 Apr. 2022
- “Cirrhosis.” Mayo Clinic, Mayo Foundation for Medical Education and Research, 6 Feb. 2021