Understanding Patient Preferences for Kidney Transplants

Blog, Journal for High Schoolers, Journal for High Schoolers 2023

By: Omry Bejerano, Yash Chanchani, Eugene Kwek, Anvika Renuprasad

Mentor: Itai Ashlagi


Kidney transplantation is the most effective treatment option for end-stage kidney disease. In the United States, Organ procurement organizations (OPOs) are responsible for recovering these organs from deceased donors and offering them to patients that need transplants. However, the current kidney transplant process is suffering from many frictions and inefficiencies. In the US, most patients wait three to five years for a kidney transplant while an average of 3,500 kidneys are discarded each year. And, unfortunately, around 5,000 patients per year die while waiting for a kidney transplant. One cause of these frictions is the lack of patient involvement in the organ allocation process. Surgeons typically accept organs, not only kidneys, for their patients without consulting them. In order to get patients involved, it is important that they understand the different factors that go into organ allocation.

We analyzed data from the Organ Procurement and Transplantation Network to gain insight on exactly what factors affect a patient’s waiting time, specifically for kidneys. Looking into exploiting variability in transplant centers’ decision, the allocation process, and historical data, we can accurately provide patients with predictions regarding waiting time which would help them make informed decisions on their transplant.


The organ allocation system in the United States is confronted with numerous inefficiencies, which pose significant challenges for individuals in need of an organ transplant. Among these inefficiencies, the demand for kidneys far exceeds the supply provided by Organ Procurement Organizations (OPOs). Despite the pressing demand, a startling number of kidney organs, approximately 3,500, are discarded each year. It is very possible that some of these discarded kidneys if the allocation process had less frictions.

Extensive waiting times for organs are another issue in the allocation process. Patients awaiting kidney transplants in the US generally wait three to five years and this number in places like California is doubled to five to ten years. This problem is the result of a combination of factors, one of them being the lack of patient involvement in the decision of what type of kidney they will receive. Not every patient wants to wait many years for a high quality kidney and suffer through dialysis procedures. Several patients prefer waiting a short time for a low quality kidney so they can resume with their daily lives. However, because their preferences are not voiced in the allocation process and surgeons generally choose their organs, patients are forced to wait and many even die while doing so. Every year, around 5,000 patients lose their lives awaiting a kidney transplant.

To combat high patient mortality rates, increased kidney discards, and, most importantly, long patient waiting times, we analyzed STAR data files from the OPTN. Looking specifically into factors that affect waiting times for the Stanford and UCSF transplant centers, there were shocking insights on what certain groups of patients preferred in their organ.


In this study, we conducted an extensive analysis utilizing the Scientific Registry of Transplant Recipients (STAR) data files from the Organ Procurement and Transplantation Network (OPTN). The dataset included over a million records of patient and donor data from the US, enabling a thorough investigation of the influences on the kidney allocation process.

To analyze characteristics of patients and donors such as age, CPRA, gender, ethnicity, KDPI, and cold ischemic time, we utilized the Python language and various packages. Commonly known packages such as matplotlib, pandas, numpy, and seaborn were used to work with datasets and create visualizations that would help us compare factors for certain groups of patients and take a closer look at different regions’ and hospitals’ data. We built machine learning models using scikit-learn to investigate relationships between patients and donors, therefore advancing our understanding of the outcomes of transplantation. Additionally, we conducted survival analysis using the lifelines program, which allowed us to evaluate the time-dependent factors that affect the success of organ donation.

Upon getting the data, we started by creating summary statistics in order to get an understanding of the factors that affected patient waiting times based on age, CPRA, KDPI, etc. With each member of the team looking at different portions of the data, we were able to identify anomalies and frictions in the kidney allocation system. The team focused on digging deeper at these abnormal occurrences in order to get an understanding of the situation and what can be changed in order to make kidney allocation a smoother process. From then, we utilized ML models to assist us in our analysis process. The next section, results, will further explain our process and include visuals created through these methods.


The first results that we found were insights from graphs based on kidney data from the OPTN. Much of the results are based on data from 2014-2018. Many past papers relating to this topic reveal that too many kidneys are discarded. Thus, we created some graphs looking into this issue (Figures 1-3). They revealed that there were 3000+ kidneys that were discarded since 2015 simply because recipients couldn’t be found. This solidifies that there are great inefficiencies in the kidney allocation process, since finding a recipient in such a large waiting list shouldn’t be an issue.

Figure 1 Figure 2




Figure 3

Visualization regarding waitlist times based on basic patient clinical information (blood type, age, etc.) were conducted (Figure 4). The main insight from these graphs is that there are large outliers (as seen in Figure 4). This suggests that the waiting times were highly dependent on a case by case basis, since no clear pattern was found.



Figure 4

Additional research was conducted with specific hospitals (in this case Stanford Healthcare and UCSF), with the hope of finding differences between the patterns of kidney transplants in each hospital. Insights in this area would help patients decide on the best facility to get their treatment. It is evident that Stanford is more selective about their kidneys and usually transplant kidneys with higher quality (Figures 5-6).




Figure 5

Figure 6

More visualizations were conducted regarding average distances for each KDPI (kidney quality) range. UCSF Healthcare had considerable spikes in these visualizations (Figure 7), and there was clear variability from facility to facility, despite the all the facilities being compared were in the same region.

Figure 7

The second part of results that were collected were with machine learning models. Two ML models were created: one model predicted patient waiting times based on patient clinical information (blood type, patient EPTS score, diabetic status, age, etc.). This model (gradient boosting algorithm) was able to predict waiting times within 100 days accuracy 17% of the time, and patient waiting times within 1 year about 45% of the time. Although it wasn’t very accurate, the results are promising. The second model was a Cox proportional-hazard model, with a graph of its predictions shown in Figure 8. This was used to develop the Estimated Waitlist Survival Score (EWLS), which helps inform patients how urgently they need a new kidney.

Figure 8


This research aimed to provide insight into the kidney allocation process, specifically understanding how patients can optimize their chances of getting a kidney transplant within reasonable time. As discussed in the background, the kidney allocation process in the United States is heavily flawed and inefficient. By tackling one of the smaller issues within this large problem, we aimed to become one step closer to improving the system.

By looking into the kidney data provided by the OPTN, we were able to make insightful graphs (as seen in results) that help shed light on the underlying problems in the current kidney allocation system. Many of these visualizations, such as the graphs regarding kidney discard rates, revealed concrete evidence for the current system’s inefficiencies. Most of this research was heavily focused on Region 5 of the OPTN, but it is safe to infer that these inefficiencies are present in other regions in the United States. Furthermore, the insights into specific transplant facilities, such as Stanford Healthcare, can help patients decide on what healthcare facilities to do their treatment at (if they want a facility with more strict rules on kidney quality, a facility that has a high number of incoming kidneys, etc.).

The machine learning models were also very significant parts of this project, as well as the current kidney allocation process in general. These models can be used by patients, who can gain a new understanding of their survival probabilities and their times on the waitlist. Based on these factors, they can decide on a specific treatment center, or whether or not to accept a lower quality kidney (since those are more readily available). For instance, if the model predicts a patient’s waitlist time to be 4 years, and the patient cannot wait that long, they may opt for a lower quality kidney which they could get in a smaller period of time. This way, they would be making a more informed decision, rather than accepting or denying a kidney without knowing how long it would be until they get the best kidney for them.

However, the model that predicted waiting times wasn’t very accurate. The Cox proportional-hazard model also could have performed better. However, the results were promising, and with more resources, these models could be improved and can be implemented in the real-world. This ties in with the significance of this paper— eventually, both the ML models and data insights can be used by transplant centers around the United States to help patients accurately define their preferences for treatment, kidney quality, etc.

Future Directions

Moving forward, building upon our current findings and models, we aim to continue to refine our machine learning models to increase a consistent level of accuracy. The goal is to eventually have these ML models be used as a rough estimate regarding patient health and urgency; this would be helpful to both patients and hospitals. We will continue to use current data to train the models to increase their accuracy.

Future work will also contain a further analysis regarding offers, once that data is acquired. Analyzing offers on the OPO and patient level can help reveal insights about how accepting certain patients and OPOs tend to be.

To put it all together, a website or application is atop our priorities for the future. This interface would allow patients to access helpful metrics that would help them decide on which OPO to choose, as well as, which kidneys to consider given their situation. This system would enable accessibility of important information to patients and can serve as a genuinely helpful tool for patients who are considering multiple recovery options.

Finally, we believe that we can create medical and political change within the inefficient kidney allocation system. Our analysis has revealed that an unnecessarily large amount of kidneys are being discarded due to a recipient not being located or transportation issues, among other insights. Taking these findings to executives and officials within the kidney transplantation network could result in countless lives changed.


  1. Mohan S, Schold JD. Accelerating deceased donor kidney utilization requires more than accelerating placement. Am J Transplant. 2022 Jan;22(1):7-8. doi: 10.1111/ajt.16866. Epub 2021 Oct 30. PMID: 34637595.
  2. Barah M, Mehrotra S. Predicting Kidney Discard Using Machine Learning. Transplantation. 2021 Sep 1;105(9):2054-2071. doi: 10.1097/TP.0000000000003620. PMID: 33534531; PMCID: PMC8263801.
  3. Noreen SM, Klassen D, Brown R, Becker Y, O’Connor K, Prinz J, Cooper M. Kidney accelerated placement project: Outcomes and lessons learned. Am J Transplant. 2022 Jan;22(1):210-221. doi: 10.1111/ajt.16859. Epub 2021 Oct 25. PMID: 34582630.
  4. King KL, Husain SA, Perotte A, Adler JT, Schold JD, Mohan S. Deceased donor kidneys allocated out of sequence by organ procurement organizations. Am J Transplant. 2022 May;22(5):1372-1381. doi: 10.1111/ajt.16951. Epub 2022 Jan 19. PMID: 35000284; PMCID: PMC9081167.
  5. Aubert O, Reese PP, Audry B, Bouatou Y, Raynaud M, Viglietti D, Legendre C, Glotz D, Empana JP, Jouven X, Lefaucheur C, Jacquelinet C, Loupy A. Disparities in Acceptance of Deceased Donor Kidneys Between the United States and France and Estimated Effects of Increased US Acceptance. JAMA Intern Med. 2019 Oct 1;179(10):1365-1374. doi: 10.1001/jamainternmed.2019.2322. PMID: 31449299; PMCID: PMC6714020.

Investigating the Viability of Semantic Compression Techniques Relying on Image-to-Text Transformations

Blog, Journal for High Schoolers, Journal for High Schoolers 2023

By: Adit Chintamaneni, Rini Khandelwal, Kayla Le, Sitara Mitragotri, Jessica Kang

Mentors: Lara Arikan, Tsachy Weissman


Data compression is a crucial technique for reducing the storage and transmission costs of data. As the amount of data that is consumed and produced continues to expand, it is essential to explore more efficient compression methodologies. The concept of semantics offers an interesting new approach to compression, enabled by recently developed technology. Concisely, we sought to discover whether the most important features of an image could be compressed into text, and if this text could be reconstructed by a decompressor into a new image with a high level of semantic closeness to the original image. The dataset of images that were compressed is composed of five common image categories: single person, group of people, single object, group of objects, and landscape. Each image was compressed through the following pipeline: image-to-text conversion, text compression and file size determination, file decompression and text recovery, and text-to-image conversion. This pipeline enables any image to be compressed into a few dozen bytes. When examining image-to-text compressors, we experimented with both human and artificial intelligence (AI) powered procedures. We selected the text-to-image model DALL-E 2 as our decompressor. We released multiple surveys to assess structural fidelity and semantic closeness between original images and reconstructed images. We also included compressed JPEGs and WebPs to benchmark performance. Human and AI reconstructions received lower structural fidelity scores than WebP and JPEG images. Individually, images reconstructed from human captions were perceived to have higher structural fidelity and semantic closeness to the original images than AI captions did. Participants’ textual descriptions, of both human and AI reconstructions, had high semantic fidelity scores to their descriptions of the original images. This demonstrates that the proposed pipeline is a viable semantic compression mechanism.


Images account for a large portion of all existing digital data. Conventional lossy image compression algorithms, such as discrete cosine transform, eliminate redundant data while preserving essential image features. In recent years, researchers have developed semantic-assisted compression techniques in which important semantic features of an image are identified and preserved by the compression algorithm [1]. Modern advancements in both image-to-text and text-to-image transformations allow for the generation of text with high semantic fidelity to the original image and vice-versaEarlier this year, Salesforce released BLIP-2, a multimodal language model that can generate text descriptions of images [2]. In April of 2022, OpenAI introduced DALL E-2, a leading generative model that converts text descriptions into images [3]. We aim to investigate the viability of a semantic compression pipeline based on such transformations.


We identified five image categories that are semantically distinct: “Single Person”, “Group of People”, “Inanimate Object”, “Multiple Inanimate Objects”, and “Landscape”.

Figure 1: one set of original images.

We assembled 5 sets of 5 images (25 total images), where each set contained all of the identified image categories.

Human Compression

Which features of an image constitute its meaning? We sought to answer this initial question while developing our methodology for human-based Image-To-Text transformations. Through polling, referencing other works [4], and intuition, we found that the most important, universal features of an image include:

  • major foreground objects
  • all background objects
  • colors, forms, and shapes of those objects
  • dispositions of those objects
  • the relationships between those objects, including actions and geometries of positions
  • temporal context
  • patterns in the image; repeating features of those objects

We compiled these features into a tight syntax for human captions to follow:

<major foreground objects>, <colors, forms, and shapes of those objects>, <dispositions of those objects>, <relationships between those objects, including actions and geometries of position>,

<immediate context>, <temporal context>, <background context> We manually captioned all 25 original images using this syntax.

a young, fluffy german shepherd with its tongue hanging out prancing merrily on a clear road in the afternoon with a forest in the distance.

Figure 2: an image captioned using our tight syntax.

AI Compression

Current artificial intelligence-driven Image-To-Text transformation tools do not have the capability to capture the semantic schema of an image as well as humans can through our syntax. However, they offer a speed advantage, making them worthwhile to explore as a potential component in our compression pipeline. We employed the Salesforce BLIP model to caption all of the collected images. By now, we had 25 human-generated captions and 25 AI-generated captions.

Figure 3: captions generated by BLIP for an image set.


We used DALL-E to reconstruct images from 25 human-generated captions. We refer to these as “Human Reconstructions”. We followed the same procedure to reconstruct images from the 25 AI-generated captions, referring to these as “AI Reconstructions”. We now had 75 images in total: 25 original images, 25 Human Reconstructions, and 25 AI Reconstructions.

Figure 4: a visualization of the reconstructions of one of the original images.

Benchmarking Reconstructions

We benchmarked our reconstructed images against the images compressed via JPEG and WebP compression. We used a Jupyter Notebook to compress each of our 25 original images as JPEGs and WebPs at two qualities: 1 and 25. Here, quality = N, where N is a whole number between 1, the lowest quality, and 100, the highest quality. This provided us with 100 more images. We also compressed each caption into a gzip file using the gzip compression algorithm [6]. The compressed human-generated captions had a mean file size of 154.8 bytes. The compressed AI-generated captions had a mean file size of 63.15 bytes.

Figure 5: a visualization of the JPEG and WebP compressed files at Quality = 25 and Quality = 1 for two images from Figure 1.

By this stage, each original image had a corresponding AI Reconstruction, Human Reconstruction, highly compressed JPEG, highly compressed WebP, lightly compressed JPEG, and lightly compressed WebP. We refer to this as an image group. We issued three surveys for each of the 25 image groups.

  • An original_image survey to capture what information people received from the original images
    • Questions: What object(s), element(s), or person(s) stand out to you most in this image? How do the “most important” object(s) you named above relate to each other? What three adjectives best describe this image? Please describe the image in one sentence.
  • A reconstructed_image survey to record what information people received from the reconstructed images
    • human reconstructed_image section uses reconstructed human images
    • AI reconstructed_image section uses reconstructed AI images
    • Questions: same as original_image survey
  • A comparative_image survey to compare the semantic closeness between original images and reconstructed images; this also compares original images with their corresponding JPEGs and WebPs
    • human comparative_image section compares original and human reconstructed images
    • AI comparative_image section compares original and AI reconstructed images
    • Questions (for each comparison): How similar are these images in terms of their effect on you, and what they mean to you? How similar are these images in terms of their content and appearance?

Figure 6: examples of qualitative survey questions from our original_image form and reconstructed_image forms. Questions remained constant, but the original_image forms contained only the original images while the reconstructed_image forms contained reconstructed AI/human images.

Figure 7: examples of quantitative survey questions from our comparative_image forms. The top question compares an original image (left) to its corresponding AI reconstructed image (right). The bottom question compares an original image (left) to its corresponding highly compressed JPEG (right). Both questions were asked for each image pair (ex. Original: AI Reconstructed, Original: Human Reconstructed, etc), for a total of 10 questions per section.

We collected 125 survey responses from people of various backgrounds. We split the data into two categories: quantitative and qualitative. Here, quantitative data consisted of all the comparative_image survey responses, and qualitative data consisted of all the original_image and reconstructed_image survey responses.

Analyzing Quantitative Data

Initially, the question “how similar are these images in terms of their effect on you, and what they mean to you” was meant to determine semantic closeness between images, and the question “how similar are these images in terms of their content and appearance” was mean to determine structural fidelity between images. However, after further polling, we found that most respondents interpreted both questions in the latter context. Thus, we rendered responses to the former question invalid. When analyzing this data, we examined the median ratings because the data was skewed, and used the mean ratings to further distinguish data with identical medians (see Results).

Analyzing Qualitative Data

The collected qualitative data was a better indicator of semantic closeness because the questions provided insights into the important semantic schema perceived by survey respondents. We focused on the following questions: “Please describe this image in one sentence.”, “What three adjectives best describe this image?”, and “What object(s), element(s) or person(s) stand out to you most in this image?” Based on these questions, we created six CSV datasets: AI Description vs. Original Description, Human Description vs. Original Description, AI Adjectives vs. Original Adjectives, Human Adjectives vs. Original Adjectives, AI Objects vs. Original Objects, and Human Objects vs. Original Objects. We used the similarity function in spaCy’s en_core_web_lg model to evaluate semantic closeness between descriptions, adjectives, and notable elements of original and reconstructed images. The function yields an output ranging from 0, indicating no semantic closeness, to 1, indicating that the texts are identical.

Figure 8: calculating the text similarity between reconstructed AI image descriptions and original image descriptions using a Jupyter Notebook.

Figure 9: a diagram of our compression pipeline, as described throughout the Methods section.



Respondents to the comparative AI survey, we found that people perceived the greatest similarity between the original images and lightly compressed JPEGs (median similarity rating of 10, mean of 9.7). The least similarity was found between the original images and the AI images (median similarity rating of 6, mean of 5.9). In the Comparative Human survey, people perceived the greatest similarity between original images and lightly compressed WebPs (median similarity rating of 10, mean of 10) and the least similarity between original images and human reconstructed images (median of 6.5, mean of 5.8).

Figure 10: dot plots of structural fidelity between compressed images and original images from comparative form results (scaled down by a factor).

We analyzed quantitative data using direct comparisons of absolute values. Thus, if the absolute value of a number was greater than the absolute value of another number, the difference was considered to be significant. We did not use statistical significance tests to draw conclusions because the general human population was the determiner of similarities between images, so the probability of a response being a chance event can be ignored. Thus, every difference, as described above, is considered a significant difference on its own.


Respondents’ descriptions of human reconstructed images and descriptions of original images had a mean text similarity of 0.87, while respondents’ descriptions of AI reconstructed images and descriptions of the same original images had a mean text similarity of 0.84. Additionally, the three adjectives chosen to describe human reconstructed images and those chosen to describe original images had a mean text similarity of 0.98, while the three adjectives chosen to describe AI reconstructed images and those chosen to describe the same original images had a mean text similarity of 0.95. Lastly, answers to “What object(s), element(s) or person(s) stand out to you most in this image?” for human reconstructed images and corresponding original images had a mean text similarity of 0.80, while answers to the same question for AI reconstructed images and corresponding original images had a mean text similarity of 0.78. We provided a syntax for respondents to utilize when answering the first question, and a list of adjectives for respondents to select from when answering the second question. We did not design such structural rules for the third question; this may explain the lower similarity scores.


The quantitative results demonstrate that, on average, the human and AI reconstructions were less similar to the original image in terms of content and appearance (structural fidelity) than the WebP and JPEG compressed files. When paired against each other, human captions produced images with slightly greater structural fidelity and semantic closeness to the original than AI captions, as shown by both the median similarity scores and qualitative results. Furthermore, the qualitative results indicate that human textual descriptions of both types of reconstructions had high semantic fidelity with the human textual descriptions of the original images. Thus, we consider our pipeline, based on both human and AI captioning, to be a viable semantic compression mechanism. Indeed, for many images, semantic schema is more important than pixel-wise fidelity–the proposed pipeline can be integrated into storage and sharing of such images at unprecedented compression ratios.

Future Directions

Before conducting further surveys, some adjustments to our methods are advisable. Increased variation between image categories would test our pipeline’s ability to capture semantic schema across a wide range of image types. Within an image category, a greater variation between images would provide better data–for example, many of the “multiple scattered images” were images of typical desk objects. Furthermore, offering a clearer definition of semantics to survey respondents would enable them to answer quantitative questions about semantic closeness with greater accuracy. Although DALL-E is among the best Text-To-Image models, others are worth exploring. Stable Diffusion’s ControlNet, for example, can incorporate both image descriptions to preserve meaning and various maps to preserve structural fidelity. Although the compression ratios of this method are lower than those from our pipeline, they are still substantially better than those of existing algorithms.

Figure 11: an example of a compression pipeline using ControlNet. Here, the original image (left) was captioned “a top view of a plant in a pot, a hexagonal candle, a tube of cream, bottles of essential oils, and a book titled “A Cat’s Life” are arranged in a group, with the plant on the top left, the candle on the top right, the tube of cream on the bottom left, the bottles of essential oils in the middle, and the book on the bottom right, all on a well-lit, plain white surface.” An example of an edge map generated using cv2 is shown (middle). Together, both elements give the output shown on the right.


  1. Akbari, M., Liang, J., & Jingning H. (2019, Apr 18): DSSLIC: Deep Semantic Segmentation-based Layered Image Compression. arXiv.Org. https://arxiv.org/abs/1806.03348
  2. Li, J., Li, D., Xiong, C., & Hoi, S. (2022, Feb 15): BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. arXiv.Org. https://arxiv.org/abs/2201.12086
  3. Petsiuk, V., Siemenn A., Surbehera, S., …, & Drori, I (2022, Nov 22): Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark. arXiv.Org. https://arxiv.org/abs/2211.12112
  4. Perkins School for the Blind. (2023, Jul). How To Write Alt Text and Image Descriptions for Visually Impaired. perkins.Org. https://www.perkins.org/resource/how-write-alt-text-and-image-descriptions-visually-impaired/
  5. Mahtab V., Pimpale G., Aldama J., & Truong P. (2019, Aug). Human-Based Image Compression; Using a Deterministic Computer Algorithm to Reconstruct Pre-Segmented Images. theinformaticists.com. https://theinformaticists.com/2019/08/29/human-based-image-compression-using-a-deterministic-comput er-algorithm-to-reconstruct-pre-segmented-images/

Self-Learning AI Model on Limited Biomedical Imaging Data and Labels

Blog, Journal for High Schoolers, Journal for High Schoolers 2023

By: Niraj Gupta, Saniya Khalil, Jolie Li, Iris Ochoa, Elisa Torres.

Mentor: David J. Florez Rodriguez.


This research explores self-learning AI using Google Colab by pre-training a general TensorFlow-coded model on recognizing patterns in limited, unlabeled biomedical image data. This allows the model to understand the basic underlying patterns and structures in the images. After self-learning, we train the model with labeled data. We hypothesize that self-learning will decrease how much we depend on extensively labeled data for the development of accurate AI models.


The development of reliable visual artificial intelligence (AI) models in the biomedical field usually requires a substantial quantity of high-quality, accurately labeled imaging data by professionals. Unfortunately, obtaining such data in the real world is typically limited or inaccessibly expensive. This poses one of the most significant challenges faced in the intersection of machine learning applications in the biomedical field. This hinders the development of potentially high-performing AI models that could help develop effective strategies for disease prevention, early detection, management, etc. In order to combat this, AI models can be trained using easily accessible unlabeled data at first. This allows the nascent model to grasp basic visual patterns that may arise in the data in its unsupervised self-learning phase. Then, labeled data can be utilized in order to improve the efficiency of the model and adjust the parameters that compose it during the supervised self-learning phase.

Our team developed a model that follows these criteria and analyzed its results to answer the question “How does self-learning affect the accuracy of AI models trained on limited, labeled data (primarily breast cancer histopathology images) compared to self-supervised AI models?”

This research focuses on programming an AI model that utilizes a breast cancer dataset that contains limited data, which was sometimes not labeled. Breast cancer, a disease in which the cells in the breast grow uncontrollably and form tumors, is the most common type of cancer in the world. According to the World Health Organization, in 2020, over 2.3 million individuals worldwide were diagnosed with breast cancer. Furthermore, over 685,000 deaths occurred due to this fatal disease. The growing prevalence of breast cancer makes it imperative for new models and technology to be developed in order to classify possible tumors accurately.

Materials and Methods


Both the self-learning model and supervised learning model were trained on a dataset titled “Breast Histopathology Images.” This dataset was taken from Kaggle, a data science and AI platform under Google LLC. The images of the dataset focused on 250 x 250 sized .png images of patients with and without Invasive Ductal Carcinoma (IDC). IDC is the most common subtype of all breast cancers. The full 3 GB dataset consists of 198,738 IDC(-) images and 78,786 IDC(+) images. For the models of this investigation, 772 class 0 (non-IDC) images and 207 class 1 (IDC positive) images fwere used for training and validation purposes. The demographic information of the patients is unstated through the data source.

Breast cancer is important to investigate as it is the most common type of cancer in women. Accurately identifying breast cancer subtypes is an important biomedical task that AI can save time on, decrease cost, and reduce error on.

Sample images:

Class 0 (non-IDC cells)

Class 1 (IDC positive cells)


The AI models were coded on Google Colab notebooks using Python. Functions were imported from Tensorflow, Matplotlib, Numpy, and Pandas.

Self Supervised Learning

The Self Supervised Learning AI model used both a littleNtrain and Ntrain variable. LittleNtrain is a quantity of labeled data for training. This number influences fitting (overfitting or proper learning ) and the validation accuracy and loss of the model. Ntrain is a quantity of unlabeled data used in training self-learning AI models. In particular, Ntrain prepares the AI model to recognize visual patterns in the available data so that this recognition ability can be used in training with labeled data later on.

The self-learning model defines the different functions shown above which can essentially crop, change the colors, remove the colors, and rotate pictures from the training dataset.

The model is compiled and the randomly modified images are fed to the model. The cosine loss function demonstrates the progress of the model as it is trained by producing values used for measuring how similar or different two inputs are. The model is trained to recognize whether two images are the same, even when one version of the image was modified.

The layers of the self-learning model include TensorFlow layers dropout, dense, and batch normalization.

Full code of Self Supervised Learning model: CODE – FriendlySelfSupervisedLearningFINAL.ipynb

Supervised Learning

Unlike the Self Supervised Learning AI model, the Supervised Learning AI model only uses the littleNtrain variable. This is because selfless (supervised training without self supervised learning) AI models train only on labeled data, a difference that gives self-learning AI models the advantage since the selflessmodel can only train on a few labeled data sources.

The selfless learning model does not train on any modified data. Instead, this model trains only on the original, labeled data source.

Full code of the Supervised Learning model:

CODE – FriendlySelfLessLearningFINAL.ipynb


Self-Supervised Learning [AKA Selfless] Notebook

With the self-supervised learning notebook we ran the code with up to 800 breast cancer unlabeled images, and 160 labeled images. We observed an overfitting tendency until the sample size of the unlabeled data was 40, which means that the model didn’t capture underlying patterns, but unimportant fluctuations. As the sample sizes increased, our validation accuracy increased to 90% (from the original 78% from overfitting), suggesting a large improvement in the pattern recognition as well as the accuracy of predictions for the images.

Moreover, our validation (Val) accuracy parameter starts at 78%, suggesting that our models’ predictions are accurate for an estimated 78% of data points, this percentage is also the accuracy for guessing all the images are healthy, or in other words overfitting. It later increased to 85%, which similarly to the training data, improves its accuracy. Although we observed a 6% increase throughout the models, signaling that it is learning and getting proficient, it still doesn´t reach an ideal performance of 100%.

Supervised Learning [AKA Self] Notebook

For our Supervised Learning notebook, 160 labeled images were run through the model. We detected that the model’s accuracy overfits similar amounts to selfless ones as it goes down to 40%, and goes up to 90%, possibly due to the differing amounts of diseased data in the samples.

For our Val accuracy, we obtained an average of 78%, which indicates that our model’s prediction is 22% off target. The supervised learning model has a larger max validation loss across the sample when compared with the self-supervised learning model.

The results suggest that the model’s predictions in both notebooks showed a significant improvement, but these results are not conclusive. The val accuracy still needs to go up by several percent to reach an ideal performance of imaging analysis and pattern detection. Therefore, we will need to run a larger amount of similar models to examine the algorithms for any possible changes, specifically increases, in the values’ accuracy.

Scatterplots obtained

Our results were analyzed by considering certain parameters and indexes as the x-axis which addresses the little Ntrain values and the y-axis for the validation loss. As shown below, we have two graphs indicating our results, or in other words the accuracy of our models [as seen on the left] and loss [as seen on the right] used to guide the optimization process. In our val accuracy plot, we can state that whereas our accuracy remained stable at a lower 0.80 validation loss with Ntrain values between 0 and 14, our accuracy had a sharp rise when having 2 as our Ntrain value and later on a similar increase when our Ntrain value was 12, yet this time it remained higher until 14.


Training Accuracy:

The accuracy was very high at the beginning of the testing, this is most likely due to the overfitting in the smaller datasets. Since the number of images with breast cancer was significantly less than the healthy images, the model most likely determined everything as healthy in the beginning, leading to a higher accuracy due to the low amount of disease data. Once the number of diseased images in the sample increased, the accuracy of the model decreased due to the overfitting done earlier. We can counter this issue by increasing the sample sizes of the images (>100) in order to expose the model to more disease images so that it may familiarize itself with the patterns found in the disease images.

After increasing the labeled image sample size (from 16 to 160 by multiples of 8) and unlabeled image data (from 64 to 800) we noticed a slight gradual increase in validation data. We can assume that due to the higher exposure to the disease images, the model was able to pick up on the difference in patterns between healthy and diseased images, resulting in an increase in accuracy.

Our Hypothesis that training on unlabeled data before using labeled data as a supplement increases the efficiency of the model can be seen in the comparisons of validation accuracy. The validation accuracy for guessing (due to overfitting most likely) is 78%, which also happens to be the validation accuracy percentage for Supervised learning (using only labeled data), while the self-supervised learning model, as mentioned previously was able to increase accuracy.

Potential applications in medical diagnosis and future treatments:

This model if successful will make it more affordable and efficient to sort through diseased and healthy medical images. By using unlabeled data to train the model in pattern detection, the expense of acquiring labeled data is greatly decreased, and the efficiency of finding patterns and determining categories is greatly increased.

Our research could potentially serve and be insightful for image enhancement as we can reduce noise or modify features, leading to nitid images that could be used to do a better disease diagnosis for patients.

Additionally, AI is currently being employed in various medical approaches to facilitate doctors’ work when detecting certain anomalies, accelerating drug development, or even when understanding complex disorders.

Future Direction

Research regarding AI models and the usage of unlabeled datasets, particularly biomedical, is critical for the development of new strategies and technology that can have major impacts in the healthcare field. In order to further develop our research and gain more substantial results, a myriad of changes can be employed in the future.

Greater computational power would allow the creation of more intricate and comprehensive models, which would allow for the production of more accurate results. Furthermore, an increased amount of data would allow the AI model created to have more reliable results with decreased margins of error.

Larger data samples create stronger results, decrease chances of common behaviors such as overfitting (a situation caused by small training data sets), allow greater training with varying Ntrain and littleNtrain combinations, and more.

Moreover, it would be imperative to test the AI model created on additional, diverse datasets in order to avoid bias. Training the model on numerous datasets would allow us to create a generalizable machine learning model with an architecture that can be utilized for numerous situations.This could increase the positive impact of our model allowing it to adapt to various circumstances.

Overall, we will continue to test and grow our AI model with various adjustments in order to ensure its efficiency and increase its performance.

Works Cited

  1. Kaggle: Your Machine Learning and Data Science Community, https://www.kaggle.com/. Accessed 20 July 2023.
  2. “Breast cancer.” World Health Organization (WHO), 12 July 2023, https://www.who.int/news-room/fact-sheets/detail/breast-cancer. Accessed 2 August 2023.
  3. “Breast Histopathology Images.” Kaggle, https://www.kaggle.com/datasets/paultimothymooney/breast-histopathology-images. Accessed 20 July 2023.
  4. “Self-supervised contrastive learning with SimSiam”, Keras, https://keras.io/examples/vision/simsiam/. Accessed 01 Aug 2023

Segmented Image Compression in Healthcare

Blog, Journal for High Schoolers, Journal for High Schoolers 2023

By: Alex Nava, Cristina Bonilla Bernal, Jayden Tang, Logan Graves

Mentors: Ayushman Chakraborty, Qingxi Meng


The crossroads at which medical imaging and data compression intersect has yielded a fascinating, novel area of research, particularly pertaining to the Segment Anything Model (SAM), an AI-based image segmentation model. We researched a plethora of standard medical imaging techniques, including Computed Tomography (CT scans), Positron Emission Tomography (PET scans), Ultrasound, and Magnetic Resonance Imaging (MRI). Additionally, we analyzed more specific areas of medical imaging such as digital pathology, mammography, and photoacoustic imaging. To supplement our knowledge of the different types of imaging, we researched, specifically, how MRI scans are processed in terms of segmentation and standard storage, and compression practices. Furthermore, we studied recurrent difficulties that medical professionals face when segmenting certain areas of the body, paying particular attention to issues within the brain and the spinal cord. By speaking to a member of the Radiology Interest Group at Stanford, we also determined frequent issues surrounding storage and clinical workflow, thus narrowing our research into how the Segment Anything Model can be applied in a robust, efficient, and critical manner.

Integrating our knowledge about various kinds of medical imaging technology, we present a proof-of-concept for a novel image compression technique based on SAM, one which is especially suited to medical imaging technology. By automatically distinguishing between unimportant image aspects (such as the blank black background of an MRI) and important aspects (such as the anatomical details of the scan), we can apply lossy compression to nonessential aspects and lossless compression to essential ones, allowing much greater amounts of compression without losing details relevant to the scans. We compare this technique with existing compression methods and suggest its potential applications, as well as areas for future research.


Medical imaging maintains an invaluable role in the healthcare field, advancing the processes of diagnosis, treatment, and recovery in ways that go beyond human ability. A plethora of imaging techniques exist; however, the most prevalent prove to be X-ray imaging, Computed Tomography (CT) imaging, and Magnetic Resonance Imaging (MRI). All of these techniques, despite their inherent differences, are linked by a common thread: the necessity for digital storage. Digital storage is defined by the process of storing and retaining information through binary code, allowing for the preservation of images, videos, text, and other forms of digital data. Within the medical field, the transition from film-based imaging to digital storage has proven to be revolutionary; enhancements to image quality, accessibility, long-term retention, and interdisciplinary collaboration are significant. The coupling of medical imaging and the digital world requires a closer look into just how much storage these images utilize, particularly into the role that data compression plays. Data compression is the process by which digital information is encoded to use fewer bits to represent the same information. The compression of digital data optimizes storage space, transmission times, and transfer efficiency, reflecting its gravity within the healthcare field.

Historical Context

The transition from film-based imaging to digital imaging in the 1980s and 1990s marked the beginning of a large area of focus on data compression in the medical field. Early compression techniques, mainly lossless compression algorithms, were employed to reduce the amount of storage taken up by patient records and images. In the late 1990s and early 2000s, telemedicine saw an increase in popularity, thus emphasizing the necessity of data compression as doctors required access to medical images. The advent of cloud storage and its worldwide proliferation in the early 2000s furthered the necessity for efficient data storage, underscoring the value of data compression research in healthcare [1]. Soon after, DICOM (Digital Imaging and Communications in Medicine) became the international standard, defining specific guidelines to maximize efficiency and collaboration when analyzing medical information. These guidelines popularized JPEG and JPEG2000 compression formats, allowing for easy, efficient access and storage to medical images [2]. In recent years, however, research into machine learning and its role in data compression has become increasingly popular, placing a strong emphasis on artificial intelligence (AI) and the development of compression algorithms. Considering this historical context, our present study explored a novel compression technique that utilizes the Segment Anything Model (SAM), and by building off historical milestones in the field of data compression, we developed a proof of concept compression technique.

Segment Anything Model (SAM)

The Segment Anything Model (SAM) is a segmentation system that runs entirely off user prompts, relying on its zero-shot generalization feature to identify and classify unknown objects and images. Without any need for additional training, SAM is able to process a vast array of user input prompts, and when given a grid of points, is able to segment out certain areas of an image. SAM’s dataset consists of over 1.1 billion segmentation masks which were derived from ~11 million images, demonstrating its robust accuracy and efficiency. Additionally, SAM was purposefully decoupled into a one time image-encoder and a lightweight mask decoder, enabling the model to simply be run on a web browser [3]. Overall, SAM’s proficiency in segmenting medical images, specifically MRI scans, played a vital role in our research.

Theory and Research Objectives

Our team theorized that by compressing segments made by SAM individually, we would achieve higher rates of compression than if we compressed the image as a whole, as is done traditionally. This essentially means that by compressing the disparate parts of an MRI scan, the background and foreground, separately, results would yield a greater rate of compression. To test this theory, we split our objectives into two areas of focus: medical imaging/SAM and the development of our model. Our team researched the intricacies and applications of medical imaging, specifically Magnetic resonance imaging (MRI). Our goal was to build a wealth of knowledge that would enable us to develop a model that could be effectively employed in the medical field; in other words, we intended to become experts in the field of medical imaging so as to apply our knowledge accurately when developing our compression algorithm. Our ultimate objective, backed by the extensive research into medical imaging and SAM, was to create a proof-of-concept image compression technique, one that utilized our original theory of individual segmentation and compression.

Significance and Application

The significance of our research goes beyond the rates of compression we achieved, rather, it is most effectively applied when handling storage costs of medical images. Maintaining an information system capable of storing medical images for a prolonged period of time, a minimum of seven years in California for example, comes at a high price [4]. Storage costs generally range from $25,000-$35,000 per year, proving to be a major hindrance to accessibility and affordability [5]. For well-funded hospitals located in high-income areas, digital storage costs aren’t that large of an issue; however, for hospitals that are found in low-income areas without major funding, these costs are a major obstacle. Hospitals that fit this description are forced onto a tight budget, meaning that their services and functioning as a whole lack behind their well-funded counterparts. By achieving rates of compression greater than 50% on MRI scans, this effectively means that storage costs will be reduced extensively. This reduction in cost allows for an increased number of hospitals and radiology practices to be established in low-income areas as the overall cost of running the practice decreases.


Medical imaging provides a uniquely suitable forum and use case for building and testing prototypes: not only are image storage costs a significant burden on many private practices, and cause a significant amount of carbon emissions, but these images are also much simpler in content than most other images, such as those taken with regular cameras. Most medical imaging devices output relatively simple images: in the case of MRIs, this means a black-and-white image of the targeted anatomical structures, set on a simple black background. Because the background and foreground layers are so distinct, this makes them relatively trivial to segment, eliminating the need to do particular fine-tuning or prompting on SAM and allowing us to focus on the algorithmic compression aspects of the project rather than the particulars of the image segmentation technology.

We designed and coded an algorithm in Python, utilizing a number of well-known libraries such as Pytorch for our model implementation, Numpy and matplotlib for data visualization, and PIL (python image library) for certain image display and conversion functions. In addition to these python packages, we wrote adaptable commands for Imagemagick, a command-line utility for image manipulation and conversion.

This algorithm consisted of the following steps:

  1. Set up SAM (download and install the open-weights model)
  2. Iterate through a given directory of images, or a single image, doing the following:
    1. Convert the files to a common format (usually lossless JPEG or PNG; can be customized to use case, but doesn’t matter much in final results)
    2. Use SAM to create image masks separating foreground from background
    3. Select the mask that correctly segments the foreground from background
    4. Split the background off from the main image, and apply extremely lossy JPEG compression to it (quality 1, i.e. maximum compression available to the format)
    5. Keep foreground in a lossless format (PNG)

Images could at this point be stored separately, which could in theory lead to more significant compression results if implemented correctly. However, we chose to instead recombine the images into a lossless format.

  1. Recombine images to obtain an image with resolution equal to the resolution of the input image
  2. Output final image

The algorithm also can optionally output intermediate images, such as the background and foreground layers. Its source code is available as a Jupyter Notebook, downloadable from the Google colab available at https://colab.research.google.com/drive/1zfXycr4ULxocHKvjPtvnHDsGi3MUa-_w.

This algorithm is comfortable to use for those familiar with its basic structure and those familiar with the command-line interface. However, we expect that many doctors and other medical professionals may not be either one of those two, and therefore find it cumbersome or even impossible to use in its basic code form. Using Streamlit, a novel python framework, we created a web app to enable use of the algorithm simply and visually. Our web app prompts the user to upload a zipped dataset of images which will be altered with our algorithm. With the given dataset, the user will be shown a number range slider, which will then allow them to compress an image of their choice. The user will be shown the corresponding original image and altered image side by side relative to the index of the slider. The accessibility of our web app makes it easy to navigate through a whole dataset of images and use our algorithm with said dataset.

A visualization of compression results on one test datasets, a set of layers of a skull MRI..

Image demonstrating our web app with use of the number slider and algorithm on a image of a brain MRI


The given outcomes of our proof-of-concept generalized algorithm that utilizes image segmentation technology for compression results yielded positive compression results. Our intelligent compression algorithm achieved greater than 50% compression on MRI tests and greater than 10% compression with other images. Comparing original MRI images and altered images that used our algorithms results in visually lossless results; in the general use case, results are not visually lossless (because non-medical images are much more complex and the background is not near-uniform or uniform to begin with). Though the inner workings of compression algorithms can be complex and unclear, we believe that these results are possible because the applied JPEG compression makes the background layers extremely easy to store in comparison to the original image by making them almost uniform pixel values; because common compression algorithms use similar techniques (such as dictionary encoding with huffman coding) the resulting image’s space savings are transferable once the lossy compression has been applied and converted into a lossless image. .


Based on our positive results, we can conclude that SAM’s application in the field of data compression and medical imaging has promising results for future studies. Applications of our model in the medical field are vast and auspicious, displaying promise in areas of data transmission, storage, and various other methods. Additionally, our model is essentially applicable anywhere where regular lossy compression is, and in some cases where lossless compression is. With more precise fine-tuning, or different image segmentation technology, we believe that much greater than 10% savings could be achieved on the general use case, though without visually lossless results. Furthermore, our model can be expanded upon by means of a classification system. Once segmented, the model could be taught to label anatomical structures, identifying respective parts of a medical image. Althought, this would require an extensive research, it is a viable and invaluable asset to the medical field. Ultimately, our compression model has proven its success in reducing storage size in a way that maintains the original quality and data, making our proof of concept a practical compression technique that cuts costs and sets up imaging practices to be more affordable and cost effective.

Future Directions

One promising avenue for our future research involves an expansion of the compression framework to encompass general images. Although we anticipate that the compression ratios might not be as favorable and the model’s performance may exhibit some degree of variability, this direction could yield broader applications. Diverse domains could benefit from the extended versatility of such an approach.

Furthermore, an intriguing trajectory to explore centers around adapting our work for efficient data transmission. Specifically, leveraging our compressed data transmission methodology could substantially enhance bandwidth efficiency. This, in turn, holds the potential to significantly reduce the time required for transferring medical images between endpoints. While our existing work concentrated on the distinction between background and foreground, an exciting possibility lies in delving into more intricate segmentation techniques. By achieving finer-grained segmentation, we could potentially unlock even greater gains in compression efficiency.

Beyond the realm of storage optimization, another compelling avenue emerges: the application of compression techniques for purposes beyond their conventional scope. For instance, an intriguing prospect lies in harnessing compression as a tool for quantifying discrepancies within medical imaging. This innovative direction holds the promise of not only enhancing our understanding of these discrepancies but also advancing their diagnostic potential for future studies that can increase the potential of developing a greater and better access to medical treatments.

Furthermore, by enhancing the potential to transfer and store medical imaging data, it can be substantially used to bring it to communities where access to different medical studies is reduced due to the lack of information and specific tools to analyze medical imaging data information. Thus, by making it easier to transfer and store those images, there is a greater possibility of bringing the knowledge to places where medical information is not as easily accessible; potentially changing the perspective of different medical treatments used in those places and increasing their effectiveness.

In summary, our future directions span a spectrum of possibilities. From broadening the compression framework to accommodate general images and enhancing data transmission efficiency, to helping other communities with their medical knowledge and access, to pursuing intricate segmentation methodologies and repurposing compression for novel analytical insights. Each avenue holds the potential to contribute significantly to both the field of image compression and medical imaging as a whole. As we embark on these exciting trajectories, we look forward to pushing the boundaries of what is achievable and making meaningful strides in these dynamic areas of research.


  1. [1]https://www.ncbi.nlm.nih.gov/books/NBK207141/
  2. [2]https://pubmed.ncbi.nlm.nih.gov/8057942/
  3. [3]https://segment-anything.com/
  4. [4]https://www.mbc.ca.gov/FAQs/?cat=Consumer&topic=Complaint:%20Medical%20Records#:~:text=HSC%20section%20123145%20indicates%20that,following%20discharge%20of%20the%20patient.
  5. [5]https://www.massdevice.com/medical-data-storage-adding-cost-digitizing-health-records/

SaiFETY: An Integration of Audio Protection and Ethical Data Collection Comparisons Within Txt2Vid

Blog, Journal for High Schoolers, Journal for High Schoolers 2023

By: Lucas Caldentey, Avrick Altmann, Yan Li Xiao, Fenet Geleta

Mentors: Arjun Barrett, Laura Gomezjurado, Pulkit Tandon


As a result of globalization and massive technological advancements, multi-media communications have begun running excessively on internet traffic. This reliance on digital connections further increased due to the recent Covid-19 pandemic. From the daily dependence on News channels and social media live streaming to peer-to-peer online meetings, the world’s primary form of transmitting information is now digital. With the decline of human-to-human interaction, it is critical to not only have a stable and reliable metric to converse but also an effective way to ensure the safety and ethicality of all users. We introduce an extended version of Txt2Vid with more clear and developed stances toward user safety. Specifically, we focused on comparing the data collection methods between Txt2Vid and other video communication platforms. Additionally, we developed an audio key authentication system using text-dependent voice verification (Novikova) that prevents users from falsely using the voices and information of others. With these implementations, we hope to smoothen the transition and comfort of the public as AI becomes more and more prevalent in our lives. And show people from all over the world that deep fakes can be used safely and positively to bring us closer together.


In an era marked by the rapid expansion of video streaming and visual media sharing, video compression technologies have become indispensable for ensuring efficient data transmission and bridging the digital divide on a global scale. Among these transformative technologies stands Txt2Vid, a cutting-edge video compression pipeline that goes beyond conventional solutions by specializing in deepfake and Artificial Intelligence (AI). Txt2Vid opens new possibilities for video communication, promising innovative applications and immersive experiences. However, as AI and deepfake technologies continue to advance, the need to bolster user security and prioritize ethical data handling becomes increasingly pressing for platforms like Txt2Vid. By seamlessly integrating AI-generated content into videos, users can now personalize and enhance their visual narratives. This capability offers an exciting frontier for creative expression and interactive storytelling. However, for such technologies to gain widespread acceptance and adoption, addressing the critical issues of user security and data privacy is essential.

In the quest for secure and reliable video communication, this research paper sets forth a two-fold objective. Firstly, we propose the introduction of a text-dependent voice verification system designed to establish a robust metric for user authentication. This novel system aims to mitigate the risks of unauthorized access and protect user data with a new level of assurance. By confirming the identity of users through voice verification, we enhance the platform’s security measures and foster an environment of trust. Additionally, our research critically analyzes the data collection methods employed by Txt2Vid and other video-calling platforms to identify potential deficiencies and vulnerabilities. The safeguarding of user privacy is not only a legal and ethical imperative but also vital for building user trust. We endeavor to propose ethical measures that ensure data is handled responsibly, transparently, and with due regard for user consent. Through a rigorous examination of data practices, we seek to raise awareness of potential privacy concerns and provide concrete recommendations for improvements.

The significance of this research lies in its potential to elevate Txt2Vid and similar platforms into secure and trustworthy video communication solutions. By enhancing user confidence in the platform’s security and privacy protections, we anticipate a ripple effect, leading to greater trust among users and making Txt2Vid an attractive option for diverse video communication needs.

To validate our hypothesis, we conducted surveys with participants, representing diverse demographics and usage scenarios, to gather valuable insights into user perceptions of security and trust in Txt2Vid. Incorporating their feedback into our analysis, we aim to uncover valuable insights that will help shape the platform’s security enhancements and further improve its user experience.

Related research

Text-dependent verification and text-independent verification are two different approaches to voice authentication. In text-dependent verification, the user is required to provide a specific, predetermined phrase or text during the verification process. This predetermined text serves as a reference for comparison, ensuring that the user’s voice is authenticated against a known and unique reference, making it difficult for unauthorized individuals to impersonate the user’s voice. On the other hand, text-independent verification does not require any specific phrase; it allows voice authentication based on any spoken content without the need for a reference text. We choose to incorporate text-dependent verification in the Txt2Vid platform due to its numerous advantages. Text-dependent audio biometrics significantly reduces the risk of unauthorized access or voice manipulation, making it an ideal choice for protecting against deepfakes and potential voice-based attacks. Moreover, as a new platform with limited data, implementing text-dependent verification is more straightforward and practical, allowing us to establish a strong security foundation from the outset. As Txt2Vid evolves and accumulates more data, we may explore combining text-dependent and text-independent methods for even greater security. By doing so, Txt2Vid can generate voices that closely resemble the intended users while safeguarding against emerging threats, thereby enhancing the overall user experience and content integrity.


The research design encompasses a combination of exploratory and descriptive methodologies. Exploratory research is undertaken to delve into the potential benefits of implementing text-dependent voice verification, while descriptive research aims to quantify user perceptions through survey results. Furthermore, the study introduces an experimental component to thoroughly evaluate the performance of the Txt2Vid system in accurately identifying different voices. This comprehensive analysis of the text-dependent voice verification system aims to determine its overall effectiveness and user acceptance.

The target population comprises primarily individuals in the age range of 10-19 who are regular users of various platforms. Participants were recruited through an email list and directed to a dedicated website to engage in both the survey and voice verification testing. The collected qualitative data measured both user perceptions and preferences regarding the voice verification system. Simultaneously, the voice verification testing generated audio inputs to assess the system’s accuracy and safety in verifying different voices. Prior to participation, individuals received email invitations with a request for informed consent. During voice verification testing, participants were prompted to recite diverse lines to gauge the system’s performance effectively. The study’s variables encompass user perceptions, preferences, and voice verification accuracy, which were measured through qualitative analysis of survey responses and comparison of voice inputs with the system’s verification results. Thematic analysis was utilized to understand user perceptions derived from survey responses, while voice verification data was analyzed to evaluate the system’s accuracy, considering the qualitative nature of the data. It is important to recognize several limitations of this research. The age range of 10 to 19 might not fully represent all potential users of video-calling platforms, limiting the generalizability of the findings to other age groups. Moreover, relying on self-reported survey responses may introduce social desirability bias, influencing participants to provide answers they perceive as more favorable. Additionally, voice verification testing conditions may not fully simulate real-world scenarios, potentially affecting the system’s accuracy. Despite these limitations, the research endeavors to offer valuable insights into user perceptions and the feasibility of the text-dependent voice verification system for video-calling platforms. This outcome will contribute to future improvements and enhance security measures in Txt2Vid and AI-based media platforms.


Throughout this research project, we tested various methods and models for voice authentication. The voice biometric model we chose was built off of a pre-trained Convolutional Neural Network (CNN). This allows the network’s hidden neurons to have the same weights and bias values per layer. With each layer focusing on specific features (such as pitch, speed, or tonation for example), CNN networks can increase their complexity and more accurately distinguish individual audio. We found it important for voice biometrics to have this property as more complex layering could lead to more natural speech from users while testing. Our research is primarily focused on two facets of voice biometrics: an internal examination of the voice biometric feasibility, and an external analysis of the security’s integration amongst people. Firstly, we studied the reliability of voice biometrics from a day-to-day standpoint. Factors like sickness, sore throats, or even the time of day are all factors that can change people’s voices suddenly and are not studied enough when developing audio networks. To ensure the voice biometric was capable of recognizing the user’s voice even when sick or sore, a recording set of 25 audio clips was created by each group member to create a 100-audio clip database alongside an additional 25 audio clips of a notably sick and sore voice. Each different biometric model was trained on 5 clips from each group member’s voice and tested for accuracy with the remaining 20 clips. The CNN could accurately verify 90% of the verification data generated by group members, and could accurately verify 100% of the 20 audio clips generated by 2 group members.

To further test our voice biometric and spread awareness for the safe use of deep fakes and AI, we created a flask app due to a website’s relative ease of use for consumers. A link to our site can be found below within our bibliography. To create an account, users are prompted for a username and password. After the password is hashed using the SHA algorithm, both are stored in an SQL database. The user is also prompted to record 5 voice recordings reading the following sentences.

  1. “I’m extremely excited for the SHTEM program this year.”
  2. “The quick brown fox jumped over the lazy dog.”
  3. “The hungry purple dinosaur ate the kind, zingy fox, the jabbering crab, and the mad whale.”
  4. “With tenure, Suzie’d have all the more leisure for yachting, but her publications are no good.”
  5. “The beige hue on the waters of the loch impressed all, including the French queen.”

These sentences were chosen for their significance to the program or for their high phonetic coverage–meaning they encompass many different sounds in the English language. Using the Python backend flask provides, these files are passed into the AI, and their weights are stored as a “.npy” file generated by Numpy. On login, users are prompted to record a random sentence that the AI verifies and matches to their generated .npy file. The program also utilizes Open AI’s Whisper STT, to confirm that users actually read the correct sentence. This prevents potential imitation through any audio clip of the user’s voice by requiring the voice to read the specific sentence. If the authentication is successful and the password hash is correct, the user is logged in.


6.1 The Demographic of Our Research Participants

For our research, we primarily focused on the feedback of participants from the age range of 10-19 to reflect the young population of video-calling platform users. There is a relatively equal number of males and females, people who identify as Asians make up the majority of our research group, and the amount of time participants spend on video-calling platforms per week ranges from under 1 hour to 6-10 hours. With half of people spending 1-5 hours per week on video-calling platforms, our research participants are regular users of various platforms, providing us with an understanding of how Txt2Vid compares to platforms like Zoom and Google Meet.

Figure 1: The gender identities of our research participants.
Figure 2: The race of our research participants.
Figure 3: The amount of time participants spend on video-calling platforms per week.

6.2 Performance Analysis

To evaluate the performance of our voice-verification system, we gathered information on research participants’ feedback on the likelihood of them using Txt2Vid with the voice-verification system versus without. Additionally, users were asked to rate the level of convenience of the voice-verification system, whether it successfully verified their voice, and feedback on how we can make the system better. Our participants report a 69.2% success rate, with one reporting that the system worked on her second attempt and one attempting 4 times with all failing. Although our current success rate is not viable for the industry, our results demonstrate the concept for future voice biometrics.

Figure 4: Participants reported the accuracy of our voice-verification system.
Figure 5: The success rate of our initial training data.
Figure 6:Participants reported the convenience level of the voice-verification system with 1 being inconvenient and 5 being convenient.
Figure 7: Likelihood of participants using Txt2Vid without the addition of the voice-verification system with 1 being unlikely and 5 being likely.
Figure 8: Likelihood of participants using Txt2Vid with the addition of the voice-verification system

Out of the 13 people that we surveyed, 53.85% of them reported no change in the likelihood of them using the Txt2Vid platform, 30.77% reported an increase with an average of 1.75 shift toward the “Likely” side of the scale, and 15.38% reported a decrease. As shown in the comparison of Figure 7 with Figure 8, there is a decrease in the number of participants feeling neutral about Txt2Vid before and after the addition of the voice-verification system, and a more left-skewed graph with a 15.4% increase in the number of participants reporting the likelihood of them using Txt2Vid being 5, which is highly likely. This serves as evidence that the addition of the voice-verification system to Txt2Vid decreased the reluctance to widely use this platform and increased user privacy protection. However, with 53.85% of participants reporting no change in the likelihood of them using the platform and 15.38% reporting a decrease, more must be done to increase the user base of Txt2Vid and make it effective in verifying the voices of our users.

Future Research

When evaluating the feedback we got from our research participants, 46.15% of them suggested implementing a voice recorder to the website, 7.7% suggested spreading the word involving the ethical concerns involving Txt2Vid, 23.1% suggested improving the interface, and 15.38% reported the system as inaccurate. Therefore, in the future, we hope to improve user experience by adding a voice recording device to the website, improving the user interface, increasing the accuracy of the voice-verification system, and implementing the system to the Txt2Vid platform. The addition of a voice recorder will significantly facilitate the process of voice enrollment and encourage users to opt-in to voice biometric data collection. For this project, we focused on developing a highly accurate voice-verification system outside of the Txt2Vid platform rather than integrating it into the system. We wish to further improve the accuracy of our voice-verification system with an increase in the CNN’s training and the quantity of data collected. Additionally, we wish to complete the implementation of this new security measure into the Txt2Vid platform.


In closing, this research underscores the critical significance of ensuring secure and ethical video communication platforms in our rapidly evolving digital age. The surge in visual content consumption, coupled with the transformative impact of the Covid pandemic, vividly underscores the need for dependable and morally guided user experiences. Through the introduction of an extended version of Txt2Vid, enriched with text-dependent voice verification, we’ve not only established a formidable user authentication mechanism but have also adroitly addressed concerns surrounding unauthorized access and potential voice impersonation. Furthermore, our in-depth scrutiny of data collection practices brings to light potential vulnerabilities, underscoring the importance of responsible data handling and the crucial aspect of user consent. By embracing these security enhancements, our research seeks to foster an environment of user trust, promoting the harmonious integration of AI-generated content within the realm of video communication. In a broader context, this endeavor not only augments the capabilities of Txt2Vid but also paves the way for a more secure and cohesive digital future, where AI-driven technologies can be harnessed responsibly and meaningfully. Looking ahead, we anticipate that these efforts will serve as a stepping stone for continued advancements in secure video communication platforms, fostering innovation while ensuring user safety remains a guiding principle.


The authors of this paper would like to thank Arjun Barrett, Laura Gomezjurado, and Pulkit Tandon for their continuous support and guidance throughout our research. We would also like to thank Sylvia Chin and Pr.Weissman for providing us with this incredible opportunity and overseeing the SHTEM program.


  1. Caldentey, L., Altmann, A., Xiao, Y. L., & Geleta, F. (2023, July 30). Saifety Shtem Project. SaiFETY. https://drago314.pythonanywhere.com
    • This is the site created and tested for our research. Please feel free to look through it and test the voice biometric yourself.
  2. Eric-Urban. (n.d.). Speaker recognition quickstart – speech service – azure AI services. Speaker Recognition quickstart – Speech service – Azure AI services | Microsoft Learn. https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-speaker-recognition?tabs=script&pivots=programming-language-csharp
  3. Molla, R. (2017, June 8). An explosion of online video could triple bandwidth consumption again in the next five years. Vox. https://www.vox.com/2017/6/8/15757594/future-internet-traffic-watch-live-video-facebook-google-netflix
  4. Novikova, Evgeniia (2023, January 8). What is text-dependent and text-independent voice biometrics – neuro.net blog. What Is Text Dependent And Text Independent Voice Biometrics https://neuro.net/en/blog/what-is-text-dependent-and-text-independent-biometrics
  5. Omri Wallach. Graphics/Design. (2021, September 20). The world’s most used apps, by downstream traffic. Visual Capitalist. https://www.visualcapitalist.com/the-worlds-most-used-apps-by-downstream-traffic/
  6. Tandon, P., Chandak, S., Pataranutaporn, P., Liu, Y., Mapuranga, A. M., Maes, P., Weissman, T., & Sra, M. (2022, April 3). Txt2Vid: Ultra-low bitrate compression of talking-head videos via text. arXiv.org. https://arxiv.org/abs/2106.14014
  7. Team, I. (n.d.). What are Convolutional Neural Networks? Accessed, 8/18/23 https://www.ibm.com/topics/convolutional-neural-networks
  8. Team, O. (Ed.). (2020, May 4). Keeping the internet up and running in times of crisis – OECD. OECD: Better Policies for Better Lives. https://www.oecd.org/coronavirus/policy-responses/keeping-the-internet-up-and-running-in-times-of-crisis-4017c4c9/