A Survey of Deep Learning Applications and Transfer Learning in Medical Image Classification

Journal for High Schoolers, Journal for High Schoolers 2021


Eugenia Druzhinina, Joyce Lu, Napoleon Vuong, Ethan Liang


Over the past few decades, artificial intelligence has become increasingly popular in the medical sector. Deep learning, a subset of artificial intelligence, has played an essential role in the formation of computer vision. This paper specifically considers convolutional neural networks and transfer learning for image classification. Currently, medical imaging modalities such as MRI and X-rays are used to detect and diagnose neurological disorders including Alzheimer’s disease, brain tumors, and other pathologies. A convolutional neural network that can identify key features of an image and differentiate between different pathologies may potentially assist clinicians and researchers in medical image interpretation and disease diagnosis. Despite significant improvements in deep learning for medical image classification, limitations in medical image dataset size hinder the development of robust networks. To better understand this issue, we investigated deep learning and its applications in medical imaging through a review of published literature. Transfer learning was then identified and explored as a possible solution to countering dataset limitations through the testing of various convolutional neural network models. We found that lowering the learning rate and increasing the epoch count in our models increased performance stability and accuracy.


Artificial intelligence (AI) seeks to imitate human intelligence and behavior through machines [1]. It is broadly applied to many sectors, including the medical field. By utilizing AI in medicine, we can potentially automate tedious and repetitive processes, lower overall workload, and increase efficiency. Machine learning, a subset of AI, is the process of acquiring knowledge through data and learning from the data to improve systems and algorithms. Deep learning is part of machine learning and features deep (multi-layered) networks. A standard deep learning model (also known as a neural network) consists of an input layer and an output layer, with hidden layers in between. Data input to each hidden layer is transformed for input into the next layer.

Convolutional Neural Networks

Neural networks that contain at least one convolutional layer are called convolutional neural networks. Other subsets of the hidden layers include activation layers and pooling layers. A convolutional layer abstracts features from small sections of the training data [2]. Such layers progressively abstract and identify increasingly specialized features. In the case of image recognition, identified features would include lines, edges, and more. In the pooling layer, the matrix size is reduced through means such as max pooling. As the matrices decrease in size, the number of parameters is reduced, which results in benefits such as increased computing speed and decreased chances of overfitting. While pooling layers are optional, they aid in finding the maximums and averages of values in each region of the feature maps. In the table below, a brief overview of key and relevant convolutional neural network architectures is provided.





  • First successful convolutional neural network [3]
  • Uses gradient-based learning for document recognition [4]




  • First convolutional neural network for image recognition and classification
  • Uses parameter optimization strategies, dropout, and ReLU [3]




  • Improved version of AlexNet
  • Uses parameter optimization and feature visualization [3]




  • Reduces hyperparameters from 136 million to 4 million [4]
  • Uses inception block and bottleneck layer [4]

Inception V3



  • Version 3 of GoogLeNet
  • Shrinks filter size [4]




(convolutional layers)

  • Part of the Inception family
  • More efficient use of Inception V3 parameters [5]
  • Uses depth-wise separable convolutions [5]




  • Emphasized depth for image recognition [4]
  • 3.6% classification error rate [6]
  • Uses residual blocks [3]




  • Version 2 of ResNet
  • Uses skip connections [7]




  • Solves the vanishing gradient problem
  • through cross-layer connectivity [3]



19 [8]

  • 16 convolutional layers and 3 fully-connected layers [9]
  • Smaller (1×1) filters for lowered computational complexity [3]
Table 1 Overview of Select Convolutional Neural Network Architectures

Neural Network Architectures for Medical Imaging

Convolutional neural networks preserve spatial structure and are therefore commonly used in deep learning for medical imaging. Other networks employed in medical imaging include stacked autoencoders and deep belief networks. A stacked autoencoder consists of an input layer and an output layer, with hidden layers in between that encode and decode data [2]. The encoding layers use convolutional layers to compress data into a lower-dimensional representation, and the decoding layers reconstruct the compressed representation as close as possible back to the original data [10]. Due to its compression and reconstruction functions, stacked autoencoders are excellent for improving accuracy in the classification of raw data [10]. Deep belief networks have multiple restricted Boltzman machine layers. Restricted Boltzmann machines consist of a visible (input) layer and a hidden (output) layer [11]. As opposed to a feedforward network where neurons are acyclic, neurons within restricted Boltzmann machine layers are interconnected [2]. Restricted Boltzmann machines can reduce data dimensionality and initialize weights for training [2].

Applications in Medical Imaging

Deep learning is used in the processing and analysis of medical images produced from modalities such as magnetic resonance imaging (MRI), computerized tomography (CT), and positron emission tomography (PET). Deep learning is employed in image detection, registration, segmentation, and classification [2], as feature analysis is of interest to those applications. Image detection consists of detecting lesions from tissues of interest [2]. Image registration is a part of image preprocessing and aids in clinical diagnosis by superimposing two or more images to provide a more complete and cohesive picture for diagnosis [2]. Image segmentation is the process of categorizing parts of an image into different regions based on its features (such as bone vs. tissue or gray matter vs. white matter). Image classification is essential to automated disease diagnosis and consists of learning features that are related to diseases and classifying them as such. In this report, we focus on image classification. An overview of three examples in image classification and a description of their architectures are given below. These include a Boltzman machine for Alzheimer’s disease classification, a convolutional neural network for Alzheimer’s disease classification, and a deep belief network for schizophrenia classification.

In 2013, Suk et al. [12] trained a multi-modal deep Boltzmann machine using images from MRI and PET scans for the classification of Alzheimer’s disease. Using a latent feature representation and a stacked autoencoder, shared low-level features were found and combined with other non-latent features [2]. This method achieved an accuracy of 98.8% in the classification of Alzheimer’s disease and healthy controls.

Sarraf and Tofighi [13] classified Alzheimer’s disease using convolutional neural networks and fMRI data. LeNet-5, a convolutional neural network, was used due to its advantages in both feature extraction and classification. The convolutional layer performs high-quality feature extraction and discrimination, and the complex architecture enables classification. A 96.86% Alzheimer’s versus healthy control classification accuracy was achieved using LeNet-5, which is a major improvement from the support vector machine’s classification accuracy of 84%.

Deep learning was also utilized to extract MRI features and to classify schizophrenia. Pinaya et al. [14] used a multilayer network by combining a pre-trained deep belief network to find high-level latent features indicative of schizophrenia from the MR images and a softmax layer to fine-tune the network and classify the images. This deeper network was able to capture more complex information, which resulted in better classification performance. This network achieved an accuracy of 73.6%, which is significantly higher than the support vector machine’s accuracy of 68.1% for the same classification problem.

Challenges and Solutions in Medical Imaging

While significant advances in deep learning for medical imaging applications have been made, limitations in acquiring sufficiently large and comprehensive datasets present a major challenge. The size of a dataset directly influences the quality of the network that it trains. Although a sizable amount of medical imaging data is generated each year, access to the data is limited due to patient privacy concerns and regulations (such as HIPAA) [17].

Additionally, most deep learning networks employ supervised learning. In medical imaging datasets, a specialized professional (such as a radiologist) would be needed to annotate each image by hand so that the deep learning network can learn the true label. Given that datasets must be very large to properly train neural networks, the process of image acquisition is lengthy and costly.

It has also been noted that in currently available medical imaging datasets, pathological data is rare [17]. This class imbalance, in which there is a significantly larger amount of imaging from healthy controls than from pathological subjects, leads to difficulty in choosing an appropriate neural network, which ultimately results in poorer performance [2].

Three solutions have been proposed to address this dataset limitation issue. First, undersampling can be used to rebalance the pathological versus normal control distribution by deleting or merging images [15]. Second, oversampling, which is the process of generating new images from existing data, can be used to address both class imbalance and small dataset sizes [2]. Using two publicly available datasets, researchers at MGH & BWH Center for Clinical Data Science, NVIDIA, and Mayo Clinic developed a machine learning network to generate synthetic MR images with brain tumors [16]. Beyond providing additional sources of pathological data that can improve network accuracy, synthetic generation of images in oversampling can be used as an anonymization tool, addressing patient privacy concerns in datasets. Finally, transfer learning can be used to train a network despite insufficient data [17]. In transfer learning, a neural network is first trained using a large dataset such as CIFAR-10 or ImageNet. The top layers of the network are then re-trained and fine-tuned on the smaller dataset of interest. Given that medical imaging datasets tend to be small, transfer learning is one of the most popular and effective methods of training neural networks in medical applications. In the next section, we demonstrate the effectiveness of transfer learning through training multiple networks with two datasets of 75 and 251 images, respectively.

Methods and Materials


Transfer Learning

The concept of transfer learning in artificial neural networks is taking knowledge acquired from training on one particular domain and applying it to learn a separate task [18].


The number of passes through an entire training dataset [19].

Learning Rate

A hyperparameter used in the training of neural networks that has a small positive value, often ranging between 0.0 and 1.0 This controls how quickly or slowly a neural network model learns a problem [20].

Batch Size

The number of training examples utilized in one iteration [21].

Validation Accuracy

Accuracy of the model on unseen data after the model has been trained with the testing data.

Validation Loss

Loss of the model on unseen data after the model has been trained with the testing data.

Testing Accuracy

Accuracy of the model from training with the testing data.

Testing Loss

Loss of the model from training with the testing data.


A problem in machine learning that introduces errors in real-world situations. Noise and meaningless data are taken into account in prediction or classification. Overfitting tends to happen when training datasets are too small or include parameters and/or unrelated features correlated with a causal feature of interest [22].

Table 2 List of Relevant Keywords and Descriptions

To become familiar with deep learning architectures, we implemented and tested five convolutional neural network models (Inception V3, DenseNet201, ResNet152V2, Xception, and VGG19) using a transfer learning template [23]. The template was designed as “a high-level introduction into practical machine learning for purposes of medical image classification” [23]. We used two small datasets designed for binary classification to compare the accuracy and performance of the models as well as to identify the potential sources of inaccuracy within our results.

We used Google Colab and a collection of Python libraries to implement and evaluate the five models. Tensorflow and Keras were essential in the establishment of the architectures while Numpy and Matplotlib were used to visualize the data from our testing. Our choice of convolutional neural network models came from the most updated and popular models available through the Keras library.

The template provided the necessary code to classify abdominal and chest X-ray scans from a 75 image dataset (65 training, 10 validation) with preset hyperparameters. The default model used in the template is InceptionV3. After testing the default model, we experimented with four different architectures: DenseNet201, ResNet152V2, Xception, and VGG19. We adjusted the hyperparameters to reduce fluctuations in loss and accuracy over the entirety of each run (which will be shown in the Results section). Hyperparameter alterations include decreasing the template’s preset learning rate from  1 \cdot 10^{-4} to  1 \cdot 10^{-5} and increasing the number of epochs from 20 to 40.

We ran each of the models with the hyperparameter adjustments listed above to identify the best-performing model. This was done by examining the highest average testing and validation accuracy as well as the lowest average testing and validation loss. We then trained VGG19, the best performing model, on a larger and less ideal 251 image dataset [24]. The images were sized differently. This 251 image dataset (221 training, 30 validation) contained MRI brain scans of healthy controls and MRI brain scans of subjects with tumors. This dataset allowed us to continue to work with a binary classification problem. The purpose of training the model on a second dataset was twofold. First, it would allow us to better gauge the performance of the model in an application closer to a real-world scenario. Second, it would help us pinpoint potential sources of inaccuracy. This was done by comparing VGG19 results to the ResNet model (a very popular deep learning model and the most consistent performing model out of all five models). A copy of our codebases can be found for the 75 image dataset here [25] and the 251 image dataset here [26].


We used a learning rate of  1 \cdot 10^{-4} and 20 epochs on the 75 image dataset. The results of the five models using those hyperparameters are shown on the left. An adjusted learning rate of  1 \cdot 10^{-5} and 40 epochs were then applied to the five models; the results are shown on the right.

From Figures 1 through 10, we can see that the adjustments to the hyperparameters were essential in the development of a model that learns from the data that it was trained on. This enabled us to analyze and evaluate the performance of those models. A significant improvement was seen in the VGG19 and DenseNet201 models, as high fluctuation and unpredictability across epochs were experienced before hyperparameter optimization. After the hyperparameter adjustments, we see that the VGG19 model performed the best, having the highest average training and validation accuracy as well as the lowest average training and validation loss. We continued to use this model on the larger 251 image dataset to see how well it would perform when dataset size was scaled up. The figures below show the results of the VGG19 model and the ResNet152V2 model.

Upon further examination of Figure 11 and Figure 12, we observe that VGG19 did not experience consistent performance with the increase in the dataset size. This would initially lead us to argue that an increase in dataset size causes such an inconsistency, but after seeing the contradictory performance from the ResNet152V2 model, this argument is no longer valid. Therefore, factors outside of dataset size must be affecting the results of our testing, hindering the consistency and accuracy of even our previously best-performing model.

Given our experience with hyperparameter adjustments, we believe that non-optimized hyperparameters may be the leading cause of inconsistent performances across our models. This is due to the significant impact that our hyperparameter adjustments had on the original dataset for the VGG19 and DenseNet201 models. We believe that further tuning of the hyperparameters within the template could lead to more consistent results not only for VGG19 but for the rest of the convolutional neural network models as well.


From the transfer learning template results, we concluded that factors outside of dataset scaling cause fluctuation in convolutional neural network performance. This may include hyperparameter values and choices and other factors. Additionally, we demonstrate that a reduction in learning rate (from  1 \cdot 10^{-4} to  1 \cdot 10^{-5} ) increases performance in terms of both accuracy and loss across models. This experiment also demonstrates the efficacy of transfer learning on small-sized datasets.

Future Directions

To further improve performance results on the models tested, we plan to optimize hyperparameters such as batch size, learning rate, epoch count, and more. The choice to focus on hyperparameter optimization comes from our results in Figures 3, 4, 9, and 10, which demonstrate the significant impact that learning rate and epoch count have on producing accurate and consistent data. Testing can be conducted through trials of different hyperparameter values and analysis of subsequent results to determine the optimal combination. These adjustments will likely produce more consistent and accurate validation results and will additionally decrease the probability of overfitting.


We would like to extend our deepest gratitude to our mentor, Ethan Liang, for his guidance, support, and time, which were essential to this project. We would also like to thank Professor Tsachy Weissman for providing us with this opportunity by founding the STEM to SHTEM program, Professor Stephen Boyd for his role in the development of this program, and Cindy Nguyen for directing and coordinating this program.


  1. W. Samek, T. Wiegand, and K.-R. Müller, “Explainable Artificial Intelligence: Understanding, Visualizing and Interpreting Deep Learning Models,” arXiv.org, 28-Aug-2017. [Online]. Available: https://arxiv.org/abs/1708.08296. [Accessed: 12-Jul-2021].
  2. J. Liu, Y. Pan, Z. Chen, L. Tang, C. Lu, and J. Wang, “Applications of Deep Learning to MRI Images: A Survey,” IEEE Xplore Full-Text PDF: Mar-2018. [Online]. Available: https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8268732. [Accessed: 13-Aug-2021].
  3. A. Khan, A. Sohail, U. Zahoora, and A. S. Qureshi, “A survey of the recent architectures of deep convolutional neural networks.” Artificial Intelligence Review, vol. 53, no. 8, pp. 5455-5516, 2020, doi: 10.1007/s10462-020-09825-6.
  4. S. Yeung. Lecture 5 | Convolutional Neural Networks – YouTube. (2017). Accessed: Aug. 06, 2021. [Online Video]. Available: https://www.youtube.com/watch?v=bNb2fEVKeEo.
  5. F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1800-1807, doi: 10.1109/CVPR.2017.195.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Xplore, 10-Dec-2015. [Online]. Available: https://ieeexplore.ieee.org/document/7780459/. [Accessed: 13-Aug-2021].
  7. S.-H. Tsang, “Review: ResNet – winner OF ILSVRC 2015 (Image Classification, Localization, Detection),” Towards Data Science, 15-Sept-201. [Online]. Available: https://towardsdatascience.com/review-resnet-winner-of-ilsvrc-2015-image-classification-localization-detection-e39402bfa5d8. [Accessed: 06-Aug-2021].)
  8. D. Garcia-Gasulla, F. Parés, A. Vilalta, J. Moreno, E. Ayguadé, J. Labarta, U. Cortés, and T. Suzumura, “On the behavior of convolutional nets for feature extraction,” Journal of Artificial Intelligence Research, vol. 61, pp. 563–592, 2018.
  9. K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” dblp, 2015. [Online]. Available: https://dblp.org/rec/journals/corr/SimonyanZ14a.html. [Accessed: 13-Aug-2021].
  10. V. K. Jonnalagadda, “Sparse, stacked and Variational Autoencoder,” Medium, 06-Dec-2018. [Online]. Available: https://medium.com/@venkatakrishna.jonnalagadda/sparse-stacked-and-variational-auto encoder-efe5bfe73b64. [Accessed: 13-Aug-2021].
  11. P. Canuma, “What are rbms, deep belief networks and why are they important to deep learning?,” Medium, 23-Dec-2020. [Online]. Available: https://medium.com/swlh/what-are-rbms-deep-belief-networks-and-why-are-they-importa nt-to-deep-learning-491c7de8937a. [Accessed: 13-Aug-2021].
  12. L. HI, L. SW, and S. D, “Latent feature representation with stacked auto-encoder for AD/MCI diagnosis.,” Europe pmc, 22-Dec-2013. [Online]. Available: https://europepmc.org/article/med/24363140. [Accessed: 13-Aug-2021].
  13. S. Sarraf and G. Tofighi, “Classification of alzheimer’s disease using fmri data and deep learning convolutional neural networks,” arXiv.org, 29-Mar-2016. [Online]. Available: https://arxiv.org/abs/1603.08631. [Accessed: 14-Aug-2021].
  14. W. H. Pinaya, A. Gadelha, O. M. Doyle, C. Noto, A. Zugman, Q. Cordeiro, A. P. Jackowski, R. A. Bressan, and J. R. Sato, Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia, Sci. Rep., vol. 6, p. 38897, 2016.
  15. J. Brownlee, “How to Combine oversampling and Undersampling for imbalanced classification,” Machine Learning Mastery, 10-May-2021. [Online]. Available: https://machinelearningmastery.com/combine-oversampling-and-undersampling-for-imba lanced-classification/. [Accessed: 13-Aug-2021].
  16. H. C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image synthesis for data augmentation and anonymization using generative adversarial networks,” Mayo Clinic, 01-Jan-1970. [Online]. Available: https://mayoclinic.pure.elsevier.com/en/publications/medical-image-synthesis-for-data-au gmentation-and-anonymization-u. [Accessed: 13-Aug-2021].
  17. A. S. Lundervold and A. Lundervold, “An overview of deep learning in medical imaging focusing on MRI.” Zeitschrift für Medizinische Physik, vol. 29, no. 2, pp. 102-127, 2019, doi: 10.1016/j.zemedi.2018.11.002.
  18. E. Chmiel, “Transfer learning: Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/transfer-learning-1?lang=us. [Accessed: 13-Aug-2021].
  19. F. Gaillard, “Epoch (machine learning): Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/epoch-machine-learning?lang=us. [Accessed: 13-Aug-2021].
  20. J. Brownlee, “How to configure the learning rate when training deep learning neural networks,” Machine Learning Mastery, 06-Aug-2019. [Online]. Available: https://machinelearningmastery.com/learning-rate-for-deep-learning-neural-networks/. [Accessed: 13-Aug-2021].
  21. F. Gaillard, “Batch size (machine learning): Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/batch-size-machine-learning?lang=us. [Accessed: 13-Aug-2021].
  22. C. M. Moore, “Overfitting: Radiology reference article,” Radiopaedia Blog RSS, 2020. [Online]. Available: https://radiopaedia.org/articles/overfitting?lang=us. [Accessed: 13-Aug-2021].
  23. P. Lakhani, “Paras42/Hello_World_Deep_Learning: Hello world introduction to deep learning for medical image classification,” GitHub, 16-Apr-2018. [Online]. Available: https://github.com/paras42/Hello_World_Deep_Learning. [Accessed: 06-Aug-2021].
  24. N. Chakrabarty, “Brain mri images for brain tumor detection,” Kaggle, 14-Apr-2019. [Online]. Available: https://www.kaggle.com/navoneel/brain-mri-images-for-brain-tumor-detection. [Accessed: 06-Aug-2021].
  25. N. Vuong, “Google Collaboratory – HWD_7_Models_Data,” Google Colab, 05-Aug-2021. [Online]. Available: https://colab.research.google.com/drive/1fbx9hfLIMJyNSf6T_VoGrZISGp0BpkpN?usp=s haring. [Accessed: 14-Aug-2021].
  26. N. Vuong, “Google Colaboratory,” Google Colab, 06-Aug-2021. [Online]. Available: https://colab.research.google.com/drive/1IgqnOBHL3H_GgBlSwPLZMDpeXlaikoBn?usp=sharing. [Accessed: 14-Aug-2021].

Leave a Reply