Diagnosing Tuberculosis through Analysis of X-ray Scans by Neural Networks

Blog, Journal for High Schoolers, Journal for High Schoolers 2021

Authors

David J. Florez Rodriguez, Nam Dao, Kathryn Xiong, Srividya Koppolu, Teaghan Knox

Abstract

Pulmonary tuberculosis (PTB) is a potentially fatal infectious lung disease most commonly found in developing countries. Early diagnosis of PTB can help limit its spread and increase its treatability. Doctors commonly diagnose PTB using a skin test or blood test followed by a x-ray or CT scan of the lungs. In developing countries, where PTB is most prevalent there is not always personnel with the proper training to interpret the x-rays accurately. Therefore, developing a computer model to diagnose tuberculosis that does not require intensive processing power could improve diagnosis rates in such places, as well as decrease the cost. In this project, we first tested basic predictive models like logistic regression, then moved on to CNNs. We explored hyperparameter tuning and various data prep methods in order to attempt to achieve an accuracy suitable for patients’ needs. The best model we tested was a SKlearn Random Forest Classifier which correctly classified x-rays as PTB positive or negative 83.7% of the time. While not yet accurate enough for medical purposes, it required minimal processing power and is a promising start. With more tuning and an expansion of the types of models tested, it is a real possibility that a machine learning algorithm could be applied to diagnosing PTB in the field in the near future.

Introduction

Pulmonary tuberculosis is a contagious airborne disease that occurs in people with weakened immune systems when the bacterium Mycobacterium tuberculosis attacks the lungs. When left untreated, M. Tuberculosis causes tissue destruction and lung lesions that are often fatal (World Health Organization, 2020). Tuberculosis is a global problem ranking as one of the top ten causes of death worldwide with 1.4 million deaths in 2019 alone. Developing countries have experienced the largest outbreaks with over 95% of cases and deaths (World Health Organization, 2020) With the rise of new drug resistant strains, the problem has only become more urgent making rapid, accurate diagnosis of tuberculosis vital both for ensuring the best treatment outcome of the patient and for public health intervention to reduce further spread in the community (Ryu, 2015).

Currently, diagnosing active pulmonary tuberculosis efficiently and accurately remains a problem. The common tests for pulmonary tuberculosis are the tuberculin skin test, or an IGRA blood test used in combination with a chest x-ray (CDC, 2016). Despite the popularity of chest x-rays as a means of diagnosis, detecting abnormalities in the lungs that are consistent with pulmonary tuberculosis is difficult due to the variety of features that can occur and the similarity of the features with that of other lung diseases (Pasa et al., 2019). This is especially true in developing countries where tuberculosis is most prevalent and where there is often a lack of properly trained physicians (Melendez et al., 2016).

In such cases, the role of a well trained physician for chest x-ray analysis could be filled by a cheaper and more effective machine learning algorithm. Due to these benefits it is no surprise that numerous groups have researched the application of machine learning algorithms for medical imaging diagnoses, including chest x-rays. Chest x-rays are good for diagnosing tuberculosis with machine learning algorithms due to x-ray’s many similarities with natural images, their relative cheapness, and availability of data sets such as the Montgomery and Shenzhen.

Because of the abundance of research on tuberculosis diagnosis a variety of techniques involving machine learning have been applied. One common subject of deep learning research is the application of existing powerful natural image classifiers such as ResNet, AlexNet, and GoogLeNet (Lakhani & Sundaram, 2017). While these models achieved high accuracy they are time and memory intensive, requiring high end equipment to run (Pasa et al., 2019). Our goal is to experiment with less complex algorithms in an attempt to achieve a high accuracy chest x-ray based pulmonary tuberculosis diagnosis model that requires less processing power.

To diagnose tuberculosis using x-ray scans we utilized a variety of different deep learning classification algorithms. Each model will utilize a different classification algorithm. Models will be trained on data containing both X-ray images and their corresponding label (tuberculosis positive or negative). After processing the data, the models will be used to make predictions on a test set, producing results that will be verified with real data. From this, we extracted accuracy rates for each of the models, which we compared as a metric to determine the predictive power (and thus effectiveness) of each model. Then, we focused on models which posed the greatest potential for improvements in predictive power and adjusted their hyperparameters for optimal results.

Methods and Materials:

Prior to modeling with X-rays, we used smaller images from the Tensorflow Cifar-10 dataset and importable models from the Sklearn library. This step allowed us to conserve time and memory while testing possible data conversion methods and image recognition models. Because Sklearn’s models only accept one-dimensional arrays as input data, we had to flatten our three-dimensional numerical arrays into one-dimensional arrays. We first averaged the third dimension–color–to create a black-and-white-photo–X-rays which are also black-and-white photos. This reduced the dimensions of the images from height * width * rgb colors to simply height * width. We then appended each row of the resulting two-dimensional array onto the first row, creating a one-dimensional array. We tested Sklearn’s KNearestNeighbors algorithm with five neighbors, Logistic Regression algorithm with no hyperparameters , and MLPClassifier algorithm with no hyperparameters. The three models had corresponding accuracies of 20 percent, 28 percent, and 10 percent. We got an accuracy of 67 percent for the Keras API which had no hyperparameters while experimenting with models.

Our composite dataset consisted of chest X-rays from the National Library of Medicine of patients located in Shenzhen, China, and Montgomery County in Maryland, USA, although we have only applied our models to the Shenzhen dataset. While working with the X-ray images, we first created a training set for our models. We acquired X-ray scans of people’s lungs (along with labels) from Kaggle. We stored all the images in a folder on Google Drive, which allowed for collaboration using Google Colab, and using an indicator in each or their corresponding addresses, we created the output array–whether each patient had tuberculosis. Because the X-ray images were not uniform in size and shape, and some contained a color dimension, we tried to uniformize the data. Having similar examples of training data will help the models generalize and develop better weights. Using the same process as our CIFAR-10 data, we converted the images of the x-ray scans into black and white with PIL’s getBands function, and using PIL’s resize function, we shrank all the images into 1800 by 1800 arrays. The reduction in data helped alleviate some pressing concerns about the lack of computational power for the processing of heavier data. Furthermore, since the color of X-ray images should have no bearing on whether an image is ‘diagnosed’ to have tuberculosis or not, getting rid of extraneous data helps prevent model overfitting. Finally we used Sklearn’s train_test_split function to split our data into a training and testing group. This division will help us quantify the predictive accuracy of the models later. In total, we created four groups of data: X_train (containing the image arrays used for training), y_train (containing the corresponding labels to X_train), X_test (containing the image arrays used for testing), and y_test (containing the corresponding labels to X_test. The X data consisted of 3D matrices: # of examples by 1800 by 1800. The dimensions of our X_train matrix was 496 by 1800 by 1800, and 166 by 1800 by 1800 for our X_test. Additionally, we used NumPy’s save and load functions to store the images, allowing for easier access to our data when we built our models.

Before our image data was ready, however, we needed to complete one last step. Since specific models require specific data input, we reshaped the size of our training and testing data to fit specific requirements. For Convolutional Neural Networks (CNNs), which require 4D matrices, we added an extra dimension to our input (X_train and X_test) data. For all other models, which require 2D matrices, we reduced the dimension using the function numpy.reshape().

Next, we tested several models to experiment with effectiveness and accuracy. The models we worked with were: logistic regression, random forest classifier, extra trees, naive bayes, CNN, KNN. In these models we included hyperparameters to fine tune the functionality.

Most of these models contained little to no hyperparameters. Our Naive Bayes model, for example, relied entirely on mathematical prediction, devoid of personal input. In the case of our K-Nearest Neighbor model, the only intuitive number of neighbors was 2, as the distinction between the data is binary: the images are either labeled with tuberculosis or not. We used warm start on our logistic regression, random forest, and extra trees classifiers. The warm-start hyper-parameter allowed us to create a model the first time we fit it, and continuously train it with new data. Therefore, we split our data into 3 batches to conserve RAM while still training efficiently. Other models, however, contained a lot more hyperparameters. In particular, we had to decide on the optimal architecture for our CNN model. Our CNN was constructed with a series of layers imported from the Python library Tensorflow. Although it was possible for us to implement a more sophisticated architecture which would guarantee better predictions, we purposefully picked a simpler design with less layers (Figure 1). This is in accordance with our research goal: the aim is to create a model that doesn’t require absurd amounts of computational power and time, allowing it to be implemented on computers in developing countries. Furthermore, overly complex architecture could cause a problem with overfitting.

Figure 3. Architecture of CNN

To maximize accuracy, we compiled our CNN with the optimizer Adam and a binary cross entropy loss function. For our final dense layer, we used sigmoid activation. These are all standard hyperparameters employed in CNNs that classify binary labels. After running our first model, despite a high percentage of accuracy on training data, the CNN model didn’t perform well on predictive data. Thus, we included a validation_split clause in compiling our second CNN model. Besides this change, the two models are identical.

Results:

Through trial and error, our accuracy improved. For the random forest classifier model, the model with the best results, we got an accuracy of 83.7%. The first CNN model tested successfully categorized about 81.3% of the x-ray scan data, classifying them as tuberculosis positive or negative. After changes, the second iteration of the CNN was only able to categorize around 79.5% of the data.

We also tested several other models that resulted in varying levels of accuracy (Table 1). It is hard to pinpoint exactly how one kind of model outperformed another. However, the K-Nearest Neighbor, logistic regression, and Naive Bayes models likely had the lowest accuracies because of the nature of the input data: these models couldn’t detect patterns in pixel variations. By merely comparing the values of each individual pixel across images and trying to detect overarching differences, these models lack the sophistication to detect the diverse forms that tuberculosis can show in an X-ray. Although Convolutional Neural Networks should theoretically be the best at image classification between all of the algorithms used, issues with overfitting likely caused the predictive accuracy to plummet.

Additionally, we used confusion matrices to visually group correct and incorrect diagnoses of which include positives, negatives, false positives, and false negatives. Confusion matrices are useful to visualize the performance of a classification model and they can also be used to determine the usefulness of the model. The confusion matrix for the second CNN model didn’t improve it; however, it got worse in terms of correct predictions. Theoretically, the only change made is a validation_split which should improve accuracy, but in this case, the accuracy decreased. In figures 4 and 5, rows indicate real values and columns depict predicted values. However, for imbalanced data, the matrix may not be accurate since the model primarily predicts each point to be part of the majority class label.

It seems plausible that by increasing the amount of training data or hyperparameters on a bigger dataset, it would be possible to obtain improved results with our architecture. The use of high capacity models is, however, out of the scope of this work

Model

Accuracy

K- Nearest Neighbors (KNN)

0.669

Logistic Regression

0.711

Naive Bayes

0.783

CNN(2)

~0.795

CNN(1)

~0.813

Extra Trees Classifier

0.825

Random Forest Classifier

0.837

Table 1. Models tested and accuracies in order of increasing precision

Conclusion:

The best algorithm we tested achieved a 84% accuracy rating which while good requires improvement before it could be applied in a medical setting. It is a lower accuracy than achieved by powerful industry models which often achieved accuracy in the upper 90s%. We are confident that, if given more resources at our disposal (in more computational power), our models can be fine tuned to become better at prediction without compromising runtime. We also attribute the lower predictive accuracy of our two CNN models to overfitting due to suboptimal data: the models performed well when running on training data but experienced a sudden drop in accuracy when running on the testing data. This juxtaposition in accuracy is likely the result of faulty data image processing and splitting. For example, it is an often recommended practice in deep learning to have an equal number of examples for each label in a training set. For our project, this would have meant the creation of training data with one half of tuberculosis images and one half without. However, our training data had markedly more examples that are labeled no-tuberculosis than ones that do. Because Google Colab had deficiencies in RAM, we also had to compromise on other data-processing techniques which may have increased the efficacy of our models. Primarily, the input data couldn’t be normalized on Colab without stopping the kernel. Finally, in trying to achieve a homogenous image size of 1800 by 1800, we sacrificed valuable data which could have gone towards improving all of our models. Since most images were much larger than 1800 by 1800 pixels in size (coming closer to 3000 by 3000), our intentional reduction got rid of a lot of useful data.

Another source of error in our project was only running each model once. Since computational learning can vary from one try to another, there would be slight variations in predictive accuracy for every model time had each model been run multiple times. Each model should have been run a standardized number of times and had its accuracies averaged to produce the most reliable number. However, lack of time (some models took many hours to be fitted and compiled) meant that this couldn’t be accomplished.

However, we did achieve our goal of using much smaller amounts of processing power. All of the models (besides our CNNs) were designed and implemented via Google Colab, a platform with a very small amount of available RAM for processing data and running models. We are confident that, since our models were able to run on Colab, the models will be available for implementation in developing countries as we originally intended. Although currently inapplicable for consistently and accurately diagnosing patients with tuberculosis, our models are open to a plethora of future fine tuning to achieve higher efficacy. With these improvements, we are confident that our models will be eventually fit for wide commercial usage.

Future Directions:

For our future plans, we will work on experimenting with various hyperparameters on our models. For example, more (and different kinds of) layers can be added to our CNNs. When compiling our CNN models, we could also add validation data, which would theoretically improve the image classification process. Furthermore, we can compare the results of our models with those of well-known general purpose CNNs like Google’s Inception. These models are well known go-tos for their generalized image classification abilities and are also highly computationally demanding. Along with emphasis on the design of new smaller-scale models, more research can be conducted into reducing the sophistication of general purpose CNNs so that they require less processing power. This alternative method to achieving the right combination of computational power and predictive accuracy may prove more complex but also highly rewarding.

Finally, we will use metadata for better model learning. The data set we used also included extra labels for each image, including the gender and age of the patient in each photo. The data set which we utilized originated from Shenzhen, China, but there is another identical data from Montgomery, MD, USA. By including data from the U.S. data set and utilizing the labels in metadata, we can also train our models to be more sensitive to variations across gender, age, and geography.

References:

  1. Candemir S, Jaeger S, Palaniappan K, Musco JP, Singh RK, Xue Z, Karargyris A, Antani S, Thoma G, McDonald CJ. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE Trans Med Imaging. 2014 Feb;33(2):577-90. doi: 10.1109/TMI.2013.2290491. PMID: 24239990
  2. Centers for Disease Control and Prevention (CDC). (2016, April 14). Testing & diagnosis. Tuberculosis (TB) . https://www.cdc.gov/tb/topic/testing/default.htm.
  3. Jaeger S, Karargyris A, Candemir S, Folio L, Siegelman J, Callaghan F, Xue Z, Palaniappan K, Singh RK, Antani S, Thoma G, Wang YX, Lu PX, McDonald CJ. Automatic tuberculosis screening using chest radiographs. IEEE Trans Med Imaging. 2014 Feb;33(2):233-45. doi: 10.1109/TMI.2013.2284099. PMID: 24108713
  4. Lakhani, P., & Sundaram, B. (2017, April 24). Deep learning at Chest Radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology. https://pubs.rsna.org/doi/full/10.1148/radiol.2017162326.
  5. Melendez, J., Sánchez, C., Philipsen, R. et al. An automated tuberculosis screening strategy combining X-ray-based computer-aided detection and clinical information. Sci Rep 6, 25265 (2016). https://doi.org/10.1038/srep25265
  6. Pasa, F., Golkov, V., Pfeiffer, F. et al. Efficient Deep Network Architectures for Fast Chest X-Ray Tuberculosis Screening and Visualization. Sci Rep 9, 6268 (2019). https://doi.org/10.1038/s41598-019-42557-4
  7. Ryu Y. J. (2015). Diagnosis of pulmonary tuberculosis: recent advances and diagnostic algorithms. Tuberculosis and respiratory diseases, 78(2), 64–71. https://doi.org/10.4046/trd.2015.78.2.64
  8. World Health Organization. (2020, October 14). Tuberculosis. World Health Organization. https://www.who.int/en/news-room/fact-sheets/detail/tuberculosis.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.