AuthorsByron Yang, Aditya Narayanan, Annie Xu, Burak Bartan

Recent advances in cloud computing and the Internet of Things (IoT) have opened the door for the growth of edge computing, a computing scheme which bridges the gap between computation performed on cloud-based servers and computation performed locally. Edge computing architecture offers several promising advantages over a standard cloud-computing or offline-computing setup (e.g., reducing network latency, bandwidth usage, etc.)

Abstract^{1}These advantages come from introducing edge devices – internet-connected devices capable of connecting data gathered from local physical sensors to the robust cloud-based computing system on our computers.

In this research project, we harness the various strengths of edge devices by constructing a machine learning (ML) model and then deploying it back to an Arduino Nano 33 BLE Sense board – our choice of an edge device. In addition, we explore the effects several factors (quantization size, various hyperparameter values, and different classification methods) have on the accuracy of our machine learning classifier.

**1 – Background**

TensorFlow is an open source artificial intelligence library especially designed for creating neural networks, hence why we chose it to build our ML models. Due to the increased popularity of edge devices in modern day computing, TensorFlow Lite – a set of tools designed to help developers run TensorFlow models on edge devices such as microcontrollers and smartphones – was created. Because we are using an Arduino board for this project, we used both TensorFlow and TensorFlow Lite. In addition to the above, we also used a website named Edge Impulse to both familiarize ourselves with basic neural network architectures as well as gather the data we used in this project.

Our edge device for this project is an Arduino Nano 33 BLE Sense board. It is particularly adept as an edge device as it features many built-in sensors, including a 3-axis Inertial Measurement Unit (IMU), a microphone, and temperature and humidity sensors. In addition to having powerful sensing capabilities, the Arduino board is a particularly well suited choice for this project as it is compatible with pre-built ML tools such as Edge Impulse and TensorFlow Lite. With a 32 bit ARM Cortex processor and 1 MB of available RAM, designing this project around an Arduino board enables us to perform ML tasks while also assessing our architecture’s efficiency.

The Edge Impulse application^{2} was used at times during our study due to its efficient graphical user-interface aimed at enabling simple construction of binaries compatible with our Arduino Edge device. The Edge Impulse software has support for the incorporation of Python TensorFlow code, and the Edge Impulse software was eventually used for an efficient conversion of ML Models to Arduino-compatible, quantized, binaries. Additionally, this software was used for the collection of audio-signal data from the Arduino’s own microphone.

**2 – Materials and Methods**

The materials we used include: a computer (used to build, train, and run the TensorFlow models), an Arduino Nano 33 BLE Sense board (used to gauge how well the trained TensorFlow model performs using real-time data), and a USB cable (used to connect the Arduino board with the computer).

We built and trained a number of models with pre-built audio classification datasets from Edge Impulse. After downloading the data, we first split each audio file into “windows.” Then, we used the Python package Librosa to compute mel-frequency cepstral coefficients (MFCC) and NumPy to create training and testing arrays from those coefficients. For each model trained, 60-80% of the dataset was reserved for training, and the remainder was used for validation. All data selections were randomly generated using NumPy. After preparing the data, we used TensorFlow to create our neural networks with keras and tuned the hyperparameters to obtain the maximum possible accuracy. Then, after quantizing the model with TensorFlow Lite and saving the model as a TensorFlow Lite file (.tflite), we integrated the model into the Arduino IDE using C++ code. Lastly, the model was run on the Arduino board, where it was able to make predictions based on real-time data collected by the on-device microphone.

To build the optimum machine learning model for this dataset, we assessed the different factors that impact the accuracy of the model. For this project, we chose to focus on hyperparameters (learning rate, number of epochs, different activation functions and optimizers, etc.), quantization size, and less complex classification methods (linear classification, support vector machines). At the end of the process, a model was trained with the specific best hyperparameters found in this portion of the study and deployed to the board to test live inference.

**3 – Results**

**3.1 – Hyperparameters results:**

The following three graphs show how parameters related to model size (training epochs, number of dense layers, and number of neurons in each dense layer) impacted the accuracy of the model.

From these three graphs, we can see that generally, as the independent variable in each graph increased, the accuracy for the training dataset converged to 100% and the accuracy for the testing dataset approached 80%. More specifically, we observe that the accuracy barely changed as the number of layers increased as well as the asymptotic shape of the data points in both the accuracy vs. number of epochs graph and the accuracy vs. number of neurons graph.

The following graph shows how the learning rate impacted the accuracy of the model.

From this graph, we see that the accuracy is linear until the learning rate is about 0.033. After that, the accuracy quickly decreased in an asymptotic fashion. Also note how small the learning rates had to be to produce a high accuracy for this dataset.

The following five graphs show how different activation functions impacted the accuracy of the model. I chose the label for the x-axis to be the number of epochs because by doing so, I could observe how the accuracy of the model was impacted due to a certain activation function after many iterations of training rather than just a few epochs. This reduced the weight of individual data points and helped me focus on the general trend of the data.

From these graphs, we can see that all the activation functions shown (elu, relu, selu, softplus, and softsign) performed very similarly. The accuracy for the training dataset was usually between 90% and 100% while the accuracy for the testing dataset converged to 80%.

The following four graphs show how different optimizers impacted the accuracy of the model. Once again, I chose the label for the x-axis to be the number of epochs because by doing so, I could observe how the accuracy of the model was impacted due to a certain optimizer after many iterations of training rather than just a few epochs. This reduced the weight of individual data points and helped me focus on the general trend of the data.

From these graphs, we can see that all the optimizers shown (Adam, Adamax, Nadam, and RSMProp) performed very similarly. The accuracy for the training dataset was usually between 90% and 100% while the accuracy for the testing dataset converged to 80%.

**3.2 – Model Quantization:**

One of the primary concerns when deploying Neural Networks or other Machine Learning models to microcontroller boards such as the Arduino Nano 33 BLE Sense is the model weight quantization. When dealing with the limited memory seen on devices such as the Arduino Nano, speed and memory efficiency are of the utmost importance; with large Machine Learning models such as Neural Networks, performing many many 64-bit float computations can become taxing on the microcontroller. For this reason, it is common practice to quantize the weights and/or inputs to the Machine Learning model, as accepting a lower precision in calculation can increase speed, decrease memory usage, and, if optimized properly, yield comparable accuracy to unquantized methods. In this section, we explore and quantify the relationship between accuracy and model size between quantized and unquantized Neural Networks for audio classification.

There are two common forms of quantization used for Machine Learning models, both of which are supported by the TensorFlow API for Python. The first and simpler method, post-training quantization, involves simply quantizing the weights or parameters of a Neural Network or other ML model to a lower-bit precision. For example, weights and biases for a Neural Network can be converted from 64-bit floats to 16-bit floats. The same can be done for input and output tensors for the model. The second method, quantization-aware training, takes weight and bias quantization into account during the training process. As it has been shown that post-training quantization yields more desirable results than quantization-aware training, we choose to test the effects of model quantization when using post-training quantization. Furthermore, we specifically restrict our tests to quantization of model parameters rather than including model inputs and model outputs as the Arduino Nano and other microcontrollers capable of interacting with the TensorFlow Lite quantized models are capable of operations involving the default input and output quantization (32-bit floating point).

In our exploration of the effect of quantization size on Neural Network accuracy, two metrics were selected for quantifying model accuracy: classification accuracy and “Mean Prediction Error.” The classification accuracy refers to simply the number of correct classifications divided by the total size of the testing dataset when NN output is interpreted at 95% confidence. Mean Prediction Error refers to the average error in predicted probability (before conversion to classifications). During our tests, 30 Neural Networks consisting of two 1-dimensional convolutional layers (each with 40 neurons) and an output layer of 3 neurons were trained. Each model was converted to the TensorFlow Lite format at 32-bit float, 16-bit float, and integer quantizations. Each TensorFlow Lite formatted model was tested in a TensorFlow Lite interpreter, then Classification Accuracy and Mean Prediction Error were calculated. The process used to train these models is stochastic, meaning the result of training on the same data with the same parameters can yield results differing according to some probability distribution. Assessment of the mean and sample standard deviation across the 30 sample Neural Networks aids us in increasing data accuracy.

The graphs above show the mean value of each measured quantity as well as an error bar depicting plus or minus one sample standard deviation from the mean. Based on the fact that the range represented by plus or minus one standard deviation is very similar in both accuracy and error across each of the different quantization sizes, we find no significant data suggesting that the TensorFlow Lite quantization affects, positively or negatively, the accuracy of a pre-trained model.

**3.3 – Other classification methods results:**

Starting off, we knew that Neural Network Classifiers usually yield a higher accuracy due to its more complex and flexible nature. However, because of this, NN Classifiers require more memory space than their simpler counterparts.

Since our main focus was to build an optimum Neural Network, we chose to compare it to three other classifiers that support multi-class classification: Support Vector Machines, Linear Discriminant Analysis, and Gaussian Naive Bayes.

**3.3.1 – Support Vector Machine**

SVM classifiers use kernels (set of mathematical functions) to transform data and determine hyper-planes that differentiate the classes^{3}.

The Support Vector Machine classifier yielded a training accuracy of 100% and a test accuracy of 81%. Out of the three simple classifiers tested, SVM had the highest training accuracy and second highest test accuracy (82% for Gaussian NB).

**3.3.2 – Linear Discriminant Analysis**

LDA is arguably the simplest machine learning algorithm for multi-class classification. LDA reduces the dimensions of the data, which allows for distinct classes with relatively small computing costs and resources^{5}.

The Linear Discriminant Analysis classifier yielded a training accuracy of 99% and a test accuracy of 75%. It is clear that this LDA classifier does not perform as well as SVM, Gaussian NB, and Neural Networks.

**3.3.3 – Gaussian Naive Bayes Classifier**

The Naive Bayes classifier implements the Bayes’ Theorem with the assumption that each feature is independent^{4}. We implemented the Gaussian variation of Naive Bayes.

The Gaussian Naive Bayes classifier yielded a training accuracy of 93% and a test accuracy of 82%.

**4 – Deployment to Arduino Board**

Based on the results of the test we report above (testing of architecture, quantization, and various classifiers) as well as tests involving 1-dimensional convolutional layers with similar architecture to those reported earlier, we choose a Neural Network with fully-connected architecture, using relu activation for hidden layers and softmax activation for the output layer as a model to demonstrate the capability of Edge devices in performing live-inference on pre-trained and quantized models.

As one may see from the results presented above, the pre-built dataset used for testing of various hyperparameters does not suit our live-inferencing due to the vastly different situations in which data was collected (different microphones, different levels of ambient noise, etc.) To ensure we properly reflect the capabilities of Machine Learning on Edge Devices, we record our own set of data using the Edge Impulse tool for the task of audio classification of a sample between “yes”, “no,” and “noise.”

After training the keras model with TensorFlow, the model weights and biases can be converted to 8-bit integers and exported as a .tflite file using the TensorFlow Lite library. These files can interact with the Arduino microcontroller to perform live inference. While this method was tested by the authors of this paper and proven to be successful, this current workflow does not allow us to compute MFCC coefficients in real time for live-inferencing. Having trained the models on MFCC coefficient input features, it is unreasonable to expect a well-performing model without computation of those input features. In order to still be able to compute the MFCC coefficients on the Arduino Nano, we turn to Edge Impulse, which provides us with an API key for access to MFCC computation provided that the Edge Impulse compilation tool is used to deploy our model to the Arduino Nano.

While the model did perform relatively well in live-inferencing, we did encounter some issues with the model accuracy such as mic-quality, discussed in more detail in the future works section.

**5 – Conclusion**

In this study, we tested the effects of various features on the accuracy of machine learning models when applied to audio classification, as well as demonstrated the promise of carrying out these Machine Learning inference tasks on a low-power low-memory Edge device such as the Arduino Nano 33 BLE Sense. We concluded that certain hyperparameters dealing with model size and architecture are better suited to the task of audio classification than others; we concluded that the process of post-training quantization on weights and biases of Neural Networks provides us with statistically equivalent inference capabilities to non-quantized models; and, we concluded that certain non-Neural-Network classifiers such as Support Vector Machines are certainly nearly as or as adept at the task of classification as Neural Networks. We then demonstrated that the TensorFlow Lite API can be used to effectively deploy pre-trained models to microcontrollers where inference can be performed in real time using low-power and low-memory.

**6 – Future Directions**

In this project, we tested four different classification methods: artificial neural networks, support vector machines, linear discriminant analysis classifiers, and gaussian naive bayes classifiers. Thus, a possible future direction may be to test other classification algorithms (decision tree, random forest, k-nearest neighbors, etc.) and see their impacts on the accuracy of the model.

In addition, our preparation of the data in this experiment mostly consisted of computing the MFCC coefficients of the downloaded audio files and then training the model based on these coefficients. A possible future direction, then, may be to instead generate the spectrograms associated with the audio files and then use a 2D convolutional neural network (CNN) to see how the accuracy of the model is affected. Training on spectrograms offers the advantage of further reducing our memory usage on the microcontroller chip as computation of MFCC coefficients need not be done in real time.

Furthermore, the fact that the quality of the Arduino mic is poor coupled with the fact that window-sizes for MFCC calculations need to be tailored somewhat specifically to the duration of audio signal being classified warrants further review into transfer learning for our application.

Finally, as edge computing as a computing architecture lends itself particularly well to the task of transfer learning, another future goal is to delve into the possibilities with transfer learning using our machine learning model. Rather than training another model from scratch, transfer learning will allow others and ourselves to classify other inputs.

**7 – Acknowledgements**

First, we would like to thank Burak Bartan for his guidance and tremendous help with the project. Next, we would like to thank Mert Pilanci for supplying us with the Arduino board that was crucial to our research. Lastly, we would like to acknowledge Professor Tsachy Weissman and Cindy Nguyen for their work in leading the Stanford Compression Forum and allowing us to participate in the program and conduct this research project.

**References**

[1] W. Shi, J. Cao, Q. Zhang, Y. Li and L. Xu, “Edge Computing: Vision and Challenges,” in IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637-646, Oct. 2016, doi: 10.1109/JIOT.2016.2579198.

[2] https://www.edgeimpulse.com/

[3] Dreiseitl, S., & Ohno-Machado, L. (2003, June 07). Logistic regression and artificial neural network classification models: A methodology review. Retrieved August 05, 2020, from https://www.sciencedirect.com/science/article/pii/S1532046403000340

[4] Ray, Sunil. (2020, April 01). Learn Naive Bayes Algorithm: Naive Bayes Classifier Examples. Retrieved August 05, 2020, from https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/

[5] Writer, A. M. (2020, January 04). Everything You Need to Know About Linear Discriminant Analysis. Retrieved August 05, 2020, from https://www.digitalvidya.com/blog/linear-discriminant-analysis/