# Combining Hardware Acceleration with Low-Precision and Quantization Techniques for more Energy-Efficient, Higher-Performance Computing

EE376A (Winter 2019)

By Juan Camacho

#### Introduction:

The landscape of high-performance parallel computing has been rapidly evolving over these past few years.  With the slowdown of Moore’s Law and the end of Dennard scaling over 1 decade ago, there have not been significant improvements in general-purpose processor speeds and power performances. However, at the same time, we have been seeing massive industry adoptions of emerging applications in “big data” analytics, machine and deep learning,real-time graphics, genomics, among others. Due to a huge interest in deploying these applications in huge datacenters and power-constrained edge devices, both industry and academia practitioners have been investing substantial efforts into developing more specialized. heterogeneous computing platforms for better energy-efficiency compute gains. This, of course, includes the deployment of reconfigurable architectures for hardware acceleration, which has received renewed interest from the computer architecture community to get around the power limitations of CPUs while extracting a lot more parallelism from the applications. In this post, we look at how certain techniques such as low-precision computing and quantization (which are forms of compressing information) can help improve the performance of some data-bound kernels in reconfigurable architectures such as FPGAs. More specifically, we look at the implementation of ML and computer vision kernels such as Stochastic Gradient Descent (for spam detection) and Optical Flow in a research DSL (Domain-Specific Language) for hardware accelerators called Spatial, and how they can be efficiently mapped to hardware constructs.

#### Background:

Although FPGAs (Field Programmable Gate Arrays) has been around for decades, they have recently become quite attractive candidates for implementing hardware accelerators. FPGAs come with flexible logic blocks (in the form of Look-up Tables) that can be reconfigured to implement whatever computation is specified. The picture below shows the internal structure of an FPGA:

Despite this gate-level flexibility, one of the main obstacles to massive FPGA deployment is in its programmability: to reconfigure an FPGA, one needs to use RTL verification and synthesis tools in which the developer writes code in a Hardware-Description Language (e.g. Verilog). Such low-level HDLs are hard to use without proper knowledge of digital hardware design (e.g. one must account for timing of signals, manually handle data movement, etc.), and the whole synthesis flow is usually incredibly slow (a modestly-complex design can take multiple hours before its corresponding FPGA bitstream is generated).  There has been a lot of work into developing higher-level tools that make programming FPGAs easier, one of which is Spatial.

Spatial is a Scala-based DSL that has been developed at Stanford for programming accelerators. The language gives high-level constructs  that improve programmer productivity such as nested-loop pipelining, parameterization of parallelization factors and a simple interface for specifying memory transfers between off-chip and on-chip storage. More importantly, the language allows the programmer to implement parts of the application that require hardware acceleration at a level of abstraction higher than other HDL tools while the compiler performs additional optimizations and handles low-level hardware issues such as register timing and taking into account physical constraints. For those interested, more information can be found here: https://spatial-lang.org/

#### Motivation and Description of Applications:

As already alluded to above, implementing these kinds of applications in reconfigurable hardware can lead to huge boosts in performance per watt when compared to traditional CPUs. For starters, all of the extra hardware overhead required for issuing and executing instructions in CPUs is simply not needed in a reconfigurable setting because in the latter form, specialized hardware logic can be instantiated to perform the desired computation. Also, more complicated parallel patterns and more fine-grain pipelining can be better exploited in hardware, and memory layout can be customized per application for better performance. But more importantly, at least from a information theoretic point of view, the ideas of low-precision computation and quantization can play more significant roles in this context. We show this by going over a hardware-based implementation of SGD for spam detection and optical flow.

Stochastic Gradient Descent is an iterative method for finding a global or local optimal point in a differentiable objective function by using one training example to calculate the current gradient. More formally, for a loss function $J$ and learning rate $\alpha$, the update equation for every sample $z$ is:

$W = W - \alpha \nabla J(W, b, x^{(z)}, y^{(z)})$

In many machine learning applications such as spam detection, SGD with logistic regression as the cost function are commonly used. Here, for our implementation of the SGD update equation over 5 training epochs, we used an input dataset of 5000 emails, with each email being represented by a 1024-dimensional vector. Although the Spatial implementation of SGD plus logistic regression is mostly standard, there are 2 main aspects that differentiate it from a less energy-efficient implementation: our choice of a customizable datatype for low-precision arithmetic (in the form of fixed-point), and quantization of the logistic regression function.

For many ML-based  and other probabilistic-based kernels, there’s been published research indicating that compressing the bit representation of the weights does not usually affect the overall accuracy of the results. Thus, hardware accelerators can take advantage of it because reducing the bit width of ALU engines does reduce the overall cost of computation from both a runtime and power perspective. In our case, we switched to fixed-point computation, which is a numerical format for representing real numbers. Fixed-point representation is similar to Floating-point representation, except that it covers a less range of fractional numbers, so it’s not as accurate. But it’s much cheaper to implement in hardware, and incurs in much less overall latency.

There is also the concern in how to implement logistic regression in hardware, which can be an expensive logic block (since it requires division and exponentiation). Recall that the logistic regression equation is the following:

However, by looking at the logistic regression plot, one key insight comes up: just the middle region of the graph needs to stored. We can quantize this middle region by precomputing a finite number (e.g. around 2000) of values in this range (between $-4 \leq x \leq 4$) and store them in an on-chip LUT (Look-up Table). Therefore, the actual logistic regression block would look like this:

$sig(x) = \begin{cases} 0 &\quad x < -4 \\ 1 &\quad x > 4 \\ SigLUT(x) &\quad -4 \leq x \leq4 \end{cases}$

Although this will not be as accurate as the actual logistic regression equation, this is much more hardware-friendly.

For our second application, we implemented Optical Flow, which is an object detection-like algorithm that finds motion patterns of objects between consecutive frames (e.g. people in a surveillance video). Below is a diagram of the whole image processing pipeline for optical flow, where each individual block is simply a convolution engine and the whole pipeline is implemented in a streaming fashion (i.e each engine is enabled every time there is new data coming in).

Just like with SGD, we took advantage of low-precision computing here by leveraging fixed-point arithmetic instead of floating-point arithmetic where it was applicable without losing too much accuracy. In fact, all of the compute engines in optical flow with the exception of the last stage (“Compute Flow”) could make use of fixed-point ALUs, and we were able to get significant performance gains in runtime while getting output values close enough to the original ones.

The choice of these applications was inspired by a HLS benchmark suite for software-programmable FPGAs published last year in the FPGA’ 18 conference [2].

#### Results:

For our results, we’re mostly concentrating on overall runtime for each applications. We’ve tested both SGD and optical flow on a Xilinx FPGA running at a 100 Mhz clock frequency. Here are, for example, the runtime values for spam detection on actual hardware (one for a CPU implementation, another one for a High-level Synthesis implementation reported in [2], and one for a Spatial implementation):

The Spatial implementation achieves a ~92% test accuracy, which is close to what’s expected for this dataset. We also see that Spatial achieves the lowest runtime for training using SGD plus logistic regression. This is desirable since reducing overall latency reduces energy footprint.

Now, here are the results for optical flow:

The HLS implementation uses Floating-point arithmetic for all of its computation, but we use Fixed-point arithmetic for most of the stages. The above graph shows how switching to a more compressed datatype representation for decimal numbers can help in reducing the overall latency of an accelerator.

For future work, we plan to measure performance per watt numbers (using a multimeter on the FPGA device) for each application.

#### Conclusion:

Compression and quantization techniques inspired by ideas from information theory have been incredibly useful in the design of more energy-efficient hardware accelerators, and it has quite recently become a commonly studied topic in the field of reconfigurable computing. There’s still a lot of other work in academia and industry that’s being done on this space, including other kinds of applications such as binarized neural networks (where the weights are represented by just 1 bit each). For me, it’s interesting to see how information theory concepts are being applied to make better heterogeneous computing systems.

#### Outreach Activity:

My outreach activity was actually not related to the project described above. For my outreach activity, I had a list of simple English sentences that were intentionally jumbled up (e.g swapped letters, missing letters, letters replaced by other similarly-looking symbols). I’ve had the kids try to read them and see if they could instead understand it despite the intentional misspellings. I believed this would be a fun way of showing how humans can still process and understand information correctly even if that information was not presented in its original form, which is something that has been pointed out in various other studies about the human brain. It’s also a fun way to teach how this is analogous to correcting information that was transmitted through noisy channels.

Here is the list of sentences I brought to the event:

• Ths sentence looks mostly fine, but there is one type. Spot it!
• Thiz sentnce haz intntional mizpelingz, but yu can stil read ti.
• Yuo cna porbalby raed tihs esaliy desptie teh msispeillgns.
• Alth0ugh this sentence l00ks like it doesn’t h@ve any mispellings, there’s 0ne pr0blem with it. Wh@t’s the c0mm0n err0r th@t you see here?

The outreach event was overall a fun experience, and the kids were really into the activities.

#### References:

[1] Koeplinger et al. Spatial: a language and compiler for application accelerators, Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, June 18-22, 2018, Philadelphia, PA, USA

[2] Zhou et al. Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs, Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, February 25-27, 2018, Monterey, CALIFORNIA, USA

[3] De Sa et al. 2018. High-accuracy Low-Precision Training. Arxiv, 1803.03383.

[4] John L. Hennessy , David A. Patterson, Computer Architecture, Sixth Edition: A Quantitative Approach, Morgan Kaufmann Publishers Inc., San Francisco, CA, 2017