Deep Learning Model Optimization for low compute environments

Abhilash Gedela
4 min readFeb 17, 2022

While most of the training steps take place on a server or a computer, deep learning models are used on a variety of frontend devices, such as mobile phones, self-driving cars, and Internet-of-Things (IoT) devices. With limited computing power, performance optimization becomes paramount.

Optimizing a model for speed may allow it to run in real time, opening up many new uses. Improving a model’s accuracy by even a few percent may make the difference between a toy model and a real-life application. Another important characteristic is size, which impacts how much storage the model will use and how long it will take to download it. For some platforms, such as mobile phones or web browsers, the size of the model matters to the end user.

Fortunately, modern machine learning frameworks such as TensorFlow attempt to help machine learning engineers. Through extensions such as TF Model Optimization toolkit, methods such as below can be used for model optimzation.

  • Weight Pruning
  • Quantization
  • Weight Clustering

Weight Pruning:

Source: https://arxiv.org/abs/1611.06440

Magnitude-based weight pruning gradually zeroes out model weights during the training process to achieve model sparsity. Sparse models are easier to compress, and we can skip the zeroes during inference for latency improvements.This technique brings improvements via model compression. We can see improvements in model compression with minimal loss of accuracy.

Quantization:

It is a method to bring the neural network to a reasonable size, while also achieving high performance accuracy. This is especially important for on-device applications, where the memory size and number of computations are necessarily limited. Quantization for deep learning is the process of approximating a neural network that uses floating-point numbers by a neural network of low bit width numbers. This dramatically reduces both the memory requirement and computational cost of using neural networks.

Post-training quantization: Post-training quantization includes general techniques to reduce CPU and hardware accelerator latency, processing, power, and model size with little degradation in model accuracy. These techniques can be performed on an already-trained float TensorFlow model and applied during TensorFlow Lite conversion. These techniques are enabled as options in the TensorFlow Lite converter.

quantization aware training: Quantization aware training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.

Weight Clustering:

If weights are represented numerically, it is possible to apply clustering techniques to them in order to identify groups of similar weights. This is precisely how weight clustering for model optimization works. By applying a clustering technique, it is possible to reduce the number of unique weights that are present in a machine learning model.

How this works is as follows. First of all, you need a trained model — where the system of weights can successfully generate predictions. Applying weight clustering based optimization to this model involves grouping the weights of layers into N clusters, where N is configurable by the Machine Learning engineer. This is performed using some clustering algorithm.

If there’s a cluster of samples, it’s possible to compute a value that represents the middle of a cluster. This value is called a centroid and plays a big role in clustering based model optimization. Here’s why: we can argue that the centroid value is the ‘average value’ for all the weights in the particular cluster. If you remove a bit from one vector in the cluster to move towards the centroid, and add a bit to another cluster, one could argue that — holistically, i.e. from a systems perspective — the model shouldn’t lose too much of its predictive power.

And that’s precisely what weight clustering based optimization does. Once clusters are computed, all weights in the cluster are adapted to the cluster’s centroid value. This brings benefits in terms of model compression: values that are equal can be compressed better without losing predictive performance in the machine learning model.

Conclusion:

This article demonstrated how models can be optimized so that they can be stored and run more efficiently. I hope you have learnt a lot about model optimization from this blog. If you have any questions or remarks, please feel free to leave a comment below.

References:

  1. https://www.tensorflow.org/model_optimization
  2. https://ai.googleblog.com/2021/03/accelerating-neural-networks-on-mobile.html
  3. https://blog.tensorflow.org/2020/07/accelerating-tensorflow-lite-xnnpack-integration.html

--

--