Latency-Critical Inference Serving for Deep Learning

Author : Vinod Vijay Nigade
Promotor(s) : Prof.dr.ir. H.e. Bal / Dr. w. Lin
University : Delft University of Technology
Year of publication : 2023
Link to repository : Link to thesis

Abstract

Deep learning (DL) technology has made remarkable strides in terms of accuracy through the advancement of sophisticated and large deep neural networks (DNNs). Yet, its adoption in real-world applications is still challenging. DL-based applications, in particular video analytics, impose stringent low-latency and high-accuracy requirements on the DNN inference serving system that manages the deployment of DNNs and their inference. These application requirements are challenging for inference serving systems to satisfy as they are typically conflicting objectives. The challenge is exacerbated by the ever-increasing size of DNNs and the diverse set of computing platforms (on-device, edge, and cloud), of which none is best suited for serving DNNs independently. Therefore, exploring the design space of inference serving systems in an effort to identify a suitable solution that satisfies the application requirements by utilizing large DNNs and useful features of various computing platforms has emerged as an important problem that we study in this thesis. This dissertation presents our study on how to design a latency-critical inference serving system for deep learning to meet the low-latency and high-accuracy requirements of DL-based video analytics that manifests itself in a large class of modern DL-based applications. The study is organized into four parts, each offering a step-by-step investigation toward designing a comprehensive solution. First, we present a design for a hybrid networked system named Clownfish that deploys a small yet real-time DNN on the end device and a large and accurate DNN in the cloud. By leveraging the temporal correlation property in the video data, Clownfish enhances on-device analytics with the help of delayed and intermittent feedback from the cloud. Importantly, Clownfish removes the cloud from the critical path of the DNN inference pipeline. As a result, it always operates in real-time, depending on the latency performance of the small DNN deployed on the end device. Second, we remove the assumption of the temporal correlation property in Clownfish and present an edge inference serving system that is broadly applicable. Specifically, we offload DNN inference to edge servers and serve inference by streaming video data over the communication network, which can be highly dynamic in nature. Such a dynamic communication network leads to a variable data transfer time that may ultimately affect the latency of inference requests. To overcome this variability, we propose to conduct data and DNN adaptation jointly and present a feedback control mechanism that makes adaptation decisions efficiently. This system only applies to single-client and single-server setups; hence, we remove this limitation in the next part of the exploration. Third, we design a scalable edge inference serving system named Jellyfish to serve DNN inference for multiple clients on multiple edge resources. To this end, we propose a collective adaptation technique that conducts the data and DNN adaptation jointly for multiple clients. Finally, we optimize this scalable serving system and propose ideas to leverage dynamic DNNs to avoid the DNN adaptation overhead and improve the batched inference efficiency. Overall, in an effort to enable the adoption of DL technology to broader video analytics applications, we propose solutions for designing a DNN inference serving system to deliver analytics results to multiple users while satisfying their latency requirements with high analytics accuracy. Based on these explorations, we also present a list of key takeaways and design implications that we hope will be helpful to designers and researchers of the DNN inference serving system.