Keypoint-Centric Video Processing for Reducing Net Latency in Video Streaming

Journal for High Schoolers, Journal for High Schoolers 2020

Authors

Renee Liang, Huong Nguyen, Carina Chiu

Abstract

COVID-19 has made video communications an important mode of information exchange. Consequently, research advancement in this field will contribute to eventual societal normalcy by promoting virtual collaboration while maintaining social-distancing. Currently, there exist problems with the streaming process. Our group focuses on the topic of latencies related to live video conferencing. Specifically, the purpose of our research is to reduce latency in video streams by exploring different configurations of the streaming pipeline and different types of video encoding schema. This research presents an opportunity to examine existing industry-level video communication tools, as well as the technical workings behind image segmentation as it relates to the condensation of a video stream. By extracting key points from the image as they relate to the reconstruction of an animation client-end, we are able to achieve a drastic condensation of transmitted data. Under test conditions it has been proven that this mode of video encoding representation can function at a much lower network bandwidth than the conventional model. 

Our results, as well as a demo, can be found here: https://github.com/roshanprabhakar/pose-animator/tree/master

Background

The typical video codec works by presenting a schema for the general compression of video data as a whole, without presenting a bias to any one portion of the data feed. The compressed data is sent over a network to a receiving end, where a decoding algorithm reconstructs a representation of the original feed from the streamed data. It is important to note that the reconstruction process does not necessarily recreate the original feed; in most cases, it is more practical to create a similar representation of the original. This is due to the fact that not all information contained within a video frame is equally important. Video codecs take advantage of this property and prioritize the preservation of more important aspects of a feed over others in the compression/decompression process.

The video codec exists as a necessity in the streaming pipeline, a pipeline which has long been standardized. Data is read and encoded from a video feed, then streamed across a network to another computer, where that data is decoded and rendered. Numerous codecs and optimizations have been developed in order to reduce the amount of data that can be sent for the purpose of reducing net latency in streaming, while maintaining as much information as possible in the data stream. This project focuses on minimizing the latencies inherent in this process by implementing a novel type of video codec which shifts net latency from data streaming to client/server-end computation.

By increasing the codec computations, a drastic decrease in net streamed data is achieved. This idea relies on the assumption that there is information contained within a video frame that is significantly more important than everything else. In our implementations, this corresponds to keypoint locations of a human presence. By only extracting that information which pertains to the human figure, we are able to represent the importances of the original frame with drastically less data. Thus, the encoding process becomes the scraping of human related data, and the decoding process becomes a reconstruction of this figure using the scraped data, in the form of some animation. 

Inherent in this methodology is the possibility that the scraping of human related data and the reconstruction of an animation poses a steep addition to the net latency. Industry applications of face-oriented animations demonstrate that the deconstruction/animation process can be achieved in real time, and that the only latencies to affect performance would be induced by the streaming process. 

Methods and Materials

This project features the integration of two main technologies: WebRTC and Pose-Animator. 

Web(R)eal(T)ime(C)ommunications is a web framework developed for the purposes of allowing real-time peer-to-peer communication within the browser. Upon the completion of a signaling process, which connects two RTCPeerConnection nodes, WebRTC enables the streaming of data straight from a transmitting node to a receiving node without the help of a pipeline intermediary.

WebRTC provides three major services:

  • getUserMedia: provides application access to the user media (i.e. microphone, webcam)
  • RTCPeerConnection: a connection api which allows for the transmission of audio and video data. This object handles signal processing, codec management, and bandwidth management.
  • RTCDataChannel: a connection api which allows for the transmission of arbitrary data, with no WebRTC enabled data management or manipulation.

With our project, we use an RTCPeerConnection channel to simulate the conventional video stream as our comparative control. We implement an RTCDataChannel for the handling of all transmissions regarding video key points as our experiment.

Pose-Animator is an open-source project developed for the animation of a human video feed. This animator works by extracting the location of key points within the feed: (‘nose’, ‘leftEye’, ‘rightEye’, ‘leftEar’, ‘rightEar’, ‘leftShoulder’, ‘rightShoulder’, ‘leftElbow’, ‘rightElbow’, ‘leftWrist’, ‘rightWrist’, ‘leftHip’, ‘rightHip’, ‘leftKnee’, ‘rightKnee’, ‘leftAnkle’, ‘rightAnkle’) then projecting these locations onto an animation. The process of extracting these key points for each frame in a packet of frames is Tensor-Flow enabled, and requires considerably more computation than the encoding of a video packet by any standardized video codec today.

In this project, the process of extracting key point information becomes the video encoding process: it is after all only those key points that are needed to fully represent the video feed from the perspective of the potential use cases; the process of projecting these key points onto an animation becomes the decoding process. 

By streaming pose animator key points through an RTCDataChannel, the implementation of this type of video codec is effectively simulated. 

Respective pipelines: 

Results

The demo can be found here: https://github.com/roshanprabhakar/pose-animator/tree/master

Conventional video streaming configuration, unbounded bandwidth

Streaming enabled by RTCPeerConnection streams. SDP Candidates, video codec, video metadata managed entirely by the stream nodes. 

Maximum non outlying bitrate: 1752 kilobits/second

Conventional video stream configuration, bounded bandwidth

Streaming enabled by RTCPeerConnection streams. SDP Candidates, video codec, video metadata managed by the stream nodes. SDP bandwidth limit introduced at 20 kilobits/second.

Maximum non outlying bitrate: 18.5 kilobits/second

Pose-Animator stream configuration, unbounded bandwidth

Pose-animator data structures deconstructed and transmitted through an RTCDataChannel. Transmission bandwidth is unbounded.

Maximum non outlying bitrate: 15.81 kilobits/second

Discussion and conclusions

It is clear the novel streaming pipeline functions at a notable performance with much lower bandwidth requirements than the conventional video streaming methods. By enabling a high level of stream performance under such a channel of poor bandwidth resource, this method of streaming could enable real time communication around the world assuming a certain level of computational integrity at the local ends of the channel.

For every non-redundant pre-processed frame, the Pose Animator library generates a list of key points depicting the location of key features of the human skeleton. This data structure contains a list of 17 key points, each associated with 2 position values and a single confidence value. With the positions representable by 16-bit integers and the confidences represented by 32-bit IEEE floating point numbers (along with a 32-bit IEEE floating point representing the general confidence of the cumulative key point set), each encoded frame can be represented with 1.12 kilobits of information. At a frame rate of roughly 15 frames per second (fluctuation dependent on the speed of the pose-animator’s deconstruction/reconstruction of frame information), this comes out to around 16.8 kb/s of bandwidth consumption. As is clear from the demonstration, this rate fluctuates with the changing encoding/decoding speed but maintains a general bitrate of about 16.8 kb/s, a great improvement on the ~1700 kb/s consumed by the conventional stream.

It is also important to note that the encoded representation of each frame using pose animator is not dependent on the resolution of the frame feed, meaning for any frame dimension the pose animator generates a constant encoded frame size. Implicitly this indicates that the network bandwidth required to maintain a real time transmission with pose animator is not dependent on the amount of captured data. 

In comparison, the conventional stream requires a bandwidth which fluctuates with the resolution of the incoming video feed. Due to the nature of temporal codecs, the size of transmitted delta-frames depends on the amount of captured data, thus the required bandwidth is directly dependent on the size of incoming data.

There is a disadvantage inherent in the lack of relationship between size of incoming data and size of encoded data in the pose animator channel. Specifically, there exists an input data size where the conventional encoding schema generates an encoded byte stream that is smaller in size than the pose-animator byte stream, at which point the conventional stream becomes the preferable pipeline channel. However, it is clear from the demonstrations that this crossover point occurs at extremely small feed dimensions, dimensions where it would be impractical to transmit data as it relates to the human structure. 

It is clear that our project cannot replace the conventional standard for video streaming: after all, not all key points to a frame for all use cases will depend on a human’s presence. Nevertheless, in proving that drastically lower bitrates can be achieved with the existence of keypoint positions and confidences in the encoded frame representation, this project proposes a considerable alternative to the conventional streaming method. 

Future Directions 

Through our project, we aim to provide a contribution to the theory of streaming services where animation synchronization is used to improve the quality and efficiency of video communication. Specifically, we propose a configuration of encoding, transmitting, and decoding which shifts net required computational resource from the transmitting of data to the pre/post-transmission processing of data with the ultimate goal of drastically reducing the size of transmitted information through the streaming of keypoint information. 

Our contributions to live streaming and conferencing will hopefully make an impact on various fields being affected during the COVID-19 pandemic. The use of an animator would be greatly useful to people in the performing arts, as such animators could provide for the stitching of animation in a puppetry theatre performance. This project demonstrates that such network communication can be achieved with low bandwidths and therefore in near real time. Furthermore, this type of conferencing could benefit the realm of business and general work productivity, as it proposes a channel that can reliably operate in real time under stressed network conditions. 

Acknowledgements 

We would like to acknowledge Professor Tsachy Weissman for his guidance in providing the necessary feedback and materials. We would also like to thank Roshan Prabhakar for his tremendous support and mentorship to our team. Also, we would like to thank Shubham Chandak for helping us, giving us much beneficial feedback, and guiding us. Many thanks to Zachary Hoffman and Kedar Tatwawadi for the help and the resources they have provided us. 

References

[1] Yemount. “Yemount/Pose-Animator.” GitHub, github.com/yemount/pose-animator.

[2] “Pose Animator – An Open Source Tool to Bring SVG Characters to Life in the Browser via Motion Capture.” The TensorFlow Blog, blog.tensorflow.org/2020/05/pose-animator-open-source-tool-to-bring-svg-characters-to-life.html.

[3] Saragih, Jason M, et al. “Real-Time Avatar Animation from a Single Image.” Proceedings of the … International Conference on Automatic Face and Gesture Recognition. IEEE International Conference on Automatic Face & Gesture Recognition, U.S. National Library of Medicine, 2011, www.ncbi.nlm.nih.gov/pmc/articles/PMC3935737/.

[4] Ververica. The Significance of Stream Processing, www.ververica.com/what-is-stream-processing.

[5] “List of Interface Bit Rates.” Wikipedia, Wikimedia Foundation, 16 July 2020, en.wikipedia.org/wiki/List_of_interface_bit_rates.

Leave a Reply