Arjun Barrett, Arz Bshara, Laura Gomezjurado González, Shuvam Mukherjee
Abstract
In the wake of the SARS-COV-2 pandemic, video conferencing has become a critical part of daily life around the world and now represents a major proportion of all internet traffic. Popular commercial video conferencing services like Zoom, Google Meet, and Microsoft Teams enable virtual company meetings, interactive online education, and a plethora of other applications. However, these existing platforms typically require consistent internet connections with several megabits per second of bandwidth per user, despite the use of state-of-the-art video compression techniques like H.264 and VP9. These requirements pose a major challenge to global multimedia access, particularly in underserved regions with limited, inconsistent internet connectivity. We implement an optimized version of the Txt2Vid video compression pipeline, which synthesizes video by lip-syncing existing footage to text-to-speech results from deep-fake voice clones. Our implementation is available as an accessible web application for omnidirectional, ultra-low bandwidth (~100 bits per second) video conferencing. We employ a novel architecture utilizing ONNX Runtime Web, an efficient neural network inference engine, along with WebGL-based GPU acceleration and fine-tuned preprocessing logic to enable real-time video decoding on typical consumer devices without specialized CUDA acceleration hardware. We evaluate the perceived quality-of-experience (QoE) of the platform versus traditional video compression techniques and alternative video conferencing programs via a subjective study: involving real-world contexts with a wide range of subject demographics including participants from potential markets such as Colombia. The promising QoE results and low hardware requirements of our platform indicate its real-world applicability as a means of bringing high-quality video conferencing to developing regions with poor internet connectivity.
Introduction
In the last decade, the growing success of the internet and the mobile electronics industry have revolutionized global communications. Alternatives to the telephone are skyrocketing in popularity, and brand new solutions are introducing video as means of improving communication (Fernández et al, 2014). In fact, mobile video traffic now accounts for more than half of all mobile data traffic, and it was predicted that nearly four-fifths of the world’s mobile data traffic would be video by 2022 (Cisco, 2017).
Indeed, video conferencing solutions have opened the door to new applications within a wide variety of fields: from remote expertise applications such as in medicine or law, to corporate, work and home environments (Biello, 2009). With the advent of global lockdowns caused by the COVID-19 pandemic, video conferencing traffic increased approximately by 2-3 times (Silva et al, 2021). The daily use of traditional video conferencing services and applications such as Zoom, Microsoft Teams and Google Meet experienced dramatic growth as shown in Figure 1.

In the midst of the lockdown, the implementation of video conferencing tools became more critical than ever by directly influencing basic human needs such as access to health, work and education. For millions of people around the world, video conferencing tools became the only way to attend classes.
While a switch to online life has worked well for developed nations, it has not been so successful in the developing nations of the world. Two thirds of the world’s school-age children experience poor connectivity and lack of internet access, which has impeded their education especially during the pandemic (UNICEF, 2020). Even before the COVID-19 lockdown, the lack of infrastructure and transportation in low-income regions made it difficult for millions of people to access health and education services in person, as doing so would require them to walk many miles and cross dangerous geographical barriers (Portafolio, 2022). Even beyond the pandemic, accessibility to video conferencing solutions would make it possible to bring remote education, health and work opportunities to historically disconnected areas around the world.
However, traditional video conferencing services such as Zoom and Google Meet generally require stable internet connections, often with a consistent bandwidth of several hundred kilobits per second, per user (100000 bps) despite the use of state-of-the-art video compression techniques such as H.264, VP9, and AV1. The lack of global access to high-quality internet connectivity is exacerbating pre-existing social and economic inequalities worldwide: without a low bandwidth alternative, video conferencing would remain inaccessible to many underserved regions.
These factors indicate that the implementation and real-world evaluation of an ultra-low bandwidth video conferencing platform would represent a new world of possibilities for video conferencing communications and could help reduce inequity in health, education, and work opportunities worldwide.
Background and Literature Review
Video compression is a core technology that enables delivery and consumption of video. The term “bandwidth” characterizes the amount of data that a network can transfer per unit of time; in other words, the maximum amount of data an internet connection can handle (R. Prasad et al, 2003). Typically, the maximum data rate transmitted is measured in bits per second (bps). The amount of bandwidth required for raw uncompressed video is so high that without compression, delivery over even a high end network connection and consumption would not be possible. For example, an uncompressed full HD video stream at 30 frames per second, 8 bits per pixel, with 4:2:0 encoding, needs a whopping 746 Mbps uncompressed which is impossible to deliver even over the best of networks. Most broadband connections to homes in the United States and other developed regions have lower bandwidth than that, and the number is substantially lower for developing and under-developed regions. Video streamed today is often compressed at ratios of at least 100:1 using sophisticated compression technologies.
Codecs are compression technologies with two primary components: an encoder that compresses a file or a video stream, and a decoder that decompresses them. Codecs are extensively used in the available commercial video conferencing platforms and therefore the main technology behind video conferencing communications. There exist a variety of state-of-the-art video codecs including VP8, VP9, and AV1: they are used when streaming, sending, or uploading videos. The purpose of these codecs is to compress videos to reduce the bitrate required while still trying to keep the quality high (THEOplayer). VP8 is an open-source royalty-free video codec that was released in 2010 as part of the WebM project. VP9 was the next generation codec from the WebM project, released in 2013, achieving about 40% to 50% higher compression ratio over VP8 at the same bitrate. Today, the vast majority of YouTube videos are compressed with VP9. Further, for video conferencing platforms such as Google Meet, both VP8 and VP9 are used extensively. Built on the success of VP8 and VP9, an industry consortium known as the Alliance for Open Media was established in 2015 to further advance royalty-free video codec development and deployment. The first codec developed by this consortium in 2018 was AV1 (Y. Chen et al, 2018). AV1 achieves about 30% to 35% more compression than VP9 at similar quality, and today many videos on popular streaming platforms are likely to be using AV1. However, MPEG released its latest codec Versatile Video Coding (VVC) (J. Han et al, 2021) in 2020 which surpassed AV1 by about 15% in compactness according to most estimates. While VVC is extremely sophisticated, the higher encoding complexity makes their usage in videoconferencing applications somewhat limited.
Even with the latest advances in video codec technology over the last decade, video and audio streamed for videoconferencing applications often require hundreds of kilobits per second (kbps) of bandwidth for acceptable quality of experience. This is of course much higher than what can be supported over the internet infrastructure in developing and under-developed regions in the world. Therefore, we created a platform that would not even need to send a live video feed while still providing a high quality of experience.
Recent advances in artificial intelligence (AI) and video compression have opened the door to new video conferencing techniques such as Txt2Vid, which utilizes a custom compression pipeline that decreases the conventionally required bandwidth for video conferencing.

Txt2Vid, originally developed by Tandon et al. (2021), is a novel video compression pipeline that dramatically reduces data transmission rates by compressing webcam videos to a text transcript. Essentially, Txt2Vid synthesizes video by lip-syncing existing footage to text-to-speech results from deep-fake voice clones, as shown in Figure 2.

The pipeline extracts a driving video and audio from the encoder, assigning it to a user identifier (ID) and transforming the audio into text to be sent to the decoder. The decoder uses a voice cloning model to convert text to speech (a “TTS” engine)1, and applies a lip-syncing model called Wav2Lip (Junho Park. et al, 2008) to the synthesized speech and the driving video to generate the reconstructed video, more clearly seen in Figure 3. Effectively, the text is transmitted and decoded into a realistic recreation of the original video using a variety of deep learning models. While conventional approaches generally require over 100 kilobits per second (kbps) of bandwidth, Txt2Vid requires only 100 bits per second (bps) beyond the initial driving video for similar audiovisual quality. Therefore, Txt2Vid achieves 1000 times better compression than state-of-the-art audio-video codecs.
Although this technique has proven to be highly effective for ultra-low bandwidth contexts, it required a software development toolkit to run, and was not integrated into an easy-to-use interface. Moreover, it also required an expensive NVIDIA graphics processing unit (GPU) with the CUDA toolkit. There was a need for implementing a more accessible solution with lower computational complexity to bring the final product to the consumer market.
We therefore investigated a means of integrating the Txt2Vid pipeline into an optimized user-facing application.
Methodology
Our research focuses on both the implementation and evaluation of Txt2Vid into a program with an accessible user interface. For our implementation, we engineered a custom WebRTC-based web application. Our evaluation consisted of a subjective study that measures the real-world applicability of our web application by qualitatively comparing our implementation to a traditional video conferencing platform based on the AV1 video codec.
Implementation
Although traditional video codecs such as VP8, VP9, and AV1 offer high quality video with a high-bandwidth connection, they have proven to perform poorly when bandwidth is limited. In order to enable anyone to take advantage of the quality and bandwidth improvements that the Txt2Vid methodology offers, we implemented Txt2Vid into a web application and utilized a variety of technologies to make it accessible even in low-income regions.
First, we utilize WebRTC to establish our peer-to-peer data channels and share driving videos. WebRTC enables us to use a high-quality compressed video stream that dynamically responds to changes in network conditions when sufficient bandwidth is available. Alternatively, if the website is opened on a device with a poor network connection, our application detects that audiovisual quality will suffer and seamlessly switch from WebRTC’s RTCP-based video exchange to a Txt2Vid scheme backed via efficient WebRTC data channels. The new Txt2Vid session will use a driving video recorded from the last 5 seconds of the RTCP session, practically eliminating the initial driving video overhead. With a WebRTC connection established, the application can share text transcripts of speech instead of the full video to all the other peers in a call, reducing required bandwidth from over 100 kbps to 100 bps.
Although WebRTC helped create peer-to-peer connections, we implemented the full Txt2Vid scheme from scratch on the next layer of the application stack. After receiving a text transcript from a peer,
1 Resemble, RESEMBLE.AI: Create AI Voices that sound real., accessed 2021. [Online]. Available: https://www.resemble.ai
our application uses the Resemble.ai speech synthesis engine to create a text-to-speech result mimicking the phonetic qualities of the original speaker. It then lip-syncs the driving video to the text-to-speech generation via Wav2Lip. Since the lip-syncing model typically requires a high-performance computer to run in realtime, we enabled GPU acceleration via WebGL, a graphics pipeline for web applications based on OpenGL. We created a novel WebGL shader for ConvTranspose, a component of the Wav2Lip neural network that had not previously been implemented in GLSL, and contributed it back to the ONNX Runtime Web neural network inference engine as open-source software. By utilizing OpenGL instead of CUDA (a GPU compute framework exclusive to high-end NVIDIA graphics cards), we dramatically improve Wav2Lip’s performance with GPU acceleration on low-end devices and thereby reduce the hardware requirements to use the application. We also created a face tracking algorithm based on a performance-optimized version of the Pico algorithm (A. Koskela. et al, 2021) to pass high quality face crops to the Wav2Lip model and optimize the resulting lip-sync quality while minimizing CPU load during inference.
Resemble.ai is intended to be replaced by a free, open-source voice cloning service in the future; however, we have nonetheless integrated support for custom Resemble voices securely and efficiently to maximize usability. Resemble is integrated via a suite of REST API wrappers in both the backend and frontend code, resulting in efficient network bandwidth usage and easy future replacement with an alternative service or platform. Since each member in a video call may create a custom voice on a different Resemble.ai plan or account, we securely exchange credentials to enable realistic speech synthesis for each peer in the call. We protect user security by asymmetrically encrypting the Resemble.ai API key on the frontend via an RSA-4096 public key (with the corresponding private key stored securely on our custom-built backend). To generate speech, the application uses the peer’s encrypted API key to make a call on our backend, which decrypts the credentials and forwards the request to Resemble.ai. As only the encrypted API key is saved in users’ browsers and sent to untrusted peers, potential attackers are never able to hack into the users’ Resemble.ai accounts.
Finally, our app employs Progressive Web App technologies to minimize bandwidth concerns in developing regions. After the website code and pre-trained Wav2Lip model are downloaded, the site uses a custom script called a service worker to automatically save the files to the user’s computer such that opening the site in the future will not require re-downloading any data unless the site is updated.
The source code for the website is on GitHub as open-source software at the following link: https://github.com/tpulkit/txt2vid/tree/arjun-browser. A live demo is also available at https://txt2vid.eastus2.cloudapp.azure.com. As of now, it requires the user to set up a Resemble.ai account, create a custom voice, and enter the API key into the site settings before usage.
Evaluation
Quality-of-Experience (QoE) is the overall subjective acceptability perceived by the end-user of a service or application (Kuipers et al, 2010), so evaluating QoE requires metrics that can objectively assess user satisfaction. There are multiple QoE evaluation techniques available to test the applicability and acceptability of our web platform implementation. Nonetheless, QoE evaluation by definition involves subjective complexity, given the presence of factors not necessarily related to the service’s performance, such as the user’s mood (Serral-Gracià et al, 2010). The available literature can be largely categorized into two main techniques to evaluate QoE: the measurement of numerical metrics that tend to be associated with human visual and auditory perception (Kuipers et al, 2010), or the deployment of traditional subjective studies involving voluntary human subjects and statistical analysis of their responses. For the purpose of our research, both techniques were explored and it was decided that a subjective study better suits the QoE evaluation of our web platform implementation.
Objective Evaluation
The video conferencing experience involves multiple aspects, and therefore QoE also evaluates diverse qualities within this service. These aspects include video quality, audio and speech quality as well as audio-video synchronization. Given the background technology of our video conferencing web platform implementation, the available objective resources to conduct QoE would involve the quality of the video artificially generated from the driving video, and the latency determined by the Wav2Lip model with the requirement for audio-video synchronization.
Out of the existing objective techniques, Video Multi-method Assessment Fusion (VMAF) is the most promising full-reference objective video quality assessment model. VMAF was developed by Netflix and Professor C.C. Jay Kuo from the University of Southern California. The VMAF method seemed to be especially convenient since it uses already existing image quality metrics (e.g. visual information fidelity or detail loss metric) in order to predict video quality, emphasizing metrics that are attuned to human visual preferences (García et al, 2019). VMAF uses a supervised learning regression model that provides a single VMAF score per video frame.
An example of a VMAF metric is Peak Signal-to-Noise Ratio (PSNR), which expresses the ratio between the maximum possible power value of a signal and the power of distorting noise that affects the quality of its representation (NI. 2020). Similarly, PSNR-Human Vision System modified (PSNR-HVS) is another PSNR metric that additionally considers contrast sensitivity (NI. 2020). Structural Similarity Index Measure (SSIM) predicts the perceived quality of digital television and cinematic pictures by measuring the similarity between two images using a reference image as indicator (Math Works, 2021). Multi-scale SSIM (MS-SSIM) and color image quality assessment CIEDE2000 are the other common metrics used for VMAF.
Although VMAF was explored and tested as a means of evaluation from the publicly available software (GitHub – Netflix/vmaf, 2018) the metrics that VMAF uses are focused primarily on video quality for streaming services, rather than quality-of-experience for end-users in a video call. The PSNR metric, for example, reflects mostly characteristics like clarity and sharpness that are relevant for the user’s experience in television and streaming settings, but are inaccurate to measure user satisfaction in video conferencing. Even traditional high-bandwidth platforms experience low clarity and momentaneous image distortions since image acceptability has a lower threshold when other factors like communication play a larger role (Serral-Gracià et al, 2010).
Another potential point of objective evaluation was audio-video synchronization, which refers to the relative timing of sound and image portions of a television program or movie. In Txt2Vid, it would refer to the synchronization between the artificially generated voice and image from the user in the video call. Audio-video synchronization can be measured using artificially generated video test samples by separating the video component from the audio component and assigning them certain markers that allow the numerical detection and analysis of any desynchronization (Serral-Gracià et al, 2010). Standard recommendations state that the viewer detection thresholds of audio/video lag are about +45 ms to −125 ms, and that acceptance thresholds are about +90 ms to −185 ms for video broadcasting (Lu, Y. et al, 2010). However, like with the VMAF metrics, desynchronization plays a larger role for streaming services than video conferences, where tolerances are significantly more flexible. Therefore, desynchronization measurements are inaccurate to evaluate real QoE for our implementation.
Based on this analysis, it was concluded that video-conferencing is a complete experience that involves dynamic interactions beyond the scope of any convenient audiovisual metrics or numeric parameters. Therefore, objective evaluation techniques cannot be used at all to conduct a real-world QoE study.
Subjective Evaluation
Once we established a usable web video conferencing platform, we were able to use a demo version on the browser. Rather than evaluating with an objective study that focuses on metrics that humans do not normally perceive in the video conferencing context, we opted to undergo a subjective study to prioritize user experience and obtain more accurate results.
In general terms, prior literature presents multiple techniques to deploy subjective evaluations. In the field of QoE, the final goal is to analyze the acceptability of a service based on a person’s real preferences, so it requires a sample with human subjects from the target demographic. Surveys are the best means to collect data that is later analyzed statistically to conclude the level of acceptability of the service being tested. In order to keep the study valid and generalizable from sample to population, the survey must be designed so that it primarily addresses general human perceptions without any prompting. While it is true that subjective evaluations can be influenced by variables unrelated to the quality per se (e.g personal preferences, external lighting, use of headphones, mood, etc.), we concluded that those are still factors that play a role on the experience of the user when using similar applications. Hence, they still add value to our research purpose: implementing our web solution in real-world contexts.
Accordingly, we conducted a subjective study with the objective of comparing our web implementation with traditional codecs, particularly AV1 (which is the most common codec used by commercial video calling platforms).
As shown in Figures 4 and Figure 5, the demo browser version of our implementation was designed such that we could choose to use Txt2Vid or instead disable it and use AV1 codec. We also had a slider to control the precise bandwidth to run the codec.



Since our goal was to evaluate if our implementation offers better QoE than traditional platforms in low-bandwidth contexts, we used our demo version to simulate academic and educational lectures. Those simulations within our platform were recorded, yielding a total of 6 videos grouped into 3 pairs. Each pair consists of two videos with the same content. However, one video contains a lecture given using AV1 with the minimum possible bandwidth ~10 kbps stream (~10000 bps), and the other one contains the same lecture given using Txt2Vid with only ~100 bps. Therefore, within our same platform we could simulate low-bandwidth conditions.
As a result, each pair of videos consisted of Video 1, which was the video recorded without using Txt2Vid, and Video 2, which was recorded using Txt2Vid. The 3 pairs were integrated inside an online survey that was designed2 to determine the subject’s preferences between both videos. Questions were made to be general, and provide a space for each subject to explain their reason to prefer one video over the other, as well as to rate each video independently. We verified manually all responses to guarantee quality standards finally removing ~31% of the total number of submissions due to incomplete answers or surveys fill in less than 5 minutes, failing our quality thresholds. It is worth noting that we addressed a remarkable background and demographics of subjects: with ages ranging from 13 to 60+, and both from the United States and from Colombia –two of our potential target markets.
With health and education being the principal fields we considered as potential use cases for our implementation, a significant number of our results were obtained by partnering with one of the largest Colombian diagnostic institutions, Medical Diagnostic Institute (IDIME). IDIME deployed our survey within their organization and guaranteed reliable and trustworthy results. Another important proportion of our subjects were high school students coming from a Colombian official educational institution. The rest of the results were obtained from crowdsourcing and public contributions, yielding a total of 188 submissions.
2 A preview of the full survey, video samples and some anonymized raw results available here: https://drive.google.com/drive/folders/1Mhe3BgU2K6jjgV-pRaL4jzma_r5U_5mg?usp=sharing
Evaluation Results
In total, 125 complete survey responses were considered. Respondents were asked to compare recordings from two video calls with the same content: one using Txt2Vid (~100 bps stream) and one using AV1 (~10 kbps stream). They were then asked to rank each individually with a score of 0 to 5.






As demonstrated in Figure 6 and Figure 7, in all three pairs, over two thirds of respondents for all videos preferred the Txt2Vid generation to the AV1 compressed video call, despite the use of over 100x lower bandwidth in the Txt2Vid implementation. It is worth mentioning that each pair was recorded in different lightings, background noises and by different people, in order to guarantee a real-world context. Therefore, the more pronounced preference for Txt2Vid in the third video was likely due to the use of a higher-quality driving video as compared to the more static driving videos used in videos 1 and 2.
On the other hand, respondents further rated each video in every video pair being 0 terrible and 5 excellent. The arithmetic means and standard deviation of their responses are shown in Table 1.
Video Pair 1 |
Video Pair 2 |
Video Pair 3 | ||||
Txt2Vid |
AV1 |
Txt2Vid |
AV1 |
Txt2Vid |
AV1 | |
Rating Arithmetic Mean |
3,44 |
2,77 |
3,29 |
2,36 |
3,83 |
2,5 |
Standard Deviation |
1,31 |
1,31 |
1,3 |
1,36 |
0,99 |
1,36 |
In general terms, ratings for our Txt2Vid implementation are visibly higher in the three Video Pairs, being the Video Pair 3 the one with a higher rating mean of 3,83.



Data Analysis
It is visible that respondents showed a significant preference for videos where Txt2Vid was used versus those with AV1. This trend was expected since conventional codecs usually require high data transmission rates to achieve good video quality. However, AV1 was tested under poor simulated bandwidth conditions given that is when Txt2Vid would be most useful. However, as seen in Figure 6, there is a significant difference between video pairs 1, 2, and 3. While the proportion of respondents that prefered AV1 in the first pair is ~33,3%, it is ~22,8% for the second pair, and only ~6,5% in the third. Although participants prefered Txt2Vid in general, Txt2Vid was more favorable relative to AV1 in the third video pair than in the first. It is worth mentioning that each pair was recorded in different lightings, background noises and by different people, in order to guarantee a real-world context. Therefore, the more pronounced preference for Txt2Vid in the third video was likely due to the fact that in the first pair, the driving video appeared more static and emotionless than the one from the third pair, making the Txt2Vid quality worse overall. This result indicates the importance of a good driving video to improve QoE in our implementation.
Additionally, QoE rating for Txt2Vid is concentrated mainly between 3 and 4 (on a scale from 0 to 5), and has less scores of 0 and 1 compared to AV1. In fact, as shown in Table 1. Txt2Vid’s mean rating ranges from 3,44 to 3,83 while AV1’s mean rating ranges from 2,5 to 2,7. In addition, the highest Txt2Vid’s mean rating was given in video pair 3 as well, where the driving video was considerably better than in the other pairs.
The standard deviation of the Txt2Vid video from the third pair is substantially lower compared to the standard deviation of all the other videos. Therefore, respondents’ ratings were more homogeneous, and closer to the mean (3,83), which is also the higher ranking mean from all videos; in other words, not only more people preferred Txt2Vid in the third pair, but also the majority of them rated it favorably. This convenient rating is a positive indicator of our implementation’s applicability.
On the other hand, respondents were asked to explain their preference. Respondents choosing AV1 over Tx2Vid in the first pair mainly attributed it to the video quality. It is also remarkable that open responses coincided that AV1 seemed more natural and realistic than Txt2Vid, but only in the first pair. Once again, likely due to the Txt2Vid driving video recorded in the first pair. On the contrary, the overwhelming majority of the respondents that chose Txt2Vid over AV1 from the third pair justified it due to the audio quality.
These insights allowed us to identify not only that Txt2Vid is a significantly prefered option when it comes to low-bandwidth conditions, compared to the existing commercial codecs, but also the weaknesses and strengths of our platform, as well as key factors to consider in order to guarantee a higher QoE.
One of the main strengths Txt2Vid presented over AV1, is the speech generation of our implementation. Not only the Rsemble.ai voice cloning proved to be realistic enough for the users, but the respondents also showed a preference for the audio quality in Txt2Vid. Since Txt2Vid generatesthe speech in the decoder, it avoids background noise or distortions: a clear advantage over commercial codecs where speech experience quality loss. For the opposite, although Txt2Vid video quality was still highly preferades by respondents, those who prefered AV1 attributed it to the unnaturalness that can occur when the driving video is not ideal. Meaning that improving video realism and guaranteeing good driving videos are key factors to improve Txt2Vid’s QoE.
Overall, the subjective study was favorable for our Txt2Vid web implementation and its applicability.
Conclusions
As has been demonstrated in this paper, it was possible to use WebGL to enable GPU acceleration, and create a novel WebGL shader not previously used, which allowed us to reduce the usually required high-performance computer to run Txt2Vid lip-syncing model. That way, we could implement an in-browser Txt2Vid-based platform that shows promise for use as an ultra-low bandwidth alternative to traditional video conferencing platforms. The videos from our Txt2Vid platform received higher QoE scores than state-of-the-art video codecs while utilizing 100 times lower bandwidth. The implementation of several performances and bandwidth optimizations within the web application we developed meant that standard consumer devices could run the lip-syncing inference nearly in real-time on a poor internet connection, making the platform suitable for older devices in low-income regions.
The present paper focuses mainly on the applicability of our platform as a tool to bring connectivity and multimedia access to underserved regions generally with a poor internet connection. Even so, the favorable results from the subjective study and the promising advances in web acceleration of our implementation, open the possibilities for Txt2Vid to be implemented as an AI-based solution that revolutionizes the manner video-conferencing takes place. The same technology as applied in accessible in-browser alternatives like our platform breaks what used to be the greatest limitation for video conferencing: the internet. Therefore, the applicability of this research goes beyond the original scope and can be further considered as a tool for other low-bandwidth communication contexts. As examples, it could exponentiate development in other areas such as in marine or space exploration, where poor-internet connectivity hinders scientific progress, or even for commercial purposes as add-ons alternatives for existing high-bandwidth platforms.
Future Research
The primary point of potential future research is an alternative realistic speech synthesis solution to Resemble.ai. The use of Resemble requires all users of the platform to have previously created an account and trained a custom voice on Resemble’s limited free plan. It also dramatically increases the real-world bandwidth requirements from the theoretical 100 bps (though even counting Resemble.ai, data usage is still substantially lower than a traditional video codec for similar quality). Moreover, Resemble adds several seconds of latency to the video call as the browser session waits for the API call to resolve, while the platform would otherwise have had comparable latency to standard video codecs.
An open source voice cloning tool that can operate on the few seconds of audio recorded during the initial driver video transfer over WebRTC would eliminate both the bandwidth and latency overheads caused by Resemble.ai.
Another area to continue research is computational complexity. Although we were able to approach this point by utilizing OpenGL instead of CUDA to improve Wav2Lip’s performance with GPU acceleration, these deep-learning models still constitute the majority of the execution time for our application. Although we consider other web acceleration techniques, some can be further researched. Particularly, MIL WebDNN, which is an open-source software framework for fast execution of pre-trained deep neural network (DNN) models in the web browser (MIL, 2022) that present novel approaches. It could also be investigated to optimize our models to make them more lightweight. As
AI and related technologies continue to be developed over time, we expect performance to improve overall.
In terms of QoE, it would be possible to expand the subjective study in order to identify with more precision the usability of our platform among a greater population, compared to more codecs, under different bandwidth conditions and even allowing the respondents to experience the whole video-conferencing process instead of showing them pre-recorded calls. One of the most promising resources to expand the study is Amazon Mechanical Turk: a crowdsourcing marketplace that outsourced processes to a distributed workforce who can perform these tasks virtually, widely used for survey participation (AMTurk, 2022). These resources could bring new insights from global workforces, augment data collection and analysis, and accelerate machine learning development (AMTurk, 2022). In that way, identify with a higher accuracy how to improve our web implementation to bring it to the real market.
Acknowledgments
We thank our mentor, Sahasrajit Sarmasarkar, for his continued guidance throughout the project and for his help in testing and planning the design for our web application and QoE evaluation. We would also like to thank Pulkit Tandon and the other authors of the original Txt2Vid paper for their groundbreaking research in the field of low-bandwidth video conferencing, which served as the basis for our project.
References
- Tandon, P. (2021, June 26). Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text. arXiv.Org. https://arxiv.org/abs/2106.14014
- Fernández, C., Saldana, J., Fernández-Navajas, J., Sequeira, L., & Casadesus, L. (2014). Video Conferences through the Internet: How to Survive in a Hostile Environment. The Scientific World Journal, 2014, 1–13. https://doi.org/10.1155/2014/860170
- Gladović, P., Deretić, N., & Drašković, D. (2020). Video Conferencing and its Application in Education. JTTTP – JOURNAL OF TRAFFIC AND TRANSPORT THEORY AND PRACTICE, 5(1). https://doi.org/10.7251/jtttp2001045g
- Biello, D. (2009, March 18). Can Videoconferencing Replace Travel? Scientific American. https://www.scientificamerican.com/article/can-videoconferencing-replace-travel/
- Silva, D. C. A. G. (2021, May 10). The Behavior of Internet Traffic for Internet Services during. . . arXiv.Org. https://arxiv.org/abs/2105.04083
- Cisco, “Cisco visual networking index: global mobile data traffic forecast update, 2017–2022.”, accessed 2021. [Online]. Available: https://s3.amazonaws.com/media.mediapost.com/uploads/CiscoForecast.pdf
- Koeze, E., & Popper, N. (2020, April 8). The Virus Changed the Way We Internet. The New York Times. https://www.nytimes.com/interactive/2020/04/07/technology/coronavirus-internet-use.html
- Portafolio, R. (2022, May 16). Solo 6 % de niños del país usa la ruta escolar para ir a clase. Portafolio.co. https://www.portafolio.co/economia/finanzas/solo-6-de-ninos-del-pais-usa-la-ruta-escolar-para-ir-a-cl ase-565461
- Two-thirds of the world’s school-age children do not have Internet access at home. (2020). UNICEF. https://www.unicef.org/press-releases/two-thirds-worlds-school-age-children-have-no-internet-access- home-new-unicef-itu
- Kuipers, F., Kooij, R., de Vleeschauwer, D., & Brunnström, K. (2010). Techniques for Measuring Quality of Experience. Lecture Notes in Computer Science, 216–227. https://doi.org/10.1007/978-3-642-13315-2_18
- Serral-Gracià, R., Cerqueira, E., Curado, M., Yannuzzi, M., Monteiro, E., & Masip-Bruin, X. (2010). An Overview of Quality of Experience Measurement Challenges for Video Applications in IP Networks. Lecture Notes in Computer Science, 252–263. https://doi.org/10.1007/978-3-642-13315-2_21
- García, B., López-Fernández, L., Gortázar, F., & Gallego, M. (2019). Practical Evaluation of VMAF Perceptual Video Quality for WebRTC Applications. Electronics, 8(8), 854. https://doi.org/10.3390/electronics8080854
- Junho Park, & Hanseok Ko. (2008). Real-Time Continuous Phoneme Recognition System Using Class-Dependent Tied-Mixture HMM With HBT Structure for Speech-Driven Lip-Sync. IEEE Transactions on Multimedia, 10(7), 1299–1306. https://doi.org/10.1109/tmm.2008.2004908
- Resemble, RESEMBLE.AI: Create AI Voices that sound real., accessed 2021. [Online]. Available: https://www.resemble.ai
- R. Prasad, C. Dovrolis, M. Murray and K. Claffy, “Bandwidth estimation: metrics, measurement techniques, and tools,” in IEEE Network, vol. 17, no. 6, pp. 27-35, Nov.-Dec. 2003, doi: 10.1109/MNET.2003.1248658.
- Lu, Y., Zhao, Y., Kuipers, F., & van Mieghem, P. (2010). Measurement Study of Multi-party Video Conferencing. NETWORKING 2010, 96–108. https://doi.org/10.1007/978-3-642-12963-6_8
- N. (2019). GitHub – Netflix/vmaf: Perceptual video quality assessment based on multi-method fusion. GitHub. https://github.com/Netflix/vmaf
- Peak Signal-to-Noise Ratio as an Image Quality Metric. (2020). NI. https://www.ni.com/en-us/innovations/white-papers/11/peak-signal-to-noise-ratio-as-an-image-quality-metric.html
- Multiscale structural similarity (MS-SSIM) index for image quality – MATLAB multissim. (2021). Math Works. https://www.mathworks.com/help/images/ref/multissim.html
- THEOplayer. (n.d.). Basics of video encoding: Everything you need to know. Basics of Video Encoding: Everything You Need to Know. Retrieved August 4, 2022, from https://www.theoplayer.com/blog/basics-of-video-encoding#:~:text=Codecs%20are%20essentially%2 0standards%20of,DECoder%2C%20hence%20the%20name%20codec.
- Chen, Y., Mukherjee, D., Han, J., Grange, A., Xu, Y., Parker, S., . . . Liu, Z. (2020). An Overview of Coding Tools in AV1: The First Video Codec from the Alliance for Open Media. APSIPA Transactions on Signal and Information Processing, 9, E6. doi:10.1017/ATSIP.2020.2
- Y. Chen et al., “An Overview of Core Coding Tools in the AV1 Video Codec,” 2018 Picture Coding Symposium (PCS), 2018, pp. 41-45, doi: 10.1109/PCS.2018.8456249.
- J. Han et al., “A Technical Overview of AV1,” in Proceedings of the IEEE, vol. 109, no. 9, pp. 1435-1462, Sept. 2021, doi: 10.1109/JPROC.2021.3058584.
- N. (2018). GitHub – Netflix/vmaf: Perceptual video quality assessment based on multi-method fusion. GitHub. https://github.com/Netflix/vmaf
- Koskela, A. et al (2021). GitHub – nenadmarkus/picojs: A face detection library in 200 lines of JavaScript. GitHub. https://github.com/nenadmarkus/picojs
- MIL WebDNN. (2022). WebDNN. https://mil-tokyo.github.io/webdnn/
- AMTurk. (2022). Amazon Mechanical Turk. https://www.mturk.com/