Nanodrone makes use of Moth’s stay antenna for odor-based, GPS-free navigation
What you will learn:
- Applying AI to video conferencing equipment.
- What is a high resolution image extension?
- The Impact of Deep Learning Networks and the Specter of Generative Adversary Networks (GANs).
Video conferencing for virtual meetings, distance learning, or socializing has exploded with the outbreak of the coronavirus pandemic. Some experts suggest that our trust in virtual gatherings will remain part of our new normal even after the virus withdraws. If so, the enormous hunger for bandwidth caused by ubiquitous video conferencing on the Internet – from the core to the thinnest branches – persists.
Even with the use of modern video codecs, a video conference can require a high bandwidth: 1 to 2 Mbit / s per participant just to keep these thumbnails on the screen. And there is growing evidence that with the experience, users become more critical of image quality and yearn to see the fine details of facial expressions, gestures, and postures that contain so much information in a face-to-face meeting. This trend limits the ability of apps to use higher compression rates to reduce bandwidth requirements. The fine detail that the compression algorithm puts out contains only the clues that an experienced negotiator needs most.
AI to the rescue?
Help with this dilemma can come from a surprising source: artificial intelligence (AI) or, more specifically, a branch of AI known as deep learning networks. Today, a machine learning application called Super-Resolution Image Enhancement, which is already being explored for delivering high quality video to 4K UHD TV screens, can work with existing video conferencing apps to significantly reduce the bit rates required.
Thats how it works. Each videoconferencing device in the conference that wants to display high-resolution images would maintain two machine-learning “inference models”. Each model is a block of code and data that has previously been trained through an extensive process in a data center to perform a specific function. One of the models processes the video from a user’s HD camera before it is sent to the conference app and the other processes the video that comes from the conference app before it is displayed (Fig. 1).
The first model captures video from the camera frame-by-frame and isolates the user’s image from the background, reducing the number of pixels to deal with later stages. This simplified video stream then flows to a standard video conferencing app, where it is sampled to a lower resolution such as 480P and then compressed using an industry standard algorithm such as H.264. The compressed video is then exported. With the exception of the first step of isolating the user image, everything has gone as it would in any other video conferencing scenario.
On the receiving end, the video conferencing app receives the compressed bit stream, decompresses it into a low-resolution video, and sends the video stream to the display subsystem. If you want the picture to be displayed as a thumbnail, e.g. B. in a conference with several participants, the video is displayed directly. The picture quality is good enough for the small size of the screen. However, if the image is to be larger, the decompressed video is redirected to the second deep learning model, the super-resolution expander.
The super-resolution model was trained on a variety of images with different faces, lights, and poses in order to selectively add back the information that was lost at low resolution and compression. The result is a high quality image of the original user that is very similar to their image in the original HD camera video.
Note that this is not the same as decompression. The AI model adds features to the low-resolution image that are not there but are expected by human subjects, and completes the high-resolution image, frame-by-frame, in real time.
What it takes
Deep learning networks, like most types of AI, are notorious for their huge computing appetites. Fortunately, most of the computing is primarily invested in training the models – a task that is performed in a data center before the model is shipped to users. Once a deep learning model is trained, it’s just a relatively compact block of code and a few data files. Both the user extraction model and the super resolution expander model can conveniently run on a GPU or a relatively fast notebook computer.
However, as video conferencing becomes more prevalent, the need to use much more modest devices such as dedicated conference equipment, tablets, smart TVs or set-top boxes increases. Work on special hardware accelerators for deep learning – chips that greatly increase the number of calculations performed at the same time while greatly reducing power consumption – has brought these deep learning models into the realm of low-cost, low-power devices.
An example of this work is the Synaptics VS680 System-on-Chip (SoC). This multimedia processor SoC combines arm CPU cores, a GPU, video and audio processing subsystems, extensive security and privacy regulations, and a deep learning accelerator called the Neural Processing Unit. This latter block can run both the user extraction and super resolution expander models simultaneously at full video frame rates.
The result is a single chip that significantly reduces the bandwidth requirements for video conferencing while delivering high quality images at a price and power consumption suitable for even low-cost displays, streamers and set-top boxes. The service is compatible with existing video conferencing apps.
As video conferencing becomes more common and more people in areas with poor broadband access – often people without expensive notebooks – take advantage of the opportunity to significantly reduce bandwidth requirements without sacrificing picture quality, and doing so on inexpensive devices is becoming increasingly important.
The many faces of deep learning
A deep learning network model, once designed and trained, can only do what it has been trained to do: identify flowers, say, or in our case select a person from their surroundings in a video frame. However, the underlying software and hardware that runs the trained model can often handle a variety of different types of machine learning models that have been trained in different ways to perform very different tasks.
For example, the firmware and hardware of the neural processing unit in the Synaptics VS680 can perform a variety of tasks in a multimedia system. This includes detecting objects, detecting the location and environment of the user, or detecting unwanted content or malware in incoming data streams.
The calculations performed by a deep neural network are massive, but fundamentally simple. Figure 2 shows the structure of one of the most popular neural networks: MobileNet. It contains a number of turns that require a variety of multiplication and accumulation operations.
This makes the problem very amenable to optimization by custom hardware implementations. MobileNet is a typical network that can be used for multiple vision applications. Networks for other tasks are created with similar basic elements. Because of this, the dedicated neural processing unit in the Synaptics VS680 can deliver high performance for any deep learning AI task in video, audio or analytics applications, to name a few.
A recent proposal made by a GPU vendor shows the less than desirable lengths up to which this flexibility can be used. There is a category of deep learning networks called Generative Adversary Networks, or GANs. They are mainly used to create deep fake videos.
Using a detailed photo of a person and a set of parameters that indicate the location and orientation of key facial features and body parts, a well-trained GAN creates a photo-realistic image of the person. This image may be in an environment that does not exist in the original photo, and gestures and expressions may be different from those in the original. When you string together a sequence of such generated images, you have a video of the person doing or saying things they never did or said in a place they may never have been.
Training a GAN involves two neural networks: a generator and a discriminator (Fig. 3). The generator generates random images, which the discriminator tries to identify based on real images. The discrepancy between the generated image and the real image is reported back to the generator during training. After all, the generator can generate images that the discriminator cannot distinguish from the real ones. The discriminator network performs image classification and could be based on MobileNet or some other network.
Despite the unfortunate obvious use of this technology, it can also be used to reduce the bandwidth consumer in video conferencing. When using a GAN to generate a picture of the user on the receiving end of the connection, all you need to do is send a first static picture and then a stream of data indicating the location and shape of the main features. This data stream can be significantly smaller than the original high-resolution compressed video stream.
There are practical problems. For one, the technology is incompatible with existing video conferencing apps because it sends a stream of abstract data instead of a stream of standard compressed video. Second, the security risks of running a video conferencing network full of GANs, any of which could be hijacked to create deeply fake images rather than reconstructed ones, would need to be carefully considered. However, the idea shows how once a video conferencing device can perform deep learning models, the only limit to the functions it can perform is imagination.
Deep learning inference acceleration hardware like the VS680’s Neural Processing Unit can use AI to enable this bandwidth reduction. Such a solution can work with existing conferencing services and fits within the cost and power budgets of inexpensive consumer equipment. Distance learning and working from home need not force us to decide whether users are learning to accept terrible picture quality or whether service providers are investing deeper in network bandwidth. With intelligence, we can have our cake and eat it too.