YouTube Designed Its Own Video Transcoding Hardware

The rate of video on the internet has been exploding upwards, year after year, as has the number of videos YouTube serves per year. Unfortunately, CPUs and GPUs don’t deliver the kind of yearly performance improvements they once did. Faced with a slowing rate of silicon improvement and rapidly increasing amounts of video, YouTube decided to build its own video transcoding unit, or VCU, codenamed Argos.

The company has disclosed its Argos effort in both a blog post and a paper, depending on how deep into the details you feel like digging. According to YouTube, moving workloads to the VCU has improved efficiency by 20-33x depending on the exact particulars of the stream. YouTube’s new chip is designed to be capable of transcoding to one resolution target at a time, or of targeting multiple resolutions simultaneously.

A key component of YouTube’s power savings is the fact that the software and hardware stacks are explicitly designed to work with each other. The physical architecture of the system is shown below:

There are more encode than decode cores on each iteration of the ASIC, and more than one ASIC on each VCU card. This solution has been designed for dense scaling. Transcoding a video to multiple output resolutions simultaneously is part of how YouTube achieves its power efficiency improvements, as it “allows efficient sharing of control parameters obtained by analysis of the source (e.g., detection of fades/flashes),” according to the company. Handling these transcodes in parallel (MOT) is much preferred to doing them one at a time (SOT), as it avoids redundant decoding. At least some of the claimed power efficiency improvements will come from avoiding redundant work. MOT is generally preferred to SOT, as it avoids redundant decodes for the same group of outputs.

In MOT, the video is decoded once, scaled to all target resolutions, and then encoded at all relevant targets. YouTube notes that it also designed the ASIC to be able to process multiple MOTs and SOTs simultaneously to further boost efficiency. The actual encoder is designed to encode H.264 and VP9 in hardware while searching three reference frames. It has a pipelined architecture, local reference stores for motion estimation, and can accelerate entropy encoding, but Google notes the chip is “optimized for power/performance/area targets.” Each encoder core is capable of encoding 4K at 60fps in real time, with 10 cores per ASIC, and multiple ASICs per card.

YouTube is already drawing up plans for a next-generation accelerator that would also be capable of decoding AV1 in hardware. VP9 is generally considered to be the open-source competitor for HEVC, while AV1 is a more advanced follow-up expected to deliver greater bandwidth savings.

Argos represents the kind of company-specific project we’ve seen more of in recent years as Intel has struggled to improve its CPU performance, but this isn’t strictly a CPU issue. The GPU decode blocks built into an Ampere or RDNA2 GPU clearly weren’t specialized for the task YouTube had in mind. This is the kind of semi-custom work one could theoretically see AMD taking on, but AMD doesn’t appear to have pursued outside manufacturing deals for its IP all that aggressively. We know the company is working on a deal with Samsung for a mobile graphics solution based on Radeon IP, and it partners with Sony and Microsoft for console gaming, but not much beyond that — at least, not publicly.

Ten years ago, Google, Facebook, and Amazon began to quietly revolutionize the server market by paying ODMs to build servers for them directly rather than buying off-the-shelf standardized hardware from the likes of Dell or HPE. Today, these same companies are designing their own custom silicon to fill various cloud industry use-cases. CPUs and GPUs still dominate the consumer space, but specialized accelerators and purpose-built chips are creeping into the enterprise in ever-increasing numbers. It’s also interesting to see YouTube rather pointedly not backing HEVC or even discussing future support for VVC / H.266. Any avoidance of these standards would likely be due to royalty entanglements and licensing fees.

Feature image by YouTube.

Continue reading

MSI’s Nvidia RTX 3070 Gaming X Trio Review: 2080 Ti Performance, Pascal Pricing

Nvidia's new RTX 3070 is a fabulous GPU at a good price, and the MSI RTX 3070 Gaming X Trio shows it off well.

Pfizer Claims New COVID-19 Vaccine 90 Percent Effective

There have been a number of COVID-19 vaccines in development in the United States and around the world, and one of them has shown some very positive preliminary results in its Phase 3 trial. One particular vaccine developed by Pfizer and German firm BioNTech appears to be more than 90 percent effective in preventing symptomatic…

Nvidia Will Mimic AMD’s Smart Access Memory on Ampere: Report

AMD's Smart Access Memory hasn't even shipped yet, but Nvidia claims it can duplicate the feature.

Nvidia Unveils Ampere A100 80GB GPU With 2TB/s of Memory Bandwidth

Nvidia announced an 80GB Ampere A100 GPU this week, for AI software developers who really need some room to stretch their legs.