YouTube Designed Its Own Video Transcoding Hardware
The rate of video on the internet has been exploding upwards, year after year, as has the number of videos YouTube serves per year. Unfortunately, CPUs and GPUs don’t deliver the kind of yearly performance improvements they once did. Faced with a slowing rate of silicon improvement and rapidly increasing amounts of video, YouTube decided to build its own video transcoding unit, or VCU, codenamed Argos.
The company has disclosed its Argos effort in both a blog post and a paper, depending on how deep into the details you feel like digging. According to YouTube, moving workloads to the VCU has improved efficiency by 20-33x depending on the exact particulars of the stream. YouTube’s new chip is designed to be capable of transcoding to one resolution target at a time, or of targeting multiple resolutions simultaneously.
A key component of YouTube’s power savings is the fact that the software and hardware stacks are explicitly designed to work with each other. The physical architecture of the system is shown below:
There are more encode than decode cores on each iteration of the ASIC, and more than one ASIC on each VCU card. This solution has been designed for dense scaling. Transcoding a video to multiple output resolutions simultaneously is part of how YouTube achieves its power efficiency improvements, as it “allows efficient sharing of control parameters obtained by analysis of the source (e.g., detection of fades/flashes),” according to the company. Handling these transcodes in parallel (MOT) is much preferred to doing them one at a time (SOT), as it avoids redundant decoding. At least some of the claimed power efficiency improvements will come from avoiding redundant work. MOT is generally preferred to SOT, as it avoids redundant decodes for the same group of outputs.
In MOT, the video is decoded once, scaled to all target resolutions, and then encoded at all relevant targets. YouTube notes that it also designed the ASIC to be able to process multiple MOTs and SOTs simultaneously to further boost efficiency. The actual encoder is designed to encode H.264 and VP9 in hardware while searching three reference frames. It has a pipelined architecture, local reference stores for motion estimation, and can accelerate entropy encoding, but Google notes the chip is “optimized for power/performance/area targets.” Each encoder core is capable of encoding 4K at 60fps in real time, with 10 cores per ASIC, and multiple ASICs per card.
YouTube is already drawing up plans for a next-generation accelerator that would also be capable of decoding AV1 in hardware. VP9 is generally considered to be the open-source competitor for HEVC, while AV1 is a more advanced follow-up expected to deliver greater bandwidth savings.
Argos represents the kind of company-specific project we’ve seen more of in recent years as Intel has struggled to improve its CPU performance, but this isn’t strictly a CPU issue. The GPU decode blocks built into an Ampere or RDNA2 GPU clearly weren’t specialized for the task YouTube had in mind. This is the kind of semi-custom work one could theoretically see AMD taking on, but AMD doesn’t appear to have pursued outside manufacturing deals for its IP all that aggressively. We know the company is working on a deal with Samsung for a mobile graphics solution based on Radeon IP, and it partners with Sony and Microsoft for console gaming, but not much beyond that — at least, not publicly.
Ten years ago, Google, Facebook, and Amazon began to quietly revolutionize the server market by paying ODMs to build servers for them directly rather than buying off-the-shelf standardized hardware from the likes of Dell or HPE. Today, these same companies are designing their own custom silicon to fill various cloud industry use-cases. CPUs and GPUs still dominate the consumer space, but specialized accelerators and purpose-built chips are creeping into the enterprise in ever-increasing numbers. It’s also interesting to see YouTube rather pointedly not backing HEVC or even discussing future support for VVC / H.266. Any avoidance of these standards would likely be due to royalty entanglements and licensing fees.
Feature image by YouTube.
Continue reading
Introducing TRACBench, a New AI-Powered Transcoding Benchmark
Three of the fastest workstation CPUs on Earth slug it out in an AI-powered transcoding benchmark we designed to scale from dual-core to 64-core systems.