Introducing TRACBench, a New AI-Powered Transcoding Benchmark
A bit over a year ago, I started experimenting with video restoration and AI upscaling for my Deep Space Nine Upscale Project. Today, I’d like to talk about the benchmark I’ve built as part of those efforts and what sort of interesting things it can tell us about ultra-high-end workstation performance. Such discussions aren’t much fun without practical hardware to play with, so we’ll also be examining how performance in our new test scales between an AMD Ryzen Threadripper 3990X with 64 cores and four RAM channels, and a Ryzen Threadripper Pro 3995WX-equipped Lenovo ThinkStation P620 workstation with the same 64 cores and eight RAM channels.
Spoiler Alert: One of the reasons I’ve written this article is to demonstrate just how much firepower a modern top-end x86 system can bring to media transcoding workloads in the first place. The overall quality of AI upscaling continues to improve and fans of my Deep Space Nine Upscale Project should know I’ll have more to say about it in the near future.
In the past, I’ve relied on Handbrake to capture transcoding performance, but there are more flexible tools available with a wider range of features. I experimented with using Handbrake as a processing step in my research over the last 15 months before deciding other tools were a better fit for what I wanted to do. TRACBench’s design — the first four letters stand for TRanscoding, Ai, and Conversion — reflects what I’ve learned about scaling these workloads across a large array of cores.
TRACBench 0.1 uses SD-quality interlaced footage as an initial source. While AI scaling applications like Topaz are capable of upscaling 720p or 1080p footage, 360p and 480p footage are more easily processed in a reasonable amount of time.
Transcoding: This step uses StaxRip as a front-end for AviSynth and deinterlaces the footage using QTGMC. TRACBench 0.1 uses the same settings published here and is built around StaxRip 2.1.3.0 with AviSynth+ 3.6.1. StaxRip is run in parallel using multiple instances of the same application. StaxRip is configured to allow up to eight parallel processes per application instance and Prefetch(8) was used in each AviSynth script. We test up to 16 simultaneous encodes to load all 128 threads of the Ryzen Threadripper 3990X and Threadripper Pro 3995WX. The Ryzen 9 5950X cannot sustain so many parallel encodes and tops out at a much lower maximum.
AI Upscaling. In Version 0.1, this step is handled by Topaz 1.5.3. This is an older version of the application that doesn’t support RTX 3000 or RDNA2 GPUs. That’s not a problem for us today, because the Quadro RTX 6000 cards inside the Lenovo ThinkStation P620 are Turing-based. Future versions of the test will update to the latest version of Topaz. Multi-GPU testing on the ThinkStation P620 was handled by running one application instance on each GPU.
Conversion: The final step — converting upscaled frames and the original audio back into a final video. Outputting frames and then recombining them using a tool like FFmpeg yields superior quality to just outputting an MP4 file via Topaz. TRACBench 0.1 uses FFmpeg git-2020-08-28-ccc7120 and libx264 for H.264 encoding. Future versions will include testing in H.265.
We may continue to use Handbrake for simple testing, but Handbrake isn’t as useful for front-end video processing as AviSynth. AviSynth is a command-line video editor that offers a wide range of filters for transforming and editing video in various ways. StaxRip serves as a front-end for it.
The Lenovo ThinkStation P620 was a perfect testbed for building this benchmark. The 3995WX inside the system is AMD’s top-end Ryzen Threadripper Pro CPU. It has slightly lower clocks than the 3990X, but it offers twice the maximum memory bandwidth. The 3990X has just one memory channel per 16 cores, while the 3995WX has two.
There’s a tradeoff between the Ryzen Threadripper 3995WX and the Threadripper 3990X, with the latter offering very slightly more clock speed, but dramatically less memory bandwidth. We’ll see if the difference is enough to matter in our tests — and we’ve got a few additional results between the two systems outside of this test as well.
Rather than attempting to make these three systems as alike as possible, I’ve deliberately allowed their configurations to differ. We’re looking at three different efforts to build a high-end workstation, essentially. The Ryzen 9 5950X balances a new 16-core CPU against an older GPU from 2018. The Ryzen Threadripper 3990X keeps the same GPU but increases the number of cores and overall memory bandwidth dramatically. Both of these systems opt for less expensive, larger M.2 SSDs, with 2TB of capacity compared with the faster Samsung PM981 Polaris drive, at 1TB. Finally, the Lenovo ThinkStation P620 doubles memory bandwidth again and adds a second GPU. Each one of these systems could fairly be called a workstation-class system, but they make different tradeoffs. We’ll see how those tradeoffs impact performance.
Incidentally, the 3990X is running DDR4-2666 because my CPU, which once ran at DDR4-3600 with no problem, now refuses to clock above 2666 at all. Repeatedly resocketing both the RAM and CPU had no effect on this limitation, and relaxing RAM timings to a ridiculous degree did not help the system POST a higher RAM clock.
The Lenovo ThinkStation P620 Workstation
The Lenovo ThinkStation P620 is a genuinely nice piece of kit with a few odd habits. It has a very long boot time (~81 seconds) and it emits two long beeps followed by three short beeps just before the monitor comes on. This may be related to some aspect of the dual Nvidia Quadro RTX 6000 configuration because the display doesn’t initialize until Windows 10 is pulling up the desktop. System stability was excellent at all times.
The case panel is hinged and lifts directly away from the system. The ThinkStation P620’s internal layout is well designed, though removing the second GPU can be difficult depending on how large one’s hand is. The front panel modules are designed to be adaptable to various types of devices, depending on what you need to connect.
I’m going to borrow a photo from our sister site PCMag’s review of the ThinkStation P620 because it shows the inside of the chassis without graphics cards installed:
Here’s a tighter angle of our ThinkStation P620, with its graphics cards installed.
The power supply is remarkable. It’s easily the smallest 1kW power supply I’ve ever seen, and it’s rated 80 Plus Platinum. It plugs directly into the motherboard using an edge connector, visible below:
I’m torn on this aspect of the ThinkStation P620’s design. The power supply is a well-built unit and it hooks directly to the motherboard with no need for a clunky 24-pin ATX cable. There are secondary PCIe power cables mounted on the edge of the motherboard that travel from the motherboard to the GPUs. It’s objectively a better system for power delivery, but if your power supply dies you’ll be talking to Lenovo about a replacement.
Active cooling for the RAM slots. Probably not the worst idea, given how tightly packed things are.
The cooling system is a bit unusual but it keeps the system stable, even under sustained full load. We stress-tested the system by running 16 transcoding workloads and two AI upscaling workloads simultaneously. Power consumption at the wall hit 800W, but the system remained stable under an eight-hour load test. Fan noise from both GPUs and the CPU simultaneously was significant — I wouldn’t want to run the tower all-out if it sat next to my head — but not enough to be bothersome if the machine sat under a desk.
Test Notes
The Lenovo ThinkStation P620’s dual RTX 6000 GPUs guarantee that it will win the AI upscaling test. The point of this comparison is to show the potential performance gain when stepping from an upper-end consumer card from 2018 to a pair of higher-end workstation cards. The entire point of TRACBench is that it can scale from ordinary consumer hardware to high-end workstations, so it makes sense to capture a range of data points (and price tags).
Results today are presented only for AMD systems. TRACBench 0.1 was designed on AMD hardware and I lack access to the kind of dual-socket Xeon systems that compete with the Lenovo P620 on core count. Future iterations of the benchmark will also include information on Intel platform scaling across Rocket Lake, Cascade Lake, and lower-core AMD systems.
TRACBench Results
The transcoding, AI, and combination steps each show different performance patterns, so we’ll discuss them separately.
!function(e,i,n,s){var t="InfogramEmbeds",d=e.getElementsByTagName("script")[0];if(window[t]&&window[t].initialized)window[t].process&&window[t].process();else if(!e.getElementById(n)){var o=e.createElement("script");o.async=1,o.id=n,o.src="https://e.infogram.com/js/dist/embed-loader-min.js",d.parentNode.insertBefore(o,d)}}(document,0,"infogram-async");
Transcoding is a huge win for the ThinkStation P620 and shows the benefits of eight memory channels as opposed to four. At just one instance, the Ryzen 9 5950X is actually faster than either Threadripper and AMD’s Zen 3 architecture keeps a good pace with the P620 and 3990X at the 2x level as well. At 4x, the Threadrippers pull decisively away. The small gain between 2x and 4x for the 5950X shows that 4x is the realistic limit for the consumer CPU. StaxRip crashes when configured with 8 threads per instance if you run more than four instances on the 5950X. The Threadrippers are not affected by this issue.
From 4x to 8x, the 3990X picks up just 1.25x performance, while the Lenovo ThinkStation P620 gains 1.51x. Eight memory channels allow the 3995WX to continue scaling when even the mighty 3990X runs out of gas. I want to note that the Ryzen Threadripper 3990X actually maintains higher clocks in this test than the Threadripper Pro 3995WX in the Lenovo ThinkStation P620. It’s not clock speed making the difference, it’s memory bandwidth.
The AI test is measured in frames per minute. We expected performance to be entirely determined by GPU choice, so imagine our surprise when the Ryzen 9 5950X outperformed the Threadripper 3990X when both were equipped with an RTX 2080. Topaz has been updated multiple times since we began developing this test, and TRACBench 0.2 will use an updated app version, but this was an interesting and unexpected development. The Lenovo ThinkStation P620, as expected, easily wins this test.
Finally, the FFmpeg conversion test merges frames and audio back into a single video file. The P620 outperforms both the Threadripper 3990X and the 5950X at the single-instance mark and keeps that lead thereafter. Unlike in transcoding, the falloff between the 5950X and the other AMD CPUs is immediate.
Scaling between the two Threadrippers is identical at every measured point. At eight encodes, both 64-core CPUs report ~95 percent load, and the lack of improvement between 6x and 8x instances indicates there’s not much headroom left to scrape out. The fact that the two systems scale identically, however, indicates that memory bandwidth isn’t a limiting factor. It’s interesting to see that the Ryzen 9 5950X still scales upwards, even if it isn’t by very much. Shifting from 4x to 8x improves performance by 7 percent.
The ThinkStation P620 is a giant when it comes to transcoding, where it’s no less than 1.84x faster than the 3990X and 3.37x faster than the Ryzen 9 5950X. It maintains a 2.6x lead in AI upscaling over the 5950X, courtesy of the brace of RTX 6000 Quadro cards it carries. FFmpeg performance showed the smallest advantage for the Ryzen Threadripper 3995WX.
In addition to TRACBench, we’ve also compared the two systems in SPECworkstation 3.1.0.
!function(e,i,n,s){var t="InfogramEmbeds",d=e.getElementsByTagName("script")[0];if(window[t]&&window[t].initialized)window[t].process&&window[t].process();else if(!e.getElementById(n)){var o=e.createElement("script");o.async=1,o.id=n,o.src="https://e.infogram.com/js/dist/embed-loader-min.js",d.parentNode.insertBefore(o,d)}}(document,0,"infogram-async");
SPECworkstation is designed to measure performance in workstation applications, including GPU tests. This accounts for some of the gaps between the Threadripper 3990X and Threadripper Pro 3995WX in the graph above, but not all of them.
The enormous performance gap in Life Sciences cannot be explained solely by the 3995WX’s higher memory channels, and there may have been a subtlety in our 3990X’s configuration, or a peculiarity of running a four-channel Threadripper that resulted in the 3995WX testing much, much better than the 3990X in the lammps subtests, where the 3995WX was no less than 6.5x faster than the 3990X. The gaps in the other categories are generally explained by the Lenovo ThinkStation P620 fielding faster storage, GPUs, or an additional four memory channels, but the Life Sciences category gap dwarfs them all.
If we remove the disparate impact of this subtest and examine the 3990X versus the 3995WX subtest by subtest, the 3995WX turns in scores that are 0.92x – 2.15x faster than the 3990X. While it narrowly loses a few tests due to the 3990X’s faster clock, it wins far more than it loses on the addition of more memory bandwidth.
When we look at storage tests and we remove nammd storage results for being skewed in a similar fashion to the CPU test, the Samsung PM981 SSD in the Lenovo P620 is 1.28x faster, in aggregate, than the Mushkin Pilot-E we used for our Threadripper 3990X comparison. With the nammd results included, the P620 is 1.37x faster. Both systems are using PCIe 3.0 drives — we’re seeing the impact of the SSD controller, not the additional bandwidth available via PCIe 4.0.
The Lenovo ThinkStation P620 Hits the Pinnacle of Workstation Performance
The Ryzen Threadripper 3990X is still one of the most fun CPUs I’ve ever reviewed, partly for the absurd joy of pushing it to an all-core 4.3GHz outside during the polar vortex, and partly because watching 64 cores rip through rendering workloads in minutes that can take an hour or more on an eight-core chip is fun.
If watching the Ryzen Threadripper 3990X is fun, watching the Lenovo ThinkStation P620 and the Ryzen Threadripper Pro 3995WX is an absolute party. The 3995WX isn’t always faster than the 3990X — there are a handful of places where it’s 4-6 percent slower — but you trade that handful of small slowdowns for 1.4x – 2x performance improvements in specific applications. The results we’ve shown here illustrate the importance of knowing your workload — under the right circumstances, the Ryzen Threadripper 3995WX is capable of nearly doubling the Ryzen Threadripper 3990X’s performance. Under the wrong ones, the 3990X is 5-6 percent faster than its more expensive sibling.
As for TRACBench, expect to see it pop up again the next time we have CPUs to review. The ThinkStation P620’s performance in TRACBench’s transcoding workload was amazing. The Ryzen Threadripper Pro 3995WX eats transcode workloads for breakfast, far beyond anything even the Ryzen Threadripper 3990X is capable of.
I think we’re going to see real-time AI upscaling at or above the quality TVEI currently offers within the next five years. Currently, two Turing GPUs combined produce ~5.5fps, but one can imagine Ampere doubling that baseline and hitting 5.5fps with one card. At that point, we need a further 5x performance improvement (I’m rounding up to put some padding on the margin). Given how rapidly AI performance has improved, that’s just not a crazy idea. The ThinkStation P620 isn’t showing off a future we’ll never get to see — just accelerating its arrival a bit.
The Lenovo ThinkStation P620 is one of the most powerful air-cooled workstations money can buy, and it offers a fascinating glimpse into the future of content restoration and upscaling. If you’ve looked at the Ryzen Threadripper 3990X but were concerned its quad-channel design limited the chip, the Ryzen Threadripper Pro 3995WX may be exactly what you’re looking for.
Continue reading
YouTube Designed Its Own Video Transcoding Hardware
YouTube has developed its own video transcoding chips to break bottlenecks imposed by the slowing of Moore's Law.