Intel Details Faster Gen 11 Graphics Architecture

Intel Details Faster Gen 11 Graphics Architecture

Intel has released a new design document detailing its Gen 11 GPUs and how they’ll differ from its previous family. Up until this point, we’ve gotten modest details on the new uarch, at events like Intel’s Architecture Day, but this new data backfills some of the expected technical details. Intel’s Gen 11 graphics architecture is expected to be the basis for its upcoming Xe discrete GPU architecture, so the advances debuted here are a preview for at least some of the features those cards should deploy.

We’re not going to make performance projections until we’ve seen more from the underlying hardware, but by all appearances, Intel will at least be able to mount a far more effective challenge to AMD than it ever has before. Historically, Intel’s midrange “GT2” GPUs that it mounts on desktop and some mobile chips have been comparatively weak compared with AMD. Intel’s advantage in these comparisons was historically its CPU strength compared to AMD’s Bulldozer-derived APUs. Now that Ryzen has a much more effective CPU core, AMD’s Ryzen Mobile processors have been competing far more effectively against their Intel counterparts.

Intel Details Faster Gen 11 Graphics Architecture
Intel Details Faster Gen 11 Graphics Architecture

The new GPU core by the numbers. Compute performance is roughly 2.67x higher, as is texture sampling throughput. ROP throughput has doubled, as has the number of high-Z tests per clock. The L3 cache is 4x larger, and the amount of bandwidth available for GPU writes has doubled, to 64 bytes per clock. Memory bandwidth when using DDR4 should be the same, but support for LPDDR4 allows for theoretically higher RAM clocks when paired with that memory type. The last-level cache is shared between GPU and CPU to reduce data movement. Video decoder blocks have been improved to reduce bitrate, allow multiple simultaneous decoding of 4K and 8K streams, add Adaptive Sync support, and improve HD video decode.

Intel Details Faster Gen 11 Graphics Architecture

The GPU now has a shared local memory that doesn’t block L3 access when read. Intel claims this provides an increase in effectiveness for local and global atomics.

Intel Details Faster Gen 11 Graphics Architecture

The entire structure, above. Intel claims that it has improved overall memory bandwidth efficiency substantially with Gen 11. It’ll be interesting to see how true this proves and whether it changes the overall characteristics of Intel’s iGPU solutions. Historically, AMD iGPUs have been quite sensitive to memory bandwidth, while Intel chips have been less affected by RAM clock. These changes could make RAM clocks matter more on the Intel side of things as well.

Coarse Pixel Shading and The Lost Moon of POSH

The two major features of Gen 11 are coarse pixel shading and POSH, which obviously stands for Position Only SHading and not some dubious reference to a British sci-fi program.

Coarse pixel shading decreases the workload on GPUs by reducing the number of color samples used to render an image. Other details, like geometry, are not scaled in this fashion, to maintain scene detail. Reducing the number of times the pixel shader executes can save power and improve performance. Uplift at Intel’s architecture day from CPS was in the 20-40 percent range depending on how much the feature was activated. 2×2 had a minimal impact on visuals and improved performance by a moderate amount, 4×4 was much more visible but also offered more uplift.

Intel Details Faster Gen 11 Graphics Architecture

The POSH pipe we referred to above is part of Intel’s Position Only Tile-Based Rendering (PTBR) system, which deploys two geometry pipelines — a standard rendering pipe and the POSH pipeline. Intel states that:

The POSH pipe executes the position shader in parallel with the main application, but typically generates results much faster as it only shades position attributes and avoids rendering of pixels. The POSH pipe runs ahead and uses the shaded position attribute to compute visibility information for triangles to gauge whether they are culled or not. Object visibility recording unit of the POSH pipe calculates the visibility, compresses the information and records it in memory.
The POSH pipe executes the position shader in parallel with the main application, but typically generates results much faster as it only shades position attributes and avoids rendering of pixels. The POSH pipe runs ahead and uses the shaded position attribute to compute visibility information for triangles to gauge whether they are culled or not. Object visibility recording unit of the POSH pipe calculates the visibility, compresses the information and records it in memory.

The POSH pipe executes the position shader in parallel with the main application, but typically generates results much faster as it only shades position attributes and avoids rendering of pixels. The POSH pipe runs ahead and uses the shaded position attribute to compute visibility information for triangles to gauge whether they are culled or not. Object visibility recording unit of the POSH pipe calculates the visibility, compresses the information and records it in memory.

In theory, POSH should be a faster, more power-efficient way to handle certain types of geometry processing. Overall performance and applicability to workloads will likely depend on the type of rendering mode used by games. Still, Intel is clearly thinking about how to maximize memory bandwidth and introducing more advanced features around the idea than we’ve seen previously.

Overall, Gen11 is shaping up to be a significant update for Santa Clara. AMD’s first two generations of Ryzen Mobile have gone up against reheated Skylake graphics. The third-generation Ryzen Mobile APU, whenever it launches, will have to compete against something with a little more oomph. The full document is available here.

Continue reading

AMD Brings Its Zen 3 Architecture, Up to Eight CPU Cores to Chromebooks
AMD Brings Its Zen 3 Architecture, Up to Eight CPU Cores to Chromebooks

AMD is going after the high-end Chromebook market with its newest Zen 3 APUs.

TSMC Announces ‘FinFlex’ 3nm Architecture With Variable Configurations
TSMC Announces ‘FinFlex’ 3nm Architecture With Variable Configurations

TSMC has revealed an audacious new customizable "FinFlex" architecture for its 3nm process.

Intel Confirms Entirely New Microarchitecture for Lunar Lake
Intel Confirms Entirely New Microarchitecture for Lunar Lake

We'll find out more about this low-power architecture on the 26th.

World’s Largest SSD Claims 100TB Capacity, Multi-Processor Architecture
World’s Largest SSD Claims 100TB Capacity, Multi-Processor Architecture

Nimbus Data is claiming a new record for largest SSD, with 100TB of storage in a 3.5-inch form factor and a new multi-processor architecture.