A Brief History of Intel CPUs, Part 2: Pentium II Through Coffee Lake
In Part 1 of this guide, we discussed the various Intel CPUs from the beginning of the company through to the Pentium Pro. Before we dive into the other CPUs in Intel’s overall history, let’s take a moment to further discuss the Pentium Pro. In a very real sense, the PPro is the core that revolutionized the x86 architecture and Intel’s microprocessor design, and can be thought of as the “father” from which modern CPUs are descended (the original 8086 itself, in this context, is more of a grandfather).
The Pentium Pro was, in many ways, a true watershed of CPU design. Up until its debut, CPUs executed programs in the order in which program instructions were received. This meant that performance optimizations heavily relied on the expertise of the programmer in question. Instruction caches and the use of pipelines improved performance in-hardware, but neither of these technologies changed the order in which instructions were executed. Not only did the Pentium Pro implement out-of-order execution to improve performance by allowing the CPU to re-order instructions for optimal execution, it also began the now-standard process of decoding x86 instructions into RISC-like micro-ops for more efficient execution. While Intel didn’t invent either capability out of whole cloth, it took a substantial risk when it built the Pentium Pro around them. Needless to say, that risk paid off.
The Pentium Pro’s P6 microarchitecture would be used for the Pentium II and III (all forms) before being replaced by the Pentium 4 “Netburst” — at least for a little while. It resurfaced with the Pentium M (Intel’s mobile CPU family) and Core 2 Duo family. Modern Intel CPUs are still considered to have descended from the Pentium Pro, despite the numerous architectural revisions between then and now. The original Pentium Pro, however, didn’t perform particularly well with 16-bit legacy code and was therefore mostly restricted to Intel’s workstation and server product families. The Pentium II was intended to change that — so let’s pick up our history from there.
Pentium II The Pentium II made a number of subtle changes to the Pentium Pro's design and one big honkin' shift. It re-added the segment register cache previous x86 CPUs had used but the Pentium Pro hadn't to improve 16-bit performance, doubled the L1 cache size to 32K while splitting the L1 into instructions and data caches, widened the execution core by adding MMX support, and, of course, moved from a socket configuration to Intel's Slot 1. The Pentium Pro used an onboard L2 cache that was connected to the primary CPU by a dedicated bus, but the cache itself only ran at half clock. The Pentium Pro's cache, in contrast, had run at full CPU clock. This design was a huge success for Intel overall — most of the companies last x86 competitors were on their last legs by this time.
Pentium III (Katmai) The first Pentium III, built around the Klamath core, was a marginal upgrade to the Pentium II. It offered support for Intel's first SIMD instruction set (single instruction, multiple data) and higher clock speeds, along with a controversial CPU identification mechanism that kicked off tremendous controversy at the time.
Pentium III (Coppermine) The Pentium III Coppermine is the CPU most people would remember as the "real" Pentium III. Architecturally, it retained the same basic structure as the Pentium II — but it added a fully integrated L2 cache running at full CPU speed. This tremendously improved CPU performance, while the "Gigahertz race" with AMD led Intel to quickly introduce faster and faster chips. The Pentium III suffered from limited availability at its highest clock frequencies and some high DRAM costs depending on motherboard platform, but offered excellent performance compared with AMD's Athlon equivalent at the time.
Pentium III (Tualatin) The Pentium III Tualatin is a bit of an unusual core that deserves a mention here. It barely came to the desktop — Intel was mostly concerned with marketing the Pentium 4 at the time — but it offered a die-shrunk Pentium 3 with lower power consumption and a larger overall L2 cache (the Celeron chips bumped up to 256KB of onboard cache, the Pentium III-S offered 512KB, up from 256KB). Tualatin became the basis for the Pentium M product line, which would continue through Banias, Dothan, and Yonah before returning to the desktop as the Core 2 Duo.
Pentium 4 (Willamette) The first iteration of Intel's Pentium 4 was tremendously controversial. The new "Netburst" architecture was disdainfully dubbed a "marchitecture" (marketing + architecture) and accurately accused of emphasizing high clock speeds without noting that the CPU did much less work per clock cycle. Willamette's small L2 cache (256KB), expensive RDRAM, and limited clock speeds made the first P4 a mediocre product at the best of times and it was regularly outperformed by both the Tualatin-based Pentium III and competitive parts from AMD.
Pentium 4 (Northwood) The 0.13 micron Northwood core was the P4 die shrink that proved the part could work. Clock speeds skyrocketed, a larger L2 cache improved performance, and the addition of Hyper-Threading offered the creamy smoothness of multithreading responsiveness on a single-core CPU. Northwood blew up CPU clocks, racing from 2GHz at introduction to as high as 3.4GHz by the time the next iteration of the CPU arrived. Northwood was also a strong competitor against AMD, and generally outperformed its rival. While the Athlon 64 was faster in many respects, Northwood mostly competed against the Athlon XP and did so very well. Of the three Pentium 4 iterations, Northwood was the variant that hit its goals and delivered expected performance.
Pentium 4 (Prescott) Prescott was an unmitigated disaster. While it sold extremely well, its weak performance and power efficiency left the CPU at a distinct disadvantage against AMD. The P4's longer pipeline was intended to allow it to hit high clock speeds, but Prescott's pipeline was too long and its 90nm process had significant power leakage problems. The net result was a CPU that struggled to outperform Northwood and that couldn't hit the higher clock speeds required to demonstrate performance leadership. The 2004 - 2006 era is considered the golden age for AMD performance for precisely this reason. Once dual-core CPUs became available, the Athlon 64 decisively outperformed the P4 in virtually every test and market.
Core 2 Duo If Prescott was a disaster, the Core 2 Duo was the triumphant return to form. Intel abandoned the Netburst architecture, which emphasized clock speed and returned to building higher-efficiency cores at lower clocks. The P4's faster bus speeds were retained, but matched with larger, shared L2 caches, support for additional SIMD instructions (SSE4.1), and a wider front end to support more aggressive execution. Core 2 Duo added macro-op fusion (combining multiple operations to reduce the number of instructions executed and to improve out-of-order execution). The C2D architecture was more efficient and generally outperformed AMD's Athlon 64 X2 CPUs, setting up a trend that would continue until the launch of Ryzen in 2017. Core 2 Duo CPUs were launched in two flavors (65nm and 45nm) and were one of Intel's most successful CPU families. Intel's Tick-Tock manufacturing method debuted with the Core family.
Nehalem Intel's Nehalem was a major leap forward for the company because it incorporated a memory controller, re-added Hyper-Threading support, replaced the legacy FSB with a new Quick Path Interconnect, began the process of integrating motherboard components on-die, increased macro-op fusion support in 64-bit mode, added SSE 4.2 support, and added L3 cache as standard. Unlike C2D, Nehalem parts share a common L3 cache but have a private L2 (Core 2 Duo CPUs shared the L2). Nehalem firmly established Intel as the overall performance leader.
Westmere Westmere was the 32nm die shrink of Nehalem, with better AES encryption / decryption performance, better virtualization support, and more CPU cores. Six-core and 10-core CPUs were available, and the first mobile devices with onboard (though not fully integrated) graphics were based on the Westmere die shrink.
Sandy Bridge (2000 Series) Sandy Bridge was a major architectural overhaul to Nehalem and first debuted in 2011. New Turbo Boost capabilities, fully integrated on-die graphics, improved load/store capabilities, a new uop cache, better branch prediction, AVX support, and support for up to eight CPU cores were all major features of this refresh. Modern Core CPUs, including the latest Coffee Lake chips, are direct descendants of Sandy Bridge. Sandy Bridge was also the first Intel CPU to focus on low-power operation and the first chip deployed in what would become the ultrabook market. It also was the first Intel CPU to be equipped with QuickSync, Intel's dedicated video encoder.
Ivy Bridge (3000 Series) Ivy Bridge is well known for its use of 22nm FinFET transistors, PCIe 3.0 support, improved integrated GPU, and overall higher maximum core counts in server configurations. Performance only modestly improved over Sandy Bridge, setting the basis of a trend of small improvements year-on-year that's continued into the present day.
Haswell (4000 Series) Haswell was a significant architectural refresh that widened the CPU core, added AVX2 support, boosted clock speeds, and integrated a new voltage regulator on-die (Intel later abandoned this in a subsequent refresh). Branch execution was also improved, but the performance improvements were still fairly modest. Power consumption in top-end parts rose, though GPU performance was much improved compared with Ivy Bridge. The first Intel mobile chips to integrate a large off-die EDRAM cache for graphics and use as an L4 (in some parts) were built around the Haswell CPU core.
Skylake (6000 Series) Skylake was Intel's last architectural overhaul. All CPUs since have been fundamentally based on this architecture. Skylake added DX 12_1 support, deeper out-of-order buffers, improved overall execution, improved AES encryption, support for Intel SpeedShift, and an improved video decode/encode unit. Performance compared with Haswell was test- and part-dependent, some Skylake chips were faster than Haswell clock-for-clock, but ran at lower clock speeds and thus ended up more-or-less equivalent. The largest performance improvements were concentrated in the mobile parts.
Kaby Lake (7000 Series) Kaby Lake was built under Intel's PAO (Process-Architecture-Optimization) method rather than tick-tock and represents an optimization of Skylake. Clock speeds and SpeedShift were adjusted and generally higher, with some additional improvements to the GPUs video decode / encode engine. Kaby Lake also supported the first Optane cache drives. This release was generally overshadowed by AMD's Ryzen, which debuted just a few months later.
Coffee Lake (8000 Series) Intel's Coffee Lake is built on a further refinement of the 14nm process and made significant changes to Intel's overall product line. Top-end CPU core counts were expanded to six cores / 12 threads (up from 4C/8T) and the Core i5 and Core i3 families were both overhauled, as were the Celeron and Pentium lines. Maximum clock speeds also generally improved, though this is somewhat dependent on the part in question. Coffee Lake CPUs were a substantial improvement in mobile, where performance and power consumption per watt are now much improved.
To trace the history of Intel CPU cores is to trace the history of various epochs in the evolution of CPU performance. In the 1980s and 1990s, clock speed improvements and architectural enhancements went hand in hand. From 2005 forward, it was the era of multi-core chips and higher efficiency parts. Since 2011, Intel has focused on improving the performance of its low power CPUs more than other capabilities. This focus has paid real dividends — laptops today have far better battery life and overall performance than they did a decade ago.
Compare the Core i7-8650U with the Core i7-2677M to see what we mean. The Sandy Bridge-era CPU had a maximum clock of 2.9GHz, supported just 8GB of RAM (not shown), and offered half the cores and L3 cache of the modern Coffee Lake SKU. Overall mobile performance per watt has improved dramatically. At the same time, Intel faces real challenges, both from AMD and ARM. The company has recently announced that its 10nm process would be delayed, which leaves Intel facing further competitive risk.
But regardless of what the future holds, Intel CPUs have driven consistent performance improvements over the past four decades, revolutionizing the personal computer in the process. We hope you’ve enjoyed the trip down memory lane.