Google’s Cloud TPU Matches Volta in Machine Learning at Much Lower Prices

Google’s Cloud TPU Matches Volta in Machine Learning at Much Lower Prices

Over the past few years, Nvidia has established itself as a major leader in machine learning and artificial intelligence processing. The GPU designer dove into the HPC market over a decade ago when it launched the G80 and its parallel compute platform API, CUDA. Early leadership has paid off for Nvidia; the company holds 87 spots on the TOP500 list of supercomputers, compared with just 10 for Intel. But as machine learning and artificial intelligence workloads proliferate, vendors are emerging to give Nvidia a run for its money, including Google’s new Cloud TPU. New benchmarks from RiseML put both Nvidia and Google’s TPU head-to-head — and the cost curve strongly favors Google.

Because ML and AI are both new and emerging fields, it’s important that tests be conducted fairly and that the benchmark runs don’t favor one architecture over the other simply because best testing parameters aren’t well-known. To that end, RiseML allowed both Nvidia and Google engineers to review drafts of their test results and to offer comments and suggestions. The company also states its figures have been reviewed by an additional panel of outside experts in the field.

The comparison is between four Google TPUv2 chips (which form one Cloud TPU) against 4x Nvidia Volta GPUs. Both have 64GB of total RAM and the data sets were trained in the same fashion. RiseML tested the ResNet-50 model (exact configuration details are available in the blog post) and the team investigated both raw performance (throughput), accuracy, and convergence (an algorithm converges when its output comes closer and closer to a specific value).

Google’s Cloud TPU Matches Volta in Machine Learning at Much Lower Prices

The suggested batch size for TPUs is 1024, but other batch sizes were tested at reader request. Nvidia does perform better at those lower batch sizes. In accuracy and convergence, the TPU solution is somewhat better (76.4 percent top-1 accuracy for Cloud TPU, compared with 75.7 percent for Volta). Improvements to top-end accuracy are difficult to come by, and the RiseML team makes the small difference between the two solutions out to be more important than you might think. But where Google’s Cloud TPU really wins, at least right now, is on pricing.

Google’s Cloud TPU Matches Volta in Machine Learning at Much Lower Prices

RiseML writes:

Ultimately, what matters is the time and cost it takes to reach a certain accuracy. If we assume an acceptable solution at 75.7 percent (the best accuracy achieved by the GPU implementation), we can calculate the cost to achieve this accuracy based on required epochs and training speed in images per second. This excludes time to evaluate the model in-between epochs and training startup time.

As shown above, the current pricing of the Cloud TPU allows to train a model to 75.7 percent on ImageNet from scratch for $55 in less than 9 hours! Training to convergence at 76.4 percent costs $73. While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.

Google may be subsidizing its cloud processor pricing, and the exact performance characteristics of ML chips will vary depending on implementation and programmer skill. This is far from the final word on Volta’s performance, or even Volta as compared with Google’s Cloud TPU. But at least for now, in ResNet-50, Google’s cloud TPU appears to offer nearly identical performance at substantially lower prices.

Continue reading

Musk: Tesla Was a Month From Bankruptcy During Model 3 Ramp-Up
Musk: Tesla Was a Month From Bankruptcy During Model 3 Ramp-Up

The Model 3 almost spelled doom for Tesla, but the same vehicle also probably saved it.

How Does Windows Use Multiple CPU Cores?
How Does Windows Use Multiple CPU Cores?

We take multi-core awareness for granted these days, but how do the CPU and operating system communicate with each other in the first place?

Elon Musk: SpaceX Will Send People to Mars in 4 to 6 Years
Elon Musk: SpaceX Will Send People to Mars in 4 to 6 Years

SpaceX and Tesla CEO Elon Musk likes to make bold claims. Sometimes he comes through, and we end up with a reusable Falcon 9 rocket, but Musk also has a tendency to get carried away, particularly when it comes to Mars. The SpaceX CEO has long promised a Mars colony on an aggressive, and some…

Microsoft Adds 64-bit x86 Emulation to Windows on ARM
Microsoft Adds 64-bit x86 Emulation to Windows on ARM

Microsoft announced today that the expected support for 64-bit x86 emulation on Windows on ARM devices has arrived, provided you are running Build 21277. You'll need to be part of Microsoft's Windows Insider program to test the build.