Google’s Cloud TPU Matches Volta in Machine Learning at Much Lower Prices
Over the past few years, Nvidia has established itself as a major leader in machine learning and artificial intelligence processing. The GPU designer dove into the HPC market over a decade ago when it launched the G80 and its parallel compute platform API, CUDA. Early leadership has paid off for Nvidia; the company holds 87 spots on the TOP500 list of supercomputers, compared with just 10 for Intel. But as machine learning and artificial intelligence workloads proliferate, vendors are emerging to give Nvidia a run for its money, including Google’s new Cloud TPU. New benchmarks from RiseML put both Nvidia and Google’s TPU head-to-head — and the cost curve strongly favors Google.
Because ML and AI are both new and emerging fields, it’s important that tests be conducted fairly and that the benchmark runs don’t favor one architecture over the other simply because best testing parameters aren’t well-known. To that end, RiseML allowed both Nvidia and Google engineers to review drafts of their test results and to offer comments and suggestions. The company also states its figures have been reviewed by an additional panel of outside experts in the field.
The comparison is between four Google TPUv2 chips (which form one Cloud TPU) against 4x Nvidia Volta GPUs. Both have 64GB of total RAM and the data sets were trained in the same fashion. RiseML tested the ResNet-50 model (exact configuration details are available in the blog post) and the team investigated both raw performance (throughput), accuracy, and convergence (an algorithm converges when its output comes closer and closer to a specific value).
The suggested batch size for TPUs is 1024, but other batch sizes were tested at reader request. Nvidia does perform better at those lower batch sizes. In accuracy and convergence, the TPU solution is somewhat better (76.4 percent top-1 accuracy for Cloud TPU, compared with 75.7 percent for Volta). Improvements to top-end accuracy are difficult to come by, and the RiseML team makes the small difference between the two solutions out to be more important than you might think. But where Google’s Cloud TPU really wins, at least right now, is on pricing.
RiseML writes:
Ultimately, what matters is the time and cost it takes to reach a certain accuracy. If we assume an acceptable solution at 75.7 percent (the best accuracy achieved by the GPU implementation), we can calculate the cost to achieve this accuracy based on required epochs and training speed in images per second. This excludes time to evaluate the model in-between epochs and training startup time.
As shown above, the current pricing of the Cloud TPU allows to train a model to 75.7 percent on ImageNet from scratch for $55 in less than 9 hours! Training to convergence at 76.4 percent costs $73. While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.
Google may be subsidizing its cloud processor pricing, and the exact performance characteristics of ML chips will vary depending on implementation and programmer skill. This is far from the final word on Volta’s performance, or even Volta as compared with Google’s Cloud TPU. But at least for now, in ResNet-50, Google’s cloud TPU appears to offer nearly identical performance at substantially lower prices.
Continue reading
How to Install Windows 10 in a Virtual Machine
If you have to deal with files you can't trust, need to test multiple OS installations on the same system, or otherwise need access to the OS without wanting to use it as a daily driver, here's how to install Windows 10 in a virtual machine.
Windows 11 Will Launch on Oct. 5, but Only for Select Machines
Microsoft says Windows 11 will start hitting compatible PCs on October 5th. However, you might have to wait much, much longer to install it on your machine.
Microsoft Is Pushing the PC Health Check App to Windows 10 Machines
Don't want it? Tough, you're getting the Microsoft PC Health Check app regardless of whether or not you've expressed interest in Windows 11.
Windows Machines Need Up to Eight Hours to Update Successfully
Microsoft has found that devices with less than eight hours of Update Connectivity have a much higher chance of falling behind on updates.