MLPerf Watch: AI chips are getting faster

By micohuang

Recently, the artificial intelligence industry organization MLCommons released a new set of artificial intelligence performance lists MLPerf Version 1.1. As you can see, it follows the first set of official benchmarks from five months ago, including more than 1,800 results from more than 20 organizations, and 350 energy efficiency measurements. According to MLCommons, most systems have improved performance by 5-30% compared to earlier this year, and some of these systems have more than doubled their performance figures. The new results come after the recent announcement of a new machine learning benchmark called TCP AIx.

In MLPerf’s inference benchmark, systems consisting of a combination of CPUs and GPUs or other accelerator chips are tested on up to six neural networks that perform a variety of common functions: image classification, object detection, speech recognition, 3D medical imaging, natural language processing and recommendation. For the commercial data center-based system, they tested it under two conditions: simulating real-time data center activity, where queries are completed in bursts; and “offline” activity, where all data is available at once. Computers that are supposed to work in the field and not in the data center (what MLPerf calls the edge) are measured offline, as if they were receiving a single stream of data, such as that from a security camera.

Despite data center-level submissions from Dell, HPE, Inspur, Intel, LTech Korea, Lenovo, Nvidia, Neuchips, Qualcomm, and others, all but Qualcomm and Neuchips used Nvidia AI accelerator chips. Intel doesn’t use accelerator chips at all, instead showing off the performance of its CPUs alone. Neuchips only participated in recommendation benchmarks because their accelerator, RecAccel, was specifically designed to accelerate recommender systems for recommending e-commerce items and ranking search results.

MLPerf Watch: AI chips are getting faster

MLPerf tests six common AIs under several conditions.NVIDIA

For the results Nvidia submitted, the company made up for 50 percent of the performance improvements over the past year using software improvements alone. The systems tested typically consist of one or two CPUs and up to eight accelerators. On a per-accelerator basis, systems using the Nvidia A100 accelerator showed twice or more performance than systems using the lower-power Nvidia A30. The A30-based computer outperformed the Qualcomm Cloud AI 100-based system four of the six tests in the server scenario.

However, John Kehrli, senior director of product management at Qualcomm, pointed out that Qualcomm’s accelerators are intentionally limited to a datacenter-friendly 75-watt power envelope per chip, but in offline image recognition tasks they still managed to pass some Nvidia A100-based accelerators of computers equipped with accelerators, each with a peak thermal design of 400W.

  MLPerf Watch: AI chips are getting faster

Nvidia has made gains in AI using only software improvements. NVIDIA

Dave Salvator, senior product manager for AI inference at Nvidia, pointed to two other results from the company’s accelerators: First, the Nvidia A100 accelerator was paired with a server-grade Arm CPU instead of an x86 CPU for the first time. Results were nearly identical between Arm and x86 systems on all six benchmarks. “This is an important milestone for Arm,” Salvator said. “It’s also a statement that our software stack is ready to run Arm architecture in a data center environment.”

Unlike the official MLPerf benchmark, Nvidia showed off a new software technology called multi-instance GPU (MiG), which allows a single GPU to appear from a software perspective as if it were seven separate chips. When the company ran all six benchmarks simultaneously, plus an additional instance of object detection (let’s say flex), the result was 95% of the value of a single instance.

Nvidia A100-based systems also cleaned up the edge server category, a category of systems designed for use in places like shops and offices. The computers were tested against most of the same six benchmarks, but the recommender system was replaced with low-resolution object detection. But within this category, there’s a wider range of accelerators to choose from, including Centaur’s AI-integrated coprocessor; Qualcomm’s AI 100; Edgecortix’s DNA-F200 v2, Nvidia’s Jetson Xavier, and FuriosaAI’s Warboy.

  MLPerf Watch: AI chips are getting faster

Qualcomm topped the efficiency ranking for a machine vision test.QUALCOMM

Using systems with different numbers of CPUs and accelerators, and six tests under two conditions in two commercial categories, the MLPerf performance results don’t actually apply to simple ordered lists like Top500.org’s implementation via supercomputing. The closest part is the efficiency test, which boils down to inferences per watt per second of offline components. The Qualcomm system was tested for efficiency in object recognition, object detection, and natural language processing in the data center and edge categories. In inferences per watt per second, they beat Nvidia-powered systems in machine vision tests, but not in language processing. Nvidia-accelerated systems take all the rest.

A new benchmark was launched recently, targeting a single number, which seems to go against the multidimensional nature of MLPerf. The e Transaction Processing Performance committee called the TCPx AI benchmark:

Generate and process large amounts of data

Train preprocessed data to generate realistic machine learning models

Accurate insights into real customer scenarios based on generated models

Can scale to large distributed configurations

Allows flexible configuration changes to meet the needs of dynamic AI environments

Hamesh Patel, chairman of the TPCx Artificial Intelligence Committee and Intel Principal Engineer, explained that the benchmark is designed to capture the complete end-to-end process of machine learning and artificial intelligence. This includes parts of the process that are not included in MLPerf, such as preparing data and optimizing. “There is no single benchmark that simulates the entire data science pipeline,” he said. Clients say it can take a week to prepare (the data) and two days to train a “neural network”.

Big differences between MLPerf and TPCx AI include that the latter relies on synthetic data that is similar to real data but dynamically generated. MLPerf uses real datasets for training and inference, and MLCommons executive director David Kanter is skeptical of the value of the synthetic data results.

There is a lot of overlap in membership between MLCommons and TPC, so it remains to be seen if either of the two benchmarks outperforms the other in terms of credibility. At least two MLPerf participants reported that MLPerf currently has an undeniable advantage, and computer system manufacturers have been asked to provide MLPerf data as part of a request for proposal.

The Links:   FP25R12KT3 RM100DZ-2H