Resources: Slides, Videos, Exercises
Benchmarking is critical to developing and deploying machine learning systems, especially TinyML applications. Benchmarks allow developers to measure and compare the performance of different model architectures, training procedures, and deployment strategies. This provides key insights into which approaches work best for the problem at hand and the constraints of the deployment environment.
This chapter will provide an overview of popular ML benchmarks, best practices for benchmarking, and how to use benchmarks to improve model development and system performance. It provides developers with the right tools and knowledge to effectively benchmark and optimize their systems, especially for TinyML systems.
Learning Objectives
Understand the purpose and goals of benchmarking AI systems, including performance assessment, resource evaluation, validation, and more.
Learn about key model benchmarks, metrics, and trends, including accuracy, fairness, complexity, performance, and energy efficiency.
Become familiar with the key components of an AI benchmark, including datasets, tasks, metrics, baselines, reproducibility rules, and more.
Understand the distinction between training and inference and how each phase warrants specialized ML systems benchmarking.
Learn about system benchmarking concepts like throughput, latency, power, and computational efficiency.
Appreciate the evolution of model benchmarking from accuracy to more holistic metrics like fairness, robustness, and real-world applicability.
Recognize the growing role of data benchmarking in evaluating issues like bias, noise, balance, and diversity.
Understand the limitations of evaluating models, data, and systems in isolation and the emerging need for integrated benchmarking.
11.1 Overview
Benchmarking provides the essential measurements needed to drive machine learning progress and truly understand system performance. As the physicist Lord Kelvin famously said, “To measure is to know.” Benchmarks allow us to quantitatively know the capabilities of different models, software, and hardware. They allow ML developers to measure the inference time, memory usage, power consumption, and other metrics that characterize a system. Moreover, benchmarks create standardized processes for measurement, enabling fair comparisons across different solutions.
When benchmarks are maintained over time, they become instrumental in capturing progress across generations of algorithms, datasets, and hardware. The models and techniques that set new records on ML benchmarks from one year to the next demonstrate tangible improvements in what’s possible for on-device machine learning. By using benchmarks to measure, ML practitioners can know the real-world capabilities of their systems and have confidence that each step reflects genuine progress towards the state-of-the-art.
Benchmarking has several important goals and objectives that guide its implementation for machine learning systems.
Performance assessment. This involves evaluating key metrics like a given model’s speed, accuracy, and efficiency. For instance, in a TinyML context, it is crucial to benchmark how quickly a voice assistant can recognize commands, as this evaluates real-time performance.
Power assessment. Evaluating the power drawn by a workload along with its performance equates to its energy efficiency. As the environmental impact of ML computing continues to grow, benchmarking energy can enable us to better optimize our systems for sustainability.
Resource evaluation. This means assessing the model’s impact on critical system resources, including battery life, memory usage, and computational overhead. A relevant example is comparing the battery drain of two different image recognition algorithms running on a wearable device.
Validation and verification. Benchmarking helps ensure the system functions correctly and meets specified requirements. One way is by checking the accuracy of an algorithm, like a heart rate monitor on a smartwatch, against readings from medical-grade equipment as a form of clinical validation.
Competitive analysis. This enables comparing solutions against competing offerings in the market. For example, benchmarking a custom object detection model versus common TinyML benchmarks like MobileNet and Tiny-YOLO.
Credibility. Accurate benchmarks uphold the credibility of AI solutions and the organizations that develop them. They demonstrate a commitment to transparency, honesty, and quality, which are essential in building trust with users and stakeholders.
Regulation and Standardization. As the AI industry continues to grow, there is an increasing need for regulation and standardization to ensure that AI solutions are safe, ethical, and effective. Accurate and reliable benchmarks are essential to this regulatory framework, as they provide the data and evidence needed to assess compliance with industry standards and legal requirements.
This chapter will cover the 3 types of AI benchmarks, the standard metrics, tools, and techniques designers use to optimize their systems, and the challenges and trends in benchmarking.
11.2 Historical Context
11.2.1 Performance Benchmarks
The evolution of benchmarks in computing vividly illustrates the industry’s relentless pursuit of excellence and innovation. In the early days of computing during the 1960s and 1970s, benchmarks were rudimentary and designed for mainframe computers. For example, the Whetstone benchmark, named after the Whetstone ALGOL compiler, was one of the first standardized tests to measure the floating-point arithmetic performance of a CPU. These pioneering benchmarks prompted manufacturers to refine their architectures and algorithms to achieve better benchmark scores.
The 1980s marked a significant shift with the rise of personal computers. As companies like IBM, Apple, and Commodore competed for market share, and so benchmarks became critical tools to enable fair competition. The SPEC CPU benchmarks, introduced by the System Performance Evaluation Cooperative (SPEC), established standardized tests allowing objective comparisons between different machines. This standardization created a competitive environment, pushing silicon manufacturers and system creators to continually improve their hardware and software offerings.
The 1990s brought the era of graphics-intensive applications and video games. The need for benchmarks to evaluate graphics card performance led to Futuremark’s creation of 3DMark. As gamers and professionals sought high-performance graphics cards, companies like NVIDIA and AMD were driven to rapid innovation, leading to major advancements in GPU technology like programmable shaders.
The 2000s saw a surge in mobile phones and portable devices like tablets. With portability came the challenge of balancing performance and power consumption. Benchmarks like MobileMark by BAPCo evaluated speed and battery life. This drove companies to develop more energy-efficient System-on-Chips (SOCs), leading to the emergence of architectures like ARM that prioritized power efficiency.
The focus of the recent decade has shifted towards cloud computing, big data, and artificial intelligence. Cloud service providers like Amazon Web Services and Google Cloud compete on performance, scalability, and cost-effectiveness. Tailored cloud benchmarks like CloudSuite have become essential, driving providers to optimize their infrastructure for better services.
11.2.2 Energy Benchmarks
Energy consumption and environmental concerns have gained prominence in recent years, making power (more precisely, energy) benchmarking increasingly important in the industry. This shift began in the mid-2000s when processors and systems started hitting cooling limits, and scaling became a crucial aspect of building large-scale systems due to internet advancements. Since then, energy considerations have expanded to encompass all areas of computing, from personal devices to large-scale data centers.
Power benchmarking aims to measure the energy efficiency of computing systems, evaluating performance in relation to power consumption. This is crucial for several reasons:
- Environmental impact: With the growing carbon footprint of the tech industry, there’s a pressing need to reduce energy consumption.
- Operational costs: Energy expenses constitute a significant portion of data center operating costs.
- Device longevity: For mobile devices, power efficiency directly impacts battery life and user experience.
Several key benchmarks have emerged in this space:
- SPEC Power: Introduced in 2007, SPEC Power was one of the first industry-standard benchmarks for evaluating the power and performance characteristics of computer servers.
- Green500: The Green500 list ranks supercomputers by energy efficiency, complementing the performance-focused TOP500 list.
- Energy Star: While not a benchmark per se, ENERGY STAR for Computers certification program has driven manufacturers to improve the energy efficiency of consumer electronics.
Power benchmarking faces unique challenges, such as accounting for different workloads and system configurations, and measuring power consumption accurately across a range of hardware that scales from microWatts to megaWatts in power consumption. As AI and edge computing continue to grow, power benchmarking is likely to become even more critical, driving the development of specialized energy-efficient AI hardware and software optimizations.
11.2.3 Custom Benchmarks
In addition to industry-standard benchmarks, there are custom benchmarks specifically designed to meet the unique requirements of a particular application or task. They are tailored to the specific needs of the user or developer, ensuring that the performance metrics are directly relevant to the intended use of the AI model or system. Custom benchmarks can be created by individual organizations, researchers, or developers and are often used in conjunction with industry-standard benchmarks to provide a comprehensive evaluation of AI performance.
For example, a hospital could develop a benchmark to assess an AI model for predicting patient readmission. This benchmark would incorporate metrics relevant to the hospital’s patient population, like demographics, medical history, and social factors. Similarly, a financial institution’s fraud detection benchmark could focus on identifying fraudulent transactions accurately while minimizing false positives. In automotive, an autonomous vehicle benchmark may prioritize performance in diverse conditions, responding to obstacles, and safety. Retailers could benchmark recommendation systems using click-through rate, conversion rate, and customer satisfaction. Manufacturing companies might benchmark quality control systems on defect identification, efficiency, and waste reduction. In each industry, custom benchmarks provide organizations with evaluation criteria tailored to their unique needs and context. This allows for a more meaningful assessment of how well AI systems meet requirements.
The advantage of custom benchmarks lies in their flexibility and relevance. They can be designed to test specific performance aspects critical to the success of the AI solution in its intended application. This allows for a more targeted and accurate assessment of the AI model or system’s capabilities. Custom benchmarks also provide valuable insights into the performance of AI solutions in real-world scenarios, which can be crucial for identifying potential issues and areas for improvement.
In AI, benchmarks play a crucial role in driving progress and innovation. While benchmarks have long been used in computing, their application to machine learning is relatively recent. AI-focused benchmarks provide standardized metrics to evaluate and compare the performance of different algorithms, model architectures, and hardware platforms.
11.2.4 Community Consensus
A key prerogative for any benchmark to be impactful is that it must reflect the shared priorities and values of the broader research community. Benchmarks designed in isolation risk failing to gain acceptance if they overlook key metrics considered important by leading groups. Through collaborative development with open participation from academic labs, companies, and other stakeholders, benchmarks can incorporate collective input on critical capabilities worth measuring. This helps ensure the benchmarks evaluate aspects the community agrees are essential to advance the field. The process of reaching alignment on tasks and metrics itself supports converging on what matters most.
Furthermore, benchmarks published with broad co-authorship from respected institutions carry authority and validity that convinces the community to adopt them as trusted standards. Benchmarks perceived as biased by particular corporate or institutional interests breed skepticism. Ongoing community engagement through workshops and challenges is also key after the initial release, and that is what, for instance, led to the success of ImageNet. As research progresses, collective participation enables continual refinement and expansion of benchmarks over time.
Finally, releasing community-developed benchmarks with open access promotes their adoption and consistent use. By providing open-source code, documentation, models, and infrastructure, we reduce barriers to entry, enabling groups to benchmark solutions on an equal footing with standardized implementations. This consistency is essential for fair comparisons. Without coordination, labs and companies might implement benchmarks differently, which can undermine reproducibility and comparability of results.
Community consensus brings benchmarks lasting relevance, while fragmentation confuses. Through collaborative development and transparent operation, benchmarks can become authoritative standards for tracking progress. Several of the benchmarks that we discuss in this chapter were developed and built by the community, for the community, and that is what ultimately led to their success.
11.3 AI Benchmarks: System, Model, and Data
The need for comprehensive benchmarking becomes paramount as AI systems grow in complexity and ubiquity. Within this context, benchmarks are often classified into three primary categories: Hardware, Model, and Data. Let’s dive into why each of these buckets is essential and the significance of evaluating AI from these three distinct dimensions:
11.3.1 System Benchmarks
AI computations, especially those in deep learning, are resource-intensive. The hardware on which these computations run plays an important role in determining AI solutions’ speed, efficiency, and scalability. Consequently, hardware benchmarks help evaluate the performance of CPUs, GPUs, TPUs, and other accelerators in AI tasks. By understanding hardware performance, developers can choose which hardware platforms best suit specific AI applications. Furthermore, hardware manufacturers use these benchmarks to identify areas for improvement, driving innovation in AI-specific chip designs.
11.3.2 Model Benchmarks
The architecture, size, and complexity of AI models vary widely. Different models have different computational demands and offer varying levels of accuracy and efficiency. Model benchmarks help us assess the performance of various AI architectures on standardized tasks. They provide insights into different models’ speed, accuracy, and resource demands. By benchmarking models, researchers can identify best-performing architectures for specific tasks, guiding the AI community towards more efficient and effective solutions. Additionally, these benchmarks aid in tracking the progress of AI research, showcasing advancements in model design and optimization.
11.3.3 Data Benchmarks
In machine learning, data is foundational because the quality, scale, and diversity of datasets directly impact model efficacy and generalization. Data benchmarks focus on the datasets used in training and evaluation. They provide standardized datasets the community can use to train and test models, ensuring a level playing field for comparisons. Moreover, these benchmarks highlight data quality, diversity, and representation challenges, pushing the community to address biases and gaps in training data. By understanding data benchmarks, researchers can also gauge how models might perform in real-world scenarios, ensuring robustness and reliability.
In the remainder of the sections, we will discuss each of these benchmark types. The focus will be an in-depth exploration of system benchmarks, as these are critical to understanding and advancing machine learning system performance. We will briefly cover model and data benchmarks for a comprehensive perspective, but the emphasis and majority of the content will be devoted to system benchmarks.
11.4 System Benchmarking
11.4.1 Granularity
Machine learning system benchmarking provides a structured and systematic approach to assessing a system’s performance across various dimensions. Given the complexity of ML systems, we can dissect their performance through different levels of granularity and obtain a comprehensive view of the system’s efficiency, identify potential bottlenecks, and pinpoint areas for improvement. To this end, various types of benchmarks have evolved over the years and continue to persist.
Figure11.1 illustrates the different layers of granularity of an ML system. At the application level, end-to-end benchmarks assess the overall system performance, considering factors like data preprocessing, model training, and inference. While at the model layer, benchmarks focus on assessing the efficiency and accuracy of specific models. This includes evaluating how well models generalize to new data and their computational efficiency during training and inference. Furthermore, benchmarking can extend to hardware and software infrastructure, examining the performance of individual components like GPUs or TPUs.
Micro Benchmarks
Micro-benchmarks are specialized, evaluating distinct components or specific operations within a broader machine learning process. These benchmarks focus on individual tasks, offering insights into the computational demands of a particular neural network layer, the efficiency of a unique optimization technique, or the throughput of a specific activation function. For instance, practitioners might use micro-benchmarks to measure the computational time required by a convolutional layer in a deep learning model or to evaluate the speed of data preprocessing that feeds data into the model. Such granular assessments are instrumental in fine-tuning and optimizing discrete aspects of models, ensuring that each component operates at its peak potential.
These types of microbenchmarks include zooming into very specific operations or components of the AI pipeline, such as the following:
Tensor Operations: Libraries like cuDNN (by NVIDIA) often have benchmarks to measure the performance of individual tensor operations, such as convolutions or matrix multiplications, which are foundational to deep learning computations.
Activation Functions: Benchmarks that measure the speed and efficiency of various activation functions like ReLU, Sigmoid, or Tanh in isolation.
Layer Benchmarks: Evaluations of the computational efficiency of distinct neural network layers, such as LSTM or Transformer blocks, when operating on standardized input sizes.
Example: DeepBench, introduced by Baidu, is a good benchmark that evaluates fundamental deep learning operations, such as those mentioned above. DeepBench assesses the performance of basic operations in deep learning models, providing insights into how different hardware platforms handle neural network training and inference.
Exercise11.1: System Benchmarking - Tensor Operations
Macro Benchmarks
Macro benchmarks provide a holistic view, assessing the end-to-end performance of entire machine learning models or comprehensive ML systems. Rather than focusing on individual operations, macro-benchmarks evaluate the collective efficacy of models under real-world scenarios or tasks. For example, a macro-benchmark might assess the complete performance of a deep learning model undertaking image classification on a dataset like ImageNet. This includes gauging accuracy, computational speed, and resource consumption. Similarly, one might measure the cumulative time and resources needed to train a natural language processing model on extensive text corpora or evaluate the performance of an entire recommendation system, from data ingestion to final user-specific outputs.
Examples: These benchmarks evaluate the AI model:
MLPerf Inference (Reddi et al. 2020): An industry-standard set of benchmarks for measuring the performance of machine learning software and hardware. MLPerf has a suite of dedicated benchmarks for specific scales, such as MLPerf Mobile for mobile class devices and MLPerf Tiny, which focuses on microcontrollers and other resource-constrained devices.
EEMBC’s MLMark: A benchmarking suite for evaluating the performance and power efficiency of embedded devices running machine learning workloads. This benchmark provides insights into how different hardware platforms handle tasks like image recognition or audio processing.
AI-Benchmark (Ignatov et al. 2019): A benchmarking tool designed for Android devices, it evaluates the performance of AI tasks on mobile devices, encompassing various real-world scenarios like image recognition, face parsing, and optical character recognition.
Reddi, Vijay Janapa, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, et al. 2020. “MLPerf Inference Benchmark.” In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 446–59. IEEE; IEEE. https://doi.org/10.1109/isca45697.2020.00045.
Ignatov, Andrey, Radu Timofte, Andrei Kulik, Seungsoo Yang, Ke Wang, Felix Baum, Max Wu, Lirong Xu, and Luc Van Gool. 2019. “AI Benchmark: All about Deep Learning on Smartphones in 2019.” In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 3617–35. IEEE. https://doi.org/10.1109/iccvw.2019.00447.
End-to-end Benchmarks
End-to-end benchmarks provide an all-inclusive evaluation that extends beyond the boundaries of the ML model itself. Instead of focusing solely on a machine learning model’s computational efficiency or accuracy, these benchmarks encompass the entire pipeline of an AI system. This includes initial data preprocessing, the core model’s performance, post-processing of the model’s outputs, and other integral components like storage and network interactions.
Data preprocessing is the first stage in many AI systems, transforming raw data into a format suitable for model training or inference. These preprocessing steps’ efficiency, scalability, and accuracy are vital for the overall system’s performance. End-to-end benchmarks assess this phase, ensuring that data cleaning, normalization, augmentation, or any other transformation process doesn’t become a bottleneck.
The post-processing phase also takes center stage. This involves interpreting the model’s raw outputs, possibly converting scores into meaningful categories, filtering results, or even integrating with other systems. In real-world applications, this phase is crucial for delivering actionable insights, and end-to-end benchmarks ensure it’s both efficient and effective.
Beyond the core AI operations, other system components are important in the overall performance and user experience. Storage solutions, whether cloud-based, on-premises, or hybrid, can significantly impact data retrieval and storage times, especially with vast AI datasets. Similarly, network interactions, vital for cloud-based AI solutions or distributed systems, can become performance bottlenecks if not optimized. End-to-end benchmarks holistically evaluate these components, ensuring that the entire system operates seamlessly, from data retrieval to final output delivery.
To date, there are no public, end-to-end benchmarks that take into account the role of data storage, network, and compute performance. Arguably, MLPerf Training and Inference come close to the idea of an end-to-end benchmark, but they are exclusively focused on ML model performance and do not represent real-world deployment scenarios of how models are used in the field. Nonetheless, they provide a very useful signal that helps assess AI system performance.
Given the inherent specificity of end-to-end benchmarking, it is typically performed internally at a company by instrumenting real production deployments of AI. This allows engineers to have a realistic understanding and breakdown of the performance, but given the sensitivity and specificity of the information, it is rarely reported outside of the company.
Understanding the Trade-offs
Different issues arise at different stages of an AI system. Micro-benchmarks help fine-tune individual components, macro-benchmarks aid in refining model architectures or algorithms, and end-to-end benchmarks guide the optimization of the entire workflow. By understanding where a problem lies, developers can apply targeted optimizations.
Moreover, while individual components of an AI system might perform optimally in isolation, bottlenecks can emerge when they interact. End-to-end benchmarks, in particular, are crucial to ensure that the entire system, when operating collectively, meets desired performance and efficiency standards.
Finally, organizations can make informed decisions on where to allocate resources by discerning performance bottlenecks or inefficiencies. For instance, if micro-benchmarks reveal inefficiencies in specific tensor operations, investments can be directed toward specialized hardware accelerators. Conversely, if end-to-end benchmarks indicate data retrieval issues, investments might be channeled toward better storage solutions.
11.4.2 Benchmark Components
At its core, an AI benchmark is more than just a test or a score; it’s a comprehensive evaluation framework. To understand this in-depth, let’s break down the typical components that go into an AI benchmark.
Standardized Datasets
Datasets serve as the foundation for most AI benchmarks. They provide a consistent data set on which models are trained and evaluated, ensuring a level playing field for comparisons.
Example: ImageNet, a large-scale dataset containing millions of labeled images spanning thousands of categories, is a popular benchmarking standard for image classification tasks.
Pre-defined Tasks
A benchmark should have a clear objective or task that models aim to achieve. This task defines the problem the AI system is trying to solve.
Example: Tasks for natural language processing benchmarks might include sentiment analysis, named entity recognition, or machine translation.
Evaluation Metrics
Once a task is defined, benchmarks require metrics to quantify performance. These metrics offer objective measures to compare different models or systems. In classification tasks, metrics like accuracy, precision, recall, and F1 score are commonly used. Mean squared or absolute errors might be employed for regression tasks. We can also measure the power consumed by the benchmark execution to calculate energy efficiency.
Baselines and Baseline Models
Benchmarks often include baseline models or reference implementations. These usually serve as starting points or minimum performance standards for comparing new models or techniques. Baseline models help researchers measure the effectiveness of new algorithms.
In benchmark suites, simple models like linear regression or basic neural networks are often the common baselines. These provide context when evaluating more complex models. By comparing against these simpler models, researchers can quantify improvements from advanced approaches.
Performance metrics vary by task, but here are some examples:
- Classification tasks use metrics such as accuracy, precision, recall, and F1 score.
- Regression tasks often use mean squared error or mean absolute error.
Hardware and Software Specifications
Given the variability introduced by different hardware and software configurations, benchmarks often specify or document the hardware and software environments in which tests are conducted.
Example: An AI benchmark might note that evaluations were conducted on an NVIDIA Tesla V100 GPU using TensorFlow v2.4.
Environmental Conditions
As external factors can influence benchmark results, it’s essential to either control or document conditions like temperature, power source, or system background processes.
Example: Mobile AI benchmarks might specify that tests were conducted at room temperature with devices plugged into a power source to eliminate battery-level variances.
Reproducibility Rules
To ensure benchmarks are credible and can be replicated by others in the community, they often include detailed protocols covering everything from random seeds used to exact hyperparameters.
Example: A benchmark for a reinforcement learning task might detail the exact training episodes, exploration-exploitation ratios, and reward structures used.
Result Interpretation Guidelines
Beyond raw scores or metrics, benchmarks often provide guidelines or context to interpret results, helping practitioners understand the broader implications.
Example: A benchmark might highlight that while Model A scored higher than Model B in accuracy, it offers better real-time performance, making it more suitable for time-sensitive applications.
11.4.3 Training Benchmarks
The development life cycle of a machine learning model involves two critical phases - training and inference. Training represents the phase where the system processes and ingests raw data to adjust and refine its parameters. Benchmarking the training phase reveals how choices in data pipelines, storage solutions, model architectures, computing resources, hyperparameter settings, and optimization algorithms affect the efficiency and resource demands of model training. The goal is to ensure that the ML system can efficiently learn from data, optimizing both the model’s performance and the system’s resource utilization.
Purpose
From a systems perspective, training machine learning models is resource-intensive, especially when working with large models. These models often contain billions or even trillions of trainable parameters and require enormous amounts of data, often on the scale of many terabytes. For example, OpenAI’s GPT-3 (Brown et al. 2020) has 175 billion parameters, was trained on 45 TB of compressed plaintext data, and required 3,640 petaflop-days of compute for pretraining. ML training benchmarks evaluate the systems and resources required to manage the computational load of training such models.
Efficient data storage and delivery during training also play a major role in the training process. For instance, in a machine learning model that predicts bounding boxes around objects in an image, thousands of images may be required. However, loading an entire image dataset into memory is typically infeasible, so practitioners rely on data loaders (as disucssed in Section 6.4.3.1) from ML frameworks. Successful model training depends on timely and efficient data delivery, making it essential to benchmark tools like data loaders, data pipelines, preprocessing speed, and storage retrieval times to understand their impact on training performance.
Hardware selection is another key factor in training machine learning systems, as it can significantly impact training time. Training benchmarks evaluate CPU, GPU, memory, and network utilization during the training phase to guide system optimizations. Understanding how resources are used is essential: Are GPUs being fully leveraged? Is there unnecessary memory overhead? Benchmarks can uncover bottlenecks or inefficiencies in resource utilization, leading to cost savings and performance improvements.
In many cases, using a single hardware accelerator, such as a single GPU, is insufficient to meet the computational demands of large-scale model training. Machine learning models are often trained in data centers with multiple GPUs or TPUs, where distributed computing enables parallel processing across nodes. Training benchmarks assess how efficiently the system scales across multiple nodes, manages data sharding, and handles challenges like node failures or drop-offs during training.
Metrics
When viewed from a systems perspective, training metrics offer insights that transcend conventional algorithmic performance indicators. These metrics measure the model’s learning efficacy and gauge the efficiency, scalability, and robustness of the entire ML system during the training phase. Let’s explore deeper into these metrics and their significance.
The following metrics are often considered important:
Training Time: The time it takes to train a model from scratch until it reaches a satisfactory performance level. It directly measures the computational resources required to train a model. For example, Google’s BERT (Devlin et al. 2019) is a natural language processing model that requires several days to train on a massive corpus of text data using multiple GPUs. The long training time is a significant resource consumption and cost challenge. In some cases, benchmarks can instead measure the training throughput (training samples per unit of time). Throughput can be calculated much faster and easier than training time but may obscure the metrics we really care about (e.g.time to train).
Scalability: How well the training process can handle increases in data size or model complexity. Scalability can be assessed by measuring training time, memory usage, and other resource consumption as data size or model complexity increases. For instance, training OpenAI’s GPT-3 required extensive engineering efforts to scale the training process across many GPU nodes to handle the massive model size. This involved using specialized hardware, distributed training, and other techniques to ensure the model could be trained efficiently.
Resource Utilization: The extent to which the training process utilizes available computational resources such as CPU, GPU, memory, and disk I/O. High resource utilization can indicate an efficient training process, while low utilization can suggest bottlenecks or inefficiencies. For instance, training a convolutional neural network (CNN) for image classification requires significant GPU resources. Utilizing multi-GPU setups and optimizing the training code for GPU acceleration can greatly improve resource utilization and training efficiency.
Memory Consumption: The amount of memory the training process uses. Memory consumption can be a limiting factor for training large models or datasets. For example, Google researchers faced significant memory consumption challenges when training BERT. The model has hundreds of millions of parameters, requiring large amounts of memory. The researchers had to develop techniques to reduce memory consumption, such as gradient checkpointing and model parallelism.
Energy Consumption: The energy consumed during training. As machine learning models become more complex, energy consumption has become an important consideration. Training large machine learning models can consume significant energy, leading to a large carbon footprint. For instance, the training of OpenAI’s GPT-3 was estimated to have a carbon footprint equivalent to traveling by car for 700,000 kilometers (~435,000 miles).
Throughput: The number of training samples processed per unit time. Higher throughput generally indicates a more efficient training process. The throughput is an important metric to consider when training a recommendation system for an e-commerce platform. A high throughput ensures that the model can process large volumes of user interaction data promptly, which is crucial for maintaining the relevance and accuracy of the recommendations. But it’s also important to understand how to balance throughput with latency bounds. Therefore, a latency-bounded throughput constraint is often imposed on service-level agreements for data center application deployments.
Cost: The cost of training a model can include both computational and human resources. Cost is important when considering the practicality and feasibility of training large or complex models. Training large language models like GPT-3 is estimated to cost millions of dollars. This cost includes computational, electricity and human resources required for model development and training.
Fault Tolerance and Robustness: The ability of the training process to handle failures or errors without crashing or producing incorrect results. This is important for ensuring the reliability of the training process. Network failures or hardware malfunctions can occur in a real-world scenario where a machine-learning model is being trained on a distributed system. In recent years, it has become abundantly clear that faults arising from silent data corruption have emerged as a major issue. A fault-tolerant and robust training process can recover from such failures without compromising the model’s integrity.
Ease of Use and Flexibility: The ease with which the training process can be set up and used and its flexibility in handling different types of data and models. In companies like Google, efficiency can sometimes be measured by the number of Software Engineer (SWE) years saved since that translates directly to impact. Ease of use and flexibility can reduce the time and effort required to train a model. TensorFlow and PyTorch are popular machine-learning frameworks that provide user-friendly interfaces and flexible APIs for building and training machine-learning models. These frameworks support many model architectures and are equipped with tools that simplify the training process.
Reproducibility: The ability to reproduce the training process results. Reproducibility is important for verifying a model’s correctness and validity. However, variations due to stochastic network characteristics often make it hard to reproduce the precise behavior of applications being trained, which can present a challenge for benchmarking.
By benchmarking for these types of metrics, we can obtain a comprehensive view of the training process’s performance and efficiency from a systems perspective. This can help identify areas for improvement and ensure that resources are used effectively.
Benchmarks
Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for training machine learning systems.
MLPerf Training Benchmark: MLPerf is a suite of benchmarks designed to measure the performance of machine learning hardware, software, and services. The MLPerf Training benchmark (Mattson et al. 2020a) focuses on the time it takes to train models to a target quality metric. It includes diverse workloads, such as image classification, object detection, translation, and reinforcement learning. Figure11.2 highlights the performance improvements in progressive versions of MLPerf Training benchmarks, which have all outpaced Moore’s Law. Using standardized benchmarking trends enables us to rigorously showcase the rapid evolution of ML computing.
Metrics:
- Training time to target quality
- Throughput (examples per second)
- Resource utilization (CPU, GPU, memory, disk I/O)
DAWNBench: DAWNBench (Coleman et al. 2019) is a benchmark suite focusing on end-to-end deep learning training time and inference performance. It includes common tasks such as image classification and question answering.
Coleman, Cody, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian Zhang, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. 2019. “Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark.” ACM SIGOPS Operating Systems Review 53 (1): 14–25. https://doi.org/10.1145/3352020.3352024.
Metrics:
- Time to train to target accuracy
- Inference latency
- Cost (in terms of cloud computing and storage resources)
Fathom: Fathom (Adolf et al. 2016) is a benchmark from Harvard University that evaluates the performance of deep learning models using a diverse set of workloads. These include common tasks such as image classification, speech recognition, and language modeling.
Adolf, Robert, Saketh Rama, Brandon Reagen, Gu-yeon Wei, and David Brooks. 2016. “Fathom: Reference Workloads for Modern Deep Learning Methods.” In 2016 IEEE International Symposium on Workload Characterization (IISWC), 1–10. IEEE; IEEE. https://doi.org/10.1109/iiswc.2016.7581275.
Metrics:
- Operations per second (to measure computational efficiency)
- Time to completion for each workload
- Memory bandwidth
Example Use Case
Imagine you have been tasked with benchmarking the training performance of an image classification model on a specific hardware platform. Let’s break down how you might approach this:
Define the Task: First, choose a model and dataset. In this case, you’ll be training a CNN to classify images in the CIFAR-10 dataset, a widely used benchmark in computer vision.
Select the Benchmark: Choosing a widely accepted benchmark helps ensure your setup is comparable with other real-world evaluations. You could choose to use the MLPerf Training benchmark because it provides a structured image classification workload, making it a relevant and standardized option for assessing training performance on CIFAR-10. Using MLPerf enables you to evaluate your system against industry-standard metrics, helping to ensure that results are meaningful and comparable to those achieved on other hardware platforms.
Identify Key Metrics: Now, decide on the metrics that will help you evaluate the system’s training performance. For this example, you might track:
- Training Time: How long does it take to reach 90% accuracy?
- Throughput: How many images are processed per second?
- Resource Utilization: What’s the GPU and CPU usage throughout training?
By analyzing these metrics, you’ll gain insights into the model’s training performance on your chosen hardware platform. Consider whether training time meets your expectations, if there are any bottlenecks, such as underutilized GPUs or slow data loading. This process helps identify areas for potential optimization, like improving data handling or adjusting resource allocation, and can guide future benchmarking decisions.
11.4.4 Inference Benchmarks
Inference in machine learning refers to using a trained model to make predictions on new, unseen data. It is the phase where the model applies its learned knowledge to solve the problem it was designed for, such as classifying images, recognizing speech, or translating text.
Purpose
When we build machine learning models, our ultimate goal is to deploy them in real-world applications where they can provide accurate and reliable predictions on new, unseen data. This process of using a trained model to make predictions is known as inference. A machine learning model’s real-world performance can differ significantly from its performance on training or validation datasets, which makes benchmarking inference a crucial step in the development and deployment of machine learning models.
Benchmarking inference allows us to evaluate how well a machine-learning model performs in real-world scenarios. This evaluation ensures that the model is practical and reliable when deployed in applications, providing a more comprehensive understanding of the model’s behavior with real data. Additionally, benchmarking can help identify potential bottlenecks or limitations in the model’s performance. For example, if a model takes too long to predict, it may be impractical for real-time applications such as autonomous driving or voice assistants.
Resource efficiency is another critical aspect of inference, as it can be computationally intensive and require significant memory and processing power. Benchmarking helps ensure that the model is efficient regarding resource usage, which is particularly important for edge devices with limited computational capabilities, such as smartphones or IoT devices. Moreover, benchmarking allows us to compare the performance of our model with competing models or previous versions of the same model. This comparison is essential for making informed decisions about which model to deploy in a specific application.
Finally, it is vital to ensure that the model’s predictions are not only accurate but also consistent across different data points. Benchmarking helps verify the model’s accuracy and consistency, ensuring that it meets the application’s requirements. It also assesses the model’s robustness, ensuring that it can handle real-world data variability and still make accurate predictions.
Metrics
Accuracy: Accuracy is one of the most vital metrics when benchmarking machine learning models. It quantifies the proportion of correct predictions made by the model compared to the true values or labels. For example, if a spam detection model can correctly classify 95 out of 100 email messages as spam or not, its accuracy would be calculated as 95%.
Latency: Latency is a performance metric that calculates the time lag or delay between the input receipt and the production of the corresponding output by the machine learning system. An example that clearly depicts latency is a real-time translation application; if a half-second delay exists from the moment a user inputs a sentence to the time the app displays the translated text, then the system’s latency is 0.5 seconds.
Latency-Bounded Throughput: Latency-bounded throughput is a valuable metric that combines the aspects of latency and throughput, measuring the maximum throughput of a system while still meeting a specified latency constraint. For example, in a video streaming application that utilizes a machine learning model to generate and display subtitles automatically, latency-bounded throughput would measure how many video frames the system can process per second (throughput) while ensuring that the subtitles are displayed with no more than a 1-second delay (latency). This metric is particularly important in real-time applications where meeting latency requirements is crucial to the user experience.
Throughput: Throughput assesses the system’s capacity by measuring the number of inferences or predictions a machine learning model can handle within a specific unit of time. Consider a speech recognition system that employs a Recurrent Neural Network (RNN) as its underlying model; if this system can process and understand 50 different audio clips in a minute, then its throughput rate stands at 50 clips per minute.
Energy Efficiency: Energy efficiency is a metric that determines the amount of energy consumed by the machine learning model to perform a single inference. A prime example of this would be a natural language processing model built on a Transformer network architecture; if it utilizes 0.1 Joules of energy to translate a sentence from English to French, its energy efficiency is measured at 0.1 Joules per inference.
Memory Usage: Memory usage quantifies the volume of RAM needed by a machine learning model to carry out inference tasks. A relevant example to illustrate this would be a face recognition system based on a CNN; if such a system requires 150 MB of RAM to process and recognize faces within an image, its memory usage is 150 MB.
Benchmarks
Here are some original works that laid the fundamental groundwork for developing systematic benchmarks for inference machine learning systems.
MLPerf Inference Benchmark: MLPerf Inference is a comprehensive benchmark suite that assesses machine learning models’ performance during the inference phase. It encompasses a variety of workloads, including image classification, object detection, and natural language processing, aiming to provide standardized and insightful metrics for evaluating different inference systems. It’s metrics include:
MLPerf Inference is a comprehensive benchmark suite that assesses machine learning models’ performance during the inference phase. It encompasses a variety of workloads, including image classification, object detection, and natural language processing, aiming to provide standardized and insightful metrics for evaluating different inference systems.
Metrics:
- Inference time
- Latency
- Throughput
- Accuracy
- Energy consumption
AI Benchmark: AI Benchmark is a benchmarking tool that evaluates the performance of AI and machine learning models on mobile devices and edge computing platforms. It includes tests for image classification, object detection, and natural language processing tasks, providing a detailed analysis of the inference performance on different hardware platforms. It’s metrics include:
AI Benchmark is a benchmarking tool that evaluates the performance of AI and machine learning models on mobile devices and edge computing platforms. It includes tests for image classification, object detection, and natural language processing tasks, providing a detailed analysis of the inference performance on different hardware platforms.
Metrics:
- Inference time
- Latency
- Energy consumption
- Memory usage
- Throughput
OpenVINO toolkit: OpenVINO toolkit provides a benchmark tool to measure the performance of deep learning models for various tasks, such as image classification, object detection, and facial recognition, on Intel hardware. It offers detailed insights into the models’ inference performance on different hardware configurations. It’s metrics include:
Metrics:
- Inference time
- Throughput
- Latency
- CPU and GPU utilization
Example Use Case
Suppose you were tasked with evaluating the inference performance of an object detection model on a specific edge device. Here’s how you might approach structuring this benchmark:
Define the Task: In this case, the task is real-time object detection on video streams, identifying objects such as vehicles, pedestrians, and traffic signs.
Select the Benchmark: To align with your goal of evaluating inference on an edge device, the AI Benchmark is a suitable choice. It provides a standardized framework specifically for assessing inference performance on edge hardware, making it relevant to this scenario.
Identify Key Metrics: Now, determine the metrics that will help evaluate the model’s inference performance. For this example, you might track:
- Inference Time: How long does it take to process each video frame?
- Latency: What is the delay in generating bounding boxes for detected objects?
- Energy Consumption: How much power is used during inference?
- Throughput: How many video frames are processed per second?
By measuring these metrics, you’ll gain insights into how well the object detection model performs on the edge device. This can help identify any bottlenecks, such as slow frame processing or high energy consumption, and highlight areas for potential optimization to improve real-time performance.
Exercise11.2: Inference Benchmarks - MLPerf
Get ready to put your AI models to the ultimate test! MLPerf is like the Olympics for machine learning performance. In this Colab, we’ll use a toolkit called CK to run official MLPerf benchmarks, measure how fast and accurate your model is, and even use TVM to give it a super speed boost. Are you ready to see your model earn its medal?
11.4.5 Benchmark Task Selection
Selecting representative tasks for benchmarking machine learning systems is complex due to the varied applications, data types, and requirements across different domains. Machine learning is applied in fields such as healthcare, finance, natural language processing, and computer vision, each with unique tasks that may not be relevant or comparable to others. Key challenges in task selection include:
- Diversity of Applications and Data Types: Tasks across domains involve different data types (e.g., text, images, video) and qualities, making it difficult to find benchmarks that universally represent ML challenges.
- Task Complexity and Resource Needs: Tasks vary in complexity and resource demands, with some requiring substantial computational power and sophisticated models, while others can be addressed with simpler resources and methods.
- Privacy Concerns: Tasks involving sensitive data, such as medical records or personal information, introduce ethical and privacy issues, making them unsuitable for general benchmarks.
- Evaluation Metrics: Performance metrics vary widely across tasks, and results from one task often do not generalize to others, complicating comparisons and limiting insights from one benchmarked task to another.
Addressing these challenges is essential to designing meaningful benchmarks that are relevant across the diverse tasks encountered in machine learning, ensuring benchmarks provide useful, generalizable insights for both training and inference.
11.4.6 Measuring Energy Efficiency
As machine learning capabilities expand, both in training and inference, concerns about increased power consumption and its ecological footprint have intensified. Addressing the sustainability of ML systems, a topic explored in more depth in the Sustainable AI chapter, has thus become a key priority. This focus on sustainability has led to the development of standardized benchmarks designed to accurately measure energy efficiency. However, standardizing these methodologies poses challenges due to the need to accommodate vastly different scales—from the microwatt consumption of TinyML devices to the megawatt demands of data center training systems. Moreover, ensuring that benchmarking is fair and reproducible requires accommodating the diverse range of hardware configurations and architectures in use today.
One example is the MLPerf Power benchmarking methodology (Tschand et al. 2024), which tackles these challenges by tailoring the methodologies for datacenter, edge inference, and tiny inference systems while measuring power consumption as comprehensively as possible for each scale. This methodology adapts to a variety of hardware, from general-purpose CPUs to specialized AI accelerators, while maintaining uniform measurement principles to ensure that comparisons are both fair and accurate across different platforms.
Figure11.3 illustrates the power measurement boundaries for different system scales, from TinyML devices to inference nodes and training racks. Each example highlights the components within the measurement boundary and those outside it. This setup allows for accurate reflection of the true energy costs associated with running ML workloads across various real-world scenarios, and ensures that the benchmark captures the full spectrum of energy consumption.
It is important to note that optimizing a system for performance may not lead to the most energy efficient execution. Oftentimes, sacrificing a small amount of performance or accuracy can lead to significant gains in energy efficiency, highlighting the importance of accurately benchmarking power metrics. Future insights from energy efficiency and sustainability benchmarking will enable us to optimize for more sustainable ML systems.
11.4.7 Benchmark Example
To properly illustrate the components of a systems benchmark, we can look at the keyword spotting benchmark in MLPerf Tiny and explain the motivation behind each decision.
Task
Keyword spotting was selected as a task because it is a common use case in TinyML that has been well-established for years. Additionally, the typical hardware used for keyword spotting differs substantially from the offerings of other benchmarks, such as MLPerf Inference’s speech recognition task.
Dataset
Google Speech Commands (Warden 2018) was selected as the best dataset to represent the task. The dataset is well-established in the research community and has permissive licensing, allowing it to be easily used in a benchmark.
Warden, Pete. 2018. “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition.” ArXiv Preprint abs/1804.03209 (April). http://arxiv.org/abs/1804.03209v1.
Model
The next core component is the model, which will act as the primary workload for the benchmark. The model should be well established as a solution to the selected task rather than a state-of-the-art solution. The model selected is a simple depthwise separable convolution model. This architecture is not the state-of-the-art solution to the task, but it is well-established and not designed for a specific hardware platform like many state-of-the-art solutions. Despite being an inference benchmark, the benchmark also establishes a reference training recipe to be fully reproducible and transparent.
Metrics
Latency was selected as the primary metric for the benchmark, as keyword spotting systems need to react quickly to maintain user satisfaction. Additionally, given that TinyML systems are often battery-powered, energy consumption is measured to ensure the hardware platform is efficient. The accuracy of the model is also measured to ensure that the optimizations applied by a submitter, such as quantization, don’t degrade the accuracy beyond a threshold.
Benchmark Harness
MLPerf Tiny uses EEMBCs EnergyRunner benchmark harness to load the inputs to the model and isolate and measure the device’s energy consumption. When measuring energy consumption, it’s critical to select a harness that is accurate at the expected power levels of the devices under test and simple enough not to become a burden for the benchmark participants.
Baseline Submission
Baseline submissions are critical for contextualizing results and as a reference point to help participants get started. The baseline submission should prioritize simplicity and readability over state-of-the-art performance. The keyword spotting baseline uses a standard STM microcontroller as its hardware and TensorFlow Lite for Microcontrollers (David et al. 2021) as its inference framework.
David, Robert, Jared Duke, Advait Jain, Vijay Janapa Reddi, Nat Jeffries, Jian Li, Nick Kreeger, et al. 2021. “Tensorflow Lite Micro: Embedded Machine Learning for Tinyml Systems.” Proceedings of Machine Learning and Systems 3: 800–811.
11.4.8 Challenges and Limitations
While benchmarking provides a structured methodology for performance evaluation in complex domains like artificial intelligence and computing, the process also poses several challenges. If not properly addressed, these challenges can undermine the credibility and accuracy of benchmarking results. Some of the predominant difficulties faced in benchmarking include the following:
Incomplete problem coverage: Benchmark tasks may not fully represent the problem space. For instance, common image classification datasets like CIFAR-10 have limited diversity in image types. Algorithms tuned for such benchmarks may fail to generalize well to real-world datasets.
Statistical insignificance: Benchmarks must have enough trials and data samples to produce statistically significant results. For example, benchmarking an OCR model on only a few text scans may not adequately capture its true error rates.
Limited reproducibility: Varying hardware, software versions, codebases, and other factors can reduce the reproducibility of benchmark results. MLPerf addresses this by providing reference implementations and environment specifications.
Misalignment with end goals: Benchmarks focusing only on speed or accuracy metrics may misalign real-world objectives like cost and power efficiency. Benchmarks must reflect all critical performance axes.
Rapid staleness: Due to the rapid pace of advancements in AI and computing, benchmarks and their datasets can quickly become outdated. Maintaining up-to-date benchmarks is thus a persistent challenge.
But of all these, the most important challenge is benchmark engineering.
Hardware Lottery
The hardware lottery, first described by Hooker (2021), refers to the situation where a machine learning model’s success or efficiency is significantly influenced by its compatibility with the underlying hardware (Chu et al. 2021). Some models perform exceptionally well not because they are intrinsically superior, but because they are optimized for specific hardware characteristics, such as the parallel processing capabilities of Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs).
Hooker, Sara. 2021. “The Hardware Lottery.” Communications of the ACM 64 (12): 58–65. https://doi.org/10.1145/3467017.
For instance, Figure11.4 compares the performance of models across different hardware platforms. The multi-hardware models show comparable results to “MobileNetV3 Large min” on both the CPU uint8 and GPU configurations. However, these multi-hardware models demonstrate significant performance improvements over the MobileNetV3 Large baseline when run on the EdgeTPU and DSP hardware. This emphasizes the variable efficiency of multi-hardware models in specialized computing environments.
Hardware lottery can introduce challenges and biases in benchmarking machine learning systems, as the model’s performance is not solely dependent on the model’s architecture or algorithm but also on the compatibility and synergies with the underlying hardware. This can make it difficult to compare different models fairly and to identify the best model based on its intrinsic merits. It can also lead to a situation where the community converges on models that are a good fit for the popular hardware of the day, potentially overlooking other models that might be superior but incompatible with the current hardware trends.
Benchmark Engineering
Hardware lottery occurs when a machine learning model unintentionally performs exceptionally well or poorly on a specific hardware setup due to unforeseen compatibility or incompatibility. The model is not explicitly designed or optimized for that particular hardware by the developers or engineers; rather, it happens to align or (mis)align with the hardware’s capabilities or limitations. In this case, the model’s performance on the hardware is a byproduct of coincidence rather than design.
In contrast to the accidental hardware lottery, benchmark engineering involves deliberately optimizing or designing a machine learning model to perform exceptionally well on specific hardware, often to win benchmarks or competitions. This intentional optimization might include tweaking the model’s architecture, algorithms, or parameters to exploit the hardware’s features and capabilities fully.
Problem
Benchmark engineering refers to tweaking or modifying an AI system to optimize performance on specific benchmark tests, often at the expense of generalizability or real-world performance. This can include adjusting hyperparameters, training data, or other aspects of the system specifically to achieve high scores on benchmark metrics without necessarily improving the overall functionality or utility of the system.
The motivation behind benchmark engineering often stems from the desire to achieve high-performance scores for marketing or competitive purposes. High benchmark scores can demonstrate the superiority of an AI system compared to competitors and can be a key selling point for potential users or investors. This pressure to perform well on benchmarks sometimes leads to prioritizing benchmark-specific optimizations over more holistic improvements to the system.
It can lead to several risks and challenges. One of the primary risks is that the AI system may perform better in real-world applications than the benchmark scores suggest. This can lead to user dissatisfaction, reputational damage, and potential safety or ethical concerns. Furthermore, benchmark engineering can contribute to a lack of transparency and accountability in the AI community, as it can be difficult to discern how much of an AI system’s performance is due to genuine improvements versus benchmark-specific optimizations.
The AI community must prioritize transparency and accountability to mitigate the risks associated with benchmark engineering. This can include disclosing any optimizations or adjustments made specifically for benchmark tests and providing more comprehensive evaluations of AI systems that include real-world performance metrics and benchmark scores. Researchers and developers must prioritize holistic improvements to AI systems that improve their generalizability and functionality across various applications rather than focusing solely on benchmark-specific optimizations.
Issues
One of the primary problems with benchmark engineering is that it can compromise the real-world performance of AI systems. When developers focus on optimizing their systems to achieve high scores on specific benchmark tests, they may neglect other important system performance aspects crucial in real-world applications. For example, an AI system designed for image recognition might be engineered to perform exceptionally well on a benchmark test that includes a specific set of images but needs help to recognize images slightly different from those in the test set accurately.
Another area for improvement with benchmark engineering is that it can result in AI systems that lack generalizability. In other words, while the system may perform well on the benchmark test, it may need help handling a diverse range of inputs or scenarios. For instance, an AI model developed for natural language processing might be engineered to achieve high scores on a benchmark test that includes a specific type of text but fails to process text that falls outside of that specific type accurately.
It can also lead to misleading results. When AI systems are engineered to perform well on benchmark tests, the results may not accurately reflect the system’s true capabilities. This can be problematic for users or investors who rely on benchmark scores to make informed decisions about which AI systems to use or invest in. For example, an AI system engineered to achieve high scores on a benchmark test for speech recognition might need to be more capable of accurately recognizing speech in real-world situations, leading users or investors to make decisions based on inaccurate information.
Mitigation
There are several ways to mitigate benchmark engineering. Transparency in the benchmarking process is crucial to maintaining benchmark accuracy and reliability. This involves clearly disclosing the methodologies, data sets, and evaluation criteria used in benchmark tests, as well as any optimizations or adjustments made to the AI system for the purpose of the benchmark.
One way to achieve transparency is through the use of open-source benchmarks. Open-source benchmarks are made publicly available, allowing researchers, developers, and other stakeholders to review, critique, and contribute to them, thereby ensuring their accuracy and reliability. This collaborative approach also facilitates sharing best practices and developing more robust and comprehensive benchmarks.
The modular design of MLPerf Tiny connects to the problem of benchmark engineering by providing a structured yet flexible approach that encourages a balanced evaluation of TinyML. In benchmark engineering, systems may be overly optimized for specific benchmarks, leading to inflated performance scores that don’t necessarily translate to real-world effectiveness. MLPerf Tiny’s modular design aims to address this issue by allowing contributors to swap out and test specific components within a standardized framework, such as hardware, quantization techniques, or inference models. The reference implementations, highlighted in green and orange in Figure11.5, provide a baseline for results, enabling flexible yet controlled testing by specifying which components can be modified. This structure supports transparency and flexibility, enabling a focus on genuine improvements rather than benchmark-specific optimizations.
Another method for achieving transparency is through peer review of benchmarks. This involves having independent experts review and validate the benchmark’s methodology, data sets, and results to ensure their credibility and reliability. Peer review can provide a valuable means of verifying the accuracy of benchmark tests and help build confidence in the results.
Standardization of benchmarks is another important solution to mitigate benchmark engineering. Standardized benchmarks provide a common framework for evaluating AI systems, ensuring consistency and comparability across different systems and applications. This can be achieved by developing industry-wide standards and best practices for benchmarking and through common metrics and evaluation criteria.
Third-party verification of results can also be valuable in mitigating benchmark engineering. This involves having an independent third party verify the results of a benchmark test to ensure their credibility and reliability. Third-party verification can build confidence in the results and provide a valuable means of validating the performance and capabilities of AI systems.
11.5 Model Benchmarking
Benchmarking machine learning models is important for determining the effectiveness and efficiency of various machine learning algorithms in solving specific tasks or problems. By analyzing the results obtained from benchmarking, developers and researchers can identify their models’ strengths and weaknesses, leading to more informed decisions on model selection and further optimization.
The evolution and progress of machine learning models are intrinsically linked to the availability and quality of data sets. In machine learning, data acts as the raw material that powers the algorithms, allowing them to learn, adapt, and ultimately perform tasks that were traditionally the domain of humans. Therefore, it is important to understand this history.
11.5.1 Historical Context
Machine learning datasets have a rich history and have evolved significantly over the years, growing in size, complexity, and diversity to meet the ever-increasing demands of the field. Let’s take a closer look at this evolution, starting from one of the earliest and most iconic datasets – MNIST.
MNIST (1998)
The MNIST dataset, created by Yann LeCun, Corinna Cortes, and Christopher J.C. Burges in 1998, can be considered a cornerstone in the history of machine learning datasets. It comprises 70,000 labeled 28x28 pixel grayscale images of handwritten digits (0-9). MNIST has been widely used for benchmarking algorithms in image processing and machine learning as a starting point for many researchers and practitioners. Figure11.6 shows some examples of handwritten digits.
ImageNet (2009)
Fast forward to 2009, and we see the introduction of the ImageNet dataset, which marked a significant leap in the scale and complexity of datasets. ImageNet consists of over 14 million labeled images spanning more than 20,000 categories. Fei-Fei Li and her team developed it to advance object recognition and computer vision research. The dataset became synonymous with the ImageNet Large Scale Visual Recognition Challenge (LSVRC), an annual competition crucial in developing deep learning models, including the famous AlexNet in 2012.
COCO (2014)
The Common Objects in Context (COCO) dataset (Lin et al. 2014), released in 2014, further expanded the landscape of machine learning datasets by introducing a richer set of annotations. COCO consists of images containing complex scenes with multiple objects, and each image is annotated with object bounding boxes, segmentation masks, and captions, as shown in Figure11.7. This dataset has been instrumental in advancing research in object detection, segmentation, and image captioning.
Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. “Microsoft COCO: Common Objects in Context.” In Computer Vision – ECCV 2014, 740–55. Springer; Springer International Publishing. https://doi.org/10.1007/978-3-319-10602-1\_48.
GPT-3 (2020)
While the above examples primarily focus on image datasets, there have also been significant developments in text datasets. One notable example is GPT-3 (Brown et al. 2020), developed by OpenAI. GPT-3 is a language model trained on diverse internet text. Although the dataset used to train GPT-3 is not publicly available, the model itself, consisting of 175 billion parameters, is a testament to the scale and complexity of modern machine learning datasets and models.
Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, edited by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Present and Future
Today, we have a plethora of datasets spanning various domains, including healthcare, finance, social sciences, and more. The following characteristics help us taxonomize the space and growth of machine learning datasets that fuel model development.
Diversity of Data Sets: The variety of data sets available to researchers and engineers has expanded dramatically, covering many fields, including natural language processing, image recognition, and more. This diversity has fueled the development of specialized machine-learning models tailored to specific tasks, such as translation, speech recognition, and facial recognition.
Volume of Data: The sheer volume of data that has become available in the digital age has also played a crucial role in advancing machine learning models. Large data sets enable models to capture the complexity and nuances of real-world phenomena, leading to more accurate and reliable predictions.
Quality and Cleanliness of Data: The quality of data is another critical factor that influences the performance of machine learning models. Clean, well-labeled, and unbiased data sets are essential for training models that are robust and fair.
Open Access to Data: The availability of open-access data sets has also contributed significantly to machine learning’s progress. Open data allows researchers from around the world to collaborate, share insights, and build upon each other’s work, leading to faster innovation and the development of more advanced models.
Ethics and Privacy Concerns: As data sets grow in size and complexity, ethical considerations and privacy concerns become increasingly important. There is an ongoing debate about the balance between leveraging data for machine learning advancements and protecting individuals’ privacy rights.
The development of machine learning models relies heavily on the availability of diverse, large, high-quality, and open-access data sets. As we move forward, addressing the ethical considerations and privacy concerns associated with using large data sets is crucial to ensure that machine learning technologies benefit society. There is a growing awareness that data acts as the rocket fuel for machine learning, driving and fueling the development of machine learning models. Consequently, more focus is being placed on developing the data sets themselves. We will explore this in further detail in the data benchmarking section.
11.5.2 Model Metrics
Machine learning model evaluation has evolved from a narrow focus on accuracy to a more comprehensive approach considering a range of factors, from ethical considerations and real-world applicability to practical constraints like model size and efficiency. This shift reflects the field’s maturation as machine learning models are increasingly applied in diverse, complex real-world scenarios.
Accuracy
Accuracy is one of the most intuitive and commonly used metrics for evaluating machine learning models. In the early stages of machine learning, accuracy was often the primary, if not the only, metric considered when evaluating model performance. However, as the field has evolved, it’s become clear that relying solely on accuracy can be misleading, especially in applications where certain types of errors carry significant consequences.
Consider the example of a medical diagnosis model with an accuracy of 95%. While at first glance this may seem impressive, we must look deeper to assess the model’s performance fully. Suppose the model fails to accurately diagnose severe conditions that, while rare, can have severe consequences; its high accuracy may not be as meaningful. A well-known example of this limitation is Google’s diabetic retinopathy model. While it achieved high accuracy in lab settings, it encountered challenges when deployed in real-world clinics in Thailand, where variations in patient populations, image quality, and environmental factors reduced its effectiveness. This example illustrates that even models with high accuracy need to be tested for their ability to generalize across diverse, unpredictable conditions to ensure reliability and impact in real-world settings.
Similarly, if the model performs well on average but exhibits significant disparities in performance across different demographic groups, this, too, would be cause for concern. The evolution of machine learning has thus seen a shift towards a more holistic approach to model evaluation, taking into account not just accuracy, but also other crucial factors such as fairness, transparency, and real-world applicability. A prime example is the Gender Shades project at MIT Media Lab, led by Joy Buolamwini, highlighting biases by performing better on lighter-skinned and male faces compared to darker-skinned and female faces.
While accuracy remains essential for evaluating machine learning models, a comprehensive approach is needed to fully assess performance. This includes additional metrics for fairness, transparency, and real-world applicability, along with rigorous testing across diverse datasets to identify and address biases. This holistic evaluation approach reflects the field’s growing awareness of real-world implications in deploying models.
Fairness
Fairness in machine learning involves ensuring that models perform consistently across diverse groups, especially in high-impact applications like loan approvals, hiring, and criminal justice. Relying solely on accuracy can be misleading if the model exhibits biased outcomes across demographic groups. For example, a loan approval model with high accuracy may still consistently deny loans to certain groups, raising questions about its fairness.
Bias in models can arise directly, when sensitive attributes like race or gender influence decisions, or indirectly, when neutral features correlate with these attributes, affecting outcomes. Simply relying on accuracy can be insufficient when evaluating models. For instance, consider a loan approval model with a 95% accuracy rate. While this figure may appear impressive at first glance, it does not reveal how the model performs across different demographic groups. For instance, a well-known example is the COMPAS tool used in the US criminal justice system, which showed racial biases in predicting recidivism despite not explicitly using race as a variable.
Addressing fairness requires analyzing a model’s performance across groups, identifying biases, and applying corrective measures like re-balancing datasets or using fairness-aware algorithms. Researchers and practitioners continuously develop metrics and methodologies tailored to specific use cases to evaluate fairness in real-world scenarios. For example, disparate impact analysis, demographic parity, and equal opportunity are some of the metrics employed to assess fairness. Additionally, transparency and interpretability of models are fundamental to achieving fairness. Tools like AI Fairness 360 and Fairness Indicators help explain how a model makes decisions, allowing developers to detect and correct fairness issues in machine learning models.
While accuracy is a valuable metric, it doesn’t always provide the full picture; assessing fairness ensures models are effective across real-world scenarios. Ensuring fairness in machine learning models, particularly in applications that significantly impact people’s lives, requires rigorous evaluation of the model’s performance across diverse groups, careful identification and mitigation of biases, and implementation of transparency and interpretability measures.
Complexity
Parameters
In the initial stages of machine learning, model benchmarking often relied on parameter counts as a proxy for model complexity. The rationale was that more parameters typically lead to a more complex model, which should, in turn, deliver better performance. However, this approach overlooks the practical costs associated with processing large models. As parameter counts increase, so do the computational resources required, making such models impractical for deployment in real-world scenarios, particularly on devices with limited processing power.
Relying on parameter counts as a proxy for model complexity also fails to consider the model’s efficiency. A well-optimized model with fewer parameters can often achieve comparable or even superior performance to a larger model. For instance, MobileNets, developed by Google, is a family of models designed specifically for mobile and edge devices. They used depth-wise separable convolutions to reduce parameter counts and computational demands while still maintaining strong performance.
In light of these limitations, the field has moved towards a more holistic approach to model benchmarking that considers parameter counts and other crucial factors such as floating-point operations per second (FLOPs), memory consumption, and latency. This comprehensive approach balances performance with deployability, ensuring that models are not only accurate but also efficient and suitable for real-world applications.
FLOPS
FLOPs, or floating-point operations per second, have become a critical metric for representing a model’s computational load. Traditionally, parameter count was used as a proxy for model complexity, based on the assumption that more parameters would yield better performance. However, this approach overlooks the computational cost of processing these parameters, which can impact a model’s usability in real-world scenarios with limited resources.
FLOPs measure the number of floating-point operations a model performs to generate a prediction. A model with many FLOPs requires substantial computational resources to process the vast number of operations, which may render it impractical for certain applications. Conversely, a model with a lower FLOP count is more lightweight and can be easily deployed in scenarios where computational resources are limited. Figure11.8, from (Bianco et al. 2018), illustrates the trade-off between ImageNet accuracy, FLOPs, and parameter count, showing that some architectures achieve higher efficiency than others.
Let’s consider an example. BERT—Bidirectional Encoder Representations from Transformers (Devlin et al. 2019)—is a popular natural language processing model, has over 340 million parameters, making it a large model with high accuracy and impressive performance across various tasks. However, the sheer size of BERT, coupled with its high FLOP count, makes it a computationally intensive model that may not be suitable for real-time applications or deployment on edge devices with limited computational capabilities. In light of this, there has been a growing interest in developing smaller models that can achieve similar performance levels as their larger counterparts while being more efficient in computational load. DistilBERT, for instance, is a smaller version of BERT that retains 97% of its performance while being 40% smaller in terms of parameter count. The size reduction also translates to a lower FLOP count, making DistilBERT a more practical choice for resource-constrained scenarios.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. “None.” In Proceedings of the 2019 Conference of the North, 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423.
While parameter count indicates model size, it does not fully capture the computational cost. FLOPs provide a more accurate measure of computational load, highlighting the practical trade-offs in model deployment. This shift from parameter count to FLOPs reflects the field’s growing awareness of deployment challenges in diverse settings.
Efficiency
Efficiency metrics, such as memory consumption and latency/throughput, have also gained prominence. These metrics are particularly crucial when deploying models on edge devices or in real-time applications, as they measure how quickly a model can process data and how much memory it requires. In this context, Pareto curves are often used to visualize the trade-off between different metrics, helping stakeholders decide which model best suits their needs.
11.5.3 Lessons Learned
Model benchmarking has offered us several valuable insights that can be leveraged to drive innovation in system benchmarks. The progression of machine learning models has been profoundly influenced by the advent of leaderboards and the open-source availability of models and datasets. These elements have served as significant catalysts, propelling innovation and accelerating the integration of cutting-edge models into production environments. However, as we will explore further, these are not the only contributors to the development of machine learning benchmarks.
Leaderboards play a vital role in providing an objective and transparent method for researchers and practitioners to evaluate the efficacy of different models, ranking them based on their performance in benchmarks. This system fosters a competitive environment, encouraging the development of models that are not only accurate but also efficient. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is a prime example of this, with its annual leaderboard significantly contributing to developing groundbreaking models such as AlexNet.
Open-source access to state-of-the-art models and datasets further democratizes machine learning, facilitating collaboration among researchers and practitioners worldwide. This open access accelerates the process of testing, validation, and deployment of new models in production environments, as evidenced by the widespread adoption of models like BERT and GPT-3 in various applications, from natural language processing to more complex, multi-modal tasks.
Community collaboration platforms like Kaggle have revolutionized the field by hosting competitions that unite data scientists from across the globe to solve intricate problems. Specific benchmarks serve as the goalposts for innovation and model development.
Moreover, the availability of diverse and high-quality datasets is paramount in training and testing machine learning models. Datasets such as ImageNet have played an instrumental role in the evolution of image recognition models, while extensive text datasets have facilitated advancements in natural language processing models.
Lastly, the contributions of academic and research institutions must be supported. Their role in publishing research papers, sharing findings at conferences, and fostering collaboration between various institutions has significantly contributed to advancing machine learning models and benchmarks.
Emerging Trends
As machine learning models become more sophisticated, so do the benchmarks required to assess them accurately. There are several emerging benchmarks and datasets that are gaining popularity due to their ability to evaluate models in more complex and realistic scenarios:
Multimodal Datasets: These datasets contain multiple data types, such as text, images, and audio, to represent real-world situations better. An example is the VQA (Visual Question Answering) dataset (Antol et al. 2015), where models’ ability to answer text-based questions about images is tested.
Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. “VQA: Visual Question Answering.” In 2015 IEEE International Conference on Computer Vision (ICCV), 2425–33. IEEE. https://doi.org/10.1109/iccv.2015.279.
Fairness and Bias Evaluation: There is an increasing focus on creating benchmarks assessing machine learning models’ fairness and bias. Examples include the AI Fairness 360 toolkit, which offers a comprehensive set of metrics and datasets for evaluating bias in models.
Out-of-Distribution Generalization: Testing how well models perform on data different from the original training distribution. This evaluates the model’s ability to generalize to new, unseen data. Example benchmarks are Wilds (Koh et al. 2021), RxRx, and ANC-Bench.
Koh, Pang Wei, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, et al. 2021. “WILDS: A Benchmark of in-the-Wild Distribution Shifts.” In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, edited by Marina Meila and Tong Zhang, 139:5637–64. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v139/koh21a.html.
Hendrycks, Dan, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021. “Natural Adversarial Examples.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 15257–66. IEEE. https://doi.org/10.1109/cvpr46437.2021.01501.
Xie, Cihang, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L. Yuille, and Quoc V. Le. 2020. “Adversarial Examples Improve Image Recognition.” In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 816–25. IEEE. https://doi.org/10.1109/cvpr42600.2020.00090.
Adversarial Robustness: Evaluating model performance under adversarial attacks or perturbations to the input data. This tests the model’s robustness. Example benchmarks are ImageNet-A (Hendrycks et al. 2021), ImageNet-C (Xie et al. 2020), and CIFAR-10.1.
Real-World Performance: Testing models on real-world datasets that closely match end tasks rather than just canned benchmark datasets. Examples are medical imaging datasets for healthcare tasks or customer support chat logs for dialogue systems.
Energy and Compute Efficiency: Benchmarks that measure the computational resources required to achieve a particular accuracy. This evaluates the model’s Efficiency. Examples are MLPerf and Greenbench, already discussed in the Systems benchmarking section.
Interpretability and Explainability: Benchmarks that assess how easy it is to understand and explain a model’s internal logic and predictions. Example metrics are faithfulness to input gradients and coherence of explanations.
11.5.4 Limitations and Challenges
While model benchmarks are an essential tool in assessing machine learning models, several limitations and challenges should be addressed to ensure that they accurately reflect a model’s performance in real-world scenarios.
Dataset does not Correspond to Real-World Scenarios: Often, the data used in model benchmarks is cleaned and preprocessed to such an extent that it may need to accurately represent the data that a model would encounter in real-world applications. This idealized data version can lead to overestimating a model’s performance. In the case of the ImageNet dataset, the images are well-labeled and categorized. Still, in a real-world scenario, a model may need to deal with blurry images that could be better lit or taken from awkward angles. This discrepancy can significantly affect the model’s performance.
Sim2Real Gap: The Sim2Real gap refers to the difference in the performance of a model when transitioning from a simulated environment to a real-world environment. This gap is often observed in robotics, where a robot trained in a simulated environment struggles to perform tasks in the real world due to the complexity and unpredictability of real-world environments. A robot trained to pick up objects in a simulated environment may need help to perform the same task in the real world because the simulated environment does not accurately represent the complexities of real-world physics, lighting, and object variability.
Challenges in Creating Datasets: Creating a dataset for model benchmarking is a challenging task that requires careful consideration of various factors such as data quality, diversity, and representation. As discussed in the data engineering section, ensuring that the data is clean, unbiased, and representative of the real-world scenario is crucial for the accuracy and reliability of the benchmark. For example, when creating a dataset for a healthcare-related task, it is important to ensure that the data is representative of the entire population and not biased towards a particular demographic. This ensures that the model performs well across diverse patient populations.
Model benchmarks are essential in measuring the capability of a model architecture in solving a fixed task, but it is important to address the limitations and challenges associated with them. This includes ensuring that the dataset accurately represents real-world scenarios, addressing the Sim2Real gap, and overcoming the challenges of creating unbiased and representative datasets. By addressing these challenges and many others, we can ensure that model benchmarks provide a more accurate and reliable assessment of a model’s performance in real-world applications.
The Speech Commands dataset and its successor MSWC, are common benchmarks for one of the quintessential TinyML applications, keyword spotting. Speech commands establish streaming error metrics beyond the standard top-1 classification accuracy more relevant to the keyword spotting use case. Using case-relevant metrics is what elevates a dataset to a model benchmark.
11.6 Data Benchmarking
For the past several years, AI has focused on developing increasingly sophisticated machine learning models like large language models. The goal has been to create models capable of human-level or superhuman performance on a wide range of tasks by training them on massive datasets. This model-centric approach produced rapid progress, with models attaining state-of-the-art results on many established benchmarks. Figure11.9 shows the performance of AI systems relative to human performance (marked by the horizontal line at 0) across five applications: handwriting recognition, speech recognition, image recognition, reading comprehension, and language understanding. Over the past decade, the AI performance has surpassed that of humans.
However, growing concerns about issues like bias, safety, and robustness persist even in models that achieve high accuracy on standard benchmarks. Additionally, some popular datasets used for evaluating models are beginning to saturate, with models reaching near-perfect performance on existing test splits (Kiela et al. 2021). As a simple example, there are test images in the classic MNIST handwritten digit dataset that may look indecipherable to most human evaluators but were assigned a label when the dataset was created - models that happen to agree with those labels may appear to exhibit superhuman performance but instead may only be capturing idiosyncrasies of the labeling and acquisition process from the dataset’s creation in 1994. In the same spirit, computer vision researchers now ask, “Are we done with ImageNet?” (Beyer et al. 2020). This highlights limitations in the conventional model-centric approach of optimizing accuracy on fixed datasets through architectural innovations.
Kiela, Douwe, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, et al. 2021. “Dynabench: Rethinking Benchmarking in NLP.” In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4110–24. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.324.
Beyer, Lucas, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. 2020. “Are We Done with ImageNet?” ArXiv Preprint abs/2006.07159 (June). http://arxiv.org/abs/2006.07159v1.
An alternative paradigm is emerging called data-centric AI. Rather than treating data as static and focusing narrowly on model performance, this approach recognizes that models are only as good as their training data. So, the emphasis shifts to curating high-quality datasets that better reflect real-world complexity, developing more informative evaluation benchmarks, and carefully considering how data is sampled, preprocessed, and augmented. The goal is to optimize model behavior by improving the data rather than just optimizing metrics on flawed datasets. Data-centric AI critically examines and enhances the data itself to produce beneficial AI. This reflects an important evolution in mindset as the field addresses the shortcomings of narrow benchmarking.
This section will explore the key differences between model-centric and data-centric approaches to AI. This distinction has important implications for how we benchmark AI systems. Specifically, we will see how focusing on data quality and Efficiency can directly improve machine learning performance as an alternative to optimizing model architectures solely. The data-centric approach recognizes that models are only as good as their training data. So, enhancing data curation, evaluation benchmarks, and data handling processes can produce AI systems that are safer, fairer, and more robust. Rethinking benchmarking to prioritize data alongside models represents an important evolution as the field strives to deliver trustworthy real-world impact.
11.6.1 Limitations of Model-Centric AI
In the model-centric AI era, a prominent characteristic was the development of complex model architectures. Researchers and practitioners dedicated substantial effort to devising sophisticated and intricate models in the quest for superior performance. This frequently involved the incorporation of additional layers and the fine-tuning of a multitude of hyperparameters to achieve incremental improvements in accuracy. Concurrently, there was a significant emphasis on leveraging advanced algorithms. These algorithms, often at the forefront of the latest research, were employed to improve the performance of AI models. The primary aim of these algorithms was to optimize the learning process of models, thereby extracting maximal information from the training data.
While the model-centric approach has been central to many advancements in AI, it has several areas for improvement. First, the development of complex model architectures can often lead to overfitting. This is when the model performs well on the training data but needs to generalize to new, unseen data. The additional layers and complexity can capture noise in the training data as if it were a real pattern, harming the model’s performance on new data.
Second, relying on advanced algorithms can sometimes obscure the real understanding of a model’s functioning. These algorithms often act as a black box, making it difficult to interpret how the model is making decisions. This lack of transparency can be a significant hurdle, especially in critical applications such as healthcare and finance, where understanding the model’s decision-making process is crucial.
Third, the emphasis on achieving state-of-the-art results on benchmark datasets can sometimes be misleading. These datasets need to represent the complexities and variability of real-world data more fully. A model that performs well on a benchmark dataset may not necessarily generalize well to new, unseen data in a real-world application. This discrepancy can lead to false confidence in the model’s capabilities and hinder its practical applicability.
Lastly, the model-centric approach often relies on large labeled datasets for training. However, obtaining such datasets takes time and effort in many real-world scenarios. This reliance on large datasets also limits AI’s applicability in domains where data is scarce or expensive to label.
As a result of the above reasons, and many more, the AI community is shifting to a more data-centric approach. Rather than focusing just on model architecture, researchers are now prioritizing curating high-quality datasets, developing better evaluation benchmarks, and considering how data is sampled and preprocessed. The key idea is that models are only as good as their training data. So, focusing on getting the right data will allow us to develop AI systems that are more fair, safe, and aligned with human values. This data-centric shift represents an important change in mindset as AI progresses.
11.6.2 The Shift Toward Data-centric AI
Data-centric AI is a paradigm that emphasizes the importance of high-quality, well-labeled, and diverse datasets in developing AI models. In contrast to the model-centric approach, which focuses on refining and iterating on the model architecture and algorithm to improve performance, data-centric AI prioritizes the quality of the input data as the primary driver of improved model performance. High-quality data is clean, well-labeled and representative of the real-world scenarios the model will encounter. In contrast, low-quality data can lead to poor model performance, regardless of the complexity or sophistication of the model architecture.
Data-centric AI puts a strong emphasis on the cleaning and labeling of data. Cleaning involves the removal of outliers, handling missing values, and addressing other data inconsistencies. Labeling, on the other hand, involves assigning meaningful and accurate labels to the data. Both these processes are crucial in ensuring that the AI model is trained on accurate and relevant data. Another important aspect of the data-centric approach is data augmentation. This involves artificially increasing the size and diversity of the dataset by applying various transformations to the data, such as rotation, scaling, and flipping training images. Data augmentation helps in improving the model’s robustness and generalization capabilities.
There are several benefits to adopting a data-centric approach to AI development. First and foremost, it leads to improved model performance and generalization capabilities. By ensuring that the model is trained on high-quality, diverse data, the model can better generalize to new, unseen data (Mattson et al. 2020b).
Additionally, a data-centric approach can often lead to simpler models that are easier to interpret and maintain. This is because the emphasis is on the data rather than the model architecture, meaning simpler models can achieve high performance when trained on high-quality data.
The shift towards data-centric AI represents a significant paradigm shift. By prioritizing the quality of the input data, this approach tries to model performance and generalization capabilities, ultimately leading to more robust and reliable AI systems. Figure11.10 illustrates this difference. As we continue to advance in our understanding and application of AI, the data-centric approach is likely to play an important role in shaping the future of this field.
11.6.3 Benchmarking Data
Data benchmarking focuses on evaluating common issues in datasets, such as identifying label errors, noisy features, representation imbalance (for example, out of the 1000 classes in Imagenet-1K, there are over 100 categories which are just types of dogs), class imbalance (where some classes have many more samples than others), whether models trained on a given dataset can generalize to out-of-distribution features, or what types of biases might exist in a given dataset (Mattson et al. 2020b). In its simplest form, data benchmarking seeks to improve accuracy on a test set by removing noisy or mislabeled training samples while keeping the model architecture fixed. Recent competitions in data benchmarking have invited participants to submit novel augmentation strategies and active learning techniques.
Mattson, Peter, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, et al. 2020b. “MLPerf: An Industry Standard Benchmark Suite for Machine Learning Performance.” IEEE Micro 40 (2): 8–16. https://doi.org/10.1109/mm.2020.2974843.
Data-centric techniques continue to gain attention in benchmarking, especially as foundation models are increasingly trained on self-supervised objectives. Compared to smaller datasets like Imagenet-1K, massive datasets commonly used in self-supervised learning, such as Common Crawl, OpenImages, and LAION-5B, contain higher amounts of noise, duplicates, bias, and potentially offensive data.
DataComp is a recently launched dataset competition that targets the evaluation of large corpora. DataComp focuses on language-image pairs used to train CLIP models. The introductory whitepaper finds that when the total compute budget for training is constant, the best-performing CLIP models on downstream tasks, such as ImageNet classification, are trained on just 30% of the available training sample pool. This suggests that proper filtering of large corpora is critical to improving the accuracy of foundation models. Similarly, Demystifying CLIP Data (Xu et al. 2023) asks whether the success of CLIP is attributable to the architecture or the dataset.
Xu, Hu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. 2023. “Demystifying CLIP Data.” ArXiv Preprint abs/2309.16671 (September). http://arxiv.org/abs/2309.16671v4.
DataPerf is another recent effort focusing on benchmarking data in various modalities. DataPerf provides rounds of online competition to spur improvement in datasets. The inaugural offering launched with challenges in vision, speech, acquisition, debugging, and text prompting for image generation.
11.6.4 Data Efficiency
As machine learning models grow larger and more complex and compute resources become more scarce in the face of rising demand, it becomes challenging to meet the computation requirements even with the largest machine learning fleets. To overcome these challenges and ensure machine learning system scalability, it is necessary to explore novel opportunities that increase conventional approaches to resource scaling.
Improving data quality can be a useful method to impact machine learning system performance significantly. One of the primary benefits of enhancing data quality is the potential to reduce the size of the training dataset while still maintaining or even improving model performance. This data size reduction directly relates to the amount of training time required, thereby allowing models to converge more quickly and efficiently. Achieving this balance between data quality and dataset size is a challenging task that requires the development of sophisticated methods, algorithms, and techniques.
Several approaches can be taken to improve data quality. These methods include and are not limited to the following:
- Data Cleaning: This involves handling missing values, correcting errors, and removing outliers. Clean data ensures that the model is not learning from noise or inaccuracies.
- Data Interpretability and Explainability: Common techniques include LIME (Ribeiro, Singh, and Guestrin 2016), which provides insight into the decision boundaries of classifiers, and Shapley values (Lundberg and Lee 2017), which estimate the importance of individual samples in contributing to a model’s predictions.
- Feature Engineering: Transforming or creating new features can significantly improve model performance by providing more relevant information for learning.
- Data Augmentation: Augmenting data by creating new samples through various transformations can help improve model robustness and generalization.
- Active Learning: This is a semi-supervised learning approach where the model actively queries a human oracle to label the most informative samples (Coleman et al. 2022). This ensures that the model is trained on the most relevant data.
- Dimensionality Reduction: Techniques like PCA can reduce the number of features in a dataset, thereby reducing complexity and training time.
Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “” Why Should i Trust You?” Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44.
Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 4765–74. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.
Coleman, Cody, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter Bailis, Alexander C. Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, and I. Zeki Yalniz. 2022. “Similarity Search for Efficient Active Learning and Search of Rare Concepts.” Proceedings of the AAAI Conference on Artificial Intelligence 36 (6): 6402–10. https://doi.org/10.1609/aaai.v36i6.20591.
There are many other methods in the wild. But the goal is the same. Refining the dataset and ensuring it is of the highest quality can reduce the training time required for models to converge. However, achieving this requires developing and implementing sophisticated methods, algorithms, and techniques that can clean, preprocess, and augment data while retaining the most informative samples. This is an ongoing challenge that will require continued research and innovation in the field of machine learning.
11.7 The Trifecta
While system, model, and data benchmarks have traditionally been studied in isolation, there is a growing recognition that to understand and advance AI fully, we must take a more holistic view. By iterating between benchmarking systems, models, and datasets together, novel insights that are not apparent when these components are analyzed separately may emerge. System performance impacts model accuracy, model capabilities drive data needs, and data characteristics shape system requirements.
Benchmarking the triad of system, model, and data in an integrated fashion will likely lead to discoveries about the co-design of AI systems, the generalization properties of models, and the role of data curation and quality in enabling performance. Rather than narrow benchmarks of individual components, the future of AI requires benchmarks that evaluate the symbiotic relationship between computing platforms, algorithms, and training data. This systems-level perspective will be critical to overcoming current limitations and unlocking the next level of AI capabilities.
Figure11.11 illustrates the many potential ways to interplay data benchmarking, model benchmarking, and system infrastructure benchmarking together. Exploring these intricate interactions is likely to uncover new optimization opportunities and enhancement capabilities. The data, model, and system benchmark triad offers a rich space for co-design and co-optimization.
While this integrated perspective represents an emerging trend, the field has much more to discover about the synergies and trade-offs between these components. As we iteratively benchmark combinations of data, models, and systems, new insights that remain hidden when these elements are studied in isolation will emerge. This multifaceted benchmarking approach charting the intersections of data, algorithms, and hardware promises to be a fruitful avenue for major progress in AI, even though it is still in its early stages.
11.8 Benchmarks for Emerging Technologies
Given their significant differences from existing techniques, emerging technologies can be particularly challenging to design benchmarks for. Standard benchmarks used for existing technologies may not highlight the key features of the new approach. In contrast, new benchmarks may be seen as contrived to favor the emerging technology over others. They may be so different from existing benchmarks that they cannot be understood and lose insightful value. Thus, benchmarks for emerging technologies must balance fairness, applicability, and ease of comparison with existing benchmarks.
An example of emerging technology where benchmarking has proven to be especially difficult is in Neuromorphic Computing. Using the brain as a source of inspiration for scalable, robust, and energy-efficient general intelligence, neuromorphic computing (Schuman et al. 2022) directly incorporates biologically realistic mechanisms in both computing algorithms and hardware, such as spiking neural networks (Maass 1997) and non-von Neumann architectures for executing them (Davies et al. 2018; Modha et al. 2023). From a full-stack perspective of models, training techniques, and hardware systems, neuromorphic computing differs from conventional hardware and AI. Thus, there is a key challenge in developing fair and useful benchmarks for guiding the technology.
Schuman, Catherine D., Shruti R. Kulkarni, Maryam Parsa, J. Parker Mitchell, Prasanna Date, and Bill Kay. 2022. “Opportunities for Neuromorphic Computing Algorithms and Applications.” Nature Computational Science 2 (1): 10–19. https://doi.org/10.1038/s43588-021-00184-y.
Maass, Wolfgang. 1997. “Networks of Spiking Neurons: The Third Generation of Neural Network Models.” Neural Networks 10 (9): 1659–71. https://doi.org/10.1016/s0893-6080(97)00011-7.
Davies, Mike, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, et al. 2018. “Loihi: A Neuromorphic Manycore Processor with on-Chip Learning.” IEEE Micro 38 (1): 82–99. https://doi.org/10.1109/mm.2018.112130359.
Modha, Dharmendra S., Filipp Akopyan, Alexander Andreopoulos, Rathinakumar Appuswamy, John V. Arthur, Andrew S. Cassidy, Pallab Datta, et al. 2023. “Neural Inference at the Frontier of Energy, Space, and Time.” Science 382 (6668): 329–35. https://doi.org/10.1126/science.adh1174.
Yik, Jason, Korneel Van den Berghe, Douwe den Blanken, Younes Bouhadjar, Maxime Fabre, Paul Hueber, Denis Kleyko, et al. 2023. “NeuroBench: A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems,” April. http://arxiv.org/abs/2304.04640v3.
An ongoing initiative to develop standard neuromorphic benchmarks is NeuroBench (Yik et al. 2023). To suitably benchmark neuromorphic, NeuroBench follows high-level principles of inclusiveness through task and metric applicability to both neuromorphic and non-neuromorphic solutions, actionability of implementation using common tooling, and iterative updates to continue to ensure relevance as the field rapidly grows. NeuroBench and other benchmarks for emerging technologies provide critical guidance for future techniques, which may be necessary as the scaling limits of existing approaches draw nearer.
11.9 Conclusion
What gets measured gets improved. This chapter has explored the multifaceted nature of benchmarking spanning systems, models, and data. Benchmarking is important to advancing AI by providing the essential measurements to track progress.
ML system benchmarks enable optimization across speed, Efficiency, and scalability metrics. Model benchmarks drive innovation through standardized tasks and metrics beyond accuracy. Data benchmarks highlight issues of quality, balance, and representation.
Importantly, evaluating these components in isolation has limitations. In the future, more integrated benchmarking will likely be used to explore the interplay between system, model, and data benchmarks. This view promises new insights into co-designing data, algorithms, and infrastructure.
As AI grows more complex, comprehensive benchmarking becomes even more critical. Standards must continuously evolve to measure new capabilities and reveal limitations. Close collaboration between industry, academics, national labels, etc., is essential to developing benchmarks that are rigorous, transparent, and socially beneficial.
Benchmarking provides the compass to guide progress in AI. By persistently measuring and openly sharing results, we can navigate toward performant, robust, and trustworthy systems. If AI is to serve societal and human needs properly, it must be benchmarked with humanity’s best interests in mind. To this end, there are emerging areas, such as benchmarking the safety of AI systems, but that’s for another day and something we can discuss further in Generative AI!
Benchmarking is a continuously evolving topic. The article The Olympics of AI: Benchmarking Machine Learning Systems covers several emerging subfields in AI benchmarking, including robotics, extended reality, and neuromorphic computing that we encourage the reader to pursue.
11.10 Resources
Here is a curated list of resources to support students and instructors in their learning and teaching journeys. We are continuously working on expanding this collection and will add new exercises soon.
Slides
These slides are a valuable tool for instructors to deliver lectures and for students to review the material at their own pace. We encourage students and instructors to leverage these slides to improve their understanding and facilitate effective knowledge transfer.
Videos
- Coming soon.
Exercises
To reinforce the concepts covered in this chapter, we have curated a set of exercises that challenge students to apply their knowledge and deepen their understanding.
Exercise11.1
Exercise11.2