About GKE AI conformance

Standard

This document explains what the Kubernetes AI conformance program is, why it's important for your AI/ML workloads on Google Kubernetes Engine (GKE), and how you can set up conformant GKE clusters.

Why AI conformance matters for your GKE clusters

The Kubernetes AI conformance program defines a standard for Kubernetes clusters to ensure they can reliably and efficiently run AI and ML workloads. Setting up a Kubernetes cluster for AI/ML can be complex. It often involves navigating a landscape of specific driver installations, API versions, and potential workarounds for unexpected bugs.

A conformant platform like GKE is designed to handle these underlying complexities for you, providing a path from setup to deployment. By building on a conformant GKE version, you can be confident that your environment is optimized for criteria like the following:

Scalability: efficiently scale your AI/ML workloads up and down based on demand.
Performance: get the most out of your hardware, including GPUs and TPUs.
Portability: run your AI/ML applications on any conformant Kubernetes cluster with minimal changes.
Interoperability: integrate with other tools and frameworks in the AI/ML ecosystem.

How to create an AI-conformant GKE cluster

To create an AI-conformant GKE cluster, you need to do the following:

Check the ai-conformance GitHub repository to view the list of conformant versions.
Create a GKE cluster in Standard mode running on a conformant version, such as 1.34.0-gke.1662000 or later.
Enable Gateway API on your cluster.

Your cluster now meets the mandatory requirements for Kubernetes AI conformance.

What makes GKE a Kubernetes AI conformant platform

GKE manages the underlying requirements for AI conformance so you don't have to. The following table highlights some of these key features for AI/ML workloads. Some of these features are enabled by default, but others, like Kueue for gang scheduling, are optional additions that you can install to enhance your AI/ML workloads.

The Kubernetes AI conformance program is designed to evolve with the AI/ML ecosystem. The requirements are updated with each Kubernetes minor version release based on the state of the ecosystem. For the full set of requirements for a specific minor version, in the ai-conformance GitHub repository, see the docs/AIConformance-MINOR_VERSION.yaml file, where MINOR_VERSION is your specific version, such as v1.34.

Requirement
Dynamic resource allocation (DRA)	Enables more flexible and fine-grained resource requests beyond counts. For more information, see About dynamic resource allocation.
Kubernetes Gateway API	Provides advanced traffic management for inference services, which enables capabilities like weighted traffic splitting and header-based routing. For more information, see About GKE Gateway API.
Gang scheduling	Ensures all-or-nothing scheduling for distributed AI workloads. GKE allows for the installation and successful operation of at least one gang scheduling solution (for example, Kueue or Volcano). For an example, see Deploy a batch system using Kueue.
Cluster autoscaler for accelerators	Scales node groups that contain specific accelerator types up and down, based on pending Pods requesting those accelerators. For more information, see: About GKE cluster autoscaling. About custom ComputeClasses.
Horizontal Pod Autoscaler (HPA) for accelerators	Functions correctly for Pods utilizing accelerators, including the ability to scale these Pods based on custom metrics relevant to AI/ML workloads. For more information, see: Configure autoscaling for LLM workloads on GPUs. Configure autoscaling for LLM workloads on TPUs.
Accelerator performance metrics	Exposes fine-grained performance metrics by using a standardized, machine-readable format metrics endpoint. For more information, see: Monitor your GPU node workload performance. Observability and metrics for TPU workloads.
Standardized monitoring	Provides a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (for example, Prometheus exposition format). For more information, see Observability for GKE.
AI operator support	Must prove that at least one complex AI operator with a custom resource definition (CRD) can be installed on the platform and that it functions reliably. For more information, see Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine.

What's next

Explore the Kubernetes AI conformance repository for more details on the program.
Read the Introduction to AI/ML workloads on GKE.
Learn more about AI model inference on GKE and try inference examples.
Try an example of training a model on GPUs with GKE Standard mode.

About GKE AI conformance Stay organized with collections Save and categorize content based on your preferences.

Why AI conformance matters for your GKE clusters

How to create an AI-conformant GKE cluster

What makes GKE a Kubernetes AI conformant platform

What's next

About GKE AI conformance