Why AI conformance matters for your GKE clusters
The Kubernetes AI conformance program defines a standard for Kubernetes clusters to ensure they can reliably and efficiently run AI and ML workloads. Setting up a Kubernetes cluster for AI/ML can be complex. It often involves navigating a landscape of specific driver installations, API versions, and potential workarounds for unexpected bugs.
A conformant platform like GKE is designed to handle these underlying complexities for you, providing a path from setup to deployment. By building on a conformant GKE version, you can be confident that your environment is optimized for criteria like the following:
- Scalability: efficiently scale your AI/ML workloads up and down based on demand.
- Performance: get the most out of your hardware, including GPUs and TPUs.
- Portability: run your AI/ML applications on any conformant Kubernetes cluster with minimal changes.
- Interoperability: integrate with other tools and frameworks in the AI/ML ecosystem.
How to create an AI-conformant GKE cluster
To create an AI-conformant GKE cluster, you need to do the following:
- Check the
ai-conformanceGitHub repository to view the list of conformant versions. - Create a GKE cluster in Standard mode running on a conformant version, such as 1.34.0-gke.1662000 or later.
- Enable Gateway API on your cluster.
Your cluster now meets the mandatory requirements for Kubernetes AI conformance.
What makes GKE a Kubernetes AI conformant platform
GKE manages the underlying requirements for AI conformance so you don't have to. The following table highlights some of these key features for AI/ML workloads. Some of these features are enabled by default, but others, like Kueue for gang scheduling, are optional additions that you can install to enhance your AI/ML workloads.
The Kubernetes AI conformance program is designed to evolve with the AI/ML ecosystem.
The requirements are updated with each Kubernetes minor version release based on
the state of the ecosystem. For the full set of requirements for a specific
minor version, in the
ai-conformance GitHub repository,
see the docs/AIConformance-MINOR_VERSION.yaml file,
where MINOR_VERSION is your specific version, such as
v1.34.
| Requirement | |
|---|---|
| Dynamic resource allocation (DRA) | Enables more flexible and fine-grained resource requests beyond counts. For more information, see About dynamic resource allocation. |
| Kubernetes Gateway API | Provides advanced traffic management for inference services, which enables capabilities like weighted traffic splitting and header-based routing. For more information, see About GKE Gateway API. |
| Gang scheduling | Ensures all-or-nothing scheduling for distributed AI workloads. GKE allows for the installation and successful operation of at least one gang scheduling solution (for example, Kueue or Volcano). For an example, see Deploy a batch system using Kueue. |
| Cluster autoscaler for accelerators | Scales node groups that contain specific accelerator types up and down, based on pending Pods requesting those accelerators. For more information, see: |
| Horizontal Pod Autoscaler (HPA) for accelerators | Functions correctly for Pods utilizing accelerators, including the ability to scale these Pods based on custom metrics relevant to AI/ML workloads. For more information, see: |
| Accelerator performance metrics | Exposes fine-grained performance metrics by using a standardized, machine-readable format metrics endpoint. For more information, see: |
| Standardized monitoring | Provides a monitoring system capable of discovering and collecting metrics from workloads that expose them in a standard format (for example, Prometheus exposition format). For more information, see Observability for GKE. |
| AI operator support | Must prove that at least one complex AI operator with a custom resource definition (CRD) can be installed on the platform and that it functions reliably. For more information, see Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine. |
What's next
- Explore the Kubernetes AI conformance repository for more details on the program.
- Read the Introduction to AI/ML workloads on GKE.
- Learn more about AI model inference on GKE and try inference examples.
- Try an example of training a model on GPUs with GKE Standard mode.