Optimizing AI Workloads with NVIDA GPUs, Time Slicing, and Karpenter (Half 2)

Source link : https://tech365.info/optimizing-ai-workloads-with-nvida-gpus-time-slicing-and-karpenter-half-2/

Introduction: Overcoming GPU Administration Challenges

In Half 1 of this weblog collection, we explored the challenges of internet hosting massive language fashions (LLMs) on CPU-based workloads inside an EKS cluster. We mentioned the inefficiencies related to utilizing CPUs for such duties, primarily because of the massive mannequin sizes and slower inference speeds. The introduction of GPU sources supplied a major efficiency increase, however it additionally introduced concerning the want for environment friendly administration of those high-cost sources.

On this second half, we are going to delve deeper into how one can optimize GPU utilization for these workloads. We’ll cowl the next key areas:

NVIDIA Gadget Plugin Setup: This part will clarify the significance of the NVIDIA system plugin for Kubernetes, detailing its position in useful resource discovery, allocation, and isolation.
Time Slicing: We’ll focus on how time slicing permits a number of processes to share GPU sources successfully, guaranteeing most utilization.
Node Autoscaling with Karpenter: This part will describe how Karpenter dynamically manages node scaling primarily based on real-time demand, optimizing useful resource utilization and lowering prices.

Challenges Addressed

Environment friendly GPU Administration: Guaranteeing GPUs are totally utilized to justify their excessive price.
Concurrency Dealing with: Permitting a number of workloads to share GPU sources successfully.
Dynamic Scaling: Robotically adjusting the variety of nodes primarily based on workload calls for.

Part 1: Introduction to NVIDIA Gadget Plugin

The NVIDIA system plugin for Kubernetes is a element that simplifies the administration and utilization of NVIDIA GPUs in Kubernetes clusters. It permits Kubernetes to acknowledge and allocate GPU sources to pods, enabling GPU-accelerated workloads.

Why We Want the NVIDIA Gadget Plugin

Useful resource Discovery: Robotically detects NVIDIA GPU sources on every node.
Useful resource Allocation: Manages the distribution of GPU sources to pods primarily based on their requests.
Isolation: Ensures safe and environment friendly utilization of GPU sources amongst completely different pods.

The NVIDIA system plugin simplifies GPU administration in Kubernetes clusters. It automates the set up of the NVIDIA driver, container toolkit, and CUDA, guaranteeing that GPU sources can be found for workloads with out requiring handbook setup.

NVIDIA Driver: Required for nvidia-smi and primary GPU operations. Interfacing with the GPU {hardware}. The screenshot beneath shows the output of the nvidia-smi command, which reveals key data reminiscent of the motive force model, CUDA model, and detailed GPU configuration, confirming that the GPU is correctly configured and prepared to be used

NVIDIA Container Toolkit: Required for utilizing GPUs with containerd. Under we are able to see the model of the container toolkit model and the standing of the service working on the occasion

#Put in Model

rpm -qa | grep -i nvidia-container-toolkit

nvidia-container-toolkit-base-1.15.0-1.x86_64

nvidia-container-toolkit-1.15.0-1.x86_64

CUDA: Required for GPU-accelerated functions and libraries. Under is the output of the nvcc command, displaying the model of CUDA put in on the system:

/usr/native/cuda/bin/nvcc –version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2023 NVIDIA Company

Constructed on Tue_Aug_15_22:02:13_PDT_2023

Cuda compilation instruments, launch 12.2, V12.2.140

Construct cuda_12.2.r12.2/compiler.33191640_0

Setting Up the NVIDIA Gadget Plugin

To make sure the DaemonSet runs solely on GPU-based situations, we label the node with the important thing “nvidia.com/gpu” and the worth “true”. That is achieved utilizing Node affinity, Node selector and Taints and Tolerations.

Allow us to now delve into every of those elements intimately.

Node Affinity: Node affinity permits to schedule pod on the nodes primarily based on the node labels requiredDuringSchedulingIgnoredDuringExecution: The scheduler can’t schedule the Pod until the rule is met, and the secret is “nvidia.com/gpu” and operator is “in,” and the values is “true.”

affinity:

   nodeAffinity:

       requiredDuringSchedulingIgnoredDuringExecution:

           nodeSelectorTerms:

                – matchExpressions:

                    – key: function.node.kubernetes.io/pci-10de.current

                      operator: In

                      values:

                        – “true”

                – matchExpressions:

                    – key: function.node.kubernetes.io/cpu-model.vendor_id

                      operator: In

                      values:

                      – NVIDIA

                – matchExpressions:

                    – key: nvidia.com/gpu

                      operator: In

                      values:

                    – “true”

Node selector: Node selector is the only advice type for node choice constraints nvidia.com/gpu: “true”
Taints and Tolerations: Tolerations are added to the Daemon Set to make sure it may be scheduled on the contaminated GPU nodes(nvidia.com/gpu=true:Noschedule).

kubectl taint node ip-10-20-23-199.us-west-1.compute.inner nvidia.com/gpu=true:Noschedule

kubectl describe node ip-10-20-23-199.us-west-1.compute.inner | grep -i taint

Taints: nvidia.com/gpu=true:NoSchedule

tolerations:

– impact: NoSchedule

key: nvidia.com/gpu

operator: Exists

After implementing the node labeling, affinity, node selector, and taints/tolerations, we are able to make sure the Daemon Set runs solely on GPU-based situations. We are able to confirm the deployment of the NVIDIA system plugin utilizing the next command:

kubectl get ds -n kube-system

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE

nvidia-device-plugin 1 1 1 1 1 nvidia.com/gpu=true 75d

nvidia-device-plugin-mps-control-daemon 0 0 0 0 0 nvidia.com/gpu=true,nvidia.com/mps.succesful=true 75d

However the problem right here is GPUs are so costly and wish to verify the utmost utilization of GPU’s and allow us to discover extra on GPU Concurrency.

GPU Concurrency:

Refers back to the potential to execute a number of duties or threads concurrently on a GPU

Single Course of: In a single course of setup, just one utility or container makes use of the GPU at a time. This strategy is easy however could result in underutilization of the GPU sources if the appliance doesn’t totally load the GPU.
Multi-Course of Service (MPS): NVIDIA’s Multi-Course of Service (MPS) permits a number of CUDA functions to share a single GPU concurrently, bettering GPU utilization and lowering the overhead of context switching.
Time slicing: Time slicing entails dividing the GPU time between completely different processes in different phrases a number of course of takes activates GPU’s (Spherical Robin context Switching)
Multi Occasion GPU(MIG): MIG is a function obtainable on NVIDIA A100 GPUs that enables a single GPU to be partitioned into a number of smaller, remoted situations, every behaving like a separate GPU.
Virtualization: GPU virtualization permits a single bodily GPU to be shared amongst a number of digital machines (VMs) or containers, offering every with a digital GPU.

Part 2: Implementing Time Slicing for GPUs

Time-slicing within the context of NVIDIA GPUs and Kubernetes refers to sharing a bodily GPU amongst a number of containers or pods in a Kubernetes cluster. The know-how entails partitioning the GPU’s processing time into smaller intervals and allocating these intervals to completely different containers or pods.

Time Slice Allocation: The GPU scheduler allocates time slices to every vGPU configured on the bodily GPU.
Preemption and Context Switching: On the finish of a vGPU’s time slice, the GPU scheduler preempts its execution, saves its context, and switches to the subsequent vGPU’s context.
Context Switching: The GPU scheduler ensures easy context switching between vGPUs, minimizing overhead, and guaranteeing environment friendly use of GPU sources.
Activity Completion: Processes inside containers full their GPU-accelerated duties inside their allotted time slices.
Useful resource Administration and Monitoring
Useful resource Launch: As duties full, GPU sources are launched again to Kubernetes for reallocation to different pods or containers

Why We Want Time Slicing

Price Effectivity: Ensures high-cost GPUs are usually not underutilized.
Concurrency: Permits a number of functions to make use of the GPU concurrently.

Configuration Instance for Time Slicing

Allow us to apply the time slicing config utilizing config map as proven beneath. Right here replicas: 3 specifies the variety of replicas for GPU sources that implies that GPU useful resource might be sliced into 3 sharing situations

apiVersion: v1

form: ConfigMap

metadata:

identify: nvidia-device-plugin

namespace: kube-system

knowledge:

any: |-

    model: v1

    flags:

      migStrategy: none

    sharing:

      timeSlicing:

        sources:

        – identify: nvidia.com/gpu

          replicas: 3

#We are able to confirm the GPU sources obtainable in your nodes utilizing the next command:

kubectl get nodes -o json | jq -r ‘.objects[] | choose(.standing.capability.”nvidia.com/gpu” != null)

| {identify: .metadata.identify, capability: .standing.capability}’

{

“name”: “ip-10-20-23-199.us-west-1.compute.internal”,

“capacity”: {

    “cpu”: “4”,

    “ephemeral-storage”: “104845292Ki”,

    “hugepages-1Gi”: “0”,

    “hugepages-2Mi”: “0”,

    “memory”: “16069060Ki”,

    “nvidia.com/gpu”: “3”,

    “pods”: “110”

}

}

#The above output reveals that the node ip-10-20-23-199.us-west-1. compute.inner has 3 digital GPUs obtainable.

#We are able to request GPU sources of their pod specs by setting useful resource limits

sources:

      limits:

        cpu: “1”

        reminiscence: 2G

        nvidia.com/gpu: “1”

      requests:

        cpu: “1”

        reminiscence: 2G

        nvidia.com/gpu: “1”

In our case we are able to be capable to host 3 pods in a single node ip-10-20-23-199.us-west-1. compute. Inner and due to time slicing these 3 pods can use 3 digital GPU’s as beneath

GPUs have been shared just about among the many pods, and we are able to see the PIDS assigned for every of the processes beneath.

Now we optimized GPU on the pod stage, allow us to now give attention to optimizing GPU sources on the node stage. We are able to obtain this through the use of a cluster autoscaling resolution known as Karpenter. That is notably essential as the educational labs could not at all times have a continuing load or person exercise, and GPUs are extraordinarily costly. By leveraging Karpenter, we are able to dynamically scale GPU nodes up or down primarily based on demand, guaranteeing cost-efficiency and optimum useful resource utilization.

Part 3: Node Autoscaling with Karpenter

Karpenter is an open-source node lifecycle administration for Kubernetes. It automates provisioning and deprovisioning of nodes primarily based on the scheduling wants of pods, permitting environment friendly scaling and price optimization

Dynamic Node Provisioning: Robotically scales nodes primarily based on demand.
Optimizes Useful resource Utilization: Matches node capability with workload wants.
Reduces Operational Prices: Minimizes pointless useful resource bills.
Improves Cluster Effectivity: Enhances general efficiency and responsiveness.

Why Use Karpenter for Dynamic Scaling

Dynamic Scaling: Robotically adjusts node depend primarily based on workload calls for.
Price Optimization: Ensures sources are solely provisioned when wanted, lowering bills.
Environment friendly Useful resource Administration: Tracks pods unable to be scheduled because of lack of sources, opinions their necessities, provisions nodes to accommodate them, schedules the pods, and decommissions nodes when redundant.

Putting in Karpenter:

#Set up Karpenter utilizing HELM:

helm improve –install karpenter oci://public.ecr.aws/karpenter/karpenter –version “${KARPENTER_VERSION}”

–namespace “${KARPENTER_NAMESPACE}” –create-namespace   –set “settings.clusterName=${CLUSTER_NAME}”

–set “settings.interruptionQueue=${CLUSTER_NAME}”    –set controller.sources.requests.cpu=1

–set controller.sources.requests.reminiscence=1Gi    –set controller.sources.limits.cpu=1

–set controller.sources.limits.reminiscence=1Gi

#Confirm Karpenter Set up:

kubectl get pod -n kube-system | grep -i karpenter

karpenter-7df6c54cc-rsv8s 1/1 Operating 2 (10d in the past) 53d

karpenter-7df6c54cc-zrl9n 1/1 Operating 0 53d

Configuring Karpenter with NodePools and NodeClasses:

Karpenter might be configured with NodePools and NodeClasses to automate the provisioning and scaling of nodes primarily based on the particular wants of your workloads

Karpenter NodePool: Nodepool is a customized useful resource that defines a set of nodes with shared specs and constraints in a Kubernetes cluster. Karpenter makes use of NodePools to dynamically handle and scale node sources primarily based on the necessities of working workloads

apiVersion: karpenter.sh/v1beta1

form: NodePool

metadata:

identify: g4-nodepool

spec:

template:

    metadata:

      labels:

        nvidia.com/gpu: “true”

    spec:

      taints:

        – impact: NoSchedule

          key: nvidia.com/gpu

          worth: “true”

      necessities:

        – key: kubernetes.io/arch

          operator: In

          values: [“amd64”]

        – key: kubernetes.io/os

          operator: In

          values: [“linux”]

        – key: karpenter.sh/capacity-type

          operator: In

          values: [“on-demand”]

        – key: node.kubernetes.io/instance-type

          operator: In

          values: [“g4dn.xlarge” ]

      nodeClassRef:

        apiVersion: karpenter.k8s.aws/v1beta1

        form: EC2NodeClass

        identify: g4-nodeclass

limits:

    cpu: 1000

disruption:

    expireAfter: 120m

    consolidationPolicy: WhenUnderutilized

NodeClasses are configurations that outline the traits and parameters for the nodes that Karpenter can provision in a Kubernetes cluster. A NodeClass specifies the underlying infrastructure particulars for nodes, reminiscent of occasion varieties, launch template configurations and particular cloud supplier settings.

Word: The userData part incorporates scripts to bootstrap the EC2 occasion, together with pulling a TensorFlow GPU Docker picture and configuring the occasion to hitch the Kubernetes cluster.

apiVersion: karpenter.k8s.aws/v1beta1

form: EC2NodeClass

metadata:

identify: g4-nodeclass

spec:

amiFamily: AL2

launchTemplate:

    identify: “ack_nodegroup_template_new”

    model: “7”

position: “KarpenterNodeRole”

subnetSelectorTerms:

    – tags:

        karpenter.sh/discovery: “nextgen-learninglab”

securityGroupSelectorTerms:

    – tags:

        karpenter.sh/discovery: “nextgen-learninglab”

blockDeviceMappings:

    – deviceName: /dev/xvda

      ebs:

        volumeSize: 100Gi

        volumeType: gp3

        iops: 10000

        encrypted: true

        deleteOnTermination: true

        throughput: 125

tags:

    Title: Learninglab-Staging-Auto-GPU-Node

userData: |

        MIME-Model: 1.0

        Content material-Sort: multipart/combined; boundary=”//”

        –//

        Content material-Sort: textual content/x-shellscript; charset=”us-ascii”

        set -ex

        sudo ctr -n=k8s.io picture pull docker.io/tensorflow/tensorflow:2.12.0-gpu

        –//

        Content material-Sort: textual content/x-shellscript; charset=”us-ascii”

        B64_CLUSTER_CA=” ”

        API_SERVER_URL=””

        /and so on/eks/bootstrap.sh nextgen-learninglab-eks –kubelet-extra-args ‘–node-labels=eks.amazonaws.com/capacityType=ON_DEMAND

–pod-max-pids=32768 –max-pods=110’ — b64-cluster-ca $B64_CLUSTER_CA –apiserver-endpoint $API_SERVER_URL –use-max-pods false

         –//

        Content material-Sort: textual content/x-shellscript; charset=”us-ascii”

        KUBELET_CONFIG=/and so on/kubernetes/kubelet/kubelet-config.json

        echo “$(jq “.podPidsLimit=32768″ $KUBELET_CONFIG)” > $KUBELET_CONFIG

        –//

        Content material-Sort: textual content/x-shellscript; charset=”us-ascii”

        systemctl cease kubelet

        systemctl daemon-reload

       systemctl begin kubelet

–//–

On this state of affairs, every node (e.g., ip-10-20-23-199.us-west-1.compute.inner) can accommodate as much as three pods. If the deployment is scaled so as to add one other pod, the sources might be inadequate, inflicting the brand new pod to stay in a pending state.

Karpenter displays these Un schedulable pods and assesses their useful resource necessities to behave accordingly. There might be nodeclaim which claims the node from the nodepool and Karpenter thus provision a node primarily based on the requirement.

Conclusion: Environment friendly GPU Useful resource Administration in Kubernetes

With the rising demand for GPU-accelerated workloads in Kubernetes, managing GPU sources successfully is important. The mix of NVIDIA Gadget Plugin, time slicing, and Karpenter offers a strong strategy to handle, optimize, and scale GPU sources in a Kubernetes cluster, delivering excessive efficiency with environment friendly useful resource utilization. This resolution has been carried out to host pilot GPU-enabled Studying Labs on developer.cisco.com/studying, offering GPU-powered studying experiences.

Author : tech365

Publish date : 2025-01-22 21:52:08

Copyright for syndicated content belongs to the linked Source.

Optimizing AI Workloads with NVIDA GPUs, Time Slicing, and Karpenter (Half 2)

Las ‘autonomías’ drusa y kurda desconfían de Damasco: “Llevamos años defendiendo nuestra tierra. No nos fiamos”

Bold Strategy: How One Nation is Welcoming Planeloads of U.S. Deportees

marriage for all comes into force

Stefanik pledges an ‘America First’ agenda at UN, review of US funding

Las ‘autonomías’ drusa y kurda desconfían de Damasco: “Llevamos años defendiendo nuestra tierra. No nos fiamos”

Bold Strategy: How One Nation is Welcoming Planeloads of U.S. Deportees

marriage for all comes into force

Stefanik pledges an ‘America First’ agenda at UN, review of US funding

Denmark flies home more F-35s for patrols to buffer upgrade delays

Colombian president visits southern Haiti as country battles surge in gang violence