K8s 集群中添加 GPU 节点

介绍如何在 Kubernetes 集群中安装配置 GPU 节点

环境

  • Rocky 8.9
  • Kubernetes 1.25
  • Nvidia A10G
    • AWS g5.4x.large,16C/64GB,显存 24GB
      以 Rocky 8.9 为例介绍 CUDA 驱动的安装、containderd 的配置、GPU 的 timeSlicing

安装 CUDA 驱动

CUDA 驱动安装方式参见:cuda-installation-guide-linux,当前支持 RHEL 7/CentOS 7、RHEL 8 / Rocky 8、RHEL 9 / Rocky 9、Ubuntu、Debian 等发行版

  • AWS g5.4x.large 自带 600GB SSD 存储,将其挂载到 /var/lib/containerd 目录,用于 K8s
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    $ sudo lsblk
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    nvme0n1 259:0 0 200G 0 disk
    ├─nvme0n1p1 259:2 0 99M 0 part /boot/efi
    ├─nvme0n1p2 259:3 0 1000M 0 part /boot
    ├─nvme0n1p3 259:4 0 4M 0 part
    ├─nvme0n1p4 259:5 0 1M 0 part
    └─nvme0n1p5 259:6 0 198.9G 0 part /
    nvme1n1 259:1 0 558.8G 0 disk

    # 格式化文件系统
    $ sudo mkfs -t xfs /dev/nvme1n1

    # 挂载
    $ sudo blkid
    /dev/nvme1n1: UUID="12345" BLOCK_SIZE="512" TYPE="xfs"

    $ sudo mkdir -p /var/lib/containerd
    $ cd /etc/
    $ sudo cp fstab fstab.bak
    # 将如下信息添加到 /etc/fstab
    UUID=12345 /var/lib/containerd xfs defaults,nofail 0 2

    $ sudo systemctl daemon-reload
    $ sudo mount -a

准备工作

  • 安装基础软件包
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    sudo dnf -y install epel-release

    # lspci,用于查看显卡信息
    sudo dnf -y install pciutils

    # 更新基础包
    sudo dnf -y update

    # 其他用于调试软件包
    sudo dnf -y install htop python3
  • 检查显卡、系统信息
    1
    2
    3
    4
    5
    6
    7
    8
    9
    $ sudo lspci | grep -i nvidia
    00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)

    $ uname -m && cat /etc/*release
    x86_64
    Rocky Linux release 8.9 (Green Obsidian)

    # 检查系统是否安装了 gcc
    $ gcc --version
  • 安装 kernel headers
    1
    sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

在线安装驱动

  • 配置 repo
    • sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-$distro.repo
    • 对于 Rocky 8,$distro/$arch 的值有以下几种情况,在本文中为 rhel8/x86_64
      • rhel8/cross-linux-sbsa
      • rhel8/ppc64le
      • rhel8/sbsa
      • rhel8/x86_64
  • 安装 CUDA SDK
  • 重启设备
    完整命令如下:
    1
    2
    3
    4
    5
    6
    7
    8
    sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
    sudo dnf clean expire-cache

    sudo dnf module install nvidia-driver:latest-dkms
    sudo dnf install cuda-toolkit
    sudo dnf install nvidia-gds

    sudo reboot -f
  • 重启后,检查驱动版本,如下所示安装的驱动版本为 545.23.08
    1
    2
    3
    4
    5
    $ cat /proc/driver/nvidia/version
    NVRM version: NVIDIA UNIX x86_64 Kernel Module 545.23.08 Mon Nov 6 23:49:37 UTC 2023
    GCC version: gcc version 8.5.0 20210514 (Red Hat 8.5.0-20) (GCC)

    $ nvidia-smi

配置 containerd

container-toolkit

  • 将 GPU 节点添加到 K8s 集群
  • 运行如下命令,修改 containerd 配置
    1
    2
    3
    4
    curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
    sudo yum install -y nvidia-container-toolkit

    sudo nvidia-ctk runtime configure --runtime=containerd
  • 修改 /etc/containerd/config.toml,将 default_runtime_name 设置为 nvidia
    1
    2
    3
    4
    5
    version = 2
    [plugins]
    [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
    default_runtime_name = "nvidia"
  • 重启 containerd sudo systemctl restart containerd
  • 将节点打上 nvidia.com/gpu.product=A10G 的标签

K8s 集群添加对 GPU 的支持

NVIDIA/k8s-device-plugin

Helm 安装 device plugin

1
2
3
4
5
6
7
$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
$ helm repo update
$ helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.14.3
--values values.yaml

在 Chart 的 values.yaml 中配置 affinity 如下,表示该服务仅需要运行在含有 nvidia.com/gpu.product 标签的设备上(即 GPU 节点上)

1
2
3
4
5
6
7
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: Exists

可按需修改 config 字段如下,表示将 GPU 资源虚拟化成 20 个,不同的 Pod 可以限制使用不同大小的资源 Shared Access to GPUs with CUDA Time-Slicing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
config:
# ConfigMap name if pulling from an external ConfigMap
name: ""
# Set of named configs to build an integrated ConfigMap from
map:
default: |-
version: v1
sharing:
timeSlicing:
failRequestsGreaterThanOne: true
resources:
- name: nvidia.com/gpu
replicas: 20
# Default config name within the ConfigMap
default: "default"
# List of fallback strategies to attempt if no config is selected and no default is provided
fallbackStrategies: ["named" , "single"]
  • device plugin 安装成功后,可以看到在 GPU 节点上会运行 nvidia-device-plugin 的 Pod,检查 Pod 是否有异常

运行 GPU 测试任务

  • 运行如下任务,检查 GPU 在 K8s 中调度是否正常
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    $ cat <<EOF | kubectl apply -f -
    apiVersion: v1
    kind: Pod
    metadata:
    name: gpu-pod
    spec:
    restartPolicy: Never
    containers:
    - name: cuda-container
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    resources:
    limits:
    nvidia.com/gpu: 1 # requesting 1 GPU
    tolerations:
    - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
    EOF
  • Pod 运行成功后的输出如下
    1
    2
    3
    4
    5
    6
    2024-01-05T17:23:25.675101250+08:00 [Vector addition of 50000 elements]
    2024-01-05T17:23:25.675134218+08:00 Copy input data from the host memory to the CUDA device
    2024-01-05T17:23:25.675136777+08:00 CUDA kernel launch with 196 blocks of 256 threads
    2024-01-05T17:23:25.675138621+08:00 Copy output data from the CUDA device to the host memory
    2024-01-05T17:23:25.675140416+08:00 Test PASSED
    2024-01-05T17:23:25.675142149+08:00 Done