K8s 集群中添加 GPU 节点
介绍如何在 Kubernetes 集群中安装配置 GPU 节点
环境
- Rocky 8.9
- Kubernetes 1.25
- Nvidia A10G
- AWS
g5.4x.large
,16C/64GB,显存 24GB
以 Rocky 8.9 为例介绍 CUDA 驱动的安装、containderd 的配置、GPU 的timeSlicing
- AWS
安装 CUDA 驱动
CUDA 驱动安装方式参见:cuda-installation-guide-linux,当前支持 RHEL 7/CentOS 7、RHEL 8 / Rocky 8、RHEL 9 / Rocky 9、Ubuntu、Debian 等发行版
- AWS
g5.4x.large
自带 600GB SSD 存储,将其挂载到/var/lib/containerd
目录,用于 K8s1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25$ sudo lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 200G 0 disk
├─nvme0n1p1 259:2 0 99M 0 part /boot/efi
├─nvme0n1p2 259:3 0 1000M 0 part /boot
├─nvme0n1p3 259:4 0 4M 0 part
├─nvme0n1p4 259:5 0 1M 0 part
└─nvme0n1p5 259:6 0 198.9G 0 part /
nvme1n1 259:1 0 558.8G 0 disk
# 格式化文件系统
$ sudo mkfs -t xfs /dev/nvme1n1
# 挂载
$ sudo blkid
/dev/nvme1n1: UUID="12345" BLOCK_SIZE="512" TYPE="xfs"
$ sudo mkdir -p /var/lib/containerd
$ cd /etc/
$ sudo cp fstab fstab.bak
# 将如下信息添加到 /etc/fstab
UUID=12345 /var/lib/containerd xfs defaults,nofail 0 2
$ sudo systemctl daemon-reload
$ sudo mount -a
准备工作
- 安装基础软件包
1
2
3
4
5
6
7
8
9
10sudo dnf -y install epel-release
# lspci,用于查看显卡信息
sudo dnf -y install pciutils
# 更新基础包
sudo dnf -y update
# 其他用于调试软件包
sudo dnf -y install htop python3 - 检查显卡、系统信息
1
2
3
4
5
6
7
8
9$ sudo lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
$ uname -m && cat /etc/*release
x86_64
Rocky Linux release 8.9 (Green Obsidian)
# 检查系统是否安装了 gcc
$ gcc --version - 安装 kernel headers
1
sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
在线安装驱动
- 配置 repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/$distro/$arch/cuda-$distro.repo
- 对于 Rocky 8,
$distro/$arch
的值有以下几种情况,在本文中为rhel8/x86_64
rhel8/cross-linux-sbsa
rhel8/ppc64le
rhel8/sbsa
rhel8/x86_64
- 安装 CUDA SDK
- 重启设备
完整命令如下:1
2
3
4
5
6
7
8sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
sudo dnf clean expire-cache
sudo dnf module install nvidia-driver:latest-dkms
sudo dnf install cuda-toolkit
sudo dnf install nvidia-gds
sudo reboot -f - 重启后,检查驱动版本,如下所示安装的驱动版本为 545.23.08
1
2
3
4
5$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 545.23.08 Mon Nov 6 23:49:37 UTC 2023
GCC version: gcc version 8.5.0 20210514 (Red Hat 8.5.0-20) (GCC)
$ nvidia-smi
配置 containerd
- 将 GPU 节点添加到 K8s 集群
- 运行如下命令,修改 containerd 配置
1
2
3
4curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
sudo yum install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=containerd - 修改
/etc/containerd/config.toml
,将default_runtime_name
设置为nvidia
1
2
3
4
5version = 2
[plugins]
[plugins."io.containerd.grpc.v1.cri"]
[plugins."io.containerd.grpc.v1.cri".containerd]
default_runtime_name = "nvidia" - 重启 containerd
sudo systemctl restart containerd
- 将节点打上
nvidia.com/gpu.product=A10G
的标签
K8s 集群添加对 GPU 的支持
Helm 安装 device plugin
1 |
|
在 Chart 的 values.yaml 中配置 affinity
如下,表示该服务仅需要运行在含有 nvidia.com/gpu.product
标签的设备上(即 GPU 节点上)
1 |
|
可按需修改 config 字段如下,表示将 GPU 资源虚拟化成 20 个,不同的 Pod 可以限制使用不同大小的资源 Shared Access to GPUs with CUDA Time-Slicing
1 |
|
- device plugin 安装成功后,可以看到在 GPU 节点上会运行
nvidia-device-plugin
的 Pod,检查 Pod 是否有异常
运行 GPU 测试任务
- 运行如下任务,检查 GPU 在 K8s 中调度是否正常
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
restartPolicy: Never
containers:
- name: cuda-container
image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
EOF - Pod 运行成功后的输出如下
1
2
3
4
5
62024-01-05T17:23:25.675101250+08:00 [Vector addition of 50000 elements]
2024-01-05T17:23:25.675134218+08:00 Copy input data from the host memory to the CUDA device
2024-01-05T17:23:25.675136777+08:00 CUDA kernel launch with 196 blocks of 256 threads
2024-01-05T17:23:25.675138621+08:00 Copy output data from the CUDA device to the host memory
2024-01-05T17:23:25.675140416+08:00 Test PASSED
2024-01-05T17:23:25.675142149+08:00 Done