跳至内容

使用 RayCluster 进行分布式训练

在本指南中,我们将引导您完成在具有多个节点的 MicroK8s 上设置 Ray 集群并在此上启动 Schola 训练脚本的过程。我们将涵盖 MicroK8s、Docker 和 Ray 所需的安装和配置。需要注意的是,这并不是设置 Ray 集群或启动训练脚本的唯一方法,并且可以根据您的具体要求自定义配置。但是,本指南为在本地 Kubernetes 集群上使用 Ray 进行分布式训练提供了一个起点。

安装先决条件

在开始之前,请确保您的系统上已安装以下先决条件:

  • Ubuntu 22.04 (推荐使用 22.04.4 Desktop x86 64-bit 以保证可复现性)

  • Docker (确保 Docker 已安装并正在运行)

  • MicroK8s (一个轻量级的 Kubernetes 发行版)

  • Ray (一个用于构建和运行分布式应用程序的框架)

设置 Docker

  1. 如果需要,卸载冲突的 Docker 包:
终端窗口
sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
  1. 安装 Docker:
终端窗口
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  1. Docker 安装后:
终端窗口
sudo groupadd docker
sudo usermod -aG docker $USER
newgrp docker
docker run hello-world
sudo systemctl enable docker.service
sudo systemctl enable containerd.service
  1. 配置 Docker 注册表:
终端窗口
docker run -d -p <your_registry_ip>:32000:5000 --name registry registry:2
cat <<EOF | sudo tee /etc/docker/daemon.json
{
"insecure-registries": ["<your_registry_ip>:32000"]
}
EOF
sudo systemctl restart docker
docker start registry

设置 MicroK8s

  1. 如果需要,卸载现有的 MicroK8s:
终端窗口
if command -v microk8s &> /dev/null; then
sudo microk8s reset
sudo snap remove microk8s
fi
sudo rm -rf /var/snap/microk8s/current
  1. 安装 MicroK8s:
终端窗口
sudo snap install microk8s --classic
microk8s status --wait-ready
  1. 将您的用户添加到 MicroK8s 组:
终端窗口
sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
  1. 启用必要的 MicroK8s 服务:
终端窗口
microk8s enable dns storage registry
  1. 配置 MicroK8s 使用本地 Docker 注册表:
终端窗口
sudo mkdir -p /var/snap/microk8s/current/args/certs.d/<your_registry_ip>:32000
cat <<EOF | sudo tee /var/snap/microk8s/current/args/certs.d/<your_registry_ip>:32000/hosts.toml
server = "http://<your_registry_ip>:32000"
[host."http://<your_registry_ip>:32000"]
capabilities = ["pull", "resolve"]
EOF
sudo systemctl restart docker
sudo snap stop microk8s
sudo snap start microk8s
microk8s status --wait-ready
  1. 测试 MicroK8s 设置:
终端窗口
docker start registry
docker pull hello-world
docker tag hello-world <your_registry_ip>:32000/hello-world
docker push <your_registry_ip>:32000/hello-world
microk8s kubectl create deployment hello-world --image=<your_registry_ip>:32000/hello-world
sleep 2
microk8s kubectl get deployments
  1. 添加节点到 MicroK8s 集群:要添加新节点,请先在新机器上安装 MicroK8s
终端窗口
sudo snap install microk8s

然后,在主节点上生成加入命令

终端窗口
join_command=$(microk8s add-node | grep 'microk8s join' | grep 'worker')

在新节点上运行加入命令

终端窗口
microk8s join <main_node_ip>:25000/<token>

更新新节点上的配置文件以从本地注册表拉取镜像(按照本节的步骤 5 和 Docker 设置的步骤 4)

  • 更新 /var/snap/microk8s/current/args/certs.d/<your_registry_ip>:32000/hosts.toml

  • 更新 /etc/docker/daemon.json

  • 重启容器运行时

终端窗口
sudo systemctl restart docker
sudo snap stop microk8s
sudo snap start microk8s

构建和部署 Docker 镜像

  1. 创建 Dockerfile:使用以下 Dockerfile 作为参考来构建您的 Docker 镜像
FROM rayproject/ray:latest-py39
COPY . ./python
RUN sudo apt-get update && cd python && python -m pip install --upgrade pip && \
pip install .[all] && pip install --upgrade numpy==1.26 && \
pip install --upgrade ray==2.36 && pip install tensorboard
WORKDIR ./python
  1. 构建 Docker 镜像:导航到包含 Dockerfile 的目录并运行
终端窗口
docker build --no-cache -t <image_name>:<tag> .
  1. 将 Docker 镜像推送到 MicroK8s 注册表:
终端窗口
docker tag <image_name>:<tag> localhost:32000/<image_name>:<tag>
docker push localhost:32000/<image_name>:<tag>

配置 Ray 集群

  1. 安装 KubeRay Operator:
终端窗口
microk8s helm repo add kuberay https://ray-project.github.io/kuberay-helm/
microk8s helm repo update
microk8s helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1
sleep 2
microk8s kubectl get pods -o wide
  1. 创建 RayCluster 配置文件:在定义 Ray 集群时,可以使用以下 YAML 配置作为参考
apiVersion: ray.io/v1
kind: RayCluster
metadata:
annotations:
meta.helm.sh/release-name: raycluster
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/instance: raycluster
app.kubernetes.io/managed-by: Helm
helm.sh/chart: ray-cluster-1.1.1
name: raycluster-kuberay
namespace: default
spec:
headGroupSpec:
rayStartParams:
dashboard-host: 0.0.0.0
serviceType: ClusterIP
template:
spec:
containers:
- image: <cluster IP>:32000/ScholaExamples:registry
imagePullPolicy: Always
name: ray-head
resources:
limits:
cpu: "8"
memory: 48Gi
requests:
cpu: "2"
memory: 16Gi
volumes:
- emptyDir: {}
name: log-volume
workerGroupSpecs:
- groupName: workergroup
maxReplicas: 5
minReplicas: 3
replicas: 3
template:
metadata:
labels:
app: worker-pod
spec:
containers:
- image: <cluster IP>:32000/ScholaExamples:registry
imagePullPolicy: Always
name: ray-worker
resources:
limits:
cpu: "8"
memory: 32Gi
requests:
cpu: "3"
memory: 6Gi
volumes:
- emptyDir: {}
name: log-volume
imagePullSecrets: []
nodeSelector: {}
tolerations: []
volumes:
- emptyDir: {}
name: log-volume
workerGroupSpecs:
- groupName: workergroup
maxReplicas: <max number of worker pods>
minReplicas: <min number of worker pods>
numOfHosts: 1
rayStartParams: {}
replicas: 3
template:
metadata:
labels:
app: worker-pod
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- worker-pod
topologyKey: "kubernetes.io/hostname"
containers:
- image: <cluster IP>:32000/ScholaExamples:registry
imagePullPolicy: Always
name: ray-worker
resources:
limits:
cpu: "<num cores>"
memory: <memory to use>Gi
requests:
cpu: "<num cores per worker>"
memory: <memory per worker>Gi
securityContext: {}
imagePullSecrets: []
nodeSelector: {}
tolerations: []
volumes:
- emptyDir: {}
name: log-volume
  1. 部署 RayCluster:
终端窗口
microk8s helm install raycluster kuberay/ray-cluster --version 1.1.1
sleep 2
microk8s kubectl get rayclusters
sleep 2
microk8s kubectl get pods --selector=ray.io/cluster=raycluster-kuberay
sleep 2
echo "Ray Cluster Pods:"
microk8s kubectl get pods -o wide
  1. 验证 Ray 集群设置:
终端窗口
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
echo $HEAD_POD
get_head_pod_status() {
HEAD_POD=$(microk8s kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
microk8s kubectl get pods | grep $HEAD_POD | awk '{print $3}'
}
head_pod_status=$(get_head_pod_status)
while [ "$head_pod_status" != "Running" ]; do
echo "Current head pod ($HEAD_POD) status: $head_pod_status. Waiting for 'Running'..."
sleep 2
head_pod_status=$(get_head_pod_status)
done
kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())"

部署 Ray 集群

将配置应用于您的 MicroK8s 集群

终端窗口
microk8s kubectl apply -f raycluster.yaml

启动训练脚本

  1. 在 Ray 集群上执行训练脚本:以下命令用于在 Ray 集群上启动训练脚本
终端窗口
microk8s kubectl exec -it $HEAD_POD -- python Schola/Resources/python/schola/scripts/ray/launch.py \
--num-learners <num_learners> --num-cpus-per-learner <num_cpus_per_learner> \
--activation <activation_function> --launch-unreal \
--unreal-path "<path_to_unreal_executable>" -t <training_iterations> \
--save-final-policy -p <port> --headless <APPO/IMPALA>

命令的关键组件:

  • --unreal-path "<path_to_unreal_executable>":此参数指定已完全构建的 Unreal Engine 可执行文件的路径。至关重要的是,此可执行文件必须包含在 Docker 镜像中并在运行时可访问。Unreal Engine 实例作为训练过程的一部分启动,从而实现训练所需的模拟或环境。

  • --num-learners <num_learners>:指定训练期间使用的环境学习器数量。可以根据可用资源和任务的复杂性进行调整。

  • --num-cpus-per-learner <num_cpus_per_learner>:定义为每个学习器分配的 CPU 核心数。调整此参数以优化资源利用率和性能。

  • --activation <activation_function>:设置神经网络模型中使用的激活函数。可以修改此参数以尝试不同的激活函数。

  • 附加参数,如 -t <training_iterations> (训练迭代次数)、--save-final-policy-p <port> (端口) 可以根据特定训练需求进行自定义。

  1. 监控训练过程:可选地,启动 TensorBoard 实例来跟踪训练
#!/bin/bash
HEAD_POD=$(microk8s kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)
microk8s kubectl exec -it $HEAD_POD -- tensorboard --logdir /path/to/logs --host 0.0.0.0
© . This site is unofficial and not affiliated with AMD.