使用 RayCluster 进行分布式训练
在本指南中,我们将引导您完成在具有多个节点的 MicroK8s 上设置 Ray 集群并在此上启动 Schola 训练脚本的过程。我们将涵盖 MicroK8s、Docker 和 Ray 所需的安装和配置。需要注意的是,这并不是设置 Ray 集群或启动训练脚本的唯一方法,并且可以根据您的具体要求自定义配置。但是,本指南为在本地 Kubernetes 集群上使用 Ray 进行分布式训练提供了一个起点。
安装先决条件
在开始之前,请确保您的系统上已安装以下先决条件:
-
Ubuntu 22.04 (推荐使用 22.04.4 Desktop x86 64-bit 以保证可复现性)
-
Docker (确保 Docker 已安装并正在运行)
-
MicroK8s (一个轻量级的 Kubernetes 发行版)
-
Ray (一个用于构建和运行分布式应用程序的框架)
设置 Docker
- 如果需要,卸载冲突的 Docker 包:
sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extrassudo rm -rf /var/lib/dockersudo rm -rf /var/lib/containerdfor pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done- 安装 Docker:
sudo apt-get updatesudo apt-get install ca-certificates curlsudo install -m 0755 -d /etc/apt/keyringssudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.ascsudo chmod a+r /etc/apt/keyrings/docker.ascecho "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/nullsudo apt-get updatesudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin- Docker 安装后:
sudo groupadd dockersudo usermod -aG docker $USERnewgrp dockerdocker run hello-worldsudo systemctl enable docker.servicesudo systemctl enable containerd.service- 配置 Docker 注册表:
docker run -d -p <your_registry_ip>:32000:5000 --name registry registry:2cat <<EOF | sudo tee /etc/docker/daemon.json{ "insecure-registries": ["<your_registry_ip>:32000"]}EOFsudo systemctl restart dockerdocker start registry设置 MicroK8s
- 如果需要,卸载现有的 MicroK8s:
if command -v microk8s &> /dev/null; then sudo microk8s reset sudo snap remove microk8sfisudo rm -rf /var/snap/microk8s/current- 安装 MicroK8s:
sudo snap install microk8s --classicmicrok8s status --wait-ready- 将您的用户添加到 MicroK8s 组:
sudo usermod -a -G microk8s $USERsudo chown -f -R $USER ~/.kube- 启用必要的 MicroK8s 服务:
microk8s enable dns storage registry- 配置 MicroK8s 使用本地 Docker 注册表:
sudo mkdir -p /var/snap/microk8s/current/args/certs.d/<your_registry_ip>:32000cat <<EOF | sudo tee /var/snap/microk8s/current/args/certs.d/<your_registry_ip>:32000/hosts.tomlserver = "http://<your_registry_ip>:32000"[host."http://<your_registry_ip>:32000"]capabilities = ["pull", "resolve"]EOFsudo systemctl restart dockersudo snap stop microk8ssudo snap start microk8smicrok8s status --wait-ready- 测试 MicroK8s 设置:
docker start registrydocker pull hello-worlddocker tag hello-world <your_registry_ip>:32000/hello-worlddocker push <your_registry_ip>:32000/hello-worldmicrok8s kubectl create deployment hello-world --image=<your_registry_ip>:32000/hello-worldsleep 2microk8s kubectl get deployments- 添加节点到 MicroK8s 集群:要添加新节点,请先在新机器上安装 MicroK8s
sudo snap install microk8s然后,在主节点上生成加入命令
join_command=$(microk8s add-node | grep 'microk8s join' | grep 'worker')在新节点上运行加入命令
microk8s join <main_node_ip>:25000/<token>更新新节点上的配置文件以从本地注册表拉取镜像(按照本节的步骤 5 和 Docker 设置的步骤 4)
-
更新
/var/snap/microk8s/current/args/certs.d/<your_registry_ip>:32000/hosts.toml -
更新
/etc/docker/daemon.json -
重启容器运行时
sudo systemctl restart dockersudo snap stop microk8ssudo snap start microk8s构建和部署 Docker 镜像
- 创建 Dockerfile:使用以下 Dockerfile 作为参考来构建您的 Docker 镜像
FROM rayproject/ray:latest-py39COPY . ./pythonRUN sudo apt-get update && cd python && python -m pip install --upgrade pip && \ pip install .[all] && pip install --upgrade numpy==1.26 && \ pip install --upgrade ray==2.36 && pip install tensorboardWORKDIR ./python- 构建 Docker 镜像:导航到包含 Dockerfile 的目录并运行
docker build --no-cache -t <image_name>:<tag> .- 将 Docker 镜像推送到 MicroK8s 注册表:
docker tag <image_name>:<tag> localhost:32000/<image_name>:<tag>docker push localhost:32000/<image_name>:<tag>配置 Ray 集群
- 安装 KubeRay Operator:
microk8s helm repo add kuberay https://ray-project.github.io/kuberay-helm/microk8s helm repo updatemicrok8s helm install kuberay-operator kuberay/kuberay-operator --version 1.1.1sleep 2microk8s kubectl get pods -o wide- 创建 RayCluster 配置文件:在定义 Ray 集群时,可以使用以下 YAML 配置作为参考
apiVersion: ray.io/v1kind: RayClustermetadata: annotations: meta.helm.sh/release-name: raycluster meta.helm.sh/release-namespace: default labels: app.kubernetes.io/instance: raycluster app.kubernetes.io/managed-by: Helm helm.sh/chart: ray-cluster-1.1.1 name: raycluster-kuberay namespace: defaultspec: headGroupSpec: rayStartParams: dashboard-host: 0.0.0.0 serviceType: ClusterIP template: spec: containers: - image: <cluster IP>:32000/ScholaExamples:registry imagePullPolicy: Always name: ray-head resources: limits: cpu: "8" memory: 48Gi requests: cpu: "2" memory: 16Gi volumes: - emptyDir: {} name: log-volume workerGroupSpecs: - groupName: workergroup maxReplicas: 5 minReplicas: 3 replicas: 3 template: metadata: labels: app: worker-pod spec: containers: - image: <cluster IP>:32000/ScholaExamples:registry imagePullPolicy: Always name: ray-worker resources: limits: cpu: "8" memory: 32Gi requests: cpu: "3" memory: 6Gi volumes: - emptyDir: {} name: log-volume imagePullSecrets: [] nodeSelector: {} tolerations: [] volumes: - emptyDir: {} name: log-volumeworkerGroupSpecs:- groupName: workergroup maxReplicas: <max number of worker pods> minReplicas: <min number of worker pods> numOfHosts: 1 rayStartParams: {} replicas: 3 template: metadata: labels: app: worker-pod spec: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - worker-pod topologyKey: "kubernetes.io/hostname" containers: - image: <cluster IP>:32000/ScholaExamples:registry imagePullPolicy: Always name: ray-worker resources: limits: cpu: "<num cores>" memory: <memory to use>Gi requests: cpu: "<num cores per worker>" memory: <memory per worker>Gi securityContext: {} imagePullSecrets: [] nodeSelector: {} tolerations: [] volumes: - emptyDir: {} name: log-volume- 部署 RayCluster:
microk8s helm install raycluster kuberay/ray-cluster --version 1.1.1sleep 2microk8s kubectl get rayclusterssleep 2microk8s kubectl get pods --selector=ray.io/cluster=raycluster-kuberaysleep 2echo "Ray Cluster Pods:"microk8s kubectl get pods -o wide- 验证 Ray 集群设置:
export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)echo $HEAD_POD
get_head_pod_status() { HEAD_POD=$(microk8s kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers) microk8s kubectl get pods | grep $HEAD_POD | awk '{print $3}'}
head_pod_status=$(get_head_pod_status)while [ "$head_pod_status" != "Running" ]; do echo "Current head pod ($HEAD_POD) status: $head_pod_status. Waiting for 'Running'..." sleep 2 head_pod_status=$(get_head_pod_status)done
kubectl exec -it $HEAD_POD -- python -c "import ray; ray.init(); print(ray.cluster_resources())"部署 Ray 集群
将配置应用于您的 MicroK8s 集群
microk8s kubectl apply -f raycluster.yaml启动训练脚本
- 在 Ray 集群上执行训练脚本:以下命令用于在 Ray 集群上启动训练脚本
microk8s kubectl exec -it $HEAD_POD -- python Schola/Resources/python/schola/scripts/ray/launch.py \--num-learners <num_learners> --num-cpus-per-learner <num_cpus_per_learner> \--activation <activation_function> --launch-unreal \--unreal-path "<path_to_unreal_executable>" -t <training_iterations> \--save-final-policy -p <port> --headless <APPO/IMPALA>命令的关键组件:
-
--unreal-path "<path_to_unreal_executable>":此参数指定已完全构建的 Unreal Engine 可执行文件的路径。至关重要的是,此可执行文件必须包含在 Docker 镜像中并在运行时可访问。Unreal Engine 实例作为训练过程的一部分启动,从而实现训练所需的模拟或环境。 -
--num-learners <num_learners>:指定训练期间使用的环境学习器数量。可以根据可用资源和任务的复杂性进行调整。 -
--num-cpus-per-learner <num_cpus_per_learner>:定义为每个学习器分配的 CPU 核心数。调整此参数以优化资源利用率和性能。 -
--activation <activation_function>:设置神经网络模型中使用的激活函数。可以修改此参数以尝试不同的激活函数。 -
附加参数,如
-t <training_iterations>(训练迭代次数)、--save-final-policy和-p <port>(端口) 可以根据特定训练需求进行自定义。
- 监控训练过程:可选地,启动 TensorBoard 实例来跟踪训练
#!/bin/bash
HEAD_POD=$(microk8s kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)microk8s kubectl exec -it $HEAD_POD -- tensorboard --logdir /path/to/logs --host 0.0.0.0