How to Add and Remove Nodes
OpenPAI doesn't support changing master nodes, thus, only the solution of adding/removing worker nodes is provided. You can add CPU workers, GPU workers, and other computing devices (e.g. TPU, NPU) into the cluster.
Pre-checks on Nodes to Add
Note: If you are going to remove nodes, you can skip this section.
To add worker nodes, please check if the nodes meet The Worker Requirements.
If you have configured any PV/PVC storage, please confirm the nodes meet PV's requirements. See Confirm Worker Nodes Environment for details.
If you are going to add nodes that have been deleted before, you may need to reload the systemd manager configuration on those nodes:
ssh <node> "sudo systemctl daemon-reload"
Pull & Modify Cluster Settings
- Log in to your dev box machine and go into your dev box docker container, change directory to
/pai. If you don't have a dev box docker container, launch one.
sudo docker exec -it <your-dev-box> bash
paictl.pyto pull config files to a certain folder.
Note: Check if the files you pulled contain
config.yaml. Before v1.7.0,
config.yamlis stored in
~/pai-deploy/cluster-cfg/config.yamlon the dev box machine. If you have upgraded to v1.7.0, please copy
<config-folder>and push it to the cluster. If your
config.yamlis lost, you need to create a new one. Refer to config.yaml example.
./paictl.py config pull -o <config-folder>
<config-folder>/layout.yaml. Add new nodes into
machine-list, create a new
machine-skuif necessary. Refer to layout.yaml format for schema requirements.
Note: If you are going to remove nodes, you can skip this step.
```yaml machine-list: - hostname: new-worker-node--0 hostip: x.x.x.x machine-type: xxx-sku pai-worker: "true"
- hostname: new-worker-node-1 hostip: x.x.x.x machine-type: xxx-sku pai-worker: "true"
Make sure that you can access all nodes in the cluster using the settings in
<config-folder>/config.yaml. If you use SSH key pairs to log in to nodes, please mount the folder
~/.sshon the dev box machine to
/root/.sshon the dev box docker container。
Note: If you are using Kubernetes default scheduler, you can skip this step.
Use Paictl to Add / Remove Nodes
Note: All the following operations should be performed in the dev box docker container on the dev box machine.
Note：When removing nodes, the
layout.yaml saved in Kubernetes will be automatically modified after the deletion is successful. We recommend backing up the
<config-folder> in the file system of your dev box machine in case your dev box docker container stops.
- Stop related services.
./paictl.py service stop -n cluster-configuration hivedscheduler rest-server job-exporter
- Push the latest configuration.
./paictl.py config push -p <config-folder> -m service
Add nodes to and/or remove nodes from kubernetes.
To add nodes:
bash ./paictl.py node add -n <node1> <node2> ...
To remove nodes:
bash ./paictl.py node remove -n <node1> <node2> ...
Start related services.
./paictl.py service start -n cluster-configuration hivedscheduler rest-server job-exporter