Installation FAQs and Troubleshooting
Installation FAQs
Which version of NVIDIA driver should I install?
First, check out the NVIDIA site to verify the newest driver version of your GPU card. Then, check out this table to see the CUDA requirement of the driver version.
Please note that some docker images with new CUDA versions cannot be used on machines with old drivers. As for now, we recommend installing the NVIDIA driver 418 as it supports CUDA 9.0 to CUDA 10.1, which is used by most deep learning frameworks.
How to fasten deploy speed on a large cluster?
By default, Ansible
uses 5 forks to execute commands parallelly on all hosts. If your cluster is a large one, it may be slow for you.
To fasten the deploying speed, you can add -f <parallel-number>
to all commands using ansible
or ansible-playbook
. See ansible doc for reference.
How to remove the k8s network plugin
After installation, if you use weave as a k8s network plugin and you encounter some errors about the network, such as some pods failed to connect internet, you could remove the network plugin to solve this issue.
Please run kubectl delete ds weave-net -n kube-system
to remove weave-net
daemon set first
To remove the network plugin, you could use following ansible-playbook
:
---
- hosts: all
tasks:
- name: remove cni
shell: |
sed -i '/KUBELET_NETWORK_PLUGIN/d' /etc/kubernetes/kubelet.env
echo KUBELET_NETWORK_PLUGIN=\"\" >> /etc/kubernetes/kubelet.env
args:
executable: /bin/bash
- name: remove weave
shell: ip link delete weave
args:
executable: /bin/bash
- name: restart network
shell: systemctl restart networking
args:
executable: /bin/bash
# - name: restart network for ubuntu 18.04
# shell: systemctl restart systemd-networkd
# args:
# executable: /bin/bash
- name: clean ip table
shell: |
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -t nat -F
iptables -t mangle -F
iptables -F
iptables -X
- name: config-docker
shell: |
sed -i 's/--iptables=false/--iptables=true --ip-masq=true/g' /etc/systemd/system/docker.service.d/docker-options.conf
systemctl daemon-reload
args:
executable: /bin/bash
- name: stop kubelet
shell: systemctl stop kubelet
args:
executable: /bin/bash
- name: restart docker
shell: systemctl restart docker
args:
executable: /bin/bash
- name: start kubelet
shell: systemctl start kubelet
args:
executable: /bin/bash
After these steps, you need to change the coredns
to fix the DNS resolution issue.
Please run kubectl edit cm coredns -n kube-system
, change .:53
to .:9053
Please run kubectl edit service coredns -n kube-system
, change targetPort: 53
to targetPort: 9053
Please run kubectl edit deployment coredns -n kube-system
, change containerPort: 53
to containerPort: 9053
. Add hostNetwork: true
in pod spec.
How to check whether the GPU driver is installed?
For NVIDIA GPU, use the command nvidia-smi
to check.
How to install GPU driver?
For NVIDIA GPU, please first determine which version of the driver you want to install (see this question for details). Then follow these commands:
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt install nvidia-418
sudo reboot
Here we use NVIDIA driver version 418 as an example. Please modify nvidia-418
if you want to install a different version, and refer to the NVIDIA community for help if encounter any problem.
How to install NVIDIA-container-runtime?
Please refer to the official document. Don't forget to set it as docker' default runtime in docker-config-file. Here is an example of /etc/docker/daemon.json
:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
How to deploy on Azure Kubernetes Service (AKS) with Cluster Autoscaler?
Please refer to this document.
Troubleshooting
Ansible reports Failed to update apt cache
or Apt install <some package>
fails
Please first check if there are any network-related issues. Besides network, another reason for this problem is: ansible
sometimes runs an apt update
to update the cache before the package installation. If apt update
exits with a non-zero code, the whole command will be considered to be failed.
You can check this by running sudo apt update; echo $?
on the corresponding machine. If the exit code is not 0, please fix it. Here are 2 normal causes of this problem:
If you find sudo apt update
reports the following signatures couldn’t be verified because the public key is not available
, you can use the following commands to fix it. Please replace <key-number>
with yours.
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <key-number>
sudo apt update
If you find sudo apt update
reports some expired repo lists, you can use the following commands to fix it. Please replace <repo-list-file>
with yours.
sudo rm -rf /etc/apt/sources.list.d/<repo-list-file>
sudo apt update
Ansible playbook exits because of timeout.
Sometimes, if you assign a different hostname for a certain machine, any commands with sudo
will be very slow on that machine. Because the system DNS tries to find the new hostname, but it will fail due to a timeout.
To fix this problem, on each machine, you can add the new hostname to its /etc/hosts
by:
sudo chmod 666 /etc/hosts
sudo echo 127.0.0.1 `hostname` >> /etc/hosts
sudo chmod 644 /etc/hosts
Ansible exits because sudo
is timed out.
The same as 1. Ansible playbook exits because of timeout.
Ansible reports Could not import python modules: apt, apt_pkg. Please install python3-apt package.
Sometimes it is not fixable even you have the python3-apt
package installed. In this case, please manually add -e ansible_python_interpreter=/usr/bin/python3
to this line in your local code.
Network-related Issues
If you are a China user, please refer to this issue.
Cannot download kubeadm or hyperkube
During installation, the script will download kubeadm and hyperkube from storage.googleapis.com
. To be detailed, we use kubespray 2.11
release, the corresponding kubeadm
and hyperkube
is:
kubeadm
:https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubeadm
hyperkube
:https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/hyperkube
Please find alternative URLs for downloading these two files and modify kubeadm_download_url
and hyperkube_download_url
in your config
file.
Cannot download image
Please first check the log to see which image blocks the installation process, and modify gcr_image_repo
, kube_image_repo
, quay_image_repo
, or docker_image_repo
to a mirror repository correspondingly in config
file.
For example, if you cannot pull images from gcr.io
, you should first find a mirror repository (We recommend you to use gcr.azk8s.cn
if you are in China). Then, modify gcr_image_repo
and kube_image_repo
.
Especially for gcr.io
, we find some image links in kubespray which do not adopt gcr_image_repo
and kube_image_repo
. You should modify them manually in ~/pai-deploy/kubespray
. Command grep -r --color gcr.io ~/pai-deploy/kubespray
will be helpful to you.