Getting Kubeflow flowing
Installing Kubeflow on a on-premise Kubernetes cluster has for me been a pain. The discovered solution was embarrassing easy and I’ll get to it at the end of this story.
The easy way to install Kubeflow has for me been using microk8s. All you had to do was enter ‘microk8s enable kubeflow’. Was it all you need to do? No, close but no cigar. There where no way to get GPU acceleration, so doing any real deep training was out of the question. It is frustrating when running docker containers, and even containerd, based containers worked perfectly on the host with full GPU acceleration. But microk8s did not give access to GPU.
However, there is a solution. I thought because there where several “solutions” listed on GitHub. The problem seemed to stem from Nvidia library version collision. But also from some improvements that has been done in beta version of microk8s that has not yet been released to the stable channel.
The microk8s solution: start with remove microk8s (sudo snap remove microk8s). Then remove all drivers for Nvidia in the system (sudo apt-get remove — purge ‘^nvidia-.*’). Reboot. After boot install latest driver (sudo apt-get install
nvidia-headless-460-server nvidia-container-runtime
)
Next step you install microk8s again, but this time based on the next upcoming release (sudo snap install microk8s — classic — channel=latest/candidate)
Check that microk8s works on your system (microk8s enable dns). Now the main attraction, install GPU support (microk8s enable gpu). Check that you now at last have the GPU label set on the node (microk8s kubectl get nodes -o yaml | grep -i gpu). This should now result in “nvidia.com/gpu.present: “true”.
Remember to run microk8s inspect before starting to use the development cluster. For example this warning:
WARNING: IPtables FORWARD policy is DROP. Consider enabling traffic forwarding with: sudo iptables -P FORWARD ACCEPT
The change can be made persistent with: sudo apt-get install iptables-persistent
Fix all warnings until none is printed after the microk8s inspect.
However, having the node correctly labeled with GPU availability, does in this case not mean it works. Then I saw this message in GitHub issues:
The following workaround lets me use Kubeflow and GPU with microk8s:
sudo snap install microk8s — channel=1.20 — classic
microk8s enable kubeflow
sudo snap refresh microk8s — channel=1.21/beta
microk8s enable gpu
I can summarize it to this: No. The setup will still not have working GPU.
However, the Kubeflow works perfectly without GPU, and Kubeflow works, but slowly as everything in microk8s, so I see microk8s as a test environment not meant for real work.
Back to main topic which is getting Kubeflow flowing fast. This time build a lab Kubernetes cluster that consists of 7 GPU nodes and 11 CPU nodes. My favorite way to set up an optimized Kubernetes is with Kubespray. However, there has always been a snag to get Kubeflow working. The problem is the kfctl command that do the Kubeflow generation did not work and debugging it was difficult. The command utilize “kustomize”, which is a really cool technology of building yaml files and very useful to organize a Kubernetes deployment. However, this is on my todo list of grasping so was not an easy win.
It seems there are more people that finds the setting up Kubeflow difficult, and someone thought it was a good thing to solve. Enter deepops NVIDIA/deepops: Tools for building GPU clusters (github.com). Everything is included for setting up a working cluster running Kubespray. Under the hood it is utilizing Kubespray (of course), but the tricky part of getting the GPU working is taken care of (wonder why they know how).
All the good things now up and running. An accelerated Jupyter notebook that uses a free GPU in cluster without the limitations of those in the cloud.
Next adventure will be in the Kubeflow and getting CI/CD and constant training pipeline up and running.