JupyterHub + K8s + Nvidia GPUs + On-premise

Description

This post is a detailed set of instructions for setting up a Kubernetes cluster to an on-premise group of servers or workstations. JupyterHub is then deployed to the k8s cluster and customized to the application of machine learning training.

Requirements

  • Each node (computer) should have Ubuntu installed. I used Ubuntu 20.04
  • Root access (sudo)
  • Nvidia GPU drivers

Contents

All nodes

Primary node

Secondary nodes

Shared file system

Appendix

  • Reset server
  • Reference

Install Docker

First, follow the instructions found in the Docker documentation. I prefer the install using the repository approach. The remaining instructions for a GPU and non-GPU environment are provided below.

Non-GPU Environment

Replace the contents of /etc/docker/daemon.json file with the following code block. This associates Docker with the correct cgroup driver. Skip to final steps.

{ "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m" }, "storage-driver": "overlay2" }
Code language: JSON / JSON with Comments (json)

GPU Environment

Run the following commands to install the Nvidia Docker package.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \ && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt update sudo apt install -y nvidia-docker2
Code language: Bash (bash)

Replace the contents of /etc/docker/daemon.json file with the following code block. This associates Docker with the correct cgroup driver and the Nivida container runtime.

{ "exec-opts": ["native.cgroupdriver=systemd"], "log-driver": "json-file", "log-opts": { "max-size": "100m" }, "storage-driver": "overlay2", "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
Code language: JSON / JSON with Comments (json)

Final steps

After updating the daemon.json file run the following commands to apply the changes.

sudo systemctl enable docker sudo systemctl daemon-reload sudo systemctl restart docker
Code language: Bash (bash)

Install Kubernetes

In the Kubernetes installation instructions complete the following sections:

Note: You have already configured the cgroup driver.

Next, disable swap by commenting out (add # to beginning of line) the swap line in the /etc/fstab file. Finally, restart the server or run the command

swapoff -a
Code language: Bash (bash)

Create Kubernetes Cluster

Create and initialize a k8s cluster by running

sudo kubeadm init --pod-network-cidr=[your cidr]

I used 192.168.0.0/16 as my cidr. You will see a join command printed to the console. Save this for a later step to join secondary nodes to the cluster. Next, run the following commands to point kubectl to the newly created cluster.

mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf \ $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config
Code language: Bash (bash)

Lastly, create a namespace called jhub in the cluster by running

kubectl create namespace jhub
Code language: Bash (bash)

Calico network policy

To apply the Calico network policy to your cluster run the following commands

kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml
Code language: Bash (bash)

Next, wait until all of the pods are running with a STATUS of Running verified with the command

watch kubectl get pods -n calico-system
Code language: Bash (bash)

Lastly, to allow scheduling pods to the primary node run the command

kubectl taint nodes --all node-role.kubernetes.io/master-
Code language: Bash (bash)

See calico quickstart for more information.

Install Helm

Run the following command to install helm

curl https://raw.githubusercontent.com/helm/helm/HEAD/scripts/get-helm-3 | bash
Code language: Bash (bash)

Add relevant repositories to helm by running

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update
Code language: Bash (bash)

Install Nvidia Device Plugin

The following commands install the Nvidia device plugin. This is only needed in a GPU environment.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml helm install --version=0.9.0 --generate-name nvdp/nvidia-device-plugin
Code language: Bash (bash)

Docker Image for JupyterHub

These instructions create a Docker image based on Jupyter Docker Stacks. Nvidia CUDA support is added through a modified base container. A final custom layer is added at the end to modify Ubuntu and Python packages.

First, build the foundation images by running the following (replacing [repo] with your own custom name):

docker build -t [repo]/base-notebook-cuda --build-arg BASE_CONTAINER=nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 --build-arg PYTHON_VERSION=3.8.10 https://github.com/jupyter/docker-stacks.git#:/base-notebook docker build -t [repo]/minimal-notebook-cuda --build-arg BASE_CONTAINER=[repo]/base-notebook-cuda https://github.com/jupyter/docker-stacks.git#:/minimal-notebook docker build -t [repo]/scipy-notebook-cuda --build-arg BASE_CONTAINER=[repo]/minimal-notebook-cuda https://github.com/jupyter/docker-stacks.git#:/scipy-notebook
Code language: Bash (bash)

In the above commands you can specify the version of CUDA / CUDnn as well as the version of Python by modifying the corresponding arguments.

Next, copy the following into a file named Dockerfile. This is the final layer that allows you to add Ubuntu and Python packages.

# ---------------------------------------------- # Define base container ARG BASE_CONTAINER=[repo]/scipy-notebook-cuda FROM $BASE_CONTAINER LABEL maintainer="Your Name <your@name.com>" USER root # ---------------------------------------------- # Install apt packages (root user) RUN apt-get update && \ apt-get install --yes --no-install-recommends htop nmon strace npm nodejs build-essential && \ rm -rf /var/lib/apt/lists/* # ---------------------------------------------- # Required setup for adding Tensorboard support # (root user) RUN mkdir -p /etc/tb/ RUN fix-permissions /etc/tb/ COPY tensorboard-proxy /usr/local/bin/ RUN chmod 755 /usr/local/bin/tensorboard-proxy USER ${NB_UID} # ---------------------------------------------- # Install necessary pip packages (regular user) ARG TF_VERSION=2.5 # Install pip packages RUN pip install tensorflow==${TF_VERSION} \ tensorflow-probability \ tensorflow-io \ tensorflow-hub \ tensorflow-addons \ imageio-ffmpeg \ pycocotools \ opencv-contrib-python-headless # ---------------------------------------------- # Additional setup for Tensorboard support # (regular user) WORKDIR /etc/tb/ RUN npm install http-proxy --save COPY tensorboard-proxy.js /etc/tb/ WORKDIR "${HOME}
Code language: Dockerfile (dockerfile)

Finally, build the Docker image and optionally push it to your Docker Hub respository.

docker build -t [repo]/tensorflow-notebook-cuda:1.0.x . docker push [repo]/tensorflow-notebook-cuda:1.0.x
Code language: Bash (bash)

Install JupyterHub

Now we deploy JupterHub to the k8s cluster. First, create a file named config.yaml with the following contents.

hub: db: type: sqlite-memory config: Authenticator: admin_users: - admin DummyAuthenticator: password: admin_password JupyterHub: authenticator_class: dummy authenticatePrometheus: false proxy: chp: networkPolicy: egress: - ports: - port: 6116 - port: 6117 - port: 6118 singleuser: networkPolicy: allowedIngressPorts: [6116,6117,6118] storage: type: none extraVolumes: - name: vol-gv0 persistentVolumeClaim: claimName: pvc-gv0 - name: vol-gv1 persistentVolumeClaim: claimName: pvc-gb1 - name: shm-volume emptyDir: medium: Memory extraVolumeMounts: - name: vol-gv0 mountPath: "/home/jovyan/gv0" - name: vol-gv1 mountPath: "/home/jovyan/gv1" - name: shm-volume mountPath: /dev/shm defaultUrl: "/lab" profileList: - display_name: "TF 2.5 - 0 GPUs" kubespawner_override: image: [repo]/tensorflow-notebook-cuda:1.0.x extra_resource_limits: nvidia.com/gpu: "0" default: true - display_name: "TF 2.5 - 1 GPUs" kubespawner_override: image: [repo]/tensorflow-notebook-cuda:1.0.x extra_resource_limits: nvidia.com/gpu: "1" - display_name: "TF 2.5 - 4 GPUs" kubespawner_override: image: [repo]/tensorflow-notebook-cuda:1.0.x extra_resource_limits: nvidia.com/gpu: "4"
Code language: YAML (yaml)

Some of the contents of this file will be explained more in the appendix sections and are related to Gluster storage and adding Tensorboard support. Next, apply the values in config.yaml to the cluster using helm with the following command.

helm upgrade --install [release-name] jupyterhub/jupyterhub --namespace jhub --version=1.0.0 --values config.yaml

Finally, make the service visible outside the cluster by running the following command.

kubectl patch svc proxy-public -n jhub -p '{"spec": {"type": "LoadBalancer", "externalIPs":["ip address here"]}}'
Code language: PHP (php)

Join nodes to the k8s cluster

kubeadm join 192.168.1.102:6443 --token 5ytfgr.oc4pyr54vrgj5b7g --discovery-token-ca-cert-hash sha256:d6f0fd687b14b8506a1fbc63c3ee63ec8f4e81a5c93015245135d3707f433a23 # https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#join-nodes
Code language: PHP (php)

Create GlusterFS Storage

First, create or modify your hard drive partitions to match the desired configuration. I find this link to be particularly useful.

You’ll also want to make the partitions mount the same after a computer reboot. This can be accomplished by adding entries to the /etc/fstab file. First, find the UUID of the drive with the following commands:

sudo lsblk -f # show UUID for each partition sudo lshw -C disk # show additional info about partition df -h # show info about mounted filesystems
Code language: PHP (php)

An updated /etc/fstab file will look something like this:

# <file system> <mount point> <type> <options> <dump> <pass> UUID=cc30d1ac-5c3a-41cd-8f97-0bfd648ebb07 / ext4 errors=remount-ro 0 1 UUID=52A3-2923 /boot/efi vfat umask=0077 0 1 # /swapfile none swap sw 0 0 UUID=28b63ddb-75c3-4f18-91d7-b15944ac07a1 /data/glusterfs/gv0/brick1 ext4 defaults 1 2 UUID=7099fa95-068d-4ad2-adb0-8aa8bd39017a /data/glusterfs/gv1/brick1 ext4 defaults 1 2 localhost:/mb1gv1 /mnt/mb1gv1 glusterfs defaults,_netdev 0 0 localhost:/mb2gv1 /mnt/mb2gv1 glusterfs defaults,_netdev 0 0
Code language: HTML, XML (xml)

Next, create relevant Gluster volumes. The following commands are the most helpful:

gluster peer probe hostname/ipaddress gluster peer status gluster volume create [vol name] hostname1:/brick1 hostname2:/brick2 gluster volume status gluster volume info
Code language: Bash (bash)

Before adding peers to the gluster cluster, modify the /etc/hosts file to map hostnames to DHCP reserved ip addresses of all nodes in the cluster. I’ve found that this helps reduce unexplained issues when the servers restart and attempt to rejoin. Then when adding nodes to the gluster cluster using peer probe use the hostname instead of the computer ip address.

Add GlusterFS Storage to k8s cluster

First, create endpoints and service yaml files that describe the gluster storage.

apiVersion: v1 kind: Endpoints metadata: name: glusterfs-cluster namespace: jhub subsets: - addresses: - ip: [ip address of gluster peer] ports: - port: 1
Code language: YAML (yaml)
apiVersion: v1 kind: Service metadata: name: glusterfs-cluster namespace: jhub spec: ports: - port: 1
Code language: YAML (yaml)

Add changes to the k8s cluster by running the following command for each of the above files.

kubectl create -f file_name.yaml
Code language: Bash (bash)

Next, you will create a persistent volume and a persistent volume claim for each of your gluster volumes. These files look something like this.

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc-name namespace: jhub spec: accessModes: - ReadWriteMany resources: requests: storage: 10Gi
Code language: YAML (yaml)
apiVersion: v1 kind: PersistentVolume metadata: name: pv-name namespace: jhub spec: capacity: storage: 100Gi accessModes: - ReadWriteMany claimRef: namespace: jhub name: pvc-name glusterfs: endpoints: glusterfs-cluster path: gv0 readOnly: false
Code language: YAML (yaml)

Each of these files are applied in the same way as before.

kubectl create -f file_name.yaml
Code language: Bash (bash)

Misc

---------------------------------------------------------------------------------- 13. Create JupyterHub usage dashboard a. Request the jupyterhub-dashboard web app from Ben b. Create a python virtual environment that has Flask installed (pip install Flask) c. Execute 'python dashboard-app.py' from the virtual environment d. Potentially launch dashboard on reboot using crontab ---------------------------------------------------------------------------------- 14. Create service to update proxy table for enabling Tensorboard kubectl get secret hub -n jhub -o=json | jq -r '.["data"] | .["hub.config.ConfigurableHTTPProxy.auth_token"]' | base64 --decode ---------------------------------------------------------------------------------- Appendix: reset server sudo kubeadm reset !it gets more complicated after you've added more nodes to the cluster! https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#tear-down ---------------------------------------------------------------------------------- Reference: https://zero-to-jupyterhub.readthedocs.io/en/latest/
Code language: JavaScript (javascript)