JupyterHub + K8s + Nvidia GPUs + On-premise

Description

This post is a detailed set of instructions for setting up a Kubernetes cluster to an on-premise group of servers or workstations. JupyterHub is then deployed to the k8s cluster and customized to the application of machine learning training.

Requirements

Each node (computer) should have Ubuntu installed. I used Ubuntu 20.04
Root access (sudo)
Nvidia GPU drivers

Appendix

Reset server
Reference

Install Docker ↟

First, follow the instructions found in the Docker documentation. I prefer the install using the repository approach. The remaining instructions for a GPU and non-GPU environment are provided below.

Non-GPU Environment

Replace the contents of /etc/docker/daemon.json file with the following code block. This associates Docker with the correct cgroup driver. Skip to final steps.

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
Code language: JSON / JSON with Comments (json)

GPU Environment

Run the following commands to install the Nvidia Docker package.

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt update
sudo apt install -y nvidia-docker2
Code language: Bash (bash)

Replace the contents of /etc/docker/daemon.json file with the following code block. This associates Docker with the correct cgroup driver and the Nivida container runtime.

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2",
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
Code language: JSON / JSON with Comments (json)

Final steps

After updating the daemon.json file run the following commands to apply the changes.

sudo systemctl enable docker
sudo systemctl daemon-reload
sudo systemctl restart docker
Code language: Bash (bash)

Install Kubernetes ↟

In the Kubernetes installation instructions complete the following sections:

Note: You have already configured the cgroup driver.

Next, disable swap by commenting out (add # to beginning of line) the swap line in the /etc/fstab file. Finally, restart the server or run the command

swapoff -a
Code language: Bash (bash)

Create Kubernetes Cluster ↟

Create and initialize a k8s cluster by running

sudo kubeadm init --pod-network-cidr=[your cidr]

I used 192.168.0.0/16 as my cidr. You will see a join command printed to the console. Save this for a later step to join secondary nodes to the cluster. Next, run the following commands to point kubectl to the newly created cluster.

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf \
    $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Code language: Bash (bash)

Lastly, create a namespace called jhub in the cluster by running

kubectl create namespace jhub
Code language: Bash (bash)

Calico network policy ↟

To apply the Calico network policy to your cluster run the following commands

kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml
Code language: Bash (bash)

Next, wait until all of the pods are running with a STATUS of Running verified with the command

watch kubectl get pods -n calico-system
Code language: Bash (bash)

Lastly, to allow scheduling pods to the primary node run the command

kubectl taint nodes --all node-role.kubernetes.io/master-
Code language: Bash (bash)

See calico quickstart for more information.

Install Helm ↟

Run the following command to install helm

curl https://raw.githubusercontent.com/helm/helm/HEAD/scripts/get-helm-3 | bash
Code language: Bash (bash)

Add relevant repositories to helm by running

helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
Code language: Bash (bash)

Install Nvidia Device Plugin ↟

The following commands install the Nvidia device plugin. This is only needed in a GPU environment.

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml
helm install --version=0.9.0 --generate-name nvdp/nvidia-device-plugin
Code language: Bash (bash)

Docker Image for JupyterHub ↟

These instructions create a Docker image based on Jupyter Docker Stacks. Nvidia CUDA support is added through a modified base container. A final custom layer is added at the end to modify Ubuntu and Python packages.

First, build the foundation images by running the following (replacing [repo] with your own custom name):

docker build -t [repo]/base-notebook-cuda --build-arg BASE_CONTAINER=nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 --build-arg PYTHON_VERSION=3.8.10 https://github.com/jupyter/docker-stacks.git#:/base-notebook
docker build -t [repo]/minimal-notebook-cuda --build-arg BASE_CONTAINER=[repo]/base-notebook-cuda https://github.com/jupyter/docker-stacks.git#:/minimal-notebook
docker build -t [repo]/scipy-notebook-cuda --build-arg BASE_CONTAINER=[repo]/minimal-notebook-cuda https://github.com/jupyter/docker-stacks.git#:/scipy-notebook

Code language: Bash (bash)

In the above commands you can specify the version of CUDA / CUDnn as well as the version of Python by modifying the corresponding arguments.

Next, copy the following into a file named Dockerfile. This is the final layer that allows you to add Ubuntu and Python packages.

# ----------------------------------------------
# Define base container

ARG BASE_CONTAINER=[repo]/scipy-notebook-cuda
FROM $BASE_CONTAINER

LABEL maintainer="Your Name <your@name.com>"

USER root

# ----------------------------------------------
# Install apt packages (root user)

RUN apt-get update && \
    apt-get install --yes --no-install-recommends htop nmon strace npm nodejs build-essential && \
    rm -rf /var/lib/apt/lists/*

# ----------------------------------------------
# Required setup for adding Tensorboard support
# (root user)

RUN mkdir -p /etc/tb/
RUN fix-permissions /etc/tb/

COPY tensorboard-proxy /usr/local/bin/
RUN chmod 755 /usr/local/bin/tensorboard-proxy

USER ${NB_UID}

# ----------------------------------------------
# Install necessary pip packages (regular user)

ARG TF_VERSION=2.5

# Install pip packages
RUN pip install tensorflow==${TF_VERSION} \
      tensorflow-probability \
      tensorflow-io \
      tensorflow-hub \
      tensorflow-addons \
      imageio-ffmpeg \
      pycocotools \
      opencv-contrib-python-headless

# ----------------------------------------------
# Additional setup for Tensorboard support
# (regular user)

WORKDIR /etc/tb/
RUN npm install http-proxy --save
COPY tensorboard-proxy.js /etc/tb/

WORKDIR "${HOME}
Code language: Dockerfile (dockerfile)

Finally, build the Docker image and optionally push it to your Docker Hub respository.

docker build -t [repo]/tensorflow-notebook-cuda:1.0.x .
docker push [repo]/tensorflow-notebook-cuda:1.0.x
Code language: Bash (bash)

Install JupyterHub ↟

Now we deploy JupterHub to the k8s cluster. First, create a file named config.yaml with the following contents.

hub:
  db:
    type: sqlite-memory
  config:
    Authenticator:
      admin_users:
        - admin
    DummyAuthenticator:
      password: admin_password
    JupyterHub:
      authenticator_class: dummy
  authenticatePrometheus: false

proxy:
  chp:
    networkPolicy:
      egress:
        - ports:
          - port: 6116
          - port: 6117
          - port: 6118

singleuser:
  networkPolicy:
    allowedIngressPorts: [6116,6117,6118]
  storage:
    type: none
    extraVolumes:
      - name: vol-gv0
        persistentVolumeClaim:
          claimName: pvc-gv0
      - name: vol-gv1
        persistentVolumeClaim:
          claimName: pvc-gb1
      - name: shm-volume
        emptyDir:
          medium: Memory
    extraVolumeMounts:
      - name: vol-gv0
        mountPath: "/home/jovyan/gv0"
      - name: vol-gv1
        mountPath: "/home/jovyan/gv1"
      - name: shm-volume
        mountPath: /dev/shm
  defaultUrl: "/lab"
  profileList:
    - display_name: "TF 2.5 - 0 GPUs"
      kubespawner_override:
        image: [repo]/tensorflow-notebook-cuda:1.0.x
        extra_resource_limits:
          nvidia.com/gpu: "0"
      default: true
    - display_name: "TF 2.5 - 1 GPUs"
      kubespawner_override:
        image: [repo]/tensorflow-notebook-cuda:1.0.x
        extra_resource_limits:
          nvidia.com/gpu: "1"
    - display_name: "TF 2.5 - 4 GPUs"
      kubespawner_override:
        image: [repo]/tensorflow-notebook-cuda:1.0.x
        extra_resource_limits:
          nvidia.com/gpu: "4"
Code language: YAML (yaml)

Some of the contents of this file will be explained more in the appendix sections and are related to Gluster storage and adding Tensorboard support. Next, apply the values in config.yaml to the cluster using helm with the following command.

helm upgrade --install [release-name] jupyterhub/jupyterhub --namespace jhub --version=1.0.0 --values config.yaml

Finally, make the service visible outside the cluster by running the following command.

kubectl patch svc proxy-public -n jhub -p '{"spec": {"type": "LoadBalancer", "externalIPs":["ip address here"]}}'
Code language: PHP (php)

Join nodes to the k8s cluster ↟

kubeadm join 192.168.1.102:6443 --token 5ytfgr.oc4pyr54vrgj5b7g --discovery-token-ca-cert-hash sha256:d6f0fd687b14b8506a1fbc63c3ee63ec8f4e81a5c93015245135d3707f433a23
# https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#join-nodes
Code language: PHP (php)

Create GlusterFS Storage ↟

First, create or modify your hard drive partitions to match the desired configuration. I find this link to be particularly useful.

You’ll also want to make the partitions mount the same after a computer reboot. This can be accomplished by adding entries to the /etc/fstab file. First, find the UUID of the drive with the following commands:

sudo lsblk -f # show UUID for each partition
sudo lshw -C disk # show additional info about partition
df -h # show info about mounted filesystems
Code language: PHP (php)

An updated /etc/fstab file will look something like this:

# <file system>                           <mount point>              <type>    <options>         <dump> <pass>
UUID=cc30d1ac-5c3a-41cd-8f97-0bfd648ebb07 /                          ext4      errors=remount-ro 0      1
UUID=52A3-2923                            /boot/efi                  vfat      umask=0077        0      1
# /swapfile                               none                       swap      sw                0      0
UUID=28b63ddb-75c3-4f18-91d7-b15944ac07a1 /data/glusterfs/gv0/brick1 ext4      defaults          1      2
UUID=7099fa95-068d-4ad2-adb0-8aa8bd39017a /data/glusterfs/gv1/brick1 ext4      defaults          1      2
localhost:/mb1gv1                         /mnt/mb1gv1                glusterfs defaults,_netdev  0      0
localhost:/mb2gv1                         /mnt/mb2gv1                glusterfs defaults,_netdev  0      0
Code language: HTML, XML (xml)

Next, create relevant Gluster volumes. The following commands are the most helpful:

gluster peer probe hostname/ipaddress
gluster peer status
gluster volume create [vol name] hostname1:/brick1 hostname2:/brick2
gluster volume status
gluster volume info
Code language: Bash (bash)

Before adding peers to the gluster cluster, modify the /etc/hosts file to map hostnames to DHCP reserved ip addresses of all nodes in the cluster. I’ve found that this helps reduce unexplained issues when the servers restart and attempt to rejoin. Then when adding nodes to the gluster cluster using peer probe use the hostname instead of the computer ip address.

Add GlusterFS Storage to k8s cluster ↟

First, create endpoints and service yaml files that describe the gluster storage.

apiVersion: v1
kind: Endpoints
metadata:
  name: glusterfs-cluster
  namespace: jhub
subsets:
- addresses:
  - ip: [ip address of gluster peer]
  ports:
  - port: 1
Code language: YAML (yaml)

apiVersion: v1
kind: Service
metadata:
  name: glusterfs-cluster
  namespace: jhub
spec:
  ports:
  - port: 1
Code language: YAML (yaml)

Add changes to the k8s cluster by running the following command for each of the above files.

kubectl create -f file_name.yaml
Code language: Bash (bash)

Next, you will create a persistent volume and a persistent volume claim for each of your gluster volumes. These files look something like this.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-name
  namespace: jhub
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
Code language: YAML (yaml)

apiVersion: v1
kind: PersistentVolume
metadata:
  name: pv-name
  namespace: jhub
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteMany
  claimRef:
    namespace: jhub
    name: pvc-name
  glusterfs:
    endpoints: glusterfs-cluster
    path: gv0
    readOnly: false
Code language: YAML (yaml)

Each of these files are applied in the same way as before.

kubectl create -f file_name.yaml
Code language: Bash (bash)

Misc

----------------------------------------------------------------------------------
13. Create JupyterHub usage dashboard

a. Request the jupyterhub-dashboard web app from Ben
b. Create a python virtual environment that has Flask installed (pip install Flask)
c. Execute 'python dashboard-app.py' from the virtual environment
d. Potentially launch dashboard on reboot using crontab


----------------------------------------------------------------------------------
14. Create service to update proxy table for enabling Tensorboard

kubectl get secret hub -n jhub -o=json | jq -r '.["data"] | .["hub.config.ConfigurableHTTPProxy.auth_token"]' | base64 --decode


----------------------------------------------------------------------------------
Appendix: reset server

sudo kubeadm reset

!it gets more complicated after you've added more nodes to the cluster!
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#tear-down


----------------------------------------------------------------------------------
Reference:

https://zero-to-jupyterhub.readthedocs.io/en/latest/
Code language: JavaScript (javascript)