Description
This post is a detailed set of instructions for setting up a Kubernetes cluster to an on-premise group of servers or workstations. JupyterHub is then deployed to the k8s cluster and customized to the application of machine learning training.
Requirements
- Each node (computer) should have Ubuntu installed. I used Ubuntu 20.04
- Root access (sudo)
- Nvidia GPU drivers
Contents
All nodes
Primary node
- Create k8s cluster
- Calico network policy
- Install helm
- Install Nvidia device plugin
- Docker image for JupyterHub
- Install JupyterHub
Secondary nodes
Shared file system
Appendix
- Reset server
- Reference
Install Docker ↟
First, follow the instructions found in the Docker documentation. I prefer the install using the repository approach. The remaining instructions for a GPU and non-GPU environment are provided below.
Non-GPU Environment
Replace the contents of /etc/docker/daemon.json
file with the following code block. This associates Docker with the correct cgroup driver. Skip to final steps.
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
Code language: JSON / JSON with Comments (json)
GPU Environment
Run the following commands to install the Nvidia Docker package.
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update
sudo apt install -y nvidia-docker2
Code language: Bash (bash)
Replace the contents of /etc/docker/daemon.json
file with the following code block. This associates Docker with the correct cgroup driver and the Nivida container runtime.
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Code language: JSON / JSON with Comments (json)
Final steps
After updating the daemon.json file run the following commands to apply the changes.
sudo systemctl enable docker
sudo systemctl daemon-reload
sudo systemctl restart docker
Code language: Bash (bash)
Install Kubernetes ↟
In the Kubernetes installation instructions complete the following sections:
Note: You have already configured the cgroup driver.
Next, disable swap by commenting out (add # to beginning of line) the swap line in the /etc/fstab
file. Finally, restart the server or run the command
Code language: Bash (bash)swapoff -a
Create Kubernetes Cluster ↟
Create and initialize a k8s cluster by running
sudo kubeadm init --pod-network-cidr=[your cidr]
I used 192.168.0.0/16 as my cidr. You will see a join command printed to the console. Save this for a later step to join secondary nodes to the cluster. Next, run the following commands to point kubectl to the newly created cluster.
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf \
$HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Code language: Bash (bash)
Lastly, create a namespace called jhub
in the cluster by running
Code language: Bash (bash)kubectl create namespace jhub
Calico network policy ↟
To apply the Calico network policy to your cluster run the following commands
Code language: Bash (bash)kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml
Next, wait until all of the pods are running with a STATUS of Running verified with the command
Code language: Bash (bash)watch kubectl get pods -n calico-system
Lastly, to allow scheduling pods to the primary node run the command
Code language: Bash (bash)kubectl taint nodes --all node-role.kubernetes.io/master-
See calico quickstart for more information.
Install Helm ↟
Run the following command to install helm
Code language: Bash (bash)curl https://raw.githubusercontent.com/helm/helm/HEAD/scripts/get-helm-3 | bash
Add relevant repositories to helm by running
Code language: Bash (bash)helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update
Install Nvidia Device Plugin ↟
The following commands install the Nvidia device plugin. This is only needed in a GPU environment.
Code language: Bash (bash)kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.9.0/nvidia-device-plugin.yml helm install --version=0.9.0 --generate-name nvdp/nvidia-device-plugin
Docker Image for JupyterHub ↟
These instructions create a Docker image based on Jupyter Docker Stacks. Nvidia CUDA support is added through a modified base container. A final custom layer is added at the end to modify Ubuntu and Python packages.
First, build the foundation images by running the following (replacing [repo] with your own custom name):
docker build -t [repo]/base-notebook-cuda --build-arg BASE_CONTAINER=nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 --build-arg PYTHON_VERSION=3.8.10 https://github.com/jupyter/docker-stacks.git#:/base-notebook
docker build -t [repo]/minimal-notebook-cuda --build-arg BASE_CONTAINER=[repo]/base-notebook-cuda https://github.com/jupyter/docker-stacks.git#:/minimal-notebook
docker build -t [repo]/scipy-notebook-cuda --build-arg BASE_CONTAINER=[repo]/minimal-notebook-cuda https://github.com/jupyter/docker-stacks.git#:/scipy-notebook
Code language: Bash (bash)
In the above commands you can specify the version of CUDA / CUDnn as well as the version of Python by modifying the corresponding arguments.
Next, copy the following into a file named Dockerfile. This is the final layer that allows you to add Ubuntu and Python packages.
# ----------------------------------------------
# Define base container
ARG BASE_CONTAINER=[repo]/scipy-notebook-cuda
FROM $BASE_CONTAINER
LABEL maintainer="Your Name <your@name.com>"
USER root
# ----------------------------------------------
# Install apt packages (root user)
RUN apt-get update && \
apt-get install --yes --no-install-recommends htop nmon strace npm nodejs build-essential && \
rm -rf /var/lib/apt/lists/*
# ----------------------------------------------
# Required setup for adding Tensorboard support
# (root user)
RUN mkdir -p /etc/tb/
RUN fix-permissions /etc/tb/
COPY tensorboard-proxy /usr/local/bin/
RUN chmod 755 /usr/local/bin/tensorboard-proxy
USER ${NB_UID}
# ----------------------------------------------
# Install necessary pip packages (regular user)
ARG TF_VERSION=2.5
# Install pip packages
RUN pip install tensorflow==${TF_VERSION} \
tensorflow-probability \
tensorflow-io \
tensorflow-hub \
tensorflow-addons \
imageio-ffmpeg \
pycocotools \
opencv-contrib-python-headless
# ----------------------------------------------
# Additional setup for Tensorboard support
# (regular user)
WORKDIR /etc/tb/
RUN npm install http-proxy --save
COPY tensorboard-proxy.js /etc/tb/
WORKDIR "${HOME}
Code language: Dockerfile (dockerfile)
Finally, build the Docker image and optionally push it to your Docker Hub respository.
Code language: Bash (bash)docker build -t [repo]/tensorflow-notebook-cuda:1.0.x . docker push [repo]/tensorflow-notebook-cuda:1.0.x
Install JupyterHub ↟
Now we deploy JupterHub to the k8s cluster. First, create a file named config.yaml
with the following contents.
hub:
db:
type: sqlite-memory
config:
Authenticator:
admin_users:
- admin
DummyAuthenticator:
password: admin_password
JupyterHub:
authenticator_class: dummy
authenticatePrometheus: false
proxy:
chp:
networkPolicy:
egress:
- ports:
- port: 6116
- port: 6117
- port: 6118
singleuser:
networkPolicy:
allowedIngressPorts: [6116,6117,6118]
storage:
type: none
extraVolumes:
- name: vol-gv0
persistentVolumeClaim:
claimName: pvc-gv0
- name: vol-gv1
persistentVolumeClaim:
claimName: pvc-gb1
- name: shm-volume
emptyDir:
medium: Memory
extraVolumeMounts:
- name: vol-gv0
mountPath: "/home/jovyan/gv0"
- name: vol-gv1
mountPath: "/home/jovyan/gv1"
- name: shm-volume
mountPath: /dev/shm
defaultUrl: "/lab"
profileList:
- display_name: "TF 2.5 - 0 GPUs"
kubespawner_override:
image: [repo]/tensorflow-notebook-cuda:1.0.x
extra_resource_limits:
nvidia.com/gpu: "0"
default: true
- display_name: "TF 2.5 - 1 GPUs"
kubespawner_override:
image: [repo]/tensorflow-notebook-cuda:1.0.x
extra_resource_limits:
nvidia.com/gpu: "1"
- display_name: "TF 2.5 - 4 GPUs"
kubespawner_override:
image: [repo]/tensorflow-notebook-cuda:1.0.x
extra_resource_limits:
nvidia.com/gpu: "4"
Code language: YAML (yaml)
Some of the contents of this file will be explained more in the appendix sections and are related to Gluster storage and adding Tensorboard support. Next, apply the values in config.yaml
to the cluster using helm with the following command.
helm upgrade --install [release-name] jupyterhub/jupyterhub --namespace jhub --version=1.0.0 --values config.yaml
Finally, make the service visible outside the cluster by running the following command.
kubectl patch svc proxy-public -n jhub -p '{"spec": {"type": "LoadBalancer", "externalIPs":["ip address here"]}}'
Code language: PHP (php)
Join nodes to the k8s cluster ↟
kubeadm join 192.168.1.102:6443 --token 5ytfgr.oc4pyr54vrgj5b7g --discovery-token-ca-cert-hash sha256:d6f0fd687b14b8506a1fbc63c3ee63ec8f4e81a5c93015245135d3707f433a23
# https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#join-nodes
Code language: PHP (php)
Create GlusterFS Storage ↟
First, create or modify your hard drive partitions to match the desired configuration. I find this link to be particularly useful.
You’ll also want to make the partitions mount the same after a computer reboot. This can be accomplished by adding entries to the /etc/fstab
file. First, find the UUID of the drive with the following commands:
sudo lsblk -f # show UUID for each partition
sudo lshw -C disk # show additional info about partition
df -h # show info about mounted filesystems
Code language: PHP (php)
An updated /etc/fstab file will look something like this:
# <file system> <mount point> <type> <options> <dump> <pass>
UUID=cc30d1ac-5c3a-41cd-8f97-0bfd648ebb07 / ext4 errors=remount-ro 0 1
UUID=52A3-2923 /boot/efi vfat umask=0077 0 1
# /swapfile none swap sw 0 0
UUID=28b63ddb-75c3-4f18-91d7-b15944ac07a1 /data/glusterfs/gv0/brick1 ext4 defaults 1 2
UUID=7099fa95-068d-4ad2-adb0-8aa8bd39017a /data/glusterfs/gv1/brick1 ext4 defaults 1 2
localhost:/mb1gv1 /mnt/mb1gv1 glusterfs defaults,_netdev 0 0
localhost:/mb2gv1 /mnt/mb2gv1 glusterfs defaults,_netdev 0 0
Code language: HTML, XML (xml)
Next, create relevant Gluster volumes. The following commands are the most helpful:
Code language: Bash (bash)gluster peer probe hostname/ipaddress gluster peer status gluster volume create [vol name] hostname1:/brick1 hostname2:/brick2 gluster volume status gluster volume info
Before adding peers to the gluster cluster, modify the /etc/hosts file to map hostnames to DHCP reserved ip addresses of all nodes in the cluster. I’ve found that this helps reduce unexplained issues when the servers restart and attempt to rejoin. Then when adding nodes to the gluster cluster using peer probe use the hostname instead of the computer ip address.
Add GlusterFS Storage to k8s cluster ↟
First, create endpoints and service yaml files that describe the gluster storage.
apiVersion: v1
kind: Endpoints
metadata:
name: glusterfs-cluster
namespace: jhub
subsets:
- addresses:
- ip: [ip address of gluster peer]
ports:
- port: 1
Code language: YAML (yaml)
apiVersion: v1
kind: Service
metadata:
name: glusterfs-cluster
namespace: jhub
spec:
ports:
- port: 1
Code language: YAML (yaml)
Add changes to the k8s cluster by running the following command for each of the above files.
Code language: Bash (bash)kubectl create -f file_name.yaml
Next, you will create a persistent volume and a persistent volume claim for each of your gluster volumes. These files look something like this.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pvc-name
namespace: jhub
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
Code language: YAML (yaml)
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-name
namespace: jhub
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteMany
claimRef:
namespace: jhub
name: pvc-name
glusterfs:
endpoints: glusterfs-cluster
path: gv0
readOnly: false
Code language: YAML (yaml)
Each of these files are applied in the same way as before.
Code language: Bash (bash)kubectl create -f file_name.yaml
Misc
----------------------------------------------------------------------------------
13. Create JupyterHub usage dashboard
a. Request the jupyterhub-dashboard web app from Ben
b. Create a python virtual environment that has Flask installed (pip install Flask)
c. Execute 'python dashboard-app.py' from the virtual environment
d. Potentially launch dashboard on reboot using crontab
----------------------------------------------------------------------------------
14. Create service to update proxy table for enabling Tensorboard
kubectl get secret hub -n jhub -o=json | jq -r '.["data"] | .["hub.config.ConfigurableHTTPProxy.auth_token"]' | base64 --decode
----------------------------------------------------------------------------------
Appendix: reset server
sudo kubeadm reset
!it gets more complicated after you've added more nodes to the cluster!
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#tear-down
----------------------------------------------------------------------------------
Reference:
https://zero-to-jupyterhub.readthedocs.io/en/latest/
Code language: JavaScript (javascript)