dockerfile for PyTorch

68360ded · Ulrich Kerzel · 99fea40b · 68360ded · 68360ded
Commit 68360ded authored Jan 26, 2023 by Ulrich Kerzel
--- a/docker/PyTorch/Dockerfile
+++ b/docker/PyTorch/Dockerfile
+##
+## general setup 
+## (this is done as root)
+##
+
+#
+# start with the official (?) pytorch devel container
+# https://hub.docker.com/r/pytorch/pytorch
+# it seems to be released by the PyTorch team although there is no documentation
+# alternatively, we could start from the NVidia one
+# https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
+#
+FROM  pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
+
+# say who maintains this image
+LABEL maintainer="Ulrich Kerzel <ulrich.kerzel@rwth-aachen.de>"
+
+ARG DEBIAN_FRONTEND=noninteractive
+
+#
+# here we could update the base system
+# let's leave it for now
+#
+RUN apt-get update
+# here we could update all packages
+# && yes | apt-get upgrade
+
+#
+# install any systemwide software that is missing from the base image
+#
+RUN apt-get -y install curl
+
+
+
+
+
+##
+## Install python packages
+## 
+## The PyTorch container uses conda as package manager
+## https://docs.conda.io/en/latest/ or https://github.com/conda/conda
+##
+RUN conda install -c conda-forge matplotlib=3.6.2 \
+                                numba=0.56.4  \
+                                numpy=1.23.5 \
+                                pandas=1.5.2 \
+                                scikit-learn=1.2.0 \ 
+                                scikit-image=0.19.3 \
+                                scipy=1.9.3 \
+                                seaborn=0.12.2 \
+                                shap=0.41.0 \ 
+                                lime=0.2.0.1 \
+                                networkx=3.0
+
+
+##
+## create a local user so we don't have to run as root
+##
+RUN addgroup --gid 1000 aiguru && \
+    adduser --uid 1000 --ingroup aiguru --home /home/aiguru --shell /bin/bash --disabled-password --gecos "" aiguru
+
+#
+# One of the challenges in using docker is that the user and group IDs inside the 
+# container do not match the ones on the host system
+# If we then use, e.g., a bind mount to exchange files, they will have the wrong permissions
+#
+# We can pass the option --user=`id -u`:`id -g` to "docker run", that will fix the permissions
+# However, if our user ID and/or group ID on the host system do not match with the setup we have inside
+# the container, we will get errors like "I have no name" as a user-name.
+# Depending on what we want to do, this is only a cosmetic problem. However, if we rely on an exisitig
+# and matching user-name, it may become an issue.
+# If we do not mount the host file systems, we won't have the issue anyway.
+#
+# The tool "fixuid" is helping with this issue
+# https://github.com/boxboat/fixuid
+#
+# However, this is only meant for development, not production use, so we comment this out here.
+#
+#RUN USER=docker && \
+#    GROUP=docker && \
+#    curl -SsL https://github.com/boxboat/fixuid/releases/download/v0.5.1/fixuid-0.5.1-linux-amd64.tar.gz | tar -C /usr/local/bin -xzf - && \
+#    chown root:root /usr/local/bin/fixuid && \
+#    chmod 4755 /usr/local/bin/fixuid && \
+#    mkdir -p /etc/fixuid && \
+#    printf "user: $USER\ngroup: $GROUP\n" > /etc/fixuid/config.yml
+
+
+
+# switch do the non-root user
+USER aiguru:aiguru
+WORKDIR /home/aiguru
+
+#if we use the fixuid tool for development
+#ENTRYPOINT ["fixuid"]
+
+#
+# create a local directory into which we can mount a local filesystem if needed via bind mount
+# see https://docs.docker.com/storage/bind-mounts/
+# the syntax is --mount type=bind,source=<source dir>,target=/home/aiguru/bindmount
+# 
+RUN cd $HOME
+RUN mkdir $HOME/bindmount
--- a/docker/Readme.md
+++ b/docker/Readme.md
+# Docker
+
+One of the main (more technical) challenges in machine learning is that the libraries we use evolve rapidly.
+This often leads us to the situation, where we require a specific setup to make sure that everything works together. Since the various libraries evolve at different paces and are developed and maintained by a diverse and unrelated group, we cannot generally assume that all versions of all packages work together.
+
+This can be resolved by creating a controlled virtual environment using, for example, pip, poetry, or conda.
+However, the machine learning libraries such as PyTorch also rely on specific versions of the underlying drivers for the graphics card (GPU) that are used to accellerate the training process.
+
+While we can control this setup on our own system, it becomes more difficult to do so once we move to shared ressources such as a cluster - or need to run different versions.
+
+Containers allow to go one step further than virtual environments and we can control which operating system (e.g. Ubuntu/Linux) in which version with specific libraries, etc we use.
+[Docker](https://www.docker.com/) is a popular container software which has the benefit that it is well supported not only on Linux, Windows, or MacOS machines but can also be used as a starting point for running on large clusters such as the HPC cluster at the ITC at RWTH Aachen University.
+The cluster systems use a different kind of container mechanism called [Apptainer](https://apptainer.org/) (formerly known as Singularity). However, one of the recommended ways to build a container image for HPC use is to start from a docker container.
+
+The docker container is created using a "Dockerfile". In this example, we start from the official PyTorch image and add furhter machine learning library.
+
+# Using Docker
+
+## Building images
+Before we can use the docker container, we need to build the image:
+The general syntax is:
+docker build -t TAG DIR
+where TAG is a tag by which we identify the docker image, and DIR is the directory that contains a file called "Dockerfile" that is used to build the image.
+
+Example:
+docker build -t pytorchdatascience:v1.0 PyTorch
+
+## Runing docker images
+To run the container/image, we use "docker run"
+We can either run in interactive mode or, if the container ends with a "CMD" command, this command is executed. Which version we use depends on the use-case at hand.
+
+To start an interactive session, we use the parameters "-it".
+Hence, the simplest way is to call "docker run -it TAG" where TAG is the tag we specified when building the container/image. This will give us an interactive shell inside the container - we can compare this (a bit) to logging into a remote machine.
+We can also pass a script or program that should be executed as a furhter parameter.
+
+### Accesing host files
+By default, the container is fully isolated from the host system, i.e. we also do not have access to the local filesystem of the host (the machine we call "docker run" from). In many cases, this is a good thing as we do not want to risk accessing files from the image on the host.
+However, in many scenarios we may want to access files on the host system, such as, for example, datafiles, program files, etc.
+
+We can do this using, for example, so called "bind mounts" that make a directory on the host system available in the container. The general syntax is:
+--mount type=bind,source=SRC_DIR,target=TARGET_DIR
+where SRC_DIR is the directory on the host system that we want to make available, and TARGET_DIR is the directory at which we want to access this directory.
+
+In the PyTorch example, we have created a local (non-root) user ("aiguru") inside the container image and then want to make the current working directory available in a directory called "bindmount":
+--mount type=bind,source="${PWD}",target=/home/aiguru/bindmount
+
+However, we need to note that the user "aiguru" is local to the container and, in general, not known to the host system. In order to avoid issues with the files we create inside the container, we need to map the (Linux) user and group ID to the values we have on the host system. We do this with the following parameter: 
+--user=`id -u`:`id -g`
+However, since the users on the host and docker image typically don't have the same name, we may encounter the situation that the username for this ID does not exist in the docker container. Unless we require this for our application, it is largely a cosmetic problem for interactive use.
+
+The full command to run the docker image with a bind-mount is then:
+
+docker run -it --name PyTorchDS --user=`id -u`:`id -g` --mount type=bind,source="${PWD}",target=/home/aiguru/bindmount pytorchdatascience:v1.0 python bindmount/PyTorch_MNIST.py
+
+assuming that we are on the host in the local directory that contains the file PyTorch_MNIST.py that we want to execute. If we do not specify "python bindmount/PyTorch_MNIST.py", we would enter an interactive shell.
+
+
+## Useful Docker commands
+
+### Images
+docker images : list images on the system
+docker image rm: remove one or more images
+
+### Build
+docker build -t TAG DIR
+docker tag IMAGE_ID TAG
+docker push USERNAME/REPO : push to a docker repository, such as Docker Hub
+
+### Processes
+docker ps -a  : list all processes
+docker ps -a -f status=exited : list all exited containers
+docker rm $(docker ps -a -f status=exited -q) : remove all exited containers
+
+### Cleanup
+ docker system prune -a              : everything
+
+### Run
+docker run -it --name NAME --user=`id -u`:`id -g` --mount type=bind,source=SRC_DIR,target=TARGET_DIR TAG SCRIPT
+
+remove "-it" for non-interactive mode
+
+### Misc
+docker exec -it <name> /bin/bash : attach an interactive shell to a running container
+
+
+
+