Skip to content
Snippets Groups Projects
Commit 68360ded authored by Ulrich Kerzel's avatar Ulrich Kerzel
Browse files

dockerfile for PyTorch

parent 99fea40b
No related branches found
No related tags found
No related merge requests found
##
## general setup
## (this is done as root)
##
#
# start with the official (?) pytorch devel container
# https://hub.docker.com/r/pytorch/pytorch
# it seems to be released by the PyTorch team although there is no documentation
# alternatively, we could start from the NVidia one
# https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
#
FROM pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
# say who maintains this image
LABEL maintainer="Ulrich Kerzel <ulrich.kerzel@rwth-aachen.de>"
ARG DEBIAN_FRONTEND=noninteractive
#
# here we could update the base system
# let's leave it for now
#
RUN apt-get update
# here we could update all packages
# && yes | apt-get upgrade
#
# install any systemwide software that is missing from the base image
#
RUN apt-get -y install curl
##
## Install python packages
##
## The PyTorch container uses conda as package manager
## https://docs.conda.io/en/latest/ or https://github.com/conda/conda
##
RUN conda install -c conda-forge matplotlib=3.6.2 \
numba=0.56.4 \
numpy=1.23.5 \
pandas=1.5.2 \
scikit-learn=1.2.0 \
scikit-image=0.19.3 \
scipy=1.9.3 \
seaborn=0.12.2 \
shap=0.41.0 \
lime=0.2.0.1 \
networkx=3.0
##
## create a local user so we don't have to run as root
##
RUN addgroup --gid 1000 aiguru && \
adduser --uid 1000 --ingroup aiguru --home /home/aiguru --shell /bin/bash --disabled-password --gecos "" aiguru
#
# One of the challenges in using docker is that the user and group IDs inside the
# container do not match the ones on the host system
# If we then use, e.g., a bind mount to exchange files, they will have the wrong permissions
#
# We can pass the option --user=`id -u`:`id -g` to "docker run", that will fix the permissions
# However, if our user ID and/or group ID on the host system do not match with the setup we have inside
# the container, we will get errors like "I have no name" as a user-name.
# Depending on what we want to do, this is only a cosmetic problem. However, if we rely on an exisitig
# and matching user-name, it may become an issue.
# If we do not mount the host file systems, we won't have the issue anyway.
#
# The tool "fixuid" is helping with this issue
# https://github.com/boxboat/fixuid
#
# However, this is only meant for development, not production use, so we comment this out here.
#
#RUN USER=docker && \
# GROUP=docker && \
# curl -SsL https://github.com/boxboat/fixuid/releases/download/v0.5.1/fixuid-0.5.1-linux-amd64.tar.gz | tar -C /usr/local/bin -xzf - && \
# chown root:root /usr/local/bin/fixuid && \
# chmod 4755 /usr/local/bin/fixuid && \
# mkdir -p /etc/fixuid && \
# printf "user: $USER\ngroup: $GROUP\n" > /etc/fixuid/config.yml
# switch do the non-root user
USER aiguru:aiguru
WORKDIR /home/aiguru
#if we use the fixuid tool for development
#ENTRYPOINT ["fixuid"]
#
# create a local directory into which we can mount a local filesystem if needed via bind mount
# see https://docs.docker.com/storage/bind-mounts/
# the syntax is --mount type=bind,source=<source dir>,target=/home/aiguru/bindmount
#
RUN cd $HOME
RUN mkdir $HOME/bindmount
# Docker
One of the main (more technical) challenges in machine learning is that the libraries we use evolve rapidly.
This often leads us to the situation, where we require a specific setup to make sure that everything works together. Since the various libraries evolve at different paces and are developed and maintained by a diverse and unrelated group, we cannot generally assume that all versions of all packages work together.
This can be resolved by creating a controlled virtual environment using, for example, pip, poetry, or conda.
However, the machine learning libraries such as PyTorch also rely on specific versions of the underlying drivers for the graphics card (GPU) that are used to accellerate the training process.
While we can control this setup on our own system, it becomes more difficult to do so once we move to shared ressources such as a cluster - or need to run different versions.
Containers allow to go one step further than virtual environments and we can control which operating system (e.g. Ubuntu/Linux) in which version with specific libraries, etc we use.
[Docker](https://www.docker.com/) is a popular container software which has the benefit that it is well supported not only on Linux, Windows, or MacOS machines but can also be used as a starting point for running on large clusters such as the HPC cluster at the ITC at RWTH Aachen University.
The cluster systems use a different kind of container mechanism called [Apptainer](https://apptainer.org/) (formerly known as Singularity). However, one of the recommended ways to build a container image for HPC use is to start from a docker container.
The docker container is created using a "Dockerfile". In this example, we start from the official PyTorch image and add furhter machine learning library.
# Using Docker
## Building images
Before we can use the docker container, we need to build the image:
The general syntax is:
docker build -t TAG DIR
where TAG is a tag by which we identify the docker image, and DIR is the directory that contains a file called "Dockerfile" that is used to build the image.
Example:
docker build -t pytorchdatascience:v1.0 PyTorch
## Runing docker images
To run the container/image, we use "docker run"
We can either run in interactive mode or, if the container ends with a "CMD" command, this command is executed. Which version we use depends on the use-case at hand.
To start an interactive session, we use the parameters "-it".
Hence, the simplest way is to call "docker run -it TAG" where TAG is the tag we specified when building the container/image. This will give us an interactive shell inside the container - we can compare this (a bit) to logging into a remote machine.
We can also pass a script or program that should be executed as a furhter parameter.
### Accesing host files
By default, the container is fully isolated from the host system, i.e. we also do not have access to the local filesystem of the host (the machine we call "docker run" from). In many cases, this is a good thing as we do not want to risk accessing files from the image on the host.
However, in many scenarios we may want to access files on the host system, such as, for example, datafiles, program files, etc.
We can do this using, for example, so called "bind mounts" that make a directory on the host system available in the container. The general syntax is:
--mount type=bind,source=SRC_DIR,target=TARGET_DIR
where SRC_DIR is the directory on the host system that we want to make available, and TARGET_DIR is the directory at which we want to access this directory.
In the PyTorch example, we have created a local (non-root) user ("aiguru") inside the container image and then want to make the current working directory available in a directory called "bindmount":
--mount type=bind,source="${PWD}",target=/home/aiguru/bindmount
However, we need to note that the user "aiguru" is local to the container and, in general, not known to the host system. In order to avoid issues with the files we create inside the container, we need to map the (Linux) user and group ID to the values we have on the host system. We do this with the following parameter:
--user=`id -u`:`id -g`
However, since the users on the host and docker image typically don't have the same name, we may encounter the situation that the username for this ID does not exist in the docker container. Unless we require this for our application, it is largely a cosmetic problem for interactive use.
The full command to run the docker image with a bind-mount is then:
docker run -it --name PyTorchDS --user=`id -u`:`id -g` --mount type=bind,source="${PWD}",target=/home/aiguru/bindmount pytorchdatascience:v1.0 python bindmount/PyTorch_MNIST.py
assuming that we are on the host in the local directory that contains the file PyTorch_MNIST.py that we want to execute. If we do not specify "python bindmount/PyTorch_MNIST.py", we would enter an interactive shell.
## Useful Docker commands
### Images
docker images : list images on the system
docker image rm: remove one or more images
### Build
docker build -t TAG DIR
docker tag IMAGE_ID TAG
docker push USERNAME/REPO : push to a docker repository, such as Docker Hub
### Processes
docker ps -a : list all processes
docker ps -a -f status=exited : list all exited containers
docker rm $(docker ps -a -f status=exited -q) : remove all exited containers
### Cleanup
docker system prune -a : everything
### Run
docker run -it --name NAME --user=`id -u`:`id -g` --mount type=bind,source=SRC_DIR,target=TARGET_DIR TAG SCRIPT
remove "-it" for non-interactive mode
### Misc
docker exec -it <name> /bin/bash : attach an interactive shell to a running container
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment