Merge branch 'main' of https://git.rwth-aachen.de/diffusion-project/diffusion_project

cf7fe26e · Gonzalo Martin Garcia · 049ef862 · 821b786d · cf7fe26e · cf7fe26e
Commit cf7fe26e authored Jun 29, 2023 by Gonzalo Martin Garcia
--- a/README.md
+++ b/README.md
@@ -4,11 +4,11 @@
 This is the repository for the *diffusion project* for the **(PR) Laboratory Deep Learning 23ss**

 This repository houses our comprehensive pipeline, designed to conveniently train, sample from, and evaluate our unconditional diffusion model.
-The pipeline is initiated via the experiment_creator.ipynb notebook, which is separately run our local machine. This notebook allows for the configuration of every aspect of the diffusion model, including all hyperparameters. These configurations extend to the underlying neural backbone UNet, as well as the training parameters, such as training from checkpoint, Weights & Biases run name for resumption, optimizer selection, adjustment of the learning rate for manual learning rate scheduling, and more. Moreover, it includes parameters for evaluating a and sampling images via a trained diffusion models.
+The pipeline is initiated via the experiment_creator.ipynb notebook, which is separately run our local machine. This notebook allows for the configuration of every aspect of the diffusion model, including all hyperparameters. These configurations extend to the underlying neural backbone UNet, as well as the training parameters, such as training from checkpoint, Weights & Biases run name for resumption, optimizer selection, adjustment of the CosineAnnealingLR learning rate schedule parameters, and more. Moreover, it includes parameters for evaluating a and sampling images via a trained diffusion models.

 Upon execution, the notebook generates individual JSON files, encapsulating all the hyperparameter information. When running the model on the HPC, we can choose between the operations 'train', 'sample', and 'evaluate'. These operations automatically extract the necessary hyperparameters from the JSON files and perform their respective tasks. This process is managed by the main.py file. The remaining files contain all the necessary functions optimized for HPC to perform the aforementioned tasks.

-Every uniquely trained diffusion model has its own experiment folder, given by its WANDB run name. It holds four different directories: settings, trained_ddpm, samples, and evaluations. The settings folder holds the JSON files specifying the diffusion model's configurations as well as the arguments for the training, sampling, and evaluation functions. The trained_ddpm folder contains .pth files storing the weights and biases of this experiment's diffusion model, which have been saved at different epoch milestones while training. Upon resuming training, the pipeline automatically finds the highest epoch model in trained_ddpm and continues training from there. When sampling images from these trained diffusion models, the samples are stored in different directories named epoch_{i}. This is done so we know what epoch i version of the diffusion model was used to generate these samples.
+Every uniquely trained diffusion model has its own experiment folder, given by its WANDB run name. It holds four different directories: settings, trained_ddpm, samples, and evaluations. The settings folder holds the JSON files specifying the diffusion model's configurations as well as the arguments for the training, sampling, and evaluation functions. The trained_ddpm folder contains .pth files storing the weights and biases of this experiment's diffusion model, which have been saved at different epoch milestones while training. Upon resuming training, the pipeline takes the specified model in trained_ddpm and continues training from there. When sampling images from these trained diffusion models, the samples are stored in different directories for the milestones under the names epoch_{i}. This is done so we know what epoch i version of the diffusion model was used to generate these samples.

 

--- a/evaluation/sample.py
+++ b/evaluation/sample.py
@@ -3,7 +3,7 @@ import torch
 from torchvision import transforms
 import re

-def ddpm_sampler(model, checkpoint, experiment_path, device, intermediate=False, batch_size=15):
+def ddpm_sampler(model, checkpoint, experiment_path, device, intermediate=False, batch_size=15,sample_all=False):
    '''
    Samples a tensor of 'batch_size' images from a trained diffusion model with 'checkpoint'. The generated 
    images are stored in the directory 'experiment_path/samples/epoch_{e}/sample_{j}. Where e is the epoch 
@@ -18,6 +18,13 @@ def ddpm_sampler(model, checkpoint, experiment_path, device, intermediate=False,
                     sample a single image, but store all the intermediate noised latents along the reverse chain  
    '''

+
+    if sample_all:
+        f = f'{experiment_path}trained_ddpm/'
+        checkpoint_list = [checkpoint_i for checkpoint_i in os.listdir(f) if checkpoint_i.endswith(".pth")]
+        for checkpoint_i in os.listdir(f):
+            if checkpoint_i.endswith(".pth"):
+                ddpm_sampler(model, checkpoint_i, experiment_path, device, sample_all=False)
    # load model
    try:
        checkpoint_path = f'{experiment_path}trained_ddpm/{checkpoint}'

--- a/experiment_creator.ipynb
+++ b/experiment_creator.ipynb
@@ -33,7 +33,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -47,7 +47,7 @@
    "datapath = \"/work/lect0100/lhq_256\"\n",
    "\n",
    "# Experiment setup\n",
-    "run_name = 'main_test0' # WANDB and experiment folder Name!\n",
+    "run_name = 'main_test1' # WANDB and experiment folder Name!\n",
    "checkpoint = None #'model_epoch_8.pth' # Name of checkpoint pth file or None \n",
    "experiment_path = \"/work/lect0100/main_experiment/\" + run_name +'/'\n",
    "\n",
@@ -55,25 +55,25 @@
    "local_path =\"experiments/\" + run_name + '/settings'\n",
    "\n",
    "# Diffusion Model Settings\n",
-    "diffusion_steps = 500\n",
+    "diffusion_steps = 1000\n",
    "image_size = 128\n",
    "channels = 3\n",
    "\n",
    "# Training\n",
    "batchsize = 32\n",
-    "epochs = 20\n",
-    "store_iter = 5\n",
-    "eval_iter = 2\n",
+    "epochs = 100\n",
+    "store_iter = 10\n",
+    "eval_iter = 500\n",
    "learning_rate = 0.0001\n",
    "optimizername = \"torch.optim.AdamW\"\n",
    "optimizer_params = None\n",
-    "verbose = True\n",
+    "verbose = False\n",
    "# checkpoint = None #(If no checkpoint training, ie. random weights)\n",
    "\n",
    "# Sampling \n",
    "sample_size = 20\n",
    "intermediate = False # True if you want to sample one image and all ist intermediate latents\n",
-    "\n",
+    "sample_all=False\n",
    "\n",
    "# Evaluating\n",
    "...\n",
@@ -132,7 +132,7 @@
    "                 alpha_bar_lower_bound = 0.9,\n",
    "                 var_schedule = 'same', \n",
    "                 kl_loss = 'simplified', \n",
-    "                 recon_loss = 'nll',\n",
+    "                 recon_loss = 'none',\n",
    "                 )\n",
    "training_setting = dict(\n",
    "                epochs = epochs,\n",
@@ -146,14 +146,15 @@
    "                checkpoint= checkpoint,\n",
    "                experiment_path = experiment_path,\n",
    "                verbose = verbose,\n",
-    "                T_max = 0.8*90000/32*150, # cosine lr param   len(train_ds)/batchsize * total epochs to 0 \n",
+    "                T_max = 0.8*90000/32*100, # cosine lr param   len(train_ds)/batchsize * total epochs to 0 \n",
    "                eta_min= 1e-10, # cosine lr param\n",
    "                )\n",
    "sampling_setting = dict( \n",
    "                checkpoint = checkpoint, \n",
    "                experiment_path = experiment_path, \n",
    "                batch_size = sample_size,\n",
-    "                intermediate = intermediate\n",
+    "                intermediate = intermediate,\n",
+    "                sample_all = sample_all\n",
    "                )\n",
    "# TODO\n",
    "evaluation_setting = dict(\n",
@@ -164,7 +165,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
@@ -177,10 +178,10 @@
      "{'modelname': 'UNet_Res', 'dataset': 'UnconditionalDataset', 'framework': 'DDPM', 'trainloop_function': 'ddpm_trainer', 'sampling_function': 'ddpm_sampler', 'evaluation_function': 'ddpm_evaluator', 'batchsize': 32}\n",
      "{'fpath': '/work/lect0100/lhq_256', 'img_size': 128, 'frac': 0.8, 'skip_first_n': 0, 'ext': '.png', 'transform': True}\n",
      "{'n_channels': 64, 'fctr': [1, 2, 4, 4, 8], 'time_dim': 256, 'attention': True}\n",
-      "{'diffusion_steps': 500, 'out_shape': (3, 128, 128), 'noise_schedule': 'linear', 'beta_1': 0.0001, 'beta_T': 0.02, 'alpha_bar_lower_bound': 0.9, 'var_schedule': 'same', 'kl_loss': 'simplified', 'recon_loss': 'nll'}\n",
-      "{'epochs': 20, 'store_iter': 5, 'eval_iter': 2, 'optimizer_class': 'torch.optim.AdamW', 'optimizer_params': None, 'learning_rate': 0.0001, 'run_name': 'main_test0', 'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/', 'verbose': True, 'T_max': 337500.0, 'eta_min': 1e-10}\n",
-      "{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/', 'batch_size': 20, 'intermediate': False}\n",
-      "{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/'}\n"
+      "{'diffusion_steps': 1000, 'out_shape': (3, 128, 128), 'noise_schedule': 'linear', 'beta_1': 0.0001, 'beta_T': 0.02, 'alpha_bar_lower_bound': 0.9, 'var_schedule': 'same', 'kl_loss': 'simplified', 'recon_loss': 'none'}\n",
+      "{'epochs': 100, 'store_iter': 10, 'eval_iter': 500, 'optimizer_class': 'torch.optim.AdamW', 'optimizer_params': None, 'learning_rate': 0.0001, 'run_name': 'main_test1', 'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/', 'verbose': False, 'T_max': 225000.0, 'eta_min': 1e-10}\n",
+      "{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/', 'batch_size': 20, 'intermediate': False}\n",
+      "{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/'}\n"
     ]
    }
   ],

 %% Cell type:code id: tags:

 ``` python
 from trainer.train import *
 from dataloader.load import  *
 from models.Framework import *
 from models.all_unets import *
 import torch
 from torch import nn
 ```

 %% Cell type:markdown id: tags:

 # Prepare experiment
 1. Choose Hyperparameter Settings
 2. Run notebook on local maschine to generate experiment folder with the JSON files containing the settings
 3. scp experiment folder to the HPC
 4. Run Pipeline by adding following to batch file:
 - Train Model: &emsp;&emsp;&emsp;&emsp;&emsp; `python main.py train "<absolute path of experiment folder in hpc>"`
 - Sample Images: &emsp;&emsp;&emsp; `python main.py sample "<absolute path of experiment folder in hpc>"`
 - Evaluate Model: &emsp;&emsp;&emsp; `python main.py evaluate "<absolute path of experiment folder in hpc>"`

 %% Cell type:code id: tags:

 ``` python
 import torch

 ####
 # Settings
 ####

 # Dataset path
 datapath = "/work/lect0100/lhq_256"

 # Experiment setup
-run_name = 'main_test0' # WANDB and experiment folder Name!
+run_name = 'main_test1' # WANDB and experiment folder Name!
 checkpoint = None #'model_epoch_8.pth' # Name of checkpoint pth file or None
 experiment_path = "/work/lect0100/main_experiment/" + run_name +'/'

 # Path to save generated experiment folder on local machine
 local_path ="experiments/" + run_name + '/settings'

 # Diffusion Model Settings
-diffusion_steps = 500
+diffusion_steps = 1000
 image_size = 128
 channels = 3

 # Training
 batchsize = 32
-epochs = 20
-store_iter = 5
-eval_iter = 2
+epochs = 100
+store_iter = 10
+eval_iter = 500
 learning_rate = 0.0001
 optimizername = "torch.optim.AdamW"
 optimizer_params = None
-verbose = True
+verbose = False
 # checkpoint = None #(If no checkpoint training, ie. random weights)

 # Sampling
 sample_size = 20
 intermediate = False # True if you want to sample one image and all ist intermediate latents
-
+sample_all=False

 # Evaluating
 ...



 ###
 # Advanced Settings Dictionaries
 ###

 meta_setting = dict(modelname = "UNet_Res",
                    dataset = "UnconditionalDataset",
                    framework = "DDPM",
                    trainloop_function = "ddpm_trainer",
                    sampling_function = 'ddpm_sampler',
                    evaluation_function = 'ddpm_evaluator',
                    batchsize = batchsize
                    )
 dataset_setting = dict(fpath = datapath,
                                img_size = image_size,
                                frac =0.8,
                                skip_first_n = 0,
                                ext = ".png",
                                transform=True
                                )

 model_setting = dict( n_channels=64,
                      fctr = [1,2,4,4,8],
                      time_dim=256,
                      attention = True,
                    )
 """
 outdated
 model_setting = dict( channels_in=channels,
               channels_out =channels ,
               activation='relu',           # activation function. Options: {'relu', 'leakyrelu', 'selu', 'gelu', 'silu'/'swish'}
               weight_init='he',            # weight initialization. Options: {'he', 'torch'}
               projection_features=64,      # number of image features after first convolution layer
               time_dim=batchsize,                 #dont chnage!!!
               time_channels=diffusion_steps,           # number of time channels #TODO same as diffusion steps?
               num_stages=4,                # number of stages in contracting/expansive path
               stage_list=None,             # specify number of features produced by stages
               num_blocks=1,                # number of ConvResBlock in each contracting/expansive path
               num_groupnorm_groups=32,     # number of groups used in Group Normalization inside a ConvResBlock
               dropout=0.1,                 # drop-out to be applied inside a ConvResBlock
               attention_list=None,         # specify MHA pattern across stages
               num_attention_heads=1,
               )
 """
 framework_setting = dict(
                 diffusion_steps = diffusion_steps,  # dont change!!
                 out_shape = (channels,image_size,image_size),  # dont change!!
                 noise_schedule = 'linear',
                 beta_1 = 1e-4,
                 beta_T = 0.02,
                 alpha_bar_lower_bound = 0.9,
                 var_schedule = 'same',
                 kl_loss = 'simplified',
-                 recon_loss = 'nll',
+                 recon_loss = 'none',
                 )
 training_setting = dict(
                epochs = epochs,
                store_iter = store_iter,
                eval_iter = eval_iter,
                optimizer_class=optimizername,
                optimizer_params = optimizer_params,
                #optimizer_params=dict(lr=learning_rate), # don't change!
                learning_rate = learning_rate,
                run_name=run_name,
                checkpoint= checkpoint,
                experiment_path = experiment_path,
                verbose = verbose,
-                T_max = 0.8*90000/32*150, # cosine lr param   len(train_ds)/batchsize * total epochs to 0
+                T_max = 0.8*90000/32*100, # cosine lr param   len(train_ds)/batchsize * total epochs to 0
                eta_min= 1e-10, # cosine lr param
                )
 sampling_setting = dict(
                checkpoint = checkpoint,
                experiment_path = experiment_path,
                batch_size = sample_size,
-                intermediate = intermediate
+                intermediate = intermediate,
+                sample_all = sample_all
                )
 # TODO
 evaluation_setting = dict(
                    checkpoint = checkpoint,
                    experiment_path = experiment_path,
                    )
 ```

 %% Cell type:code id: tags:

 ``` python
 import os
 import json
 f =  local_path
 if os.path.exists(f):
    print("path already exists, pick a new name!")
    print("break")
 else:
    print("create folder")
    #os.mkdir(f)
    os.makedirs(f, exist_ok=True)
    print("folder created ")
    with open(f+"/meta_setting.json","w+") as fp:
        json.dump(meta_setting,fp)

    with open(f+"/dataset_setting.json","w+") as fp:
        json.dump(dataset_setting,fp)

    with open(f+"/model_setting.json","w+") as fp:
        json.dump(model_setting,fp)

    with open(f+"/framework_setting.json","w+") as fp:
        json.dump(framework_setting,fp)

    with open(f+"/training_setting.json","w+") as fp:
        json.dump(training_setting,fp)

    with open(f+"/sampling_setting.json","w+") as fp:
        json.dump(sampling_setting,fp)

    with open(f+"/evaluation_setting.json","w+") as fp:
        json.dump(evaluation_setting,fp)

    print("stored json files in folder")
    print(meta_setting)
    print(dataset_setting)
    print(model_setting)
    print(framework_setting)
    print(training_setting)
    print(sampling_setting)
    print(evaluation_setting)

 ```

 %% Output

    create folder
    folder created
    stored json files in folder
    {'modelname': 'UNet_Res', 'dataset': 'UnconditionalDataset', 'framework': 'DDPM', 'trainloop_function': 'ddpm_trainer', 'sampling_function': 'ddpm_sampler', 'evaluation_function': 'ddpm_evaluator', 'batchsize': 32}
    {'fpath': '/work/lect0100/lhq_256', 'img_size': 128, 'frac': 0.8, 'skip_first_n': 0, 'ext': '.png', 'transform': True}
    {'n_channels': 64, 'fctr': [1, 2, 4, 4, 8], 'time_dim': 256, 'attention': True}
-    {'diffusion_steps': 500, 'out_shape': (3, 128, 128), 'noise_schedule': 'linear', 'beta_1': 0.0001, 'beta_T': 0.02, 'alpha_bar_lower_bound': 0.9, 'var_schedule': 'same', 'kl_loss': 'simplified', 'recon_loss': 'nll'}
-    {'epochs': 20, 'store_iter': 5, 'eval_iter': 2, 'optimizer_class': 'torch.optim.AdamW', 'optimizer_params': None, 'learning_rate': 0.0001, 'run_name': 'main_test0', 'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/', 'verbose': True, 'T_max': 337500.0, 'eta_min': 1e-10}
-    {'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/', 'batch_size': 20, 'intermediate': False}
-    {'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/'}
+    {'diffusion_steps': 1000, 'out_shape': (3, 128, 128), 'noise_schedule': 'linear', 'beta_1': 0.0001, 'beta_T': 0.02, 'alpha_bar_lower_bound': 0.9, 'var_schedule': 'same', 'kl_loss': 'simplified', 'recon_loss': 'none'}
+    {'epochs': 100, 'store_iter': 10, 'eval_iter': 500, 'optimizer_class': 'torch.optim.AdamW', 'optimizer_params': None, 'learning_rate': 0.0001, 'run_name': 'main_test1', 'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/', 'verbose': False, 'T_max': 225000.0, 'eta_min': 1e-10}
+    {'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/', 'batch_size': 20, 'intermediate': False}
+    {'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/'}

 %% Cell type:code id: tags:

 ``` python
 ```