Skip to content
Snippets Groups Projects
Commit cf7fe26e authored by Gonzalo Martin Garcia's avatar Gonzalo Martin Garcia
Browse files
parents 049ef862 821b786d
Branches
No related tags found
No related merge requests found
......@@ -4,11 +4,11 @@
This is the repository for the *diffusion project* for the **(PR) Laboratory Deep Learning 23ss**
This repository houses our comprehensive pipeline, designed to conveniently train, sample from, and evaluate our unconditional diffusion model.
The pipeline is initiated via the experiment_creator.ipynb notebook, which is separately run our local machine. This notebook allows for the configuration of every aspect of the diffusion model, including all hyperparameters. These configurations extend to the underlying neural backbone UNet, as well as the training parameters, such as training from checkpoint, Weights & Biases run name for resumption, optimizer selection, adjustment of the learning rate for manual learning rate scheduling, and more. Moreover, it includes parameters for evaluating a and sampling images via a trained diffusion models.
The pipeline is initiated via the experiment_creator.ipynb notebook, which is separately run our local machine. This notebook allows for the configuration of every aspect of the diffusion model, including all hyperparameters. These configurations extend to the underlying neural backbone UNet, as well as the training parameters, such as training from checkpoint, Weights & Biases run name for resumption, optimizer selection, adjustment of the CosineAnnealingLR learning rate schedule parameters, and more. Moreover, it includes parameters for evaluating a and sampling images via a trained diffusion models.
Upon execution, the notebook generates individual JSON files, encapsulating all the hyperparameter information. When running the model on the HPC, we can choose between the operations 'train', 'sample', and 'evaluate'. These operations automatically extract the necessary hyperparameters from the JSON files and perform their respective tasks. This process is managed by the main.py file. The remaining files contain all the necessary functions optimized for HPC to perform the aforementioned tasks.
Every uniquely trained diffusion model has its own experiment folder, given by its WANDB run name. It holds four different directories: settings, trained_ddpm, samples, and evaluations. The settings folder holds the JSON files specifying the diffusion model's configurations as well as the arguments for the training, sampling, and evaluation functions. The trained_ddpm folder contains .pth files storing the weights and biases of this experiment's diffusion model, which have been saved at different epoch milestones while training. Upon resuming training, the pipeline automatically finds the highest epoch model in trained_ddpm and continues training from there. When sampling images from these trained diffusion models, the samples are stored in different directories named epoch_{i}. This is done so we know what epoch i version of the diffusion model was used to generate these samples.
Every uniquely trained diffusion model has its own experiment folder, given by its WANDB run name. It holds four different directories: settings, trained_ddpm, samples, and evaluations. The settings folder holds the JSON files specifying the diffusion model's configurations as well as the arguments for the training, sampling, and evaluation functions. The trained_ddpm folder contains .pth files storing the weights and biases of this experiment's diffusion model, which have been saved at different epoch milestones while training. Upon resuming training, the pipeline takes the specified model in trained_ddpm and continues training from there. When sampling images from these trained diffusion models, the samples are stored in different directories for the milestones under the names epoch_{i}. This is done so we know what epoch i version of the diffusion model was used to generate these samples.
......@@ -3,7 +3,7 @@ import torch
from torchvision import transforms
import re
def ddpm_sampler(model, checkpoint, experiment_path, device, intermediate=False, batch_size=15):
def ddpm_sampler(model, checkpoint, experiment_path, device, intermediate=False, batch_size=15,sample_all=False):
'''
Samples a tensor of 'batch_size' images from a trained diffusion model with 'checkpoint'. The generated
images are stored in the directory 'experiment_path/samples/epoch_{e}/sample_{j}. Where e is the epoch
......@@ -18,6 +18,13 @@ def ddpm_sampler(model, checkpoint, experiment_path, device, intermediate=False,
sample a single image, but store all the intermediate noised latents along the reverse chain
'''
if sample_all:
f = f'{experiment_path}trained_ddpm/'
checkpoint_list = [checkpoint_i for checkpoint_i in os.listdir(f) if checkpoint_i.endswith(".pth")]
for checkpoint_i in os.listdir(f):
if checkpoint_i.endswith(".pth"):
ddpm_sampler(model, checkpoint_i, experiment_path, device, sample_all=False)
# load model
try:
checkpoint_path = f'{experiment_path}trained_ddpm/{checkpoint}'
......
%% Cell type:code id: tags:
``` python
from trainer.train import *
from dataloader.load import *
from models.Framework import *
from models.all_unets import *
import torch
from torch import nn
```
%% Cell type:markdown id: tags:
# Prepare experiment
1. Choose Hyperparameter Settings
2. Run notebook on local maschine to generate experiment folder with the JSON files containing the settings
3. scp experiment folder to the HPC
4. Run Pipeline by adding following to batch file:
- Train Model: &emsp;&emsp;&emsp;&emsp;&emsp; `python main.py train "<absolute path of experiment folder in hpc>"`
- Sample Images: &emsp;&emsp;&emsp; `python main.py sample "<absolute path of experiment folder in hpc>"`
- Evaluate Model: &emsp;&emsp;&emsp; `python main.py evaluate "<absolute path of experiment folder in hpc>"`
%% Cell type:code id: tags:
``` python
import torch
####
# Settings
####
# Dataset path
datapath = "/work/lect0100/lhq_256"
# Experiment setup
run_name = 'main_test0' # WANDB and experiment folder Name!
run_name = 'main_test1' # WANDB and experiment folder Name!
checkpoint = None #'model_epoch_8.pth' # Name of checkpoint pth file or None
experiment_path = "/work/lect0100/main_experiment/" + run_name +'/'
# Path to save generated experiment folder on local machine
local_path ="experiments/" + run_name + '/settings'
# Diffusion Model Settings
diffusion_steps = 500
diffusion_steps = 1000
image_size = 128
channels = 3
# Training
batchsize = 32
epochs = 20
store_iter = 5
eval_iter = 2
epochs = 100
store_iter = 10
eval_iter = 500
learning_rate = 0.0001
optimizername = "torch.optim.AdamW"
optimizer_params = None
verbose = True
verbose = False
# checkpoint = None #(If no checkpoint training, ie. random weights)
# Sampling
sample_size = 20
intermediate = False # True if you want to sample one image and all ist intermediate latents
sample_all=False
# Evaluating
...
###
# Advanced Settings Dictionaries
###
meta_setting = dict(modelname = "UNet_Res",
dataset = "UnconditionalDataset",
framework = "DDPM",
trainloop_function = "ddpm_trainer",
sampling_function = 'ddpm_sampler',
evaluation_function = 'ddpm_evaluator',
batchsize = batchsize
)
dataset_setting = dict(fpath = datapath,
img_size = image_size,
frac =0.8,
skip_first_n = 0,
ext = ".png",
transform=True
)
model_setting = dict( n_channels=64,
fctr = [1,2,4,4,8],
time_dim=256,
attention = True,
)
"""
outdated
model_setting = dict( channels_in=channels,
channels_out =channels ,
activation='relu', # activation function. Options: {'relu', 'leakyrelu', 'selu', 'gelu', 'silu'/'swish'}
weight_init='he', # weight initialization. Options: {'he', 'torch'}
projection_features=64, # number of image features after first convolution layer
time_dim=batchsize, #dont chnage!!!
time_channels=diffusion_steps, # number of time channels #TODO same as diffusion steps?
num_stages=4, # number of stages in contracting/expansive path
stage_list=None, # specify number of features produced by stages
num_blocks=1, # number of ConvResBlock in each contracting/expansive path
num_groupnorm_groups=32, # number of groups used in Group Normalization inside a ConvResBlock
dropout=0.1, # drop-out to be applied inside a ConvResBlock
attention_list=None, # specify MHA pattern across stages
num_attention_heads=1,
)
"""
framework_setting = dict(
diffusion_steps = diffusion_steps, # dont change!!
out_shape = (channels,image_size,image_size), # dont change!!
noise_schedule = 'linear',
beta_1 = 1e-4,
beta_T = 0.02,
alpha_bar_lower_bound = 0.9,
var_schedule = 'same',
kl_loss = 'simplified',
recon_loss = 'nll',
recon_loss = 'none',
)
training_setting = dict(
epochs = epochs,
store_iter = store_iter,
eval_iter = eval_iter,
optimizer_class=optimizername,
optimizer_params = optimizer_params,
#optimizer_params=dict(lr=learning_rate), # don't change!
learning_rate = learning_rate,
run_name=run_name,
checkpoint= checkpoint,
experiment_path = experiment_path,
verbose = verbose,
T_max = 0.8*90000/32*150, # cosine lr param len(train_ds)/batchsize * total epochs to 0
T_max = 0.8*90000/32*100, # cosine lr param len(train_ds)/batchsize * total epochs to 0
eta_min= 1e-10, # cosine lr param
)
sampling_setting = dict(
checkpoint = checkpoint,
experiment_path = experiment_path,
batch_size = sample_size,
intermediate = intermediate
intermediate = intermediate,
sample_all = sample_all
)
# TODO
evaluation_setting = dict(
checkpoint = checkpoint,
experiment_path = experiment_path,
)
```
%% Cell type:code id: tags:
``` python
import os
import json
f = local_path
if os.path.exists(f):
print("path already exists, pick a new name!")
print("break")
else:
print("create folder")
#os.mkdir(f)
os.makedirs(f, exist_ok=True)
print("folder created ")
with open(f+"/meta_setting.json","w+") as fp:
json.dump(meta_setting,fp)
with open(f+"/dataset_setting.json","w+") as fp:
json.dump(dataset_setting,fp)
with open(f+"/model_setting.json","w+") as fp:
json.dump(model_setting,fp)
with open(f+"/framework_setting.json","w+") as fp:
json.dump(framework_setting,fp)
with open(f+"/training_setting.json","w+") as fp:
json.dump(training_setting,fp)
with open(f+"/sampling_setting.json","w+") as fp:
json.dump(sampling_setting,fp)
with open(f+"/evaluation_setting.json","w+") as fp:
json.dump(evaluation_setting,fp)
print("stored json files in folder")
print(meta_setting)
print(dataset_setting)
print(model_setting)
print(framework_setting)
print(training_setting)
print(sampling_setting)
print(evaluation_setting)
```
%% Output
create folder
folder created
stored json files in folder
{'modelname': 'UNet_Res', 'dataset': 'UnconditionalDataset', 'framework': 'DDPM', 'trainloop_function': 'ddpm_trainer', 'sampling_function': 'ddpm_sampler', 'evaluation_function': 'ddpm_evaluator', 'batchsize': 32}
{'fpath': '/work/lect0100/lhq_256', 'img_size': 128, 'frac': 0.8, 'skip_first_n': 0, 'ext': '.png', 'transform': True}
{'n_channels': 64, 'fctr': [1, 2, 4, 4, 8], 'time_dim': 256, 'attention': True}
{'diffusion_steps': 500, 'out_shape': (3, 128, 128), 'noise_schedule': 'linear', 'beta_1': 0.0001, 'beta_T': 0.02, 'alpha_bar_lower_bound': 0.9, 'var_schedule': 'same', 'kl_loss': 'simplified', 'recon_loss': 'nll'}
{'epochs': 20, 'store_iter': 5, 'eval_iter': 2, 'optimizer_class': 'torch.optim.AdamW', 'optimizer_params': None, 'learning_rate': 0.0001, 'run_name': 'main_test0', 'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/', 'verbose': True, 'T_max': 337500.0, 'eta_min': 1e-10}
{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/', 'batch_size': 20, 'intermediate': False}
{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test0/'}
{'diffusion_steps': 1000, 'out_shape': (3, 128, 128), 'noise_schedule': 'linear', 'beta_1': 0.0001, 'beta_T': 0.02, 'alpha_bar_lower_bound': 0.9, 'var_schedule': 'same', 'kl_loss': 'simplified', 'recon_loss': 'none'}
{'epochs': 100, 'store_iter': 10, 'eval_iter': 500, 'optimizer_class': 'torch.optim.AdamW', 'optimizer_params': None, 'learning_rate': 0.0001, 'run_name': 'main_test1', 'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/', 'verbose': False, 'T_max': 225000.0, 'eta_min': 1e-10}
{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/', 'batch_size': 20, 'intermediate': False}
{'checkpoint': None, 'experiment_path': '/work/lect0100/main_experiment/main_test1/'}
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment