Lab: DL Toolchain Automation
The aim of this ticket is to make the EMADL2CPP compilation toolchain as user friendly and intuitive as possible.
Current state: EMADL2CPP is a code generator / compiler written to enable the compilation and training of EmbeddedMontiArc models featuring Deep Learning based components, i.e. those having a CNNArch implementation. However, instead of doing the actual code generation itself, EMADL2CPP delegates the actual generation to the respective sub-generators: EMAM2CPP for architecture generation as well as MontiMath code generation and CNNArch2X for deep learning components (thereby X=MXNet or Caffe2). Thereby, the code is generated and the user has to make sure that the database containing the training and test data is put into the right location in the target directory structure, then train the network and compile the result to an executable file "manually".
Goal: Rework the EMADL2CPP compiler such that based on a given configuration it generates code , trains all the networks present in the EmbeddedMontiArc model, compiles the result to an executable in one shot (only one call allowed!).
Therefore, an additional configuration file is needed to set up the data paths for training for each DL component as well as some meta data concerning the database. A line of the configuration file might look like this (and we need a line per DL component):
de.some.package.MyParentComponent.dlComponentToTrainInstanceName /path/to/data LMDB
Thereby, the first argument is a fully qualified descriptor of the instance to be trained. The name of the instance is dlComponentToTrainInstanceName
, it is instantiated in the component MyParentComponent
residing in the package de.some.package
.
The semantics of the line is to look up an LMDB database containing training and test data respectively in /path/to/data
. Hence, EMADL2CPP should ask the backend compiler if it currently supports this kind of data base. If an unsopprted database type is required, an error needs to be thrown.
On the other hand, you do not want to retrain all the networks inside your model each time you change and regenerate a MontiMath component. Hence, you need to check whether training is necessary. Therfore, you might want to store an additional file containing the hash value of the data used for training in the target directory of each DL component. If neither the hash value of the training database nor the CNNArch component implementation has changed, a re-training can be omitted.
Of course, there might be scenarios where you want to force re-training and there are also scenario's where you want to omit checking the hash value as it might take a while for big databases. Therefore, please introduce two new CLI parameters for EMADL2CPP: no-training
and force-training
.
There is one more pitfall: sometimes you want to have several instances of the very same component, i.e. you want to allow for weight sharing. It should, hence, be possible to have several instances of the same DL component to share weights. On epossible solution would be to check, whether several instances mentionned in the configuration file have the same component type AND the same training data. Then the weights should be shared. As an additional alternative it makes sense to allow one to configure a component type with a database instead of a concrete instance. Then all instances of this type should share the same weight (with exceptions of concretely mentionned instances)
Feel free to ask questions, suggest improvements, and discuss new ideas.