Training loss is NaN sometimes (PyTorch Backend)
Problem
When running certain experiments, the training loss calculation in src/main/resources/experiments/steps/MySupervisedTrainer.py (line 80) sometimes returns 'NaN'. This results in debug messages like:
Epoch:1 Train Loss:nan Train Accuracy:10.03%
Steps to reproduce
Note: Getting far enough to trigger this problem requires implementing the workaround presented in issue #123, given that it is not yet solved.
This issue is not deterministic. However, when executing the EMADL2CPP generator as follows, there is a high chance of encountering the problem.
- Main class:
de.monticore.lang.monticar.emadl.generator.MontiAnnaCli
- Program arguments:
-m src/main/resources/calculator_experiment/emadl -r calculator.Connector -o target -b PYTORCH
Out of the six runs generated by this execution, typically 2-4 runs exhibit this behavior. I've never encountered this issue with other experiments like the 'adanet_experiment' or 'squaredigit_experiment', even though all mentioned experiments use the same loss metric (cross entropy).