!!This project is obsolete !! Use ConfLang instead !!
CNNTrain
CNNTrain is a domain specific language for describing training parameters of a feedforward neural network. CNNTrain files must have .cnnt extension. Training configuration starts with a configuration
word, followed by the configuration name and a list of parameters. The available parameters are batch size, number of epochs, loading previous checkpoint as well as an optimizer with its parameters. All these parameters are optional.
An example of a config:
configuration FullConfig{
num_epoch : 5
batch_size : 100
load_checkpoint: true
optimizer:rmsprop{
learning_rate:0.001
weight_decay:0.01
learning_rate_decay:0.9
learning_rate_policy:step
step_size:1000
rescale_grad:1.1
clip_gradient:10
gamma1:0.9
gamma2:0.9
epsilon:0.000001
centered:true
clip_weights:10
}
}
See CNNTrain.mc4 for full grammar definition.
Using CNNTrainGenerator class, a Python file can be generated, which looks as following (for an example above):
batch_size = 100,
num_epoch = 5,
load_checkpoint = True,
optimizer = 'rmsprop',
optimizer_params = {
'epsilon': 1.0E-6,
'weight_decay': 0.01,
'rescale_grad': 1.1,
'centered': True,
'clip_gradient': 10.0,
'gamma2': 0.9,
'gamma1': 0.9,
'learning_rate_policy': 'step',
'clip_weights': 10.0,
'learning_rate': 0.001,
'learning_rate_decay': 0.9,
'step_size': 1000}
Reinforcement Learning
CNNTrain can be used to describe training parameters for supervised learning methods as well as for reinforcement learning methods. If reinforcement learning is selected, the network is trained with the Deep-Q-Network algorithm (Mnih et. al. in Playing Atari with Deep Reinforcement Learning).
An example of a supervised learning configuration can be seen above. The following is an example configuration for reinforcement learning:
configuration ReinforcementConfig {
learning_method : reinforcement
agent_name : "reinforcement-agent"
environment : gym { name:"CartPole-v1" }
context : cpu
num_episodes : 300
num_max_steps : 9999
discount_factor : 0.998
target_score : 1000
training_interval : 10
loss : huber
use_fix_target_network : true
target_network_update_interval : 100
use_double_dqn : true
replay_memory : buffer{
memory_size : 1000000
sample_size : 64
}
action_selection : epsgreedy{
epsilon : 1.0
min_epsilon : 0.01
epsilon_decay_method: linear
epsilon_decay : 0.0001
}
optimizer : rmsprop{
learning_rate : 0.001
learning_rate_minimum : 0.00001
weight_decay : 0.01
learning_rate_decay : 0.9
learning_rate_policy : step
step_size : 1000
rescale_grad : 1.1
clip_gradient : 10
gamma1 : 0.9
gamma2 : 0.9
epsilon : 0.000001
centered : true
clip_weights : 10
}
}
Available Parameters for Reinforcement Learning
Parameter | Value | Default | Required | Algorithm | Description |
---|---|---|---|---|---|
learning_method | reinforcement,supervised | supervised | No | All | Determines that this CNNTrain configuration is a reinforcement or supervised learning configuration |
rl_algorithm | ddpg-algorithm, dqn-algorithm, td3-algorithm | dqn-algorithm | No | All | Determines the RL algorithm that is used to train the agent |
agent_name | String | "agent" | No | All | Names the agent (e.g. for logging output) |
environment | gym, ros_interface | Yes | / | All | If ros_interface is selected, then the agent and the environment communicates via ROS. The gym environment comes with a set of environments which are listed here |
context | cpu, gpu | cpu | No | All | Determines whether the GPU is used during training or the CPU |
num_episodes | Integer | 50 | No | All | Number of episodes the agent is trained. An episode is a full passing of a game from an initial state to a terminal state. |
num_max_steps | Integer | 99999 | No | All | Number of steps within an episodes before the environment is forced to reset the state (e.g. to avoid a state in which the agent is stuck) |
discount_factor | Float | 0.9 | No | All | Discount factor |
target_score | Float | None | No | All | If set, the agent stops the training when the average score of the last 100 episodes is greater than the target score. |
training_interval | Integer | 1 | No | All | Number of steps between two trainings |
loss | l2, l1, softmax_cross_entropy, sigmoid_cross_entropy, huber | l2 | No | DQN | Selects the loss function |
use_fix_target_network | bool | false | No | DQN | If set, an extra network with fixed parameters is used to estimate the Q values |
target_network_update_interval | Integer | / | DQN | Yes, if fixed target network is true | If use_fix_target_network is set, it determines the number of steps after the target network is updated (Minh et. al. "Human Level Control through Deep Reinforcement Learning") |
use_double_dqn | bool | false | No | If set, two value functions are used to determine the action values (Hasselt et. al. "Deep Reinforcement Learning with Double Q Learning") | |
replay_memory | buffer, online, combined | buffer | No | All | Determines the behaviour of the replay memory |
strategy | epsgreedy, ornstein_uhlenbeck | epsgreedy (discrete), ornstein_uhlenbeck (continuous) | No | All | Determines the action selection policy during the training |
reward_function | Full name of an EMAM component | / | Yes, if ros_interface is selected as the environment and no reward topic is given | All | The EMAM component that is used to calculate the reward. It must have two inputs, one for the current state and one boolean input that determines if the current state is terminal. It must also have exactly one output which represents the reward. |
critic | Full name of architecture definition | / | Yes, if DDPG or TD3 is selected | DDPG, TD3 | The architecture definition which specifies the architecture of the critic network |
soft_target_update_rate | Float | 0.001 | No | DDPG, TD3 | Determines the update rate of the critic and actor target network |
actor_optimizer | See supervised learning | adam with LR .0001 | No | DDPG, TD3 | Determines the optimizer parameters of the actor network |
critic_optimizer | See supervised learning | adam with LR .001 | No | DDPG, TD3 | Determines the optimizer parameters of the critic network |
start_training_at | Integer | 0 | No | All | Determines at which episode the training starts |
evaluation_samples | Integer | 100 | No | All | Determines how many epsiodes are run when evaluating the network |
policy_noise | Float | 0.1 | No | TD3 | Determines the standard deviation of the noise that is added to the actions predicted by the target actor network when calculating the targets. |
noise_clip | Float | 0.5 | No | TD3 | Sets the upper and lower limit of the policy noise |
policy_delay | Integer | 2 | No | TD3 | Every policy_delay of steps, the actor network and targets are updated. |
Environment
Option: ros_interface
If selected, the communication between the environment and the agent is done via ROS. Additional parameters:
- state_topic: Topic on which the state is published
- action_topic: Topic on which the action is published
- reset_topic: Topic on which the reset command is published
- terminal_state_topic: Topic on which the terminal flag is published
Option: gym
The gym environment comes with a set of environments which are listed here. Additional parameters:
- name: Name (see https://gym.openai.com/) of the environment
Replay Buffer
Different buffer behaviour can be selected for the training. For more information about the buffer behaviour see "A deeper look at Experience Replay" by Zhang, Sutton
Option: buffer
A simple buffer in which stores the SARS (State, Action, Reward, next State) tuples. Additional parameters:
- memory_size: Determines the size of the buffer
- sample_size: Number of samples that are used for each training step
Option: online
No buffer is used. Only the current SARS tuple is used for taining.
Option: combined
Combination of online and buffer. Both the current SARS tuple as well as a sample from the buffer are used for each training step. Parameters are the same as buffer.
Strategy
Determines the behaviour when selecting an action based on the values.
Option: epsgreedy
This strategy is only available for discrete problems. It selects an action based on Epsilon-Greedy-Policy. This means, based on epsilon, either a random action is choosen or an action with the highest Q-value. Additional parameters:
- epsilon: Probability of choosing an action randomly
- epsilon_decay_method: Method which determines how epsilon decreases after each step. Can be linear for linear decrease or no for no decrease.
- epsilon_decay_start: Number of Episodes after the decay of epsilon starts
- epsilon_decay: The actual decay of epsilon after each step.
- min_epsilon: After min_epsilon is reached, epsilon is not decreased further.
- epsilon_decay_per_step:Expects either true or false. If true, the decay will be performed for each step the agent executes instead of performing the decay after each episode. The default value is false
Option: ornstein_uhlenbeck
This strategy is only available for continuous problems. The action is selected based on the actor network. Based on the current epsilon, noise is added based on the Ornstein-Uhlenbeck process. Additional parameters:
All epsilon parameters from epsgreedy strategy can be used. Additionally, mu, theta, and sigma needs to be specified. For each action output you can specify the corresponding value with a tuple-style notation: (x,y,z)
Example: Given an actor network with action output of shape (3,), we can write
strategy: ornstein_uhlenbeck{
...
mu: (0.0, 0.1, 0.3)
theta: (0.5, 0.0, 0.8)
sigma: (0.3, 0.6, -0.9)
}
to specify the parameters for each place.
Option: gaussian
This strategy is also only available for continuous problems. If this strat- egy is selected, uncorrelated Gaussian noise with zero mean is added to the current policy action selection. This strategy provides the same parameters as the epsgreedy option and the parameter noise_variance that determines the variance of the noise.
Generation
To execute generation in your project, use the following code to generate a separate Config file:
import de.monticore.lang.monticar.cnntrain.generator.CNNTrainGenerator;
...
CNNTrainGenerator cnnTrainGenerator = new CNNTrainGenerator();
Path modelPath = Paths.get("path/to/cnnt/file");
cnnTrainGenerator.generate(modelPath, cnnt_filename);
Use the following code to get file contents as a map ( fileContents.getValue()
contains the generated code):
import de.monticore.lang.monticar.cnntrain.generator.CNNTrainGenerator;
import de.monticore.lang.monticar.cnntrain._symboltable.CNNTrainLanguage;
...
CNNTrainGenerator cnnTrainGenerator = new CNNTrainGenerator();
ModelPath mp = new ModelPath(Paths.get("path/to/cnnt/file"));
GlobalScope trainScope = new GlobalScope(mp, new CNNTrainLanguage());
Map.Entry<String, String> fileContents = cnnTrainGenerator.generateFileContent( trainScope, cnnt_filename );
CNNTrain can be used together with CNNArch language, which describes architecture of a NN. EmbeddedMontiArcDL uses both languages.