Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics
Authors: Alex Kendall, Yarin Gal, Roberto Cipolla
Affiliation: University of Cambridge, University of Oxford
Task: Multi-Task Learning
Link: arXiv, Code (soon)
TL;DR: A multi-task loss function can be automatically defined with the help of homoscedastic uncertainty.
Task & Motivation
For tasks like Panoptic Segmentation, usually more than one task have to be solved at the same time. In this case, semantic and instance segmentation. Different tasks usually have different objetives and hence, different loss functions. Thus, the whole task has a loss function that consists of weighted single-task losses. But it is non-trivial to find weights, such that the goal is reached.
This work proposes a general framework for finding good weights for a multi-task loss function. The chosen scenario is shown below:
Multi-Task Learning
The importance of finding good weights is shown by the following figure.
Homoscedastic uncertainty
There a 2 types of uncertainty in a Bayesian model:
- Epistemic uncertainty is uncertainty in the model, which is due to lack of training data.
-
Aleatoric uncertainty is uncertainty wrt. information that is not explained by the data. In can be decreased by increasing precision and observation of all explanatory variables. There are two categories:
- Data-dependent or Heteroscedastic uncertainty depends on input data and is predicted as a model output.
- Task-dependent or Homoscedastic uncertainty is not dependent ob input data and not a model output. It varies only between different tasks.
Task uncertainty captures the relative confidence between tasks and can be used as a basis for weighting the single-task losses.
Multi-Task Lilelihoods
A loss function is derived with the help of Maximum Likelihood.
Regression likelihood:
p(y | f^{W}(x)) = \mathcal{N}(f^{W}(x), \sigma^{2})
Classification likelihood:
p(y | f^{W}(x)) = Softmax(f^{W}(x))
We assume that
p(y_1, \dots, y_K | f^{W}(x)) = p(y_1 | f^{W}(x)) \dots p(y_K | f^{W}(x))
For two regression losses the final loss is:
\mathcal{L}(W, \sigma_1, \sigma_2) = \frac{1}{2\sigma_1^2}\mathcal{L}_1(W) + \frac{1}{2\sigma_2^2}\mathcal{L}_2(W) + \log \sigma_1\sigma_2
with \mathcal{L}_1(W) = || y_1 - f^W(x)||^2 and similar for \mathcal{L}_2(W)
For one regression loss and one classification loss the final loss is:
\mathcal{L}(W, \sigma_1, \sigma_2) \approx \frac{1}{2\sigma_1^2}\mathcal{L}_1(W) + \frac{1}{2\sigma_2^2}\mathcal{L}_2(W) + \log \sigma_1\sigma_2
with \mathcal{L}_1(W) = || y_1 - f^W(x)||^2 and \mathcal{L}_2(W) = - \log Softmax(y_2, f^W(x))
When the noise decreased the loss contribution will increase and vice versa. The last term punishes to large values for \sigma_1, \sigma_2. In practice, the log-variance is predicted to avoid divisions by zero.
Model
The proposed model follows the general encoder-decoder scheme. Specifically, DeepLabV3 is used with ResNet101 as backbone. For each single-task a separate decoder is used.
Semantic Segmentation
The standard cross-entropy loss is used.
Instance Segmentation
Each pixel learns to predict an instance centroid, such that L_1 regression loss can be used. The votes are clusted with OPTICS. This also works when parts of an instance are occluded, while other approaches might create new instances.
Depth Regression
Standard L_1 is used for supervised learning of inverse depth.
Results
Dataset: CityScapes (2,975 training and 500 validation images)
Quantitative Results
Qualitative Results
Discussion
Positive Aspects
- General framework
- Bayesian based
- Robust to initialization, converge fast
Negative Aspects
- One equation transformation is unclear
- No code yet
- No testing on larrge datasets
- Spelling issues
Other aspects
Since the uncertainty decreases over time, the effective learning rate increases (since the contribution of each single-task loss does so too). This can be compensated by using a particular learning rate.