Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics

Authors: Alex Kendall, Yarin Gal, Roberto Cipolla

Affiliation: University of Cambridge, University of Oxford

Task: Multi-Task Learning

Link: arXiv, Code (soon)

TL;DR: A multi-task loss function can be automatically defined with the help of homoscedastic uncertainty.

Task & Motivation

For tasks like Panoptic Segmentation, usually more than one task have to be solved at the same time. In this case, semantic and instance segmentation. Different tasks usually have different objetives and hence, different loss functions. Thus, the whole task has a loss function that consists of weighted single-task losses. But it is non-trivial to find weights, such that the goal is reached.

This work proposes a general framework for finding good weights for a multi-task loss function. The chosen scenario is shown below:

multi-task-scenario

Multi-Task Learning

The importance of finding good weights is shown by the following figure.

multi-task-weights

Homoscedastic uncertainty

There a 2 types of uncertainty in a Bayesian model:

  1. Epistemic uncertainty is uncertainty in the model, which is due to lack of training data.
  2. Aleatoric uncertainty is uncertainty wrt. information that is not explained by the data. In can be decreased by increasing precision and observation of all explanatory variables. There are two categories:
    • Data-dependent or Heteroscedastic uncertainty depends on input data and is predicted as a model output.
    • Task-dependent or Homoscedastic uncertainty is not dependent ob input data and not a model output. It varies only between different tasks.

Task uncertainty captures the relative confidence between tasks and can be used as a basis for weighting the single-task losses.

Multi-Task Lilelihoods

A loss function is derived with the help of Maximum Likelihood.

Regression likelihood:

p(y | f^{W}(x)) = \mathcal{N}(f^{W}(x), \sigma^{2})

Classification likelihood:

p(y | f^{W}(x)) = Softmax(f^{W}(x))

We assume that

p(y_1, \dots, y_K | f^{W}(x)) = p(y_1 | f^{W}(x)) \dots p(y_K | f^{W}(x))

For two regression losses the final loss is:

\mathcal{L}(W, \sigma_1, \sigma_2) = \frac{1}{2\sigma_1^2}\mathcal{L}_1(W) + \frac{1}{2\sigma_2^2}\mathcal{L}_2(W) + \log \sigma_1\sigma_2

with \mathcal{L}_1(W) = || y_1 - f^W(x)||^2 and similar for \mathcal{L}_2(W)

For one regression loss and one classification loss the final loss is:

\mathcal{L}(W, \sigma_1, \sigma_2) \approx \frac{1}{2\sigma_1^2}\mathcal{L}_1(W) + \frac{1}{2\sigma_2^2}\mathcal{L}_2(W) + \log \sigma_1\sigma_2

with \mathcal{L}_1(W) = || y_1 - f^W(x)||^2 and \mathcal{L}_2(W) = - \log Softmax(y_2, f^W(x))

When the noise decreased the loss contribution will increase and vice versa. The last term punishes to large values for \sigma_1, \sigma_2. In practice, the log-variance is predicted to avoid divisions by zero.

Model

The proposed model follows the general encoder-decoder scheme. Specifically, DeepLabV3 is used with ResNet101 as backbone. For each single-task a separate decoder is used.

Semantic Segmentation

The standard cross-entropy loss is used.

Instance Segmentation

Each pixel learns to predict an instance centroid, such that L_1 regression loss can be used. The votes are clusted with OPTICS. This also works when parts of an instance are occluded, while other approaches might create new instances.

multi-task-instance-centroid

Depth Regression

Standard L_1 is used for supervised learning of inverse depth.

Results

Dataset: CityScapes (2,975 training and 500 validation images)

Quantitative Results

multi-task-results1

multi-task-results2

Qualitative Results

multi-task-results3

Discussion

Positive Aspects

  • General framework
  • Bayesian based
  • Robust to initialization, converge fast

Negative Aspects

  • One equation transformation is unclear
  • No code yet
  • No testing on larrge datasets
  • Spelling issues

Other aspects

Since the uncertainty decreases over time, the effective learning rate increases (since the contribution of each single-task loss does so too). This can be compensated by using a particular learning rate.