A team builds a single multi-task model to simultaneously predict (1) movie review sentiment and (2) whether a medical record indicates diabetes. Both tasks perform worse than single-task baselines. What is the most likely cause?
AThe batch size was too small to support gradient updates from two different loss functions simultaneously.
BThe tasks don't share meaningful feature structure, so shared layers are pulled toward incompatible representations, harming both tasks.
CMulti-task learning always requires significantly more training data than single-task models to achieve competitive performance.
DThe learning rate should be doubled to compensate for the gradient signal being split across two tasks.
Multi-task learning only helps when tasks share underlying structure — when features useful for one task are also useful for another. Sentiment analysis and diabetes prediction require completely different feature abstractions (syntactic/semantic text patterns vs. clinical biomarkers). Forcing shared layers to serve both tasks distorts the representation for both, hurting performance. MTL is not universally beneficial: the tasks must be meaningfully related. This is the key failure mode that the 'implicit regularizer' framing can obscure.
Question 2 Multiple Choice
In hard parameter sharing, why does training with auxiliary tasks often improve performance on the MAIN task, even when no new labeled examples are added for that task?
AAuxiliary tasks supply more training labels for the main task by transferring examples across task heads.
BShared layers are forced to learn features that generalize across all tasks, acting as an implicit regularizer that prevents overfitting to quirks in the main task's training data.
CAuxiliary tasks reduce the effective learning rate for the main task's output head, preventing gradient explosion.
DSeparate task-specific heads isolate the auxiliary tasks, ensuring they don't influence the shared representation at all.
This is the core counterintuitive insight of MTL. The shared backbone cannot memorize idiosyncrasies of a single task because the same weights must serve all tasks simultaneously. This implicit regularization reduces overfitting — the shared layers learn a more general, robust feature space than any single task would force. The auxiliary tasks act like diverse training signals that shape better-generalizing internal representations. The benefit comes from the gradient diversity, not from additional labeled examples for the main task.
Question 3 True / False
Multi-task learning can improve a model's performance on a target task even when no additional labeled data is provided for that target task.
TTrue
FFalse
Answer: True
This is one of the most powerful properties of MTL. Auxiliary tasks provide diverse gradient signals that shape the shared representation in ways a single task's gradients would not. The main task benefits from a representation that has been implicitly regularized by the need to also solve other tasks. The additional labels are for auxiliary tasks only — yet the main task improves because its shared layers are better trained. This is why auxiliary tasks are sometimes deliberately chosen for a main task of interest, even when the auxiliary task's predictions are not needed.
Question 4 True / False
Adding more tasks to a multi-task learning setup usually improves the performance of most task in the model, because more diverse gradients produce better shared representations.
TTrue
FFalse
Answer: False
More tasks are only beneficial if those tasks share relevant structure. Unrelated or conflicting tasks introduce gradients that actively harm the shared representation for other tasks — a phenomenon called negative transfer. Additionally, even with compatible tasks, task imbalance can cause one task's loss to dominate training, distorting the shared representation. MTL requires careful task selection and balancing; it is not a free lunch. Adding an incompatible task can make every other task perform worse.
Question 5 Short Answer
Explain why task compatibility is critical for multi-task learning to work, using the concept of shared representations.
Think about your answer, then reveal below.
Model answer: Shared layers in MTL must learn features that are simultaneously useful for all tasks. If tasks share underlying structure — they require similar abstractions from the input — then the shared representation benefits everyone: the gradient from each task reinforces features useful to other tasks. If tasks are incompatible — their required features are unrelated or contradictory — then gradient signals from different tasks pull the shared weights in different directions, producing a blurred representation that is mediocre for everyone. Task compatibility is the precondition for the shared representation to function as an implicit regularizer rather than a source of interference.
The shared representation is the mechanism by which MTL achieves its benefits — and its failure mode. Think of it as a shared language: if two tasks can express their needs in overlapping terms, they help each other. If they speak entirely different languages, forcing them to share a vocabulary produces incoherence. The art of MTL is identifying tasks whose 'languages' (feature needs) overlap enough that shared training is mutually beneficial. Metrics like task relatedness, gradient similarity across tasks, or negative transfer detection during training can help diagnose mismatched task combinations.