Using our wildest imagination, we can picture a dataset consisting
of features X and labels Y, as on the left. Also imagine that we’d
like to generalize this relationship to additional values of X -
that we’d like to predict future values based on what we’ve
already seen before.
With our imagination now undoubtedly spent, we can take a very
simple approach to modeling the relationship between X and Y by
just drawing a line to the general trend of the data.
A Simple Model
Our simple model isn’t the best at modeling the relationship -
clearly there's information in the data that it's failing to
capture.
We'll measure the performance of our model by looking at the
mean-squared error
of its output and the true values (displayed in the bottom
barchart). Our model is close to some of the training points, but
overall there's definitely room for improvement.
The error on the training data is important for model tuning, but
what we really care about is how it performs on data we haven't
seen before, called test data. So let's check that out as well.
Low Complexity & Underfitting
Uh-oh, it looks like our earlier suspicions were correct - our
model is garbage. The test error is even higher than the train
error!
In this case, we say that our model is
underfitting the data: our
model is so simple that it fails to adequately capture the
relationships in the data. The high test error is a direct result
of the lack of complexity of our model.
An underfit model is one that is too simple to accurately
capture the relationships between its features X and label
Y.
A Complex Model
Our previous model performed poorly because it was too simple.
Let's try our luck with something more complex. In fact, let's get
as complex as we can - let's train a model that predicts every
point in our training data perfectly.
Great! Now our training error is zero. As the old saying goes in
Tennessee: Fool me once - shame on you. Fool me twice - er... you
can't get fooled again ;).
High Complexity & Overfitting
Wait a second... Even though our training error from our model was
effectively zero, the error on our test data is high. What gives?
Unsurprisingly, our model is too complicated. We say that it
overfits the data. Instead of
learning the true trends underlying our dataset, it memorized
noise and, as a result, the model is not generalizable to datasets
beyond its training data.
Overfitting refers to the case when a model is so specific to
the data on which it was trained that it is no longer applicable
to different datasets.
In situations where your training error is low but your test error
is high, you've likely overfit your model.
Test Error Decomposition
Our test error can come as a result of both under- and over-
fitting our data, but how do the two relate to each other?
In the general case,
mean-squared error can be decomposed into three components:
error due to bias, error to to variance, and error due to
noise.
Or, mathematically:
We can’t do much about the irreducible
term, but we can make use of the relationship between both bias
and variance to obtain better predictions.
Bias
Bias represents the difference between the average prediction
and the true value:
The term is a tricky one.
It refers to the average prediction after the model has been
trained over several independent datasets. We can think of the
bias as measuring a systematic error in prediction.
These different model realizations are shown in the
top chart, while the error decomposition (for each point of data)
is shown in the bottom chart.
For underfit (low-complexity) models, the majority of our error
comes from bias.
Variance
As with bias, the notion of variance also relates to different
realizations of our model. Specifically,
variance measures how much, on average, predictions vary for a
given data point:
As you can see in the bottom plot, predictions from overfit
(high-complexity) models show a lot more error from variance than
from bias. It’s easy to imagine that any unseen data points will
be predicted with high error.
Finding A Balance
To obtain our best results, we should work to find a happy medium
between a model that is so basic it fails to learn meaningful
patterns in our data, and one that is so complex it fails to
generalize to unseen data .
In other words, we don’t want an underfit model, but we don’t want
an overfit model either. We want something in between - something
with enough complexity to learn learn the generalizable patterns
in our data.
By trading some bias for variance (i.e. increasing the
complexity of our model), and without going overboard, we can
find a balanced model for our dataset.
Across Complexities
We just showed, at different levels of complexity, a sample of
model realizations alongside their corresponding prediction error
decompositions.
Let’s direct our focus to the error decompositions across model
complexities.
For each level of complexity, we’ll aggregate the error
decomposition across all data-points, and plot the aggregate
errors at their level of complexity.
This aggregation applied to our balanced model (i.e. the middle
level of complexity) is shown to the left.
The Bias Variance Trade-off
Repeating this aggregation across our range of model complexities,
we can see the relationship between bias and variance in
prediction errors manifests itself as a U-shaped curve detailing
the trade off between bias and variance.
When a model is too simple (i.e. small values along the x-axis),
it ignores useful information, and the error is composed mostly of
that from bias.
When a model is too complex (i.e. large values along the x-axis),
it memorizes non-general patterns, and the error is composed
mostly of that from variance.
The ideal model aims to minimize both bias and variance. It
lays in the sweet spot - not too simple, nor too complex.
Achieving such a balance will yield the minimum error.