WEBVTT
00:00:00.820 --> 00:00:03.673
When performing linear regression, we have a number of data points.
00:00:03.673 --> 00:00:09.043
Let's say that we have 1, 2, 3 and so on up through M data points.
00:00:09.043 --> 00:00:12.013
Each data point has an output variable, Y,
00:00:12.013 --> 00:00:15.400
and a number of input variables, X1 through X N.
00:00:16.410 --> 00:00:20.400
So in our baseball example Y is the lifetime number of home runs.
00:00:20.400 --> 00:00:24.370
And our X1 and XN are things like height and weight.
00:00:24.370 --> 00:00:27.762
Our one through M samples might be different baseball players.
00:00:27.762 --> 00:00:32.572
So maybe data point one is Derek Jeter, data point two is Barry Bonds, and
00:00:32.572 --> 00:00:34.870
data point M is Babe Ruth.
00:00:34.870 --> 00:00:37.823
Generally speaking, we are trying to predict the values of
00:00:37.823 --> 00:00:41.779
the output variable for each data point, by multiplying the input variables by
00:00:41.779 --> 00:00:45.579
some set of coefficients that we're going to call theta 1 through theta N.
00:00:45.579 --> 00:00:49.105
Each theta, which we'll from here on out call the parameters or
00:00:49.105 --> 00:00:52.826
the weights of the model, tell us how important an input variable is
00:00:52.826 --> 00:00:55.589
when predicting a value for the output variable.
00:00:55.589 --> 00:00:57.381
So if theta 1 is very small,
00:00:57.381 --> 00:01:01.880
X1 must not be very important in general when predicting Y.
00:01:01.880 --> 00:01:03.940
Whereas if theta N is very large,
00:01:03.940 --> 00:01:07.300
then XN is generally a big contributor to the value of Y.
00:01:07.300 --> 00:01:10.420
This model is built in such a way that we can multiply each X by
00:01:10.420 --> 00:01:13.500
the corresponding theta, and sum them up to get Y.
00:01:13.500 --> 00:01:17.080
So that our final equation will look something like the equation down here.
00:01:17.080 --> 00:01:20.250
Theta 1 plus X1 plus theta 2 times X2,
00:01:20.250 --> 00:01:23.948
all the way up to theta N plus XN equals Y.
00:01:23.948 --> 00:01:26.670
And we'd want to be able to predict Y for each of our M data points.
00:01:27.780 --> 00:01:31.844
In this illustration, the dark blue points represent our reserve data points,
00:01:31.844 --> 00:01:34.712
whereas the green line shows the predictive value of Y for
00:01:34.712 --> 00:01:37.598
every value of X given the model that we may have created.
00:01:37.598 --> 00:01:41.380
The best equation is the one that's going to minimize the difference across all
00:01:41.380 --> 00:01:44.160
data points between our predicted Y, and our observed Y.
00:01:45.280 --> 00:01:48.876
What we need to do is find the thetas that produce the best predictions.
00:01:48.876 --> 00:01:52.960
That is, making these differences as small as possible.
00:01:52.960 --> 00:01:56.650
If we wanted to create a value that describes the total areas of our model,
00:01:56.650 --> 00:01:58.380
we'd probably sum up the areas.
00:01:58.380 --> 00:02:02.440
That is, sum over all of our data points from I equals 1, to M.
00:02:02.440 --> 00:02:05.290
The predicted Y minus the actual Y.
00:02:05.290 --> 00:02:08.320
However, since these errors can be both negative and
00:02:08.320 --> 00:02:11.570
positive, if we simply sum them up, we could have
00:02:11.570 --> 00:02:17.040
a total error term that's very close to 0, even if our model is very wrong.
00:02:17.040 --> 00:02:20.710
In order to correct this, rather than simply adding up the error terms,
00:02:20.710 --> 00:02:23.100
we're going to add the square of the error terms.
00:02:23.100 --> 00:02:26.940
This guarantees that the magnitude of each individual error term,
00:02:26.940 --> 00:02:29.680
Y predicted minus Y actual is positive.
00:02:30.920 --> 00:02:33.620
Why don't we make sure the distinction between input of variables and
00:02:33.620 --> 00:02:34.830
output of variables is clear.