Application: Regression problems

Regression problems are those where we're trying to interpolate or extrapolate real-valued functions on given inputs.

In particular, linear regression is basically fitting a line to a collection of data points.

Let's show an example of "linear-like" data that would benefit from linear regression:

The code above generated noisy linear data with a given slope $m$ and intercept $n$ such that $ y = m * x + n $.

Since we have 2D (x,y) pairs of points, we can display them using a scatter plot:

As we can see, this data clearly has a line-like nature. NumPy's .polyfit function is useful for fitting polynomials to data, so we can use it to get a line estimate:

Let's show the estimated line alongside our data:

As we can see, visually this looks like a good fit. However, we would also like to assign some kind of score to this line that computes how good it is, and there are several metrics we can use.

One is the mean squared error (MSE), which measures the mean squared error (haha) between the actual $y$ values and the predicted $\hat{y}$ values from our linear regression model. The closer the value is to zero, the better that the fit is to the given data.There's also root-MSE (RMSE), which is just the square root of the MSE. Finally, there is the $R^2$ metric, which signals that the fit is better as the $R^2$ score approaches 1.0.

We can quickly copy the definitions from the textbook:

As we see, the MSE is relatively low and the $R^2$ score is close to 1, meaning that the fit is quite good quantitatively as well. Note that since our data is noisy, we can't find any line that would have a MSE of 0.0.

Here's an example of some data which is not linear, but still polynomial in nature:

Let's see what happens when we try to fit a line:

As we see, the fit is visually not very good. As for MSE, it's even worse:

Now let's try to fit a two-degree polynomial:

Much better. Finally, let's see what we can do with some non-linear data from a different family:

By looking at the data, we can see that it is wave-like and periodic, so we would like to fit some kind of sinusoid to the data. To do this, we can use SciPy's curve_fit by passing a function that fits a sine wise to the data:

Again, both the plot and the metrics indicate that this is a good fit for the generated data.