Regression problems are those where we're trying to interpolate or extrapolate real-valued functions on given inputs.
In particular, linear regression is basically fitting a line to a collection of data points.
Let's show an example of "linear-like" data that would benefit from linear regression:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
xs = np.linspace(-10, 10, 100)
m = 5
n = 10
# Create some linear-like data
ys = (m * xs + n) + np.random.normal(0, 4, xs.size) # add random noise
The code above generated noisy linear data with a given slope $m$ and intercept $n$ such that $ y = m * x + n $.
Since we have 2D (x,y) pairs of points, we can display them using a scatter plot:
plt.scatter(xs, ys)
plt.grid()
As we can see, this data clearly has a line-like nature. NumPy's .polyfit
function is useful for fitting polynomials to data, so we can use it to get a line estimate:
mhat, nhat = np.polyfit(xs, ys, 1)
print(mhat, nhat)
4.909841138001781 10.431102318932654
Let's show the estimated line alongside our data:
plt.plot(xs, xs * mhat + nhat, color='tab:orange')
plt.scatter(xs, ys)
plt.grid()
As we can see, visually this looks like a good fit. However, we would also like to assign some kind of score to this line that computes how good it is, and there are several metrics we can use.
One is the mean squared error (MSE), which measures the mean squared error (haha) between the actual $y$ values and the predicted $\hat{y}$ values from our linear regression model. The closer the value is to zero, the better that the fit is to the given data.There's also root-MSE (RMSE), which is just the square root of the MSE. Finally, there is the $R^2$ metric, which signals that the fit is better as the $R^2$ score approaches 1.0.
We can quickly copy the definitions from the textbook:
def MSE(estimated_y, correct_y):
"""
Mean Squared Error between the estimated_y and correct_y values.
Inputs are both numpy arrays.
"""
result = 0
N = estimated_y.size
for i in range(N):
result += (estimated_y[i] - correct_y[i]) ** 2
return result / N
def RMSE(estimated_y, correct_y):
"""
Root Mean Squared Error between the estimated_y and correct_y values.
Inputs are both numpy arrays.
"""
return MSE(estimated_y, correct_y) ** 0.5
def R_squared(estimated_y, correct_y):
"""
R^2 error between the estimated_y and correct_y values.
Inputs are both numpy arrays.
"""
MSE_of_estimates = MSE(estimated_y, correct_y)
mean_y = correct_y.mean()
mean_y_array = np.full(correct_y.size, mean_y)
MSE_wrt_mean_of_y = MSE(estimated_y, mean_y_array) # an array filled with mean-y
result = 1 - MSE_of_estimates / MSE_wrt_mean_of_y
return result
MSE(ys, (xs * mhat + nhat))
17.03008891929832
RMSE(ys, (xs * mhat + nhat))
4.126752829925524
R_squared(ys, (xs * mhat + nhat))
0.9796489148199724
As we see, the MSE is relatively low and the $R^2$ score is close to 1, meaning that the fit is quite good quantitatively as well. Note that since our data is noisy, we can't find any line that would have a MSE of 0.0.
Here's an example of some data which is not linear, but still polynomial in nature:
a = 1
b = 4
c = 4
ys = (a * xs * xs + b * xs + c) + np.random.normal(0, 4, xs.size) # add random noise
plt.scatter(xs, ys)
plt.grid()
Let's see what happens when we try to fit a line:
m, n = np.polyfit(xs, ys, 1)
plt.plot(xs, xs * m + n, label='estimated line', color='orange')
plt.scatter(xs, ys)
plt.legend()
<matplotlib.legend.Legend at 0x7fbe328a5970>
As we see, the fit is visually not very good. As for MSE, it's even worse:
MSE(ys, (xs * m + n))
939.2141146961674
Now let's try to fit a two-degree polynomial:
ahat, bhat, chat = np.polyfit(xs, ys, 2)
plt.plot(xs, ahat * xs * xs + bhat * xs + chat, color='orange')
plt.scatter(xs, ys)
<matplotlib.collections.PathCollection at 0x7fbe32811cd0>
MSE(ys, (ahat * xs * xs + bhat * xs + chat))
15.052760650909258
Much better. Finally, let's see what we can do with some non-linear data from a different family:
ys = 4 * np.sin(xs) - 5 + np.random.randn(xs.size)
plt.scatter(xs, ys)
<matplotlib.collections.PathCollection at 0x7fbe208d88e0>
By looking at the data, we can see that it is wave-like and periodic, so we would like to fit some kind of sinusoid to the data. To do this, we can use SciPy's curve_fit
by passing a function that fits a sine wise to the data:
# Import optimize module
from scipy import optimize
def sinelike(x, a, b):
return a * np.sin(x) + b
# Fit a linear model:
solution, _ = optimize.curve_fit(sinelike, xs, ys, method='lm')
print("The estimated solution is: ", solution)
The estimated solution is: [ 4.14373463 -5.03095775]
plt.scatter(xs, ys)
yhat = [sinelike(x, solution[0], solution[1]) for x in xs]
plt.plot(xs, yhat, color='orange')
[<matplotlib.lines.Line2D at 0x7fbe4058b820>]
MSE(np.array(ys), np.array(yhat))
0.9337464287047195
R_squared(np.array(ys), np.array(yhat))
0.8974051269093574
Again, both the plot and the metrics indicate that this is a good fit for the generated data.