Application: Regression problems¶

In a regression problem, we use input data to predict a numerical output. For example, we might predict a house’s price from its size.

In particular, linear regression is basically fitting a line to a collection of data points.

Let's show an example of "linear-like" data that would benefit from linear regression:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

xs = np.linspace(-10, 10, 100)

m = 5
n = 10

# Create some linear-like data
ys = (m * xs + n) + np.random.normal(0, 4, xs.size) # add random noise

The code above generated noisy linear data with a given slope $m$ and intercept $n$ such that $ y = m * x + n $.

Since we have 2D (x,y) pairs of points, we can display them using a scatter plot:

In [2]:
plt.scatter(xs, ys)
plt.grid()
No description has been provided for this image

As we can see, this data clearly has a line-like nature. NumPy's .polyfit function is useful for fitting polynomials to data, so we can use it to get a line estimate:

In [3]:
mhat, nhat = np.polyfit(xs, ys, 1)
print(mhat, nhat)
5.067523508424096 10.338403758908862

Let's show the estimated line alongside our data:

In [4]:
plt.plot(xs, xs * mhat + nhat, color='tab:orange')
plt.scatter(xs, ys)
plt.grid()
No description has been provided for this image

As we can see, visually this looks like a good fit. However, we would also like to assign some kind of score to this line that computes how good it is, and there are several metrics we can use.

One is the mean squared error (MSE), which measures the mean squared error (haha) between the actual $y$ values and the predicted $\hat{y}$ values from our linear regression model. The closer the value is to zero, the better that the fit is to the given data.There's also root-MSE (RMSE), which is just the square root of the MSE. Finally, there is the $R^2$ metric, which signals that the fit is better as the $R^2$ score approaches 1.0.

We can quickly copy the definitions from the textbook:

In [5]:
def MSE(estimated_y, correct_y):
  """
    Mean Squared Error between the estimated_y and correct_y values.
    Inputs are both numpy arrays.
  """
  result = 0
  N = estimated_y.size
  for i in range(N):
    result += (estimated_y[i] - correct_y[i]) ** 2
  return result / N

def RMSE(estimated_y, correct_y):
  """
    Root Mean Squared Error between the estimated_y and correct_y values.
    Inputs are both numpy arrays.
  """
  return MSE(estimated_y, correct_y) ** 0.5

def R_squared(estimated_y, correct_y):
    """
    Coefficient of determination for predictions and observed values.
    """
    mse_model = MSE(estimated_y, correct_y)

    mean_y = correct_y.mean()
    mean_predictions = np.full(correct_y.size, mean_y)
    mse_baseline = MSE(mean_predictions, correct_y)

    return 1 - mse_model / mse_baseline
In [6]:
MSE(ys, (xs * mhat + nhat))
Out[6]:
np.float64(13.890365494547748)
In [7]:
RMSE(ys, (xs * mhat + nhat))
Out[7]:
np.float64(3.7269780646722013)
In [8]:
R_squared(ys, (xs * mhat + nhat))
Out[8]:
np.float64(0.9840941381977046)

As we see, the MSE is relatively low and the $R^2$ score is close to 1, meaning that the fit is quite good quantitatively as well. Note that since our data is noisy, we can't find any line that would have a MSE of 0.0.

Here's an example of some data which is not linear, but still polynomial in nature:

In [9]:
a = 1
b = 4
c = 4
ys = (a * xs * xs + b * xs + c) + np.random.normal(0, 4, xs.size) # add random noise
plt.scatter(xs, ys)
plt.grid()
No description has been provided for this image

Let's see what happens when we try to fit a line:

In [10]:
m, n = np.polyfit(xs, ys, 1)
In [11]:
plt.plot(xs, xs * m + n, label='estimated line', color='orange')
plt.scatter(xs, ys)
plt.legend()
Out[11]:
<matplotlib.legend.Legend at 0x10ead6ba0>
No description has been provided for this image

As we see, the fit is visually not very good. As for MSE, it's even worse:

In [12]:
MSE(ys, (xs * m + n))
Out[12]:
np.float64(938.017200528655)

Now let's try to fit a second-degree polynomial:

In [13]:
ahat, bhat, chat = np.polyfit(xs, ys, 2)
In [14]:
plt.plot(xs, ahat * xs * xs + bhat * xs + chat, color='orange')
plt.scatter(xs, ys)
Out[14]:
<matplotlib.collections.PathCollection at 0x10ea6b260>
No description has been provided for this image
In [15]:
MSE(ys, (ahat * xs * xs + bhat * xs + chat))
Out[15]:
np.float64(15.630775125359312)

Much better. Finally, let's see what we can do with some non-linear data from a different family:

In [16]:
ys = 4 * np.sin(xs) - 5 + np.random.randn(xs.size)
plt.scatter(xs, ys)
Out[16]:
<matplotlib.collections.PathCollection at 0x10eb2e780>
No description has been provided for this image

By looking at the data, we can see that it is wave-like and periodic, so we would like to fit some kind of sinusoid to the data. To do this, we can use SciPy's curve_fit by passing a function that fits a sine wave to the data:

In [17]:
# Import optimize module
from scipy import optimize

def sinelike(x, a, b):
    return  a * np.sin(x) + b

# Fit a linear model:
solution, _ = optimize.curve_fit(sinelike, xs, ys, method='lm')

print("The estimated solution is: ", solution)
The estimated solution is:  [ 3.93896208 -4.86172707]
In [18]:
plt.scatter(xs, ys)
yhat = [sinelike(x, solution[0], solution[1]) for x in xs]

plt.plot(xs, yhat, color='orange')
Out[18]:
[<matplotlib.lines.Line2D at 0x1112cfd10>]
No description has been provided for this image
In [19]:
MSE(np.array(ys), np.array(yhat))
Out[19]:
np.float64(0.728924909680623)
In [20]:
R_squared(np.array(ys), np.array(yhat))
Out[20]:
np.float64(0.9012331468531657)

Again, both the plot and the metrics indicate that this is a good fit for the generated data.