# jet08013/Talks

Fetching contributors…
Cannot retrieve contributors at this time
8494 lines (8301 sloc) 814 KB
 MathTalk slides

Introduction

We begin by looking at a data set to make things concrete.

In [4]:
import pandas as pd
In [5]:
import matplotlib.pyplot as plt

fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(8,8))
axes[0,0].scatter(df['Cigarettes'],df['Lung'])
axes[0,0].set_title('Lung')
#plt.xlabel('Cigarettes Sold per Capita')
axes[0,0].set_xticks([0,1500,3000,4500])
axes[0,0].axis([0,4500,0,30])
axes[0,0].set_yticks(range(0,40,10))

#plt.xlabel('Cigarettes Sold per Capita')
axes[1,0].set_xticks([0,1500,3000,4500])
axes[1,0].set_yticks(range(0,40,10))

axes[0,1].scatter(df['Cigarettes'],df['Leukemia'],color='green')
axes[0,1].set_title('Leukemia')
#plt.xlabel('Cigarettes Sold per Capita')
axes[0,1].set_xticks([0,1500,3000,4500])
axes[0,1].set_yticks(range(0,40,10))

axes[1,1].scatter(df['Cigarettes'],df['Kidney'],color='red')
axes[1,1].set_title('Kidney')
#plt.xlabel('Cigarettes Sold per Capita')
axes[1,1].set_xticks([0,1500,3000,4500])
axes[1,1].set_yticks(range(0,40,10))

fig.suptitle("Deaths from Cancer per 100K Population\n vs \n Cigarette sales per capita")

plt.show()
In [6]:
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(10,6))
axes.scatter(df['Cigarettes'],df['Lung'])
axes.set_title('Lung Cancer per 100K Population\n vs \n Cigarette Sales per Capita')
#plt.xlabel('Cigarettes Sold per Capita')
axes.set_xticks([0,1500,3000,4500])
axes.axis([0,4500,0,30])
axes.set_yticks(range(0,40,10))
plt.show()

There are 44 data points in the set (each corresponds to a state.)

In [7]:
df[['Cigarettes','Lung']][0:10]
Out[7]:
Cigarettes Lung
0 1820 17.05
1 3034 25.88
2 2582 19.80
3 1824 15.98
4 2860 22.07
5 3110 22.83
6 3360 24.55
7 4046 27.27
8 2827 23.57
9 2010 13.58

10 rows × 2 columns

The basic problem of simple linear regression is to find the slope $m$ and intercept $b$ so that the equation of the line $$y=mx+b$$ is the best fit to the data above.

To be specific, we look at the function $$E(m,b)=\sum_{i=1}^{N} (y_i-mx_i-b)^2$$ and we want to find $m$ and $b$ so that this is as small as possible. (For our example, $N=44$.)

Calculus

Although the equation for $E$ looks complicated, it is really a function of two variables and so we can use calculus to find the minimum value. We compute: $$\frac{\partial E}{\partial m}=\sum_{i=1}^{N} 2(y_i-mx_i-b)x_i$$ and $$\frac{\partial E}{\partial b}=\sum_{i=1}^{N} 2(y_i-mx_i-b).$$

Remembering that the variables are $m$ and $b$, set these two equations to zero and find the following two equations: $$S_{yx}=S_{xx}m+S_{x}b$$ and $$S{y}=S{x}m+Nb$$ where I've written $S_{x}=\sum_{i=1}^{N} x_i$, $S_{xy}=\sum_{i=1}^{N} x_iy_i$, and so on.

In [8]:
Sx=sum(df['Cigarettes'])
Sy=sum(df['Lung'])
Sxy=sum([df['Cigarettes'][i]*df['Lung'][i] for i in range(len(df['Cigarettes']))])
Sxx=sum([df['Cigarettes'][i]*df['Cigarettes'][i] for i in range(len(df['Cigarettes']))])
N=len(df['Cigarettes'])
print 'N=',N,'Sx=',Sx,'Sy=',Sy,'Sxx=',Sxx,'Sxy=',Sxy
N= 44 Sx= 109622 Sy= 874.74 Sxx= 286469700 Sxy= 2248867.14

These are two equations in two unknowns that you can solve "by hand" or, for the general solution, you can use Cramer's rule. Let $$M=(NS{yx}-S{x}S_{y})$$ $$B=(S{xx}S_y-S{yx}S_x)$$ $$D=(S{xx}N-S{x}^2)$$ Then $$m=M/D$$ and $$b=B/D.$$

In [9]:
M=N*Sxy-Sx*Sy
B=Sxx*Sy-Sxy*Sx
D=N*Sxx-Sx*Sx
m=M/D
b=B/D
print 'm=',m,'b=',b
m= 0.00520586968046 b= 6.91050349746
In [10]:
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(10,6))
axes.scatter(df['Cigarettes'],df['Lung'])
axes.plot(np.arange(0,4500,1),m*np.arange(0,4500,1)+b*np.ones(len(np.arange(0,4500,1))),color="red",label='Regression Line')
axes.set_title('Lung Cancer per 100K Population\n vs \n Cigarette Sales per Capita')
#plt.xlabel('Cigarettes Sold per Capita')
axes.set_xticks([0,1500,3000,4500])
axes.axis([0,4500,0,30])
axes.set_yticks(range(0,40,10))
plt.legend()
plt.show()

Linear Algebra

Let's look at the problem from a different point of view. Let's consider three column vectors:

• $Y=\left[\begin{matrix} y_1 \cr y_2 \cr \vdots \cr y_N\end{matrix}\right]$
• $X=\left[\begin{matrix} x_1 \cr x_2 \cr \vdots \cr x_N\end{matrix}\right]$
• $E=\left[\begin{matrix} 1 \cr 1 \cr \vdots \cr 1 \end{matrix}\right]$

If things were "perfect", meaning that the points all belonged to the line $y=mx+b$, we would have $$Y=mX+bE.$$

But, of course, we don't. In linear algebra terms, the vector $Y$ does not lie in the plane spanned by $X$ and $E$ in $\mathbb{R}^{44}$

So let's try to find the point in the plane spanned by $E$ and $X$ which is closest to $Y$.

In [11]:
from mpl_toolkits.mplot3d import Axes3D
#plt.subplots(nrows=1,ncols=1,figsize=(6,10))
fig=plt.figure(figsize=(10,10))