Introduction
We begin by looking at a data set to make things concrete.
import pandas as pd
df=pd.read_csv("/Users/jteitelbaum/Dropbox/CancerData.csv")
import matplotlib.pyplot as plt
fig,axes=plt.subplots(nrows=2,ncols=2,figsize=(8,8))
axes[0,0].scatter(df['Cigarettes'],df['Lung'])
axes[0,0].set_title('Lung')
#plt.xlabel('Cigarettes Sold per Capita')
axes[0,0].set_xticks([0,1500,3000,4500])
axes[0,0].axis([0,4500,0,30])
axes[0,0].set_yticks(range(0,40,10))
axes[1,0].scatter(df['Cigarettes'],df['Bladder'],color='black')
axes[1,0].set_title('Bladder')
#plt.xlabel('Cigarettes Sold per Capita')
axes[1,0].set_xticks([0,1500,3000,4500])
axes[1,0].set_yticks(range(0,40,10))
axes[0,1].scatter(df['Cigarettes'],df['Leukemia'],color='green')
axes[0,1].set_title('Leukemia')
#plt.xlabel('Cigarettes Sold per Capita')
axes[0,1].set_xticks([0,1500,3000,4500])
axes[0,1].set_yticks(range(0,40,10))
axes[1,1].scatter(df['Cigarettes'],df['Kidney'],color='red')
axes[1,1].set_title('Kidney')
#plt.xlabel('Cigarettes Sold per Capita')
axes[1,1].set_xticks([0,1500,3000,4500])
axes[1,1].set_yticks(range(0,40,10))
fig.suptitle("Deaths from Cancer per 100K Population\n vs \n Cigarette sales per capita")
plt.show()
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(10,6))
axes.scatter(df['Cigarettes'],df['Lung'])
axes.set_title('Lung Cancer per 100K Population\n vs \n Cigarette Sales per Capita')
#plt.xlabel('Cigarettes Sold per Capita')
axes.set_xticks([0,1500,3000,4500])
axes.axis([0,4500,0,30])
axes.set_yticks(range(0,40,10))
plt.show()
There are 44 data points in the set (each corresponds to a state.)
df[['Cigarettes','Lung']][0:10]
Cigarettes | Lung | |
---|---|---|
0 | 1820 | 17.05 |
1 | 3034 | 25.88 |
2 | 2582 | 19.80 |
3 | 1824 | 15.98 |
4 | 2860 | 22.07 |
5 | 3110 | 22.83 |
6 | 3360 | 24.55 |
7 | 4046 | 27.27 |
8 | 2827 | 23.57 |
9 | 2010 | 13.58 |
10 rows × 2 columns
The basic problem of simple linear regression is to find the slope
To be specific, we look at the function
$$
E(m,b)=\sum_{i=1}^{N} (y_i-mx_i-b)^2
$$
and we want to find
Calculus
Although the equation for
Remembering that the variables are
Sx=sum(df['Cigarettes'])
Sy=sum(df['Lung'])
Sxy=sum([df['Cigarettes'][i]*df['Lung'][i] for i in range(len(df['Cigarettes']))])
Sxx=sum([df['Cigarettes'][i]*df['Cigarettes'][i] for i in range(len(df['Cigarettes']))])
N=len(df['Cigarettes'])
print 'N=',N,'Sx=',Sx,'Sy=',Sy,'Sxx=',Sxx,'Sxy=',Sxy
N= 44 Sx= 109622 Sy= 874.74 Sxx= 286469700 Sxy= 2248867.14
These are two equations in two unknowns that you can solve "by hand" or, for the general solution, you can use Cramer's rule. Let $$ M=(NS_{yx}-S_{x}S_{y}) $$ $$ B=(S_{xx}S_y-S_{yx}S_x) $$ $$ D=(S_{xx}N-S_{x}^2) $$ Then $$ m=M/D $$ and $$ b=B/D. $$
M=N*Sxy-Sx*Sy
B=Sxx*Sy-Sxy*Sx
D=N*Sxx-Sx*Sx
m=M/D
b=B/D
print 'm=',m,'b=',b
m= 0.00520586968046 b= 6.91050349746
fig,axes=plt.subplots(nrows=1,ncols=1,figsize=(10,6))
axes.scatter(df['Cigarettes'],df['Lung'])
axes.plot(np.arange(0,4500,1),m*np.arange(0,4500,1)+b*np.ones(len(np.arange(0,4500,1))),color="red",label='Regression Line')
axes.set_title('Lung Cancer per 100K Population\n vs \n Cigarette Sales per Capita')
#plt.xlabel('Cigarettes Sold per Capita')
axes.set_xticks([0,1500,3000,4500])
axes.axis([0,4500,0,30])
axes.set_yticks(range(0,40,10))
plt.legend()
plt.show()
Linear Algebra
Let's look at the problem from a different point of view. Let's consider three column vectors:
- $Y=\left[\begin{matrix} y_1 \cr y_2 \cr \vdots \cr y_N\end{matrix}\right]$
- $X=\left[\begin{matrix} x_1 \cr x_2 \cr \vdots \cr x_N\end{matrix}\right]$
- $E=\left[\begin{matrix} 1 \cr 1 \cr \vdots \cr 1 \end{matrix}\right]$
If things were "perfect", meaning that the points all belonged to the line
But, of course, we don't. In linear algebra terms, the vector
So let's try to find the point in the plane spanned by
from mpl_toolkits.mplot3d import Axes3D
#plt.subplots(nrows=1,ncols=1,figsize=(6,10))
fig=plt.figure(figsize=(10,10))
ax = fig.add_subplot(111, projection='3d')
X = np.arange(-2, 2, 0.25)
Y = np.arange(-2, 2, 0.25)
X, Y = np.meshgrid(X, Y)
Z=2*Y-X
t=np.arange(-1,0,.1)
ax.plot_wireframe(X,Y,Z)
ax.plot(t,-2*t,6+t,color="red")
ax.scatter([0,-1],[0,2],[6,5],color="red")
ax.set_title("Closest point lying in a given plane from a point outside the plane")
plt.show()
We can use linear algebra to solve the problem of finding the point in the plane
First, we normalize
Next, we subtract the projection of
Then the point we are interested in is $$ \mathbf{y}=(Y\cdot\mathbf{x})\mathbf{x}+(Y\cdot\mathbf{e})\mathbf{e} $$
This point lies in the plane
Let's make a few observations about this. First of all,
We can arrange for our data to have
In these coordinates, we get a huge simplification. If both means are zero, we get
$$
\mathbf{x}=\frac{X}{\sqrt{X\cdot X}}
$$
and
$$
\mathbf{y}=\frac{Y\cdot X}{X\cdot X}X
$$
In other words, in these coordinates,
Xnorm=df['Cigarettes']-df['Cigarettes'].mean()
Ynorm=df['Lung']-df['Lung'].mean()
plt.scatter(Xnorm,Ynorm)
m=Xnorm.dot(Ynorm)/Xnorm.dot(Xnorm)
plt.plot(Xnorm, m*Xnorm)
plt.show()
Statistics
What accounts for the variation of the points around the "true value"? Statisticians approach this through the idea of a statistical model. Let's leave the lung cancer data behind for the moment, and imagine an abstract problem. Suppose
that we make measurements of a dependent variable
We will assume that the error term
from scipy.stats import norm
X=np.arange(-5,5,.1)
plt.plot(X,norm.pdf(X,scale=.7,loc=0),color='red',label='v=.5')
plt.plot(X,norm.pdf(X,scale=1),color='blue',label='v=1')
plt.plot(X,norm.pdf(X,scale=1.5),color='orange',label='v=2.25')
plt.title('Normal Distributions')
plt.legend()
plt.show()
Suppose for the sake of concreteness that the
def MB(X,Y):
Sxx=sum([X[i]*X[i] for i in range(len(X))])
Sxy=sum([X[i]*Y[i] for i in range(len(X))])
Sx=sum([X[i] for i in range(len(X))])
Sy=sum([Y[i] for i in range(len(Y))])
N=len(X)
M=N*Sxy-Sx*Sy
B=Sxx*Sy-Sxy*Sx
D=N*Sxx-Sx*Sx
m=M/D
b=B/D
return m,b
X=np.arange(0,10,.1)
fig,ax=plt.subplots(nrows=2,ncols=2,figsize=(6,6))
epsilon=norm.rvs(size=100,scale=2,loc=0)
Y0=X+np.ones(len(X))+epsilon
epsilon=norm.rvs(size=100,scale=2,loc=0)
Y1=X+np.ones(len(X))+epsilon
epsilon=norm.rvs(size=100,scale=2,loc=0)
Y2=X+np.ones(len(X))+epsilon
epsilon=norm.rvs(size=100,scale=2,loc=0)
Y3=X+np.ones(len(X))+epsilon
for i in [0,1]:
for j in [0,1]:
ax[i,j].set_xticks(range(0,11,2))
ax[i,j].set_yticks(range(-5,17,2))
ax[0,0].scatter(X,Y0,color='blue')
ax[1,0].scatter(X,Y1,color='red')
ax[0,1].scatter(X,Y2,color='green')
ax[1,1].scatter(X,Y3,color='black')
m0,b0=MB(X,Y0)
m1,b1=MB(X,Y1)
m2,b2=MB(X,Y2)
m3,b3=MB(X,Y3)
ax[0,0].plot(X,m0*X+b0,color="blue")
ax[1,0].plot(X,m1*X+b1,color="red")
ax[0,1].plot(X,m2*X+b2,color="green")
ax[1,1].plot(X,m3*X+b3,color="black")
plt.show()
We can superimpose these different lines and you can see that they are slightly different -- with varying slope and intercept -- which come from different snapshots of the underlying random data.
plt.plot(X,m0*X+b0,color="blue")
plt.plot(X,m1*X+b1,color="red")
plt.plot(X,m2*X+b2,color="green")
plt.plot(X,m3*X+b3,color="black")
plt.show()
Suppose we draw 5000 samples of 100 Y-values each, where each of the 100
Y=X+np.ones(len(X))
M,B=[],[]
for i in range(5000):
Yn=Y+norm.rvs(size=100,loc=0,scale=2)
m,b=MB(X,Yn)
M.append(m)
B.append(b)
plt.title("Slope/Intercept Pairs for Different Samples")
plt.xlabel("Slope")
plt.ylabel("Intercept")
plt.scatter(B,M,s=.3,alpha=.3,color='blue')
plt.show()
To see where this ellipse comes from, let's recall the orthonormal vectors
def P(X,Y):
N=len(X)
Xbar=sum(X)/N
Sx=sum((X-Xbar*np.ones(N))**2)
A=sum(Y)/sqrt(N)
Z=(X-Xbar*np.ones(N))/sqrt(Sx)
B=sum([Z[i]*Y[i] for i in range(N)])
return A,B
A=[]
B=[]
a0,b0=P(X,X+1)
for i in range(5000):
Yn=Y+norm.rvs(size=100,loc=0,scale=2)
a,b=P(X,Yn)
A.append(a-a0)
B.append(b-b0)
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(10,10))
axes[0].set_title("Orthogonal\n Projection of Residuals")
axes[0].axis([-10,10,-10,10])
axes[0].set_aspect('equal')
#plt.gca().set_aspect('equal')
axes[0].scatter(A,B,s=.3,alpha=.3,color='blue')
axes[1].hist(A,normed='True')
axes[1].hist(B,normed='True')
axes[1].plot(np.arange(-10,10,.1),norm.pdf(np.arange(-10,10,.1),scale=2),color='black')
axes[1].set_title("Distribution of Residuals")
plt.show()
We can make a change of variables from the
#plt.xlabel("Slope")
#plt.ylabel("Intercept")
#axes[0].set_xticks(np.arange(-.5,.5,.1))
fig=figure(num=1,figsize=(10,10))
axes=fig.gca()
A=np.array(A)
B=np.array(B)
axes.set_title('Projection of Residuals in Slope,Intercept')
axes.axis([-2,2,-.3,.3])
axes.set_xticks(np.arange(-2,3,1))
N=len(X)
Xbar=sum(X)/N
Sx=sum((X-Xbar*np.ones(N))**2)
R=Sx+N*Xbar**2
S=2*N*Xbar
T=N
u=np.arange(-3,4,.01)
v=np.arange(-3,4,.01)
U,V=np.meshgrid(u,v)
axes.contour(U,V,T*U**2+S*U*V+R*V**2,levels=[4,16,36],colors=['black','black','black'])
#axes.set_aspect('equal',adjustable='box')
axes.scatter(1/sqrt(N)*A-Xbar/sqrt(Sx)*B,1/sqrt(Sx)*B,s=.5,alpha=.3,color='red')
plt.show()