linear models
- Material from “Statistical Models Theory and Practice” - David Freedman
introduction
-
Y=Xβ+ϵ
type of linear regression X Y simple univariate univariate multiple multivariate univariate multivariate either multivariate generalized either error not normal -
minimize: L(θ)=||Y−Xβ||22⟹ˆθ=(XTX)−1XTY
-
2 proofs
- set deriv and solve
- use projection matrix H to show HY is proj of Y onto R(X)
- the projection matrix maps the responses to the predictions: ˆy=Hy
- define projection (hat) matrix H=X(XTX)−1XT
- show ||Y−Xθ||2≥||Y−HY||2
- key idea: subtract and add HY
- interpretation
- if feature correlated, weights aren’t stable / can’t be interpreted
- curvature inverse (XTX)−1 - dictates stability
- importance: weight * feature value
- LS doesn’t work when p » n because of colinearity of X columns
- assumptions
- ϵ∼N(Xβ,σ2)
- homoscedasticity: var(Yi|X) is the same for all i
- opposite of heteroscedasticity
- multicollinearity - predictors highly correlated
- variance inflation factor (VIF) - measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictors are not linearly related
- normal linear regression
- variance MLE ˆσ2=∑(yi−ˆθTxi)2/n
- in unbiased estimator, we divide by n-p
- LS has a distr. N(θ,σ2(XTX)−1)
- variance MLE ˆσ2=∑(yi−ˆθTxi)2/n
- linear regression model
- when n is large, LS estimator ahs approx normal distr provided that X^TX /n is approx. PSD
- weighted LS: minimize ∑[wi(yi−xTiθ)]2
- ˆθ=(XTWX)−1XTWY
- heteroscedastic normal lin reg model: erorrs ~ N(0, 1/w_i)
- leverage scores - measure how much each xi influences the LS fit
- for data point i, Hii is the leverage score
- LAD (least absolute deviation) fit
- MLE estimator when error is Laplacian
recent notes
regularization
- when (XTX) isn’t invertible can’t use normal equations and gradient descent is likely unstable
- X is nxp, usually n » p and X almost always has rank p
- problems when n < p
- intuitive way to fix this problem is to reduce p by getting rid of features
- a lot of papers assume your data is already zero-centered
- conventionally don’t regularize the intercept term
- ridge regression (L2)
- if (X^T X) not invertible, add a small element to diagonal
- then it becomes invertible
- small lambda -> numerical solution is unstable
- proof of why it’s invertible is difficult
- argmin ∑i(yi−^yi)2+λ||β||22
- equivalent to minimizing ∑i(yi−^yi)2 s.t. ∑jβ2j≤t
- solution is ^βλ=(XTX+λI)−1XTy
- for small λ numerical solution is unstable
- When XTX=I, $\beta {Ridge} = \frac{1}{1+\lambda} \beta{Least Squares}$
- lasso regression (L1)
- ∑i(yi−^yi)2+λ||β||1
- equivalent to minimizing ∑i(yi−^yi)2 s.t. ∑j|βj|≤t
- “least absolute shrinkage and selection operator”
- lasso - least absolute shrinkage and selection operator - L1
- acts in a nonlinear manner on the outcome y
- keep the same SSE loss function, but add constraint of L1 norm
- doesn’t have closed form for Beta
- because of the absolute value, gradient doesn’t exist
- can use directional derivatives
- best solver is LARS - least angle regression - if tuning parameter is chose well, will set lots of coordinates to 0 - convex functions / convex sets (like circle) are easier to solve - disadvantages
- if p>n, lasso selects at most n variables
- if pairwise correlations are very high, lasso only selects one variable
- elastic net - hybrid of the other two
- βNaiveENet=∑i(yi−^yi)2+λ1||β||1+λ2||β||22
- l1 part generates sparse model
- l2 part encourages grouping effect, stabilizes l1 regularization path
- grouping effect - group of highly correlated features should all be selected - naive elastic net has too much shrinkage so we scale βENet=(1+λ2)βNaiveENet - to solve, fix l2 and solve lasso
the regression line (freedman ch 2)
- regression line
- goes through (ˉx,ˉy)
- slope: rsy/sx
- intercept: ˉy−slope⋅ˉx
- basically fits graph of averages (minimizes MSE)
- SD line
- same except slope: sign(r)sy/sx
- intercept changes accordingly
- for regression, MSE = (1−r2)Var(Y)
multiple regression (freedman ch 4)
- assumptions
- assume n>p and X has full rank (rank p - columns are linearly independent)
- ϵi are iid, mean 0, variance σ2
- ϵ independent of X
- ei still orthogonal to X
- OLS is conditionally unbiased
- E[ˆθ|X]=θ
- Cov(ˆθ|X)=σ2(XTX)−1
- ^σ2=1n−p∑ie2i
- this is unbiased - just dividing by n is too small since we have minimized ei so their variance is lower than var of ϵi
- ^σ2=1n−p∑ie2i
- random errors ϵ
- residuals e
- H=X(XTX)−1XT
- e = (I-H)Y = (I−H)ϵ
- H is symmetric
- H2=H,(I−H)2=I−H
- HX = X
- e⊥X
- basically H projects Y int R(X)
- E[^σ2|X]=σ2
- random errs don’t need to be normal
- variance
- var(Y)=var(Xˆθ)+var(e)
- var(Xˆθ) is the explained variance
- fraction of variance explained: R2=var(Xˆθ)/var(Y)
- like summing squares by projecting
- if there is no intercept in a regression eq, R2=||ˆY||2/||Y||2
- var(Y)=var(Xˆθ)+var(e)
advanced topics
BLUE
- drop assumption: ϵ independent of X
- instead: E[ϵ|X]=0,cov[ϵ|X]=σ2I
- can rewrite: E[ϵ]=0,cov[ϵ]=σ2I fixing X
- Gauss-markov thm - assume linear model and assumption above: when X is fixed, OLS estimator is BLUE = best linear unbiased estimator
- has smallest variance.
- prove this
GLS
- GLMs roughly solve the problem where outcomes are non-Gaussian
- mean is related to wtx through a link function (ex. logistic reg assumes sigmoid)
- also assume different prob distr on Y (ex. logistic reg assumes Bernoulli)
- generalized least squares regression model: instead of above assumption, use E[ϵ|X]=0,cov[ϵ|X]=G,:G∈SK++
- covariance formula changes: cov(ˆθOLS|X)=(XTX)−1XTGX(XTX)−1
- estimator is the same, but is no longer BLUE - can correct for this: (G−1/2Y)=(G−1/2X)θ+(G−1/2ϵ)
- feasible GLS=Aitken estimator - use ˆG
- examples
- simple
- iteratively reweighted
- 3 assumptions can break down:
- if E[ϵ|X]≠0 - GLS estimator is biased
- else if cov(ϵ|X)≠G - GLS unbiased, but covariance formula breaks down
- if G from data, but violates estimation procedure, estimator will be misealding estimate of cov
path models
- path model - graphical way to represent a regression equation
- making causal inferences by regression requires a response schedule
simultaneous equations
- simultaneous-equation models - use instrumental variables / two-stage least squares
- these techniques avoid simultaneity bias = endogeneity bias
binary variables
- indicator variables take on the value 0 or 1
- dummy coding - matrix is singular so we drop the last indicator variable - called reference class / baseline class
- effect coding
- one vector is all -1s
- B_0 should be weighted average of the class averages
- orthogonal coding
- additive effects assume that each predictor’s effect on the response does not depend on the value of the other predictor (as long as the other one was fixed
- assume they have the same slope
- interaction effects allow the effect of one predictor on the response to depend on the values of other predictors.
- yi=β0+β1xi1+β2xi2+β3xi1xi2+εi
LR with non-linear basis functions
- can have nonlinear basis functions (ex. polynomial regression)
- radial basis function - ex. kernel function (Gaussian RBF)
- exp(−(x−r)2/(2λ2))
- non-parametric algorithm - don’t get any parameters theta; must keep data
locally weighted LR (lowess)
- recompute model for each target point
- instead of minimizing SSE, we minimize SSE weighted by each observation’s closeness to the sample we want to query
kernel regression
-
nonparametric method
-
$\operatorname{E}(Y X=x) = \int y f(y x) dy = \int y \frac{f(x,y)}{f(x)} dy$ Using the [[kernel density estimation]] for the joint distribution ‘‘f(x,y)’’ and ‘‘f(x)’’ with a kernel ‘’’'’K’’’’’,
ˆf(x,y)=1n∑ni=1Kh(x−xi)Kh(y−yi) ˆf(x)=1n∑ni=1Kh(x−xi)
we get
$\begin{align} \operatorname{\hat E}(Y X=x) &= \int \frac{y \sum_{i=1}^{n} K_h\left(x-x_i\right) K_h\left(y-y_i\right)}{\sum_{j=1}^{n} K_h\left(x-x_j\right)} dy,\ &= \frac{\sum_{i=1}^{n} K_h\left(x-x_i\right) \int y \, K_h\left(y-y_i\right) dy}{\sum_{j=1}^{n} K_h\left(x-x_j\right)},\ &= \frac{\sum_{i=1}^{n} K_h\left(x-x_i\right) y_i}{\sum_{j=1}^{n} K_h\left(x-x_j\right)},\end{align}$
gam
- generalized additive models: assume mean output is sum of functions of individual variables (no interactions)
- learn individual functions using splines
sums interpretation
- SST - total sum of squares - measure of total variation in response variable
- ∑(yi−ˉy)2
- SSR - regression sum of squares - measure of variation explained by predictors
- ∑(^yi−ˉy)2
- SSE - measure of variation not explained by predictors
- ∑(yi−^yi)2
- SST = SSR + SSE
- R2=SSRSST - coefficient of determination
- measures the proportion of variation in Y that is explained by the predictor
geometry - J. 6
-
LMS = least mean squares (p-dimensional geometries)
- yn=θTxn+ϵn
- θ(t+1)=θ(t)+α(yn−θ(t)Txn)xn
-
converges if $0 < \alpha < 2/ x_n ^2$
-
- if N=p and all x(i) are lin. indepedent, then there exists exact solution θ
-
solving requires finding orthogonal projection of y on column space of X (n-dimensional geometries)
- 3 Pfs
- geometry - y−Xθ∗ must be orthogonal to columns of X: XT(y−Xθ)=0
- minimize least square cost function and differentiate
- show HY projects Y onto col(X)
- either of these approaches yield the normal eqns: XTXθ∗=XTy
-
SGD
- SGD converges to normal eqn
-
convergence analysis: requires 0<ρ<2/λmax[XTX]
- algebraic analysis: expand θ(t+1) and take t→∞
- geometric convergence analysis: consider contours of loss function
-
weighted least squares: J(θ)=12∑nwn(yn−θTxn)2
- yields XTWXθ∗=XTWy
-
probabilistic interpretation
- p(y|x,θ)=1(2πσ2)N/2exp(−12σ2∑Nn=1(yn−θTxn)2)
- l(θ;x,y)=−12σ2∑Nn=1(yn−θTxn)2
- log-likelihood is equivalent to least-squares cost function
likelihood calcs
normal equation
- L(θ)=12∑ni=1(ˆyi−yi)2
- L(θ)=1/2(Xθ−y)T(Xθ−y)
- L(θ)=1/2(θTXT−yT)(Xθ−y)
- L(θ)=1/2(θTXTXθ−2θTXTy+yTy)
- 0=∂L∂θ=2XTXθ−2XTy
- θ=(XTX)−1XTy
ridge regression
- L(θ)=∑ni=1(ˆyi−yi)2+λ||θ||22
- L(θ)=(Xθ−y)T(Xθ−y)+λθTθ
- L(θ)=θTXTXθ−2θTXTy+yTy+λθTθ
- 0=∂L∂θ=2XTXθ−2XTy+2λθ
- θ=(XTX+λI)−1XTy