Processing math: 100%
Gong Qingfeng | linear models

linear models

  • Material from “Statistical Models Theory and Practice” - David Freedman

introduction

  • Y=Xβ+ϵ

    type of linear regression X Y
    simple univariate univariate
    multiple multivariate univariate
    multivariate either multivariate
    generalized either error not normal
  • minimize: L(θ)=||YXβ||22ˆθ=(XTX)1XTY

  • 2 proofs

    1. set deriv and solve
    2. use projection matrix H to show HY is proj of Y onto R(X)
      1. the projection matrix maps the responses to the predictions: ˆy=Hy
    • define projection (hat) matrix H=X(XTX)1XT
      • show ||YXθ||2||YHY||2
      • key idea: subtract and add HY
  • interpretation
    • if feature correlated, weights aren’t stable / can’t be interpreted
    • curvature inverse (XTX)1 - dictates stability
    • importance: weight * feature value
  • LS doesn’t work when p » n because of colinearity of X columns
  • assumptions
    • ϵN(Xβ,σ2)
    • homoscedasticity: var(Yi|X) is the same for all i
      • opposite of heteroscedasticity
  • multicollinearity - predictors highly correlated
    • variance inflation factor (VIF) - measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictors are not linearly related
  • normal linear regression
    • variance MLE ˆσ2=(yiˆθTxi)2/n
      • in unbiased estimator, we divide by n-p
    • LS has a distr. N(θ,σ2(XTX)1)
  • linear regression model
    • when n is large, LS estimator ahs approx normal distr provided that X^TX /n is approx. PSD
  • weighted LS: minimize [wi(yixTiθ)]2
    • ˆθ=(XTWX)1XTWY
    • heteroscedastic normal lin reg model: erorrs ~ N(0, 1/w_i)
  • leverage scores - measure how much each xi influences the LS fit
    • for data point i, Hii is the leverage score
  • LAD (least absolute deviation) fit
    • MLE estimator when error is Laplacian

recent notes

regularization

  • when (XTX) isn’t invertible can’t use normal equations and gradient descent is likely unstable
    • X is nxp, usually n » p and X almost always has rank p
    • problems when n < p
  • intuitive way to fix this problem is to reduce p by getting rid of features
  • a lot of papers assume your data is already zero-centered
    • conventionally don’t regularize the intercept term
  1. ridge regression (L2)
    • if (X^T X) not invertible, add a small element to diagonal
    • then it becomes invertible
    • small lambda -> numerical solution is unstable
    • proof of why it’s invertible is difficult
    • argmin i(yi^yi)2+λ||β||22
    • equivalent to minimizing i(yi^yi)2 s.t. jβ2jt
    • solution is ^βλ=(XTX+λI)1XTy
    • for small λ numerical solution is unstable
    • When XTX=I, $\beta {Ridge} = \frac{1}{1+\lambda} \beta{Least Squares}$
  2. lasso regression (L1)
    • i(yi^yi)2+λ||β||1
    • equivalent to minimizing i(yi^yi)2 s.t. j|βj|t
    • “least absolute shrinkage and selection operator”
    • lasso - least absolute shrinkage and selection operator - L1
    • acts in a nonlinear manner on the outcome y
    • keep the same SSE loss function, but add constraint of L1 norm
    • doesn’t have closed form for Beta
    • because of the absolute value, gradient doesn’t exist
    • can use directional derivatives
    • best solver is LARS - least angle regression - if tuning parameter is chose well, will set lots of coordinates to 0 - convex functions / convex sets (like circle) are easier to solve - disadvantages
    • if p>n, lasso selects at most n variables
    • if pairwise correlations are very high, lasso only selects one variable
  3. elastic net - hybrid of the other two
    • βNaiveENet=i(yi^yi)2+λ1||β||1+λ2||β||22
    • l1 part generates sparse model
    • l2 part encourages grouping effect, stabilizes l1 regularization path
    • grouping effect - group of highly correlated features should all be selected - naive elastic net has too much shrinkage so we scale βENet=(1+λ2)βNaiveENet - to solve, fix l2 and solve lasso

the regression line (freedman ch 2)

  • regression line
    • goes through (ˉx,ˉy)
    • slope: rsy/sx
    • intercept: ˉyslopeˉx
    • basically fits graph of averages (minimizes MSE)
  • SD line
    • same except slope: sign(r)sy/sx
    • intercept changes accordingly
  • for regression, MSE = (1r2)Var(Y)

multiple regression (freedman ch 4)

  • assumptions
    1. assume n>p and X has full rank (rank p - columns are linearly independent)
    2. ϵi are iid, mean 0, variance σ2
    3. ϵ independent of X
      • ei still orthogonal to X
  • OLS is conditionally unbiased
    • E[ˆθ|X]=θ
  • Cov(ˆθ|X)=σ2(XTX)1
    • ^σ2=1npie2i
      • this is unbiased - just dividing by n is too small since we have minimized ei so their variance is lower than var of ϵi
  • random errors ϵ
  • residuals e
  • H=X(XTX)1XT
    1. e = (I-H)Y = (IH)ϵ
    2. H is symmetric
    3. H2=H,(IH)2=IH
    4. HX = X
    5. eX
  • basically H projects Y int R(X)
  • E[^σ2|X]=σ2
  • random errs don’t need to be normal
  • variance
    • var(Y)=var(Xˆθ)+var(e)
      • var(Xˆθ) is the explained variance
      • fraction of variance explained: R2=var(Xˆθ)/var(Y)
      • like summing squares by projecting
    • if there is no intercept in a regression eq, R2=||ˆY||2/||Y||2

advanced topics

BLUE

  • drop assumption: ϵ independent of X
    • instead: E[ϵ|X]=0,cov[ϵ|X]=σ2I
    • can rewrite: E[ϵ]=0,cov[ϵ]=σ2I fixing X
  • Gauss-markov thm - assume linear model and assumption above: when X is fixed, OLS estimator is BLUE = best linear unbiased estimator
    • has smallest variance.
    • prove this

GLS

  • GLMs roughly solve the problem where outcomes are non-Gaussian
    • mean is related to wtx through a link function (ex. logistic reg assumes sigmoid)
    • also assume different prob distr on Y (ex. logistic reg assumes Bernoulli)
  • generalized least squares regression model: instead of above assumption, use E[ϵ|X]=0,cov[ϵ|X]=G,:GSK++
    • covariance formula changes: cov(ˆθOLS|X)=(XTX)1XTGX(XTX)1
    • estimator is the same, but is no longer BLUE - can correct for this: (G1/2Y)=(G1/2X)θ+(G1/2ϵ)
  • feasible GLS=Aitken estimator - use ˆG
  • examples
    • simple
    • iteratively reweighted
  • 3 assumptions can break down:
    1. if E[ϵ|X]0 - GLS estimator is biased
    2. else if cov(ϵ|X)G - GLS unbiased, but covariance formula breaks down
    3. if G from data, but violates estimation procedure, estimator will be misealding estimate of cov

path models

  • path model - graphical way to represent a regression equation
  • making causal inferences by regression requires a response schedule

simultaneous equations

  • simultaneous-equation models - use instrumental variables / two-stage least squares
    • these techniques avoid simultaneity bias = endogeneity bias

binary variables

  • indicator variables take on the value 0 or 1
    • dummy coding - matrix is singular so we drop the last indicator variable - called reference class / baseline class
    • effect coding
      • one vector is all -1s
      • B_0 should be weighted average of the class averages
    • orthogonal coding
  • additive effects assume that each predictor’s effect on the response does not depend on the value of the other predictor (as long as the other one was fixed
    • assume they have the same slope
  • interaction effects allow the effect of one predictor on the response to depend on the values of other predictors.
    • yi=β0+β1xi1+β2xi2+β3xi1xi2+εi

LR with non-linear basis functions

  • can have nonlinear basis functions (ex. polynomial regression)
  • radial basis function - ex. kernel function (Gaussian RBF)
    • exp((xr)2/(2λ2))
  • non-parametric algorithm - don’t get any parameters theta; must keep data

locally weighted LR (lowess)

  • recompute model for each target point
  • instead of minimizing SSE, we minimize SSE weighted by each observation’s closeness to the sample we want to query

kernel regression

  • nonparametric method

  • $\operatorname{E}(Y X=x) = \int y f(y x) dy = \int y \frac{f(x,y)}{f(x)} dy$

    Using the [[kernel density estimation]] for the joint distribution ‘‘f(x,y)’’ and ‘‘f(x)’’ with a kernel ‘’’'’K’’’’’,

    ˆf(x,y)=1nni=1Kh(xxi)Kh(yyi) ˆf(x)=1nni=1Kh(xxi)

    we get

    $\begin{align} \operatorname{\hat E}(Y X=x) &= \int \frac{y \sum_{i=1}^{n} K_h\left(x-x_i\right) K_h\left(y-y_i\right)}{\sum_{j=1}^{n} K_h\left(x-x_j\right)} dy,\ &= \frac{\sum_{i=1}^{n} K_h\left(x-x_i\right) \int y \, K_h\left(y-y_i\right) dy}{\sum_{j=1}^{n} K_h\left(x-x_j\right)},\ &= \frac{\sum_{i=1}^{n} K_h\left(x-x_i\right) y_i}{\sum_{j=1}^{n} K_h\left(x-x_j\right)},\end{align}$

gam

  • generalized additive models: assume mean output is sum of functions of individual variables (no interactions)
    • learn individual functions using splines

sums interpretation

  • SST - total sum of squares - measure of total variation in response variable
    • (yiˉy)2
  • SSR - regression sum of squares - measure of variation explained by predictors
    • (^yiˉy)2
  • SSE - measure of variation not explained by predictors
    • (yi^yi)2
  • SST = SSR + SSE
  • R2=SSRSST - coefficient of determination
    • measures the proportion of variation in Y that is explained by the predictor

geometry - J. 6

  • LMS = least mean squares (p-dimensional geometries)

    • yn=θTxn+ϵn
    • θ(t+1)=θ(t)+α(ynθ(t)Txn)xn
      • converges if $0 < \alpha < 2/   x_n   ^2$
    • if N=p and all x(i) are lin. indepedent, then there exists exact solution θ
  • solving requires finding orthogonal projection of y on column space of X (n-dimensional geometries)

    1. 3 Pfs
      1. geometry - yXθ must be orthogonal to columns of X: XT(yXθ)=0
      2. minimize least square cost function and differentiate
      3. show HY projects Y onto col(X)
    • either of these approaches yield the normal eqns: XTXθ=XTy
  • SGD

    • SGD converges to normal eqn
  • convergence analysis: requires 0<ρ<2/λmax[XTX]

    • algebraic analysis: expand θ(t+1) and take t
    • geometric convergence analysis: consider contours of loss function
  • weighted least squares: J(θ)=12nwn(ynθTxn)2

    • yields XTWXθ=XTWy
  • probabilistic interpretation

    • p(y|x,θ)=1(2πσ2)N/2exp(12σ2Nn=1(ynθTxn)2)
    • l(θ;x,y)=12σ2Nn=1(ynθTxn)2
      • log-likelihood is equivalent to least-squares cost function

likelihood calcs

normal equation

  • L(θ)=12ni=1(ˆyiyi)2
  • L(θ)=1/2(Xθy)T(Xθy)
  • L(θ)=1/2(θTXTyT)(Xθy)
  • L(θ)=1/2(θTXTXθ2θTXTy+yTy)
  • 0=Lθ=2XTXθ2XTy
  • θ=(XTX)1XTy

ridge regression

  • L(θ)=ni=1(ˆyiyi)2+λ||θ||22
  • L(θ)=(Xθy)T(Xθy)+λθTθ
  • L(θ)=θTXTXθ2θTXTy+yTy+λθTθ
  • 0=Lθ=2XTXθ2XTy+2λθ
  • θ=(XTX+λI)1XTy