data analysis
pqrs
- Goal: inference - conclusion or opinion formed from evidence
- PQRS
- P - population
- Q - question - 2 types
- hypothesis driven - does a new drug work
- discovery driven - find a drug that works
- R - representative data colleciton
- simple random sampling = SRS
- w/ replacement: $var(\bar{X}) = \sigma^2 / n$
- w/out replacement: $var(\bar{X}) = (1 - \frac{n}{N}) \sigma^2 / n$
- simple random sampling = SRS
- S - scrutinizing answers
visual summaries
- numerical summaries
- mean vs. median
- sd vs. iq range
- visual summaries
- histogram
- kernel density plot - Gaussian kernels
- with bandwidth h $K_h(t) = 1/h K(t/h)$
- plots
- box plot / pie-chart
- scatter plot / q-q plot
- q-q plot = probability plot - easily check normality
- plot percentiles of a data set against percentiles of a theoretical distr.
- should be straight line if they match
- transformations = feature engineering
- log/sqrt make long-tail data more centered and more normal
- delta-method - sets comparable bw (wrt variance) after log or sqrt transform: $Var(g(X)) \approx [g’(\mu_X)]^2 Var(X)$ where $\mu_X = E(X)$
- if assumptions don’t work, sometimes we can transform data so they work
- transform x - if residuals generally normal and have constant variance
- corrects nonlinearity - transform y - if relationship generally linear, but non-constant error variance
- stabilizes variance - if both problems, try y first - Box-Cox: Y’ = $Y^l : if : l \neq 0$, else log(Y)
- least squares
- inversion of pxp matrix ~O(p^3)
- regression effect - things tend to the mean (ex. bball children are shorter)
- in high dims, l2 worked best
- kernel smoothing + lowess
- can find optimal bandwidth
- nadaraya-watson kernel smoother - locally weighted scatter plot smoothing
- \(g_h(x) = \frac{\sum K_h(x_i - x) y_i}{\sum K_h (x_i - x)}\) where h is bandwidth - loess - multiple predictors / lowess - only 1 predictor
- also called local polynomial smoother - locally weighted polynomial
- take a window (span) around a point and fit weighted least squares line to that point
- replace the point with the prediction of the windowed line
- can use local polynomial fits rather than local linear fits
-
silhouette plots - good clusters members are close to each other and far from other clustersf
- popular graphic method for K selection
- measure of separation between clusters $s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$
- a(i) - ave dissimilarity of data point i with other points within same cluster
- b(i) - lowest average dissimilarity of point i to any other cluster
- good values of k maximize the average silhouette score
- lack-of-fit test - based on repeated Y values at same X values