data analysis

pqrs

Goal: inference - conclusion or opinion formed from evidence
PQRS
- P - population
- Q - question - 2 types
  1. hypothesis driven - does a new drug work
  2. discovery driven - find a drug that works
- R - representative data colleciton
  - simple random sampling = SRS
    - w/ replacement: $var(\bar{X}) = \sigma^2 / n$
    - w/out replacement: $var(\bar{X}) = (1 - \frac{n}{N}) \sigma^2 / n$
- S - scrutinizing answers

visual summaries

numerical summaries
- mean vs. median
- sd vs. iq range
visual summaries
- histogram
- kernel density plot - Gaussian kernels
  - with bandwidth h $K_h(t) = 1/h K(t/h)$
plots
1. box plot / pie-chart
2. scatter plot / q-q plot
  - q-q plot = probability plot - easily check normality
  - plot percentiles of a data set against percentiles of a theoretical distr.
  - should be straight line if they match
3. transformations = feature engineering
  - log/sqrt make long-tail data more centered and more normal
  - delta-method - sets comparable bw (wrt variance) after log or sqrt transform: $Var(g(X)) \approx [g’(\mu_X)]^2 Var(X)$ where $\mu_X = E(X)$
  - if assumptions don’t work, sometimes we can transform data so they work
  - transform x - if residuals generally normal and have constant variance
  - corrects nonlinearity - transform y - if relationship generally linear, but non-constant error variance
  - stabilizes variance - if both problems, try y first - Box-Cox: Y’ = $Y^l : if : l \neq 0$, else log(Y)
4. least squares
  - inversion of pxp matrix ~O(p^3)
  - regression effect - things tend to the mean (ex. bball children are shorter)
  - in high dims, l2 worked best
5. kernel smoothing + lowess
  - can find optimal bandwidth
  - nadaraya-watson kernel smoother - locally weighted scatter plot smoothing
  - $g_h(x) = \frac{\sum K_h(x_i - x) y_i}{\sum K_h (x_i - x)}$ where h is bandwidth - loess - multiple predictors / lowess - only 1 predictor
  - also called local polynomial smoother - locally weighted polynomial
  - take a window (span) around a point and fit weighted least squares line to that point
  - replace the point with the prediction of the windowed line
  - can use local polynomial fits rather than local linear fits
6. silhouette plots - good clusters members are close to each other and far from other clustersf
  1. popular graphic method for K selection
  2. measure of separation between clusters $s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$
  3. a(i) - ave dissimilarity of data point i with other points within same cluster
  4. b(i) - lowest average dissimilarity of point i to any other cluster
  5. good values of k maximize the average silhouette score
7. lack-of-fit test - based on repeated Y values at same X values