Gong Qingfeng | data analysis

data analysis


pqrs

  • Goal: inference - conclusion or opinion formed from evidence
  • PQRS
    • P - population
    • Q - question - 2 types
      1. hypothesis driven - does a new drug work
      2. discovery driven - find a drug that works
    • R - representative data colleciton
      • simple random sampling = SRS
        • w/ replacement: $var(\bar{X}) = \sigma^2 / n$
        • w/out replacement: $var(\bar{X}) = (1 - \frac{n}{N}) \sigma^2 / n$
    • S - scrutinizing answers

visual summaries

  • numerical summaries
    • mean vs. median
    • sd vs. iq range
  • visual summaries
    • histogram
    • kernel density plot - Gaussian kernels
      • with bandwidth h $K_h(t) = 1/h K(t/h)$
  • plots
    1. box plot / pie-chart
    2. scatter plot / q-q plot
      • q-q plot = probability plot - easily check normality
      • plot percentiles of a data set against percentiles of a theoretical distr.
      • should be straight line if they match
    3. transformations = feature engineering
      • log/sqrt make long-tail data more centered and more normal
      • delta-method - sets comparable bw (wrt variance) after log or sqrt transform: $Var(g(X)) \approx [g’(\mu_X)]^2 Var(X)$ where $\mu_X = E(X)$
      • if assumptions don’t work, sometimes we can transform data so they work
      • transform x - if residuals generally normal and have constant variance
      • corrects nonlinearity - transform y - if relationship generally linear, but non-constant error variance
      • stabilizes variance - if both problems, try y first - Box-Cox: Y’ = $Y^l : if : l \neq 0$, else log(Y)
    4. least squares
      • inversion of pxp matrix ~O(p^3)
      • regression effect - things tend to the mean (ex. bball children are shorter)
      • in high dims, l2 worked best
    5. kernel smoothing + lowess
      • can find optimal bandwidth
      • nadaraya-watson kernel smoother - locally weighted scatter plot smoothing
      • \(g_h(x) = \frac{\sum K_h(x_i - x) y_i}{\sum K_h (x_i - x)}\) where h is bandwidth - loess - multiple predictors / lowess - only 1 predictor
      • also called local polynomial smoother - locally weighted polynomial
      • take a window (span) around a point and fit weighted least squares line to that point
      • replace the point with the prediction of the windowed line
      • can use local polynomial fits rather than local linear fits
    6. silhouette plots - good clusters members are close to each other and far from other clustersf

      1. popular graphic method for K selection
      2. measure of separation between clusters $s(i) = \frac{b(i) - a(i)}{max(a(i), b(i))}$
      3. a(i) - ave dissimilarity of data point i with other points within same cluster
      4. b(i) - lowest average dissimilarity of point i to any other cluster
      5. good values of k maximize the average silhouette score
    7. lack-of-fit test - based on repeated Y values at same X values