principles

breiman

conversation
- moved sf -> la -> caltech (physics) -> columbia (math) -> berkeley (math)
- info theory + gambling
- CART, ace, and prob book, bagging
- ucla prof., then consultant, then founded stat computing at berkeley
- lots of cool outside activities
  - ex. selling ice in mexico
2 cultures paper
1. generative - data are generated by a given stochastic model
  - stat does this too much and needs to move to 2
  - ex. assume y = f(x, noise, parameters)
  - validation: goodness-of-fit and residuals
2. predictive - use algorithmic model and data mechanism unknown
  - assume nothing about x and y
  - ex. generate P(x, y) with neural net
  - validation: prediction accuracy
  - axioms
3. Occam
4. Rashomon - lots of different good models, which explains best? - ex. rf is not robust at all
5. Bellman - curse of dimensionality - might actually want to increase dimensionality (ex. svms embedded in higher dimension)
  - industry was problem-solving, academia had too much culture

box + tukey

questions
1. what points are relevant and irrelevant today in both papers?
  - relevant
  - box
    - thoughts on scientific method
    - solns should be simple
    - necessity for developing experimental design
    - flaws (cookbookery, mathematistry)
  - tukey
    - separating data analysis and stats
    - all models have flaws
    - no best models
    - lots of goold old techniques (e.g. LSR) - irrelevant
  - some of the data techniques (I think)
  - tukey multiple-response data has been better attacked (graphical models)
2. how do you think the personal traits of Tukey and Box relate to the scientific opinions expressed in their papers?
  - probably both pretty critical of the science at the time
  - box - great respect for Fisher
  - both very curious in different fields of science
3. what is the most valuable msg that you get from each paper?
  - box - data analysis is a science
  - tukey - models must be useful
  - no best models
  - find data that is useful
  - no best models
box_79 “science and statistics”
- scientific method - iteration between theory and practice
  - learning - discrepancy between theory and practice
  - solns should be simple
- fisher - founder of statistics (early 1900s)
  - couples math with applications
  - data analysis - subiteration between tentative model and tentative analysis
  - develops experimental design
- flaws
  - cookbookery - forcing all problems into 1 or 2 routine techniques
  - mathematistry - development of theory for theory’s sake
tukey_62 “the future of data analysis”
- general considerations
  - data analysis - different from statistics, is a science
  - lots of techniques are very old (LS - Gauss, 1803)
  - all models have flaws
  - no best models
  - must teach multiple data analysis methods
- spotty data - lots of irregularly non-constant variability
  - could just trim highest and lowest values
    - winzorizing - replace suspect values with closest values that aren’t
  - must decide when to use new techniques, even when not fully understood
  - want some automation
  - FUNOP - fulll normal plot
    - can be visualized in table
- spotty data in more complex situations
  - FUNOR-FUNOM
- multiple-response data
  - understudied except for factor analysis
  - multiple-response procedures have been modeled upon how early single-response procedures were supposed to have been used, rather than upon how they were in fact used
  - factor analysis
    1. reduce dimensionality with new coordinates
    2. rotate to find meaningful coordinates
      - can use multiple regression factors as one factor if they are very correlated
  - regression techniques always offer hopes of learning more from less data than do variance-component techniques
- flexibility of attack
  - ex. what unit to measure in