principles
breiman
- conversation
- moved sf -> la -> caltech (physics) -> columbia (math) -> berkeley (math)
- info theory + gambling
- CART, ace, and prob book, bagging
- ucla prof., then consultant, then founded stat computing at berkeley
- lots of cool outside activities
- ex. selling ice in mexico
- 2 cultures paper
- generative - data are generated by a given stochastic model
- stat does this too much and needs to move to 2
- ex. assume y = f(x, noise, parameters)
- validation: goodness-of-fit and residuals
- predictive - use algorithmic model and data mechanism unknown
- assume nothing about x and y
- ex. generate P(x, y) with neural net
- validation: prediction accuracy
- axioms
- Occam
- Rashomon - lots of different good models, which explains best? - ex. rf is not robust at all
- Bellman - curse of dimensionality
- might actually want to increase dimensionality (ex. svms embedded in higher dimension)
- industry was problem-solving, academia had too much culture
- generative - data are generated by a given stochastic model
box + tukey
- questions
- what points are relevant and irrelevant today in both papers?
- relevant
- box
- thoughts on scientific method
- solns should be simple
- necessity for developing experimental design
- flaws (cookbookery, mathematistry)
- tukey
- separating data analysis and stats
- all models have flaws
- no best models
- lots of goold old techniques (e.g. LSR) - irrelevant
- some of the data techniques (I think)
- tukey multiple-response data has been better attacked (graphical models)
- how do you think the personal traits of Tukey and Box relate to the scientific opinions expressed in their papers?
- probably both pretty critical of the science at the time
- box - great respect for Fisher
- both very curious in different fields of science
- what is the most valuable msg that you get from each paper?
- box - data analysis is a science
- tukey - models must be useful
- no best models
- find data that is useful
- no best models
- what points are relevant and irrelevant today in both papers?
- box_79 “science and statistics”
- scientific method - iteration between theory and practice
- learning - discrepancy between theory and practice
- solns should be simple
- fisher - founder of statistics (early 1900s)
- couples math with applications
- data analysis - subiteration between tentative model and tentative analysis
- develops experimental design
- flaws
- cookbookery - forcing all problems into 1 or 2 routine techniques
- mathematistry - development of theory for theory’s sake
- scientific method - iteration between theory and practice
- tukey_62 “the future of data analysis”
- general considerations
- data analysis - different from statistics, is a science
- lots of techniques are very old (LS - Gauss, 1803)
- all models have flaws
- no best models
- must teach multiple data analysis methods
- spotty data - lots of irregularly non-constant variability
- could just trim highest and lowest values
- winzorizing - replace suspect values with closest values that aren’t
- must decide when to use new techniques, even when not fully understood
- want some automation
- FUNOP - fulll normal plot
- can be visualized in table
- could just trim highest and lowest values
- spotty data in more complex situations
- FUNOR-FUNOM
- multiple-response data
- understudied except for factor analysis
- multiple-response procedures have been modeled upon how early single-response procedures were supposed to have been used, rather than upon how they were in fact used
- factor analysis
- reduce dimensionality with new coordinates
- rotate to find meaningful coordinates
- can use multiple regression factors as one factor if they are very correlated
- regression techniques always offer hopes of learning more from less data than do variance-component techniques
- flexibility of attack
- ex. what unit to measure in
- general considerations