Table of contents
 Story
 Slides
 Spotfire Dashboard
 Research Notes
 Doing Data Science
 Cover Page
 Inside Cover
 Doing Data Science
 Preface
 Introduction
 Motivation
 Origins of the Class
 Origins of the Book
 What to Expect from This Book
 How This Book Is Organized
 How to Read This Book
 How Code Is Used in This Book
 Who This Book Is For
 Prerequisites
 Supplemental Reading
 About the Contributors
 Conventions Used in This Book
 Using Code Examples
 Safari® Books Online
 How to Contact Us
 Acknowledgments
 1. Introduction: What Is Data Science?
 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
 3. Algorithms
 Machine Learning Algorithms
 Three Basic Algorithms
 Linear Regression
 WTF. So Is It an Algorithm or a Model?
 Figure 31. An obvious linear pattern
 Figure 32. Looking kind of linear
 Start by writing something down
 Figure 33. Which line is the best fit?
 Fitting the model
 Figure 34. The line closest to all the points
 Figure 35. On the left is the fitted line
 Extending beyond least squares
 Evaluation metrics
 Figure 36. Comparing mean squared error in training and test set
 Other models for error terms
 Review
 Exercise
 kNearest Neighbors (kNN)
 Example with credit scores
 Figure 37. Credit rating as a function of age and income
 Figure 38. What about that guy?
 Caution: Modeling Danger Ahead!
 Similarity or distance metrics
 Training and test sets
 Pick an evaluation metric
 Other Terms for Sensitivity and Specificity
 Putting it all together
 Choosing k
 Binary Classes
 Test Set in kNN
 What are the modeling assumptions?
 kmeans
 Linear Regression
 Exercise: Basic Machine Learning Algorithms
 Summing It All Up
 Thought Experiment: Automated Statistician
 4. Spam Filters, Naive Bayes, and Wrangling
 5. Logistic Regression
 6. Time Stamps and Financial Modeling
 Kyle Teague and GetGlue
 Timestamps
 Cathy O’Neil
 Thought Experiment
 Financial Modeling
 Exercise: GetGlue and Timestamped Event Data
 7. Extracting Meaning from Data
 William Cukierski
 The Kaggle Model
 Thought Experiment: What Are the Ethical Implications of a RoboGrader?
 Feature Selection
 David Huffaker: Google’s Hybrid Approach to Social Research
 8. Recommendation Engines: Building a UserFacing Data Product at Scale
 A RealWorld Recommendation Engine
 Figure 81. Bipartite graph with users and items (television shows) as nodes
 Nearest Neighbor Algorithm Review
 Some Problems with Nearest Neighbors
 Beyond Nearest Neighbor: Machine Learning Classification
 The Dimensionality Problem
 Singular Value Decomposition (SVD)
 Important Properties of SVD
 Principal Component Analysis (PCA)
 Alternating Least Squares
 Fix V and Update U
 Last Thoughts on These Algorithms
 Thought Experiment: Filter Bubbles
 Exercise: Build Your Own Recommendation System
 A RealWorld Recommendation Engine
 9. Data Visualization and Fraud Detection
 Data Visualization History
 What Is Data Science, Redux?
 A Sample of Data Visualization Projects
 Mark’s Data Visualization Projects
 Data Science and Risk
 Data Visualization at Square
 Ian’s Thought Experiment
 Data Visualization for the Rest of Us
 10. Social Networks and Data Journalism
 11. Causality
 12. Epidemiology
 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
 Claudia’s Data Scientist Profile
 Data Mining Competitions
 How to Be a Good Modeler
 Data Leakage
 How to Avoid Leakage
 Evaluating Models
 Accuracy: Meh
 Probabilities Matter, Not 0s and 1s
 Figure 135. An example of how to draw an ROC curve
 Figure 136. The socalled lift curve
 Figure 137. A way to measure calibration is to bucket predictions and plot predicted probability versus empirical probability for each bucket—here, we do this for an unpruned decision tree
 Figure 138. Testing calibration for logistic regression
 Choosing an Algorithm
 A Final Example
 Parting Thoughts
 14. Data Engineering: MapReduce, Pregel, and Hadoop
 15. The Students Speak
 16. NextGeneration Data Scientists, Hubris, and Ethics
 Index
 About the Authors
 Colophon
 Story
 Slides
 Spotfire Dashboard
 Research Notes
 Doing Data Science
 Cover Page
 Inside Cover
 Doing Data Science
 Preface
 Introduction
 Motivation
 Origins of the Class
 Origins of the Book
 What to Expect from This Book
 How This Book Is Organized
 How to Read This Book
 How Code Is Used in This Book
 Who This Book Is For
 Prerequisites
 Supplemental Reading
 About the Contributors
 Conventions Used in This Book
 Using Code Examples
 Safari® Books Online
 How to Contact Us
 Acknowledgments
 1. Introduction: What Is Data Science?
 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
 3. Algorithms
 Machine Learning Algorithms
 Three Basic Algorithms
 Linear Regression
 WTF. So Is It an Algorithm or a Model?
 Figure 31. An obvious linear pattern
 Figure 32. Looking kind of linear
 Start by writing something down
 Figure 33. Which line is the best fit?
 Fitting the model
 Figure 34. The line closest to all the points
 Figure 35. On the left is the fitted line
 Extending beyond least squares
 Evaluation metrics
 Figure 36. Comparing mean squared error in training and test set
 Other models for error terms
 Review
 Exercise
 kNearest Neighbors (kNN)
 Example with credit scores
 Figure 37. Credit rating as a function of age and income
 Figure 38. What about that guy?
 Caution: Modeling Danger Ahead!
 Similarity or distance metrics
 Training and test sets
 Pick an evaluation metric
 Other Terms for Sensitivity and Specificity
 Putting it all together
 Choosing k
 Binary Classes
 Test Set in kNN
 What are the modeling assumptions?
 kmeans
 Linear Regression
 Exercise: Basic Machine Learning Algorithms
 Summing It All Up
 Thought Experiment: Automated Statistician
 4. Spam Filters, Naive Bayes, and Wrangling
 5. Logistic Regression
 6. Time Stamps and Financial Modeling
 Kyle Teague and GetGlue
 Timestamps
 Cathy O’Neil
 Thought Experiment
 Financial Modeling
 Exercise: GetGlue and Timestamped Event Data
 7. Extracting Meaning from Data
 William Cukierski
 The Kaggle Model
 Thought Experiment: What Are the Ethical Implications of a RoboGrader?
 Feature Selection
 David Huffaker: Google’s Hybrid Approach to Social Research
 8. Recommendation Engines: Building a UserFacing Data Product at Scale
 A RealWorld Recommendation Engine
 Figure 81. Bipartite graph with users and items (television shows) as nodes
 Nearest Neighbor Algorithm Review
 Some Problems with Nearest Neighbors
 Beyond Nearest Neighbor: Machine Learning Classification
 The Dimensionality Problem
 Singular Value Decomposition (SVD)
 Important Properties of SVD
 Principal Component Analysis (PCA)
 Alternating Least Squares
 Fix V and Update U
 Last Thoughts on These Algorithms
 Thought Experiment: Filter Bubbles
 Exercise: Build Your Own Recommendation System
 A RealWorld Recommendation Engine
 9. Data Visualization and Fraud Detection
 Data Visualization History
 What Is Data Science, Redux?
 A Sample of Data Visualization Projects
 Mark’s Data Visualization Projects
 Data Science and Risk
 Data Visualization at Square
 Ian’s Thought Experiment
 Data Visualization for the Rest of Us
 10. Social Networks and Data Journalism
 11. Causality
 12. Epidemiology
 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
 Claudia’s Data Scientist Profile
 Data Mining Competitions
 How to Be a Good Modeler
 Data Leakage
 How to Avoid Leakage
 Evaluating Models
 Accuracy: Meh
 Probabilities Matter, Not 0s and 1s
 Figure 135. An example of how to draw an ROC curve
 Figure 136. The socalled lift curve
 Figure 137. A way to measure calibration is to bucket predictions and plot predicted probability versus empirical probability for each bucket—here, we do this for an unpruned decision tree
 Figure 138. Testing calibration for logistic regression
 Choosing an Algorithm
 A Final Example
 Parting Thoughts
 14. Data Engineering: MapReduce, Pregel, and Hadoop
 15. The Students Speak
 16. NextGeneration Data Scientists, Hubris, and Ethics
 Index
 About the Authors
 Colophon
Story
Doing Data Science Exercises Without Data Cleaning and Coding
So as a data scientists/data journalist/information designer, who is about to teach university courses, I asked is it possible to teach and introductory level class that does not require first learning a lot about data cleaning and coding?
I hope to answer that question by doing the following with the excellent new book, Doing Data Science, by repurposing it from PDF to MindTouch with information design shills and working the exercises in Spotfire as follows:
Process:
 Used the Table of Contents as the basic hierarchical structure (taxonomy) by assigning Headings (15) in MindTouch
 Copied and chunked the content into paragraphs
 Edited the line lengths of the sentences to full page length
 Added the figures and equations
 Added the italics and other special characters (I probably did not get all of these)
 Added the Web links
 Checked the work
Table of Contents:
 Preface
 Using Code Examples: Select Download Entire Repository Button to the Right and Then UnZiped the File
 1. Introduction: What Is Data Science?
 A Data Science Profile: Create Your Profile
 Learning Exercise: My Profile: Slides.
 Handson Class Exercise: NYT Data Set (31 CSV files, 151 MB) My Note: Unzip ZIP File and start with NYT1 (CSV).
 Result: See Spotfire Web Player Chapter 2 EDA NYT Clickstream and Spotfire File
 2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
 Case Study: RealDirect
 Team Homework Exercise: Go to the NYC (Manhattan) Housing dataset: How to Find Recent Sales Data for New York City Real Estate and Click on: Rolling Sales Updaate(after the fifth paragraph). You can use any or all of the datasets here—start with Manhattan August, 2012–August 2013. My Note: Now November 2012November 2013: rollingsales_bronx, rollingsales_brooklyn, rollingsales_manhattan, rollingsales_queens, and rollingsales_statenisland
 Result: See Spotfire Web Player Chapter 2 NYC (Manhattan) Housing and Spotfire File
 3. Algorithms
 Exercise: Basic Machine Learning Algorithms
 4. Spam Filters, Naive Bayes, and Wrangling
 Jake’s Exercise: Naive Bayes for Article Classification: NYT Data Set (31 CSV files, 151 MB)
 Enron Emails (Big TAR ZIP files)
 5. Logistic Regression
 Media 6 Degrees Exercise
 Handson Class Exercise: Media 6 Degrees kindly provided a dataset that is perfect for exploring logistic regression models, and evaluating how good the models are. dds_ch5_binaryclassdataset
 Result: See Spotfire Web Player Chapter 5 Logistic Regression Media 6 Degrees and Spotfire File
 6. Time Stamps and Financial Modeling
 Exercise: GetGlue and Timestamped Event Data
 Team Homework Exercise: Get the Data: Go to Yahoo! Finance and download daily data from a stock that has at least eight years of data, making sure it goes from earlier to later. If you don’t know how to do it, Google it. Yahoo: http://finance.yahoo.com/q/hp?s=%5EO...torical+Prices (CSV)
 Result: See Spotfire Web Player and Spotfire File
 7. Extracting Meaning from Data
 Handson Class Exercise: SAS and SAS Public Data Sets
 Result: See Spotfire Web Player and Spotfire File, Spotfire Web Player and Spotfire File, and Spotfire Web Player and Spotfire File
 8. Recommendation Engines: Building a UserFacing Data Product at Scale
 Exercise: Build Your Own Recommendation System
 Team Homework Exercise: Read in next week's reading: Data Visualization for the Rest of Us: See my Slides and Spotfire Web Player and Spotfire File. Start to create your own Hubway Data Visualization Challenge and eventually submit it for your class project and the challenge (now closed but still accepting submissions) if you want.
 9. Data Visualization and Fraud Detection
 Handson Class Exercise: Data Visualization for the Rest of Us: See my Slides and Spotfire Web Player and Spotfire File. Start to create your own Hubway Data Visualization Challenge and eventually submit it for your class project and the challenge (now closed but still accepting submissions) if you want.
 Result: See Spotfire Web Player and Spotfire File
 10. Social Networks and Data Journalism
 Handson Class Exercise: Social Networks: NodeXL CodePlex and MIcrosoft NodeXL
 Team Homework Exercise: Data Journalism: Critique One of My Health Data Stories/Products (How could it be better? Is there other data I should have included? How would you present it?)
 11. Causality
 Handson Class Exercise: Project TYCO Data for Health See Research Notes
 Result: See Spotfire Web Player and Spotfire File
 12. Epidemiology
 Research Experiment (Observational Medical Outcomes Partnership): Data Available??
 Team Homework Exercise: KDD Cups Datasets
 13. Lessons Learned from Data Competitions: Data Leakage and Model Evaluation
 Handson Class Exercise: More KDD Cups datasets
 14. Data Engineering: MapReduce, Pregel, and Hadoop
 Team Homework Assignment: Your brief chapter of what you have learned so far and hope to yet learn in this class.
 15. The Students Speak
 Handson Class Exercise: Start AmericasDataFest Competition
 16. NextGeneration Data Scientists, Hubris, and Ethics
 Team Homework Assignment: Study about Graph Databases, Graph Computing, and Semantic Medline Review Wiki and View Videos: YarcData Videos(Schizo7 minutes, Cancer21 minutes).
MORE TO FOLLOW
Mine Other Courses:
http://www.kdnuggets.com/2013/11/harvardcs109datasciencecourseresourcesfreeonline.html
http://cs.fit.edu/~pkc/classes/mlinternet/
Free Cloudera Training: Find Email
Slides
Slide 1 Hubway Data Visualization Challenge: Spotfire
http://semanticommunity.info
http://datacommunitydc.org/blog/2013/11/fromcomputeraidedjournalismtodatajournalism/
http://datacommunitydc.org/blog/2013/08/cloudsoasemanticsanddatascienceconference/
Slide 4 Hubway Seeking MetroBoston
Figure 917. This is a visualization by Eurry Kim and Kaz Sakamoto of the Hubway sharedbike program
Spotfire Dashboard
For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App
Research Notes
Errata for Doing Data Science
Source: http://www.oreilly.com/catalog/errata.csp?isbn=9781449358655
The errata list is a list of errors and their corrections that were found after the product was released. If the error was corrected in a later version or reprint the date of the correction will be displayed in the column titled "Date Corrected".
The following errata were submitted by our customers and approved as valid errors by the author or editor.
Color Key: Serious Technical Mistake Minor Technical Mistake Language or formatting error Typo Question Note Update
Version  Location  Description  Submitted By  Date Submitted  Date Corrected 

Printed  Page multiple multiple  This errata was submitted by Philipp Marek via email. Errata for Doing data science I mark /deletions/, and *changes*. This is in UTF8  so eg. a CRLF is shown as downleft pointing arrow: ↵ xvii: move 3 words: there is more breath // than depth *in some cases* xxi: Forgot to mention "Visual Display of Quantitative Information" ... although listed on p37 2: statis/i/tican 14: 14 use different shades of gray, or dashes or something like that 30: observed realworld phenomen*a* (or *a* phenomenon) 32: x in seconds? Don't integrate over minutes 38: http://stat.columbia.edu  everything else on github 43: hypothesis, not thesis (?) 23 Huma*n* behavi*or* (nouns) Trying to read associations fails; put Olympics beneath Olympic records? 44: an extension /of/ or variation of 48: answered? 49: Did Doug use ... (... "CPC")  aren't used in text, no need to explain 50: plot(log(), log()): see http://spacecraft.ssl.umd.edu/akins_laws.html, twice. "6. (Mar's Law) Everything is linear if plotted loglog with a fat magic marker." bk.homes[which ...]  indentation of 3rd line wrong log() <= 5 ... better use <= 1e5 or 100e3 and remove log() 68: 36 truth = d*e*gree 2 (top right) 69: x*_2* * x_3 71: x₁ï¿½, not xï¿½₁ 72: you'd have establish*ed* the bins (or have *to* establish) 73: 37 doesn't include the points listed above 74: 38 use "x" for new guy, this point is already in 37 76: Hamming: shoe +ss => hose, distance is 2 we start with a Google search ... *which to use*. 77: n.points = length(data) Why not simply use a boolean vector of some length on data? swap lines: train < and #define 78: swap cl < and # swap true.labels and # 79: # We're using ... comment not helpful 85: http://abt.cm  why a different link shortener? we showed how *to* explore and clean 87: remove line setwd() 90: U of Edinb*o*rough? 101: parallel//ly 108: WWW::Mechanize, and generally Perl for text extraction 111,112: script could use a few functions 117: *An Empirical...* format different from other book references or titles 129: "non discrete)" is still a comment, wrong format used c[, 2]  space before "," missing 131: vlist < use less space to avoid line break, twice 132: "use holdout group" join to previous line "vars" within for loop? 137: prop/o/agates 140: 63 no colors visible. use distinguisable grays? 141: 64 no counts visible 147: what does 67 show? 151: 68 label both axes with text 155: 612 factors not distinguishable 156: this_E is unused 176: "Director of Research..." in one line 177: the modeling part isn't *what* we want 183: AIC Info*r*mation 184: a college studen ... spend *her* time 191: "column which is our response" is still a comment, has wrong format 194: "Google's Hybrid Approach" title => italic 201: simple but comp*l*ete 215: vr = indentation wrong 236: to a*c*cept 241: the Predicted=False row should have FN, TN 246: 2nd mouse/keyboard is not needed, other person should read and think, not type simultaneously 247: discrep*a*ncy 251: participate ? 254: digital media at	 251;Columbia (space missing) 281: "Overlapping..." title => italic 287: people that take/s/ some drug  people take, not the population 293: "Oral..." title => italic 304: (hers is shown ... *)* 341: line 44 is hard to read, code doesn't match other formatting 349: Map*R*educe 351ff: Index: "Amazon Mechanical Turk" in Amazon bunch together "causal ..." bunch together "chaos ..." "Protocol buffers" instead of "prtobuf" and probably some more.  Rachel Schutt  Nov 20, 2013  
Page multiple multiple  Page Error Note p.207 starup should be "startup" p.359 want achieve should be "want to achieve" p.162163 section headers are different sizes "Exercise: GetGlue and Timestamped Event Data" and "Exercise: Financial Data" should be same size font p.68 dgree In figure 36, should be "degree" p.3233 inconsistent capitalization of random variables: x vs X p.2122 indentation is odd and seems arbitrary index curse of dimensionality missing p.282 "That experimental infrastructure" strange phrasing  Rachel Schutt  Nov 20, 2013  
Page 95 2d paragraph 1st sentence  "Thinking back to the previous chapter, in order to use liner regression,... " should be 'linear'  donald f caldwell  Dec 01, 2013  
ePub  Page 119 United States  w.r.t. to my just submitted errata, it appears that its my github ignorance. Shift clicking on the file doesn't have the obvious semantics, but the button on the right side of the pane "download zipfile" does. So my request would be for a slight change to the text to make this clear for us cvs, sccs, svn, bitkeeper folks who didn't get with Git. Note from the Author or Editor:  Keith Bierman  Oct 31, 2013  Dec 03, 2013 
http://columbiadatascience.com/blog/
https://github.com/oreillymedia/doing_data_science
This is the example code repository for Doing Data Science by Cathy O'Neil and Rachel Schutt (O'Reilly Media)
README.md  <time class="jsrelativedate" datetime="20131004T07:45:5807:00" title="20131004 07:45:58">a month ago</time>  
dds_datasets.zip  <time class="jsrelativedate" datetime="20131004T12:42:5907:00" title="20131004 12:42:59">a month ago</time> 
My Note: I cannot open these two files. I sent an email to Cathy O'Neil asking for help, but did not hear back. I went back and clicked on the Download Entire Repository Buttom and got a 35 MB ZIP files that seemed complete with one text file (3 MB), 31 CSV files (151 MB), and 5 XLS files (23 MB) that I imported to Spotfire. See Errata Above
Exercise: EDA
Exercise: RealDirect Data Strategy
Doing Data Science
Source: http://cdn.oreillystatic.com/oreilly...55_sampler.pdf (PDF)
Doing Data Science
Preface
Introduction
Motivation
Origins of the Class
Origins of the Book
What to Expect from This Book
How This Book Is Organized
How to Read This Book
How Code Is Used in This Book
Who This Book Is For
 Experienced data scientists will perhaps come to see and understand themselves and what they do in a new light.
 Statisticians may gain an appreciation of the relationship between data science and statistics. Or they may continue to maintain the attitude, “that’s just statistics,” in which case we’d like to see that argument clearly articulated.
 Quants, math, physics, or other science PhDs who are thinking about transitioning to data science or building up their data science skills will gain perspective on what that would require or mean.
 Students and those new to data science will be getting thrown into the deep end, so if you don’t understand everything all the time, don’t worry; that’s part of the process.
 Those who have never coded in R or Python before will want to have a manual for learning R or Python. We recommend The Art of R Programming by Norman Matloff (No Starch Press). Students who took the course also benefitted from the expert instruction of lab instructor, Jared Lander, whose book R for Everyone: Advanced Analytics and Graphics (AddisonWesley) is scheduled to come out in November 2013. It’s also possible to do all the exercises using packages in Python.
 For those who have never coded at all before, the same advice holds. You might also want to consider picking up Learning Python by Mark Lutz and David Ascher (O’Reilly) or Wes McKinney’s Python for Data Analysis (also O’Reilly) as well.
Prerequisites
Supplemental Reading
Math
Coding
Data Analysis and Statistical Inference
Artificial Intelligence and Machine Learning
Experimental Design
Visualization
About the Contributors
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
 The brain trust that convened in Cathy’s apartment: Chris Wiggins, David Madigan, Mark Hansen, Jake Hofman, Ori Stitelman, and Brian Dalessandro.
 Our editors, Courtney Nash and Mike Loukides.
 The participants and organizers of the IMA Userlevel modeling conference where some preliminary conversations took place.
 The students!
 Coppelia, where Cathy and Rachel met for breakfast a lot.
1. Introduction: What Is Data Science?
Big Data and Data Science Hype
Getting Past the Hype
 Sure, there’s is a difference between industry and academia. But does it really have to be that way? Why do many courses in school have to be so intrinsically out of touch with reality?
 Even so, the gap doesn’t represent simply a difference between industry statistics and academic statistics. The general experience of data scientists is that, at their job, they have access to a larger body of knowledge and methodology, as well as a process, which we now define as the data science process (details in Chapter 2), that has foundations in both statistics and computer science.
Why Now?
Datafication
The Current Landscape (with a Little History)
Figure 11. Drew Conway’s Venn diagram of data science
 Statistics (traditional analysis you’re used to thinking about)
 Data munging (parsing, scraping, and formatting data)
 Visualization (graphs, tools, etc.)
The Role of the Social Scientist in Data Science
Both LinkedIn and Facebook are social network companies. Oftentimes a description or definition of data scientist includes hybrid statistician, software engineer, and social scientist. This made sense in the context of companies where the product was a social product and still makes sense when we’re dealing with human or user behavior. But if you think about Drew Conway’s Venn diagram, data science problems cross disciplines—that’s what the substantive expertise is referring to. In other words, it depends on the context of the problems you’re trying to solve. If they’re social sciencey problems like friend recommendations or people you know or user segmentation, then by all means, bring on the social scientist! Social scientists also do tend to be good question askers and have other good investigative qualities, so a social scientist who also has the quantitative and programming chops makes a great data scientist. But it’s almost a “historical” (historical is in quotes because 2008 isn’t that long ago) artifact to limit your conception of a data scientist to someone who works only with online user behavior data. There’s another emerging field out there called computational social sciences, which could be thought of as a subset of data science. 
Data Science Jobs
A Data Science Profile
 Computer science
 Math
 Statistics
 Machine learning
 Domain expertise
 Communication and presentation skills
 Data visualization
Figure 12. Rachel’s data science profile
Thought Experiment: MetaDefinition
Start with a textmining model
So what about a clustering algorithm?
Figure 14. Harlan Harris’s clustering and visualization of subfields of data science
OK, So What Is a Data Scientist, Really?
In Academia
In Industry
2. Statistical Inference, Exploratory Data Analysis, and the Data Science Process
Statistical Thinking in the Age of Big Data
Statistical Inference
Populations and Samples
Populations and Samples of Big Data
Sampling solves some engineering challenges
Bias
Sampling
New kinds of data
 Traditional: numerical, categorical, or binary
 Text: emails, tweets, New York Times articles (see Chapter 4 or Chapter 7)
 Records: userlevel data, timestamped event data, jsonformatted log files (see Chapter 6 or Chapter 8)
 Geobased location data: briefly touched on in this chapter with NYC housing data
 Network (see Chapter 10)
 Sensor data (not covered in this book)
 Images (not covered in this book)
Terminology: Big Data
We’ve been throwing around “Big Data” quite a lot already and are guilty of barely defining it beyond raising some big questions in the previous chapter. A few ways to think about Big Data: “Big” is a moving target. Constructing a threshold for Big Data such as 1 petabyte is meaningless because it makes it sound absolute. Only when the size becomes a challenge is it worth referring to it as “Big.” So it’s a relative term referring to when the size of the data outstrips the stateoftheart current computational solutions (in terms of memory, storage, complexity, and processing speed) available to handle it. So in the 1970s this meant something different than it does today. “Big” is when you can’t fit it on one machine. Different individuals and companies have different computational resources available to them, so for a single scientist data is big if she can’t fit it on one machine because she has to learn a whole new host of tools and methods once that happens. Big Data is a cultural phenomenon. It describes how much data is part of our lives, precipitated by accelerated advances in technology. The 4 Vs: Volume, variety, velocity, and value. Many people are circulating this as a way to characterize Big Data. Take from it what you will. 
Big Data Can Mean Big Assumptions
 Collecting and using a lot of data rather than small samples
 Accepting messiness in your data
 Giving up on knowing the causes
Can N=ALL?
Data is not objective
n = 1
At the other end of the spectrum from N=ALL, we have n=1, by which we mean a sample size of 1. In the old days a sample size of 1 would be ridiculous; you would never want to draw inferences about an entire population by looking at a single individual. And don’t worry, that’s still ridiculous. But the concept of n=1 takes on new meaning in the age of Big Data, where for a single person, we actually can record tons of information about them, and in fact we might even sample from all the events or actions they took (for example, phone calls or keystrokes) in order to make inferences about them. This is what userlevel modeling is about. 
Modeling
What is a model?
Statistical modeling
But how do you build a model?
Probability distributions
Figure 21. A bunch of continuous density functions (aka probability distributions)
Fitting a model
Overfitting
Exploratory Data Analysis
Historical Perspective: Bell Labs
Bell Labs is a research lab going back to the 1920s that has made innovations in physics, computer science, statistics, and math, producing languages like C++, and many Nobel Prize winners as well. There was a very successful and productive statistics group there, and among its many notable members was John Tukey, a mathematician who worked on a lot of statistical problems. He is considered the father of EDA and R (which started as the S language at Bell Labs; R is the open source version), and he was interested in trying to visualize highdimensional data. We think of Bell Labs as one of the places where data science was “born” because of the collaboration between disciplines, and the massive amounts of complex data available to people working there. It was a virtual playground for statisticians and computer scientists, much like Google is today. In fact, in 2001, Bill Cleveland wrote “Data Science: An Action Plan for expanding the technical areas of the field of statistics,” which described multidisciplinary investigation, models, and methods for data (traditional applied stats), computing with data (hardware, software, algorithms, coding), pedagogy, tool evaluation (staying on top of current trends in technology), and theory (the math behind the data). You can read more about Bell Labs in the book The Idea Factory by Jon Gertner (Penguin Books). 
Philosophy of Exploratory Data Analysis
Exercise: EDA
 Plot the distributions of number impressions and clickthroughrate (CTR=# clicks/# impressions) for these six age categories.
 Define a new variable to segment or categorize users based on their click behavior.
 Explore the data and make visual and quantitative comparisons across user segments/demographics (<18yearold males versus < 18yearold females or loggedin versus not, for example).
 Create metrics/measurements/statistics that summarize the data. Examples of potential metrics include CTR, quantiles, mean, median, variance, and max, and these can be calculated across the various user segments. Be selective. Think about what will be important to track over time—what will compress the data, but still capture user behavior.
Sample code
On Coding
In a May 2013 oped piece, “How to be a Woman Programmer,” Ellen Ullman describes quite well what it takes to be a programmer (setting aside for now the woman part): “The first requirement for programming is a passion for the work, a deep need to probe the mysterious space between human thoughts and what a machine can understand; between human desires and how machines might satisfy them. The second requirement is a high tolerance for failure. Programming is the art of algorithm design and the craft of debugging errant code. In the words of the great John Backus, inventor of the Fortran programming language: You need the willingness to fail all the time. You have to generate many ideas and then you have to work very hard only to discover that they don’t work. And you keep doing that over and over until you find one that does work.” 
The Data Science Process
Figure 22. The data science process
This is where you typically start in a standard statistics class, with a clean, orderly dataset. But it’s not where you typically start in the real world. 
A Data Scientist’s Role in This Process
Connection to the Scientific Method
We can think of the data science process as an extension of or variation of the scientific method:
In both the data science process and the scientific method, not every problem requires one to go through all the steps, but almost all problems can be solved with some combination of the stages. For example, if your end goal is a data visualization (which itself could be thought of as a data product), it’s possible you might not do any machine learning or statistical modeling, but you’d want to get all the way to a clean dataset, do some exploratory analysis, and then create the visualization. 
Thought Experiment: How Would You Simulate Chaos?
 A Lorenzian water wheel, which is a Ferris wheeltype contraption with equally spaced buckets of water that rotate around in a circle. Now imagine water being dripped into the system at the very top. Each bucket has a leak, so some water escapes into whatever bucket is directly below the drip. Depending on the rate of the water coming in, this system exhibits a chaotic process that depends on molecularlevel interactions of water molecules on the sides of the buckets. Read more about it in this associated Wikipedia article.
 Many systems can exhibit inherent chaos. Philippe M. Binder and Roderick V. Jensen have written a paper entitled “Simulating chaotic behavior with finitestate machines”, which is about digital computer simulations of chaos.
 An interdisciplinary program involving M.I.T., Harvard, and Tufts involved teaching a technique that was entitled “Simulating chaos to teach order”. They simulated an emergency on the border between Chad and Sudan’s troubled Darfur region, with students acting as members of Doctors Without Borders, International Medical Corps, and other humanitarian agencies.
 See also Joel Gascoigne’s related essay, “Creating order from chaos in a startup”.
Instructor Notes
1. Being a data scientist in an organization is often a chaotic experience, and it’s the data scientist’s job to try to create order from that chaos. So I wanted to simulate that chaotic experience for my students throughout the semester. But I also wanted them to know that things were going to be slightly chaotic for a pedagogical reason, and not due to my ineptitude! 2. I wanted to draw out different interpretations of the word “chaos” as a means to think about the importance of vocabulary, and the difficulties caused in communication when people either don’t know what a word means, or have different ideas of what the word means. Data scientists might be communicating with domain experts who don’t really understand what “logistic regression” means, say, but will pretend to know because they don’t want to appear stupid, or because they think they ought to know, and therefore don’t ask. But then the whole conversation is not really a successful communication if the two people talking don’t really understand what they’re talking about. Similarly, the data scientists ought to be asking questions to make sure they understand the terminology the domain expert is using (be it an astrophysicist, a social networking expert, or a climatologist). There’s nothing wrong with not knowing what a word means, but there is something wrong with not asking! You will likely find that asking clarifying questions about vocabulary gets you even more insight into the underlying data problem. 3. Simulation is a useful technique in data science. It can be useful practice to simulate fake datasets from a model to understand the generative process better, for example, and also to debug code. 
Case Study: RealDirect
How Does RealDirect Make Money?
Exercise: RealDirect Data Strategy
 What data would you advise the engineers log and what would your ideal datasets look like?
 How would data be used for reporting and monitoring product usage?
 How would data be built back into the product/website?
 First challenge: load in and clean up the data. Next, conduct exploratory data analysis in order to find out where there are outliers or missing values, decide how you will treat them, make sure the dates are formatted correctly, make sure values you think are numerical are being treated as such, etc.
 Once the data is in good shape, conduct exploratory data analysis to visualize and make comparisons (i) across neighborhoods, and (ii) across time. If you have time, start looking for meaningful patterns in this dataset.
 Does stepping out of your comfort zone and figuring out how you would go about “collecting data” in a different setting give you insight into how you do it in your own field?
 Sometimes “domain experts” have their own set of vocabulary. Did Doug use vocabulary specific to his domain that you didn’t understand (“comps,” “open houses,” “CPC”)? Sometimes if you don’t understand vocabulary that an expert is using, it can prevent you from understanding the problem. It’s good to get in the habit of asking questions because eventually you will get to something you do understand. This involves persistence and is a habit to cultivate.
Sample R code
3. Algorithms
Machine Learning Algorithms
Three Basic Algorithms
Linear Regression
WTF. So Is It an Algorithm or a Model?
While we tried to make a distinction between the two earlier, we admit the colloquial use of the words “model” and “algorithm” gets confusing because the two words seem to be used interchangeably when their actual definitions are not the same thing at all. In the purest sense, an algorithm is a set of rules or steps to follow to accomplish some task, and a model is an attempt to describe or capture the world. These two seem obviously different, so it seems the distinction should should be obvious. Unfortunately, it isn’t. For example, regression can be described as a statistical model as well as a machine learning algorithm. You’ll waste your time trying to get people to discuss this with any precision. In some ways this is a historical artifact of statistics and computer science communities developing methods and techniques in parallel and using different words for the same methods. The consequence of this is that the distinction between machine learning and statistical modeling is muddy. Some methods (for example, kmeans, discussed in the next section) we might call an algorithm because it’s a series of computational steps used to cluster or classify objects—on the other hand, kmeans can be reinterpreted as a special case of a Gaussian mixture model. The net result is that colloquially, people use the terms algorithm and model interchangeably when it comes to a lot of these methods, so try not to let it worry you. (Though it bothers us, too.) 
 There’s a linear pattern.
 The coefficient relating x and y is 25.
 It seems deterministic.
Figure 31. An obvious linear pattern
7  276 
3  43 
4  82 
6  136 
10  417 
9  269 
Now, your brain can’t figure out what’s going on by just looking at them (and your friend’s brain probably can’t, either). They’re in no obvious particular order, and there are a lot of them. So you try to plot it as in Figure 32.
Figure 32. Looking kind of linear
Start by writing something down
Figure 33. Which line is the best fit?
Fitting the model
Figure 34. The line closest to all the points
x  y 
7  276 
3  43 
4  82 
6  136 
10  417 
9  269 
Figure 35. On the left is the fitted line
Extending beyond least squares
Note this is sometimes not a reasonable assumption. If you are dealing with a known fattailed distribution, and if your linear model is picking up only a small part of the value of the variable y, then the error terms are likely also fattailed. This is the most common situation in financial modeling. That’s not to say we don’t use linear regression in finance, though. We just don’t attach the “noise is normal” assumption to it. 
Turns out that no matter how the ϵs are distributed, the least squares estimates that you already derived are the optimal estimators for βs because they have the property of being unbiased and of being the minimum variance estimators. If you want to know more about these properties and see a proof for this, we refer you to any good book on statistical inference (for example, Statistical Inference by Casella and Berger). 
Then you estimate the variance (σ_{2}) of ϵ, as:
Why are we dividing by n–2? A natural question. Dividing by n–2, rather than just n, produces an unbiased estimator. The 2 corresponds to the number of model parameters. Here again, Casella and Berger’s book is an excellent resource for more background information. 
Evaluation metrics
Figure 36. Comparing mean squared error in training and test set
Other models for error terms
Review
 Linearity
 Error terms normally distributed with mean 0
 Error terms independent of each other
 Error terms have constant variance across values of x
 The predictors we’re using are the right predictors
 If we want to predict one variable knowing others
 If we want to explain or understand the relationship between two or more things
Exercise
kNearest Neighbors (kNN)
Example with credit scores
age  income  credit 
69  3  low 
66  57  low 
49  79  low 
49  17  low 
58  26  high 
44  71  high 
You can plot people as points on the plane and label people with an empty circle if they have low credit ratings, as shown in Figure 37.
Figure 37. Credit rating as a function of age and income
Figure 38. What about that guy?
Caution: Modeling Danger Ahead!
The scalings question is a really big deal, and if you do it wrong, your model could just suck. Let’s consider an example: Say you measure age in years, income in dollars, and credit rating as credit scores normally are given—something like SAT scores. Then two people would be represented by triplets such as 25,54000,700 and 35,76000,730 . In particular, their “distance” would be completely dominated by the difference in their salaries. On the other hand, if you instead measured salary in thousands of dollars, they’d be represented by the triplets 25,54,700 and 35,76,730 , which would give all three variables similar kinds of influence. Ultimately the way you scale your variables, or equivalently in this situation the way you define your concept of distance, has a potentially enormous effect on the output. In statistics it is called your “prior.” 
Similarity or distance metrics
where S is the covariance matrix.
where i is the ith element of each of the vectors.
Training and test sets
Pick an evaluation metric
Other Terms for Sensitivity and Specificity
Sensitivity is also called the true positive rate or recall and varies based on what academic field you come from, but they all mean the same thing. And specificity is also called the true negative rate. There is also the false positive rate and the false negative rate, and these don’t get other special names. 
Putting it all together
Choosing k
Binary Classes
When you have binary classes like “high credit” or “low credit,” picking k to be an odd number can be a good idea because there will always be a majority vote, no ties. If there is a tie, the algorithm just randomly picks. 
Test Set in kNN
Notice we used the function knn() twice and used it in different ways. In the first way, the test set was some data we were using to evaluate how good the model was. In the second way, the “test” set was actually a new data point that we wanted a prediction for. We could also have given it many rows of people who we wanted predictions for. But notice that R doesn’t know the difference whether what you’re putting in for the test set is truly a “test” set where you know the real labels, or a test set where you don’t know and want predictions. 
What are the modeling assumptions?
 Training data has been labeled or classified into two or more classes.
 You pick the number of neighbors to use, k.
 You’re assuming that the observed features and the labels are somehow associated. They may not be, but ultimately your evaluation metric will help you determine how good the algorithm is at labeling. You might want to add more features and check how that alters the evaluation metric. You’d then be tuning both which features you were using and k. But as always, you’re in danger here of overfitting.
kmeans
 You might want to give different users different experiences. Marketing often does this; for example, to offer toner to people who are known to own printers.
 You might have a model that works better for specific groups. Or you might have different models for different groups.
 Hierarchical modeling in statistics does something like this; for example, to separately model geographical effects from household effects in survey results.
2D version
Figure 39. Clustering in two dimensions; look at the panels in the left column from top to bottom, and then the right column from top to bottom
 Choosing k is more an art than a science, although there are bounds: 1≤k ≤n, where n is number of data points.
 There are convergence issues—the solution can fail to exist, if the algorithm falls into a loop, for example, and keeps going back and forth between two possible solutions, or in other words, there isn’t a single unique solution.
 Interpretability can be a problem—sometimes the answer isn’t at all useful. Indeed that’s often the biggest problem.
Historical Perspective: kmeans
Wait, didn’t we just describe the algorithm? It turns out there’s more than one way to go after kmeans clustering. The standard kmeans algorithm is attributed to separate work by Hugo Steinhaus and Stuart Lloyd in 1957, but it wasn’t called “kmeans” then. The first person to use that term was James MacQueen in 1967. It wasn’t published outside Bell Labs until 1982. Newer versions of the algorithm are HartiganWong and Lloyd and Forgy, named for their inventors and developed throughout the ’60s and ’70s. The algorithm we described is the default, HartiganWong. It’s fine to use the default. As history keeps marching on, it’s worth checking out the more recent kmeans++ developed in 2007 by David Arthur and Sergei Vassilvitskii (now at Google), which helps avoid convergence issues with kmeans by optimizing the initial seeds. 
Exercise: Basic Machine Learning Algorithms
 Analyze sales using regression with any predictors you feel are relevant. Justify why regression was appropriate to use.
 Visualize the coefficients and fitted model.
 Predict the neighborhood using a kNN classifier. Be sure to withhold a subset of the data for testing. Find the variables and the k that give you the lowest prediction error.
 Report and visualize your findings.
 Describe any decisions that could be made or actions that could be taken from this analysis.
Solutions
Sample R code: Linear regression on the housing dataset
Sample R code: KNN on the housing dataset
Modeling and Algorithms at Scale
The data you’ve been dealing with so far in this chapter has been pretty small on the Big Data spectrum. What happens to these models and algorithms when you have to scale up to massive datasets? In some cases, it’s entirely appropriate to sample and work with a smaller dataset, or to run the same model across multiple sharded datasets. (Sharding is where the data is broken up into pieces and divided among different machines, and then you look at the empirical distribution of the estimators across models.) In other words, there are statistical solutions to these engineering challenges. However, in some cases we want to fit these models at scale, and the challenge of scaling up models generally translates to the challenge of creating parallelized versions or approximations of the optimization methods. Linear regression at scale, for example, relies on matrix inversions or approximations of matrix inversions. Optimization with Big Data calls for new approaches and theory—this is the frontier! From a 2013 talk by Peter Richtarik from the University of Edinburugh: “In the Big Data domain classical approaches that rely on optimization methods with multiple iterations are not applicable as the computational cost of even a single iteration is often too excessive; these methods were developed in the past when problems of huge sizes were rare to find. We thus need new methods which would be simple, gentle with data handling and memory requirements, and scalable. Our ability to solve truly huge scale problems goes hand in hand with our ability to utilize modern parallel computing architectures such as multicore processors, graphical processing units, and computer clusters.” Much of this is outside the scope of the book, but a data scientist needs to be aware of these issues, and some of this is discussed in Chapter 14. 
Summing It All Up
Thought Experiment: Automated Statistician
4. Spam Filters, Naive Bayes, and Wrangling
Thought Experiment: Learning by Example
Figure 41. Suspiciously spammy
 Any email is spam if it contains Viagra references. That’s a good rule to start with, but as you’ve likely seen in your own email, people figured out this spam filter rule and got around it by modifying the spelling. (It’s sad that spammers are so smart and aren’t working on more important projects than selling lots of Viagra…)
 Maybe something about the length of the subject gives it away as spam, or perhaps excessive use of exclamation points or other punctuation. But some words like “Yahoo!” are authentic, so you don’t want to make your rule too simplistic. And here are a few suggestions regarding code you could write to identify spam:
 Try a probabilistic model. In other words, should you not have simple rules, but have many rules of thumb that aggregate together to provide the probability of a given email being spam? This is a great idea.
 What about knearest neighbors or linear regression? You learned about these techniques in the previous chapter, but do they apply to this kind of problem? (Hint: the answer is “No.”)
Why Won’t Linear Regression Work for Filtering Spam?
Aside: State of the Art for Spam Filters
In the last five years, people have started using stochastic gradient methods to avoid the noninvertible (overfitting) matrix problem. Switching to logistic regression with stochastic gradient methods helped a lot, and can account for correlations between words. Even so, Naive Bayes is pretty impressively good considering how simple it is. 
How About knearest Neighbors?
Aside: Digit Recognition
Say you want an algorithm to recognize pictures of handwritten digits as shown in Figure 42. In this case, kNN works well. Figure 42. Handwritten digits
To set it up, you take your underlying representation apart pixel by pixel—say in a 16x16 grid of pixels—and measure how bright each pixel is. Unwrap the 16x16 grid and put it into a 256dimensional space, which has a natural archimedean metric. That is to say, the distance between two different points on this space is the square root of the sum of the squares of the differences between their entries. In other words, it’s the length of the vector going from one point to the other or vice versa. Then you apply the kNN algorithm. If you vary the number of neighbors, it changes the shape of the boundary, and you can tune k to prevent overfitting. If you’re careful, you can get 97% accuracy with a sufficiently large dataset. Moreover, the result can be viewed in a “confusion matrix.” A confusion matrix is used when you are trying to classify objects into k bins, and is a k×k matrix corresponding to actual label versus predicted label, and the (i, j) th element of the matrix is a count of the number of items that were actually labeled i that were predicted to have label j. From a confusion matrix, you can get accuracy, the proportion of total predictions that were correct. In the previous chapter, we discussed the misclassification rate. Notice that accuracy = 1  misclassification rate. 
Naive Bayes
Bayes Law
 99% of sick patients test positive.
 99% of healthy patients test negative.
Figure 43. Tree diagram to build intuition
Let’s do it again using fancy notation so we’ll feel smart.
A Spam Filter for Individual Words
 “money”: 80% chance of being spam
 “viagra”: 100% chance
 “enron”: 0% chance
A Spam Filter That Combines Words: Naive Bayes
It’s helpful to take the log because multiplying together tiny numbers can give us numerical problems. 
 “Idiot’s Bayes  not so stupid after all?” (The whole paper is about why it doesn’t suck, which is related to redundancies in language.)
 “Naive Bayes at Forty: The Independence Assumption in Information”
 “Spam Filtering with Naive Bayes  Which Naive Bayes?”
Fancy It Up: Laplace Smoothing
If we take the derivative, and set it to zero, we get:
In other words, just what we had before. So what we’ve found is that the maximal likelihood estimate recovers your result, as long as we assume independence.
Is That a Reasonable Assumption?
Recall that θ is the chance that a word is in spam if that word is in some email. On the one hand, as long as both α > 0 and β> 0, this distribution vanishes at both 0 and 1. This is reasonable: you want very few words to be expected to never appear in spam or to always appear in spam. On the other hand, when α and β are large, the shape of the distribution is bunched in the middle, which reflects the prior that most words are equally likely to appear in spam or outside spam. That doesn’t seem true either. A compromise would have α and β be positive but small, like 1/5. That would keep your spam filter from being too overzealous without having the wrong idea. Of course, you could relax this prior as you have more and better data; in general, strong priors are only needed when you don’t have sufficient data. 
Comparing Naive Bayes to kNN
Sample Code in bash
Scraping the Web: APIs and Other Tools
Warning about APIs
Always check the terms and services of a website’s API before scraping. Additionally, some websites limit what data you have access to through their APIs or how often you can ask for data without paying for it. 
Thought Experiment: Image Recognition
How do you determine if an image is a landscape or a headshot? Start with collecting data. You either need to get someone to label these things, which is a lot of work, or you can grab lots of pictures from flickr and ask for photos that have already been tagged. Represent each image with a binned RGB (red, green, blue) intensity histogram. In other words, for each pixel, and for each of red, green, and blue, which are the basic colors in pixels, you measure the intensity, which is a number between 0 and 255. Represent each image with a binned RGB (red, green, blue) intensity histogram. In other words, for each pixel, and for each of red, green, and blue, which are the basic colors in pixels, you measure the intensity, which is a number between 0 and 255. Then draw three histograms, one for each basic color, showing how many pixels had which intensity. It’s better to do a binned histogram, so have counts of the number of pixels of intensity 051, etc. In the end, for each picture, you have 15 numbers, corresponding to 3 colors and 5 bins per color. We are assuming here that every picture has the same number of pixels. Finally, use kNN to decide how much “blue” makes a landscape versus a headshot. You can tune the hyperparameters, which in this case are the number of bins as well as k. 
Jake’s Exercise: Naive Bayes for Article Classification
Your code should read the title and body text for each article, remove unwanted characters (e.g., punctuation), and tokenize the article contents into words, filtering out stop words (given in the stopwords file). The training phase of your code should use these parsed document features to estimate the weights w (hat), taking the hyperparameters α and β as input. The prediction phase should then accept these weights as inputs, along with the features for new examples, and output posterior probabilities for each class.
Sample R Code for Dealing with the NYT API
Historical Context: Natural Language Processing
The example in this chapter where the raw data is text is just the tip of the iceberg of a whole field of research in computer science called natural language processing (NLP). The types of problems that can be solved with NLP include machine translation, where given text in one language, the algorithm can translate the text to another language; semantic analysis; part of speech tagging; and document classification (of which spam filtering is an example). Research in these areas dates back to the 1950s. 
5. Logistic Regression
The contributor for this chapter is Brian Dalessandro. Brian works at Media6Degrees as a VP of data science, and he’s active in the research community. He’s also served as cochair of the KDD competition. M6D (also known as Media 6 Degrees) is a startup in New York City in the online advertising space. Figure 51 shows Brian’s data science profile—his yaxis is scaled from Clown to Rockstar.
Figure 51. Brian’s data science profile
Thought Experiments
 Would we even need data science if we had such a theory?
 Is it even theoretically possible to have such a theory? Do such theories lie only in the realm of, say, physics, where we can anticipate the exact return of a comet we see once a century?
 What’s the critical difference between physics and data science that makes such a theory implausible?
 Is it just accuracy? Or more generally, how much we imagine can be explained? Is it because we predict human behavior, which can be affected by our predictions, creating a feedback loop?
 You hold on to your existing best performer.
 Once you have a new idea to prototype, set up an experiment wherein the two best models compete.
 Rinse and repeat (while not overfitting).
Classifiers
Table 51. Classifier example questions and answers
Runtime
You
Interpretability
Scalability
 Simpler models are more interpretable but aren’t as good performers.
 The question of which algorithm works best is problemdependent.
 It’s also constraintdependent.
M6D Logistic Regression Case Study
Click Models
The Underlying Math
Logit Versus Inverselogit
The logit function takes x values in the range 0,1 and transforms them to y values along the entire real line: The inverselogit does the reverse, and takes x values along the real line and tranforms them to y values in the range [0,1]. 
And similarly, if c_{i} =0, the first term cancels out and you have:
Which can also be written as:
Estimating α and β
So putting it all together, you have:
Now, how do you maximize the likelihood?
More on Maximum Likelihood Estimation
We realize that we went through that a bit fast, so if you want more details with respect to maximum likelihood estimation, we suggest looking in Statistical Inference by Casella and Berger, or if it’s linear algebra in general you want more details on, check out Gilbert Strang’s Linear Algebra and Its Applications. 
Newton’s Method
Stochastic Gradient Descent
Implementation
Call this matrix “train,” and then the command line in R would be:
Evaluation
Warning: Feedback Loop!
If you want to productionize logistic regression to rank ads or items based on clicks and impressions, then let’s think about what that means in terms of the data getting generated. So let’s say you put an ad for hair gel above an ad for deodorant, and then more people click on the ad for hair gel—is it because you put it on top or because more people want hair gel? How can you feed that data into future iterations of your algorithm given that you potentially caused the clicks yourself and it has nothing to do with the ad quality? One solution is to always be logging the position or rank that you showed the ads, and then use that as one of the predictors in your algorithm. So you would then model the probability of a click as a function of position, vertical, brand, or whatever other features you want. You could then use the parameter estimated for position and use that going forward as a “position normalizer.” There’s a whole division at Google called Ads Quality devoted to problems such as these, so one little paragraph can’t do justice to all the nuances of it. 
Using A/B Testing for Evaluation
When we build models and optimize them with respect to some evaluation metric such as accuracy or mean squared error, the estimation method itself is built to optimize parameters with respect to these metrics. In some contexts, the metrics we might want to optimize for are something else altogether, such as revenue. So we might try to build an algorithm that optimizes for accuracy, when our real goal is to make money. The model itself may not directly capture this. So a way to capture it is to run A/B tests (or statistical experiments) where we divert some set of users to one version of the algorithm and another set of users to another version of the algorithm, and check the difference in performance of metrics we care about, such as revenue or revenue per user, or something like that. We’ll discuss A/B testing more in Chapter 11. 
Media 6 Degrees Exercise
Sample R Code
6. Time Stamps and Financial Modeling
In this chapter, we have two contributors, Kyle Teague from GetGlue, and someone you are a bit more familiar with by now: Cathy O’Neil. Before Cathy dives into her talk about the main topics for this chapter—times series, financial modeling, and fancypants regression—we’ll hear from Kyle Teague from GetGlue about how they think about building a recommendation system. (We’ll also hear more on this topic in Chapter 7.) We then lay some of the groundwork for thinking about timestamped data, which will segue into Cathy’s talk.
Kyle Teague and GetGlue
Figure 61. Bipartite graph with users and items (shows) as nodes
Timestamps
Exploratory Data Analysis (EDA)
Figure 62. An example of a way to visually display userlevel data over time
 What is the typical or average user doing?
 What does variation around that look like?
 How would we classify users into different segments based on their behavior with respect to time?
 How would we quantify the differences between these users?
Figure 63. Use color to include more information about user actions in a visual display
Figure 64. Aggregating user actions into counts
Timestamps Are Tricky
We’re not gonna lie, timestamps are one of the hardest things to get right about modeling, especially around time changes. That’s why it’s sometimes easiest to convert all timestamps into seconds since the beginning of epoch time. 
Metrics and New Variables or Features
What’s Next?
Historical Perspective: What’s New About This?
Timestamped data itself is not new, and time series analysis is a wellestablished field (see, for example, Time Series Analysis by James D. Hamilton). Historically, the available datasets were fairly small and events were recorded once a day, or even reported at aggregate levels. Some examples of timestamped datasets that have existed for a while even at a granular level are stock prices in finance, credit card transactions, phone call records, or books checked out of the library. Even so, there are a couple things that make this new, or at least the scale of it new. First, it’s now easy to measure human behavior throughout the day because many of us now carry around devices that can be and are used for measurement purposes and to record actions. Next, timestamps are accurate, so we’re not relying on the user to selfreport, which is famously unreliable. Finally, computing power makes it possible to store large amounts of data and process it fairly quickly. 
Cathy O’Neil
Figure 65. Cathy’s data science profile
Thought Experiment
Figure 66. Without keeping track of timestamps, we can’t see timebased patterns; here, we see a seasonal pattern in a time series
Financial Modeling
InSample, OutofSample, and Causality
Figure 67. Insample should come before outofsample data in a time series dataset
Preparing Financial Data
Transforming Your Data
Outside of the context of financial data, preparing and transforming data is also a big part of the process. You have a number of possible techniques to choose from to transform your data to better “behave”:

Log Returns
Example: The S&P Index
Figure 610. The log of the S&P returns shown over time
Figure 611. The volatility normalized log of the S&P closing returns shown over time
Working out a Volatility Measurement
Exponential Downweighting
This technique is called exponential downweighting, a convenient way of compressing the data into a single value that can be updated without having to save the entire dataset. 
Next, assume we have the current variance estimate as:
Note that we said we would use the sample standard deviation, but the formula for that normally involves removing the mean before taking the sum of squares. Here we ignore the mean, mostly because we are typically taking daily volatility, where the mean (which is hard to anticipate in any case!) is a much smaller factor than the noise, so we can treat it essentially as zero. If we were to measure volatility on a longer time scale such as quarters or years, then we would probably not ignore the mean.
It really matters which downweighting factor you use, as shown in Figure 612. 
Figure 612. Volatility in the S&P with different decay factors
Exponential Downweighting
where e_{t} is the new term.
Additive Estimates
We need each of our estimates to be additive (which is why we have a running variance estimate rather than a running standard deviation estimate). If what we’re after is a weighted average, say, then we will need to have a running estimate of both numerator and denominator. 
The Financial Modeling Feedback Loop
Figure 613. A graph of the cumulative PnLs of two theoretical models
This generalizes to any model as well—you plot the cumulative sum of the product of demeaned forecast and demeaned realized. (A demeaned value is one where the mean’s been subtracted.) In other words, you see if your model consistently does better than the “stupidest” model of assuming everything is average.
Why Regression?
Adding Priors
A Baby Model
Figure 614. Looking at autocorrelation out to 100 lags
This can also be solved using calculus, and we solve for beta to get:
Putting the preceding rules to use, we have:
Setting this to 0 and solving for β gives us:
Priors and Higher Derivatives
If you want to, you can add a prior about the second derivative (or other higher derivatives) as well, by squaring the derivative operator (I−M) (or taking higher powers of it). 
Exercise: GetGlue and Timestamped Event Data
 How many unique actions can a user take? And how many actions of each type are in this dataset?
 How many unique users are in this dataset?
 What are the 10 most popular movies?
 How many events in this dataset occurred in 2011?
Exercise: Financial Data
7. Extracting Meaning from Data
William Cukierski
Background: Data Science Competitions
Data Science Competitions Cut Out All the Messy Stuff
Competitions might be seen as formulaic, dry, and synthetic compared to what you’ve encountered in normal life. Competitions cut out the messy stuff before you start building models—asking good questions, collecting and cleaning the data, etc.—as well as what happens once you have your model, including visualization and communication. The team of Kaggle data scientists actually spends a lot of time creating the dataset and evaluation metrics, and figuring out what questions to ask, so the question is: while they’re doing data science, are the contestants? 
Background: Crowdsourcing
 In 1714, the British Royal Navy couldn’t measure longitude, and put out a prize worth $6 million in today’s dollars to get help. John Harrison, an unknown cabinetmaker, figured out how to make a clock to solve the problem.
 In 2002, the TV network Fox issued a prize for the next pop solo artist, which resulted in the television show American Idol, where contestants compete in an eliminationround style singing competition.
 There’s also the Xprize company, which offers “incentivized prize competitions…to bring about radical breakthroughs for the benefits of humanity, thereby inspiring the formation of new industries and the revitalization of markets.” A total of $10 million was offered for the Ansari Xprize, a space competition, and $100 million was invested by contestants trying to solve it. Note this shows that it’s not always such an efficient process overall—but on the other hand, it could very well be efficient for the people offering the prize if it gets solved.
Terminology: Crowdsourcing and Mechanical Turks
These are a couple of terms that have started creeping into the vernacular over the past few years. Although crowdsourcing—the concept of using many people to solve a problem independently—is not new, the term was only fairly recently coined in 2006. The basic idea is a that a challenge is issued and contestants compete to find the best solution. The Wisdom of Crowds was a book written by James Suriowiecki (Anchor, 2004) with the central thesis that, on average, crowds of people will make better decisions than experts, a related phenomenon. It is only under certain conditions (independence of the individuals rather than groupthink where a group of people talking to each other can influence each other into wildly incorrect solutions), where groups of people can arrive at the correct solution. And only certain problems are wellsuited to this approach. Amazon Mechanical Turk is an online crowdsourcing service where humans are given tasks. For example, there might be a set of images that need to be labeled as “happy” or “sad.” These labels could then be used as the basis of a training set for a supervised learning problem. An algorithm could then be trained on these humanlabeled images to automatically label new images. So the central idea of Mechanical Turk is to have humans do fairly routine tasks to help machines, with the goal of the machines then automating tasks to help the humans! Any researcher with a task they need automated can use Amazon Mechanical Turk as long as they provide compensation for the humans. And any human can sign up and be part of the crowdsourcing service, although there are some quality control issues—if the researcher realizes the human is just labeling every other image as “happy” and not actually looking at the images, then the human won’t be used anymore for labeling. Mechanical Turk is an example of artificial artificial intelligence (yes, double up on the “artificial”), in that the humans are helping the machines helping the humans. 
The Kaggle Model
A Single Contestant
Figure 71. Chris Mulligan, a student in Rachel’s class, created this leapfrogging visualization
Their Customers
Is This Fair?
Is it fair to the data scientists already working at the companies that engage with Kaggle? Some of them might lose their job, for example, if the result of the competition is better than the internal model. Is it fair to get people to basically work for free and ultimately benefit a forprofit company? Does it result in data scientists losing their fair market price? Kaggle charges a fee for hosting competitions, and it offers welldefined prizes, so a given data scientist can always choose to not compete. Is that enough? This seems like it could be a great opportunity for companies, but only while the data scientists of the world haven’t realized their value and have extra time on their hands. As soon as they price their skills better they might think twice about working for (almost) free, unless it’s for a cause they actually believe in. 
Kaggle’s Essay Scoring Competition
Part of the final exam for the Columbia class was an essay grading contest. The students had to build it, train it, and test it, just like any other Kaggle competition, and group work was encouraged. The details of the essay contest are discussed below, and you access the data at https://inclass.kaggle.com. You are provided access to handscored essays so that you can build, train, and test an automatic essay scoring engine. Your success depends upon how closely you can deliver scores to those of human expert graders. For this competition, there are five essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels 7 to 10. All essays were hand graded and were doublescored. Each of the datasets has its own unique characteristics. The variability is intended to test the limits of your scoring engine’s capabilities. The data has these columns: id A unique identifier for each individual student essay set 15 An id for each set of essays essay The ascii text of a student’s response rater1 Rater 1’s grade rater2 Rater 2’s grade grade Resolved score between the raters 
Thought Experiment: What Are the Ethical Implications of a RoboGrader?
Domain Expertise Versus Machine Learning Algorithms
This is a false dichotomy. It isn’t either/or. You need both to solve data science problems. However, Kaggle’s president Jeremy Howard pissed some domain experts off in a December 2012 New Scientist magazine interview with Peter Aldhous, “Specialist Knowledge is Useless and Unhelpful.” Here’s an excerpt: PA: What separates the winners from the alsorans? JH: The difference between the good participants and the bad is the information they feed to the algorithms. You have to decide what to abstract from the data. Winners of Kaggle competitions tend to be curious and creative people. They come up with a dozen totally new ways to think about the problem. The nice thing about algorithms like the random forest is that you can chuck as many crazy ideas at them as you like, and the algorithms figure out which ones work. PA: That sounds very different from the traditional approach to building predictive models. How have experts reacted? JH: The messages are uncomfortable for a lot of people. It’s controversial because we’re telling them: “Your decades of specialist knowledge are not only useless, they’re actually unhelpful; your sophisticated techniques are worse than generic methods.” It’s difficult for people who are used to that old type of science. They spend so much time discussing whether an idea makes sense. They check the visualizations and noodle over it. That is all actively unhelpful. PA: Is there any role for expert knowledge? JH: Some kinds of experts are required early on, for when you’re trying to work out what problem you’re trying to solve. The expertise you need is strategy expertise in answering these questions. PA: Can you see any downsides to the datadriven, blackbox approach that dominates on Kaggle? JH: Some people take the view that you don’t end up with a richer understanding of the problem. But that’s just not true: The algorithms tell you what’s important and what’s not. You might ask why those things are important, but I think that’s less interesting. You end up with a predictive model that works. There’s not too much to argue about there. 
Feature Selection
Terminology: Features, Explanatory Variables, Predictors
Example: User Retention
Figure 72. Chasing Dragons, the app designed by you
 Number of days the user visited in the first month
 Amount of time until second visit
 Number of points on day j for j=1, . . .,30 (this would be 30 separate features)
 Total number of points in first month (sum of the other features)
 Did user fill out Chasing Dragons profile (binary 1 or 0)
 Age and gender of user
 Screen size of device
Use your imagination and come up with as many features as possible. Notice there are redundancies and correlations between these features; that’s OK.
Feature Generation or Feature Extraction
This process we just went through of brainstorming a list of features for Chasing Dragons is the process of feature generation or feature extraction. This process is as much of an art as a science. It’s good to have a domain expert around for this process, but it’s also good to use your imagination. In today’s technology environment, we’re in a position where we can generate tons of features through logging. Contrast this with other contexts like surveys, for example—you’re lucky if you can get a survey respondent to answer 20 questions, let alone hundreds. But how many of these features are just noise? In this environment, when you can capture a lot of data, not all of it might be actually useful information. Keep in mind that ultimately you’re limited in the features you have access to in two ways: whether or not it’s possible to even capture the information, and whether or not it even occurs to you at all to try to capture it. You can think of information as falling into the following buckets: Relevant and useful, but it’s impossible to capture it. You should keep in mind that there’s a lot of information that you’re not capturing about users—how much free time do they actually have? What other apps have they downloaded? Are they unemployed? Do they suffer from insomnia? Do they have an addictive personality? Do they have nightmares about dragons? Some of this information might be more predictive of whether or not they return next month. There’s not much you can do about this, except that it’s possible that some of the data you are able to capture serves as a proxy by being highly correlated with these unobserved pieces of information: e.g., if they play the game every night at 3 a.m., they might suffer from insomnia, or they might work the night shift. Relevant and useful, possible to log it, and you did. Thankfully it occurred to you to log it during your brainstorming session. It’s great that you chose to log it, but just because you chose to log it doesn’t mean you know that it’s relevant or useful, so that’s what you’d like your feature selection process to discover. Relevant and useful, possible to log it, but you didn’t. It could be that you didn’t think to record whether users uploaded a photo of themselves to their profile, and this action is highly predictive of their likelihood to return. You’re human, so sometimes you’ll end up leaving out really important stuff, but this shows that your own imagination is a constraint in feature selection. One of the key ways to avoid missing useful features is by doing usability studies (which will be discussed by David Huffaker later on this chapter), to help you think through the user experience and what aspects of it you’d like to capture. Not relevant or useful, but you don’t know that and log it. This is what feature selection is all about—you’ve logged it, but you don’t actually need it and you’d like to be able to know that. Not relevant or useful, and you either can’t capture it or it didn’t occur to you. That’s OK! It’s not taking up space, and you don’t need it. 
Filters
Wrappers
Selecting an algorithm
Selection criterion
In practice
Embedded Methods: Decision Trees
Figure 73. Decision tree for college student, aka the party tree
Figure 74. Decision tree for Chasing Dragons
Entropy
Figure 75. Entropy
The Decision Tree Algorithm
Handling Continuous Variables in Decision Trees
Surviving the Titanic
My Note: I did a SAS Exercise with this data.
For fun, Will pointed us to this decision tree for surviving on the Titanic on the BigML website. The original data is from the Encyclopedia Titanica–source code and data are available there. Figure 76 provides just a snapshot of it, but if you go to the site, it is interactive.
Random Forests
Criticisms of Feature Selection
Let’s address a common criticism of feature selection. Namely, it’s no better than data dredging. If we just take whatever answer we get that correlates with our target, however far afield it is, then we could end up thinking that Bangladeshi butter production predicts the S&P (PDF). Generally we’d like to first curate the candidate features at least to some extent. Of course, the more observations we have, the less we need to be concerned with spurious signals. There’s a wellknown biasvariance tradeoff: a model is “high bias” if it’s is too simple (the features aren’t encoding enough information). In this case, lots more data doesn’t improve our model. On the other hand, if our model is too complicated, then “high variance” leads to overfitting. In this case we want to reduce the number of features we are using. 
User Retention: Interpretability Versus Predictive Power
David Huffaker: Google’s Hybrid Approach to Social Research
Moving from Descriptive to Predictive
Thought Experiment: LargeScale Network Analysis
We’ll dig more into network analysis in Chapter 10 with John Kelly. But for now, think about how you might take the findings from the Google+ usability studies and explore selectively sharing content on a massive scale using data. You can use large data and look at connections between actors like a graph. For Google+, the users are the nodes and the edges (directed) are “in the same circle.” Think about what data you would want to log, and then how you might test some of the hypotheses generated from speaking with the small group of engaged users. As data scientists, it can be helpful to think of different structures and representations of data, and once you start thinking in terms of networks, you can see them everywhere. Other examples of networks:

Social at Google
Privacy
 Financial loss
 Access to personal data
 Really private stuff I searched on
 Unwanted spam
 Provocative photo (oh *&!$ my boss saw that)
 Unwanted solicitation
 Unwanted ad targeting
 Offline threats/harassment
 Harm to my family
 Stalkers
 Employment risks
Thought Experiment: What Is the Best Way to Decrease Concern and Increase Understanding and Control?
 You could write and post a manifesto of your data policy. Google tried that, but it turns out nobody likes to read manifestos.
 You could educate users on your policies a la the Netflix feature “because you liked this, we think you might like this.” But it’s not always so easy to explain things in complicated models.
 You could simply get rid of all stored data after a year. But you’d still need to explain that you do that.
 Make a picture or graph of where data is going.
 Give people a privacy switchboard.
 Provide access to quick settings.
 Make the settings you show people categorized by “things you don’t have a choice about” versus “things you do” for the sake of clarity.
 Best of all, you could make reasonable default setting so people don’t have to worry about it.
8. Recommendation Engines: Building a UserFacing Data Product at Scale
A RealWorld Recommendation Engine
Figure 81. Bipartite graph with users and items (television shows) as nodes
Nearest Neighbor Algorithm Review
Which Metric Is Best?
You might get a different answer depending on which metric you choose. But that’s a good thing. Try out lots of different distance functions and see how your results change and think about why. 
Some Problems with Nearest Neighbors
Beyond Nearest Neighbor: Machine Learning Classification
Let’s make more rigorous the preceding argument that huge coefficients imply overfitting, or maybe even just a bad model. For example, if two of your variables are exactly the same, or are nearly the same, then the coefficients on one can be 100,000 and the other can be –100,000 and really they add nothing to the model. In general you should always have some prior on what a reasonable size would be for your coefficients, which you do by normalizing all of your variables and imagining what an “important” effect would translate to in terms of size of coefficients—anything much larger than that (in an absolute value) would be suspect. 
You can’t use this penalty term for large coefficients and assume the “weighting of the features” problem is still solved, because in fact you’d be penalizing some coefficients way more than others if they start out on different scales. The easiest way to get around this is to normalize your variables before entering them into the model, similar to how we did it in Chapter 6. If you have some reason to think certain variables should have larger coefficients, then you can normalize different variables with different means and variances. At the end of the day, the way you normalize is again equivalent to imposing a prior. 
The Dimensionality Problem
Time to Brush Up on Your Linear Algebra if You Haven’t Already
A lot of the rest of this chapter likely won’t make sense (and we want it to make sense to you!) if you don’t know linear algebra and understand the terminology and geometric interpretation of words like rank (hint: the linear algebra definition of that word has nothing to do with ranking algorithms), orthogonal, transpose, base, span, and matrix decomposition. Thinking about data in matrices as points in space, and what it would mean to transform that space or take subspaces can give you insights into your models, why they’re breaking, or how to make your code more efficient. This isn’t just a mathematical exercise for the sake of it—although there is elegance and beauty in it—it can be the difference between a starup that fails and a startup that gets acquired by eBay. We recommend Khan Academy’s excellent free online introduction to linear algebra if you need to brush up your linear algebra skills. 
Singular Value Decomposition (SVD)
Important Properties of SVD
Principal Component Analysis (PCA)
The resulting latent features are the basis of a welldefined subspace of the total ndimensional space of potential latent variables. There’s no reason to think this solution is unique if there are a bunch of missing values in your “answer” matrix. But that doesn’t necessarily matter, because you’re just looking for a solution. 
Theorem: The resulting latent features will be uncorrelated
Alternating Least Squares
 Pick a random V.
 Optimize U while V is fixed.
 Optimize V while U is fixed.
 Keep doing the preceding two steps until you’re not changing very much at all. To be precise, you choose an ϵ and if your coefficients are each changing by less than ϵ, then you declare your algorithm “converged.”
Theorem with no proof: The preceding algorithm will converge if your prior is large enough
Fix V and Update U
Last Thoughts on These Algorithms
Thought Experiment: Filter Bubbles
Exercise: Build Your Own Recommendation System
Sample Code in Python
9. Data Visualization and Fraud Detection
Data Visualization History
Gabriel Tarde
Mark’s Thought Experiment
What Is Data Science, Redux?
Processing
Franco Moretti
A Sample of Data Visualization Projects
Figure 91. Nuage Vert by Helen Evans and Heiko Hansen
Figure 92. One Tree by Natalie Jeremijenko
Figure 93. Dusty Relief from New Territories
Figure 94. Project Reveal from the New York Times R & D lab
Mark’s Data Visualization Projects
New York Times Lobby: Moveable Type
Figure 96. Moveable Type, the New York Times lobby, by Ben Rubin and Mark Hansen
Figure 97. Display box for Moveable Type
Project Cascade: Lives on a Screen
Figure 98. Project Cascade by Jer Thorp and Mark Hansen
This was done two years ago, and Twitter has gotten a lot bigger since then. 
Cronkite Plaza
Figure 99. And That’s The Way It Is, by Jer Thorp, Mark Hansen, and Ben Rubin
eBay Transactions and Books
Figure 911. The underlying data for the eBay installation
Public Theater Shakespeare Machine
Figure 912. Shakespeare Machine, by Mark, Jer, and Ben
Goals of These Exhibits
Data Science and Risk
About Square
The Risk Challenge
Detecting suspicious activity using machine learning
Figure 914. Seller schema
Figure 915. Settlement schema
 Payment data, where we can assume the fields are transaction_id, seller_id, buyer_id, amount, success (0 or 1), and timestamp.
 Seller data, where we can assume the fields are seller_id, sign_up_date, business_name, business_type, and business_location.
 Settlement data, where we can assume the fields are settlement_id, state, and timestamp.
Figure 916. Risk engine
 Get data.
 Derive features.
 Train model.
 Estimate performance.
 Publish model!
The Trouble with Performance Estimation
Defining the error metric
Table 91. Actual versus predicted table, also called the Confusion Matrix
Defining the labels
 What counts as a suspicious activity?
 What is the right level of granularity? An event or an entity (or both)?
 Can we capture the label reliably? What other systems do we need to integrate with to get this data?
Challenges in features and learning
What’s the Label?
Here’s another example of the trickiness of labels. At DataEDGE, a conference held annually at UC Berkeley’s School of Information, in a conversation between Michael Chui of the McKinsey Global Institute, and Itamar Rosenn—Facebook’s first data scientist (hired in 2007)—Itamar described the difficulties in defining an “engaged user.” What’s the definition of engaged? If you want to predict whether or not a user is engaged, then you need some notion of engaged if you’re going to label users as engaged or not. There is no one obvious definition, and, in fact, a multitude of definitions might work depending on the context—there is no ground truth! Some definitions of engagement could depend on the frequency or rhythm with which a user comes to a site, or how much they create or consume content. It’s a semisupervised learning problem where you’re simultaneously trying to define the labels as well as predict them. 
Model Building Tips
Code readability and reusability
Get a pair!
Productionizing machine learning models
 Highdimensionality? Don’t worry, we’ll just do an SVD, save off the transformation matrices, and multiply them with the holdout data.
 Transforming arrival rate data? Hold on, let me first fit a Poisson model across the historical data and holdout set.
 Time series? Let’s throw in some Fourier coefficients.
Data Visualization at Square
 Enable efficient transaction review.
 Reveal patterns for individual customers and across customer segments.
 Measure business health.
 Provide ambient analytics.
 Play with real data.
 Build math, stats, and computer science foundations in school.
 Get an internship.
 Be literate, not just in statistics.
 Stay curious!
Ian’s Thought Experiment
Data Visualization for the Rest of Us
 There’s a nice orientation to the building blocks of data visualization by Michael Dubokov at http://www.targetprocess.com/article...encoding.html.
 Nathan Yau, who was Mark Hansen’s PhD student at UCLA, has a collection of tutorials on creating visualizations in R at http://flowingdata.com/. Nathan Yau also has two books: Visualize This: The Flowing Data Guide to Design, Visualization, and Statistics (Wiley); and Data Points: Visualization That Means Something (Wiley).
 Scott Murray, code artist, has a series of tutorials to get up to speed on d3 at http://alignedleft.com/tutorials/d3/. These have been developed into a book, Interactive Data Visualization (O’Reilly).
 Hadley Wickham, who developed the R package ggplot2 based on Wilkinson’s Grammar of Graphics, has a corresponding book: ggplot2: Elegant Graphics for Data Analysis (Use R!) (Springer).
 Classic books on data visualization include several books (The Visual Display of Quantitative Information [Graphics Pr], for example) by Edward Tufte (a statistician widely regarded as one of the fathers of data visualization; we know we already said that about Mark Hansen—they’re different generations) with an emphasis less on the tools, and more on the principles of good design. Also William Cleveland (who we mentioned back in Chapter 1 because of his proposal to expand the field of statistics into data science), has two books: Elements of Graphing Data (Hobart Press) and Visualizing Data (Hobart Press).
 Newer books by O’Reilly include the R Graphics Cookbook, Beautiful Data, and Beautiful Visualization. We’d be remiss not to mention that art schools have graphic design departments and books devoted to design principles. An education in data visualization that doesn’t take these into account, as well as the principles of journalism and storytelling, and only focuses on tools and statistics is only giving you half the picture. Not to mention the psychology of human perception.
 This talk, “Describing Dynamic Visualizations” by Bret Victor comes highly recommended by Jeff Heer, a Stanford professor who created d3 with Michael Bostock (who used to work at Square, and now works for the New York Times). Jeff described this talk as presenting an alternative view of data visualization.
 Collaborate with an artist or graphic designer!
Data Visualization Exercise
Figure 917. This is a visualization by Eurry Kim and Kaz Sakamoto of the Hubway sharedbike program
Readme
For more information, see the contest page.
Metadata for Trips Table:
Variables: id: trip id
 status: trip status; "closed" indicates a trip has terminated
 duration: time of trip in seconds
 start_date: start date of trip with date and time, in EST
 start_station: station id of start station
 end_date: end date of trip with date and time, in EST
 end_station_id: station id of end station
 bike_nr: id of bicycle used
 subscription_type: "Registered" is user with membership; "Casual" is user without membership
 zip_code: zipcode of user
 birth_date: birth year of user
 gender: gender of user
Notes
* first row contains column names
* Total records = 552,030
* Trips that did not include a start or end date were removed from original table.
This resulted in the removal of 12,562 trips.
* zip_code, birth_date, gender are only available for Registered users
* characater columns are quoted in double quotes
Metadata for Stations table:
Variables:
 id: station id terminalName
name: station name
 installed: logical, indicates whether station has been installed
 locked: logical
 temporary: logical, indicates whether station is temporary
 lat: station latitude
 lng: station longitude
Notes:
* installDate and removalDate were removed from original table due to quality assurance concerns
Metadata for Station Capacity table:
Variables: id: unique id
 station_id: station id
 day: date (yyyymmdd)
 capacity: capacity of station (the number of bikes that a station can hold)
Notes:
* Data is aggregated from data found at: http://thehubway.com/data/stations/bikeStations.xml
* Data collection started at 8/22/2011
* Data for 9/21/2011 to 11/9/2011 is missing
Spotfire Dashboard
See: Spotfire Dashboard
10. Social Networks and Data Journalism
Social Network Analysis at Morning Analytics
CaseAttribute Data versus Social Network Data
Social Network Analysis
Terminology from Social Networks
Centrality Measures
The Industry of Centrality Measures
Thought Experiment
Morningside Analytics
Figure 101. Example of the Arabic blogosphere
How Visualizations Help Us Find Schools of Fish
More Background on Social Network Analysis from a Statistical Point of View
Representations of Networks and Eigenvalue Centrality
A First Example of Random Graphs: The ErdosRenyi Model
The Bernoulli Network
Not all networks with N nodes occur with equal probability under this model: observing a network with all nodes attached to all other nodes has probability p^{D}, while observing a network with all nodes disconnected has probability (1− p) ^{D}. And of course there are many other possible networks between these two extremes. The ErdosRenyi model is also known as a Bernoulli network. In the mathematics literature, the ErdosRenyi model is treated as a mathematical object with interesting properties that allow for theorems to be proved. 
A Second Example of Random Graphs: The Exponential Random Graph Model
Inference for ERGMs
Further examples of random graphs: latent space models, smallworld networks
 Networks, Crowds, and Markets (Cambridge University Press) by David Easley and Jon Kleinberg at Cornell’s computer science department.
 Chapter on Mining SocialNetwork graphs in the book Mining Massive Datasets (Cambridge University Press) by Anand Rajaraman, Jeff Ullman, and Jure Leskovec in Stanford’s computer science department.
 Statistical Analysis of Network Data (Springer) by Eric D. Kolazcyk at Boston University.
Data Journalism
A Bit of History on Data Journalism
Writing Technical Journalism: Advice from an Expert
11. Causality
Correlation Doesn’t Imply Causation
Asking Causal Questions
Figure 111. Relationship between ice cream sales and bathing suit sales
The terms “treated” and “untreated” come from the biostatistics, medical, and clinical trials realm, where patients are given a medical treatment, examples of which we will encounter in the next chapter. The terminology has been adopted by the statistical and social science literature. 
Confounders: A Dating Example
There are lots of things we’re not doing here that we might want to try. For example, we’re not thinking about Frank’s attributes. Maybe he’s a really weird unattractive guy that no woman would want to date no matter what he says, which would make this a tough question to solve. Maybe he can’t even spell “beautiful.” Conversely, what if he’s gorgeous and/or famous and it doesn’t matter what he says? Also, most dating sites allow women to contact men just as easily as men contact women, so it’s not clear that our definitions of “treated” and “untreated” are welldefined. Some women might ignore their emails but spontaneously email Frank anyway. 
OK Cupid’s Attempt
Figure 112. OK Cupid’s attempt to demonstrate that using the word “beautiful” in an email hurts your chances of getting a response
One important piece of information missing in this plot is the bucket sizes. How many first contacts contained each of the words? It doesn’t really change things, except it would help in making it clear that the horizontal line at 32% is a weighted average across these various buckets of emails. 
The Gold Standard: Randomized Clinical Trials
Average Versus the Individual
Randomized clinical trials measure the effect of a certain drug averaged across all people. Sometimes they might bucket users to figure out the average effect on men or women or people of a certain age, and so on. But in the end, it still has averaged out stuff so that for a given individual we don’t know what they effect would be on them. There is a push these days toward personalized medicine with the availability of genetic data, which means we stop looking at averages because we want to make inferences about the one. Even when we were talking about Frank and OK Cupid, there’s a difference between conducting this study across all men versus Frank alone. 
A/B Tests
From “Overlapping Experiment Infrastructure: More, Better, Faster Experimentation”
The design goals for our experiment infrastructure are therefore: more, better, faster. More We need scalability to run more experiments simultaneously. However, we also need flexibility: different experiments need different configurations and different sizes to be able to measure statistically significant effects. Some experiments only need to change a subset of traffic, say Japanese traffic only, and need to be sized appropriately. Other experiments may change all traffic and produce a large change in metrics, and so can be run on less traffic. Better Invalid experiments should not be allowed run on live traffic. Valid but bad experiments (e.g., buggy or unintentionally producing really poor results) should be caught quickly and disabled. Standardized metrics should be easily available for all experiments so that experiment comparisons are fair: two experimenters should use the same filters to remove robot traffic when calculating a metric such as CTR. Faster It should be easy and quick to set up an experiment; easy enough that a nonengineer can do so without writing any code. Metrics should be available quickly so that experiments can be evaluated quickly. Simple iterations should be quick to do. Ideally, the system should not just support experiments, but also controlled rampups, i.e., gradually ramping up a change to all traffic in a systematic and wellunderstood way. 
Second Best: Observational Studies
Simpson’s Paradox
Figure 113. Probability of having a heart attack (also known as MI, or myocardial infarction) as a function of the size of the dose of a bad drug
Figure 114. Probability of having a heart attack as a function of the size of the dose of a bad drug and whether or not the patient also took aspirin
The Rubin Causal Model
Visualizing Causality
Definition: The Causal Effect
The causal effect is sometimes defined as the ratio of these two numbers instead of the difference. 
Three Pieces of Advice
12. Epidemiology
Madigan’s Background
Thought Experiment
Modern Academic Statistics
Medical Literature and Observational Studies
Stratification Does Not Solve the Confounder Problem
Table 121. Aggregated: Both men and women
Treatment: Drugged  Treatment: Counterfactual  Control: Counterfactual  Control: No Drug  
Y=1  30  20  30  20 
Y=0  70  80  70  80 
P(Y=1)  0.3  0.2  0.3  0.2 
Table 122. Stratified: Men
Treatment: Drugged  Treatment: Counterfactual  Control: Counterfactual  Control: No Drug  
Y=1  15  2  5  5 
Y=0  35  8  65  15 
P(Y=1)  0.3  0.2  0.07  0.25 
Table 123. Stratified: Women
Treatment: Drugged  Treatment: Counterfactual  Control: Counterfactual  Control: No Drug  
Y=1  15  18  25  15 
Y=0  35  72  5  65 
P(Y=1)  0.3  0.2  0.83  0.1875 