Table of contents
  1. Story
  2. Slides
    1. Slide 1 Data Science for Tackling the Challenges of Big Data
    2. Slide 2 Overview
    3. Slide 3 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment
    4. Slide 4 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress
    5. Slide 5 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage
    6. Slide 6 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Modern Databases
    7. Slide 7 Courseware: Big Data Storage
    8. Slide 8 Selected Slides: Professor Sam Madden
    9. Slide 9 Selected Slides: Professor David Karger
    10. Slide 10 Selected Slides: Professor Daniela Rus
    11. Slide 11 Google Search: Singapore Taxi Data
    12. Slide 12 Think Business: Why can’t I find a taxi when I really need one?
    13. Slide 13 Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days
    14. Slide 14 MIT Big Data Knowledge Base: Table 1 Spreadsheet
    15. Slide 15 Singapore Land Transport Authority: Traffic Info Service Providers
    16. Slide 16 Singapore Land Transport Authority: MyTransport.sg
    17. Slide 17 Singapore Land Transport Authority: All Datasets Spreadsheet
    18. Slide 18 MIT Big Data Knowledge Base: MindTouch
    19. Slide 19 MIT Big Data: Knowledge Base Spreadsheet
    20. Slide 20 MIT Big Data: Course Participant Spreadsheet
    21. Slide 21 MIT Big Data: Spotfire Cover Page
    22. Slide 22 MIT Big Data: Student Enrollment
    23. Slide 23 MIT Big Data: Singaporean Cab Drivers
    24. Slide 24 New York City Open Data: Socrata 
    25. Slide 25 New York City Open Data: Search Results
    26. Slide 26 New York City Open Data: Data Table
    27. Slide 27 Visualizing NYC’s Open Data: Socrata Beta
    28. Slide 28 MIT Big Data Assessment: Questions and Answers
  3. Spotfire Dashboard
  4. Why can’t I find a taxi when I really need one?
  5. Labor Supply Decisions of Singaporean Cab Drivers
    1. Abstract
    2. 1. Introduction
    3. 2. Cab Drivers and Operators in Singapore
    4. 3. Data Description
      1. a. Real time trip data from global positioning system (GPS) enabled cabs in Singapore
        1. Table 1: Comparison of the Cabdriver Data Used in Related Studies
      2. b. Summary Statistics
        1. Table 2: Summary Statistics
        2. Inter-day/Intra-day variations in activities
        3. Figure 1: Distributions of Cabs per minutes by Status for the week of August 15-21, 2010
      3. c. Inter-day Demand and Supply Elasticity for Cab
        1. Figure 2: Distributions of Day-by-Day Supply and Demand of Cab Services and Utilization Rate for the Week of August 15-21, 2010
      4. d. Shift-level Work hours and Wage Measures
        1. Figure 3: Distributions of Cabdrivers’ activities by shift
    5. 4. Empirical Tests of Cabdriver Labor Supply Elasticity
      1. Figure 4: Distribution of Shift Length for the Month of August 2010
      2. Figure 5: Distribution of Wage Rate of Cabdrivers
      3. a. Pairwise Relationships between shift length and wage rate
        1. Figure 6: Scatter Plot and Local Polynomial Regression Fitting of the Relationship between Shift Length and Wage Rate for all Shifts
        2. Figure 7: Differences between Single-Shift and Two-Shift Cabdrivers
        3. Reduced form models of cabdriver labor supply
        4. Table 3: OLS Models of Worked Hours on Wage Rate for Cabdrivers
        5. Table 4: Labor Supply Models with Cabdriver Fixed Effects
        6. Figure 8: Relationship between Shift Length and Wage Rate based on a Calibrated Polynomial Function
      4. b. Supply shocks and the labor supply elasticity of cabdrivers
        1. Table 5: Shocks to Taxi Driver Labor Supply
      5. c. Heterogeneity in Cabdrivers’ Labor Supply Elasticities
        1. Table 6: Heterogeneity of Cabdrivers
      6. d. Reference-Dependence Preferences in Labor Supply
        1. Table 7: Reference Dependence Hypothesis with Driver Average Targets
        2. Table 8: Reference Dependence Hypothesis with Driver/Day-of-the-Week Average Targets
      7. e. Constancy in Labor Supply of Cabdrivers
        1. Table 9: Income Targets of Cabdrivers
      8. f. Testing the Importance of Daily Targets using Cabdrivers’ Response Earnings Shocks in the Previous Day
        1. Figure 9: Distribution of Income Ratio of Cabdrivers
      9. g. Effects of Different Reference Points on Labor Supply
        1. Table 10: Reference Dependence and Labor Supply Elasticity
    6. 5. Conclusion
    7. Footnotes
      1. 1
      2. 2
      3. 3
      4. 4
      5. 5
      6. 6
      7. 7
      8. 8
    8. References
    9. Appendix: Fares and Rates of one of the largest Cab Operators in Singapore
  6. Course Emails
    1. December 23, 2014
    2. December 16, 2014
    3. December 15, 2014
    4. December 9, 2014
    5. December 2, 2014
    6. November 25, 2014
    7. November 24, 2014
    8. November 18, 2014
    9. November 11, 2014
    10. November 4, 2014
    11. October 29, 2014 
    12. October 28, 2014
    13. October 3, 2014
  7. The CancerMath.net Web Calculators
    1. Research Summary
    2. Databases
    3. Medical Usage and Cost
    4. A Mathematical Approach To Cancer Lethality
    5. CancerMath.net Calculators
    6. Notes
    7. References
    8. Laboratory of Quantitative Medicine Technical Reports
  8. Courseware
    1. Course Survey
    2. Welcome to the Course
      1. How To Participate in the Course
        1. How to Use the Discussion Forum
          1. Meet Your Course TAs!
          2. Brian Bell
          3. Manasi Vartak
        2. Discussion and Community Guidelines
        3. Certificates and CEUs
      2. Technical Assistance and Contacts
      3. Introduction: Big Data Challenges
      4. Assessment 1
    3. Introduction and Use Cases
      1. 1.0 Introduction: Uses Cases
        1. 1.1 Case Study: Transportation 
        2. 1.2 Case Study: Visualizing Twiitter
        3. Assessment 2
        4. Discussion 1
    4. Big Data Collection
      1. 2.0 Introduction: Big Data Collection
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 2.1 Data Cleaning and Integration
      3. 2.2 Hosted Data Platforms and the Cloud
      4. Assessment 3
      5. Discussion 2
    5. Big Data Storage
      1. 3.0 Introduction to Big Data Storage
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 3.1 Modern Databases
      3. 3.2 Distributed Computing Platforms
      4. 3.3 NoSQL NewSQL
      5. Assessment 4
      6. Discussion 3
    6. Big Data Systems
      1. 4.0 Introduction: Big Data Systems
        1. Overview
        2. Goals (4.1 Security)
        3. Objectives: Students should be able to...
        4. Goals (4.2 Multicore Scalability)
        5. Objectives: Students should be able to...
        6. Goals (4.3 Interfaces and Visualization)
        7. Objectives: Students should be able to...
      2. 4.1 Security
      3. 4.2 Multicore Scalability
      4. 4.3 Visualization and User Interfaces
      5. Assessment 5
      6. Discussion 4
    7. Big Data Analytics
      1. 5.0 Introduction: Big Data Analytics
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 5.1 Fast Algorithms I
      3. 5.2 Fast Algorithms II
      4. 5.3 Data Compression
      5. 5.4 Machine Learning Tools
      6. 5.5 Case Study: Information Summarization
      7. 5.6 Applications: Medicine
      8. 5.7 Applications: Finance
      9. Assessment 6
      10. Discussion 5
  9. Slides
    1. 1.0_introduction-to-bigdata
      1. What Is This Course Going to Cover?
    2. 1.1_transportation
      1. Case Study: Transportation in Singapore
    3. 1.2_visualizing-twitter
      1. Spatial Correlations
      2. Other Techniques We'll Cover
    4. 2.1_data-cleaning-integration
      1. Data Curation
      2. Startups in This Space
      3. Data Tamer Future
      4. The Way Forward 1
      5. The Way Forward 2
    5. 2.2_hosted-data-platforms
      1. Examples
      2. Summary
      3. Examples
      4. Conclusions
    6. 3.1_modern-databases
      1. History Lesson
      2. My Thesis
      3. Rest of This Module
      4. Data Warehouse Marketplace
      5. The Participants
      6. Roughly Speaking
      7. OLTP Data Bases -- 3 Big Decisions
      8. Summary
      9. Everything Else
      10. Array DBMSs--e.g. SciDB
      11. Array DBMSs--Summary
      12. Graph DBMSs
      13. What Is Hadoop?
      14. What Is Happening Now?
      15. Most Likely Future
      16. Thoughts While Shaving 1
      17. Thoughts While Shaving 2
      18. Thoughts While Shaving 3
      19. The Curse--May You Live in Interesting Times 1
      20. The Curse--May You Live in Interesting Times 2
    7. 3.2_distributed-computing-platforms
      1. Motivation
      2. Software in This Space
      3. Applications
      4. This Lecture
    8. 3.3_NoSQL-NewSQL
      1. What Does a Traditional Database Provide?
      2. A Thousand Flowers Bloom
      3. Rest of the Module
    9. 4.1_security
      1. Security is a Negative Goal
      2. Multiple Encryption Schemes
      3. Conclusion
    10. 4.2_multicore-scalability
      1. Goal: Scalability
      2. Outline
      3. Conclusion: Multicore Scalability
    11. 4.3_visualization-user-interfaces
      1. My Research Group
      2. Why User Interfaces? 1
      3. Why User Interfaces? 2
      4. The Value of Interfaces
      5. User Interfaces for Data
      6. Spectrum of Interface Capabilities
      7. Overview
      8. Small Multiples
      9. What Is Visualization?
      10. Why Visualizations?
      11. John Tukey 1
      12. Challenger Disaster
      13. Morton Thiokol
      14. Make a Decision: Challenger 1
      15. Make a Decision: Challenger 2
      16. Information Visualization
      17. How Not To Lie
      18. Tufte: Graphical Integrity 1
      19. Tufte: Graphical Integrity 2
      20. Summary
      21. Interactivity
      22. Why?
      23. Plan
      24. John Tukey 2
      25. Exploratory Versus Confirmatory
      26. John Tukey 3
      27. Summary
      28. Goal
      29. Spectrum of Interface Capabilities
      30. Interaction Strategy
    12. 5.1_fast-algorithms-1
      1. What To Do About REALLY Big Data?
      2. No Time
      3. Really Big Data
      4. What Can We Hope To Do Without Viewing Most of the Data?
      5. What Types of Approximation?
      6. Conclusion
    13. 5.2_fast-algorithms-2
      1. Streaming and Sampling
      2. Streaming Versus Sampling
      3. Rest of This Lecture
      4. Computing Fourier Transform
      5. Idea: Leverage Sparsity
      6. Benefits of Sparse Fourier Transform
    14. 5.3_data-compression
      1. Learning Big Data Patterns From Tiny Core-Sets
      2. Data Challenge
      3. Challenges
      4. Outline
      5. Coreset and Data Compression Techniques
      6. References for Compression
      7. Example: Coresets for Life Logging
      8. System Overview
      9. Coresets for Latent Semantic Analysis
      10. Wrapping Up
    15. 5.4_machine-learning-tools
      1. Our Research Group
      2. Machine Learning 1
      3. Machine Learning 2
      4. Structured Prediction
    16. 5.5_information-summarization
      1. My Research
      2. Need in Information Extraction
      3. Information Extraction for Big Data
      4. Data Set 1
      5. Data Set 2
      6. Conclusion
    17. 5.6_applications-medicine
      1. My Research Group
      2. Medical Analytics
      3. The Good News
      4. Problems Posed by Increase in Available Data 1
      5. Problems Posed by Increase in Available Data 2
      6. Using ML to Make Useful Predictions
      7. Accurate Predictions Can Help
      8. The Big Data Challenge
      9. What Is Machine Learning? 1
      10. What Is Machine Learning? 2
      11. How Are Things Learned?
      12. Machine Learning Methods
      13. More Not Always Better
      14. Approach
      15. The Data
      16. Variables Considered
      17. Two Step Approach
      18. Results
      19. Wrapping Up
    18. 5.7_applications-finance
      1. Consumer Credit Risk Management
      2. MIT Laboratory for Financial Engineering
      3. Anonymized Data From Large U.S. Commercial Bank
      4. Objectives
      5. Machine Learning Objectives
      6. Empirical Results
      7. Macro Forecasts of Credit Losses
      8. Conclusions
  10. Course Updates & News
    1. November 4, 2014
      1. Review the course Syllabus, Wiki, and Course Handouts
      2. Join the course networking groups
  11. Course Syllabus for Tackling the Challenges of Big Data
    1. Time Requirement/Commitment
    2. Who Should Participate?
    3. Learning Objectives
    4. Course Staff
    5. Course Requirements
    6. Course Schedule
      1. Week 1 - MODULE ONE: INTRODUCTION AND USE CASES
        1. Introduction: Big Data Challenges (Sam Madden)
        2. Case Study: Transportation (Daniela Rus)
        3. Case Study: Visualizing Twitter (Sam Madden)
        4. Recommended weekly activities
      2. Week 2 - MODULE TWO: BIG DATA COLLECTION
        1. Data Cleaning and Integration (Michael Stonebraker)
        2. Hosted Data Platforms and the Cloud (Matei Zaharia)
        3. Recommended weekly activities
      3. Week 3 - MODULE THREE: BIG DATA STORAGE
        1. Modern Databases (Michael Stonebraker)
        2. Distributed Computing Platforms (Matei Zaharia)
        3. NoSQL, NewSQL (Sam Madden)
        4. Recommended weekly activities
      4. Week 4 - MODULE FOUR: BIG DATA SYSTEMS
        1. Security (Nickolai Zeldovich)
        2. Multicore Scalability (Nickolai Zeldovich)
        3. User Interfaces for Data (David Karger)
        4. Recommended weekly activities
      5. Week 5 - MODULE FIVE, PART I: BIG DATA ANALYTICS
        1. Fast Algorithms I (Ronitt Rubinfeld)
        2. Fast Algorithms II (Piotr Indyk)
        3. Data Compression (Daniela Rus)
        4. Recommended weekly activities
      6. Week 6 - MODULE FIVE, PART II: BIG DATA ANALYTICS
        1. Machine Learning Tools (Tommi Jaakkola)
        2. Case Study: Information Summarization (Regina Barzilay)
        3. Applications: Medicine (John Guttag)
        4. Applications: Finance (Andrew Lo)
        5. Recommended weekly activities
        6. Completing the course
        7. Post-course
  12. Wiki
    1. Networking
    2. Readings and Resources
    3. Resources by Module
      1. 1.0 Introduction: Big Data Challenges - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      2. 1.1 Case Study: Transportation - PDF of Presentation slides (Rus)
        1. STUDENT-ADDED RESOURCES
      3. 1.2 Case Study: Visualizing Twitter - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      4. 2.0 Introduction: Big Data Collection
      5. 2.1 Data Cleaning and Integration - PDF of Presentation slides (Stonebraker)
        1. STUDENT-ADDED RESOURCES
      6. 2.2 Hosted Data Platforms and the Cloud - PDF of Presentation slides (Zaharia)
        1. STUDENT-ADDED RESOURCES
      7. 3.0 Introduction: Big Data Storage
      8. 3.1 Modern Databases - PDF of Presentation slides (Stonebraker)
        1. STUDENT-ADDED RESOURCES
      9. 3.2 Distributed Computing Platforms - PDF of Presentation slides (Zaharia)
        1. STUDENT-ADDED RESOURCES
      10. 3.3 NoSQL, NewSQL - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      11. 4.0 Introduction: Big Data Systems
      12. 4.1 Security - PDF of Presentation slides (Zeldovich)
        1. STUDENT-ADDED RESOURCES
      13. 4.2 Multicore Scalability - PDF of Presentation slides (Zeldovich)
        1. STUDENT-ADDED RESOURCES
      14. 4.3 Visualization and User Interfaces - PDF of Presentation slides (Karger)
        1. STUDENT-ADDED RESOURCES
      15. 5.0 Introduction: Big Data Analytics
      16. 5.1 Fast Algorithms I - PDF of Presentation slides (Rubinfeld)
        1. STUDENT-ADDED RESOURCES
      17. 5.2 Fast Algorithms II - PDF of Presentation slides (Indyk)
        1. STUDENT-ADDED RESOURCES
      18. 5.3 Data Compression - PDF of Presentation slides (Rus)
        1. STUDENT-ADDED RESOURCES
      19. 5.4 Machine Learning Tools - PDF of Presentation slides (Jaakkola)
        1. STUDENT-ADDED RESOURCES
      20. 5.5 Case Study: Information Summarization - PDF of Presentation slides (Barzilay)
        1. STUDENT-ADDED RESOURCES
      21. 5.6 Applications: Medicine - PDF of Presentation slides (Guttag)
        1. STUDENT-ADDED RESOURCES
      22. 5.7 Applications: Finance - PDF of Presentation slides (Lo)
        1. STUDENT-ADDED RESOURCES
  13. NEXT

Data Science for Tackling the Challenges of Big Data

Last modified
Table of contents
  1. Story
  2. Slides
    1. Slide 1 Data Science for Tackling the Challenges of Big Data
    2. Slide 2 Overview
    3. Slide 3 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment
    4. Slide 4 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress
    5. Slide 5 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage
    6. Slide 6 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Modern Databases
    7. Slide 7 Courseware: Big Data Storage
    8. Slide 8 Selected Slides: Professor Sam Madden
    9. Slide 9 Selected Slides: Professor David Karger
    10. Slide 10 Selected Slides: Professor Daniela Rus
    11. Slide 11 Google Search: Singapore Taxi Data
    12. Slide 12 Think Business: Why can’t I find a taxi when I really need one?
    13. Slide 13 Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days
    14. Slide 14 MIT Big Data Knowledge Base: Table 1 Spreadsheet
    15. Slide 15 Singapore Land Transport Authority: Traffic Info Service Providers
    16. Slide 16 Singapore Land Transport Authority: MyTransport.sg
    17. Slide 17 Singapore Land Transport Authority: All Datasets Spreadsheet
    18. Slide 18 MIT Big Data Knowledge Base: MindTouch
    19. Slide 19 MIT Big Data: Knowledge Base Spreadsheet
    20. Slide 20 MIT Big Data: Course Participant Spreadsheet
    21. Slide 21 MIT Big Data: Spotfire Cover Page
    22. Slide 22 MIT Big Data: Student Enrollment
    23. Slide 23 MIT Big Data: Singaporean Cab Drivers
    24. Slide 24 New York City Open Data: Socrata 
    25. Slide 25 New York City Open Data: Search Results
    26. Slide 26 New York City Open Data: Data Table
    27. Slide 27 Visualizing NYC’s Open Data: Socrata Beta
    28. Slide 28 MIT Big Data Assessment: Questions and Answers
  3. Spotfire Dashboard
  4. Why can’t I find a taxi when I really need one?
  5. Labor Supply Decisions of Singaporean Cab Drivers
    1. Abstract
    2. 1. Introduction
    3. 2. Cab Drivers and Operators in Singapore
    4. 3. Data Description
      1. a. Real time trip data from global positioning system (GPS) enabled cabs in Singapore
        1. Table 1: Comparison of the Cabdriver Data Used in Related Studies
      2. b. Summary Statistics
        1. Table 2: Summary Statistics
        2. Inter-day/Intra-day variations in activities
        3. Figure 1: Distributions of Cabs per minutes by Status for the week of August 15-21, 2010
      3. c. Inter-day Demand and Supply Elasticity for Cab
        1. Figure 2: Distributions of Day-by-Day Supply and Demand of Cab Services and Utilization Rate for the Week of August 15-21, 2010
      4. d. Shift-level Work hours and Wage Measures
        1. Figure 3: Distributions of Cabdrivers’ activities by shift
    5. 4. Empirical Tests of Cabdriver Labor Supply Elasticity
      1. Figure 4: Distribution of Shift Length for the Month of August 2010
      2. Figure 5: Distribution of Wage Rate of Cabdrivers
      3. a. Pairwise Relationships between shift length and wage rate
        1. Figure 6: Scatter Plot and Local Polynomial Regression Fitting of the Relationship between Shift Length and Wage Rate for all Shifts
        2. Figure 7: Differences between Single-Shift and Two-Shift Cabdrivers
        3. Reduced form models of cabdriver labor supply
        4. Table 3: OLS Models of Worked Hours on Wage Rate for Cabdrivers
        5. Table 4: Labor Supply Models with Cabdriver Fixed Effects
        6. Figure 8: Relationship between Shift Length and Wage Rate based on a Calibrated Polynomial Function
      4. b. Supply shocks and the labor supply elasticity of cabdrivers
        1. Table 5: Shocks to Taxi Driver Labor Supply
      5. c. Heterogeneity in Cabdrivers’ Labor Supply Elasticities
        1. Table 6: Heterogeneity of Cabdrivers
      6. d. Reference-Dependence Preferences in Labor Supply
        1. Table 7: Reference Dependence Hypothesis with Driver Average Targets
        2. Table 8: Reference Dependence Hypothesis with Driver/Day-of-the-Week Average Targets
      7. e. Constancy in Labor Supply of Cabdrivers
        1. Table 9: Income Targets of Cabdrivers
      8. f. Testing the Importance of Daily Targets using Cabdrivers’ Response Earnings Shocks in the Previous Day
        1. Figure 9: Distribution of Income Ratio of Cabdrivers
      9. g. Effects of Different Reference Points on Labor Supply
        1. Table 10: Reference Dependence and Labor Supply Elasticity
    6. 5. Conclusion
    7. Footnotes
      1. 1
      2. 2
      3. 3
      4. 4
      5. 5
      6. 6
      7. 7
      8. 8
    8. References
    9. Appendix: Fares and Rates of one of the largest Cab Operators in Singapore
  6. Course Emails
    1. December 23, 2014
    2. December 16, 2014
    3. December 15, 2014
    4. December 9, 2014
    5. December 2, 2014
    6. November 25, 2014
    7. November 24, 2014
    8. November 18, 2014
    9. November 11, 2014
    10. November 4, 2014
    11. October 29, 2014 
    12. October 28, 2014
    13. October 3, 2014
  7. The CancerMath.net Web Calculators
    1. Research Summary
    2. Databases
    3. Medical Usage and Cost
    4. A Mathematical Approach To Cancer Lethality
    5. CancerMath.net Calculators
    6. Notes
    7. References
    8. Laboratory of Quantitative Medicine Technical Reports
  8. Courseware
    1. Course Survey
    2. Welcome to the Course
      1. How To Participate in the Course
        1. How to Use the Discussion Forum
          1. Meet Your Course TAs!
          2. Brian Bell
          3. Manasi Vartak
        2. Discussion and Community Guidelines
        3. Certificates and CEUs
      2. Technical Assistance and Contacts
      3. Introduction: Big Data Challenges
      4. Assessment 1
    3. Introduction and Use Cases
      1. 1.0 Introduction: Uses Cases
        1. 1.1 Case Study: Transportation 
        2. 1.2 Case Study: Visualizing Twiitter
        3. Assessment 2
        4. Discussion 1
    4. Big Data Collection
      1. 2.0 Introduction: Big Data Collection
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 2.1 Data Cleaning and Integration
      3. 2.2 Hosted Data Platforms and the Cloud
      4. Assessment 3
      5. Discussion 2
    5. Big Data Storage
      1. 3.0 Introduction to Big Data Storage
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 3.1 Modern Databases
      3. 3.2 Distributed Computing Platforms
      4. 3.3 NoSQL NewSQL
      5. Assessment 4
      6. Discussion 3
    6. Big Data Systems
      1. 4.0 Introduction: Big Data Systems
        1. Overview
        2. Goals (4.1 Security)
        3. Objectives: Students should be able to...
        4. Goals (4.2 Multicore Scalability)
        5. Objectives: Students should be able to...
        6. Goals (4.3 Interfaces and Visualization)
        7. Objectives: Students should be able to...
      2. 4.1 Security
      3. 4.2 Multicore Scalability
      4. 4.3 Visualization and User Interfaces
      5. Assessment 5
      6. Discussion 4
    7. Big Data Analytics
      1. 5.0 Introduction: Big Data Analytics
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 5.1 Fast Algorithms I
      3. 5.2 Fast Algorithms II
      4. 5.3 Data Compression
      5. 5.4 Machine Learning Tools
      6. 5.5 Case Study: Information Summarization
      7. 5.6 Applications: Medicine
      8. 5.7 Applications: Finance
      9. Assessment 6
      10. Discussion 5
  9. Slides
    1. 1.0_introduction-to-bigdata
      1. What Is This Course Going to Cover?
    2. 1.1_transportation
      1. Case Study: Transportation in Singapore
    3. 1.2_visualizing-twitter
      1. Spatial Correlations
      2. Other Techniques We'll Cover
    4. 2.1_data-cleaning-integration
      1. Data Curation
      2. Startups in This Space
      3. Data Tamer Future
      4. The Way Forward 1
      5. The Way Forward 2
    5. 2.2_hosted-data-platforms
      1. Examples
      2. Summary
      3. Examples
      4. Conclusions
    6. 3.1_modern-databases
      1. History Lesson
      2. My Thesis
      3. Rest of This Module
      4. Data Warehouse Marketplace
      5. The Participants
      6. Roughly Speaking
      7. OLTP Data Bases -- 3 Big Decisions
      8. Summary
      9. Everything Else
      10. Array DBMSs--e.g. SciDB
      11. Array DBMSs--Summary
      12. Graph DBMSs
      13. What Is Hadoop?
      14. What Is Happening Now?
      15. Most Likely Future
      16. Thoughts While Shaving 1
      17. Thoughts While Shaving 2
      18. Thoughts While Shaving 3
      19. The Curse--May You Live in Interesting Times 1
      20. The Curse--May You Live in Interesting Times 2
    7. 3.2_distributed-computing-platforms
      1. Motivation
      2. Software in This Space
      3. Applications
      4. This Lecture
    8. 3.3_NoSQL-NewSQL
      1. What Does a Traditional Database Provide?
      2. A Thousand Flowers Bloom
      3. Rest of the Module
    9. 4.1_security
      1. Security is a Negative Goal
      2. Multiple Encryption Schemes
      3. Conclusion
    10. 4.2_multicore-scalability
      1. Goal: Scalability
      2. Outline
      3. Conclusion: Multicore Scalability
    11. 4.3_visualization-user-interfaces
      1. My Research Group
      2. Why User Interfaces? 1
      3. Why User Interfaces? 2
      4. The Value of Interfaces
      5. User Interfaces for Data
      6. Spectrum of Interface Capabilities
      7. Overview
      8. Small Multiples
      9. What Is Visualization?
      10. Why Visualizations?
      11. John Tukey 1
      12. Challenger Disaster
      13. Morton Thiokol
      14. Make a Decision: Challenger 1
      15. Make a Decision: Challenger 2
      16. Information Visualization
      17. How Not To Lie
      18. Tufte: Graphical Integrity 1
      19. Tufte: Graphical Integrity 2
      20. Summary
      21. Interactivity
      22. Why?
      23. Plan
      24. John Tukey 2
      25. Exploratory Versus Confirmatory
      26. John Tukey 3
      27. Summary
      28. Goal
      29. Spectrum of Interface Capabilities
      30. Interaction Strategy
    12. 5.1_fast-algorithms-1
      1. What To Do About REALLY Big Data?
      2. No Time
      3. Really Big Data
      4. What Can We Hope To Do Without Viewing Most of the Data?
      5. What Types of Approximation?
      6. Conclusion
    13. 5.2_fast-algorithms-2
      1. Streaming and Sampling
      2. Streaming Versus Sampling
      3. Rest of This Lecture
      4. Computing Fourier Transform
      5. Idea: Leverage Sparsity
      6. Benefits of Sparse Fourier Transform
    14. 5.3_data-compression
      1. Learning Big Data Patterns From Tiny Core-Sets
      2. Data Challenge
      3. Challenges
      4. Outline
      5. Coreset and Data Compression Techniques
      6. References for Compression
      7. Example: Coresets for Life Logging
      8. System Overview
      9. Coresets for Latent Semantic Analysis
      10. Wrapping Up
    15. 5.4_machine-learning-tools
      1. Our Research Group
      2. Machine Learning 1
      3. Machine Learning 2
      4. Structured Prediction
    16. 5.5_information-summarization
      1. My Research
      2. Need in Information Extraction
      3. Information Extraction for Big Data
      4. Data Set 1
      5. Data Set 2
      6. Conclusion
    17. 5.6_applications-medicine
      1. My Research Group
      2. Medical Analytics
      3. The Good News
      4. Problems Posed by Increase in Available Data 1
      5. Problems Posed by Increase in Available Data 2
      6. Using ML to Make Useful Predictions
      7. Accurate Predictions Can Help
      8. The Big Data Challenge
      9. What Is Machine Learning? 1
      10. What Is Machine Learning? 2
      11. How Are Things Learned?
      12. Machine Learning Methods
      13. More Not Always Better
      14. Approach
      15. The Data
      16. Variables Considered
      17. Two Step Approach
      18. Results
      19. Wrapping Up
    18. 5.7_applications-finance
      1. Consumer Credit Risk Management
      2. MIT Laboratory for Financial Engineering
      3. Anonymized Data From Large U.S. Commercial Bank
      4. Objectives
      5. Machine Learning Objectives
      6. Empirical Results
      7. Macro Forecasts of Credit Losses
      8. Conclusions
  10. Course Updates & News
    1. November 4, 2014
      1. Review the course Syllabus, Wiki, and Course Handouts
      2. Join the course networking groups
  11. Course Syllabus for Tackling the Challenges of Big Data
    1. Time Requirement/Commitment
    2. Who Should Participate?
    3. Learning Objectives
    4. Course Staff
    5. Course Requirements
    6. Course Schedule
      1. Week 1 - MODULE ONE: INTRODUCTION AND USE CASES
        1. Introduction: Big Data Challenges (Sam Madden)
        2. Case Study: Transportation (Daniela Rus)
        3. Case Study: Visualizing Twitter (Sam Madden)
        4. Recommended weekly activities
      2. Week 2 - MODULE TWO: BIG DATA COLLECTION
        1. Data Cleaning and Integration (Michael Stonebraker)
        2. Hosted Data Platforms and the Cloud (Matei Zaharia)
        3. Recommended weekly activities
      3. Week 3 - MODULE THREE: BIG DATA STORAGE
        1. Modern Databases (Michael Stonebraker)
        2. Distributed Computing Platforms (Matei Zaharia)
        3. NoSQL, NewSQL (Sam Madden)
        4. Recommended weekly activities
      4. Week 4 - MODULE FOUR: BIG DATA SYSTEMS
        1. Security (Nickolai Zeldovich)
        2. Multicore Scalability (Nickolai Zeldovich)
        3. User Interfaces for Data (David Karger)
        4. Recommended weekly activities
      5. Week 5 - MODULE FIVE, PART I: BIG DATA ANALYTICS
        1. Fast Algorithms I (Ronitt Rubinfeld)
        2. Fast Algorithms II (Piotr Indyk)
        3. Data Compression (Daniela Rus)
        4. Recommended weekly activities
      6. Week 6 - MODULE FIVE, PART II: BIG DATA ANALYTICS
        1. Machine Learning Tools (Tommi Jaakkola)
        2. Case Study: Information Summarization (Regina Barzilay)
        3. Applications: Medicine (John Guttag)
        4. Applications: Finance (Andrew Lo)
        5. Recommended weekly activities
        6. Completing the course
        7. Post-course
  12. Wiki
    1. Networking
    2. Readings and Resources
    3. Resources by Module
      1. 1.0 Introduction: Big Data Challenges - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      2. 1.1 Case Study: Transportation - PDF of Presentation slides (Rus)
        1. STUDENT-ADDED RESOURCES
      3. 1.2 Case Study: Visualizing Twitter - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      4. 2.0 Introduction: Big Data Collection
      5. 2.1 Data Cleaning and Integration - PDF of Presentation slides (Stonebraker)
        1. STUDENT-ADDED RESOURCES
      6. 2.2 Hosted Data Platforms and the Cloud - PDF of Presentation slides (Zaharia)
        1. STUDENT-ADDED RESOURCES
      7. 3.0 Introduction: Big Data Storage
      8. 3.1 Modern Databases - PDF of Presentation slides (Stonebraker)
        1. STUDENT-ADDED RESOURCES
      9. 3.2 Distributed Computing Platforms - PDF of Presentation slides (Zaharia)
        1. STUDENT-ADDED RESOURCES
      10. 3.3 NoSQL, NewSQL - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      11. 4.0 Introduction: Big Data Systems
      12. 4.1 Security - PDF of Presentation slides (Zeldovich)
        1. STUDENT-ADDED RESOURCES
      13. 4.2 Multicore Scalability - PDF of Presentation slides (Zeldovich)
        1. STUDENT-ADDED RESOURCES
      14. 4.3 Visualization and User Interfaces - PDF of Presentation slides (Karger)
        1. STUDENT-ADDED RESOURCES
      15. 5.0 Introduction: Big Data Analytics
      16. 5.1 Fast Algorithms I - PDF of Presentation slides (Rubinfeld)
        1. STUDENT-ADDED RESOURCES
      17. 5.2 Fast Algorithms II - PDF of Presentation slides (Indyk)
        1. STUDENT-ADDED RESOURCES
      18. 5.3 Data Compression - PDF of Presentation slides (Rus)
        1. STUDENT-ADDED RESOURCES
      19. 5.4 Machine Learning Tools - PDF of Presentation slides (Jaakkola)
        1. STUDENT-ADDED RESOURCES
      20. 5.5 Case Study: Information Summarization - PDF of Presentation slides (Barzilay)
        1. STUDENT-ADDED RESOURCES
      21. 5.6 Applications: Medicine - PDF of Presentation slides (Guttag)
        1. STUDENT-ADDED RESOURCES
      22. 5.7 Applications: Finance - PDF of Presentation slides (Lo)
        1. STUDENT-ADDED RESOURCES
  13. NEXT

  1. Story
  2. Slides
    1. Slide 1 Data Science for Tackling the Challenges of Big Data
    2. Slide 2 Overview
    3. Slide 3 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment
    4. Slide 4 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress
    5. Slide 5 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage
    6. Slide 6 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Modern Databases
    7. Slide 7 Courseware: Big Data Storage
    8. Slide 8 Selected Slides: Professor Sam Madden
    9. Slide 9 Selected Slides: Professor David Karger
    10. Slide 10 Selected Slides: Professor Daniela Rus
    11. Slide 11 Google Search: Singapore Taxi Data
    12. Slide 12 Think Business: Why can’t I find a taxi when I really need one?
    13. Slide 13 Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days
    14. Slide 14 MIT Big Data Knowledge Base: Table 1 Spreadsheet
    15. Slide 15 Singapore Land Transport Authority: Traffic Info Service Providers
    16. Slide 16 Singapore Land Transport Authority: MyTransport.sg
    17. Slide 17 Singapore Land Transport Authority: All Datasets Spreadsheet
    18. Slide 18 MIT Big Data Knowledge Base: MindTouch
    19. Slide 19 MIT Big Data: Knowledge Base Spreadsheet
    20. Slide 20 MIT Big Data: Course Participant Spreadsheet
    21. Slide 21 MIT Big Data: Spotfire Cover Page
    22. Slide 22 MIT Big Data: Student Enrollment
    23. Slide 23 MIT Big Data: Singaporean Cab Drivers
    24. Slide 24 New York City Open Data: Socrata 
    25. Slide 25 New York City Open Data: Search Results
    26. Slide 26 New York City Open Data: Data Table
    27. Slide 27 Visualizing NYC’s Open Data: Socrata Beta
    28. Slide 28 MIT Big Data Assessment: Questions and Answers
  3. Spotfire Dashboard
  4. Why can’t I find a taxi when I really need one?
  5. Labor Supply Decisions of Singaporean Cab Drivers
    1. Abstract
    2. 1. Introduction
    3. 2. Cab Drivers and Operators in Singapore
    4. 3. Data Description
      1. a. Real time trip data from global positioning system (GPS) enabled cabs in Singapore
        1. Table 1: Comparison of the Cabdriver Data Used in Related Studies
      2. b. Summary Statistics
        1. Table 2: Summary Statistics
        2. Inter-day/Intra-day variations in activities
        3. Figure 1: Distributions of Cabs per minutes by Status for the week of August 15-21, 2010
      3. c. Inter-day Demand and Supply Elasticity for Cab
        1. Figure 2: Distributions of Day-by-Day Supply and Demand of Cab Services and Utilization Rate for the Week of August 15-21, 2010
      4. d. Shift-level Work hours and Wage Measures
        1. Figure 3: Distributions of Cabdrivers’ activities by shift
    5. 4. Empirical Tests of Cabdriver Labor Supply Elasticity
      1. Figure 4: Distribution of Shift Length for the Month of August 2010
      2. Figure 5: Distribution of Wage Rate of Cabdrivers
      3. a. Pairwise Relationships between shift length and wage rate
        1. Figure 6: Scatter Plot and Local Polynomial Regression Fitting of the Relationship between Shift Length and Wage Rate for all Shifts
        2. Figure 7: Differences between Single-Shift and Two-Shift Cabdrivers
        3. Reduced form models of cabdriver labor supply
        4. Table 3: OLS Models of Worked Hours on Wage Rate for Cabdrivers
        5. Table 4: Labor Supply Models with Cabdriver Fixed Effects
        6. Figure 8: Relationship between Shift Length and Wage Rate based on a Calibrated Polynomial Function
      4. b. Supply shocks and the labor supply elasticity of cabdrivers
        1. Table 5: Shocks to Taxi Driver Labor Supply
      5. c. Heterogeneity in Cabdrivers’ Labor Supply Elasticities
        1. Table 6: Heterogeneity of Cabdrivers
      6. d. Reference-Dependence Preferences in Labor Supply
        1. Table 7: Reference Dependence Hypothesis with Driver Average Targets
        2. Table 8: Reference Dependence Hypothesis with Driver/Day-of-the-Week Average Targets
      7. e. Constancy in Labor Supply of Cabdrivers
        1. Table 9: Income Targets of Cabdrivers
      8. f. Testing the Importance of Daily Targets using Cabdrivers’ Response Earnings Shocks in the Previous Day
        1. Figure 9: Distribution of Income Ratio of Cabdrivers
      9. g. Effects of Different Reference Points on Labor Supply
        1. Table 10: Reference Dependence and Labor Supply Elasticity
    6. 5. Conclusion
    7. Footnotes
      1. 1
      2. 2
      3. 3
      4. 4
      5. 5
      6. 6
      7. 7
      8. 8
    8. References
    9. Appendix: Fares and Rates of one of the largest Cab Operators in Singapore
  6. Course Emails
    1. December 23, 2014
    2. December 16, 2014
    3. December 15, 2014
    4. December 9, 2014
    5. December 2, 2014
    6. November 25, 2014
    7. November 24, 2014
    8. November 18, 2014
    9. November 11, 2014
    10. November 4, 2014
    11. October 29, 2014 
    12. October 28, 2014
    13. October 3, 2014
  7. The CancerMath.net Web Calculators
    1. Research Summary
    2. Databases
    3. Medical Usage and Cost
    4. A Mathematical Approach To Cancer Lethality
    5. CancerMath.net Calculators
    6. Notes
    7. References
    8. Laboratory of Quantitative Medicine Technical Reports
  8. Courseware
    1. Course Survey
    2. Welcome to the Course
      1. How To Participate in the Course
        1. How to Use the Discussion Forum
          1. Meet Your Course TAs!
          2. Brian Bell
          3. Manasi Vartak
        2. Discussion and Community Guidelines
        3. Certificates and CEUs
      2. Technical Assistance and Contacts
      3. Introduction: Big Data Challenges
      4. Assessment 1
    3. Introduction and Use Cases
      1. 1.0 Introduction: Uses Cases
        1. 1.1 Case Study: Transportation 
        2. 1.2 Case Study: Visualizing Twiitter
        3. Assessment 2
        4. Discussion 1
    4. Big Data Collection
      1. 2.0 Introduction: Big Data Collection
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 2.1 Data Cleaning and Integration
      3. 2.2 Hosted Data Platforms and the Cloud
      4. Assessment 3
      5. Discussion 2
    5. Big Data Storage
      1. 3.0 Introduction to Big Data Storage
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 3.1 Modern Databases
      3. 3.2 Distributed Computing Platforms
      4. 3.3 NoSQL NewSQL
      5. Assessment 4
      6. Discussion 3
    6. Big Data Systems
      1. 4.0 Introduction: Big Data Systems
        1. Overview
        2. Goals (4.1 Security)
        3. Objectives: Students should be able to...
        4. Goals (4.2 Multicore Scalability)
        5. Objectives: Students should be able to...
        6. Goals (4.3 Interfaces and Visualization)
        7. Objectives: Students should be able to...
      2. 4.1 Security
      3. 4.2 Multicore Scalability
      4. 4.3 Visualization and User Interfaces
      5. Assessment 5
      6. Discussion 4
    7. Big Data Analytics
      1. 5.0 Introduction: Big Data Analytics
        1. Overview
        2. Goals
        3. Objectives: Students should be able to...
      2. 5.1 Fast Algorithms I
      3. 5.2 Fast Algorithms II
      4. 5.3 Data Compression
      5. 5.4 Machine Learning Tools
      6. 5.5 Case Study: Information Summarization
      7. 5.6 Applications: Medicine
      8. 5.7 Applications: Finance
      9. Assessment 6
      10. Discussion 5
  9. Slides
    1. 1.0_introduction-to-bigdata
      1. What Is This Course Going to Cover?
    2. 1.1_transportation
      1. Case Study: Transportation in Singapore
    3. 1.2_visualizing-twitter
      1. Spatial Correlations
      2. Other Techniques We'll Cover
    4. 2.1_data-cleaning-integration
      1. Data Curation
      2. Startups in This Space
      3. Data Tamer Future
      4. The Way Forward 1
      5. The Way Forward 2
    5. 2.2_hosted-data-platforms
      1. Examples
      2. Summary
      3. Examples
      4. Conclusions
    6. 3.1_modern-databases
      1. History Lesson
      2. My Thesis
      3. Rest of This Module
      4. Data Warehouse Marketplace
      5. The Participants
      6. Roughly Speaking
      7. OLTP Data Bases -- 3 Big Decisions
      8. Summary
      9. Everything Else
      10. Array DBMSs--e.g. SciDB
      11. Array DBMSs--Summary
      12. Graph DBMSs
      13. What Is Hadoop?
      14. What Is Happening Now?
      15. Most Likely Future
      16. Thoughts While Shaving 1
      17. Thoughts While Shaving 2
      18. Thoughts While Shaving 3
      19. The Curse--May You Live in Interesting Times 1
      20. The Curse--May You Live in Interesting Times 2
    7. 3.2_distributed-computing-platforms
      1. Motivation
      2. Software in This Space
      3. Applications
      4. This Lecture
    8. 3.3_NoSQL-NewSQL
      1. What Does a Traditional Database Provide?
      2. A Thousand Flowers Bloom
      3. Rest of the Module
    9. 4.1_security
      1. Security is a Negative Goal
      2. Multiple Encryption Schemes
      3. Conclusion
    10. 4.2_multicore-scalability
      1. Goal: Scalability
      2. Outline
      3. Conclusion: Multicore Scalability
    11. 4.3_visualization-user-interfaces
      1. My Research Group
      2. Why User Interfaces? 1
      3. Why User Interfaces? 2
      4. The Value of Interfaces
      5. User Interfaces for Data
      6. Spectrum of Interface Capabilities
      7. Overview
      8. Small Multiples
      9. What Is Visualization?
      10. Why Visualizations?
      11. John Tukey 1
      12. Challenger Disaster
      13. Morton Thiokol
      14. Make a Decision: Challenger 1
      15. Make a Decision: Challenger 2
      16. Information Visualization
      17. How Not To Lie
      18. Tufte: Graphical Integrity 1
      19. Tufte: Graphical Integrity 2
      20. Summary
      21. Interactivity
      22. Why?
      23. Plan
      24. John Tukey 2
      25. Exploratory Versus Confirmatory
      26. John Tukey 3
      27. Summary
      28. Goal
      29. Spectrum of Interface Capabilities
      30. Interaction Strategy
    12. 5.1_fast-algorithms-1
      1. What To Do About REALLY Big Data?
      2. No Time
      3. Really Big Data
      4. What Can We Hope To Do Without Viewing Most of the Data?
      5. What Types of Approximation?
      6. Conclusion
    13. 5.2_fast-algorithms-2
      1. Streaming and Sampling
      2. Streaming Versus Sampling
      3. Rest of This Lecture
      4. Computing Fourier Transform
      5. Idea: Leverage Sparsity
      6. Benefits of Sparse Fourier Transform
    14. 5.3_data-compression
      1. Learning Big Data Patterns From Tiny Core-Sets
      2. Data Challenge
      3. Challenges
      4. Outline
      5. Coreset and Data Compression Techniques
      6. References for Compression
      7. Example: Coresets for Life Logging
      8. System Overview
      9. Coresets for Latent Semantic Analysis
      10. Wrapping Up
    15. 5.4_machine-learning-tools
      1. Our Research Group
      2. Machine Learning 1
      3. Machine Learning 2
      4. Structured Prediction
    16. 5.5_information-summarization
      1. My Research
      2. Need in Information Extraction
      3. Information Extraction for Big Data
      4. Data Set 1
      5. Data Set 2
      6. Conclusion
    17. 5.6_applications-medicine
      1. My Research Group
      2. Medical Analytics
      3. The Good News
      4. Problems Posed by Increase in Available Data 1
      5. Problems Posed by Increase in Available Data 2
      6. Using ML to Make Useful Predictions
      7. Accurate Predictions Can Help
      8. The Big Data Challenge
      9. What Is Machine Learning? 1
      10. What Is Machine Learning? 2
      11. How Are Things Learned?
      12. Machine Learning Methods
      13. More Not Always Better
      14. Approach
      15. The Data
      16. Variables Considered
      17. Two Step Approach
      18. Results
      19. Wrapping Up
    18. 5.7_applications-finance
      1. Consumer Credit Risk Management
      2. MIT Laboratory for Financial Engineering
      3. Anonymized Data From Large U.S. Commercial Bank
      4. Objectives
      5. Machine Learning Objectives
      6. Empirical Results
      7. Macro Forecasts of Credit Losses
      8. Conclusions
  10. Course Updates & News
    1. November 4, 2014
      1. Review the course Syllabus, Wiki, and Course Handouts
      2. Join the course networking groups
  11. Course Syllabus for Tackling the Challenges of Big Data
    1. Time Requirement/Commitment
    2. Who Should Participate?
    3. Learning Objectives
    4. Course Staff
    5. Course Requirements
    6. Course Schedule
      1. Week 1 - MODULE ONE: INTRODUCTION AND USE CASES
        1. Introduction: Big Data Challenges (Sam Madden)
        2. Case Study: Transportation (Daniela Rus)
        3. Case Study: Visualizing Twitter (Sam Madden)
        4. Recommended weekly activities
      2. Week 2 - MODULE TWO: BIG DATA COLLECTION
        1. Data Cleaning and Integration (Michael Stonebraker)
        2. Hosted Data Platforms and the Cloud (Matei Zaharia)
        3. Recommended weekly activities
      3. Week 3 - MODULE THREE: BIG DATA STORAGE
        1. Modern Databases (Michael Stonebraker)
        2. Distributed Computing Platforms (Matei Zaharia)
        3. NoSQL, NewSQL (Sam Madden)
        4. Recommended weekly activities
      4. Week 4 - MODULE FOUR: BIG DATA SYSTEMS
        1. Security (Nickolai Zeldovich)
        2. Multicore Scalability (Nickolai Zeldovich)
        3. User Interfaces for Data (David Karger)
        4. Recommended weekly activities
      5. Week 5 - MODULE FIVE, PART I: BIG DATA ANALYTICS
        1. Fast Algorithms I (Ronitt Rubinfeld)
        2. Fast Algorithms II (Piotr Indyk)
        3. Data Compression (Daniela Rus)
        4. Recommended weekly activities
      6. Week 6 - MODULE FIVE, PART II: BIG DATA ANALYTICS
        1. Machine Learning Tools (Tommi Jaakkola)
        2. Case Study: Information Summarization (Regina Barzilay)
        3. Applications: Medicine (John Guttag)
        4. Applications: Finance (Andrew Lo)
        5. Recommended weekly activities
        6. Completing the course
        7. Post-course
  12. Wiki
    1. Networking
    2. Readings and Resources
    3. Resources by Module
      1. 1.0 Introduction: Big Data Challenges - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      2. 1.1 Case Study: Transportation - PDF of Presentation slides (Rus)
        1. STUDENT-ADDED RESOURCES
      3. 1.2 Case Study: Visualizing Twitter - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      4. 2.0 Introduction: Big Data Collection
      5. 2.1 Data Cleaning and Integration - PDF of Presentation slides (Stonebraker)
        1. STUDENT-ADDED RESOURCES
      6. 2.2 Hosted Data Platforms and the Cloud - PDF of Presentation slides (Zaharia)
        1. STUDENT-ADDED RESOURCES
      7. 3.0 Introduction: Big Data Storage
      8. 3.1 Modern Databases - PDF of Presentation slides (Stonebraker)
        1. STUDENT-ADDED RESOURCES
      9. 3.2 Distributed Computing Platforms - PDF of Presentation slides (Zaharia)
        1. STUDENT-ADDED RESOURCES
      10. 3.3 NoSQL, NewSQL - PDF of Presentation slides (Madden)
        1. STUDENT-ADDED RESOURCES
      11. 4.0 Introduction: Big Data Systems
      12. 4.1 Security - PDF of Presentation slides (Zeldovich)
        1. STUDENT-ADDED RESOURCES
      13. 4.2 Multicore Scalability - PDF of Presentation slides (Zeldovich)
        1. STUDENT-ADDED RESOURCES
      14. 4.3 Visualization and User Interfaces - PDF of Presentation slides (Karger)
        1. STUDENT-ADDED RESOURCES
      15. 5.0 Introduction: Big Data Analytics
      16. 5.1 Fast Algorithms I - PDF of Presentation slides (Rubinfeld)
        1. STUDENT-ADDED RESOURCES
      17. 5.2 Fast Algorithms II - PDF of Presentation slides (Indyk)
        1. STUDENT-ADDED RESOURCES
      18. 5.3 Data Compression - PDF of Presentation slides (Rus)
        1. STUDENT-ADDED RESOURCES
      19. 5.4 Machine Learning Tools - PDF of Presentation slides (Jaakkola)
        1. STUDENT-ADDED RESOURCES
      20. 5.5 Case Study: Information Summarization - PDF of Presentation slides (Barzilay)
        1. STUDENT-ADDED RESOURCES
      21. 5.6 Applications: Medicine - PDF of Presentation slides (Guttag)
        1. STUDENT-ADDED RESOURCES
      22. 5.7 Applications: Finance - PDF of Presentation slides (Lo)
        1. STUDENT-ADDED RESOURCES
  13. NEXT

Story

Data Science for Tackling the Challenges of Big Data

This 6 week MIT online course consists of the following:

I mined this MIT Online Course for data sets and ideas. In the process I found a subset of the slides that contained data sets and ideas and were interesting and useful visualizations in themselves. Professor Karger's lecture slides on Visualization User Interfaces were all about my heroes: Tukey, Tufte, Sneiderman, and Spotfire. In fact it was everything leading up to Spotfire, but Spotifre itself!

1.0_introduction-to-bigdata.png

1.2_visualizing-twitter2.png

To aid in completion of the Course, I downloaded and extracted all the slides (18 - 67 MB), reviewed the slides and selected what I thought were the key slides, downloaded the Class Directory Spreadsheet, downloaded 4 PDF files and added to them knowledge base (3 about the course and The CancerMath.net Web Calculators), downloaded and read the 18 video scripts, and completed the six Assessments with 100%.

To preserve my work and present it as a tutorial to the Federal Big Data Working Group Meetup, I build a MindTouch knowledge base, Excel spreadsheet index, and Spotfire interactive visualizations.

I was especially interested in the following since both Professors Stonebraker and Madden presented to our Federal Big Data Working Group Meetup:

This module begins with an overview of a number of these technologies by renowned database professor Mike Stonebraker. In his unique and ardent fashion, Mike expresses his skepticism about many new technologies, particularly Hadoop/MapReduce and NoSQL, and voices support for many new relational technologies, including column stores and main memory databases.

After that, Professors Matei Zaharia and Samuel Madden provide a more nuanced view of the tradeoffs between the various approaches, discussing Hadoop and its derivatives, as well as NoSQL and its tradeoffs, in more detail.

I also found some of the Assessment Questions and Answers very interesting and wanted to preserve them as follows:

  • Welcome to the Course
    • 2) In the MGH Cancer Patient Database example given by Professor Madden, the primary explanation for dramatically higher costs in the most expensive patients is:
      • The most expensive patients have the most virulent tumors
      • The most expensive patients have the best insurance
      • The most expensive patients have particular doctors - correct
      • The most expensive patients are treated for the longest time
  • Introduction and Use Case
    • 1) How does data enable more effective transportation?
      • Improved level of service by visualization, analysis, planning support
      • Personalization of traffic services
      • Optimization for traffic services
      • All of the above - correct
  • Big Data Collection
    • 2) Data science requires:
      • Knowledge of statistics
      • Knowledge of data management
      • Knowledge of curation
      • ​All of the above - correct
  • Big Data Storage
    • 1) Column stores will win in the data warehouse marketplace over row stores because:
      • They have less disk reads
      • They compress the data better
      • Their execution engines are more efficient
      • All of the above - correct
    • 3) According to Professor Mike Stonebreaker, the Hadoop layer will fail because:
      • It is written in Java
      • Most problems are not embarrassingly parallel - correct
      • It is based on HDFS
      • There are not appropriate support tools
  • Big Data Systems
    • 13) For which of the following tasks is interactive visualization most useful? (choose all that apply)
      • Developing a hypothesis about data - correct
      • Formally confirming a hypothesis
      • Communicating a conclusion about data - correct
      • All of the above
    • 15) How does linking and brushing help users understand data? (choose all that apply)
      • They can follow links like joins in a database
      • They can see connections between data in different visualizations - correct
      • They can change the colors of data points for better clarity
      • The can move data between different visualizations
    • 17) Which of the following is often used for faceted browsing? (choose all that apply)
      • A tag cloud - correct
      • A slider for a numeric range - correct
      • A text search box
      • A map
  • Big Data Analytics
    • 12) Why are machine learning algorithms so popular nowadays? Select the most appropriate explanation.
      • Automation is critical to software design
      • Large data repositories act as a resource supplementing traditional engineering solutions
      • Machine learning methods are data driven approaches that automate the process of creating models for prediction - correct
    • 13) Big Data means that there's no shortage of useful data.
      • True
      • False - correct

I also mined the course content for big data sets that I could access: Taxis in Singapore, medical data, financial data, etc., but could not find any. Doing a Google search I found the following:

  • Why can’t I find a taxi (in Singapore) when I really need one? I use a unique administrative dataset of over 520 million data points from the largest taxi company in Singapore for a period of one month 
  • Data Mall: LTA publishes a variety of transport-related data for public use and are available for downloads for the creation, development and testing of innovative applications by third-party (Click here to watch the DataMall video guide). The featured apps below have incorporated the various datasets from LTA. This data sharing initiative is a Distinguished Winner in the eGov Excellence Awards 2013 in the Data Sharing category.

The Singapore Land Transport Authority (LTA) is the excellent data source for both the references above so lets do a Data Science Data Publications for them. Since data science requires knowledge of statistics, data management, and curation (ingest, validate, transform, correct, consolidate, and visualize information to be integrated), we especially need to do data curation here.

The process and results of building the  MindTouch knowledge baseExcel spreadsheet index, and Spotfire interactive visualizations.and captured in the Slides below.

The MIT online course is very useful, but actual case studies with public data that students can use to apply the data science principles learned like I have done here. 

The Assessment Question: Big Data means that there's no shortage of useful data.

  • True
  • False - correct

was shown to be true here where summarization of big data (Singapore taxi data) was shown to be possibly more useful in data science using knowledge of statistics, data management, and curation than the unique administrative dataset of over 520 million data points from the largest taxi company in Singapore for a period of one month.

Slides

Slides

Slide 2 Overview

BrandNiemann11142014Slide2.PNG

Slide 3 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment

Web Site (private)

BrandNiemann11142014Slide3.PNG

Slide 4 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress

https://mitprofessionalx.edx.org/cou...T2014/progress

BrandNiemann11142014Slide4.PNG

Slide 5 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage

Web Site (private)

BrandNiemann11142014Slide5.PNG

Slide 6 MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Modern Databases

Web Site (private) and Script (Public)

BrandNiemann11142014Slide6.PNG

Slide 7 Courseware: Big Data Storage

3.0 Introduction to Big Data Storage and Discussion 3

BrandNiemann11142014Slide7.PNG

Slide 8 Selected Slides: Professor Sam Madden

What Is This Course Going to Cover? and Other Techniques We'll Cover

BrandNiemann11142014Slide8.PNG

Slide 9 Selected Slides: Professor David Karger

Overview and Interaction Strategy

BrandNiemann11142014Slide9.PNG

Slide 11 Google Search: Singapore Taxi Data

BrandNiemann11142014Slide11.PNG

Slide 12 Think Business: Why can’t I find a taxi when I really need one?

http://thinkbusiness.nus.edu/smart-f...eally-need-one?

BrandNiemann11142014Slide12.PNG

Slide 13 Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days

http://www.ushakrisna.com/Cabdrivers.pdf

BrandNiemann11142014Slide13.PNG

Slide 14 MIT Big Data Knowledge Base: Table 1 Spreadsheet

Spreadsheet

BrandNiemann11142014Slide14.PNG

Slide 15 Singapore Land Transport Authority: Traffic Info Service Providers

http://www.lta.gov.sg/content/ltaweb...providers.html

BrandNiemann11142014Slide15.PNG

Slide 16 Singapore Land Transport Authority: MyTransport.sg

http://www.mytransport.sg/content/my...l#All_Datasets

BrandNiemann11142014Slide16.PNG

Slide 17 Singapore Land Transport Authority: All Datasets Spreadsheet

Spreadsheet

BrandNiemann11142014Slide17.PNG

Slide 18 MIT Big Data Knowledge Base: MindTouch

Data Science for Tackling the Challenges of Big Data

BrandNiemann11142014Slide18.PNG

Slide 19 MIT Big Data: Knowledge Base Spreadsheet

Spreadsheet

BrandNiemann11142014Slide19.PNG

Slide 20 MIT Big Data: Course Participant Spreadsheet

Spreadsheet

BrandNiemann11142014Slide20.PNG

Slide 21 MIT Big Data: Spotfire Cover Page

Web Player

BrandNiemann11142014Slide21.PNG

Slide 22 MIT Big Data: Student Enrollment

Web Player

BrandNiemann11142014Slide22.PNG

Slide 23 MIT Big Data: Singaporean Cab Drivers

Web Player

BrandNiemann11142014Slide23.PNG

Slide 24 New York City Open Data: Socrata 

https://nycopendata.socrata.com/

BrandNiemann11142014Slide24.PNG

Slide 25 New York City Open Data: Search Results

Web Site

BrandNiemann11142014Slide25.PNG

Slide 26 New York City Open Data: Data Table

Web Site and Medallion_Drivers_-_Active.xlsx

BrandNiemann11142014Slide26.PNG

Slide 27 Visualizing NYC’s Open Data: Socrata Beta

https://nycopendata.socrata.com/viz

BrandNiemann11142014Slide27.PNG

Slide 28 MIT Big Data Assessment: Questions and Answers

Story

BrandNiemann11142014Slide28.PNG

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Why can’t I find a taxi when I really need one?

Source: http://thinkbusiness.nus.edu/smart-f...eally-need-one?

By: Sumit Agarwal Wednesday, 15 May 2013

 
taxi280
 
Ever since I moved to Singapore my biggest gripe has been the short supply of taxi's in Singapore.

Normally, you can get a taxi within a few minutes but when you really need one – heavy rain or morning and evening rush hour and late nights -- you cannot find a taxi. Sometimes you can end up waiting for over an hour. It just happened last week.

So, why?

Economic theory suggests that as wages rise individuals will supply more labour (economics jargon: positive price elasticity of labour supply). It seems very simple. If I can earn more I will work harder. During rush hour and rainy time period, the taxi drivers know that they will find many more customers so why can't I find a taxi.

There could be many reasons:

1. There is a supply and demand mismatch during peak hours. The supply cannot keep up with the demand during those hours.

2. Taxi drivers withdraw labour supply during heavy rains because they are risk averse and do not want to get into accidents. This is especially true because they have to pay the first $2000 towards accident claims.

3. The taxi drivers do not want to drive in the evening because they are tired and value leisure more than labour.

4. Maybe, taxi drivers have a references depended utility function. In other words, they only want to supply labour till they earn a fixed wage and then they quit. For instance they want to earn $100 a day and if they can make that within four or 8 hours they will only work till they make that money.

In a recent paper, with my colleagues DiaoPan, and Tien-Foo I look at this question. I use a unique administrative dataset of over 520 million data points from the largest taxi company in Singapore for a period of one month (August 2010).

I have the wage and labour supply information for all the taxi drivers at this company.

In terms of background: A typical cab driver in Singapore works a 12 hours shift. All the drivers are Singaporeans, they rent their cabs from cab operators at a fixed rental rate of $120, and they incur a typical fuel cost of $50 a day. They typically earn $25 an hour, so they can break even in 7 hours. They can sub-lease the cab to a second driver and split the shift. Additionally, the driver's wages vary according to pricing regulations that apply according to where and when the journey takes place, and whether the passenger made an advance booking.

Rain: Singapore is a tropical country and has a total of 178 rainy days a year and a total annual precipitation of 2,075 millimetre (mm) (or a daily average of 11.66 mm) in the year of 2010 In my sample month of August 2010, there were 7 days with precipitation (sum of the precipitation at four major stations) of less than 2.54 mm (0.1 inch) and 8 days with precipitation of greater than 20.32 mm (0.8 inch).

Findings: I find that in a given 24 hour period, a typical cab driver has a passenger on board for over 6 hours, they are on break for close to 5 hours, they are free and looking for a passenger for 4 hours, and they are offline (probably sleeping) for another 8 hours.

Specifically, Figure 1 shows the worked hours characteristics of active cabdrivers in the month of August 2010. The figure shows a bell-shaped distribution of an average cabdriver works 572.5 minutes (9.54 hours) in one shift. The shift intervals of 93.7% of the cabdrivers (or 597,074 cabdrivers) are bounded within the range of between 3 hours and 16 hours.

taxifig1Figure 1: Average Shift Length (in minutes) for the 4000 taxi drivers
  
Next, Figure 2 shows the wage rate for cabdrivers. On average, a driver spends less than 50% of the time with a passenger on board during a given shift. Some cardrivers are very efficient and have a passenger on board for over 70% of the shift time and there are a small fraction that have a passenger on board only 20% of the shift time.

taxifig2Figure 2: Average Wage Rate for the 4000 taxi drivers
  
Next, I find that on average there is a negative relationship between wages and labor supply. The wage elasticity of labor supply is a kinked shape. For the most part the elasticity is zero, but after some time, it turns negative and the point estimate is -0.05. In other words, cab drivers do stop driving once they perceive to having earned a certain amount.

Finally, looking at the exogenous variation in wages has positive impact on labor supply elasticity. The results show that cab drivers supply more labour on rainy days and during peak load pricing hours and locations.

So, now I understand as to why I do not find a taxi in Singapore. Cabdrivers in Singapore are not following the neoclassical economic model that dictates a positive price elasticity of labor supply.
 

Labor Supply Decisions of Singaporean Cab Drivers

Source: http://papers.ssrn.com/sol3/papers.c...act_id=2338476 (PDF)

Sumit Agarwal 1, Mi Diao 2, #, Jessica Pan 3, and Tien Foo Sing 4
National University of Singapore
September 2014

* We would like to thank SMART for providing the data used for the analysis.
# Singapore – MIT Alliance for Research and Technology (SMART)
1 Email: ushakri@yahoo.com
2 Email: rstdm@nus.edu.sg
3 Email: jesspan13@gmail.com
4 Email: rststf@nus.edu.sg

Abstract

We use a unique administrative dataset of over 10,000 taxi drivers in Singapore to study the labor supply decisions of these drivers. Our results indicate that both intra-day and across days, cab drivers deviate from the neoclassical economic model and that their labor supply is affected by a reference level of income. We also find some evidence that cab drivers who earn more at the beginning of the month work longer shifts at the end of the month. Nevertheless, conditional on previous day’s (or week’s) income, cabdrivers continue to exhibit a significant negative labor supply elasticity in response to daily wages.

Keywords: Labor Supply, Reference Dependent Preferences, Cab Drivers
JEL Classification Codes: J22, B49

1. Introduction

A growing literature has attempted to measure the effects of transitory wage changes on the labor supply of groups of workers who are free to choose their hours of work. Invariably, these studies have focused on taxi drivers (Camerer et al., 1997, Farber, 2005, 2008, Crawford and Meng, 2011), stadium vendors (Oettinger, 1999) and bicycle messengers/taxi drivers (Fehr and Goette, 2007, Dupas and Robinson, 2013). These studies provide mixed evidence on the standard neoclassical model of consumer behavior – that in the absence of large income effects, there should be a positive relationship between wages and hours worked.

The earlier studies by Camerer et al. (1997), Farber (2005) and Oettinger (1999) arrive at conflicting results. In direct opposition to the predictions of the standard neoclassical model of labor supply, Camerer et al. (1997) find a negative wage elasticity, which they interpret as providing evidence in support of the idea that taxi drivers set loose daily income targets and quit working once the target has been reached. Oettinger (1999) studies the daily participation decisions of stadium vendors at baseball games and find evidence of a large positive intertemporal supply elasticity. Similarly, Fehr and Goette (2007) find evidence of positive labor supply responses to experimentally-induced wage changes among bicycle messengers. In a re-analysis of Camerer et al.’s results, Farber (2005) argues that daily income effects are small and that the decision to stop work is more strongly related to cumulative daily hours than to wages.

More recent studies, most notably, Farber (2008) and Crawford and Meng (2011), have introduced the idea of reference-dependent preferences as an alternative to the standard neoclassical approach. Farber (2008) develops an empirical model of daily labor supply that incorporates reference-dependent preferences and finds some evidence suggesting that drivers’ labor supply decision may be influenced by a reference level of daily income. However, he finds that the reference level varies substantially day to day for a particular driver and that most shifts appear to end before the reference level is reached, leading him to conclude that there is a limited scope for reference-dependent preferences to explain the labor supply decisions of taxi drivers in his sample. Crawford and Meng (2011) followed-up on Farber’s (2008) idea by modeling income and hours targets using proxied rational expectations. By reducing the dimensionality of the problem, they find strong evidence that reference dependence (Koszegi and Rabin, 2006) is an important part of the labor supply decisions of taxi drivers in Farber’s (2008) sample and that natural proxies of driver’s income and hours targets are able to avoid Farber’s criticism that driver’s income targets are too unstable to yield a useful model of labor supply.

It is likely that Crawford and Meng’s (2011) findings will further intensify the debate surrounding whether reference-dependent preferences outperform the standard neoclassical model in describing the labor supply decisions of taxi drivers and other groups of workers. Nevertheless, one of the biggest obstacles in settling (and furthering) the debate lies in the quality of data. Most of the research on the labor supply of taxi drivers is based on a small sample of drivers and non-externally verified trip sheets. Existing data limitations also preclude the study of important economic behavior. As a complete history of trip sheets is typically not available for any particular driver, existing research has been constrained to studying the daily supply of labor and cannot address relevant questions such as the intertemporal (inter-day) substitution of labor and the relevance of income targets at time horizons beyond the daily level.

In this paper, we revisit the debate with a unique administrative dataset from taxi companies in Singapore that covers the minute-by-minute trip history of over 10,000 cabs for a period of one month in August 2010. Unlike the data for the NYC cab drivers, which were hand collected, our data is based on an automatic data collection system, known as the “TraffiScan” system that collects traffic speed data from the GPS satellite-based tracking and dispatching system installed on a fleet of cabs on the road. 1 This system enables real time speed data collection from cabs on the road at an interval of less than 3 minutes.

Apart from providing us with data that is essentially free of measurement error, one of the other advantages that our data has over existing studies is the ability to observe market-level supply and demand conditions. On any given day, we not only observe the labor supply decisions of a particular driver, but the labor supply decisions of the majority of other drivers. Therefore, in our estimation strategy, we can obtain cleaner variation in the demand shocks that generate transitory wage changes by explicitly controlling for aggregate supply conditions.

We find that in a given 24 hour period, a typical cab has a passenger on board for about 7 hours, are on break for close to 5 hours, free and looking for passengers for 4 hours and are offline for the remaining 8 hours. There is large variation in terms of time-use across days and across drivers. For instance, the average shift length for a given driver over our sample period is 10 hours with a standard deviation of 90 minutes. Conditional on having a shift, they have a passenger on board for slightly over 50% of the time and the standard deviation is 10%.

We find that, on average, there is a negative relationship between wages and labor supply. These results are robust to the addition of controls for supply shocks. We find that cabdrivers’ stopping probabilities are strongly related to earnings and hours targets as proxied for by rational expectations. Our results indicate that when targets on income and hours are reached, the probability of cabdrivers stopping work increases significantly. We also find some evidence that cabdrivers who earn more at the beginning of the month tend to work longer shifts at the end of the month. Nevertheless, conditional on previous day’s (or week’s) income, cabdrivers continue to exhibit a significant negative labor supply elasticity in response to daily wages.

The rest of the paper is organized as follows. In Section 2, we provide institutional details of the taxicab operations in Singapore. Section 3 describes the data and empirical strategy. In Section 4, we present the main empirical results. Section 5 concludes.

2. Cab Drivers and Operators in Singapore

Singapore is an island state with a total land area of 715.8 square kilometer (km2), which is slightly smaller than the city of New York (790 km2). The total population of Singapore is estimated at 5.3 million as in 2012. Its population density of 7,421 persons / km2 is about 30% lower than that of New York City, which is estimated at 10,436 persons / km2 in 2011. Despite its small geographical size, Singapore has one of the world’s most extensive transportation networks. Relative to other major cities such as New York and London, Singapore has one of the highest cab densities in the world with 5,128 cabs per one million inhabitants in 2007 compared to 1,522 in New York and 3,285 in London. Cab services are relatively inexpensive in Singapore, where a 10-km trip will cost approximately US$8 (S$10), which is less than half of the US$17.4 (S$21.5) fare for the same trip taken in New York City. 2

The taxi market in Singapore was liberalized in 2003 and barriers of entry were lifted for new operators. Currently, there are seven operators, comprising public and private firms that run a total fleet of 28,210 cabs. The cab market is regulated, and only Singapore citizens with Taxi Driver’s Vocational Licences (TDVLs) are allowed to work as cabdrivers in Singapore. In 2012, a total of 95,764 TDVLs had been issued. Cab drivers usually join an operator either as hirers or relief drivers. Hirers lease cabs directly from the operators, usually on a six month or an annual contract; whereas relief drivers make private arrangements with hirers to lease their cabs on selected shifts in a day. Different operators charge different daily rental rates in Singapore. The rates can range approximately from US$52.83 (S$65) to US$97.53 (S$120) per day depending on age, model, and fuel type of the cabs. Cabdrivers cover their on-theroad expenses, which include fuel costs, washing, parking and other miscellaneous expenses. A typical taxi driver could earn an average daily income of S$318 3. A small number of cabdrivers own their cabs, but they constitute not more than 1% of the total fleet.

All cabs in Singapore are fitted with electronic meters. Cab fares are highly variable, and are made up of base fares and additional surcharges. The base fares include a flag-down fare of US$2.44 (S$3.00) to US$4.06 (S$5.00) for the first 1 km, and a variable fare that varies by distance travelled and waiting time. The distance-based fare is a step-up charge with US$0.18 (S$0.22) for every 400 meters in the first 1 km to 10 km; and subsequently at US$0.18 (S$0.22) for every 350 meters for distances above 10 km. The waiting time fare is charged at US$0.18 (S$0.22) for every 45 seconds that cabs are caught in a traffic jam. To regulate the demand by commuters during peak hours and at selected locations, additional time-based surcharges and area-based surcharges are levied. All cab operators in Singapore also offer current or advance booking services at a fee, which varies from US$2.03 (S$2.50) to US$14.63 (S$18.00) depending on the time of booking and the taxi type. Cab drivers keep all the fare charged (flag-down fare, additional surcharges and booking fees), but are required to pay a fee to operators for the booking services. Commuters, who travel into the city at peak hours, are also subject to congestion charges, which are paid through the Electronic Road Pricing (ERP) readers in the cabs. The ERP charges are payable on top of the cab fares and are paid directly to the government. The total fare is shown on the taximeter. Appendix A1 provides a summary of the fares and rates of a typical cab operator in Singapore.

3. Data Description

a. Real time trip data from global positioning system (GPS) enabled cabs in Singapore

Previous studies by Camerer et al. (1997) and Farber (2005) used data from trip sheets recorded by cabdrivers, which were provided by different cab companies in New York City. Farber (2008) and Crawford and Meng (2011) use the same set of data collected by Farber (2005) in their test of reference-dependent preferences in cabdriver labor supply. The largest of Camerer et al.’s (1997) sample consisted of 484 cabdrivers with 1044 recorded trips; whereas in Farber (2005), 584 trips were extracted from the trip sheets of a smaller sample of 21 cabdrivers. Apart from the small sample of drivers, another major limitation of existing data is that the trip sheets are not organized in a systematic fashion. As a result, the authors are not able to back-out information on the number of shifts worked by an individual cab driver. 4 Hence, all the existing research on taxi driven labor supply has focused exclusively on the intensive margin of labor supply within days. In 1999, the cab operators in Singapore started installing GPS satellite-based tracking and dispatching system. 5 In the same year, the Singapore Land Transport Authority (LTA) also developed an automatic data collection system, known as the “TrafficScan” system, to collect traffic speed data from a fleet of cabs on the road. In 2006, an enhanced e-TrafficScan system was implemented after the cab operators have upgraded the wireless communication in cabs to a new high bandwidth General Packet Radio Services (GPRS). The e-TrafficScan system enables real time speed data collection from cabs on the road at an interval of less than 3 minutes. The real time data for the full month of August 2010 were collected from the cab operators in Singapore for this study.

The real time logs contain 16,793,501 data points from 15,406 cabs each day on average. A data point is a row of GPS trace data of a cab in the real time logs, which include the fields such as XY location, timestamp, speed, status, plate number and driver ID of the cab at that point in time. A time interval, which is less than 3 minutes, is computed by the difference in times recorded between two consecutive points. Based on the points, we generate data on trips. The first point in the log of a cab is the start time of a trip identified by a specific status, and the end time of a trip is identified by the last point in a series of consecutive points with the same status. Trip time is thus defined as difference in times recorded between the first point and the last point in the series of observations with the same status for a cab driver; or equivalently as the cumulative intervals of consecutive points with the same status between the first to the last points. Table 1 compares the cabdriver data used in the three studies.

Table 1: Comparison of the Cabdriver Data Used in Related Studies

Table 1 Comparison of the Cabdriver Data Used in Related Studies.png

b. Summary Statistics

The one-month real time logs in August 2010 yield a cumulative number of over 520 million data points over the one-month sample period. Based on the point observations, we generate trip data for each day according to the vehicle service indicators as described in Table 2. These include status descriptions such as whether the vehicle is hired by a passenger (HIRED), vehicle is available to take a passenger (AVAILABLE), vehicle is not free to take a passenger (REST or UNAVAILABLE), etc. Based on the 11 statuses identified each day in the real time logs (a total of 1,440 minutes or 24 hours), the “HIRED” status takes up 27.7% of the time on the road, which is equivalent to a monthly cumulative time interval of about 190,637,213 minutes, or 400 minutes per day on average. The cabdrivers’ fare generating services (HIRED), which measure their output, average at 6.7 hours each day (24 hours). They also spent on average about 304 minutes (5.1 hours) searching for commuters as indicated by “AVAILABLE” status, which took up 21.1% of their time each day. If we group the “active” working statuses as indicated by status 1 to 6 as in Table 2, (AVAILABLE, HIRED, BOOKED, PICKUP, CALLOFF and CASH), and the remainder as “inactive” statuses of a typical cabdriver, a cab was used on average 51.4% of the time each day on the road. 6 The distributions of the activities of cabdrivers at the monthly aggregate level are consistent with those daily activities observed for average cabdrivers. The multiple trips in each day are added up for each driver to give the day-shift data for both single-shift cabs and two-shift cabs. Using the day-shift data, we compute the daily income and wage rate of the taxi drivers who can be identified by their unique IDs.

Table 2: Summary Statistics

Table 2 Summary Statistics.png

Inter-day/Intra-day variations in activities

We plot the distributions of cabs by time and by the cab activity status for the week from August 15 to 21, 2010 (Figure 1). The vertical axis measures the frequency distributions of the cabdriver number and the horizontal axis measures the time by minutes in an ordinal scale covering the one-week period starting on August 15, 2010. Four major activities - HIRED (purple line), LOGOFF (light blue line), AVAILABLE (light green line) and REST (orange line) represent the main activities of cabs in Figure 1. The LOGOFF (light blue line) activity always peaks preceding the HIRED activities, indicating either a change of shift or the stoppage of work for the day by the sample of cabdrivers. The results clearly show significant variation in inter-day demand and the wage rate of cabdrivers across different days in a week.

Figure 1: Distributions of Cabs per minutes by Status for the week of August 15-21, 2010

Figure 1 Distributions of Cabs per minutes by Status for the week of August 15-21, 2010.png

c. Inter-day Demand and Supply Elasticity for Cab

Based on the real time data, “AVAILABLE”, “HIRED”, “BOOKED”, “PICKUP”, “CALLOFF” and “CASH” are identified as the “active” working hours for cabs each day. However, only the HIRED status is used to capture the fare services provided by a cab. We compute the cumulative frequency distributions of cabs for “active” working status and fare paying services (HIRED) by minutes each day. The ratio of the cumulative number of cabs in HIRED status over the cumulative number of cabs in “active” status is computed over a oneminute interval. This ratio, measures the demand (utilization) for cabs on the road by minutes. A high ratio that is close to unity also implies a high productivity output (efficiency) of cabdrivers on the road. In contrast, a high utilization rate will have negative impact on the service delivery ratio, because we may expect longer wait time to flag down a cab during the peak hours. The ratio is an indicator that could be used to monitor the “taxi availability” standard implemented by the regulator (LTA) with effect from 1 January 2013.

Figure 2 plots cumulative number of cabs in an active working status (proxy for cab supply), the number of cabs that have been hired (proxy for cab demand) and the ratio of the cumulative number of hired cabs over the cumulative number of active working cabs by minute (multiplied by 10,000) for the week of August 15 to 21, 2010. The supply (active working cabs) and demand (hired cabs) are higher during the weekdays relative to Saturday and Sunday. The results seem to imply that the cabdriver labor supply is highly correlated with the demand, which as a result, the inter-day utilization rate (an indicator of cab availability standard) on Sunday is not significantly lower than in other days in a week.

Figure 2: Distributions of Day-by-Day Supply and Demand of Cab Services and Utilization Rate for the Week of August 15-21, 2010

Figure 2 Distributions of Day-by-Day Supply and Demand of Cab Services and Utilization Rate for the Week of August 15-21, 2010.png

d. Shift-level Work hours and Wage Measures

This study aims to empirically test the target earning hypothesis of Camerer et al. (1997) that predicts a negative labor supply elasticity against Farber’s (2005 and 2008) neoclassical view that argues for a positive inter-temporal labor supply elasticity. Like the two earlier studies, we need to measure the worked hours and wage (income) of cabdrivers. Camerer et al. (1997) and Farber (2005 and 2008) both used hand-recorded trip sheets data from cabdrivers in New York City; whereas, we have access to high frequency real-time data collected through the GPS enabled system installed in Singapore’s cabs. In Camerer et al. (1997) and Farber (2005 and 2008), hours worked per day are defined by the difference between trip start time (first passengers is picked) and trip end time (when the last passenger is dropped). The worked hours are truncated if cabdrivers start work before the first passenger is picked up, and/or if they do not stop immediately after the last passenger is dropped off.

In our study, we use the GPS-based data to identify the work hours of Singapore cabs through the status recorded in the real time logs of each cab. The high frequency data are more accurate and reliable in tracking the work hours of cabdrivers. In our dataset, the taxi fares vary by factors like whether the cabs were flagged-down, distance-based and waiting time rates. Additional surcharges are also levied for trips starting from selected locations such as airports, city area, Marina Bay Sands etc. For the tests of the hours of work and wage relationships, we derive the wage rate of individual cabdrivers in each shift taking into account the above variations in fare structure. !

The shift-level data allow us to separate the behavior of hirers (main cabdrivers) from that of relief drivers especially for the two-shift cabs. We define a shift as a set of consecutive trips with different statuses by the same driver of the same cab. The shift starting time is the first point of the first active trip (HIRED, AVAILABLE, etc.) in the real time log of a cabdriver, and his shift ending time is identified by the last point of the last active trip in the log. The starting and ending shifts are separated by a non-working break of at least 6 hours in the case of a single-shift cabdriver (hirer) (i.e. a single shift cabdriver could have driven by up to a maximum of 18 hours a day); or by the start of the new shift by a relief cabdriver if a cab is driven in two shifts in a day. Each cabdriver’s hours of work (denoted as T_SHIFT) is computed as the time intervals (in minutes) between the starting shift time and ending shift time.

Figure 3 plots the activities of 5000 randomly selected cabdrivers from the sample against the shift time (T_SHIFT) on the X-axis assuming that all of them start work at the same time. The black colored portion of the Figure represents the offline status, which indicates that majority of the cabdrivers stop work after approximately 10 hours (600 minutes). Some cabdrivers stop work earlier; whereas a few hard-working drivers stretch their work hours up to 16.7 hours (1000 minutes). The data show significant variation in the work intervals of cabdrivers.

Figure 3: Distributions of Cabdrivers’ activities by shift

Figure 3 Distributions of Cabdrivers’ activities by shift.png

4. Empirical Tests of Cabdriver Labor Supply Elasticity

In this section, we test the target-earning hypothesis versus the inter-temporal labor substitution hypothesis. We first do a pair-wise curve fitting of shift length (in minutes) and wage rate (per working hour). Next, structural multivariate models of labor supply on wage rate and other covariates are estimated. Figure 4 shows the distributions of shift length by the full sample of 637,255 shifts recorded in the full month of August 2010. 548,216 shift-level observations with shift time ranging from 3 to 16 hours (180 - 960 minutes) 7 are used in both pairwise curve fitting and multivariable regressions. Figure 5 plots the distribution of wage rates of the 548,216 shifts in the analyses.

Figure 4: Distribution of Shift Length for the Month of August 2010

Figure 4 Distribution of Shift Length for the Month of August 2010.png

Figure 5: Distribution of Wage Rate of Cabdrivers

Figure 5 Distribution of Wage Rate of Cabdrivers.png

a. Pairwise Relationships between shift length and wage rate

We use local polynomial regressions to fit the shift length (in minutes) and wage rate for all the 548,216 shift level observations in Figure 6. The scatter plots indicate each of the shift level observations while the black solid line is the local polynomial regression fit to the underlying data. The figure indicates that shift length is generally increasing in wage rate at lower wage rates, but tends to decrease in wage rate at higher wage rates. Overall, the downward sloping labor supply curves are reminiscent of the negative elasticity of cabdriver labor supply as documented by Camerer et al. (1997), but the relationships are non-linear. The curve shows an initial positive income effect when the wage rate is less than S$40 per hour. This result is consistent with the split sample model of Crawford and Meng (2011). However, their results found no significant relationship between worked hours and wage, when income exceeds the target. In our non-linear (polynomial) fitted curves, however, we find that the inter-temporal labor supply elasticity is negative, if the wage rate exceeds S$40 per hour.

Figure 6: Scatter Plot and Local Polynomial Regression Fitting of the Relationship between Shift Length and Wage Rate for all Shifts

Figure 6 Scatter Plot and Local Polynomial Regression Fitting of the Relationship between Shift Length and Wage Rate for all Shifts.png

Figure 7 shows that the income-wage relationships separately for single-shift and two-shift drivers. The curvature (second derivative) is larger for single-shift cabdrivers (Figure 7a), implying that the decision to stop work among single-shift drivers is more sensitive to changes in the wage rate. The labor elasticity of income is smaller for two-shift drivers (Figure 7b), and a flatter curve is observed if weekday shifts are used (Figures 7c and 7d).

Figure 7: Differences between Single-Shift and Two-Shift Cabdrivers

Figure 7 Differences between Single-Shift and Two-Shift Cabdrivers.png

Reduced form models of cabdriver labor supply

Like Camerer et al. (1997), we estimate the cabdriver labor supply as a function of wage rate and other control variables in an OLS framework as follows:

T_SHIFTi,t = αi + δi*WAGE +Xi,tβi + ξi (1)

here T_SHIFT denotes the shift length (in minutes of cabdriver i in shift t of a day, WAGE is the wage rate, which is calculated by the cumulative fare earned divided by the shift length in hours in each shift t, and Xi,t is a vector of exogenous factors that affect cabdriver labor supply. We use day dummies to control for lull periods on “Saturday”, “Sunday” and “National Day” (August 9, 2010). We control for heterogeneity in shifts using a “NIGHT” dummy, which has a value of 1, if a cabdriver start the shift after 3.00PM; and 0 otherwise; and a “Is2Driver”, which has a value of 1 for a two-shift cab, i.e. a cab that is shared by 2 drivers; and 0 otherwise. Like in Camerer et al. (1997) and Farber (2005 and 2008), we control for the experience of cabdrivers based on cabdrivers’ ID numbers. We rank cabdrivers by their ID and sort them into two groups, where experienced cabdrivers have ID numbers that are smaller than the first quartile cut-off. We also control for rainy days by including two weather dummies that indicate days with the highest precipitation (i.e. “Rain_high” =1, if the aggregate rainfall at four weather stations in a day is more than 0.8 inches) and days with the lowest precipitation (i.e. “Rain_low” = 1, if the aggregate rainfall at four weather stations in a day is less than 0.1 inches).

The OLS regression results and the corresponding log-version of the models are summarized in Table 3. Our results for both the OLS and the log-OLS models indicate that the coefficients on wage rates both in the absolute value models (Model 1: -2.060 and Model 3: -1.498) and the log-value models (Model 2: -0.050 and Model 4: -0.031) are negative and significantly different from zero. The results support Camerer et al.’s (1997) findings that there is a significant but negative labor supply elasticity of wage in cabdrivers, where we find that cabdrivers stop work after the target income is reached. However, compared to the estimates of Camerer et al. (1997) of -3.55 to -5.01, our estimates of the wage elasticity of cabdriver supply is substantially smaller at -0.031 (Model 4).

Table 3: OLS Models of Worked Hours on Wage Rate for Cabdrivers

Table 3 OLS Models of Worked Hours on Wage Rate for Cabdrivers.png

As Camerer et al. (1997) and Farber (2005 and 2008) pointed out, using the shift length, which is also the dependent variable in the model, as the denominator of the wage rate measure could potentially cause downward (negative) biases to the estimates of the wage elasticity on labor supply (δi), if any measurement errors of worked hours occur. As our data are real time data recorded with precision by GPS-enabled technology, we expect measurement errors to be minimal compared to the hand-collected trip sheet data in the two earlier studies on New York cabdrivers. We control for potential biases by adding the fixed cabdriver effects to the OLS and log-OLS models in Table 4. The results show that the coefficients on wage are still significant and negative. The coefficient in the OLS model decreases marginally to -1.409 after controlling for fixed cabdriver effects. We also estimate a polynomial model, and the results show significant non-linear wage elasticity on labor supply. Figure 8 plot the curve of shift length and wage rate based on the results of the polynomial model assuming that all dummy variables take the value of 0. The curve is similar to the ones generated by the local polynomial fitting. In the log-labor supply models, we show that the wage elasticity of labor supply is still negative and significant at -0.040.

Table 4: Labor Supply Models with Cabdriver Fixed Effects

Table 4 Labor Supply Models with Cabdriver Fixed Effects.png

Figure 8: Relationship between Shift Length and Wage Rate based on a Calibrated Polynomial Function

Figure 8 Relationship between Shift Length and Wage Rate based on a Calibrated Polynomial Function.png

b. Supply shocks and the labor supply elasticity of cabdrivers

One of the key concerns in interpreting the estimates in Tables 3 and 4 as measures of the labor supply elasticity of cabdrivers is the potential endogeneity of wages. The assumption behind the empirical specifications used thus far is that the daily variation in cabdriver wages is driven by shifts in the demand for taxi services. However, these results could be downward biased if the observed daily wage fluctuations are the result of unobserved shifts in the labor supply curves of cabdrivers (Oettinger, 1999). Since we have the universe of cabdrivers in our sample, we are able to construct measures of aggregate labor supply to control for potential unobserved labor supply shocks. We construct two measures of supply shocks – the first is based on the current day aggregate active working time among all cabdrivers in our sample and the second is based on the previous day aggregate active working time of all cabdrivers.

The results controlling for these aggregate supply measures are reported in Table 5. The supply shock variables (both the current day and previous day) have a positive and significant impact on the labor supply of individual cabdrivers. Nevertheless, controlling for these variables do not change the coefficient on wages significantly - the coefficients were estimated at -0.052 when the proxy for current day aggregate labor supply shock is included, and -0.049 when the proxy for previous day aggregate labor supply shock is included. The values are slightly larger than the -0.040 estimated in the model without the supply shocks in Table 4. Overall, the results suggest that our estimates of the negative income elasticity of cabdriver labor supply are robust to the inclusion of measures of aggregate supply shocks.

Table 5: Shocks to Taxi Driver Labor Supply

Table 5 Shocks to Taxi Driver Labor Supply.png

c. Heterogeneity in Cabdrivers’ Labor Supply Elasticities

​Like Camerer et al. (1997), we sort cabdrivers based on their ID in ascending order with the earlier number in the first quartile identified as experienced cabdrivers, and the cabdrivers whose ID are ranked in the 4th quartile are identified as new (relatively inexperienced) cabdrivers. We run the two fixed effect models, and the results are summarized in Table 6. We find that experienced cabdrivers are more sensitive to the wage rate effects and they stop work earlier when they reach their target earnings. Whereas the negative elasticity for newer cabdrivers is smaller, the coefficient is still statistically significant at less than 1% level.

Table 6: Heterogeneity of Cabdrivers

Table 6 Heterogeneity of Cabdrivers.png

We also examine the impact of the shift constraint imposed on 2-shift cabdrivers, where they are subject to the work hour limit of 12 hours a day, because they are required to transfer their cabs to their co-cabdrivers (relief drivers) at the time agreed between them. The results as shown in Table 6 show that the single-shift and 2-shift cabdrivers are both target earners who stop work when they meet their income target of the day. The coefficients on the log wage variable were significant and negative for both groups of drivers. However, the results show no significant evidence to suggest that single-shift cabdrivers who are not subject to shift-time constraint are more likely to drive longer hours and are less responsive to income effects.

Based on the average shift length, we sort cabdrivers into a “leisure” group (1st quartile) and a “hard-working” group (4th quartile) and rerun the fixed-effect models for these two groups of cabdrivers. The results in columns 5 and 6 of Table 6 show that for cabdrivers who work short hours in a shift (“leisure” group), the wage elasticity is more negative at -0.075 compared to - 0.007 for the group of cabdrivers who work long hours (“hard-working” group). In other words, cabdrivers who work shorter shifts in a day are more sensitive to target earnings relative to cabdrivers who work longer hours in each shift. If earnings are higher in the earlier hours of a shift, the “leisure” cabdrivers are more likely to stop work earlier. These results appear consistent with the asymmetric target earners model of Crawford and Meng (2011).

d. Reference-Dependence Preferences in Labor Supply

Camerer et al. (1997) attribute the negative labor supply elasticity in their results to target earnings behavior of cabdrivers. Consistent with reference-dependence preferences, target earnings imply a discontinuity in the marginal utility of income at a reference level. A sharp kink is expected in the utility function, where cabdrivers choose to stop work when the reference/target income is earned for the day. Farber (2008), however, found evidence that constancy in the reference income across days of the week is strongly rejected, and that New York cabdrivers appear to stop work way before the reference income level is reached. He claims that the results were inconsistent with the reference dependence hypothesis, where stable targets influence negative cabdrivers’ decisions to stop work.

Crawford and Meng (2011) repeated the empirical tests using the same data set as Farber (2008). Unlike the latent targets used in Farber (2008), they treat cabdrivers’ income and hour targets as rational expectations using exogenous proxies from driver/day-of-the-week by driver/day-of-the week point estimates of income and hours, excluding the day in question. They estimate the probability of stopping work using split samples of cabdrivers by their “early earnings’ criteria, and their results show significant effects of the second (hours) target in predicting the likelihood that cabdrivers stop work. The results support the reference dependence hypothesis, which suggests that if early earning targets are reached, the stopping probabilities are significantly and positively correlated with the second (hours) target. The reverse is true, such that when the first (early) earning target is not met, the hours target is unlikely to explain the stopping probability of cabdrivers.

Like Farber (2008) and Crawford and Meng (2011), we expand our tests on whether the reference dependence hypothesis holds in predicting the cabdriver labor supply elasticity in our Singapore sample. Based on our unique dataset, we modify the estimation strategies for target income and hours used by Crawford and Meng (2011). We first use the driver-specific averages of day-by-day income and hours for the sample periods (“calibration” periods) up to but excluding the “treatment” periods to proxy the targets. We also estimate the average driver-specific by day-of-the-week targets for income and hours based on the first three-week calibration periods (August 1 to 21, 2010) as an alternative approach to control for day-of-theweek variation. We define two dummy variables indicating the above targets (i.e. cumulative income > the target, and cumulative hours > the target) by day and by day-of the-week as the measures of the incremental effects of hitting the target in the stopping probability models. We test the effects of the above target on the stopping probability using shifts/trips data in the last week of August 22 to 28, 2010, which is defined as the treatment (test) period.

In our Logit models, we define the stopping time probability using a dummy variable that has a value of 1, if a driver stops work after a “HIRED” trip is made, and 0 otherwise. The positive and significant coefficients on the cumulative efforts (both income and hours) above the targets as shown in Table 7 (Column 1) implies that when the targets on income and hours are reached, the probability of cabdrivers stopping work increases significantly. The results are consistent and significant across the models, when we control for weather, day of the week, hour of the day and driver fixed effects (Column 2). The two “above” targets representing incremental effects are still significant and positive. The coefficients on the cumulative hours on the “treated” working day are significant and positive, which are consistent with the negative labor supply elasticity results earlier. The results imply that cabdrivers whose cumulative working time is high are more likely to stop work earlier. The cumulative income coefficients are significant, but the signs change from negative to positive when variations in day of the week, hour of the day, rainy days and driver fixed effects are controlled.

Table 7: Reference Dependence Hypothesis with Driver Average Targets

Table 7 Reference Dependence Hypothesis with Driver Average Targets.png

We repeat the Logit regressions that adjust for day-of-the-week variations in the targets for specific drivers, where the target for Monday is set based on the average of the past three Mondays’ averages from the calibration periods (August 1 to 21, 2010), and the same for other days of the week. Our results in Table 8 are consistent with those reported in Table 7. The fact that we also observe constancy in cabdrivers’ stopping behavior in our Singapore sample also allays the criticism of Farber (2008) that the reference income is too unstable in predicting cabdrivers’ stopping behavior.

Table 8: Reference Dependence Hypothesis with Driver/Day-of-the-Week Average Targets

Table 8 Reference Dependence Hypothesis with Driver Day-of-the-Week Average Targets.png

e. Constancy in Labor Supply of Cabdrivers

In the neo-classical model, variation in income should be uncorrelated within a day and across days, such that these changes in the drivers’ income should have no explanatory effects on their work hours in a day or a shift. We test the constancy hypothesis and importance of daily wage targeting using a reduced form model that regresses the work hours/shift by driver/day against the lagged effort levels indicators, represented by the log-income target variables. For the lag effects, we test the day-by-day effects using the previous day effort levels as the lagged control variable, and also test the beginning of the month effects, where the first week of the month effort levels (August 1-7, 2010) are used as the control variables, and the last week of the month effort level (August 22-28, 2010) are used as the response variables. As in previous specifications, the models will also control for variation related to weekend days (Saturday and Sunday), rainy days, and drivers’ heterogeneity. For the day-by-day regression, the National Day effect on August 9, 2010 and other day fixed effects are controlled.

Table 9 summarizes the regression results. The shift length in the day-by-day model is positively and significantly correlated with the previous day income, as represented by the log of one-day lagged income (“Log(Income Target)”), suggesting that higher income in the previous day is associated with a longer shift duration. Controlling for previous day income does not appear to impact the effect of daily wages – the coefficient on the current log wage remains negative and statistically significant at -0.139. These results are also robust to the inclusion of driver fixed effects – conditional on previous day income, the coefficient on current log wage is smaller in absolute value at -0.044 but remains highly statistically significant.

Table 9: Income Targets of Cabdrivers

Table 9 Income Targets of Cabdrivers.png

We further examine the effects of both weekly and monthly income targets on cabdriver labor supply. We use the end-of-the-month work-hours (week 4) as calibrated models (base model), and we include the income estimated from “Week 1” samples as a proxy for the weekly targets, and the cumulative income from “Weeks 1 to 3” as a proxy for the monthly target 8 in our extended models.

For each of the models that include the prior “Week 1” and “Weeks 1 to 3” income, we find that shift length is positively and significantly related to previous income. Higher income in the earlier weeks (as proxied for by the cumulative log income target the previous weeks) is correlated with longer shift duration in Week 4. Nevertheless, conditional on previous weeks’ earnings, cabdrivers continue to exhibit negative labor supply elasticity in response to daily wage variation. Overall, these results point to the importance of daily income targets in determining cabdriver labor supply.

f. Testing the Importance of Daily Targets using Cabdrivers’ Response Earnings Shocks in the Previous Day

The analysis in the previous section suggests that daily income targets are important determinants of cabdriver labor supply. In this section, we show the importance of daily targets by exploiting shocks in the earnings of cabdrivers in the previous day and examining their subsequent labor supply behavior. Moreover, as our data captures the labor supply of all drivers over a month-long period, we can use cabdrivers’ behavior in the first three weeks of the month (August 1-21) to establish each individual cabdriver’s reference daily income target.

Daily income targeting suggests that cabdriver labor supply is only sensitive to variation in daily wages and should not be affected by previous day’s income. On the other hand, if cabdrivers are using income targets beyond the daily level, e.g. weekly or monthly targeting, we would expect that cabdrivers who earned significantly lower in the previous day to try to make up the difference in the days after. On the other hand, cabdrivers who earn significantly higher in the previous day might decide to reduce their labor supply in response to the positive income shock. To test this hypothesis, for each day in the last week of August (August 22-28), we select cabdrivers who experience a negative and positive income shock based on the ratio of their actual income on that day to their reference daily income (income-ratio). Cabdrivers defined as experiencing a negative shock in their income are those in the bottom 25th percentile of the income-ratio distribution. Similarly, cabdrivers who are in the top 25th percentile of the income-ratio distribution in a particular day are defined as those who experienced a positive shock in their income. For both these groups of workers, we track the ratio of their actual income to their reference daily income in the following two days.

Figure 9A plots the kernel density of the income-ratio for cabdrivers who experience a negative shock in each of the seven days in the last week of August. There are 7 series in the Figure and each series consists of three kernel density curves, which represent the distribution of the income ratio for the same set of cabdrivers in the first (blue), second (green) and third day (red), respectively. For example, the first day distributions are depicted by the blue lines and, by construction, these distributions are right-skewed and peak to the left of 1.0, indicating that this sample of cabdrivers earned substantially below the reference daily incomes on the first day. Next, for the same set of cabdrivers for each day (each series is indicated by the date and line type), we track their income-ratio over the next two days – these distributions are depicted by the green (second day) and red (third day) lines. Note that the same sets of cabdrivers are tracked in each individual series, but different sets of cabdrivers are used across series. Figure 9B depicts the patterns for cabdrivers who experienced a positive income shock. Strikingly, both these figures show regardless of whether cabdrivers experience a negative or positive shock in the previous day, the income of these same drivers revert back to the reference level in each of the following two days. This pattern is highly consistent across days, providing further evidence that cabdriver labor supply is generally unaffected by shocks to previous day income and is determined largely by adherence to daily income targets.

Figure 9: Distribution of Income Ratio of Cabdrivers

Figure 9 Distribution of Income Ratio of Cabdrivers.png

g. Effects of Different Reference Points on Labor Supply

Next, we test the effects of different reference points on cabdrivers’ labor supply elasticity. We first estimate two reference points for each driver using the average income and working hours per shift using the sample observations in the first three-week from 1st to 21st, August. Based on the reference points derived, we split the sample of cabdrivers’ shifts based on their income and working hours per shift in Week 4 (22nd to 28th, August) into four groups: high income shifts versus low income shifts and high working hour shifts versus low working hour shifts. We then run separate fixed effect models with the log-working hour per shift as the dependent variable using the pooled sample and the four sub-samples in the treatment periods (Week 4). The results are summarized in Table 10.

Table 10: Reference Dependence and Labor Supply Elasticity

Table 10 Reference Dependence and Labor Supply Elasticity.png

In the pooled-sample model, the wage elasticity of labor supply is estimated at -0.052 (see Column (1)). When estimating the models using the sub-samples of the data sorted by the first-three-week income and working hour references, we find that the wage coefficients in the stratified models are all significant and negative. The wage elasticity for the high-income shifts is estimated at -0.423 and is greater than the -0.275 estimate for the low-income shifts. For cabdrivers whose shifts are above the reference level, they are more likely to stop work earlier when their income target is achieved. However, the wage elasticity for cabdrivers working long hours (above the reference level) is estimated at -0.023, and the elasticity is insignificantly different from the elasticity of -0.025 estimated for cabdrivers working below the reference number of hours. The results imply that the working hour reference is not significant in explaining the variation in the labor supply elasticity of cabdrivers.

5. Conclusion

The debate surrounding the wage elasticity in cabdriver labor supply between Camerer et al. (1997) and Farber (2005) has not been resolved due to differences in econometric methodology, conceptual model and measurement of wage rates. The reference-dependent utility preference model (Farber, 2008, and Crawford and Meng, 2011) is used to describe cabdrivers’ target earnings behavior in the decision to stop work. Again, no consensus has been reached between the two schools of thought on labor supply, where the neoclassical school represented by Farber (2008) found no significant effects of target income on cabdrivers’ decision to stop work as most cabdrivers appear to end shifts before reaching their target earnings. In contrast, Crawford and Meng (2011) find evidence of asymmetric target income effects on cabdriver labor supply consistent with the behavioral story of Camerer et al. (1997). They show that cabdrivers who do not meet their target income in initial work hours are more likely to work longer (i.e. positive income effects on probability of stopping work).

The earlier studies use two different sets of hand-collected trip sheet data from a small number of New York City cabdrivers. We collected high-frequency real time data through GPSenabled technology installed in Singapore’s cabs for the month of August 2010, and use the data to revisit the debate surrounding the determinants of cabdriver labor supply. After controlling for various sources of heterogeneity and cabdriver fixed effects in the model, we find evidence that supports the negative wage elasticity effects as documented by Camerer et al. (1997), which predicts that cabdrivers do stop work earlier after their target earnings have been reached. However, our results show a much smaller elasticity of between -0.031 and – 0.052, compared to the Camerer et al’s (1997) elasticity estimates of -0.411 to -0.618 (unadjusted for fixed effects). The low negative wage elasticity in our results shows some smoothness in cabdrivers’ stopping probabilities in reacting to income target as shown in Farber (2008).

Next, we follow Crawford and Meng’s (2011) approach in estimating the targets by using the average income and hours by driver/day and also by driver/day-of-the-week. Our results show that hitting the targets early have a positive and significant impact on the probability of stopping work. Cabdrivers who work up to the target cumulative hours and/or who earn the daily target income are more likely to stop work earlier. The results support the reference dependence hypothesis in explaining cabdriver labor supply.

Given the uniqueness of our dataset, we are able to conduct more robustness tests by including supply shocks and examining whether cabdriver labor supply is sensitive to daily, weekly or monthly income targets. Overall, our findings are consistent with the observed negative wage elasticity being largely driven by cabdrivers’ use of daily income targeting.

In future work, we plan to further evaluate the effect of transitory and permanent changes in income by examining the effects of area-based fare variations on cabdrivers’ labor supply. The proposed extension will require extensive spatial data not just on trip time, but also the trip route of cabdrivers. Such an experiment can be designed with our high-frequency data.

Footnotes

1

“World-first satellite tracking, booking system for Tibs taxis,” The Straits Times, November 24, 1995.

2

Fares are estimated based on a 10-km taxi trip, with 5 minutes waiting time, at off peak hours and with no surcharges. Exchange rate is based on July 2012 figures. (Source: LTA, Singapore)

3

"Shorter wait for cabs, better takings for cabbies,” The Straits Times, January 22, 2008.

4

If a trip sheet is not available for a particular driver on a given day, one cannot tell if the driver did not work on that day, or if the trip sheet was not provided.

5

“World-first satellite tracking, booking system for Tibs taxis,” The Straits Times, November 24, 1995.

6

Despite the high cab to population ratio of 5.2 cabs per 1000 population vis-à-vis 1.5 cabs per 1000 population in New York, the cabdrivers in Singapore drive only about 9 to 10 hours a day, compared with 18 hours clocked by New York City cabdrivers. Read the article by Tan, Christopher, “Will new taxi rules boost availability?” Straits Times, November 26, 2012.

7

The shift length of these observations is roughly within two standard deviations of the mean.

8

As our sample data are only available for the one-month sample period for the Month of August in 2010, we could not use the full month sample to estimate our monthly target. Instead, we use the first three week estimates to proxy our monthly target, and calibrate the effects using the last week samples in the models.

Table 1 Summary Statistics by Days.png

References

Altonji, Joseph G., 1986, Intertemporal Substitution in Labor Supply: Evidence from Micro Data, Journal of Political Economy, 94: 176-215.

Browning, Martin, Angus Deaton and Margaret Irish, 1985, A Profitable Approach to Labor Supply and Commodity Demands Over the Life Cycle, Econometrica, 53: 503-543.

Camerer, Colin, Linda Babcock, George Loewenstein, and Richard Thaler, 1997, Labor Supply of New York City Cabdrivers: One Day at a Time, Quarterly Journal of Economics, 112(2): 408-441

Crawford, Vincent P. and Juanjuan Meng, 2011, New York City Cab Drivers’ Labor Supply Revisited: Reference-Dependent Preference with Rational Expectations Targets for Hours and Income, American Economic Review, 101: 1912-1932.

Dupas, Pascaline and Jonathan Robinson, 2013, The Daily Grind: Cash Needs, Labor Supply and Self-Control,” Working Paper.

Farber, Henry, 2005, Is Tomorrow Another Day? The Labor Supply of New York City Cabdrivers, Journal of Political Economy, 113(1): 46-82

Farber, Henry, 2008, Reference-dependent preferences and labor supply: the case of New York City taxi drivers, American Economic Review, 98(3): 1069-1082

Fehr, Ernst and Lorenz Goette, 2007, Do Workers Work More if Wages are High? Evidence from a Randomized Field Experiment, American Economic Review, 97(1): 298-317

Goette, Lorenz, Ernst Fehr, and David Huffman, 2004, Loss aversion and labor supply, Journal of the European Economic Association, 2(2-3): 216-228

Heckman, James J., 1974, Life Cycle Consumption and Labor Supply: An Explanation of the Relationship between Income and Consumption Over the Life Cycle, American Economic Review, 64(1): 188- 194.

Lucas, Robert E. Jr and Leonard A. Rapping, 1969, Real Wages, Employment, and Inflation, Journal of Political Economy, 77.

Köszegi, Botond, and Matthew Rabin, 2006, A Model of Reference-Dependent Preferences, Quarterly Journal of Economics, 121(4): 1133-1165

Mankiw, N. Gregory, Julio J. Rotemberg and Lawrence H. Summers, 1986, Intertemporal Substitution in Macroeconomics, Quarterly Journal of Economics, 100: 225-251.

Oettinger, G. S.,1999, An Empirical Analysis of the Daily Labor Supply of Stadium Venors, Journal of Political Economy, 107(2): 360-392.

Tversky, A., and Daniel Kahneman, 1991, Loss Aversion in Riskless Choice: A Reference-Dependent Model, Quarterly Journal of Economics, 106:1039-1061.

Appendix: Fares and Rates of one of the largest Cab Operators in Singapore

Source: http://www.cdgtaxi.com.sg/commuters_...vn?cid=2214322

Appendix Fares and Rates of one of the largest Cab Operators in Singapore.png

Course Emails

Certificate.png

December 23, 2014

Dear Participant,

In our previous email, the link to the final course survey was incorrect. Please follow this link to take the survey: https://www.surveymonkey.com/s/BigData-NOV2014-Final My Note: Already taken it.

We apologize for the inconvenience.

CONGRATULATIONS!

We are pleased to inform you that if you have earned a Certificate of Completion, it is now posted to your student dashboard at https://mitprofessionalx.edx.org/ PDF

Unfortunately, if you have not earned a certificate, it is because you were not able to complete the assessments with an average of 80 percent success rate by the course end date. Please note that no extensions, exceptions, or transfers can be accommodated, as per our course FAQs.

FINAL COURSE SURVEY

Now that the course has come to an end, we would like you to fill out a final course survey to provide us with feedback on your experience with the course. A link to the survey is included here: https://www.surveymonkey.com/s/BigData-NOV2014-Final

Please keep in mind that this is a survey on your experience with this course, similar to the mid-course survey. For those that wish to be eligible for 2.0 CEUs for this course, you must answer all of the questions on the final course survey by January 2, 2015. Please see the requirements for CEU and Certificate of Completion eligibility here.

Thank you for your time, and congratulations on completing the course!

MIT Professional Education
Online X Programs Team

December 16, 2014

Dear Participant,

Tackling the Challenges of Big Data is ending today!

We realize that there has been some confusion around the course end time, and we apologize for this inconvenience. In the courseware, the assessment due date is set to December 17, 05:00 UTC. This time converts to today, December 16, at 11:59 p.m. EST. To see what time this is in your local time zone, please visit this link.

Please note that no extensions, exceptions, or transfers can be accommodated, as per our course FAQs.

Course content (including videos, discussion boards, content, and Wiki) will be accessible for an additional 90 days post program. The site will officially close by March 15, 2015.

Please remember, that per the course Terms of Service, you have agreed that access to the archives and materials is for your own personal use and that the downloading of videos, or posting/distribution of course assets is prohibited.

FINAL COURSE SURVEY

Now that the course has come to an end, we would like you to fill out a final course survey to provide us with feedback on your experience with the course. A link to the survey is included here: https://www.surveymonkey.com/s/BigData-NOV2014-Final.

Please keep in mind that this is a survey on your experience with this course, similar to the mid-course survey. For those that wish to be eligible for 2.0 CEUs for this course, you must answer all of the questions on the final course survey by January 2, 2015. Please see the requirements for CEU and Certificate of Completion eligibility here.

If you have earned a Certificate of Completion, it will be posted to your student dashboard by December 23, 2014.

A CEU award letter will be emailed to all participants that earn them by January 15, 2015. 

We enjoyed working with you over the last six weeks and thank you for taking on Tackling the Challenges of Big Data!

MIT Professional Education
Online X Programs Team

December 15, 2014

Tomorrow is the last day of the course! Please remember you have until tomorrow at 4:59 UTC to complete all assessments with an 80% pass rate in order to earn a Certificate of Completion.

Week 6 Review

Top Posts

Below are our top posts from week six that the course TAs wanted to highlight for you. Take a few minutes to read them, and consider adding your own contribution if you have not already.

Thank you to everyone who participated in the discussion boards for this course!

  • Discussion Question 5.1: How can you apply the sampling algorithms mentioned in this module to the data that you work with?
  • Discussion Question 5.5: How can you apply the sampling algorithms mentioned in this module to the data that you work with?
  • Discussion Question 5.6: Are there any inherent challenges to your dataset which make the analysis difficult?
  • Discussion Question 5.7: What do you think Professor Lo means when he says that it is the high dimensionality of Big Data which makes his machine learning algorithms so potent?

important announcements

MIT Professional Education Certificate of Completion

This course will be ending tomorrow, on December 16th, 2014 at 4:59 UTC. In order to earn an MIT Professional Education Certificate of Completion, you must complete all course assessments with an average of 80 percent success rate by the course end date. There will be no exceptions or extensions. You can check your progress here to see how much you have left to complete.

If you have earned a certificate, it will be posted to your student dashboard by December 23, 2014.

Linkedin Group for MIT Professional Education Alumni

In addition, for those that earn a Certificate of Completion, an invitation will be sent out to join our restricted LinkedIn professional alumni group by December 30, 2015.

Final Course Survey and Continuing Education Units (CEUs)

The final course survey will be distributed tomorrow via email, and will be posted on the Course Info page. Please keep in mind that this is a survey on your experience with this course, similar to the mid-course survey.

In order to earn 2.0 CEUs for this course, you must do the following:

  • Earn a Certificate of Completion
  • Answer all of the questions on the final course survey by January 2, 2015.

A CEU award letter will be emailed to all participants that earn them by January 15, 2015. More details, including a link to the final course survey, will be sent in the final week email and will be posted to the Course info Tab.

90 Days Archived Access

Course content (including videos, discussion boards, content, and Wiki) will be accessible for an additional 90 days post program. The site will officially close by March 15, 2015.

Please remember, that per the course Terms of Service, you have agreed that access to the archives and materials is for your own personal use and that the downloading of videos, or posting/distribution of course assets is prohibited.

Good luck completing the course!

MIT Professional Education
The Online X Programs Team

December 9, 2014

Congratulations everyone on completing week five of Tackling the Challenges of Big Data! We hope you enjoyed learning more about algorithms and machine learning techniques that can be used to extract patterns, trends, and insights from data.

This is the last official week of the course. Please read the important announcements below the Week 6 Agenda regarding Certificates of Completion, the Final Course survey, the restricted LinkedIn alumni group, and archived access to the course.

WEEK 6 AGENDA

In week six, the following topics will be covered:

MODULE FIVE, PART II: BIG DATA ANALYTICS

  • Machine Learning Tools (Tommi Jaakkola)
  • Case Study: Information Summarization (Regina Barzilay)
  • Applications: Medicine (John Guttag)
  • Applications: Finance (Andrew Lo)

The discussion questions below pertain to all of Module 5, so please feel free to answer them as you work your way through the material for week 6.

DISCUSSION QUESTIONS

  • Question 5.1: How can you apply the sampling algorithms mentioned in this module to the data that you work with?
  • Question 5.2: How are different types of sketches applicable to your work?
  • Question 5.3: In your current work, are there opportunities for data compression with coresets?
  • Question 5.4: Identify a problem that you have that can be solved with machine learning and the tools you will need.
  • Question 5.5: To what sources of data other than Twitter do you think you could apply the Professor Barzilay’s NLP techniques?
  • Question 5.6: Are there any inherent challenges to your dataset which make the analysis difficult?
  • Question 5.7:  What do you think Professor Lo means when he says that it is the high dimensionality of Big Data which makes his machine learning algorithms so potent?

The lead TA for week six will be Brian Bell, who will be answering your questions and responding to the top upvoted posts within the discussion forum.

We encourage you to continue posting in the discussion forums and networking with one another on the course LinkedIn page.

IMPORTANT ANNOUNCEMENTS

MIT Professional Education Certificate of Completion

This course will be ending in one week, on December 16th, 2014. In order to earn an MIT Professional Education Certificate of Completion, you must complete all course assessments with an average of 80 percent success rate by the course end date. There will be no exceptions or extensions. You can check your progress here to see how much you have left to complete.

If you have earned a certificate, it will be posted to your student dashboard 7 days after the course end date.

Linkedin Group for MIT Professional Education Alumni

In addition, for those that earn a Certificate of Completion, an invitation will be sent out to join our restricted LinkedIn professional alumni group one week after certificates have been posted.

Final Course Survey and Continuing Education Units (CEUs)

The final course survey will be distributed on the final day of the course. Please keep in mind that this is a survey on your experience with this course, similar to the mid-course survey. 

In order to earn 2.0 CEUs for this course, you must do the following:

  • Earn a Certificate of Completion
  • Answer all of the questions on the final course survey by January 2, 2015.

A CEU award letter will be emailed to all participants that earn them by January 15, 2015. More details, including a link to the final course survey, will be sent in the final week email and will be posted to the Course info Tab.

90 Days Archived Access

Course content (including videos, discussion boards, content, and Wiki) will be accessible for an additional 90 days post program. The site will officially close by March 15, 2015.

We welcome your continued thoughts and feedback.

Good luck in Week 6!

MIT Professional Education
The Online X Programs Team

December 2, 2014

Congratulations everyone on completing week four of Tackling the Challenges of Big Data! We hope you enjoyed learning more about additional aspects of managing Big Data, including security and multicore scalability and user interfaces and visualization.

WEEK 4 REVIEW


TOP POSTS

Below are our top posts from the third week that the course TAs wanted to highlight for you. Take a few minutes to read them, and consider adding your own contribution if you have not already.

If you want to network with other professionals, learn more about what is going on in the industry, and learn how to apply the knowledge you have gained from this course, please visit our course LinkedIn group and take a look at the rich conversations taking place. If you have not yet joined the LinkedIn group, the course Wiki provides instructions.

WEEK 5 AGENDA

In week five, the following topics will be covered:


MODULE FIVE, PART I: BIG DATA ANALYTICS

  • Fast Algorithms I (Ronitt Rubinfeld)
  • Fast Algorithms II (Piotr Indyk)
  • Data Compression (Daniela Rus)

Module 5 will be split into 2 parts over the final weeks of the course. The discussion questions below pertain to all of Module 5, so please feel free to answer them as you work your way through the material for weeks 5 and 6.


DISCUSSION QUESTIONS

  • Question 5.1: How can you apply the sampling algorithms mentioned in this module to the data that you work with?
  • Question 5.2: How are different types of sketches applicable to your work?
  • Question 5.3: In your current work, are there opportunities for data compression with coresets?
  • Question 5.4: Identify a problem that you have that can be solved with machine learning and the tools you will need.
  • Question 5.5: To what sources of data other than Twitter do you think you could apply the Professor Barzilay’s NLP techniques?
  • Question 5.6: Are there any inherent challenges to your dataset which make the analysis difficult?
  • Question 5.7:  What do you think Professor Lo means when he says that it is the high dimensionality of Big Data which makes his machine learning algorithms so potent?

The lead TA for week five will be Brian Bell, who will be answering your questions and responding to the top upvoted posts within the discussion forum.

We welcome your continued thoughts and feedback.

Good luck in Week 5!

MIT Professional Education
The Online X Programs Team

November 25, 2014

Congratulations everyone on completing week three of Tackling the Challenges of Big Data! We hope you enjoyed learning more about data storage processing technologies for Big Data.

MID-COURSE SURVEY
If you have not already, please take a few minutes to fill out a mid-course survey and provide us with feedback on how you think the course is going. To have your feedback counted, please complete the survey by Sunday, November 30 at 4:59 UTC. Completing the survey is not a requirement to earn a Certificate of Completion or CEUs.

Take the survey today: https://www.surveymonkey.com/s/Bigdata-NOV2014

WEEK 3 REVIEW

TOP POSTS

Below are our top posts from the third week that the course TAs wanted to highlight for you:

DISCUSSION QUESTIONS

The TAs have posted discussion questions for modules 3, 4, and 5 that are now pinned to the “General Discussion” board for each of these modules. We encourage you to review these as you finish watching the videos and provide any thoughts that you may have in the appropriate thread.

  • Question 3.1: Do you agree with Professor Stonebraker's opinions?
  • Question 3.2: Which MapReduce features are most important for performance?
  • Question 3.3: For your data processing challenges, which works better, a conventional SQL system, or a system like Hadoop/MapReduce/Spark?

WEEK 4 AGENDA

In week four we will go over:

MODULE FOUR: BIG DATA SYSTEMS

  • Multicore Scalability (Nickolai Zeldovich)
  • Security (Nickolai Zeldovich)
  • User Interfaces for Data (David Karger)

DISCUSSION QUESTIONS

  • Question 4.1: What types of data you would use CryptDB with?
  • Question 4.2: What kind of programs can benefit from multi-core parallelism?
  • Question 4.3: Find a visualization that is deceptive. Why is it deceptive and what you would do to correct the deception?

The lead TA for week four will be Manasi Vartak, who will be answering your questions and responding to the top upvoted posts within the discussion forum.

We welcome your continued thoughts and feedback.

Good luck in Week 4!

MIT Professional Education
The Online X Programs Team

November 24, 2014

Week three of Tackling the Challenges of Big Data is almost finished. 

Now that you are about halfway finished with the course, please take a few minutes to fill out a mid-course survey and provide us with feedback on how you think the course is going. A link to the survey is included below.

To have your feedback counted, please complete the survey by Sunday, November 30 at 4:59 UTC. Completing the survey is not a requirement to earn a Certificate of Completion or CEUs.

https://www.surveymonkey.com/s/Bigdata-NOV2014

MIT Professional Education
The Online X Programs Team

November 18, 2014

Congratulations everyone on completing week two of Tackling the Challenges of Big Data! We hope you enjoyed learning more about data collection, integration, and storage.

Some of you have been adding articles to the course Wiki page, and if you have not already done so, we want to encourage you to visit the Wiki, read some of the resources, and contribute your own. The course Wiki is a collaborative page featuring 100+ resource links from faculty, TAs, and your fellow participants. In addition, the PDFs of course slides are available for download, which many of you were asking about this week.

The conversations in the discussion forum are still going strong. Below are our top posts from the second week that the course TAs wanted to highlight for you, as well as a couple of great threads from the first week that you may have missed:

If you have an interesting question, or one that others may benefit from, please remember that posting in the discussion area is the best way to get an answer. Another participant may have the answer, or if your post is upvoted enough times, one of the course TAs may respond with further knowledge. To upvote a post, click on the plus sign (+) at the top right of the post.

1. Participant XavierAnthony posted a great response to optional discussion question #1, about “messy” data.

2. Participant marsuaga asked the TA about other options for tools besides Data Tamer and Wrangler.

3. Participant AshishNarayan started an interesting discussion about Big Data and healthcare.

4. Participant VinayIN began a conversation on big data for small and medium businesses.

Week 3 Agenda

In week three, course material will cover:

MODULE THREE: BIG DATA STORAGE

  • Modern Databases (Michael Stonebraker)
  • Distributed Computing Platforms (Matei Zaharia)
  • NoSQL, NewSQL (Sam Madden)

The lead TA for week three will be Manasi Vartak, who will be answering your questions and responding to the top upvoted posts within the discussion forum.

We welcome your questions, comments, and suggestions as we move forward through the course material.

Good luck in Week 3!

MIT Professional Education
The Online X Programs Team

November 11, 2014

Congratulations everyone on completing week one of Tackling the Challenges of Big Data!

We hope you enjoyed the introduction to Big Data and the case studies that were discussed in the first week. Please remember that while this is a self-paced course, we encourage everyone to stay on schedule with the videos as we believe this will make for more resourceful information on the discussion boards. Please refer to the syllabus and the course schedule.
 
Here are our top 5 posts from the first week that the course TAs wanted to highlight for you. Take a few minutes to read them, and consider adding your own contribution if you have not already.

1.    Participant amgmails posed a question about what makes data “Big Data” – is it volume? What is the threshold?
2.    Participant Max-Haifei-Li started an provocative discussion on whether or not Sam Madden purposefully gave half of the story on the MGH case study.
3.    Participant marsuaga asked about data in GPU memory in the Twitter use case.
4.    Participant ShadadMoradi started a conversation regarding loop detectors and collecting data.
5.    Participant coldskin kicked off a great discussion on whether the 'question' is as important as the analysis itself.

Are you networking with your fellow participants?
We also wanted to encourage everyone to use the Linkedin and Facebook networking groups. There are some interesting conversations happening there, and lots of connections taking place. Also, in case you are interested in some reasons why participants are taking this course, read and contribute to this Linkedin thread on “What Made you Join this course”. 

In order to join the LinkedIn group, please see the instructions on the course Wiki page.
 
Student Directory

If you are interested in sharing your information so that we can get some statistics about the participants, and so you might find more meaningful ways to network, please submit your information here. You can view the submitted information here.
 
Week 2 Agenda

In week two, course material will cover:

MODULE TWO: BIG DATA COLLECTION

  • Data Cleaning and Integration (Michael Stonebraker)
  • Hosted Data Platforms and the Cloud (Matei Zaharia)

The lead TA for week two will be Brian Bell, who will be answering your questions and responding to the top upvoted posts within the discussion forum.
 
We welcome your questions, comments, and suggestions as we move forward through the course material.

Good luck in Week 2!
 
MIT Professional Education
The Online X Programs Team

November 4, 2014

Dear Participant,

Congratulations and welcome to the MIT Professional Education course Tackling the Challenges of Big Data. The course is now open and available from your dashboard! All of the MIT instructors are very excited to have you in the course.

Tackling the Challenges of Big Data is designed to provide you with a broad survey of key technologies and techniques. You will learn about important new developments that will likely shape the commercial Big Data landscape in the future.

Upon successful completion of this course, participants will be equipped with new tools to solve their own Big Data challenges.

To begin, once you login to your dashboard at https://mitprofessionalx.edx.org and click “View Course,” you will be brought to a brief entrance survey, which will be the first required assignment of the course. 

This survey will only take a couple minutes, and your answers will help the course team and faculty better understand how familiar you are with big data concepts and your goals for taking this course, and they will ultimately be a guide to improving your experience and that of future offerings. As soon as you complete the survey, you will be granted access to the videos and content, and can start the course.

Be sure to also check the Course Info tab for important announcements about the course, as well as important course handouts, such as information about discussion forum guidelines, the course TAs, and how to earn a Certificate of Completion and CEUs for this course.

Best wishes and welcome to the course and MIT,

Course Directors,

Professors Daniela Rus and Sam Madden, 

Department of Electrical Engineering & Computer Science

 

   

October 29, 2014 

Dear Participant,

We are now less than a week away from the start of Tackling the Challenges of Big Data! 

On November 04, 2014 at 05:00 UTC, the course will be open and accessible from your dashboard. Once you login at https://mitprofessionalx.edx.org and click “View Course,” you will be brought to a brief entrance survey, which will be the first required assignment of the course. 

This survey will only take a couple minutes, and your answers will help the course team and faculty better understand how familiar you are with big data concepts and your goals for taking this course, and they will ultimately be a guide to improving your experience and that of future offerings. As soon as you complete the survey, you will be granted access to the videos and content, and can start the course.

In the meantime, if you have not already, you can begin the following recommended activities today:

edX demonstration course

If you wish to familiarize yourself with the edX platform, we encourage you to take this demonstration course that was built specifically to help participants become more familiar with using the platform. 

Join the course networking groups

We invite you to join the Tackling the Challenges of Big Data networking groups on Facebook and LinkedIn if you wish to start networking with your fellow participants.

Many of you have already joined and we are pleased to see the conversations that are unfolding.

1. FACEBOOK (https://www.facebook.com/groups/1450017638606503

Visit the above link and request to “Join Group”. An admin will process your request as soon as possible. We appreciate your patience.

2. LINKEDIN (https://www.linkedin.com/groups?home=&gid=8155708&trk=anet_ug_hm

Visit the above link and request to “Join Group”. If your LinkedIn Account email is the same as the email you used to register for the course, you should be pre-approved to join and will be automatically added as a member of the group. If your LinkedIn email is different than the one used to register for this course, an admin will process your request and add you as a group member as soon as possible. We appreciate your patience.

As always, we welcome your questions, comments, and suggestions.

MIT Professional Education

The Online X Programs Team

October 28, 2014

Tackling the Challenges of Big Data 

 "Tackling the Challenges of Big Data"

November 4 - Dec 16, 2014

Save 5% when you use code: pefan5

Dear Brand, 

We are one week away from the start of our online professional course "Tackling the Challenges of Big Data". There is still time to register for the November 4 - December 16 offering, but we encourage you to do so soon.

This six-week online course was designed for engineers, technical managers, and entrepreneurs looking to understand new Big Data technologies and concepts to apply in their work, and to gain perspective on current trends and future capabilities. 

Overview

  • Date: Nov 4 - Dec 16, 2014 (also available Feb 3 - Mar 17, 2015)
  • Fee: $545 (save 5% when you register using code: pefan5)
  • Six-week, self-paced online course for professionals
  • 20 hours of video lectures
  • Taught by 12 MIT CSAIL Faculty Instructors
  • Participants who successfully complete all course requirements are eligible to receive a Certificate of Completion and 2.0 CEUs from MIT Professional Education 

CIO Today says:

"The big data course offered by MIT should be required in any enterprise where business users interact with data. Business users crave big data and analytics tools, but without an understanding of what makes data good or bad they may make decisions based on insight that's fallacious. MIT's big data course is an important step for the industry."

Past Participants say:

"This course provided a comprehensive overview of what Big Data really represents, and how the analysis of large data sources may improve operating efficiencies, result in new business opportunities, and improve profit margins. This knowledge will allow me to lead efforts to utilize resources more efficiently."  - 

Norman Yale, Professional Technical Architect, AT&T Corporation, UNITED STATES

"I learned the latest technologies and financial models from both the course content and the discussion forum where I communicated with participants from across the continents. I could apply the knowledge I gained from this course to my projects right away." -  

Satoshi Hashimoto, Account Manager, Coca-Cola Business Services Company, Ltd., JAPAN

"I was working with Big Data previously, testing Big Data use cases with my team of graduate interns, but I was missing some new developments and structured information since I left university 9 years back. Having attended this course, I am now able to remove the gaps, become aware of what is going on in research and academics, and I have better insight into the problems with Big Data. With this certificate, people across departments now recognize me as an SME." - Hemant Kumar, Associate Architect in Advance Analytics and Big Data, IBM Global Services, SINGAPORE

Position yourself in your organization as a vital subject matter expert on major technologies and applications driving the Big Data revolution, and position your company to propel forward and stay competitive in today's market.

Learn more about "Tackling the Challenges of Big Data" >> 

MIT Professional Education Friends Discount 

MIT Professional Education Friends will save 5% on tuition by entering promotional code PEFAN5 on course registration forms. Please share this information with your colleagues and others in your network so they can take advantage of this special savings. 

Sincerely,

MIT Professional Education

Online X Programs Team

October 3, 2014

Welcome from the 

Executive Director of
MIT Professional Education

Bhaskar Pant

 

 

 

 

 

 

 

 

November 3, 2014

Dear Participant:  

A warm welcome to the second offering of our MIT Professional Education Online X Program: Tackling the Challenges of Big Data. Thank you for enrolling in what we are confident is an enhanced version of our inaugural course offered this past spring, and what is the next of what we hope to be many more exciting online courses to come. 

Our mission with MIT Professional Education Online X Programs is to provide busy professionals such as you the opportunity to advance their knowledge with as little disruption of their work-life balance as possible. We hope that the flexible nature of this course will allow you to gain cutting edge knowledge about big data at your own pace within the six-week schedule.  

Over the duration of the course, we invite you to look out for the mid-term and final course surveys. Your honest feedback will provide us valuable information to assist in the planning of future online courses.

Thank you again for registering for Tackling the Challenges of Big Data. On behalf of the entire team, I wish you much success in this course.

Sincerely,

Bhaskar Pant

Executive Director, MIT Professional Education

The CancerMath.net Web Calculators

Source: PDF

See: http://www.lifemath.net/quantmed/

The Laboratory for Quantitative Medicine
Massachusetts General Hospital/Harvard Medical School
James Michaelson PhD

DATA RESOURCES AND APPLICATIONS

Databases on Patients,
Mathematics of Cancer Outcome, Screening and Treatment
Analysis of Medical Usage and Cost among Cancer Patients
Communication of Cancer Outcome (The CancerMath.net web calculators)

Research Summary

For the past decade, the work of my group has been concerned with building very large databases on patients,
using this information to understand disease, and especially cancer, and its treatment, using mathematics to answer
practical questions about health, and building web-based tools for communicating this information to patients and medical
professionals (the CancerMath.net calculators).

Databases

We have built a number of very large databases on patients. Last June, we were tasked by the MGH Cancer Center
to build a database on all of the 173,301 Massachusetts General Hospital cancer patients,167,814 of whom were
diagnosed between 1968 and 2010. The database contains 559,921 pathology reports, 575,204 discharge reports,
10,938,444 encounter notes, 304,211 operative reports, 22,009,527 procedure notes, 9,159,232 radiology
reports,~1,700,000 aggregated medical bills, and ~250,000 images. The database contains all-cause survival information
from the Social Security Administration Death Master File (which provides information on all deaths of persons issued
social security numbers since 1937), and cause-of-death information from the Massachusetts Death Certificate Database
(which contains international classification of disease cause of death information on 1,984,790 people who died in the
state of Massachusetts between 1970 and 2008). The database is linked to the MGH SNAPSHOT gene sequence
dataset, thus providing a great wealth of genetic data on a large number of patients.

As far as we are aware, in terms of the total mass of data, this database is the largest source of clinical information on
cancer in the world.

Our group also maintains a great wealth of other data on patients, including the largest multi-institutional database on
the breast cancer in the world, with data on more than 33,000 breast cancer patients (and all the varieties of information
noted above). We have also built a similar multi-institutional data on melanoma patients. We also possess the Van Nuys
Breast Cancer breast cancer database. We also have our own MS SQL Server re-build of the SEER national database
with, for example ~325,0000 breast cancer patients for who there is full tumor size and survival information. We also
possess the SEER/Medicare dataset for breast cancer. Finally, we also have build a number of specialized databases,
including our database on all MGH cancer patients who were treated by liver resection.

Medical Usage and Cost

Over the past year, we have served on the MGH Cancer Center's lung cancer and colon cancer redesign
committees, providing detailed and actionable data on medical use and cost for cancer patients. For example, our
analysis of the lung cancer patients has revealed that the total cost of treating lung cancer patients at the MGH is ~$59
million/year, but that standardization of care has the potential of saving ~$30 million/year.

Studies are underway for generating data on treatment and cost for patients with many types of cancer.

A Mathematical Approach To Cancer Lethality

We have developed a mathematical framework for comprehending cancer lethality: the binary biological model of
metastasis. This mathematics has proven to provide highly specific method for estimating of the risk of death for individual
breast carcinoma patients, with an accuracy of 1%, This mathematics also provide estimates of the risk of death for
melanoma, renal cell carcinoma, sarcoma, and head and neck squamous cell carcinomas, and the applicability of this
mathematics in analysis of survival for other cancers is under analysis. This math also provides a basis for estimating the
risk of node positivity for breast carcinoma and melanoma patients, as well as providing insight into the nature of the
events of spread that underlie cancer lethality, as well as providing a way to address a whole range of questions in
oncology.

Our binary-biological mathematics has also formed the core of a computer simulation model of cancer screening,
which has made it possible to derive biologically plausible and testable estimates of the reduction in cancer death that can
be expected from screening various patients at various intervals. This work has been accompanied by a whole range of
studies on the operational details of the usage of breast cancer screening.

Our binary-biological mathematics has been used to create a series of web-based CancerMath.net calculators,
which provide patients with breast carcinoma, melanoma, and renal cell carcinoma with information on their likely
outcomes. For breast carcinoma, we also provide a CancerMath web-calculator that shows the benefit that they can
expect from the various adjuvant chemotherapy agents available to them. Analysis of the use of these CancerMath.net
calculators has taught us that they are very widely used, being consulted by 1-in-5 USA breast carcinoma patients, 1-in-2
renal cell carcinoma patients, and a large number of melanoma patients. More details can be found below.

CancerMath.net Calculators

For the last decade, my research has concerned the development of a mathematical framework for predicting survival for
cancer patients, together with the impact that various treatment choices will have on that outcome and the creation of very
large databases on patients so that this mathematics can be accurate.

One of the applications of this work has been the creation of a series of Cancer Math.net web calculators, which appear to
have become the most widely used decision aids used by cancer patients (www.CancerMath.net). These calculators are
used by ~1-in-5 breast carcinoma patients, ~1-in-2 renal cell carcinoma patients. and large numbers of patients with other
cancers. This has happened quite spontaneously, without publicizing these tools.

The goal of this work has been to provide patients and their physicians with highly accurate, disinterested, information on
their survival expectations, together with information on the impact that can expected from the various treatment options
that are available to them.

This math also has applications in other fields, such as: generating accurate estimates of the benefit of cancer screening;
generating accurate estimates of life expectancy for the insurance industry; generating accurate estimates for legal
professionals of the harm caused (if any) in the detection and treatment of cancer.

We have also created a web-calculator, PreventiveMath.net (www.PreventiveMath.net), which provides individuals with a
list of the class A US Preventive Services Task Force recommendations, prioritized by the benefit that they can expect, so
that people can see the benefit, and choose those steps that will give them the greatest possible extension in life.

The current tools available at CancerMath.net website include:

1) Breast Cancer Outcome Calculator (Provides information on survival expectation [see below for definition*], at the time
of diagnosis, assuming standard of care therapy)

2) Breast Cancer Therapy Calculator (Provides information on survival expectation*, and the impact which various
adjuvant chemotherapy options can be expected to have on that outcome)

3) Breast Cancer Conditional Survival Calculator (Provides information on survival expectation*, assuming standard of
care therapy, for patients who have remained disease free 2-15 years after diagnosis)

4) Breast Cancer Nodal Status Calculator (Provides information on the likelihood of cancer spread to the local lymph
nodes)

5) Breast Cancer Nipple Involvement Calculator Calculator (Provides information to the surgeon on the likelihood that
nipple sparing surgery can be carried out without leaving cancer behind)

6) Melanoma Outcome Calculator (Provides information on survival expectation*, at the time of diagnosis, assuming
standard of care therapy) (Also includes Conditional Survival Information)

7) Renal Cell Carcinoma Outcome Calculator (Provides information on survival expectation*, at the time of diagnosis,
assuming standard of care therapy) (Also includes Conditional Survival Information)

8) Colon Cancer Cancer Outcome Calculator (Provides information on survival expectation*, at the time of diagnosis,
assuming standard of care therapy) (Also includes Conditional Survival Information)

9) Head & Neck Cancer Outcome Calculator (Provides information on survival expectation*, at the time of diagnosis,
assuming standard of care therapy) (Also includes Conditional Survival Information)

10) Sarcoma Outcome Calculator (Provides information on survival expectation*, at the time of diagnosis, assuming
standard of care therapy) (Also includes Conditional Survival Information) (completed, but not yet been posted, but
viewable at the hidden link:http://www.lifemath.net/cancer/sarco...come/index.php).

Notes

1) * survival expectation measures provided by the CancerMath calculators are: risk of death, for each of the first 15 years
after diagnosis: 1) to cancer; 2) to causes of death other than cancer; 3) to all causes.

Also provided is the life expectancy with cancer, life expectancy without cancer, and the reduction in life expectancy that
is caused by cancer.

For the therapy calculator, the impact of the various breast cancer adjuvant chemotherapy regimens on these measures
are given.

2) The outcome calculators also provide Cancer Stage.

References

1. Michaelson, J, Halpern, E, Kopans, D. A Computer Simulation Method For Estimating The Optimal Intervals For
Breast Cancer Screening. Radiology. 212:551-560 1999

2. Michaelson, JS, Kopans, DB, Cady, B. The Breast Cancer Screening Interval is Important. Cancer 2000 88:1282-
1284

3. Michaelson JS Using Information on Breast Cancer Growth, Spread, and Detectability to Find the Best Ways To Use
Screening to Reduce Breast Cancer Death Woman’s Imaging 3:54-57 2001

4. Michaelson JS Satija S Moore R Weber G Garland G Kopans DB, Observations on Invasive Breast Cancers
Diagnosed in a Service Screening and Diagnostic Breast Imaging Program Journal of Woman’s Imaging 3:99-104
2001

5. Michaelson JS Satija S, Moore R Weber G Garland G Phuri, D. Kopans DB The Pattern of Breast Cancer Screening
Utilization and its Consequences CANCER 94:37-43 2002

6. Michaelson JS Silverstein M, Wyatt J Weber G Moore R Kopans DB, Hughes, K. Predicting the survival of patients
with breast carcinoma using tumor size CANCER 95: 713-723 2002

7. Beckett JR, Kotre CJ, Michaelson JS Analysis of benefit:risk ratio and mortality reduction for the UK Breast Screening
Programme. Br J Radiol 76:309-20 2003

8. Michaelson JS Satija S, Moore R Weber G Garland G Kopans DB Estimates of the Breast Cancer Growth Rate and
Sojourn Time from Screening Database Information Journal of Women’s Imaging 5:3-10 2003

9. Michaelson JS Satija S, Moore R Weber G Garland G Kopans DB, Hughes, K. Estimates of the Sizes at which Breast
Cancers Become Detectable on Mammographic and on Clinical Grounds Journal of Women’s Imaging 5:10-19 2003

10. del Carmen MG, Hughes KS, Halpern E, Rafferty E, Kopans D, Parisky YR, Sardi A, Esserman L, Rust S, Michaelson
J Racial differences in mammographic breast density. CANCER 98:590-6 2003

11. Michaelson JS, Satija S, Kopans DB, Moore RA, Silverstein, M, Comegno A, Hughes K, Taghian A, Powell S, Smith,
B Gauging the Impact of Breast Cancer Screening, in Terms of Tumor Size and Death Rate Cancer 98:2114-24 2003

12. Michaelson JS, Silverstein M, Sgroi D, Cheongsiatmoy JA, Taghian A, Powell S, Hughes K, Comegno A, Tanabe KK,
Smith B The effect of tumor size and lymph node status on breast carcinoma lethality. CANCER 98:2133-43 2003

13. Chen Y, Taghian A, Goldberg S, Assaad S, Abi Raad R, Michaelson J, Powell S Influence of margin status and tumor
bed boost dose on local recurrence rate in breast-conserving therapy: does a higher radiation dose to the tumor bed
overcome the effect of close or positive margin status in breast-conserving therapy? Int J Radiat Oncol Biol Phys
57:S358 2003

14. Jagsi R, Powell S, Raad RA, Goldberg S, Michaelson J, Taghian A Loco-regional recurrence rates and prognostic
factors for failure in node-negative patients treated with mastectomy alone: implications for postmastectomy radiation.
Int J Radiat Oncol Biol Phys 57:S128-9 2003

15. Blanchard K, Weissman J, MoyB, PuriD, Kopans D,, , Kaine E, MooreR, Halpern E, Hughes K, Tanabe K, Smith B
Michaelson J, Mammographic screening: Patterns of use and estimated impact on breast carcinoma survival Cancer
101, 495-507 2004

16. Ma XJ, Wang Z, Ryan PD, Isakoff SJ, Barmettler A, Fuller A, Muir B, Mohapatra G, Salunga R, Tuggle JT, Tran Y,
Tran D, Tassin A, Amon P, Wang W, Wang W, Enright E, Stecker K, Estepa-Sabal E, Smith B, Younger J, Balis U,
Michaelson J, Bhan A, Habin K, Baer TM, Brugge J, Haber DA, Erlander MG, Sgroi DC. A two-gene expression ratio
predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell. 2004 Jun;5(6):607-16.

17. Jones JL, Hughes KS, Kopans DB, Moore RH, Howard-McNatt M, Hughes SS, Lee NY, Roche CA, Siegel N, Gadd
MA, Smith BL, Michaelson JS. Evaluation of hereditary risk in a mammography population. Clin Breast Cancer. 2005
Apr;6(1):38-44.

18. Colbert, J Bigby JA, Smith D, Moore R, Rafferty E, Georgian-Smith D, D’Alessandro HA, Yeh E, Kopans DB, Halpern
E, Hughes K, Smith BL, Tanabe KK, Michaelson J. The Age at Which Women Begin Mammographic Screening.
CANCER: 101, 1850-1859

19. Blanchard K, Colbert J, Kopans D, Moore R, Halpern E, Hughes K, Tanabe K, Smith BL, Michaelson JS. The Risk of
False Positive Screening Mammograms, as a Function of Screening Usage. RADIOLOGY, 240: 335 – 342 2006

20. Jagsi R, Raad RA, Goldberg S, Sullivan T, Michaelson J, Powell SN, Taghian AG. Locoregional recurrence rates and
prognostic factors for failure in node-negative patients treated with mastectomy: implications for postmastectomy
radiation. Int J Radiat Oncol Biol Phys. 2005 Jul 15;62(4):1035-9.

21. Dominguez FJ, Jones JL, Zabicki K, Smith BL, Gadd MA, Specht M, Kopans DB, Moore RH, Michaelson JS, Hughes
KS. Prevalence of hereditary breast/ovarian carcinoma risk in patients with a personal history of breast or ovarian
carcinoma in a mammography population. Cancer. 2005 Nov 1;104(9):1849-53.

22. Livestro DP, Muzikansky A, Kaine EM, Flotte TJ, Sober AJ, Mihm MC Jr, Michaelson JS, Cosimi AB, Tanabe KK. A
Case-Control Study of Desmoplastic Melanoma J Clin Oncol. 2005 Sep 20;23(27):6739-46.

23. Michaelson JS, Cheongsiatmoy JA Dewey F, Silverstein M, Sgroi D Smith B. Tanabe KK, The Spread of Human
Cancer Cells Occurs with Probabilities Indicative of A Non Genetic Mechanism British Journal of Cancer 93:1244-
1249 2005

24. Zabicki K, Colbert JA, Dominguez FJ, Gadd MA, Hughes KS, Jones JL, Specht MC, Michaelson JS, Smith BL. Breast
cancer diagnosis in women < or = 40 versus 50 to 60 years: increasing size and stage disparity compared with older
women over time. Ann Surg Oncol. 2006 Aug;13(8):1072-7

25. Rusby JE, Brachtel EF, Michaelson JS, Koerner FC, Smith BL. Breast Duct Anatomy in the Human Nipple: Three-
Dimensional Patterns and Clinical Implications Breast Cancer Research and Treatment Jan 2007

26. Livestro DP, Kaine EM, Michaelson JS, Mihm MC, Haluska FC, Muzikansky A, Sober AJ, Tanabe KK, M.D.
Melanoma of the young: differences and similarities with adult melanoma, a case-matched controlled analysis Cancer
Aug 1;110(3):614-24 2007

27. Pawlik TM, Gleisner AL, Bauer TW, Adams RB, Reddy SK, Clary BM, Martin RC, Scoggins CR, Tanabe KK,
Michaelson JS, Kooby DA, Staley CA, Schulick RD, Vauthey JN, Abdalla EK, Curley SA, Choti MA, Elias D. Liver-
Directed Surgery for Metastatic Squamous Cell Carcinoma to the Liver: Results of a Multi-Center Analysis. Ann Surg
Oncol. Jun 6 2007

28. Virani S, Michaelson JS, Hutter MM, Lancaster RT, Warshaw AL, Henderson WG, Khuri SF, Tanabe KK. Morbidity
and mortality after liver resection: results of the patient safety in surgery study. J Am Coll Surg. Jun;204(6):1284-92
2007.

29. Michaelson J, Reducing Delay in the Detection and Treatment of Breast Cancer. Adv Imag Onc In 2007

30. Michaelson J, Mammographic Screening: Impact on Survival in CANCER IMAGING Ed: M.A. Hayat in 2007

31. Dominguez FJ, Golshan M, Black DM, Hughes KS,Gadd MA, Christian R, Lesnikoski B, Specht M, Michselson JS,
Smith BL Sentinel Node Biopsy is Important in Mastectomy for Ductal Carcinoma in Situ Ann Surgical Oncology
2008 Jan;15(1):268-73.

32. Rusby JE, Kirstein LJ, Brachtel EF, Michaelson JS, Koerner FC, Smith BL Nipple-sparing mastectomy: Lessons from
ex-vivo procedures The Breast Journal. 2008 Sep-Oct;14(5):464-70

33. Rusby JE, Brachtel EF, Taghian AG, Michaelson JS, Koerner FC, Smith BL. Microscopic anatomy within the nipple:
Implications for nipple sparing mastectomy. American Journal of Surgery 2007 Oct;194(4):433-7

34. Murphy CD, Jones JL, Javid SJ, Michaelson JS, Nolan ME, Lipsitz SR, Specht MC, Lesnikoski B, Hughes KS, Gadd
MA, Smith BL,Do Sentinel Node Micrometastases Predict Recurrence Risk in Ductal Carcinoma in Situ and Ductal
Carcinoma in Situ with Microinvasion? American Journal of Surgery Volume 196, Issue 4, Pages 566-568 2008

35. Cady B, Nathan, NR , Michaelson JS, Golshan M , Smith BL , Matched Pair Analyses of Stage IV Breast Cancer With
or Without Resection of Primary Breast Site J Surgical Oncology 2008 Dec;15(12):3384-95 2008

36. Rusby JE, Brachtel EF, Othus M, Michaelson JS, Koerner FC and Smith BL, Development and validation of a model
predictive of occult nipple involvement in women undergoing mastectomy British Journal of Surgery 2008; 95: 1356–
1361

37. Samphao S, Wheeler AJ, Rafferty E, Michaelson JS, Specht MC, Gadd MA, Hughes KS, Smith BL. Diagnosis of
breast cancer in women age 40 and younger: delays in diagnosis result from underuse of genetic testing and breast
imaging. Am J Surg. 2009 Oct;198(4):538-43

38. Michaelson JS,Chen LL, Silverstein M Mihm MV, Jr., Sober AJ, Tanabe KK, Smith BL, Younger J. How Cancer at
The Primary Site And In The Nodes Contributes To The Risk Of Cancer Death CANCER Nov 1;115(21):5095-107
2009

39. Michaelson JS,Chen LL, Silverstein M, Cheongsiatmoy JA, Mihm MV, Jr., Sober AJ, Tanabe KK, Smith BL, Younger
J. Why Cancer at The Primary Site And In The Nodes Contributes To The Risk Of Cancer Death CANCER Nov
1;115(21):5084-94 2009

40. Chen LL, Nolan, M, Silverstein M, Mihm MV, Jr., Sober AJ, Tanabe KK, Smith BL, Younger J., Michaelson JS, The
Impact Of Primary Tumor Size, Nodal Status, And Other Prognostic Factors On The Risk Of Cancer Death CANCER
2Nov 1;115(21):5071-83 2009

41. Tanabe KK, Jara S, Michaelson J. Creating and providing predictions of melanoma outcome. Ann Surg Oncol. 2010
Aug;17(8):1981-2.

42. Pandalai PK, Dominguez FJ, Michaelson J, Tanabe KK. Clinical Value of Radiographic Staging in Patients
Diagnosed With AJCC Stage III Melanoma. Ann Surg Oncol. 2011 Feb;18(2):506-13

43. Cady B, Michaelson JS Chung MA, The “Tipping Point” for breast cancer mortality decline has resulted from size
reductions due to mammographic screening Annals of Surgical Oncology in Press 2011

44. Bush D, Smith B, Younger J, Michaelson JS. The non-breast-cancer death rate among breast cancer patients. Breast
Cancer Res Treat. 2010 Oct 7

45. Michaelson JS, Chen L, Bush D, Smith B, Younger J, Improved web-based calculators for predicting breast
carcinoma outcomes. Breast Cancer Res Treat. In Press 2011

46. Emmons KM, Cleghorn D, Tellez T, Greaney ML, Sprunck KM, Bastani R, Battaglia T, Michaelson JS, Puleo E.
Prevalence and implications of multiple cancer screening needs among Hispanic community health center patients.
Cancer Causes Control. 2011 011 Sep;22(9):1343-9

47. Michaelson JS, Chen LL, Bush D, Fong A, Smith B, Younger J. Improved web-based calculators for predicting breast
carcinoma outcomes. Breast Cancer Res Treat. 2011 Aug;128(3):827-35

48. Barnes JA, Lacasce AS, Feng Y, Toomey CE, Neuberg D, Michaelson JS, Hochberg EP, Abramson JS. Evaluation of
the addition of rituximab to CODOX-M/IVAC for Burkitt's lymphoma: a retrospective analysis. Ann Oncol. 2011
Aug;22(8):1859-64. Epub 2011 Feb 21.

49. Rich, S, Ali S Calkins, J, Michaelson J. Survival trends in childhood hematological malignancies Int. J. Biomath.B 05,
1250053 (2012)

50. Wender R, Fontham ET, Barrera E Jr, Colditz GA, Church TR, Ettinger DS, Etzioni R, Flowers CR, Scott Gazelle G,
Kelsey DK, Lamonte SJ, Michaelson JS, Oeffinger KC, Shih YC, Sullivan DC, Travis W, Walter L, Wolf AM, Brawley
OW, Smith RA American Cancer Society lung cancer screening guidelines.CA Cancer J Clin. 2013 Jan 11.

51. Valsangkar NP, Bush DM, Michaelson JS, Ferrone CR, Wargo JA, Lillemoe KD, Fernández-Del Castillo C, Warshaw
AL, Thayer SP. The Effect of Lymph Node Number on Accurate Survival Prediction in Pancreatic Ductal
Adenocarcinoma. J Gastrointest Surg. 2013 Feb;17(2):257-66.

Laboratory of Quantitative Medicine Technical Reports

(Available at http://www.lifemath.net/cancer/about...orts/index.php)

1. Technical Report #1 - Mathematical Methods (March 9, 2009)

2. Technical Report #2 - Equation Parameters (March 9, 2009)

3. Technical Report #3 - Validation: SizeOnly Equation (June 24, 2008)

4. Technical Report #4 - Validation: Size+Nodes Equation (June 26, 2008)

5. Technical Report #5 - Validation: Size+Nodes+PrognosticFactors Equation (July 3, 2008)

6. Technical Report #6 - Comparisons with AdjuvantOnline (July 7, 2008)

7. Technical Report #7a - Partners Breast Cancer Database (May 12, 2008)

8. Technical Report #7b - SEER Breast Cancer Database (May 12, 2008)

9. Technical Report #8 - How and Why Primary Tumor Size, Nodal Status, and Other Prognostic Factors Contribute to
the Risk of Cancer Death (March 9, 2009)

10. Technical Report #9 - Adjuvant Multi-agent Chemotherapy and Tamoxifen Usage Trends for Breast Cancer in the
United States (March 27, 2009)

11. Technical Report #10 - How the CancerMath.net Breast Cancer Calculators Work (April 6, 2009, Updated Nov 28
2009)

12. Technical Report #11 - Comparative Effectiveness Calculators For Predicting Melanoma Death (August 19, 2009)

13. Technical Report #12 - Accuracy of the CancerMath.net Breast Cancer Calculators over 15 years following diagnosis
(August 27, 2009)

14. Technical Report #12b - Accuracy of the CancerMath.net Breast Cancer Calculators (version 2) over 15 years
following diagnosis (August 29, 2009)

15. Technical Report #13 - Computer Simulation Estimation of the Impact of Various Breast Cancer Screening Intervals in
Women of Various Ages (April 5, 2009)

16. Technical Report #14 - Computer Simulation Estimation of the Benefits and Costs of Breast Cancer Chemoprevention
(April 5, 2009)Pre-Course Survey

Courseware

Course Survey

You can begin your course as soon as you complete the following form. Required fields are marked with an asterisk (*).


As of today, how would you rate your level of understanding of the following Big Data technologies and concepts:

  1. (1-10, where 1 is the lowest level of understanding up to 10 representing a high level of proficiency)
  2. (1-10, where 1 is the lowest level of understanding up to 10 representing a high level of proficiency)
  3. (1-10, where 1 is the lowest level of understanding up to 10 representing a high level of proficiency)
  4. (1-10, where 1 is the lowest level of understanding up to 10 representing a high level of proficiency)
  5. (1-10, where 1 is the lowest level of understanding up to 10 representing a high level of proficiency)

Welcome to the Course

How To Participate in the Course

One way to participate in this course is to engage with your fellow learners using the discussion forum.

Take a moment to review and become familiar with how to use the discussion forum (My Note: See below) and the discussion forum guidelines (My Note: See below). To discuss the topics presented in each module, including course videos, optional discussion questions, and assessments, go to the "Discussion" tab at the top of the page. There will be one discussion board per module.

You can start by introducing yourself in the discussion forum under the topic titled "Introduce Yourself." Say a little bit about yourself and your reasons for taking this course. My Note: I did this.

How to Use the Discussion Forum

Source: https://mitprofessionalx.edx.org/c4x...sion_forum.pdf (PDF)

Course Title: Tackling the Challenges of Big Data

Meet Your Course TAs!
Brian Bell

Brian is a Masters student in the ALFA Machine Learning Group at MIT CSAIL. His research interests focus on applications of machine learning to challenges within healthcare, such as building effective models from clinical data. He has extensive experience developing classification tools and predictive models for patient EMRs (Electronic Medical Records). Other interests include NLP for clinical notes, distributed machine learning systems, NoSQL databases and personalized search.

Brian will be the Lead TA for the following modules:

Week 1: Big Data Challenges

Week 5-6: Big Data Analytics 1

Manasi Vartak

Manasi is a PhD student in the Database Group at MIT CSAIL. Her research interests lie at the intersection of databases and machine learning, and she works on new techniques to efficiently query, analyze, and visualize large scale data. Past projects have included tools to find interesting differences in datasets, recommender system for item sets, and a comparative analysis of current data management systems for complex analytics. She has worked at Google, Facebook and Microsoft in areas including dynamic ad targeting and real-time data querying. She is particularly interested in data management and analysis for healthcare applications.

Contact: Web: http://www.mit.edu/~mvartak/ Twitter: @DataCereal

Manasi will be the Lead TA for the following modules:

Week 2: Big Data Collection

Week 3: Big Data Storage

Week 4 - Big Data Systems

Participating in the discussion forum is optional, but will enrich your learning experience by giving you the opportunity to interact with a community of professionals from around the globe.

To discuss the topics presented in each module (including videos and optional discussion questions), go to the “Discussion” tab at the top of the page. There will be one discussion board per module. All discussion regarding video sequences, discussion questions posed by course staff, assessments and anything related to the topics in the particular module will be discussed in these boards.

Creating a discussion forum post:

1) In order to create a post in the discussion forum, click on the blue “New Post” button.
2) When adding a new post, you must identify the post as either asking a question or starting a discussion.
3) Select “Topic Area” from the drop-down menu. This is important. If you do not specify, your post will go under the “General” discussion thread and may get overlooked. However, if you have a general topic not related to a specific module in the courseware, this is where to begin these discussions.
4) When titling your post, be as specific as possible as to the subject of your post. For example, just titling a post “Big Data Issue” is not as effective as titling it “Big data challenges in customer relationship management”.
5) Enter text for your post and then hit “submit”.

The TAs will be monitoring the discussion forum daily, and will respond to questions and other posts of note throughout the duration of the course.

For questions and concerns you would like directly addressed by the TAs, please post in the “Ask your TA” thread that will be under the “General” discussion board. TAs will also pay close attention to the posts that get the most upvotes, so if you come across a question or conversation that you find especially interesting or helpful, click the plus sign (+) at the top right of the post.

Discussion and Community Guidelines

Source: https://mitprofessionalx.edx.org/c4x...s_10.10.14.pdf (PDF)

November 2014

Our approach to experiential learning integrates cutting edge theory from world-renowned MIT faculty with real world practice and experience. Our goal is to develop innovative leaders and thinkers who solve complex problems and produce systemic changes in order to transform their teams and organizations.

We will provide a discussion forum within the course platform for all participants. You may use the forum to discuss concepts, problem solving approaches, reflect, share interesting references, or anything else that may be of interest.
The discussion forum is for your benefit – to connect with other professionals around the globe, to create a sense of community, and to enhance your learning experience.

Please review the following guidelines in order to keep this discussion forum easy to navigate for both participants and course staff.
General Guidelines

1. Observe the Honor Code. Participants are encouraged to collaborate with others, however, posting answers to assessments or otherwise engaging in any activity that would falsify or misrepresent your results or the results of others may result in your account being disabled and your course progress removed.
2. Upvote good posts. If you like a post and/or find it helpful, please upvote the post by clicking on the plus sign (+) at the top right of the post. This applies to both questions and answers within the discussion forum. The more votes a post gets, the easier it is to find for participants, and the more likely course staff is to respond.
3. Use the search function. Before posting a question in the forum, click on the magnifying glass on the top right to search whether or not your question has already been asked and/or answered. Having too many threads makes the forum cluttered and hard to navigate, plus a discussion becomes richer the more people are participating in it.
4. Be as specific as possible. Use a descriptive title for your post and give details as to which part of the assessment or video you are referring to, what your questions are, and what, if anything, you have already tried.
5. When commenting or posting in the discussion forum, formulate your opinion clearly and if applicable by providing references that support your position. Just saying “I agree” or “I disagree” is not as effective in generating a good discussion.

Certificates and CEUs

​Source: https://mitprofessionalx.edx.org/c4x...Us_handout.pdf (PDF)

November 2014

Certificate of Completion:
Upon successful completion of this course, a Certificate of Completion will be awarded by MIT Professional Education. Please note that the program is held over six weeks, and is entirely asynchronous. Lectures are pre-recorded by faculty so you follow along whenever and wherever you find it convenient. However, all work must be completed by December 16, 2014.

To ensure that you earn and receive a Certificate of Completion from MIT, you must follow the below steps.

  • Watch all course videos (please note: MIT Professional Education does not track your progress, however, your understanding of all course content is necessary to complete the course assessments).
  • Complete all course assessments with an average of 80 percent success rate by December 16, 2014.

Your Certificate of Completion will be posted to your student dashboard 7 days after the course end date.

CERTIFICATE OF COMPLETION ASSESSMENTS:

  • When answering assessments 0-5, please click the CHECK button beneath each question to be sure that your answer is recorded.
  • If your answer is correct, you will see a green check mark next to it.
  • If your answer is incorrect, you will see × a red X next to it.
  • If you are not satisfied with your results, you may review the course materials and resubmit your answer to any of the questions you answered incorrectly to achieve a higher average.
  • Your average can be checked by clicking on the PROGRESS tab and viewing the percentage score of the last red column titled TOTAL.

Continuing Education Units (CEUs):

Participants of Tackling the Challenges of Big Data offered November 4 – December 16, 2014 who successfully complete all course requirements and earn a Certificate of Completion are eligible to receive 2.0 Continuing Education Units (2.0 CEUs). In order to earn CEUs, participants must complete the final course survey/CEU assessment by January 2, 2015. A link to the final course/CEU assessment will be emailed to all participants, and will be posted in the Course Info tab.

CEUs are a nationally recognized means of recording noncredit/non-degree study. They are accepted by many employers, licensing agencies, and professional associations as evidence of a participant’s serious commitment to the development of a professional competence. CEUs are based on hours of instruction. For example: One CEU = 10 hours of instruction. CEUs may not be applied toward any MIT undergraduate or graduate level course.

A CEU award letter will be emailed to all participants that earn them by January 15, 2015.

Technical Assistance and Contacts

If you are experiencing any technical difficulties with this course site, including problems viewing videos or other features, please email: onlinex-techsupport@mit.edu

If you have questions related to the course curriculum, please make use of the course discussion forums.

Introduction: Big Data Challenges

Before we get started, if you are new to using the edX platform, please view this quick tutorial on how to navigate through course videos. Once you feel comfortable, you can move on to the next video in the sequence, where Sam Madden will introduce Big Data challenges.

If you still wish to become more familar with the edX platform, you can view the entire demonstration course here.

LILA FISHER: Hi, welcome to Edx.
I'm Lila Fisher, an Edx fellow helping to put together these courses.
As you know, our courses are entirely online.
So before we start learning about the subjects that brought you here, let's learn about the tools that you will use to navigate through the course material.
Let's start with what is on your screen right now.You are watching a video of me talking.
You have several tools associated with these videos.
Some of them are standard video buttons, like the play
Pause Button on the bottom left.
Like most video players, you can see how far you are into this particular video segment and how long the entire video segment is.
Something that you might not be used to is the speed option.
While you are going through the videos, you can speed up or slow down the video player with these buttons.
Go ahead and try that now.
Make me talk faster and slower.
If you ever get frustrated by the pace of speech, you can adjust it this way.
Another great feature is the transcript on the side.
This will follow along with everything that I am saying as I am saying it, so you can read along if you like.
You can also click on any of the words, and you will notice that the video jumps to that word.
The video slider at the bottom of the video will let you navigate through the video quickly.
If you ever find the transcript distracting, you can toggle the captioning button in order to make it go away or reappear.
Now that you know about the video player, I want to point out the sequence navigator.
Right now you're in a lecture sequence, which interweaves many videos and practice exercises.
You can see how far you are in a particular sequence by observing which tab you're on.
You can navigate directly to any video or exercise by clicking on the appropriate tab.
You can also progress to the next element by pressing the Arrow button, or by clicking on the next tab.
Try that now.
The tutorial will continue in the next video.

Assessment 1

1) Which of the following is not one of the "3 Vs" of Big Data?

 

2) In the MGH Cancer Patient Database example given by Professor Madden, the primary explanation for dramatically higher costs in the most expensive patients is:

 

3) According to Professor Madden, the primary Big Data challenge associated with flight data is:

 

Introduction and Use Cases

1.0 Introduction: Uses Cases

Discussions will take place in the forum (accessible through the link at the top of the page) rather than in the modules themselves, as the full functionality of the discussion feature — including organization by topic and searchability — is not enabled within the courseware.

1.1 Case Study: Transportation 

DANIELA RUS: Hello, my name is Daniela Rus.
I'm a professor at MIT, and today, I would like to talk to you about data analytics for transportation.
I'm interested in this problem because a few years ago my commute time increased from five minutes on a regular basis to 17 minutes on Sunday morning, and an hour and a half on Friday before a long weekend.
And this is unsustainable.
In fact, for most of us, the demands of transportation in urban settings has been consistently creeping up.
We want more.
We want better.
We want different modalities.
But it is unsustainable to expand the road network or the amount of space available for parking.
In fact, over the past 30 years, we have been spending increasingly more time commuting, whether we live in small cities, medium sized cities, or very large cities.
Our time spent in transportation has doubled and tripled, in many cases.
And this has a lot of negative consequences, loss of productivity, negative impact on the environment, not to mention the amount of stress that it induces in all of us.
So we are excited, because technology can come to the rescue of this very important problem.
Today, we have devices that allow us to collect data about transportation, analyze this data, and then rethink how we plan and how we design our transportation networks.
We have in-car devices that enable vehicles to keep track of where they're going and how much time it takes to go from one
place to the other.
We have cameras installed on the road network that give us a global view of what happens at that particular spot in the network.
We have wireless communication devices that allow us to upload all this data into the cloud, and we can personalize everything using our smartphones.
All these devices enable us to aggregate a lot of data about our transportation systems and process them in the cloud.
We can use this data to enable transportation capabilities that we have never had before.
We can visualize how congestion impacts the amount of time it takes to get home.
We can visualize where people are.
We can see what happens in the case of special events.
For example, when the Red Sox are out playing at Fenway Park.
We can also see what happens when it's snows or it rains, and we want to go from one part of the city to the other.
These are all exciting applications that have motivated a case study in transportation we have conducted in Singapore.
Singapore is a small country in Southeast Asia.
It occupies an area of about 30 by 16 miles.
It has 5 million people, most of whom are early adopters of high tech.
It has a very high GDP, and lots of taxis.
Singapore has over 26,000 taxis.
In fact, people tend to move across the country using taxis.
Each taxi in the Singapore system is equipped with an in-car device you can see in this picture.
In our study, we had 16,000 unique taxis available to us, and we collected data from these taxis over a long period of time, many months.
Each data consisted of the taxi ID, it's GPS location, the speed, status, and time stamp.
And this data was logged about once every half a minute.
One month of taxi data from the 16,000 unique taxis yielded about 33 gigabytes of data.
That's a lot of data.
And this data had to undergo several data cleaning steps in order to make it useful to us.
GPS logs data with error, so we had to detect all those points in the data where GPS placed us in the middle of the Atlantic, or the Pacific.
The GPS points not always lined up with roads on the map.
So we had to do extra work in order to ensure that we understood exactly on which road the taxi was located.
However, after all this work, the data we collected enabled a whole suite of really exciting applications.
And to give you a taste of the kinds of things we can do with this data, here's a snapshot of traffic visualization specifically, how we can visualize traffic volume in Singapore using taxi data.
The maps correspond to the country, to Singapore, and encoded in these maps with the color red at different levels of intensity is the level of congestion of the country.
So for instance, we see that at the airport, there is intense red.
That's because there's a lot of traffic at the airport.
In the central business district, there's a fairly high concentration of traffic.
But in other parts of the country, the taxi concentration is not so high.
The traffic is pretty light.
In addition to visualizing what happens in the country, we can also visualize how the vehicles on the road that are aimed to be there to deliver service to customers behave.
So in this chart, we see the number of taxis in the road network for different times of day, and we see in red, the number of taxis that have a person on board, in other words, are delivering service, versus the number of taxis that are driving around empty aiming to find a new customer.
And in looking at this graph, we see a first surprising fact.
We see that the number of empty taxis is approximately the same as the number of red taxis in the system.
This suggests great inefficiency in the system, because passengers are waiting in areas that are not serviced by taxis.
So can we use this data in order to figure out where the demand is and to match the demand with the supply?
These are the kinds of things that are enabled by a data driven approach to characterizing traffic.

1.2 Case Study: Visualizing Twiitter

SAMUEL MADDEN: Hi.
It's, me, Sam again.
In this module, I'm going to introduce you to a tool we've been building at MIT that lets you process big data really, really fast.
It's called MapD.
And what it does is use the graphics processing unit, the GPU, in your computer in order to be able to pour through millions of records in just a few milliseconds.
I'm going to introduce MapD to you in the context of a specific application, Twitter data.
So we're going to use MapD to process and visualize a large quantity of Twitter data very, very quickly.
So through this module, I'm going to introduce you to four key topics.
The first topic is just to give you a little bit of background about what Twitter data is.
I'm going to show you a demonstration of MapD running on Twitter data.
And finally, I'm going to introduce you to some of the other technologies that we'll use in this class to process data very, very quickly.
So the first topic we're going to talk about is, what is Twitter data, and what does it look like?
You may know that there are a lot of tweets that are posted every day.
But you may not have known that there are 500 million tweets as of October, 2013 every day.
So 500 million is more than one tweet for every person the United States.
To me, that's just a staggering number.
In addition to that, of those 500 million, about 8 million of them are geo-coded.
This means that they have position information attached to them, so we can tell where the person was who sent the tweet.
So that geocoding is going to allow MapD to make some really compelling visualization on top of maps that show what's happening inside of Twitter.
So what is a tweet, and what information do you get from Twitter when you access a tweet?
Well, you probably know that tweets are 140 characters.
But there's lots of other information that's attached to these tweets as well.
So there's this position information, a latitude and longitude.
There's information about the user who posted the tweet, who his or her friends or followers were.
There's a time stamp.
There's information about whether this tweet was a reply to some other tweet or message in the system.
So it's not just the text but lots of other information that we can use to place context on these tweets.
So let me just quickly give you one example.
This is a visualization from MapD itself, and we'll do a bunch more of these visualizations in the next module.
But this particular map is showing all the tweets that contain the word "snow" on October 4, 2013.
So if you look here, you see this big red blob right here in the middle of Denver.
And so it turns out that there was an early snowstorm on October 4, 2013.
And many people on Twitter, of course, posted about it, because it was unusual that there was a snowstorm at this time of year.
So you can see that this is a very nice way to visualize this information.
We can very clearly see the correspondence between this big spike in the words "snow" and the location, that is Denver.
So this is what Map D is going to allow us to do.
This kind of map-oriented visualization is a really compelling way to access Twitter information.
So Twitter has so many tweets in it that it can be very hard to make sense of the information, especially when you're just presented with a long list of all the tweets that have been posted.
So this is the kind of interface that many of you probably are familiar with when you go to access Twitter.
You just see all the things your friends are saying.
And that's fine if you're trying to look for recent tweets that have been posted.
But it's not a very good way to understand these kinds of macro trends that are happening in the Twitter space.
So I would argue that this kind of mapping visualization is just a much, much more compelling way to access and see this kind of information.
There are some other things that make making sense out of Twitter data challenging and some other things that you could imagine you might like to do when you're applying big data techniques to Twitter.
So one thing you might like to do is to correlate external data sources with your tweets.
So for example, you might like to combine census data and understand how particular words or patterns vary with demographic information or other kinds of attributes that the census tracks.
So I'll show you an example of that in a minute.
Another thing you might like to do is some deep, kind of contextual analysis about tweets.
So for example, you might like to understand what specific product or what specific show a given tweet is referencing.
And this can be very challenging, because tweets are such a sparse data source.
They have so few characters in them that often people use abbreviations.
We need rather sophisticated English language and natural language processing techniques to understand what it is that people really are talking about.
So let's come back to this issue of correlating external data sources with Twitter.
This is a map showing all of the zip codes in the Boston area, color coded by income.
So you can see that there's quite a diversity of incomes in Boston.
Here green are the richer zip codes and red are the poorer zip codes.
You could imagine that you might like to combine this kind of income information with information about certain kinds of key words.
For example, how does the use of the word "Starbucks" vary with the income of the zip code of the person who's posting it?
So using this geocoded Twitter data will allow us to do this.
And then we can do these kinds of regressions across various attributes.
In this particular case, this data is made up.
This isn't a real example.
But this sort of analysis is what MapD is going to allow us to do.
So with that brief introduction to what Twitter is and some of the big data analytics that we would like to apply to it, what I'm going to do next is to give you a demonstration of the MapD system and interface.

Assessment 2

Done 100%

Discussion 1

Optional Discussion Question:

Consider your local transportation challenges. What data, policies, and services could urban authorities in your area provide to address them?

To discuss this question with your fellow participants, please go to the module 1 discussion area.

Big Data Collection

2.0 Introduction: Big Data Collection

Overview

By "Big Data Collection" we mean a collection of techniques for ingesting and integrating Big Data, as well as using cloud infrastructure to store it (i.e., "collecting" data into a hosted repository). We will see that Big Data's messy and vast nature makes it inherently different from working with smaller datasets. Professor Mike Stonebreaker will talk about how to tackle the problem of how to ingest and integrate disparate data sources together into a single unified data set that can be accessed and queried. After that, Professor Matei Zaharia will discuss how cloud infrastructure can be used to store and process Big Data.

Goals
  1. To understand the challenges of data integration, including:
    1. Creating a unified schema from multiple data sets, where uniform names and data types are used to describe the concept or entity.
    2. Performing entity resolution, where the same object or entity (i.e., a particular product, customer, or company) referenced in multiple data sets, is identified using the same name or identifier in the integrated data set.
  2. To understand how these problems are different in the era of Big Data;  in particular, how to handle these challenges with hundreds or thousands of data sources and billions of records.
  3. To understand the key concepts of cloud computing, including elasticity and pay-as-you-go, and to understand the economics of cloud computing.
  4. To understand why these concepts are particularly attractive for Big Data, including the ability to quickly harness the power of many computers, and the ability to deal with bursty workloads.
  5. To gain an understanding of the features of popular hosted cloud and Big Data platforms, including Amazon AWS (EC2 and S3), as well as hosted services like Google Big Query.
Objectives: Students should be able to...
  1. Understand the concepts of data integration, including both schema integration and entity resolution.
  2. Understand how Big Data requires a new, statistical approach to these problems, instead of the traditional, manual intensive approach.
  3. Articulate the benefits of cloud computing in a Big Data era.
  4. Understand the different features of popular cloud platforms, and the level at which they operate (e.g., hosted compute, hosted data, hosted service).

2.1 Data Cleaning and Integration

MIKE STONEBRAKER: I'm Mike Stonebraker.
I'm on the faculty here at MIT, and I'm here today to talk to you about data integration.
And this is a really complicated topic.
I will call it data curation, because what you really need to do is ingest data from a data source--usually not written by your team--validate it-- make sure it's correct.
Often, you need to make transformations.
We'll talk about that some more in a minute.
Data is invariably dirty, and so you need to clean it-- i.e. correct it--and then you need to consolidate it with any other data sources that you have on-site.
And often, you want to look at the data when you get done.
So this whole process is called data curation.
Where did this come from?
The roots of data integration, data curation go back to the data warehouse roots, which is the retail sector--meaning people like Kmart, Walmart, those guys--pioneered data warehouses in the early 1990s.
And the whole idea was they wanted to get consolidated sales data into a data warehouse, and what you wanted to do was be able to let your business intelligence guys-- namely your buyers--make better buying decisions.
So the whole idea was to figure out that pet rocks are out, and Barbie dolls are in.
So you want to send the pet rocks back to the factory, or move them up front and put them on sale, tie up the manufacturer of Barbie dolls so that nobody else can get any.
So that was buying decisions were the focus of early data warehouses.
The average system that got built in the '90s was 2x over budget and two times late, and it was all because of data integration issues.
So these were big headaches in the '90s.
However, data warehouses were a huge success.
The average data warehouse paid for itself within six months with smarter buying decisions.
So then, of course, what happened was it's a small world, and essentially everybody else piled on and built data warehouses for their customer facing data, which is products, customers, sales, all that stuff.
This generated what's come to be called the extract, transform, and load business.
So ETL tools serviced the data warehouse market.
So the traditional wisdom that dates from the 1990s is that if you want to construct one of these consolidated data warehouses, you send a very smart human off to think about things, and you define the schema, which is the way the data is going to look in the data warehouse.
Then, you assign a programmer to go out and take a close look at each data source, and he has to understand it, figure out what the various fields are, what they mean, how it's formatted, and write the transformations from the local schema--whatever the local data looks like--to whatever you want the global data to look like in the warehouse.
He has to write cleaning routines, and then you have to run this workflow ETL.
So there's a human with a few tools to help him out, and this scales to maybe 10 data sources or maybe 20, because it's very human intensive.
So this is traditional ETL and where it came from.
So the architecture is pretty straightforward.
You have a collection of data sources.
You have this ETL system, which is downstream from all your data sources and upstream from your data warehouse, and it's doing data curation, run by a human, and it's a bunch of tools.
So let's just take a little look at what some of the issues are.
Why is this stuff hard?
Why was this 2x over budget?
Well, suppose you're somebody who sells stuff.
So you sell widgets, and you have a European and a US subsidiary.
So one of your sales records is your US subsidiary sold 100k of widgets to IBM Incorporated, and your European subsidiary sold 800k Euros of m-widgets to IBM SA.
So of course the issues are you've got to translate currencies, because Euros aren't directly comparable to dollars.
And then you have the thornier questions which is, is IBM SA the same thing as IBM Incorporated?
Yes or no.
And then, are m-widgets the same thing as widgets?
So these semantic, thorny difficulties and transformations, and then, of course, this data isn't dirty, so if it was dirty, you'd have to clean it.
So this stuff is just hard.
And so what are the traditional ETL tools do to help you?
Well, first of all, they often give you a visual picture of the source schema and the schema you're aiming for, and they give you a line drawing tool so that you can say, here's the thing in the local schema.
Here's an attribute in the local schema, map it to this other attribute in the global schema.
So you basically get line drawing tools that allow you to line up the attributes, so that helps you a bunch.
But then you have to do transformations, so all the ETL tools give you a scripting language.
Think of it as Python, if you want.
And what you do is that you write Python scripts that convert IBM SA to IBM Incorporated, if that's what you need to do.
And as often as not, the ETL vendors give you a workflow so that you can define modules, line them up with boxes and arrows, and every box is some code written by a programmer.

2.2 Hosted Data Platforms and the Cloud

MATEI ZAHARIA: Hi.
I'm Matei Zaharia.
I've recently joined MIT as an assistant professor working in large scale systems in big data.
And today, I'm going to talk about hosted data platforms and cloud computing.
So what is cloud computing?
Cloud computing has been in the press a lot lately, and many different definitions floating around, many services that claim to be clouds.
But at a high level, what cloud computing means is computing resources that are available on demand.
By computing resources, I mean a wide range of resources.
They can be just storage or computing cycles.
But they can also be higher level software built on top.
So for example, several providers today offer databases as a hosted service.
So the provider will set up and manage a database for you and do all the administration.
And as a user, you directly see and use a database.
Now the "on demand" aspect means that resources are fast to set up and fast to give away or tear down.
And it means that pricing is usually pay as you go.
What I mean by that is that you pay at a small granularity.
If you only use your hosted database for a day, you only pay for a day.
And often with these services, you pay on granularity as little as an hour.
For big data workloads, in particular, clouds can be attractive for several reasons.
So the first one is that clouds provide easy access to large scale infrastructure that would otherwise be very hard to set up and operate in-house.
For example, if you need to store 100 terabytes of data, or if you need to launch 100 servers to do a computation, this can be quite a bit of setup, quite a bit of administrative overhead to do in-house.
But with a cloud, it's possible to pay with your credit card and have access to these resources right away and start trying out a computation and seeing if it would be useful.
The second reason that clouds are attractive for big data is that big data workloads are often bursty.
And so they benefit quite a bit from the pay-as-you-go model.
What I mean by that is, when you collect a large amount of data, you're not usually just continuously doing computations on it.
Instead, maybe once in a while, you have a large computation that you want done.
And with the cloud model, you can acquire a bunch of computing resources for just that computation, and then give them away and only pay for the time that you use them.
So because of these properties, cloud computing has seen a major growth in the past few years, in actually almost all software domains, and certainly in big data.
Let's talk about some examples of cloud services.
Our cloud services actually exist at multiple different levels.
At the lowest level, you have just raw storage or computing cycles and a variety of services that offer that today.
The most well-known are Amazon S3, or simple storage service, and EC2, elastic compute cloud, which lets you store bytes and launch virtual servers to do computations.
But since Amazon began these services, quite a few other providers are offering similar ones as well, including Google's Compute Engine, Windows Azure from Microsoft, and Rackspace.
At the next level up, you have hosted services that provide a higher level piece of software and just host and manage it for you.
And a great example of that is Amazon's relational database servers or RDS, which has hosted versions of just database software, including, for example, MySQL or Oracle.
Now normally, managing and hosting a database isn't entirely easy.
You need to make sure the database is up and running, highly available.
You need to make sure that you're setting up and taking backups.
And you may even want to replicate the data archives to data centers for disaster recovery in case one of the data centers goes down.
With the database hosted on Amazon, Amazon manages all these properties for you.
And you can go ahead and use a database.
Other examples of this include Google's BigQuery, which lets you on SQL queries on Google's distributed infrastructure, and Amazon Redshift, which is a hosted analytical database similar to the analytical databases you'll see at other parts of this course.
And finally, at the highest level, you have entire hosted applications which are directly accessed by end users.
So one example of that is Salesforce, which provides a variety of enterprise software such as a customer relationship management software just hosted so that business users can directly use it.
Splunk is a software for analyzing log files from servers.
And there's a version of Splunk that runs in the cloud so that you just point logs from your servers to it, and then Splunk will analyze them and show you interesting graphs and interesting things that are going on.
And finally, Tableau is a visualization software that lets you quickly explore and slice through data.
And there's a hosted version of Tableau that is managed by the Tableau company itself.

My Note: Spotfire does this better in my opinion!

Assessment 3

Done 100%

Discussion 2

Optional Discussion Question:

Think back to the data that you usually come across or work with. Is it messy? Have you had to combine multiple datasets together? How did you perform entity resolution and correct missing/wrong values?  What sort of tools would have made this job easier?

To discuss this question with your fellow participants, please go to the module 2 discussion area.

Big Data Storage

3.0 Introduction to Big Data Storage

Overview

In this module, you will learn about new data storage processing technologies that have been developed to process Big Data. Specifically, we will discuss:

  1. Trends in relational database systems, including column-oriented and main-memory systems, which offer much greater scalability and performance than first-generation relational systems, which are needed in the era of Big Data.
  2. Hadoop, MapReduce, Spark, and other scalable computation platforms, their limitations, and recent developments aimed to make them perform better.
  3. So-called "NoSQL" and "NewSQL" systems, which offer different interfaces, consistency, and performance than traditional relational systems.

This module begins with an overview of a number of these technologies by renowned database professor Mike Stonebraker. In his unique and ardent fashion, Mike expresses his skepticism about many new technologies, particularly Hadoop/MapReduce and NoSQL, and voices support for many new relational technologies, including column stores and main memory databases.

After that, Professors Matei Zaharia and Samuel Madden provide a more nuanced view of the tradeoffs between the various approaches, discussing Hadoop and its derivatives, as well as NoSQL and its tradeoffs, in more detail.

Goals
  1. Survey how data storage technology has evolved in the last decade, from conventional single-node relational databases to a plethora of storage technologies, including modern analytical databases, Hadoop, and NoSQL.
  2. Highlight the storage challenges that Big Data presents, particularly as it relates to new analytical needs on that stored data, from processing massive-scale SQL aggregates to querying arrays to processing graphs.
  3. Present the tradeoffs between relational, NoSQL, and other technologies, particularly as they relate to storing and accessing very large volumes of data.
  4. Describe what a column-oriented database is and how it works.
  5. Describe how the MapReduce (Hadoop) framework works and what its limitations are.
  6. Understand how next generation "NewSQL" databases, like the H-Store system, can provide very fast SQL queries over data sets that fit into memory
Objectives: Students should be able to...
  1. Understand the differences between row- and column-oriented relational databases.
  2. Understand the differences between transactional and analytical database workloads.
  3. Describe the MapReduce abstraction and its limitations, especially when processing repetitive operations over data that fits in memory.
  4. Articulate how the new Spark framework addresses these limitations.
  5. Understand ACID semantics and related reduced-consistence concepts like eventual consistency.
  6. Understand how SQL systems can be modified to provide very high performance on main memory workloads without giving up ACID properties.

3.1 Modern Databases

MICHAEL STONEBRAKER: Hi, I'm Mike Stonebraker.
I'm on the faculty here at MIT.
I'm delighted to be with you today.
And I'm going to talk about modern databases.
So the first thing I'm going to do is give you an introduction, which is going to be a history lesson of where this all came from.
Since I have a lot of gray hair, I was around at the time.
So the world pretty much started in the 1970s.
It started with a pioneering paper by Ted Codd in CACM that, to a first approximation, invented the modern relational database model.
During the '70s, there were some prototypes built.
But the next significant thing happened in 1984, when IBM released a system called DB2, which is still sold today.
And they by that one act declared that relational database systems were a mainstream technology, they were behind them.
During the remainder of the '80s, relational database systems gained a market traction.
And in the '90s, they basically took over.
All new database implementations were basically using relational systems by the 1990s.
And a concept evolved called "One-size fits all", that relational database systems were the universal answer to all database problems.
Put differently, your relational database salesman was the guy with the hammer, and everything looked like a nail.
So that was pretty much true.
Relational systems were the answer to whatever data management question you had.
And that was pretty much true until the mid 2000s.
And if you want to pick one event, I wrote a paper that appeared in the ICDE proceedings.
At the time, I claimed that one-size did not fit all, that the days of universal relational database system implementations were over, and they would have to coexist with a bunch of other stuff.
Now it's seven years later, and I'm here today to tell you that one-size fits none, that the implementations that you've come to know and love are not good at anything anymore.
So before I tell you why, I've got tell you what it is that I'm complaining about.
So unless you squint, all the major vendors are selling you roughly the same thing.
They're selling you a disk-based SQL-oriented database system, in which data is stored on disk blocks.
Disk blocks are heavily encoded for a variety of technical reasons.
To improve performance, there is a main memory buffer pool of blocks that sit in main memory.
It's up to the database system to move blocks back and forth between disk and the main memory cache.
SQL is the way you express interactions.
There's a bunch of magic that parses and produces query plans.
These query plans figure out the best way to execute the command you gave them by optimizing CPU and I/O.
And the fundamental operation, the inner loop of those query plans, is to read and or update a row out of a table.
Indexing is by the omnipresent B-trees that's taken over as the way to do indexing.
All the systems provide concurrency control and crash recovery, so-called ACID in the lingo.
To do that, they used dynamic row-level locking and use a write-ahead logging system, invented by a guy named C. Mohan at IBM called Aries.
Most systems these days support replication.
The way, essentially, all of the major vendors work is that they declare one node to be the primary.
They update that node first, then they move the log over the network and roll the log forward at a backup site, and bring the backup site up to date.
So that's what I mean by "the implementations of the current major vendors", who I will affectionately call "the elephants." You are almost certainly using "elephant" technology.
And the thing to note is that all of the vendors implementations date from the 1980s.
So they are 30-year-old technology.
You would call them legacy systems these days.
So you're buying legacy systems from the major vendors.
And my thesis is they are currently not good at anything, which is whatever data management problem you have, there is a much better way to do it with some other more recently developed technology.
So the major vendors suffer from The Innovators Dilemma.
That's a fabulous book written by Clayton Christensen who's on the faculty at the Harvard Business School.
It basically says, if you're selling the old stuff, and technology changes to where the new stuff is the right way to do things, it's really
difficult for you to morph from the old stuff to the new stuff successfully without losing market share.
But in any case, the vendors suffer from the innovator's dilemma.
That keeps them from moving quickly to adopt a new technology.
And therefore, their current stuff is tired and old, and deserves to be sent to the home for tired software.
So the rest of this module is going to be my explanation of why the "elephant" systems are not good at anything anymore.
So I will claim that there's three major database markets, so I will segment the market into three pieces.
About one-third data warehouses, about one-third transaction processing, so-called OLTP, and one-third everything else.
So I'll talk about each of these markets and how you can beat the "elephants" implementations by a couple orders of magnitude in every single one of them.
And then I will have some conclusions at the end to leave you with.

3.2 Distributed Computing Platforms

MATEI ZAHARIA: Hi, I'm Matei Zaharia.
I'm starting an assistant professorship here at MIT.
And today I'm going to talk about distributed computing platforms to run general computations on large clusters.
So let's start with the motivation for distributed computing.
Today, in many domains, large data sets are inexpensive to collect and store, but unfortunately processing these data sets is requiring higher and higher degrees of parallelism.
Just as an example, one terabyte of disk space today costs about $50 if you just go out and buy a terabyte disk.
So it's very inexpensive to collect large amounts of data if you have an application that generates them.
However, reading through this one terabyte disk end to end takes about six hours.
To do anything quickly with these large data sets we need to parallelize them.
And for example, even if we parallelize the one terabyte of data across 1,000 disks, it would still take 20 seconds to read.
It's actually still not exactly interactive.
Now in other parts of this course, you've seen work on speeding up query processing.
But in fact, not just query processing but all computations that we do on data will need this scale.
For example, loading the data, transforming it, and indexing it into a format that enables fast queries is also a large scale computing platform and needs to be done in parallel.
In the same way, complex analytics functions that might require custom code will also need to be parallelized over large clusters.
So in this lecture, we're going to talk about how can we program these large clusters.
Just as an example, here's a picture of a Google data center that Google put out this year.
And you can see in there it has thousands of computers organized into these racks.
There's actually a fairly detailed network topology that dictates which computers can easily talk to other ones.
And applications needs to spread out across this whole data center to run their computations.
So Google's internal applications regularly use over 1,000 nodes just on one application and these same kind of data centers are available externally as a cloud.
Now traditional network programming in the past has been based on message passing between nodes.
This is a very natural way to write distributed programs.
You can have a program on each node that sends messages to other ones.
Unfortunately, message passing is very difficult to be done at large scale due to several problems.
The first problem is how to split the computation across nodes.
The way we split the computation has to consider the network topology as well as data placement throughout the data center because otherwise moving excessive amounts of data over the network will be too slow.
The second problem is how to deal with failures.
Typically if you have a single server that you're running computations on, failures are a very uncommon event.
For example, average server today might fail once every three years.
However when you put together more nodes, the probability of a failure happening in a small time interval increases.
So for example, if you put 10,000 of these same nodes together, you start to see ten faults per day.
And so any long-running computation needs to handle failures within the computers running it.
Finally, our third problem that happens as you put more and more nodes together are stragglers.
This is a case when a node hasn't outright failed but it's simply going slower than the other nodes.
And if your goal was to parallelize the computation across 1,000 nodes and 900 of those nodes finish but 100 remain slow, then the computation as a whole will still be going slowly.
So as a result of these problems, almost nobody uses the message passing model directly to write computations on large clusters.
Instead a wide array of new programming models have been developed.
In this lecture, we're going to talk about one of the more popular classes of parallel computing models designed to solve this problem, which is data-parallel models.
In data-parallel models the programmer simply specifies an operation that it wants the system to run on all the data.
And the programmer doesn't care where exactly the operation runs.
The system gets to schedule that.
And in fact, it's even possible for the system to run parts of the operation twice on different nodes to do things like
recovering from a failure.
The main example of such a model in use today is MapReduce which is the model that was introduced by Google and popularized by the open source Hadoop platform.
There's quite a bit of software in this space.
Even open-source software to handle these kinds of computations, offers a lot of diversity.
So these names here on the slide are some of the names you might see associated with big data processing.
I'm just going to cover these systems a little bit in turn and group them into classes.
So first of all, at the top left we have Hadoop.
Hadoop is the open source implementation of Google's MapReduce model.
And Hadoop has also been used to build higher level system on top that compiled down to MapReduce, including Hive which runs SQL queries, and Pig.
We're going to cover these in more detail later in the lecture.
Next there are some generalizations of MapReduce which are able to express more types of computations efficiently.
And one of the main ones is Microsoft Dryad model, and then systems built on top of that like DryadLINQ.
Another system that generalizes Dryad further is Spark, which also supports in-memory data sharing between computations.
And this has also been used to build systems on top, such as Shark, which does large-scale SQL queries.
For specific domains of computations there are also more specialized systems.
For example, after developing MapReduce, Google developed the Pregel computation model for large-scale graph processing.
And there are several different open-source implementations that either follow this model or extend it, including Giraph and GraphLab.
For large-scale, interactive SQL queries, Google developed Dremel, which is a large-scale parallel SQL tool.
And there are open-source systems such as Impala and Tez that implement similar models.
Finally, for real-time stream processing, systems like Storm and Samza follow a different programming model from the batch systems you'll see, but it is still a data-parallel model.
These systems are used for a wide variety of applications.
First and most importantly, in most types of organizations these general purpose platforms are used for extract, transform, and load--or ETL workloads--which means taking data that arrives in potentially an unstructured format or just an arbitrary external formats, such as say, log files or JSON records, and transforming it into forms that enable faster queries, for example, indexing it.
Google's MapReduce was first developed to support its web indexing pipeline and similar systems are used at Yahoo, Microsoft, and other large-scale search engines.
Yahoo uses Hadoop, among other things, for spam filtering and Yahoo Mail.
Netflix uses it for product recommendation, to figure out which movies you are likely to want to watch.
Organizations like Facebook use it for ad hoc queries, where maybe there wasn't time to index the data into the right format to support the query and they want to go back to the raw data.
And finally, financial organizations are using these systems to do fraud detection with large-scale data.
So in the rest of this lecture, I'm going to start by talking about the challenges in large-scale computing environments.
And I'll cover the MapReduce model and show you some examples of how it can be used.
After that, I'll talk about limitations and extensions of MapReduce.
And I'll talk about a few other types of platforms that have been developed to tackle the large-scale programming challenge.

3.3 NoSQL NewSQL

SAMUEL MADDEN: Hi, it's me, Sam, again.
I'm going to talk to you in this module about a very important topic in big data, namely, how do we store that data?
I'm going to start off by doing a quick survey of the traditional way that people have stored their data for their applications, namely in database systems.
I'll talk about the properties that traditional database systems provide, and then I'll talk a little bit about how those properties aren't a great fit for some applications in the big data era.
Before I do that, I just want to start with a little bit of terminology.
I want to define two terms, transactions and analytics.
A transaction is a database operation that fetches or updates a small piece of information inside of the database.
Transactions are a really important workload the database systems have to support, and they're used very widely in things like web applications.
So, for example, in your bank, you would be running a transaction every time you logged in and accessed your bank account.
The transaction would go read your bank account balance.
Another transaction might transfer some money from bank account A to bank account B, so on and so forth.
So in a transactional system you have lots of users concurrently using the system, each running some set of transactions on their behalf in order to fetch or manipulate their data.
OK, let's contrast transactions with analytics.
So in an analytic database what you're trying to do is to access or read a large number of historical records.
So unlike a transaction, which only looks at one record a time, an analytical workload might look at, for example, the entire history of banking transactions, in order to analyze how much money has been transferred over some period of time, or to understand
what investments a set of customers are currently making.
So in this module, we're mostly going to focus on this transactional workload, and then in future modules we'll talk a lot more about analytics.
All right, so let's understand a little bit about some of the properties that these conventional transactional database systems provide.
They provide a number of really powerful features.
So the first one is what I call record-oriented persistent storage.
So by a record, I mean instead of storing data in a large collection, like a file where, for example, all the bank accounts might just
be stored as a collection of uninterpretable bytes, in a record-oriented system each bank account balance would be stored separately and independently.
And the system would provide the ability to access those records one time or manipulate them one at a time.
In addition, the database system typically requires and enforces that every record inside of a particular collection--sometimes these collections in the database system are called tables-- the records inside of one table all conform to what we call a schema.
So a schema just means that it describes the structure of each of these records.
So for example, in a bank account each record might correspond to a particular customer, it might represent that customers balance, maybe something about their interest rate, so on and so forth.
Third, these database systems provide a query language.
A query language is just a way to access the data and manipulate it.
The most common query language in the world is SQL, or "SQL", and this is used in almost every database system in existence.
SQL queries are sort of a high level way of describing what data you would like to fetch from the database.
So here's a very simple query that fetches my bank account balance.
Queries can get much more complicated than this.
They can do things like access records from multiple tables, modify records, and so on and so forth.
We'll see some more examples later in this module.
Finally, database systems provide a really powerful tool, transactions.
Sometimes transactions are called ACID semantics or ACID properties.
And this stands for Atomicity, Consistency, Isolation, and Durability.
For our purposes, we're really most concerned with this atomicity property.
Atomicity lets us take a collection of queries and group them together with something that I'll call the all or nothing property.
So what this means is if I have a group of statements or a group of operations on the database that manipulate a collection of records, we can ensure that either all of those statements successfully complete or none of them do.
And I'll give you an example about why this is important, again going back to our bank account balances.
Imagine I have two bank accounts, A and B, and I want to do a transfer of some money from bank account A to bank account B. If I'm not careful, the database system might withdraw the money from bank account A and then crash.
So at that point, we've effectively lost money if we're not careful, and we really would like to avoid this from happening.
So what I'd like to do is to ensure that the withdrawal from A and the subsequent credit to bank account B happen together.
Either they both happen definitely, or none of them happen and we didn't lose any money from bank account A, right?
So if the system crashed, in sort of halfway through this operation, what transactions ensure is that we're able to recover somehow from this crash.
These transactional properties hold, not only in a single system, but developers have even figured out how to enforce them in a distributed setting, where the database is spread across multiple machines.
And often this comes at the cost of some complexity inside of the database system.
We'll talk about this a little bit more.
All right, so let's understand how these conventional properties map into this new world of big data.
In big data, there really are 1,000 flowers blooming.
We have a ton of different types of data that people want to store to represent.
And it's no longer the case that these properties that these conventional transaction-oriented database systems provide are exactly a perfect fit for each one of these applications.
They're a good fit for some applications, like banking, but others they may not apply so well.
So let's try and understand a little bit about what some of the properties we might like in this big data era are.
In big data, what has happened is in effect, we've had a proliferation of systems, new data storage systems for all different types of requirements.
And some of the new capabilities that these new systems provide are things like really, really high throughput, high performance,
the ability to run operations on thousands or even tens of thousands of database records per second.
And this is something that these conventional systems weren't necessarily very good at.
In addition, these big data requirements have introduced the requirement that our data be spread across multiple machines.
Sometimes these databases are just too big to fit on a single computer.
Or, we may require the data to be stored on multiple machines, in order to be able to mask the failure of one of the machines
inside of the database system.
Third, there's a requirement that these database systems be really, really highly available.
By that I mean that no matter what, they continue to provide answers.
So rather than enforcing this ACID semantics, this idea that the data is perfectly consistent, in these highly available systems sometimes we'll emit inconsistent data, data that doesn't necessarily reflect these sort of groups of transactional operations, in order to ensure that the system is online.
Finally, this notion of the SQL query language, even though it's a very powerful language, it's not always the right language for some of these new big data applications.
So what we're going to do next is to try and understand some of these properties in a little bit more detail, and look at why it is that conventional database systems don't necessarily provide them.

Assessment 4

Done 100%

Discussion 3

Optional Discussion Question:

Professor Stonebraker expresses a number of strong opinions in this module. Which of them do you agree with? Which do you disagree with? Why?

To discuss this question with your fellow participants, please go to the module 3 discussion area.

Big Data Systems

4.0 Introduction: Big Data Systems

Overview

In this module, you will learn about several additional aspects of managing Big Data, including security and multicore scalability from Professor Nickolai Zeldovich and user interfaces and visualization from Professor David Karger. Each of these is a broad, important topic in its own right, so we will not be able to cover them completely. Instead, our goal in these lectures is to introduce you to some of the key challenges that arise in these diverse areas, particularly when processing Big Data workloads.  We’ve grouped them together in this module because they are systemic, cross-cutting issues that affect many aspects of Big Data software.

Since these topics are so diverse, we summarize the goals and objectives of each separately.

Goals (4.1 Security)
  1. Understand what data security means and why it is hard to achieve.
  2. Understand how encryption can protect data in a database and protect from a number of different potential attacks.
  3. Look at several different approaches to building an encrypted database, and learn about how new cryptographic advances can allow us to run queries directly on encrypted data.
Objectives: Students should be able to...
  1. Understand why operating directly on encrypted data is important in Big Data databases.
  2. Understand the notion of homomorphic encryption.
  3. Be able to describe the differences between randomized, deterministic, and order-preserving encryption, in terms of the properties they provide and the types of queries that can be run on them.
Goals (4.2 Multicore Scalability)
  1. Explain the concepts of scalability performance.
  2. Articulate the need for scalability when processing large amounts of data.
  3. Describe why achieving scalability is challenging by looking a bit at how modern processor architectures work.
  4. Delve into some programming techniques and data structures that can be employed to achieve scalability on Big Data workloads and modern processors.
Objectives: Students should be able to...
  1. Understand the functioning of the cache coherence protocol in modern multi-core processors.
  2. Understand why, in data-intensive applications, cache-coherence can create bottlenecks to scalability in modern processor architectures.
  3. Be able to describe how locking — which is used to prevent concurrent modification of data by multiple processors — can be efficiently implemented on a multicore system.
  4. Be able to describe how efficient locking, combined with garbage collection, can be used to implement many data structures that are at the core of Big Data systems, including stacks and trees.
Goals (4.3 Interfaces and Visualization)
  1. Illustrate different interfaces and visualizations for data.
  2. Explain the power of visualization for accessing large data sets.
  3. Describe some of the pitfalls and rules of thumb of what makes a good (or bad) visualization.
  4. Describe the role of interactivity in data analysis.
  5. Present a number of recently developed visualization and browsing interfaces for Big Data.
Objectives: Students should be able to...
  1. Understand what makes some visualizations and charts bad, such as the presence of chart junk, lack of proper labels, use of perspective, and so on.
  2. Understand how concepts like brushing and linking, and direct manipulation, can be used to allow users to interact with and better understand their data.
  3. Apply Schneiderman’s process of overviewing, filtering, zooming, and obtaining details for interacting with data.
  4. Use recently proposed extensions to the HTML standard to templatize the presentation of information of data, so that creating a visualization is no harder than authoring an HTML document.

4.1 Security

NICKOLAI ZELDOVICH: Hi, I'm Professor Nickolai Zeldovich.
And in this module, I'm going to tell you about how to build secure computer systems that handle large amounts of data.
The reason that computer security is a hard problem is because it's a negative goal.
And what I mean by this is that in order to build a secure computer system, you have to make sure there's no way for an adversary to violate your computer system's security policy.
For example, it means making sure there's no way for an adversary to obtain confidential data from your computer system or any way for them to corrupt that data.
And the reason it's so difficult to achieve computer security in many computer systems is because adversaries have so many different avenues of attack to choose from.
And as a defender, as a builder of a secure computer system, you have to defend your system against all of these possible avenues of attack.
To understand what this means, let's take an example of a database storing lots of confidential data.
Typically this looks like the following diagram on the slide.
We have an application server on the left and a database server on the right, storing all of our data on disk.
And the application server is going to issue database queries in a language, typically like SQL, over to the database server.
And the database or will return results to the application.
So how can an adversary try to steal confidential data from this database server?
Well in fact, there's many ways an adversary could attack here.
One possibility is that the adversary could look on the wire on the network and watch either the queries or the database results going over the network and steal the data that way.
Another alternative that's often exploited by adversaries is to find some sort of a software bug in the database server software and exploit that vulnerability to gain access to the database server that way.
Another option is to focus on the hardware.
So maybe if the adversary can get a copy of the data on disk, maybe when the database server is decommissioned or thrown away, or to somehow access the in memory contents of the database server, they can get a copy of the data there.
Yet another route is to attack the humans that manage this database server.
Presumably there is some sort of an administrator that has access to this database server and can log in and maintain the database server but also have the privilege to read all the data on disk.
And finally, another concern that you might have is government agencies that might file subpoenas and require a database server to disclose information to law enforcement officials.
So how do we prevent adversaries from compromising our data if there are so many different kinds of things we have to worry about?
One promising approach that can actually take away all of these avenues of attack is to use encryption.
What I mean by encryption is imagine placing some sort of a proxy in the middle between the application server and the database.
And the proxy is going to store a key of some sort.
So whenever queries go from the application server to the database, they'll first be intercepted by this proxy and all the data in those queries will be encrypted before being passed onto the database server.
So where we have regular data being sent over here, all of the data being sent to the database server is actually encrypted.
And when the database server provides results back to the proxy, maybe these results will be encrypted but the proxy will be able to decrypt them using its key and send the plain text data back to the application server.
The advantage of this approach, if we can somehow pull it off, is that everything on the right of this diagram, as I'm indicating by this shading, is encrypted.
That component, the database server and the disk, never store plain text data, never have access to the decryption key.
And even if they're compromised, no matter in what way, they will not be able to decrypt our plain text data and leak our confidential information.
So this is the power of encryption.
It allows us to address a very broad threat model even if the server happens to be compromised without having to enumerate all of the different avenues of attack.
Of course, the problem with the simplistic approach that I've sketched out on this slide is that database servers often need to decrypt the data in order to process it.
If they need to run queries over the database, find aggregate statistics, select certain records, et cetera.
So in the rest of this module, we're going to look at techniques for how to be able to process queries over encrypted data without giving the database server the decryption key.
So in the rest of this module, as I mentioned, we'll first go over some of the recent theoretical results in the space of execution over encrypted data and look at a scheme fully homomorphic encryption, which is a theoretically promising scheme that's a good thing to know about, but unfortunately is not a practical solution to the problem I posed so far.
Then we'll look at a practical approach that I have been developing with some of my colleagues at MIT called CryptDB, and we'll look at how CryptDB uses specialized encryption schemes to efficiently execute database queries over encrypted data.
I'll give you a sense of what all these schemes look like by illustrating how to construct an order preserving encryption scheme.
And then I'll describe a technique called onions of encryption that allows us to dynamically adjust the level of encryption at run time.
And finally, I'll conclude by showing you all the promising results from a prototype of the CryptDB system that we've built with our
colleagues here at MIT.

4.2 Multicore Scalability

NICKOLAI ZELDOVICH: Hi, I'm Professor Nickolai Zeldovich, and my research focuses on building computer systems.
And in particular, we worry about how to build systems that perform well when running on many cores in a large computer.
You probably know that many computers these days consist of multiple cores running in a single system, all with access to a single shared memory.
And in this topic, we'll look at, how do you make sure that this computer system performs well?
So what do I mean by performing well?
In particular, what we're going to talk about is scalability.
And by scalability, I mean that if we have a computer that has n cores, ideally, this computers should do n times as much work in the same amount of time as a computer with a single core.
Now, this notion of scalability is often a convenient way to think about achieving good performance on a multi-core system, because as you add more cores, a scalable system will achieve more and more total performance or throughput or work.
Now, they're not exactly the same.
It's easy to have a system that scales well but has poor performances.
It just does lots of useless work on every core.
It's oftentimes that high performance system isn't the most scalable, either.
But there's quite a bit of correlation between these two notions, so we'll stick with the goal of trying to achieve good scalability on a multi-core system for the rest of this topic.
So one reason why a computer system might not achieve good scalability is because of the presence or serial sections in the overall system.
One example of a serial section you might be already familiar with is a typical lock in an application that ensures that only one piece of code is executing at a time.
Another example of a serial section you might encounter in a computer system is accessing shared memory.
So if multiple cores are accessing the same data structure in memory, hardware must ensure that only one core can perform each individual memory access at a time.
And thus this serializes the instructions issued by multiple cores, limiting the scalability of the overall application.
And finally, there are hardware resources that might pose a serial section.
For example, if all of your programs are accessing a single DRAM interface in your computer system or a single network interface card, then hardware can only allow a single application to access the DRAM or NIC at a given time.
And this again, forms a serial section that bottlenecks your application scalability.
Now, how significant is this notion of a serial section to the overall application performance?
One rough way to think of this problem is to imagine that your application is split into two sections--one that runs in parallel and spends time, p, in the apparel section of your code, and one part of your application is purely serial and spends time, s.
Now, if you roughly imagine how long such a program might run on a machine with n cores, we can take the parallel time, p, and divide it by the number of cores, n.
So as we have more cores, the parallel section will run faster and faster.
But the serial section will still executive serially, regardless of how many cores we have.
And as a result, as we have more and more cores, so as n goes to infinity, what this means is that the total runtime over application is going to approach s, and will not improve beyond the length of our serial section, no matter how many hardware resources we throw at this problem.
So we understand that serial sections are a problem for scalability.
But do we really need these serial sections?
One approach to achieving good scalability on a multi-core system is to simply partition your system into n independent cores.
Think of them as just n independent computers, working in parallel with one another.
Now, this is a good approach if you can actually partition your data set, your application, your workload across these multiple partitions, because these partitions will not share much with each other and will achieve pretty good scalability.
And in fact, this is what you have to do if your problem or data set is so large that it cannot fit into a single computer, no matter how many cores you can buy from Intel.
Now, the problem is that partitioning requires you to actually choose a specific partitioning priority.
And it might be hard to break up your single monolithic problem into n independent partitions.
Another disadvantage of this partitioning approach is that once you've broken up your problem into n independent chunks, you might need to load balance these partitions with each other, if it happens that one partition has too much work and another one doesn't have enough.
So for the rest of this topic, we'll avoid this partitioning approach and look at situations where you must have some amount of sharing between the cores and consequently cannot purely partition your data set between the cores of your system.
So for the rest of this topic, I will first tell you about how the hardware works at a low-level, because that's critical to understand how to achieve good performance and what kinds of operations scale well and what kinds of operations do not.
Then we'll look at a couple of case studies to understand what does it take to build a scalable application.
First, we're going to look at how cache coherence impacts scalability by examining a simple implementation of a lock, that is, a basic primitive used by almost every concurrent application.
We'll look at how a lock can collapse in terms of performance when you have more and more cores.
And we'll look at techniques that help you avoid performance collapse in a lock implementation.
Then we're going to look at techniques to improve scalability of an overall system by avoiding locks altogether.
We'll look at an example application of a stack data structure and look how we can actually perform lock-free reads, reads that don't actually require us to take any lock at all.
And then we'll generalize this approach into a technique widely known as Read-Copy-Update, or RCU, that allows us to perform lock-free reads in almost any application.

4.3 Visualization and User Interfaces

DAVID KARGER: Hi there.
So I'm David Karger and I'm going to be talking to you today about user interfaces for data.
And I'm going to start with the big picture.
And in particular, an introduction to myself and my research group.
My group is focused on understanding what is it that makes it hard for people to manage data.
And when I started working on this problem, I looked at traditional areas like databases, and information retrieval.
But over time, it became more and more clear to me that the real challenges in data management were in the interfaces that people use to interact with their computers.
And so that's where the focus of my research group is now.
And we sort of have this cycle where we first try to understand what it is that makes it hard to manage data.
What are the problems with the user interfaces people have today.
And then we try to create tools that address those problems that we understand.
Then we put those tools out and watch people use them.
And use that to try to understand still more about what's hard to manage data and create the next generation of tools.
And the theme in all of this work is really the end user and how we can empower them to manage data on their own.
So we spend less time thinking about highly specialized tools for power users with specific unusual types of data.
And think more about the kind of data management tasks that everybody has.
Now, why is the user interface so important?
Well, Hal Varian had a pretty good summary of what's going on nowadays.
As data becomes more and more pervasive, the ability to take that data and understand it, process it, visualize it, and communicate with it is becoming more and more important.
And at least three of those tasks--understanding, visualization, and communication are fundamentally user interface tasks.
And it's the user interface that's going to affect how well somebody can do it.
Now, the reason we need those interfaces is that while computers can store and process information far better than a human being, people are still a lot better than computers at a number of information management tasks.
They're better at seeing patterns, noticing oddities, imposing order on some data that doesn't have a model yet, figuring out what a suitable model is.
And so if we can somehow create a combination of a computer with its processing capabilities and a human being with their understanding capabilities, putting those together is something more powerful than either one.
And even if the computer is going to be doing in various measures the majority of the work, there still has to be a way for the human to tell it what work to do.
Computers are still not very good at taking initiative.
I'm going to be talking about interfaces that have value in really two different ways.
One is that they can really help users analyze data in order to reason about it to draw conclusions.
And as a step in this understanding, these users need a sort of expanded memory.
There's no way that we can think about millions of data points all at once in our heads.
The computer can hold those millions of data points and present them to us in a way that we can think about them.
Once we see them, we can find patterns, based on that we can develop and assess hypotheses, we can look for errors in the data.
The second major use of interfaces is for communication to others.
Even if you see something in the data or you have done some work to understand something in data, a lot of times you can only have an effect with your understanding by communicating what you understand to other people.
And for that, you need ways to present your conclusions to people.
And we know from the famous cliche that a picture is worth a thousand words, that visual imagery is very effective for communicating to other people.
And so using interfaces to create that communication, to share and persuade other people, or to collaborate with them on your understanding and revise your own ideas is equally important in applications of user interfaces.
So just as a quick example of how powerful vision and human understanding is, here we have a well known data set called Anscombe's quartet.
It's a collection of four distinct data sets each consisting of pairs of x and y-coordinates.
And if you look at this table, you really can't see anything at all.
And if you feed this data into your typical statistical package where it will tell you say the means and standard deviations of these four data sets, you'll find out that all four of these data sets have exactly the same means, and exactly the same standard deviation.
So from a numeric prospective, they're really no different.
So what is the difference between these data sets?
Well, if we take these x and y-coordinates and treat them as points in a plane and plot them, well then it's completely obvious what the
differences between these four data sets.
We've got one that's almost linear following a diagonal line.
We've got one that's sort of parabolic.
We've got one that's truly linear with a single outlier.
And one that's completely concentrated in the x-coordinate, but has some variability in the y-coordinate that leaps out at us when
we look at the pictures.
We didn't see any of that by looking at the numbers.

Assessment 5

Done 100%

Discussion 4

Optional Discussion Question:

CryptDB uses the idea that there are several different types of encryption schemes which allow for different levels of computation and different levels of leakage. Think about a data set that you work with. Discuss whether it would be appropriate to use something like CryptDB with it. Which types of data must be encrypted using which type of encryption scheme?

To discuss this question with your fellow participants, please go to the module 4 discussion area.

Big Data Analytics

5.0 Introduction: Big Data Analytics

Overview

In previous modules, we learned about how to store and process Big Data. Now that we have the tools for managing Big Data, we can move on to doing complex analytical work with that data. In particular, in this module, we will learn about several algorithms and machine learning techniques that can be used to extract patterns, trends, and insights from data.

A common theme among these lectures will be that Big Data is inherently hard to work with due to its large, messy, erroneous, and noisy nature.  We will discuss a number of methods that researchers have come up with and a variety of methods to combat the problem of data complexity.

  1. Professor Ronitt Rubinfeld will start the lectures by talking about sampling algorithms, which test properties and answer queries on large datasets in sublinear time.
  2. Professor Pitor Indyk will then talk about streaming algorithms, which use small, randomized structures to summarize a dataset and provide approximate estimations of its properties.
  3. Finally, Professor Daniela Rus will return to talk about coresets, which compress Big Data by storing only the most meaningful bits of information while preserving certain statistical properties of the data.

Armed with these tools, we turn to the problem of machine learning, focusing on how it can be used to extract information from large datasets on the web. In particular, Professor Tommi Jaakkola will teach us how a recommendation engine works, and Professor Regina Barzilay will show us how to infer events from a stream of Tweets on Twitter.

Finally, we will end the module with two very important applications of analytics on Big Data: medicine and finance. In the context of medicine, Professor John Guttag will discuss how machine learning can be applied to predict the likelihood that a particular patient will survive a heart attack or the probability that a patient will acquire a secondary infection as a part of his or her hospital stay. We will also see from Professor Andrew Lo how machine learning models can be used to develop a new measure of creditworthiness that has more predictive power (in terms of probability of default on a loan) than the traditional credit score (FICO) used by U.S. banks.

One last thing to note is that Big Data algorithms and machine learning are both extremely large fields which could span several courses on their own. Naturally, we will not be able to explain in detail every algorithm or term that comes up. We heavily encourage you to pursue further reading material in our resources page to learn more about the ideas and methodologies used in this module.

Goals
  1. To understand what machine learning is and how it differs from traditional computer programs.
  2. To understand the challenges of machine learning on Big Data.
  3. To understand the trade-offs between privacy and accuracy when profiling users.
  4. To understand how sampling can lead to sublinear algorithms.
  5. To understand the different advantages of streaming and sampling.
  6. To understand what a coreset is and how it can efficiently encapsulate large datasets.
  7. To understand how natural language processing can be used to extract information from large text datasets.
  8. To understand how machine learning can be applied to the field of medicine.
  9. To understand that applying machine learning on Big Data can give us information we could not have gotten from traditional statistic methods.
Objectives: Students should be able to...
  1. List the various terms used in machine learning.
  2. Describe how a recommendation engine (such as the one used for Amazon) works.
  3. Implement an approximate sublinear algorithm for the six degrees of separation problem.
  4. Explain the difference between c-multiplicative approximation and c-additive approximation.
  5. Use collision statistics to solve problems such as minimum vertex cover.
  6. Implement an algorithm which approximately counts the distinct number of numbers in a set.
  7. Use sampling to implement the Sparse Fourier Transform.
  8. Describe the k-segment means algorithm.
  9. Explain how iDiary uses coresets to log the large amount of GPS data it generates.
  10. Have the basic intuition of how multi-aspect summarization works.
  11. Talk about how the factor graph model helps in inferring events from a stream of Twitter tweets.
  12. Discuss what makes medical analysis on Big Data so hard.
  13. Use a confusion matrix to identify false positives and false negatives.
  14. Describe what the ROC curve is.

5.1 Fast Algorithms I

RONITT RUBINFELD: Hi.
I'm Ronitt Rubinfeld, and I'm a professor at MIT.
There's been a huge hubbub about big data.
But what I want to talk to you about is what you do when you have really big data, so big that there's no time to look at it all.
What can you possibly hope to say about data that you can't even view?
Let's take an example to be concrete.
Let's take the famous Small World Phenomenon.
We'll model the social network as a graph.
We'll associate with each person a node.
And we'll place an edge between pairs of people that actually know each other.
And the six degrees of separation property, that Millman first studied, was later a subject of a Broadway play and then became even a movie, asks if all pairs of people are connected by a path of distance, at most six.
So you see in this graph, this guy is not connected to anybody.
So this graph does not have the six degrees of separation property.
So this illustrates the problem we have with really big data.
It may be impossible to access some of the data.
And even what is accessible is so enormous that no single individual or even reasonably sized group of individuals could hope to access all of this data.
And even what you can access as a group is so large that by the time you access it all, the data might actually change.
So in our small world example, somebody could die, somebody could be born, things could change.
So this illustrates the problem we might have with the standard notions we teach in undergraduate algorithms courses.
We teach that the goal is to get a linear time algorithm.
Once you get a linear time algorithm, that is an algorithm which, on inputs encoded by n bits or words, takes at most a constant times n many time steps.
We teach you that when you get this, you're done.
You get a gold star, A plus.
And you can move on to the next problem on the exam.
But maybe this notion is inadequate for big data.
But what could you possibly hope to do if you can't even see all of the input data.
You certainly can't answer the same types of questions that we'd hoped to answer before.
Any question that has a "for all" or an "exactly" type statement in it is going to be hard for us to solve.
We're not going to be able the answer exactly how many individuals on earth are left handed.
We're not going to be able to answer whether all individuals are connected by at most six degrees of separation.
So we're going to have to make some compromise.
OK.
So the compromise that's traditionally taken by statisticians is to approximate.
Let's approximately determine how many individuals on earth are left handed.
And this is well understood.
All the sampling bounds and polling bounds, we use this to determine who's going to vote for which political candidate.
We use this to determine if a certain type of drug is working on a certain disease.
So this is something that's classically studied and used all over.
But because of the types of data they have now appeared are much more interesting and have certain kinds of structure that we're looking for, the questions that are now arising are quite a bit different than those that have been classically studied.
So in particular, for the six degrees of separation property, we might ask, is there a large group of individuals that are connected by at most six degrees of separation?
You might ask, are at least 99% of the individuals on this earth connected by at most six degrees of separation?
So what types of approximation or compromises are we looking for?
We're going to talk about two in this module.
We're going to talk about property testing.
We're going to try to distinguish data that has a certain property, such as networks that have six degrees of separation, from data that is far from having the property, networks where not even 50% of the population has the six degrees of separation property.
And we'll define what that means.
We'll also talk about classical approximation problems.
We're going to try to approximate the correct output of a computational problem.
The computational problem might be the classical statistics problem of estimating how many people are left handed on this earth or it might be something more interesting, like approximating some solution to a combinatorial optimization problem.
So we've seen that we're going to have to make some sort of compromise.
We've talked about two types of compromises that we might make.
And in the following segments, we're going to see how and when we might apply these types of compromises to specific questions.

5.2 Fast Algorithms II

PROFESSOR INDYK: My name's Piotr Indyk, and in this module, I will give
you an overview of basic algorithmic techniques used to process massive amounts of data.
In particular, we'll focus two algorithmic techniques called streaming and sampling.
So what are these techniques about?
Let's start from sampling.
The basic idea of sampling is a very natural one.
The idea is that instead of processing the whole data set, we instead pick up a random sample of the data, and then perform the computation only on this sample that we picked up.
So for example, if this is our data set, which is a large collection of, in this case, digits, the sample algorithm will start by first picking up a small sample of the data set.
So for example, the algorithm could pick up this element, this element, this element, this element, this element, and this element.
So that will be the first step of the algorithm.
And in the remainder of the computation, the algorithm will perform the processing only on the sample.
So this clearly reduces the amount of resources needed to perform the computation.
And as long as the sample is picked in a representative way, the answers for the sample are still pretty accurate compared to the true answers.
So that's sampling.
On the other hand, streaming makes a different approach to massive data processing.
The streaming algorithm touches all the data elements.
So the streaming algorithm would first start by looking at the first element, then second element, third element, and so on.
It would make one pass over the whole data set.
And as a result of this pass, it would store some sketch, or the synopsis, of the data.
So instead of storing the whole data set, it would only store some partial information about the set data.
And after the whole data set is read, the algorithm would simply infer the desired properties of the data from the synopsis alone.
So this is a streaming approach to a massive data process.
So what are the pros and cons of these two approaches?
Let's start from sampling again.
What are the main positives of using sampling?
Well, the main positive of using sampling is that we perform the computation only on the sample, which means that we are setting out to reduce both the storage needed to store the data set as well as the computation time.
The second advantage is that we only need an access to data elements which are actually included in the sample, which means that all the data elements which are not in the sample do not even need to be materialized or generated.
So in situations where it is costly to even generate the data, the cost can be further reduced since we don't have to generate the data which we are not using.
And we'll see a few examples of this benefit later in this module.
On the other hand, what are the advantages of streaming?
Well, compared to sampling, the main advantage, or one of the advantages, is that in streaming algorithms, every data element is seen at least once.
So one can say that no element is left behind.
All the elements are seen at least once.
So one can see advantage of this approach.
For example, if the goal of the algorithm is to simply detect whether the data stream contains a particular element--for example, one.
Streaming algorithm for this problem is very easy.
The algorithm just makes a pass over the data and checks whether any data element is equal to one.
On the other hand, sampling can potentially miss the data element if the stream contains very few ones.
So in that sense, streaming can be more accurate than sampling because all the data elements are accessed.
Another advantage of streaming is that despite the fact that that makes a pass over the whole data set, it still uses a little storage because it doesn't store the whole data set.
It only stores some synopsis.
And the computation time is at least linear in the data size because the algorithm makes a pass over the full data.
However, most of the algorithms we see in this module in fact have very efficient running times.
In the remainder of this module, I will present an overview and examples of both of these approaches.
We'll start from streaming.
In particular, we'll see a very efficient algorithm for estimating the number of distinct elements in the stream using very limited space.
So for example, if the data set consists of a bunch of web clicks, the number of distinct elements estimates the number of different users that actually clicked on this website, as opposed to the total number of clicks, which is a very useful statistic in many situations.
And I will say this algorithm's actually very efficient.
In particular, one of the algorithms, in order to estimate the size of vocabulary of Shakespeare, who as an author was a rather large vocabulary, and very prolific.
The algorithm is only 128 bytes to estimate the number of distinct words in all works of Shakespeare with pretty good accuracy.
So you can see that using a very limited synopsis, we can perform a computation of a rather large data set.
And we will also see examples of other problems which are solvable using a streaming approach.
And I will give an overview of what kind of questions those algorithms answer.
In the second part of this lecture, we'll see an example of a sampling approach.
And in particular, we're going to see an overview of recent algorithms for performing a discrete Fourier Transform, a well-known task in a single processing, in time which is much faster than the well-known fast Fourier Transform.
This algorithm applies to data which is sparse--namely, to signals which have a very small number of large coefficients in the spectrum.
For such signals, our algorithms will run very efficiently because they will access only few samples of the data.
And as we see in many situations, the running time of the algorithm will be actually more efficient than the running time of the well-known fast Fourier Transform algorithm.
Thank you.

5.3 Data Compression

DANIELA RUS: Hello, everyone.
My name is Daniela Rus.
I'm a professor at MIT, and today I would like to talk with you about data compression for big-data analytics.
We will look at techniques for learning big-data patterns from tiny coresets.
What does this mean?
We are going to examine how to extract, in a very efficient way, the most meaningful parts of our datasets.
Here's the intuition.
Say you have an algorithm that runs well on a small dataset, and now, all of a sudden, you have to run the algorithm on a huge dataset.
What options do you have?
Well, if you have to look at every data item, your algorithm might not ever finish.
It might be intractable.
One option is to try to make the algorithm more efficient to improve its running time.
But another option is to leave the algorithm as is, and focus on the data.
Trying to extract the most meaningful portions of your data stream in such a way that running the original algorithm on the entire dataset will have the same performance as running the algorithm on this vastly reduced dataset.
How are we going to do this?
We're going to develop a technique called coresets.
But first let's look at an example.
We have a lot of devices today that are producing a lot of data.
In fact, in 2012, we have produced 2.5 quintillion data bytes per day.
That's a lot of data.
This data comes from phones, from cameras, from our computers, from our sensors embedded in the world.
Let's look at an example.
Let's look at the data that comes just from our phones.
And in fact, let's focus on one type of data.
Let's just focus on GPS data.
Here's our smartphone.
Each GPS data packet is about 100 bytes.
If we log 100 bytes every 10 seconds, we end up with about 0.4 megabytes of data per hour, or 100 megabytes of data per day.
That means 0.1 gigabytes of data per day, per device.
Wow, that's a lot of data.
But now, if we consider many phones producing data, let's see how the numbers look.
In 2010, over 300 million smartphones were sold.
Let's say only a third of these phones, about 100 million of them, produce data.
So for 100 million devices, we have 10 petabytes of data per day.
That's about 10,000 terabytes per day.
If you take your favorite external drive that stores 2 terabytes of data, you need 5,000 of such devices in order to store all this data for just one day.
That's a lot of data.
So how useful is it to capture such a lot of data?
Let's look at an example, which captures a day in my life, in the form of a movie.
I'm represented by the beautiful, yellow icon in the movie.
In this example, my phone keeps track of all my activities.
I start the logging at work, then I drive home.
I walk to a restaurant to have dinner, and then I drive to a shopping area.
I can extract all this semantic information about my activities just by looking at GPS data.
Keeping track of this history, in the form of a diary, is not only exciting as a way of logging my life, but it's also useful.
You can create a system that can automatically keep track of your work travel, work meals, and work meetings for tax purposes.
We can keep track of where we go and who we meet.
And if many people do this all together, the system can synthesize all the relevant correlations.
And we can do much more.
We can figure out which of our travel was on wheels, and which of our travel was on foot.
In other words, we can take GPS data and extract quite reliably the travel mode.
How reliable?
We can measure reliability using two important measures used in information retrieval.
The first measure is recall.
Recall tells us the fraction of relevant instances classified correctly.
In other words, how many of the data points we extracted were extracted correctly?
The precision tells us the fraction of instances that were correctly classified.
In other words, if we look at all the data we have classified--in our example, on foot or with wheels--which of those data points were classified correctly?
For on-foot predictions, the recall is very high--93%.
The precision is 86%.
This is because, at rush time, you can often walk faster than you drive.
For wheels prediction, the numbers are high along both axes.
The recall is about 96%, and the precision is about 98%.
So what else can we do with this data?
It turns out we don't walk randomly on the planet using a Brownian motion.
In fact, we tend to have patterns.
We tend to go to the same places with a certain degree of repetition.
And we can determine our normal activities, and we can detect what is abnormal, on a day by day basis.
This is all very useful information, which we can extract just by looking at GPS data.
For example, we can take the GPS signal, which you can see on the left, and with our coreset data-compression analytics, we can map that onto textual descriptions of our activities that correspond to geographic locations.
And then we can further take that information and map it into real activity--semantic activities.
Have we been drinking coffee?
Working on homework?
Doing the dishes, et cetera?
There are many challenges associated with solving and enabling all these applications.
The first one is that storing data on smartphones, and in general, on sensor nodes, is very expensive.
Transmitting the data is also expensive, and it's very hard to interpret raw data.
The GPS signal consists of three numbers.
How do we know those three numbers correspond to a particular geographic location?
Our techniques are going to help us address how to mitigate these challenges, by using the notion of coreset.
And here is the intuition.
We will address these challenges by a computational technique that will allow us to find the right data from big data.
Say you're given a huge dataset--billions of points, which are represented by the blue dots in the slide.
The set is called D. And say you have an algorithm called A. And trying to run the algorithm on the dataset, D, is intractable.
Takes too much time.
The question is, can we efficiently reduce our dataset, D, to subset C, denoted by the red points in my slide, so that running the algorithm on the subset, C, is fast?
And the solution computed by the algorithm on C is approximately the same as the solution computed on the entire set, D, which we can't do, because there's too much data.
The challenge is to find C very fast, and to be able to prove that running the algorithm on the two different sets gives approximately the same answer.
In order to provide concrete solutions for how to compute this special, magical set, C, I would like to introduce our big-data computation model.
In this model, we assume an infinite stream of vectors.
We have seen n vectors so far, but there are more coming up.
We only have log n memory to store the data, so we clearly cannot keep all the data around.
We have M processors, which will enable parallelism.
And we wish to have insertions for each data point be efficient.
In other words, they should take on the order of log n, over M time.
The coreset techniques we will describe in the rest of this module belong to a wide set of techniques for data compression that have been developed by various communities in computer science.
Coresets originated with a computational geometry community.
And you see some of the key names of authors you might want to consult if you'd like to learn more.
But there are other techniques that have the same spirit as our approach to data compression by coresets.
In graph theory, people have developed sparsifiers.
In matrix approximation, volume sampling.
In statistics we have important sampling.
In PAC learning we have epsilon samples.
And in combinatorial geometry, you have epsilon nets and epsilon approximations.
If you would like to learn more about data compression for big data, I encourage you to follow up with some of these references.
Next we will look at some concrete examples for constructing coresets.

5.4 Machine Learning Tools

TOMMI JAAKKOLA: Hi, I'm Tommi Jaakkola.
I'm a professor in computer science.
And today we'll talk about machine learning tools for big data analytics.
I have a group of seven people at MIT working on machine learning.
Machine learning, you can think of it as computer programs that learn from experience.
There are lots of issues from theory algorithms to applications related to such methods.
So, for example, from a theoretical perspective, we try to understand how learning is possible, when it is possible, when it is not, how much experience is required for learning anything, and what guarantees we might be able to give for the outcome of learning.
Algorithms are needed to effortlessly implement those computer programs.
And there are lots of applications that we are trying to apply these methods to, and serve as a motivation for the underlying theory and algorithms.
So, for example, we work on natural language processing, trying to understand how computers could understand natural language documents.
If we could do that, we could have very natural user interfaces for computers.
We could ask computer to search for information.
We could translate from one language to another, and so on.
Also, nowadays the access to information is almost invariably through recommendations.
If you type keywords on Google, you will get back recommendations, what Google might think that you're likely searching for.
Such recommendations are used for interfaces all over the place, whether you visit amazon.com site, whether you are watching movies, you get recommendations for movies that you would like to see, and so on.
There are lots of other applications that we are motivated by that you will see here, from predicting user behavior, to reverse engineering biological systems, and so on.

5.5 Case Study: Information Summarization

PROFESSOR: I'm Regina Barzilay, and I'm professor at CSAIL at MIT.
I work on natural language processing.
My group and I are developing a machine learning application that aim to extract useful information from texts.
We work on very fundamental tasks, such as part of speech tagging, or syntactic parsing.
But we also do a lot of applications, such as text summarization and information extractions.
Sometimes we do really esoteric things, like lost language and decipherment.
And we even succeeded in one of these languages, Ugaritic.
But today, I will focus on one application which is closely related to big data, and this is information extraction.
So as you all know, lots and lots of big data material is actually generated by lay users.
And they generate it in natural language.
So one example would be reviews.
Lots and lots of reviews are generated every second on Amazon and other services.
And whenever we are making a decision on what book to buy, or to what restaurant to go, we often check these reviews and make these decisions.
However, we can only check a very small fraction of reviews given the time constraints that we have.
And for this reason, we often use stars notation, even though it's not perfect.
It doesn't contain lots of information, but it was aggregated across thousands of reviews.
So in ideal case, what we would really want to have is a way to take all this text, 1,000s of reviews, and put them in some compressed format which we can read very fast and make our decision.
Unfortunately, systems today do not provide a lot of opportunities to do so.
So the recent application that was trying to achieve this goal for quite some time, and it's called information extraction.
So what information extraction systems try to do is to take text and transfer it into a structured representation such as database.
And database is something great that we, as computer scientists, know how to work with because we know how to query it, how to aggregate it.
We can analyze information fast in the structured form and compare it.
So the benefits would be really great if we can take all this unstructured big data information and transfer it to the structured format.
So at the question that you may ask at this point, if you are not familiar with natural language processing, how exactly are you planning to do it?
Even humans have difficulty going and reading all this.
So how a machine can do it, because machines don't understand language.
They don't understand the grammar of natural language.
How they can take text and put it into a database form?
So a quite surprising thing that people found maybe two or three decades ago is that you can solve a lot of tasks without actually understanding text.
You can just look at the very simple statistics.
Let's say you are trying to decide whether a text is about finances or about weather, to the topic determination of the document.
It turns out to that it is enough to look at word distribution over the document.
If you see many times, blistering clouds, and sunny, most likely, this is a document about weather and not about finances.
So if the machine is provided with many examples of text about topic A and topic B, it can learn the distributional patterns which are characteristic to the topic.
Now you can say, OK, this is really an easy task.
Can they do something which is really an information extraction?
Let's just look at another example for information extraction, which is named entity disambiguation.
You often see names of entities when you look at Google News.
They give you all the people in the news, all the locations.
How would you know?
So it turns out that it's, again, a simple task.
You can think about many predictions which tells us whether it is a person, an organization, a location, or an unrelated entity.
You can look whether the entity is capitalized, or whether it has maybe a name from some list of names.
Or the words that appeared previously, like the word Mister or Miss would be a strong indicator that this is a person.
So there are lots and lots of features you can think about.
And it turns that there's a very useful piece of technology, which is called classification, which can learn a function that maps all these features into a prediction, into a label.
But what you would see, that initial language processing, many, many tasks can be solved just using this mapping which takes features that the person engineered and maps them into a prediction.

5.6 Applications: Medicine

John Guttag: Hello I'm John Guttag, and I want to spend some time talking with you about the application of big data analytics to improving medical outcomes.
But before I do, I want to put a little ad for my research group.
In particular wanted an excuse to show you this picture.
This is a picture of my current graduate students.
And maybe it's the sad truth, maybe it's the happy truth about life in academia, is professors, or at least this professor, doesn't do any real work.
All the work is done by the students, and the job of the professor is to claim credit.
So I thought the very least I should do is show you a picture of my students.
I do want to point out that this handsome fellow down at the bottom is me.
I thought I should be at the top of the pyramid, but my students told me that they were tired of propping me up, and it was time I spent at least some time propping them up.
So that's how I ended up here at the bottom.
Our group is, basically, we do machine learning. We study big data.
We do data mining.
We do some computer vision, a lot of things.
Our main thrust is indeed medical analytics, but recently, we've ramped up our activity in financial analytics, and particular, in sports analytics because I'm kind of a sports nut, and it's fun.
And to be fair, I should acknowledge all of the people who paid for the work I'm going to be talking about.
On to medical analytics.
So what I want to talk about is using machine learning to build predictive models.
I'll talk about signals, because a lot of medical data is signals.
EKGs, for example, and biomarkers, because things like lab tests are important.
I won't talk about our work on video-based monitoring and diagnosis, because, well, I don't want to take hours and hours of your time.
I should point out that these are the logos of the hospitals with which we're currently collaborating.
It's important to say that the kind of work I'm interested in can't be done in vacuo, in a university.
If you're not collaborating with people who have data, with people who are treating patients, you can't do anything real.
You can prove theorems.
You can try toy examples, but you won't actually help change medicine.
Want to start with some good news, things that help us do this kind of work.
To start with is we have a lot more data than used to be available.
The capacity to gather medical data is growing enormously.
We have better instrumentation.
Today's MRI machines are much better than they used to be.
A lot of ambulatory monitors, and that's very important.
We can now gather data about people going through their normal lives, rather than in clinical environments.
And that data is often very different from the data you see in a doctor's office or in a hospital setting.
So we get a lot more data, a lot more information about each patient.
What's also important-- and we get to keep this information.
Hospitals have always gathered a lot of information about patients.
Somebody has a heart attack.
They check into the hospital.
They may get monitored for a week at a time.
Well, in the bad old days, this data got trashed not long after the patient left the hospital, because, well, there was no place to keep it.
First, it originally would be on paper, and there's no way to keep all that paper.
And even if you could keep it, you'd never find anything.
Then it was electronic, but storage was expensive.
But as we now know, storage is darn close to free, and so increasingly, medical environments are saving information.
Another very important factor, which is not to be underestimated, is that the economic and social forces have changed, and we're now getting aggregated data.
It used to be that every hospital was a tub in its on bottom, kept its own data.
Today, we see an enormous growth in medical systems.
Hospitals are combining for economic reasons.
And suddenly, we're seeing aggregated databases.
We're also seeing consolidation in the payers, and in particular, as the government has become a more important payer, it has started collecting national data sets in the US and many other countries, international data sets.
And so we're able to look at much larger aggregations of data.
That's the good news.
You're not shocked that a slide that says Good News is followed by a slide that sort of says Bad News.
What's the bad news?
Well, from a patient's perspective, part of the bad news is clinicians are spending ever more time studying data about their patients.
I'm sure many of you have had the experience of going into your doctor's office and noticing the doctor is looking at his or her computer screen, rather than at you.
It can be very frustrating at times, but they have to, because there's more data to study about the patient.
But they're still ignoring most of the data.
It's surprising, for example, how many lab tests are never picked up.
Tracking the onslaught of new medical information is impossible.
Like all fields of science, medicine is exploding in the rate at which information is appearing in literature.
There's no way a practicing physician can read everything of relevance.
There are some technical problems, as well.
Researchers, or at least many researchers, are using analytical techniques that were not designed for today's data sets.
The techniques don't scale to multi-modal data.
They don't scale to thousands of variables and millions of patients.
That's just not what they were designed for.
Also, they were typically not designed to uncover new knowledge, what we might call data mining.
They were designed to test hypotheses, kind of like the scientific method we learned in high school.
Here's a hypothesis.
Let's design an experiment that might refute it.
That's what a clinical trial really is.
Today, that's just not going to cut it.
We need to have techniques that will uncover new knowledge.
Finally, a lot of the techniques today require prohibitively expensive and time consuming clinical studies.
And all you have to do is pick up the newspaper to realize how those clinical studies, in the current way we do them, is impeding the progress of medicine.

5.7 Applications: Finance

ANDREW LO: Hi, I'm Andrew Lo.
And I'm here to talk to you today about big data analytics, in particular applications to finance.
And the particular financial application we're going to be talking about today is the challenge of consumer credit risk management.
We're going to be applying the techniques of big data and machine learning to try to understand a little bit more about how consumer credit evolves over time and different economic conditions.
I want to start with a little bit of an introduction about consumer credit and why risk management is so important.
First of all, keep in mind that there's over $3 trillion of consumer credit outstanding as of August, 2013.
And $840 billion of that is actually revolving consumer credit.
Revolving credit is credit where you're given a credit line.
And you can take out as much or as little of it as you want.
Unlike a mortgage or an auto loan where you've got fixed payments, with revolving credit like a credit card or a home equity line of credit, you get to decide how much credit you want and how much you need.
One of the amazing facts about this industry is that as of October, 2013 the average credit card balance for the typical individual consumer is over $15,000.
That's an astonishing number when you think about it, because the typical consumer is paying on average 15% annual interest on that kind of credit card.
Now, I know that you and I don't get 15% on our savings accounts.
So it seems kind of incredible that somebody's getting 15% on your credit card balances.
Well, the reason, of course, is that these credit card companies are actually taking a fair bit of risk.
And the risk that they're taking is that you might not actually pay back or you might be late in paying back.
And you might not be able to pay all the finance charges.
To give you a sense of just how large that is, as of the second quarter of 2013, 6.7% of all of the consumer credit outstanding was considered to be a loss.
That is, delinquencies and defaults basically caused credit card companies to write that off.
And in the first quarter of 2010, that number reached 10.2%.
So there are a fair bit of losses involved in this industry.
So the challenge is whether or not we can predict these kinds of cycles and figure out how to tell whether or not these credit card losses are about to happen or are they not as much of a problem as we hoped.
So to give you an example of why this is such an important challenge, let's take a look at typical consumer credit scores, something like a FICO score that's used by rare various credit rating agencies.
So here's a plot of consumer credit scores that are used in the industry to categorize consumers as good credits or bad credits.
The horizontal axis is the level of the score.
The low part of the curve means that you're a very big risk.
The high part means that you're a pretty good credit risk.
And each of the curves that you see on this graph represents the scores of various consumers in a particular commercial bank's database for each of the years from 2005 through 2008.
Now, the interesting thing about these curves is except for that very top curve, except for this black curve here, which is 2008, all of the other curves are really pretty close together.
And what the height of the curve represents is the actual probability of loss of consumers during that year.
And what you observe is that, apart from 2008 where the credit crisis really started to hit, there's very little difference in the scores of these various different consumers across the various default rates.
So in other words, these scores that are traditionally used for categorizing consumers into good credits or bad are very, very insensitive from one year to the next.
They're not good for predicting defaults, even though they're good for ranking consumers into high credit or low credit.
So what I'm going to be talking to you about is some research that we've been doing at the MIT laboratory for financial engineering.
It's based on a research article that you can all get a copy of, co-authored by Amir Khandani, Adlar Kim, and myself.
We published this in the Journal of Banking and Finance in 2010.
And it's research that applies the tools of big data and machine learning to a sample of consumers from a major commercial bank for the years 2005 through 2009.
And over the course of the next few segments, we're going to be going through that.
And we'll describe how it is a we can use these tools to get a sense of where these consumer credit risks are and what we can do about them.

Thank you.

Assessment 6

Done 100%

Discussion 5

Optional Discussion Question:

How do you think you can apply the sampling algorithms mentioned in this module to the data that you work with? Are the approximate results useful for the conclusions you're trying to draw from the data? Why or why not?

To discuss this question with your fellow participants, please go to the module 5 discussion area.

Slides

1.0_introduction-to-bigdata

What Is This Course Going to Cover?

1.0_introduction-to-bigdata.png

1.1_transportation

Case Study: Transportation in Singapore

Source: Live Singapore, Future of Urban Mobility,Singapore-MIT Alliance for Research and Technology

1.1_transportation.png

1.2_visualizing-twitter

Other Techniques We'll Cover

1.2_visualizing-twitter2.png

2.1_data-cleaning-integration

http://en.wikipedia.org/wiki/Data_integration
Demo of http://goby.com

Wrangler video
http://vis.stanford.edu/wrangler/

Data Curation

2.1_data-cleaning-integration1.png

Startups in This Space

2.1_data-cleaning-integration2.png

Data Tamer Future

2.1_data-cleaning-integration3.png

The Way Forward 1

2.1_data-cleaning-integration4.png

The Way Forward 2

2.1_data-cleaning-integration5.png

2.2_hosted-data-platforms

Examples

2.2_hosted-data-platforms1.png

Summary

2.2_hosted-data-platforms2.png

Examples

2.2_hosted-data-platforms3.png

Conclusions

2.2_hosted-data-platforms4.png

3.1_modern-databases

History Lesson

3.1_modern-databases1.png

My Thesis

3.1_modern-databases2.png

Rest of This Module

3.1_modern-databases3.png

Data Warehouse Marketplace

3.1_modern-databases4.png

The Participants

3.1_modern-databases5.png

Roughly Speaking

3.1_modern-databases6.png

OLTP Data Bases -- 3 Big Decisions

3.1_modern-databases7.png

Summary

3.1_modern-databases8.png

Everything Else

3.1_modern-databases9.png

Array DBMSs--e.g. SciDB

3.1_modern-databases10.png

Array DBMSs--Summary

3.1_modern-databases11.png

Graph DBMSs

3.1_modern-databases12.png

What Is Hadoop?

3.1_modern-databases13.png

What Is Happening Now?

3.1_modern-databases14.png

Most Likely Future

3.1_modern-databases15.png

Thoughts While Shaving 1

3.1_modern-databases16.png

Thoughts While Shaving 2

3.1_modern-databases17.png

Thoughts While Shaving 3

3.1_modern-databases18.png

The Curse--May You Live in Interesting Times 1

3.1_modern-databases19.png

The Curse--May You Live in Interesting Times 2

3.1_modern-databases20.png

3.2_distributed-computing-platforms

Motivation

3.2_distributed-computing-platforms1.png

Software in This Space

3.2_distributed-computing-platforms2.png

Applications

3.2_distributed-computing-platforms3.png

This Lecture

3.2_distributed-computing-platforms4.png

3.3_NoSQL-NewSQL

What Does a Traditional Database Provide?

3.3_NoSQL-NewSQL1.png

A Thousand Flowers Bloom

3.3_NoSQL-NewSQL2.png

Rest of the Module

3.3_NoSQL-NewSQL3.png

4.1_security

Security is a Negative Goal

4.1_security1.png

Multiple Encryption Schemes

4.1_security2.png

Conclusion

4.1_security3.png

4.2_multicore-scalability

Goal: Scalability

4.2_multicore-scalability1.png

Outline

4.2_multicore-scalability2.png

Conclusion: Multicore Scalability

4.2_multicore-scalability3.png

4.3_visualization-user-interfaces

My Research Group

4.3_visualization-user-interfaces1.png

Why User Interfaces? 1

4.3_visualization-user-interfaces2.png

Why User Interfaces? 2

4.3_visualization-user-interfaces3.png

The Value of Interfaces

4.3_visualization-user-interfaces4.png

User Interfaces for Data

4.3_visualization-user-interfaces5.png

Spectrum of Interface Capabilities

4.3_visualization-user-interfaces6.png

Overview

4.3_visualization-user-interfaces7.png

Small Multiples

4.3_visualization-user-interfaces8.png

What Is Visualization?

4.3_visualization-user-interfaces9.png

Why Visualizations?

4.3_visualization-user-interfaces10.png

John Tukey 1

4.3_visualization-user-interfaces11.png

Challenger Disaster

4.3_visualization-user-interfaces12.png

Morton Thiokol

4.3_visualization-user-interfaces13.png

Make a Decision: Challenger 1

4.3_visualization-user-interfaces14.png

Make a Decision: Challenger 2

4.3_visualization-user-interfaces15.png

Information Visualization

4.3_visualization-user-interfaces16.png

How Not To Lie

4.3_visualization-user-interfaces17.png

Tufte: Graphical Integrity 1

4.3_visualization-user-interfaces18.png

Tufte: Graphical Integrity 2

4.3_visualization-user-interfaces19.png

Summary

4.3_visualization-user-interfaces20.png

Interactivity

4.3_visualization-user-interfaces21.png

Why?

4.3_visualization-user-interfaces22.png

Plan

4.3_visualization-user-interfaces23.png

John Tukey 2

4.3_visualization-user-interfaces24.png

Exploratory Versus Confirmatory

4.3_visualization-user-interfaces25.png

John Tukey 3

4.3_visualization-user-interfaces26.png

Summary

4.3_visualization-user-interfaces27.png

Goal

4.3_visualization-user-interfaces28.png

Spectrum of Interface Capabilities

4.3_visualization-user-interfaces29.png

Interaction Strategy

4.3_visualization-user-interfaces30.png

5.1_fast-algorithms-1

What To Do About REALLY Big Data?

5.1_fast-algorithms-1-1.png

No Time

5.1_fast-algorithms-1-2.png

Really Big Data

5.1_fast-algorithms-1-3.png

What Can We Hope To Do Without Viewing Most of the Data?

5.1_fast-algorithms-1-4.png

What Types of Approximation?

5.1_fast-algorithms-1-5.png

Conclusion

5.1_fast-algorithms-1-6.png

5.2_fast-algorithms-2

Streaming and Sampling

5.2_fast-algorithms-2-1.png

Streaming Versus Sampling

5.2_fast-algorithms-2-2.png

Rest of This Lecture

5.2_fast-algorithms-2-3.png

Computing Fourier Transform

5.2_fast-algorithms-2-4.png

Idea: Leverage Sparsity

5.2_fast-algorithms-2-5.png

Benefits of Sparse Fourier Transform

5.2_fast-algorithms-2-6.png

5.3_data-compression

Learning Big Data Patterns From Tiny Core-Sets

5.3_data-compression1.png

Data Challenge

5.3_data-compression2.png

Challenges

5.3_data-compression3.png

Outline

5.3_data-compression4.png

Coreset and Data Compression Techniques

5.3_data-compression5.png

References for Compression

5.3_data-compression6.png

Example: Coresets for Life Logging

5.3_data-compression7.png

System Overview

5.3_data-compression8.png

Coresets for Latent Semantic Analysis

5.3_data-compression9.png

Wrapping Up

5.3_data-compression10.png

5.4_machine-learning-tools

Our Research Group

5.4_machine-learning-tools1.png

Machine Learning 1

5.4_machine-learning-tools2.png

Machine Learning 2

5.4_machine-learning-tools3.png

Structured Prediction

5.4_machine-learning-tools4.png

5.5_information-summarization

My Research

5.5_information-summarization1.png

Need in Information Extraction

5.5_information-summarization2.png

Information Extraction for Big Data

5.5_information-summarization3.png

Data Set 1

5.5_information-summarization4.png

Data Set 2

5.5_information-summarization4.png

Conclusion

5.5_information-summarization6.png

5.6_applications-medicine

My Research Group

5.6_applications-medicine1.png

Medical Analytics

5.6_applications-medicine2.png

The Good News

5.6_applications-medicine3.png

Problems Posed by Increase in Available Data 1

5.6_applications-medicine4.png

Problems Posed by Increase in Available Data 2

5.6_applications-medicine5.png

Using ML to Make Useful Predictions

5.6_applications-medicine6.png

Accurate Predictions Can Help

5.6_applications-medicine7.png

The Big Data Challenge

5.6_applications-medicine8.png

What Is Machine Learning? 1

5.6_applications-medicine9.png

What Is Machine Learning? 2

5.6_applications-medicine10.png

How Are Things Learned?

5.6_applications-medicine11.png

Machine Learning Methods

5.6_applications-medicine12.png

More Not Always Better

5.6_applications-medicine13.png

Approach

5.6_applications-medicine14.png

The Data

5.6_applications-medicine15.png

Variables Considered

5.6_applications-medicine16.png

Two Step Approach

5.6_applications-medicine17.png

Results

5.6_applications-medicine19.png

Wrapping Up

5.6_applications-medicine20.png

5.7_applications-finance

Consumer Credit Risk Management

5.7_applications-finance1.png

MIT Laboratory for Financial Engineering

5.7_applications-finance2.png

Anonymized Data From Large U.S. Commercial Bank

5.7_applications-finance3.png

Objectives

5.7_applications-finance4.png

Machine Learning Objectives

5.7_applications-finance5.png

Empirical Results

5.7_applications-finance6.png

Macro Forecasts of Credit Losses

5.7_applications-finance7.png

Conclusions

5.7_applications-finance8.png

Course Updates & News

Source: https://mitprofessionalx.edx.org/cou...DX/2T2014/info

November 4, 2014

Welcome to the MIT Professional Education course Tackling the Challenges of Big Data. Now that the course is open, you can begin the following activities:

Review the course Syllabus, Wiki, and Course Handouts

SYLLABUS: The course syllabus can be viewed in the “Syllabus” tab of the course platform.

Please remember, this course is held over six weeks, and is entirely asynchronous. Lectures are pre-taped and you can follow along when you find it convenient, as long as you finish all required assignments by December 16, 2014. You may complete all assignments before the course end date, however, you may find it more beneficial to adhere to a weekly schedule so you can stay up-to-date with the discussion forums.

WIKI: The Wiki can be viewed in the “Wiki” tab of the course platform.

The course Wiki is a collaborative space for participants to share information, tips, suggestions, etc. about the course. Everyone is welcome and encouraged to edit these pages or add new ones as they see fit. There will be a lot of great resources available, including PDFs of the presentations the faculty refer to in their videos.

COURSE HANDOUTS: Course handouts can be viewed near the top right side of this page. Handouts include information about Discussion Forum Guidelines, an Introduction to your course Teaching Assistants (TAs) and how they can help, and details on how to earn a Certificate of Completion and CEUs for the this course.

Join the course networking groups

These groups are meant to provide participants additional ways to interact with each other during the active running of the course, and joining them is optional. They will be un-moderated, so any questions meant for course staff should instead be asked in the discussion forums where your course TAs can assist you. Details about how to join these groups can be found at the top of the course Wiki.

Be sure to check this page regularly, as we will post copies of email communications as well as important announcements throughout the duration of the course.

MIT Professional Education
The Online X Programs Team

Course Syllabus for Tackling the Challenges of Big Data

https://mitprofessionalx.edx.org 

Course Dates: 
Course Officially Begins: November 4, 2014, 05:00 UTC 
Course Officially Ends: December 16, 2014, 4:59 UTC

Course Description: 
Read the full course description here.

Time Requirement/Commitment

Taking into consideration various time zones, this course is self-paced with online accessibility 24/7. Lectures are pre-taped and you can follow along when you find it convenient as long as you finish by the course end date. You may complete all assignments before the course end date, however, you may find it more beneficial to adhere to the suggested weekly schedule so you can stay up-to-date with the discussion forums. There are approximately three hours of video every week. Most participants will spend about five hours a week on course-related activities.

Who Should Participate?

Prerequisite(s): This course is designed to be suitable for anyone with a bachelor’s level education in computer science or equivalent work experience, such as working hands-on with IT / technology systems (programming, database administration, data analysis, actuarial work, etc.) No programming experience or knowledge of programming languages is required.

Learning Objectives

Participants will learn the state-of-the-art in Big Data. The course aims to reduce the time from research to industry dissemination and expose participants to some of the most recent ideas and techniques in Big Data.

After taking this course, participants will:

  1. Understand the challenges posed by Big Data (volume, velocity, and variety,) its sources and its potential impact for your industry. Determine how and where Big Data challenges arise in a number of domains, including social media, transportation, finance, and medicine
  2. Investigate multicore challenges and how to engineer around them
  3. Explore the relational model, SQL, and capabilities of new relational systems in terms of scalability and performance
  4. Understand the capabilities of NoSQL systems, their capabilities and pitfalls, and how the NewSQL movement addresses these issues
  5. Learn how to maximize the MapReduce programming model: What are its benefits, how it compares to relational systems, and new developments that improve its performance and robustness
  6. Learn why building secure Big Data systems is so hard and survey recent techniques that help; including learning direct processing on encrypted data, information flow control, auditing, and replay
  7. Discover user interfaces for Big Data and what makes building them difficult
  8. Manage the development of data compression algorithms
  9. Formulate the “data integration problem”: semantic and schematic heterogeneity and discuss recent breakthroughs in solving this problem
  10. Understand the benefits and challenges of open-linked data
  11. Comprehend machine learning and algorithms for data analytics

Methodology: Online recorded lectures, optional discussion boards, case studies, assessments, and a community Wiki.

Learning Activities Planned for the Program:

  • Optional participation in threaded discussions on designated forums
  • End of topic assessments
  • Video learning sequences
  • Wiki for sharing of resources and external links

Course Staff

Faculty Co-Directors:

Daniela Rus, Professor, Electrical Engineering Computer Systems

Rus is Professor of Electrical Engineering and Computer Science and Director of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT.  Rus’ research interests include distributed robotics, mobile computing, and programmable matter. At CSAIL, she has led numerous groundbreaking research projects in the areas of transportation, security, environmental modeling and monitoring, underwater exploration, and agriculture. Her research group, the Distributed Robotics Lab, has developed modular and self-reconfiguring robots, systems of self-organizing robots, networks of robots and sensors for first-responders, mobile sensor networks, techniques for cooperative underwater robotics, and new technology for desktop robotics. They have built robots that can tend a garden, bake cookies from scratch, cut birthday cake, fly in swarms without human aid to perform surveillance functions, and dance with humans.

Sam Madden, Professor, Electrical Engineering Computer Systems

Madden is a computer scientist specializing in database management systems. He is the faculty director of MIT’s Big Data Initiative at CSAIL and co-director of the Intel Science and Technology Center (ISTC) in Big Data at CSAIL. Recent projects include CarTel, a distributed wireless platform that monitors traffic and on-board diagnostic conditions in order to generate road surface reports, and Relational Cloud, a project investigating research issues in building a database-as-a-service. In 2005, Madden was named one of Technology Review's “Top 35 Under 35.” He is also co-founder of Vertica (acquired by HP).

See the full list of faculty for this course.

Course Requirements

Students must complete a mandatory entrance survey in order to gain access to the videos and other course materials.

In order to get the most out of this course, you are encouraged to watch all course videos, complete all weekly assessments, and actively participate in the discussion forums.

Grading:

Grades are not awarded for this program. However, to earn a Certificate of Completion from MIT, you are required to watch all course videos and complete all assessments with an 80% or higher average score. MIT Professional Education will not track your video progress, but please note that your understanding of all course content is necessary to complete the course assessments.

Participants who successfully complete all course requirements and earn a Certificate of Completion are eligible to receive 2.0 Continuing Education Units (2.0 CEUs). In order to earn CEUs, participants must complete the final course survey/CEU assessment by January 2, 2015.

Course Schedule

This course is structured into a six-week program and is entirely asynchronous. The below is a suggested weekly schedule for the purpose of staying up-to-date with the discussion forums.

Please note that no extensions will be granted, and all required assessments and assignments must be completed and submitted on or before December 16, 2014, 4:59 UTC.

Pre-course Assignment: Participants are required to provide some information via a short course entrance survey. Your answers will help the course team and faculty better understand your goals for taking this course, how familiar you are with big data concepts, and they will ultimately be a guide to improving your experience and that of future courses. 

This survey is your first assignment of the course. You will be able to access the survey on the course start date, November 4, 2014. As soon as you complete the survey, you will be granted access to the videos, and can start the course.

Week 1 - MODULE ONE: INTRODUCTION AND USE CASES

November 4 – November 10

The introductory module aims to give a broad survey of Big Data challenges and opportunities and highlights applications as case studies.

Introduction: Big Data Challenges (Sam Madden)
  • Identify and understand the application of existing tools and new technologies needed to solve next generation data challenges
  • Challenges posed by the ability to scale and the constraints of today's computing platforms and algorithms
  • Addressing the universal issue of Big Data and how to use the data to align with a company’s mission and goals
Case Study: Transportation (Daniela Rus)
  • Data driven models for transportation
  • Coresets for Global Positioning System (GPS) data streams
  • Congestion aware planning
Case Study: Visualizing Twitter (Sam Madden)
  • Understand the power of geocoded Twitter data
  • Learn how Graphic Processing Units (GPUs) can be used for extremely high throughput data processing
  • Utilize MapD, a new GPU based database system for visualizing Twitter in action
Recommended weekly activities
  • Watch course videos for this week
  • Review and contribute to discussion forum, including module discussion questions (NOTE: Contributing to discussion forums is not required to earn a certificate or CEUs.)
  • Review and contribute to Wiki

Week 2 - MODULE TWO: BIG DATA COLLECTION

November 11 – November 17

The data capture module surveys approaches to data collection, cleaning, and integration.

Data Cleaning and Integration (Michael Stonebraker)
  • Available tools and protocols for performing data integration
  • Curation issues (cleaning, transforming, and consolidating data)
Hosted Data Platforms and the Cloud (Matei Zaharia)
  • How performance, scalability, and cost models are impacted by hosted data platforms in the cloud
  • Internal and external platforms to store data
Recommended weekly activities
  • Watch course videos for this week
  • Review and contribute to discussion forum, including module discussion questions (NOTE: Contributing to discussion forums is not required to earn a certificate or CEUs.)
  • Review and contribute to Wiki

Week 3 - MODULE THREE: BIG DATA STORAGE

November 18 - November 24

The module on Big Data storage describes modern approaches to databases and computing platforms.

Modern Databases (Michael Stonebraker)
  • Survey data management solutions in today’s market place, including traditional RDBMS, NoSQL, NewSQL, and Hadoop
  • Strategic aspects of database management
Distributed Computing Platforms (Matei Zaharia)
  • Parallel computing systems that enable distributed data processing on clusters, including MapReduce, Dryad, Spark
  • Programming models for batch, interactive, and streaming applications
  • Tradeoffs between programming models
NoSQL, NewSQL (Sam Madden)
  • Survey of new emerging database and storage systems for Big Data
  • Tradeoffs between reduced consistency, performance, and availability
  • Understanding how to rethink the design of database systems can lead to order of magnitude performance improvements
Recommended weekly activities
  • Watch course videos for this week
  • Review and contribute to discussion forum, including module discussion questions (NOTE: Contributing to discussion forums is not required to earn a certificate or CEUs.)
  • Review and contribute to Wiki

Optional midcourse evaluation will be distributed by 4:59 UTC November 24, 2014 and is due 4:59 UTC November 25, 2014.

Week 4 - MODULE FOUR: BIG DATA SYSTEMS

November 25 - December 1

The systems module discusses solutions to creating and deploying working Big Data systems and applications.

Security (Nickolai Zeldovich)
  • Protecting confidential data in a large database using encryption
  • Techniques for executing database queries over encrypted data without decryption
Multicore Scalability (Nickolai Zeldovich)
  • Understanding what affects the scalability of concurrent programs on multicore systems
  • Lock-free synchronization for data structures in cache-coherent shared memory
User Interfaces for Data (David Karger)
  • Principles of and tools for data visualization and exploratory data analysis
  • Research in data-oriented user interfaces
Recommended weekly activities
  • Watch course videos for this week
  • Review and contribute to discussion forum, including module discussion questions (NOTE: Contributing to discussion forums is not required to earn a certificate or CEUs.)
  • Review and contribute to Wiki

Week 5 - MODULE FIVE, PART I: BIG DATA ANALYTICS

December 2 - December 8

The analytics module covers state-of-the-art algorithms for very large data sets and streaming computation.

Fast Algorithms I (Ronitt Rubinfeld)
  • Efficiency in data analysis
Fast Algorithms II (Piotr Indyk)
  • Advanced applications of efficient algorithms Scale-up properties
Data Compression (Daniela Rus)
  • Reducing the size of the Big Data file and its impact on storage and transmission capacity
  • Design of data compression schemes such as coresets to apply to Big Data set
Recommended weekly activities
  • Watch course videos for this week
  • Review and contribute to discussion forum, including module discussion questions (NOTE: Contributing to discussion forums is not required to earn a certificate or CEUs.)
  • Review and contribute to Wiki

Week 6 - MODULE FIVE, PART II: BIG DATA ANALYTICS

December 9 – December 15

The analytics module covers state-of-the-art algorithms for very large data sets and streaming computation. 

Machine Learning Tools (Tommi Jaakkola)
  • Computational capabilities of the latest advances in machine learning
  • Advanced machine learning algorithms and techniques for application to large data sets
Case Study: Information Summarization (Regina Barzilay)
Applications: Medicine (John Guttag)
  • Utilize data to improve operational efficiency and reduce costs
  • Analytics and tools to improve patient care and control risks
  • Using Big Data to improve hospital performance and equipment management
Applications: Finance (Andrew Lo)
  • Learn how big data and machine learning can be applied to financial forecasting and risk management
  • Analyze the dynamics of the consumer credit card business of a major commercial bank
  • Recognize and acquire intuition for business cases where big data is useful and where it isn't
Recommended weekly activities
  • Watch course videos for this week
  • Review and contribute to discussion forum, including module discussion questions (NOTE: Contributing to discussion forums is not required to earn a certificate or CEUs.)
  • Review and contribute to Wiki
Completing the course
  • In order to receive a Certificate of Completion, all end of topic assessments must be completed with a minimum of 80% success rate by 4:59 UTC, December 16, 2014. Certificates will be posted to your student dashboard 7 days after course end date.
  • In order to receive 2.0 CEUs, you must earn a Certificate of Completion and complete the final course/CEU assessment by January 2, 2015. A CEU award letter will be emailed to all participants that earn them by January 15, 2015.
Post-course
  • Course content will be accessible for an additional 90 days post program. There will be no exceptions or extensions. The site will officially close by March 15, 2015.
  • An invitation will be sent out to join our restricted LinkedIn professional alumni group one week after certificates have been posted.
 
Thank you for your participation in 
Tackling the Challenges of Big Data.
MIT Professional Education
http://web.mit.edu/professional/

Wiki

Networking

If you wish to connect with your fellow participants outside the course discussion forum, please see these official networking groups:

Facebook- course group

Visit the above link and request to “Join Group”. An admin will process your request as soon as possible. We appreciate your patience.

Visit the above link and request to “Join Group”. If your LinkedIn Account email is the same as the email you used to register for the course, you should be pre-approved to join and will be automatically added as a member of the group. If your LinkedIn email is different than the one used to register for this course, an admin will process your request and add you as a group member as soon as possible. We appreciate your patience.

If you are interested in sharing your information so that we can get some statistics about the participants, and find more meaningful ways to network, please submit your information here. You can view the submitted information here.

These are the official, un-moderated course networking groups for Tackling the Challenges of Big Data, November 4 – December 16, 2014.

Please note that once you complete the course, you will all be invited to join the MIT Professional Education - Alumni LinkedIn Group. This group is exclusive to all those that have received certificates through MIT Professional Education programs and will allow you to network with a broad group of professionals from around the world, representing many industries and job functions, including Big Data.

Readings and Resources

Course members are invited to share information and resources in this wiki. Resources that are appropriate for a particular unit should be added under "student-added resources" for that unit. When adding resources such as links to files, please follow accessibility guidelines and label the file type (e.g. PDF, Word Document, etc.)

General resources, not specific to a module, are listed in a separate article.

You can get a preview of the course materials through the links below. During the first week of the course, we will also publish the slides in a single package to make it easier to download all of them at once.

Resources by Module

1.0 Introduction: Big Data Challenges - PDF of Presentation slides (Madden)

STUDENT-ADDED RESOURCES

2.0 Introduction: Big Data Collection

3.0 Introduction: Big Data Storage

4.0 Introduction: Big Data Systems

NEXT

Page statistics
6701 view(s) and 169 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments