Table of contents
  1. Story
  2. Spotfire Dashboard
  3. Slides
    1. Slide 1 What MIT CSAIL is Doing About Big Data
    2. Slide 2 Big Data
    3. Slide 3 When Do You Have a Big Data Problem?
    4. Slide 4 Example: Medical Costs
    5. Slide 5 Why Big Data Now
    6. Slide 6 What are we (CSAIL) Doing?
    7. Slide 7 What is the ISTC in Big Data?
    8. Slide 8 bigdata@csail: focus
    9. Slide 9 Mission
    10. Slide 10 bigdata@csail: Founding Members
    11. Slide 11 Meetings & Workshops
    12. Slide 12 Annual Workshops & Meetings
    13. Slide 13 Big Data Privacy Workshop
    14. Slide 14 Big Data & Privacy Working Group
    15. Slide 15 Education & Impact
    16. Slide 16 MIT Big Data Challenge: Innovation + Students 1
    17. Slide 17 MIT Big Data Challenge: Innovation + Students 2
    18. Slide 18 BigDataX: Tackling the Challenges of Big Data
    19. Slide 19 Research
    20. Slide 20 big data: systems and tools 1
    21. Slide 21 big data: systems and tools 2
    22. Slide 22 Example Research: MapD
    23. Slide 23 What is a GPU Accelerated Database
    24. Slide 24 Demo 1
    25. Slide 25 Demo 2
    26. Slide 26 Demo 3
    27. Slide 27 Demo 4
    28. Slide 28 Summary
  4. Spotfire Dashboard
  5. Research Notes
  6. bigdata@csail
    1. ABOUT
      1. The Big Data Problem

      2. Web Analytics
      3. Finance

      4. Medical

      5. APPROACH
        1. Computational Platforms
        2. Scalable Algorithms
        3. Machine Learning and Understanding
        4. Privacy and Security
      6. TEAM
        1. Sam Madden
        2. Elizabeth Bruce
        3. Susana Kevorkova
        4. Sheila Marian
      7. CONTACT
        1. For sponsors/membership
        2. For general information
    2. PEOPLE
      1. Hal Abelson CSAIL MIT
      2. Saman Amarasinghe CSAIL MIT
      3. Hari Balakrishnan CSAIL MIT
      4. Regina Barzilay CSAIL MIT
      5. Tim Berners-Lee CSAIL MIT
      6. Elizabeth Bruce CSAIL MIT
      7. Erik Demaine CSAIL MIT
      8. Srini Devadas CSAIL MIT
      9. Alan Edelman CSAIL MIT
      10. John Fisher CSAIL MIT
      11. William Freeman CSAIL MIT
      12. Polina Golland CSAIL MIT
      13. John Guttag CSAIL MIT
      14. Piotr Indyk CSAIL MIT
      15. Tommi Jaakkola CSAIL MIT
      16. Lalana Kagal CSAIL MIT
      17. David Karger CSAIL MIT
      18. Boris Katz CSAIL MIT
      19. Andrew Lo CSAIL MIT and Sloan School of Management
      20. Nancy Lynch CSAIL MIT
      21. madden CSAIL MIT
      22. Una-May O'Reilly CSAIL MIT
      23. Rob Miller CSAIL MIT
      24. Aude Oliva CSAIL MIT
      25. Alex Pentland Human Dynamics Laboratory MIT Media Lab
      26. Ronitt Rubinfeld CSAIL MIT
      27. Cynthia Rudin MIT Sloan School of Management
      28. Michael Stonebraker CSAIL MIT
      29. Peter Szolovits CSAIL MIT
      30. Antonio Torralba CSAIL MIT
      31. Daniel Weitzner CSAIL MIT
      32. Nickolai Zeldovich CSAIL MIT
    3. RESEARCH
      1. RESEARCH PROJECTS
        1. A SURVEY OF SYSTEMIC RISK ANALYTICS
        2. BIG DATA FOR A BETTER LIFE: THE TRENTO SMART CITY PROJECT
        3. BLINKDB
        4. BUILDING THE NEXT GENERATION OF SEARCH ENGINES: GROWING A LIST OF ALMOST ANYTHING
        5. CLOUD-SCALE, FLEXIBLE SCALING AND FACTORING OF MACHINE LEARNING ALGORITHMS: THE FLEXGP PROJECT
        6. CONSUMER CREDIT-RISK MODELS VIA MACHINE-LEARNING ALGORITHMS
        7. DECLARATIVE, GRAPHICAL CONSTRUCTION OF COMPLEX REPORT QUERIES
        8. DYNAMIC REDUCTION OF QUERY RESULT SETS FOR INTERACTIVE VISUALIZATON
        9. ENERGY-EFFICIENT ALGORITHMS
        10. ENERGY: HYDROCARBON EXPLORATION
        11. ERASURE CODING MEETS DISTRIBUTED COMPUTING: CODED EMULATION OF SHARED MEMORY ARCHITECTURES
        12. EVIDENCE-BASED EDUCATION RESEARCH FOR ALL
        13. EXECUTION MIGRATION MACHINE (EM2)
        14. GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO
        15. IMAGES: LARGE SCALE VISION
        16. INFORMATION ACCOUNTABILITY
        17. INTERPRETING BIG DATA: MEANINGFUL MODELS FOR HEALTHCARE, CRIMINOLOGY, AND BEYOND
        18. MACHINE LEARNING
        19. MACHINE LEARNING FOR CRIME SERIES DETECTION
        20. MAINTAINING COHERENT MEMORY IN DYNAMIC DISTRIBUTED SYSTEMS
        21. NATIONAL ALLIANCE FOR MEDICAL IMAGE COMPUTING
        22. NATURAL LANGUAGE INTERFACE FOR BIG DATA
        23. PREDICTIVE ANALYTICS FOR POWER GRID MAINTENANCE
        24. PRIVACY-PRESERVING METHODS FOR SHARING FINANCIAL RISK EXPOSURES
        25. SCIDB
        26. SMART TRANSPORTATION: CARTEL
        27. SOCIAL: CONDENSR
        28. SOCIAL: INFLUENCE MODELING
        29. SOCIAL: TWITINFO
        30. SPORTS ANALYTICS
        31. SUBLINEAR TIME ALGORITHMS
        32. TUNABLE FAST SIMILARITY SEARCH FOR HIGH-DIMENSIONAL DATA
        33. UNCOVERING CLINICALLY RELEVANT MEDICAL KNOWLEDGE
        34. VISION MACHINE: LEARNING ONLINE FROM 25 MILLION IMAGES
    4. EVENTS
      1. UPCOMING MEETINGS
      2. UPCOMING TALKS
      3. ARCHIVED TALKS
      4. MEMBER MEETINGS
        1. BIG DATA ADVISORY BOARD MEETINGS
        2. MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS [AGENDA, SLIDES & VIDEO]  
        3. MAR 3 2014 MIT WHITE HOUSE BIG DATA PRIVACY WORKSHOP [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        4. NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING [AGENDA, SLIDES & VIDEO]    
        5. JUN 19 2013  MEMBER WORKSHOP: BIG DATA PRIVACY [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        6. APR 4 2013  MEMBER WORKSHOP: BIG DATA INTEGRATION  [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        7. NOV 28-29 2012 ANNUAL BIG DATA MEMBER MEETING  [AGENDA, SLIDES & VIDEO]
    5. PARTNERS
      1. MEMBERS
      2. DATA PARTNERS
    6. SPECIAL PROJECTS
      1. BIG DATA PRIVACY
        1. Workshop: Big Data Privacy
        2. DEFINING “PRIVACY” IN A BIG DATA WORLD
          1. DAVID VLADECK - GEORGETOWN UNIVERSITY LAW CENTER
        3. BIG DATA, SYSTEMIC RISK, AND PRIVACY-PRESERVING RISK MEASUREMENT
          1. ANDREW LO - MIT SLOAN SCHOOL OF MANAGEMENT
        4. BIG DATA: NEW OIL OF THE INTERNET
          1. ALEX (SANDY) PENTLAND - MIT MEDIA LAB
        5. DIFFERENTIAL PRIVACY
          1. KOBBI NISSIM - BEN-GURION UNIVERSITY
        6. NO FREE LUNCH
          1. ASHWIN MACHANAVAJJHALA - DUKE UNIVERSITY
        7. DEVELOPING ACCOUNTABLE SYSTEMS
          1. LALANA KAGAL - MIT CSAIL
        8. ENCRYPTED QUERY PROCESSING
          1. RALUCA ADA POPA - MIT CSAIL
      2. DATA CHALLENGES
        1. The MIT Big Data Challenge
        2. TAKE ME TO THE CITY OF BOSTON TRANSPORTATION CHALLENGE PAGE!
        3. OVERVIEW
          1. MIT Big Data Challenge: Transportation in the City of Boston
          2. MIT Big Data Challenge: Transportation in the City of Boston Visualization and Prediction Winners!
      3. A LIVING LAB
    7. NEWS
      1. WHITE HOUSE & MIT RELEASE REPORTS ON BIG DATA AND PRIVACY
      2. JUN YANG'S TALK
      3. JANET WIENER'S TALK
      4. MEMBER WORKSHOP ON DATA ANALYTICS
      5. BIG, FAST, WEIRD DATA
      6. BOSTON GLOBE ARTICLE ON MIT BIG DATA CHALLENGE
      7. MIT, WHITE HOUSE CO-SPONSOR WORKSHOP ON BIG-DATA PRIVACY
      8. MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS
      9. WHITE HOUSE TO CO-HOST WORKSHOP AT MIT ON ‘BIG DATA’ AND PRIVACY
      10. BRUCE SCHNEIER TALK
      11. MIT LAUNCHES FIRST ONLINE PROFESSIONAL COURSE ON BIG DATA
      12. JAN 31 2014 NEW ENGLAND DATABASE SUMMIT
      13. DEC 10 2013 BIG DATA CHALLENGE MEETUP
      14. MIT BIG DATA INITIATIVE LAUNCHES TRANSPORTATION CHALLENGE, PRIVACY WORKING GROUP AT WHITE HOUSE EVENT
      15. NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING
      16. NSA MASS SURVEILLANCE: TURNING THE FOURTH (AND FIRST) AMENDMENT ON IT'S HEAD
      17. MICHAEL I. JORDAN TALK
      18. BIG DATA DESIGN CONTEST
      19. BIG DATA AND BLINKDB: A QUERY ENGINE FOR BIG DATA
      20. BIG DATA AND THE LAW: APPLYING NLP TO SUPREME COURT DECISIONS
      21. MIT BIG DATA INITIATIVE AT CSAIL IS AWARDED GRANT FROM ALFRED P. SLOAN FOUNDATION
      22. JUNE 19 2013 WORKSHOP: BIG DATA PRIVACY
      23. BIG DATA FOR THE REST OF US
      24. APRIL 4 2013 MEMBER WORKSHOP: BIG DATA INTEGRATION
      25. FEB 1 2013 NEW ENGLAND DATABASE SUMMIT
      26. MIT NEWS: SPEEDING ALGORITHMS BY SHRINKING DATA
      27. MIT NEWS: MINING PHYSICIANS' NOTES FOR MEDICAL INSIGHTS
      28. NOV 28-29 2012 BIGDATA@CSAIL MEMBER MEETING
      29. CSAIL TACKLES BIG DATA
      30. LOGIN

Data Science for MIT Big Data Initiative

Last modified
Table of contents
  1. Story
  2. Spotfire Dashboard
  3. Slides
    1. Slide 1 What MIT CSAIL is Doing About Big Data
    2. Slide 2 Big Data
    3. Slide 3 When Do You Have a Big Data Problem?
    4. Slide 4 Example: Medical Costs
    5. Slide 5 Why Big Data Now
    6. Slide 6 What are we (CSAIL) Doing?
    7. Slide 7 What is the ISTC in Big Data?
    8. Slide 8 bigdata@csail: focus
    9. Slide 9 Mission
    10. Slide 10 bigdata@csail: Founding Members
    11. Slide 11 Meetings & Workshops
    12. Slide 12 Annual Workshops & Meetings
    13. Slide 13 Big Data Privacy Workshop
    14. Slide 14 Big Data & Privacy Working Group
    15. Slide 15 Education & Impact
    16. Slide 16 MIT Big Data Challenge: Innovation + Students 1
    17. Slide 17 MIT Big Data Challenge: Innovation + Students 2
    18. Slide 18 BigDataX: Tackling the Challenges of Big Data
    19. Slide 19 Research
    20. Slide 20 big data: systems and tools 1
    21. Slide 21 big data: systems and tools 2
    22. Slide 22 Example Research: MapD
    23. Slide 23 What is a GPU Accelerated Database
    24. Slide 24 Demo 1
    25. Slide 25 Demo 2
    26. Slide 26 Demo 3
    27. Slide 27 Demo 4
    28. Slide 28 Summary
  4. Spotfire Dashboard
  5. Research Notes
  6. bigdata@csail
    1. ABOUT
      1. The Big Data Problem

      2. Web Analytics
      3. Finance

      4. Medical

      5. APPROACH
        1. Computational Platforms
        2. Scalable Algorithms
        3. Machine Learning and Understanding
        4. Privacy and Security
      6. TEAM
        1. Sam Madden
        2. Elizabeth Bruce
        3. Susana Kevorkova
        4. Sheila Marian
      7. CONTACT
        1. For sponsors/membership
        2. For general information
    2. PEOPLE
      1. Hal Abelson CSAIL MIT
      2. Saman Amarasinghe CSAIL MIT
      3. Hari Balakrishnan CSAIL MIT
      4. Regina Barzilay CSAIL MIT
      5. Tim Berners-Lee CSAIL MIT
      6. Elizabeth Bruce CSAIL MIT
      7. Erik Demaine CSAIL MIT
      8. Srini Devadas CSAIL MIT
      9. Alan Edelman CSAIL MIT
      10. John Fisher CSAIL MIT
      11. William Freeman CSAIL MIT
      12. Polina Golland CSAIL MIT
      13. John Guttag CSAIL MIT
      14. Piotr Indyk CSAIL MIT
      15. Tommi Jaakkola CSAIL MIT
      16. Lalana Kagal CSAIL MIT
      17. David Karger CSAIL MIT
      18. Boris Katz CSAIL MIT
      19. Andrew Lo CSAIL MIT and Sloan School of Management
      20. Nancy Lynch CSAIL MIT
      21. madden CSAIL MIT
      22. Una-May O'Reilly CSAIL MIT
      23. Rob Miller CSAIL MIT
      24. Aude Oliva CSAIL MIT
      25. Alex Pentland Human Dynamics Laboratory MIT Media Lab
      26. Ronitt Rubinfeld CSAIL MIT
      27. Cynthia Rudin MIT Sloan School of Management
      28. Michael Stonebraker CSAIL MIT
      29. Peter Szolovits CSAIL MIT
      30. Antonio Torralba CSAIL MIT
      31. Daniel Weitzner CSAIL MIT
      32. Nickolai Zeldovich CSAIL MIT
    3. RESEARCH
      1. RESEARCH PROJECTS
        1. A SURVEY OF SYSTEMIC RISK ANALYTICS
        2. BIG DATA FOR A BETTER LIFE: THE TRENTO SMART CITY PROJECT
        3. BLINKDB
        4. BUILDING THE NEXT GENERATION OF SEARCH ENGINES: GROWING A LIST OF ALMOST ANYTHING
        5. CLOUD-SCALE, FLEXIBLE SCALING AND FACTORING OF MACHINE LEARNING ALGORITHMS: THE FLEXGP PROJECT
        6. CONSUMER CREDIT-RISK MODELS VIA MACHINE-LEARNING ALGORITHMS
        7. DECLARATIVE, GRAPHICAL CONSTRUCTION OF COMPLEX REPORT QUERIES
        8. DYNAMIC REDUCTION OF QUERY RESULT SETS FOR INTERACTIVE VISUALIZATON
        9. ENERGY-EFFICIENT ALGORITHMS
        10. ENERGY: HYDROCARBON EXPLORATION
        11. ERASURE CODING MEETS DISTRIBUTED COMPUTING: CODED EMULATION OF SHARED MEMORY ARCHITECTURES
        12. EVIDENCE-BASED EDUCATION RESEARCH FOR ALL
        13. EXECUTION MIGRATION MACHINE (EM2)
        14. GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO
        15. IMAGES: LARGE SCALE VISION
        16. INFORMATION ACCOUNTABILITY
        17. INTERPRETING BIG DATA: MEANINGFUL MODELS FOR HEALTHCARE, CRIMINOLOGY, AND BEYOND
        18. MACHINE LEARNING
        19. MACHINE LEARNING FOR CRIME SERIES DETECTION
        20. MAINTAINING COHERENT MEMORY IN DYNAMIC DISTRIBUTED SYSTEMS
        21. NATIONAL ALLIANCE FOR MEDICAL IMAGE COMPUTING
        22. NATURAL LANGUAGE INTERFACE FOR BIG DATA
        23. PREDICTIVE ANALYTICS FOR POWER GRID MAINTENANCE
        24. PRIVACY-PRESERVING METHODS FOR SHARING FINANCIAL RISK EXPOSURES
        25. SCIDB
        26. SMART TRANSPORTATION: CARTEL
        27. SOCIAL: CONDENSR
        28. SOCIAL: INFLUENCE MODELING
        29. SOCIAL: TWITINFO
        30. SPORTS ANALYTICS
        31. SUBLINEAR TIME ALGORITHMS
        32. TUNABLE FAST SIMILARITY SEARCH FOR HIGH-DIMENSIONAL DATA
        33. UNCOVERING CLINICALLY RELEVANT MEDICAL KNOWLEDGE
        34. VISION MACHINE: LEARNING ONLINE FROM 25 MILLION IMAGES
    4. EVENTS
      1. UPCOMING MEETINGS
      2. UPCOMING TALKS
      3. ARCHIVED TALKS
      4. MEMBER MEETINGS
        1. BIG DATA ADVISORY BOARD MEETINGS
        2. MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS [AGENDA, SLIDES & VIDEO]  
        3. MAR 3 2014 MIT WHITE HOUSE BIG DATA PRIVACY WORKSHOP [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        4. NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING [AGENDA, SLIDES & VIDEO]    
        5. JUN 19 2013  MEMBER WORKSHOP: BIG DATA PRIVACY [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        6. APR 4 2013  MEMBER WORKSHOP: BIG DATA INTEGRATION  [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        7. NOV 28-29 2012 ANNUAL BIG DATA MEMBER MEETING  [AGENDA, SLIDES & VIDEO]
    5. PARTNERS
      1. MEMBERS
      2. DATA PARTNERS
    6. SPECIAL PROJECTS
      1. BIG DATA PRIVACY
        1. Workshop: Big Data Privacy
        2. DEFINING “PRIVACY” IN A BIG DATA WORLD
          1. DAVID VLADECK - GEORGETOWN UNIVERSITY LAW CENTER
        3. BIG DATA, SYSTEMIC RISK, AND PRIVACY-PRESERVING RISK MEASUREMENT
          1. ANDREW LO - MIT SLOAN SCHOOL OF MANAGEMENT
        4. BIG DATA: NEW OIL OF THE INTERNET
          1. ALEX (SANDY) PENTLAND - MIT MEDIA LAB
        5. DIFFERENTIAL PRIVACY
          1. KOBBI NISSIM - BEN-GURION UNIVERSITY
        6. NO FREE LUNCH
          1. ASHWIN MACHANAVAJJHALA - DUKE UNIVERSITY
        7. DEVELOPING ACCOUNTABLE SYSTEMS
          1. LALANA KAGAL - MIT CSAIL
        8. ENCRYPTED QUERY PROCESSING
          1. RALUCA ADA POPA - MIT CSAIL
      2. DATA CHALLENGES
        1. The MIT Big Data Challenge
        2. TAKE ME TO THE CITY OF BOSTON TRANSPORTATION CHALLENGE PAGE!
        3. OVERVIEW
          1. MIT Big Data Challenge: Transportation in the City of Boston
          2. MIT Big Data Challenge: Transportation in the City of Boston Visualization and Prediction Winners!
      3. A LIVING LAB
    7. NEWS
      1. WHITE HOUSE & MIT RELEASE REPORTS ON BIG DATA AND PRIVACY
      2. JUN YANG'S TALK
      3. JANET WIENER'S TALK
      4. MEMBER WORKSHOP ON DATA ANALYTICS
      5. BIG, FAST, WEIRD DATA
      6. BOSTON GLOBE ARTICLE ON MIT BIG DATA CHALLENGE
      7. MIT, WHITE HOUSE CO-SPONSOR WORKSHOP ON BIG-DATA PRIVACY
      8. MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS
      9. WHITE HOUSE TO CO-HOST WORKSHOP AT MIT ON ‘BIG DATA’ AND PRIVACY
      10. BRUCE SCHNEIER TALK
      11. MIT LAUNCHES FIRST ONLINE PROFESSIONAL COURSE ON BIG DATA
      12. JAN 31 2014 NEW ENGLAND DATABASE SUMMIT
      13. DEC 10 2013 BIG DATA CHALLENGE MEETUP
      14. MIT BIG DATA INITIATIVE LAUNCHES TRANSPORTATION CHALLENGE, PRIVACY WORKING GROUP AT WHITE HOUSE EVENT
      15. NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING
      16. NSA MASS SURVEILLANCE: TURNING THE FOURTH (AND FIRST) AMENDMENT ON IT'S HEAD
      17. MICHAEL I. JORDAN TALK
      18. BIG DATA DESIGN CONTEST
      19. BIG DATA AND BLINKDB: A QUERY ENGINE FOR BIG DATA
      20. BIG DATA AND THE LAW: APPLYING NLP TO SUPREME COURT DECISIONS
      21. MIT BIG DATA INITIATIVE AT CSAIL IS AWARDED GRANT FROM ALFRED P. SLOAN FOUNDATION
      22. JUNE 19 2013 WORKSHOP: BIG DATA PRIVACY
      23. BIG DATA FOR THE REST OF US
      24. APRIL 4 2013 MEMBER WORKSHOP: BIG DATA INTEGRATION
      25. FEB 1 2013 NEW ENGLAND DATABASE SUMMIT
      26. MIT NEWS: SPEEDING ALGORITHMS BY SHRINKING DATA
      27. MIT NEWS: MINING PHYSICIANS' NOTES FOR MEDICAL INSIGHTS
      28. NOV 28-29 2012 BIGDATA@CSAIL MEMBER MEETING
      29. CSAIL TACKLES BIG DATA
      30. LOGIN

  1. Story
  2. Spotfire Dashboard
  3. Slides
    1. Slide 1 What MIT CSAIL is Doing About Big Data
    2. Slide 2 Big Data
    3. Slide 3 When Do You Have a Big Data Problem?
    4. Slide 4 Example: Medical Costs
    5. Slide 5 Why Big Data Now
    6. Slide 6 What are we (CSAIL) Doing?
    7. Slide 7 What is the ISTC in Big Data?
    8. Slide 8 bigdata@csail: focus
    9. Slide 9 Mission
    10. Slide 10 bigdata@csail: Founding Members
    11. Slide 11 Meetings & Workshops
    12. Slide 12 Annual Workshops & Meetings
    13. Slide 13 Big Data Privacy Workshop
    14. Slide 14 Big Data & Privacy Working Group
    15. Slide 15 Education & Impact
    16. Slide 16 MIT Big Data Challenge: Innovation + Students 1
    17. Slide 17 MIT Big Data Challenge: Innovation + Students 2
    18. Slide 18 BigDataX: Tackling the Challenges of Big Data
    19. Slide 19 Research
    20. Slide 20 big data: systems and tools 1
    21. Slide 21 big data: systems and tools 2
    22. Slide 22 Example Research: MapD
    23. Slide 23 What is a GPU Accelerated Database
    24. Slide 24 Demo 1
    25. Slide 25 Demo 2
    26. Slide 26 Demo 3
    27. Slide 27 Demo 4
    28. Slide 28 Summary
  4. Spotfire Dashboard
  5. Research Notes
  6. bigdata@csail
    1. ABOUT
      1. The Big Data Problem

      2. Web Analytics
      3. Finance

      4. Medical

      5. APPROACH
        1. Computational Platforms
        2. Scalable Algorithms
        3. Machine Learning and Understanding
        4. Privacy and Security
      6. TEAM
        1. Sam Madden
        2. Elizabeth Bruce
        3. Susana Kevorkova
        4. Sheila Marian
      7. CONTACT
        1. For sponsors/membership
        2. For general information
    2. PEOPLE
      1. Hal Abelson CSAIL MIT
      2. Saman Amarasinghe CSAIL MIT
      3. Hari Balakrishnan CSAIL MIT
      4. Regina Barzilay CSAIL MIT
      5. Tim Berners-Lee CSAIL MIT
      6. Elizabeth Bruce CSAIL MIT
      7. Erik Demaine CSAIL MIT
      8. Srini Devadas CSAIL MIT
      9. Alan Edelman CSAIL MIT
      10. John Fisher CSAIL MIT
      11. William Freeman CSAIL MIT
      12. Polina Golland CSAIL MIT
      13. John Guttag CSAIL MIT
      14. Piotr Indyk CSAIL MIT
      15. Tommi Jaakkola CSAIL MIT
      16. Lalana Kagal CSAIL MIT
      17. David Karger CSAIL MIT
      18. Boris Katz CSAIL MIT
      19. Andrew Lo CSAIL MIT and Sloan School of Management
      20. Nancy Lynch CSAIL MIT
      21. madden CSAIL MIT
      22. Una-May O'Reilly CSAIL MIT
      23. Rob Miller CSAIL MIT
      24. Aude Oliva CSAIL MIT
      25. Alex Pentland Human Dynamics Laboratory MIT Media Lab
      26. Ronitt Rubinfeld CSAIL MIT
      27. Cynthia Rudin MIT Sloan School of Management
      28. Michael Stonebraker CSAIL MIT
      29. Peter Szolovits CSAIL MIT
      30. Antonio Torralba CSAIL MIT
      31. Daniel Weitzner CSAIL MIT
      32. Nickolai Zeldovich CSAIL MIT
    3. RESEARCH
      1. RESEARCH PROJECTS
        1. A SURVEY OF SYSTEMIC RISK ANALYTICS
        2. BIG DATA FOR A BETTER LIFE: THE TRENTO SMART CITY PROJECT
        3. BLINKDB
        4. BUILDING THE NEXT GENERATION OF SEARCH ENGINES: GROWING A LIST OF ALMOST ANYTHING
        5. CLOUD-SCALE, FLEXIBLE SCALING AND FACTORING OF MACHINE LEARNING ALGORITHMS: THE FLEXGP PROJECT
        6. CONSUMER CREDIT-RISK MODELS VIA MACHINE-LEARNING ALGORITHMS
        7. DECLARATIVE, GRAPHICAL CONSTRUCTION OF COMPLEX REPORT QUERIES
        8. DYNAMIC REDUCTION OF QUERY RESULT SETS FOR INTERACTIVE VISUALIZATON
        9. ENERGY-EFFICIENT ALGORITHMS
        10. ENERGY: HYDROCARBON EXPLORATION
        11. ERASURE CODING MEETS DISTRIBUTED COMPUTING: CODED EMULATION OF SHARED MEMORY ARCHITECTURES
        12. EVIDENCE-BASED EDUCATION RESEARCH FOR ALL
        13. EXECUTION MIGRATION MACHINE (EM2)
        14. GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO
        15. IMAGES: LARGE SCALE VISION
        16. INFORMATION ACCOUNTABILITY
        17. INTERPRETING BIG DATA: MEANINGFUL MODELS FOR HEALTHCARE, CRIMINOLOGY, AND BEYOND
        18. MACHINE LEARNING
        19. MACHINE LEARNING FOR CRIME SERIES DETECTION
        20. MAINTAINING COHERENT MEMORY IN DYNAMIC DISTRIBUTED SYSTEMS
        21. NATIONAL ALLIANCE FOR MEDICAL IMAGE COMPUTING
        22. NATURAL LANGUAGE INTERFACE FOR BIG DATA
        23. PREDICTIVE ANALYTICS FOR POWER GRID MAINTENANCE
        24. PRIVACY-PRESERVING METHODS FOR SHARING FINANCIAL RISK EXPOSURES
        25. SCIDB
        26. SMART TRANSPORTATION: CARTEL
        27. SOCIAL: CONDENSR
        28. SOCIAL: INFLUENCE MODELING
        29. SOCIAL: TWITINFO
        30. SPORTS ANALYTICS
        31. SUBLINEAR TIME ALGORITHMS
        32. TUNABLE FAST SIMILARITY SEARCH FOR HIGH-DIMENSIONAL DATA
        33. UNCOVERING CLINICALLY RELEVANT MEDICAL KNOWLEDGE
        34. VISION MACHINE: LEARNING ONLINE FROM 25 MILLION IMAGES
    4. EVENTS
      1. UPCOMING MEETINGS
      2. UPCOMING TALKS
      3. ARCHIVED TALKS
      4. MEMBER MEETINGS
        1. BIG DATA ADVISORY BOARD MEETINGS
        2. MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS [AGENDA, SLIDES & VIDEO]  
        3. MAR 3 2014 MIT WHITE HOUSE BIG DATA PRIVACY WORKSHOP [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        4. NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING [AGENDA, SLIDES & VIDEO]    
        5. JUN 19 2013  MEMBER WORKSHOP: BIG DATA PRIVACY [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        6. APR 4 2013  MEMBER WORKSHOP: BIG DATA INTEGRATION  [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
        7. NOV 28-29 2012 ANNUAL BIG DATA MEMBER MEETING  [AGENDA, SLIDES & VIDEO]
    5. PARTNERS
      1. MEMBERS
      2. DATA PARTNERS
    6. SPECIAL PROJECTS
      1. BIG DATA PRIVACY
        1. Workshop: Big Data Privacy
        2. DEFINING “PRIVACY” IN A BIG DATA WORLD
          1. DAVID VLADECK - GEORGETOWN UNIVERSITY LAW CENTER
        3. BIG DATA, SYSTEMIC RISK, AND PRIVACY-PRESERVING RISK MEASUREMENT
          1. ANDREW LO - MIT SLOAN SCHOOL OF MANAGEMENT
        4. BIG DATA: NEW OIL OF THE INTERNET
          1. ALEX (SANDY) PENTLAND - MIT MEDIA LAB
        5. DIFFERENTIAL PRIVACY
          1. KOBBI NISSIM - BEN-GURION UNIVERSITY
        6. NO FREE LUNCH
          1. ASHWIN MACHANAVAJJHALA - DUKE UNIVERSITY
        7. DEVELOPING ACCOUNTABLE SYSTEMS
          1. LALANA KAGAL - MIT CSAIL
        8. ENCRYPTED QUERY PROCESSING
          1. RALUCA ADA POPA - MIT CSAIL
      2. DATA CHALLENGES
        1. The MIT Big Data Challenge
        2. TAKE ME TO THE CITY OF BOSTON TRANSPORTATION CHALLENGE PAGE!
        3. OVERVIEW
          1. MIT Big Data Challenge: Transportation in the City of Boston
          2. MIT Big Data Challenge: Transportation in the City of Boston Visualization and Prediction Winners!
      3. A LIVING LAB
    7. NEWS
      1. WHITE HOUSE & MIT RELEASE REPORTS ON BIG DATA AND PRIVACY
      2. JUN YANG'S TALK
      3. JANET WIENER'S TALK
      4. MEMBER WORKSHOP ON DATA ANALYTICS
      5. BIG, FAST, WEIRD DATA
      6. BOSTON GLOBE ARTICLE ON MIT BIG DATA CHALLENGE
      7. MIT, WHITE HOUSE CO-SPONSOR WORKSHOP ON BIG-DATA PRIVACY
      8. MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS
      9. WHITE HOUSE TO CO-HOST WORKSHOP AT MIT ON ‘BIG DATA’ AND PRIVACY
      10. BRUCE SCHNEIER TALK
      11. MIT LAUNCHES FIRST ONLINE PROFESSIONAL COURSE ON BIG DATA
      12. JAN 31 2014 NEW ENGLAND DATABASE SUMMIT
      13. DEC 10 2013 BIG DATA CHALLENGE MEETUP
      14. MIT BIG DATA INITIATIVE LAUNCHES TRANSPORTATION CHALLENGE, PRIVACY WORKING GROUP AT WHITE HOUSE EVENT
      15. NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING
      16. NSA MASS SURVEILLANCE: TURNING THE FOURTH (AND FIRST) AMENDMENT ON IT'S HEAD
      17. MICHAEL I. JORDAN TALK
      18. BIG DATA DESIGN CONTEST
      19. BIG DATA AND BLINKDB: A QUERY ENGINE FOR BIG DATA
      20. BIG DATA AND THE LAW: APPLYING NLP TO SUPREME COURT DECISIONS
      21. MIT BIG DATA INITIATIVE AT CSAIL IS AWARDED GRANT FROM ALFRED P. SLOAN FOUNDATION
      22. JUNE 19 2013 WORKSHOP: BIG DATA PRIVACY
      23. BIG DATA FOR THE REST OF US
      24. APRIL 4 2013 MEMBER WORKSHOP: BIG DATA INTEGRATION
      25. FEB 1 2013 NEW ENGLAND DATABASE SUMMIT
      26. MIT NEWS: SPEEDING ALGORITHMS BY SHRINKING DATA
      27. MIT NEWS: MINING PHYSICIANS' NOTES FOR MEDICAL INSIGHTS
      28. NOV 28-29 2012 BIGDATA@CSAIL MEMBER MEETING
      29. CSAIL TACKLES BIG DATA
      30. LOGIN

Story

Data Science for MIT Big Data Initiative

Great presentation by Madden and Stonebraker at our June 30th Meetup

Request: I was looking to download the MIT Big Data Challenge Data Set: http://bigdatachallenge.csail.mit.edu/datasets, but found it is no longer available. Could it still be made available for our members to work with?

In the meantime please see: Data Visualization for the Rest of Us using the Boston Hubway Data Visualization challenge

The knowledge base below shows the details of this major big data program!

I like the listing of all 32 PEOPLE and all 34 Research Projects and who is doing them, like Michael Stonebraker CSAIL MIT doing SCIDB.

MORE TO FOLLOW

The MIT Big Data Challenge Data Set is available and I prepared the following files for future DC Data Science and Data Owls Meetups:

Data Mining Documentation

Spreadsheet File Inventory

Spotfire File (162 MB)

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Slides

Slides

Slide 1 What MIT CSAIL is Doing About Big Data

http://bigdata.csail.mit.edu/
http://istc-bigdata.org/

SamuelMadden06302014Slide1.PNG

Slide 2 Big Data

A driver’s risk of getting in an accident
Probability of default on a loan
The best ad to show to a user
Effectiveness of treatments for patients
Probability of aircraft engine failure

SamuelMadden06302014Slide2.PNG

Slide 3 When Do You Have a Big Data Problem?

Example: Medical Costs
Found 2 doctors were ordering the vast majority of x-rays (unecessary)
Group sharing of data among doctors can be very effective at lowering costs and improving health

SamuelMadden06302014Slide3.PNG

 

Slide 4 Example: Medical Costs

Found 2 doctors were ordering the vast majority of x-rays (unecessary)
Group sharing of data among doctors can be very effective at lowering costs and improving health

SamuelMadden06302014Slide4.PNG

Slide 5 Why Big Data Now

SamuelMadden06302014Slide5.PNG

Slide 6 What are we (CSAIL) Doing?

http://bigdata.csail.mit.edu/
http://istc-bigdata.org/

SamuelMadden06302014Slide6.PNG

Slide 7 What is the ISTC in Big Data?

SamuelMadden06302014Slide7.PNG

Slide 8 bigdata@csail: focus

SamuelMadden06302014Slide8.PNG

Slide 9 Mission

SamuelMadden06302014Slide9.PNG

Slide 10 bigdata@csail: Founding Members

SamuelMadden06302014Slide10.PNG

Slide 11 Meetings & Workshops

SamuelMadden06302014Slide11.PNG

Slide 12 Annual Workshops & Meetings

SamuelMadden06302014Slide12.PNG

Slide 13 Big Data Privacy Workshop

SamuelMadden06302014Slide13.PNG

Slide 14 Big Data & Privacy Working Group

SamuelMadden06302014Slide14.PNG

Slide 15 Education & Impact

SamuelMadden06302014Slide15.PNG

Slide 16 MIT Big Data Challenge: Innovation + Students 1

SamuelMadden06302014Slide16.PNG

Slide 17 MIT Big Data Challenge: Innovation + Students 2

http://bigdatachallenge.csail.mit.edu

winners and viz gallery: http://bigdata.csail.mit.edu/challenge_2014

SamuelMadden06302014Slide17.PNG

Slide 18 BigDataX: Tackling the Challenges of Big Data

SamuelMadden06302014Slide18.PNG

Slide 19 Research

SamuelMadden06302014Slide19.PNG

Slide 20 big data: systems and tools 1

http://bigdata.csail.mit.edu/research_projects_2 Access Denied

SamuelMadden06302014Slide20.PNG

Slide 22 Example Research: MapD

SamuelMadden06302014Slide22.PNG

Slide 23 What is a GPU Accelerated Database

SamuelMadden06302014Slide23.PNG

Slide 24 Demo 1

SamuelMadden06302014Slide24.PNG

Slide 25 Demo 2

SamuelMadden06302014Slide25.PNG

Slide 26 Demo 3

SamuelMadden06302014Slide26.PNG

Slide 27 Demo 4

SamuelMadden06302014Slide27.PNG

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Research Notes

Additional Information on Data Resources and Applications, by James Michaelson, PhD My Note: See PDF and Add Here

bigdata@csail

Source: http://bigdata.csail.mit.edu/

 
  • Big Data promises a better world.  A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources.  These changes are enabled by a proliferation of new technologies and...
  • The MIT Big Data Challenge    Take me to the CITY OF BOSTON TRANSPORTATION challenge page!  OVERVIEWThe MIT Big Data Initiative at CSAIL is organizing competitions designed to spur innovation in how we think about and use data to address major societal issues.  Big Data...
  • MapD (Massively Parallel Database) is an analytics database being built by Todd Mostak and Prof. Sam Madden at MIT that allows interactive querying of big datasets.It takes advantage of the immense computational power and memory bandwidth available in commodity-level, of-the-shelf multiocore...

 

Next Major Event: THE BIG DATA ANNUAL MEETING is on NOV 12-13th


Big Data Privacy Workshop: Advancing the State of the Art in Technology and Practice
Co-hosted with the White House Office of Science and Technology Policy
WATCH VIDEO
 or VIEW VIDEO CLIPS AND SLIDES
MIT BIG DATA PRIVACY WORKSHOP SUMMARY REPORT
 

MIT Big Data Challenge - Transportation in the City of Boston [Nov 12, 2013 - Jan 20, 2014]
VISIT VISUALIZATION SUBMISSIONS GALLERY!
AND THE WINNERS ARE...

 


The goal of the MIT Big Data Initiative, a multi-year effort launched in May 2012, is to identify and develop new technologies needed to solve next generation data challenges that will require the ability to scale well beyond what today's computing platforms, algorithms, and methods can provide.  We want to enable people to leverage Big Data by developing systems and platforms that are reusable and scalable across multiple application domains.

Our approach includes two important aspects.  First, we will work closely with key industry and government stakeholders to provide real-world applications and drive impact.  Promoting in-depth interactions between academic researchers, industry and government is a key goal.  Second, we believe the solution to Big Data is fundamentally multi-disciplinary.  The team includes faculty and researchers hailing from diverse research backgrounds, including algorithms, architecture, data management, machine learning, privacy and security, user interfaces, and visualization, as well as domain experts in finance, industrial, medical, smart infrastructure, education and science.

                       

 

ABOUT

The Big Data Problem


We define big data as data that is too big, too fast, or too hard for existing tools to process. Here, “too big” means that organizations increasingly have to deal with petabyte-scale collections of data, which come from click streams, transaction records, sensors, and many other places. “Too fast” means that not only is data big, but it needs to be processed quickly – for example, to perform fraud detection at a point of sale, or determine what ad to show to a user on a web page. “Too hard” is a catchall for data that doesn’t fit neatly into an existing processing tool, i.e., data that needs more complex analysis than existing tools can readily provide. Examples of the big data problem abound.


Web Analytics

On the Internet, many websites now register millions of unique visitors per day. Each of these visitors may access and create a range of content. This can easily amount to tens to hundreds of gigabytes per day (tens of terabytes per year) of accumulating user and log data, even for medium sized websites. Increasingly, companies want to be able to mine this data to understand limitations of their site, improve response time, offer more targeted ads, and so on. Doing this requires tools that can perform complicated analytics on data that far exceeds the memory of a single machine or even a cluster of machines.


Finance


As another example, consider the big data problem as it applies to banks and other financial organizations. These organizations have vast quantities of data about consumer spending habits, credit card transactions, financial markets, and so on. This data is massive: for example, Visa processes more than 35B transactions per year; if they record 1 KB of data per transaction, this represents 3.5 petabytes of data per year. Visa, and large banks that issue Visa cards would like to use this data in a number of ways: to predict customers at risk of default, to detect fraud, to offer promotions, and so on. This requires complex analytics. Additionally, this processing needs to be done quickly and efficiently, and needs to be easy to tune as new models are developed and refined.


Medical


Consider the impact of new sensors on our ability to continuously monitor a patient's health. Recent advances in wireless networking, miniaturization of sensors via MEMS processes, and incredible advances in digital imaging technology have made it possible to cheaply deploy wearable sensors that monitor a number of biological signals on patients, even outside of the doctors office. These signals measure functioning of the heart, brain, circulatory system, etc. Additionally, accelerometers and touch screens can be used to assess mobility and cognitive function. This creates an unprecedented opportunity for doctors to provide outpatient care, by understanding how patients are progressing outside of the doctor’s office, and when they need to be seen urgently. Additionally, by correlating signals from thousands of different patients, it become possible to develop a new understand of what is normal or abnormal, or what kinds of signal features are indicative of potential serious problems.


Similar challenges arise across most industry sectors today including healthcare, finance, government, transportation, biotech and drug discovery, insurance, retail, telecommunications and energy.

APPROACH

We believe the solution to big data is fundamentally multi-disciplinary. Our approach brings together world leaders in parallel architecture, massive-scale data processing, algorithms, machine learning, visualization, and interfaces to collectively identify and address the fundamental technology challenges we face with Big Data.

Our approach focuses on four broad research themes, summarized below:

Computational Platforms
Scalable Algorithms
Machine Learning and Understanding
Privacy and Security

Computational Platforms

We are building several parallel data processing platforms, including SciDB, BlinkDB, and several cloud-based deployment platforms, including FOS and Relational Cloud. The goal of these platforms is to make it easy for developers of big data applications to write programs much as they would on a single-node computational environment, and to be able to rapidly deploy those applications on tens or hundreds of nodes. Additionally, as the computation and storage requirements of applications change, these platforms should be able to dynamically and elastically adapt to those changes.

Scalable Algorithms

We are developing a range of algorithms designed to deal with very large volumes of data, and to process that data in parallel. These include parallel implementations of a range of known algorithms, including matrix computations, as well as statistical operations like regression, optimization methods like gradient descent, and machine learning algorithms like clustering and classification. In addition, we are developing fundamental new types of algorithms designed to handle the challenges of Big Data. For example, we are working on sublinear algorithms that can compute a range of statistics, such as estimates of the number of distinct items in a set, using space that is exponentially smaller than the input. Additionally, we are developing new algorithms for encoding, comparing, and searching massive data sets; specific examples include hash-based similarity search on massive scale data, and algorithms for compressed sensing that provide a new way to encode sparse matrices that arise in a number of scientific applications.

Machine Learning and Understanding

On top of these algorithms, we are deploying a number of novel machine learning applications focused on machine understanding in specific domains. For example, in work on scene understanding in images we are building tools that automatically label parts of an image, or that classify an image as belonging to a certain category or categories based on the types of images that appear in them. As a second example, we are using natural language processing to convert massive quantities of text tweets and text reviews on the web into structured information about products, restaurants, and services that indicate the type of content in some text (e.g., a food review, a rating), an assessment of the sentiment of the text, etc.

Privacy and Security

Finally, because much of the mining and analysis involved in a big data context involves sensitive, private information, we are working technologies and policies for protecting, anonymization, and allowing people to retain control over their data. As an example, in the Crypt DB project, we are building a database system that stores data in an encrypted format in the cloud, in such a way that a curious database or system administrator cannot decrypt the data. Users retain the encryption keys over their data, but have the ability to execute queries over that encrypted data on the database serving, enabling much better performance than simply sending the data back an decrypting on the client’s machine.

Work in these four areas is coupled with application experts in finance (Professor Andrew Lo), medicine (Professor John Guttag), science (Professor Michael Stonebraker), education (through a relationship with the MITx initiative), and transportation (Professor Balakrishnan and Professor Madden).

TEAM

Sam Madden's picture
Sam Madden

CSAIL MIT

Faculty Director
Elizabeth Bruce's picture
Elizabeth Bruce

CSAIL MIT

Executive Director
Susana Kevorkova's picture
Susana Kevorkova

CSAIL MIT

Administrative Assistant
Sheila Marian's picture
Sheila Marian

CSAIL MIT

Administrative Assistant
 

CONTACT

Big Data requires a new generation of technologies to store, manage, analyze, share and understand the huge quantities of data we are now capable of collecting -- join bigdata.at.CSAIL to learn more.

For sponsors/membership

Elizabeth Bruce
Executive Director, MIT Big Data Initiative
ebruce@csail.mit.edu

For general information

Susana Kevorkova
Assistant to Elizabeth Bruce & Big Data Initiative
MIT-CSAIL, 32 Vassar Street, 32-G206, Cambridge, MA 02139
Phone: 617-324-8424, E-Mail: skevorkova@csail.mit.edu

Sheila Marian
Sr. Administrative Assistant to Professor Sam Madden
MIT-CSAIL, 32 Vassar Street, 32-G944, Cambridge, MA 02139
Phone: 617-253-1996, E-Mail: sheila@csail.mit.edu

 

PEOPLE

Hal Abelson

Hal Abelson CSAIL MIT

Saman Amarasinghe

Saman Amarasinghe CSAIL MIT

Hari Balakrishnan

Hari Balakrishnan CSAIL MIT

Regina Barzilay

Regina Barzilay CSAIL MIT

Tim Berners-Lee

Tim Berners-Lee CSAIL MIT

Elizabeth Bruce

Elizabeth Bruce CSAIL MIT

Erik Demaine

Erik Demaine CSAIL MIT

Srini Devadas

Srini Devadas CSAIL MIT

Alan Edelman

Alan Edelman CSAIL MIT

John Fisher

John Fisher CSAIL MIT

William Freeman

William Freeman CSAIL MIT

Polina Golland

Polina Golland CSAIL MIT

John Guttag

John Guttag CSAIL MIT

Piotr Indyk

Piotr Indyk CSAIL MIT

Tommi Jaakkola

Tommi Jaakkola CSAIL MIT

Lalana Kagal

Lalana Kagal CSAIL MIT

David Karger

David Karger CSAIL MIT

Boris Katz

Boris Katz CSAIL MIT

Andrew Lo

Andrew Lo CSAIL MIT and Sloan School of Management

Nancy Lynch

Nancy Lynch CSAIL MIT

madden

madden CSAIL MIT

Una-May O'Reilly

Una-May O'Reilly CSAIL MIT

Rob Miller

Rob Miller CSAIL MIT

Aude Oliva

Aude Oliva CSAIL MIT

Alex Pentland

Alex Pentland Human Dynamics Laboratory MIT Media Lab

Ronitt Rubinfeld

Ronitt Rubinfeld CSAIL MIT

Cynthia Rudin

Cynthia Rudin MIT Sloan School of Management

Michael Stonebraker
Peter Szolovits

Peter Szolovits CSAIL MIT

Antonio Torralba

Antonio Torralba CSAIL MIT

Daniel Weitzner

Daniel Weitzner CSAIL MIT

Nickolai Zeldovich

Nickolai Zeldovich CSAIL MIT

RESEARCH

RESEARCH PROJECTS

 
A SURVEY OF SYSTEMIC RISK ANALYTICS
Andrew Lo Project image Dimitrios Bisias, Mark Flood, Stavros Valavanis

We provide a survey of 31 quantitative measures of systemic risk in the economics and finance literature, chosen to span key themes and issues in systemic risk measurement and management. We motivate these measures from the supervisory, research, and data perspectives in the main text, and present concise definitions of each risk measure—including required inputs, expected outputs, and data requirements—in an extensive appendix.

We have deployed a data sensing and data sharing architecture in the city of Trento in order to `mashup' government, company, and individual mobile data. The goal is to validate the value, monitization, and privacy/ownership issues of Big Data in running a `smart city.' Joint with Telefonica, Telecom Italia, Government of Trento, European Inst. Technology.

BlinkDB is a database system that runs on top of Hadoop (MapReduce), running SQL queries and translating them into MapReduce jobs. The key idea is that rather than running queries over the entire data set, it runs queries on a random (precomputed) sample of the data, and uses sampling theory to estimate the true query answer.

The next generation of search engines should not simply retrieve URLs, but should aim at retrieving information. We designed a system that leads into this next generation, leveraging information from across the Internet to grow an authoritative list on almost any topic. Our method starts from a small seed of examples, and intelligently grows a list of items relevant to the seed.

Scalable machine learning systems will underpin the complex data analytics afforded by BigData. The goal of the FlexGP project is scaled machine learning which exploits cloud-scale parallelism and resource elasticity. FlexGP launches diverse learning engines onto the cloud which are different along training data partitions and explanatory dimensions as well as model expression and objective function choices. Among other related questions, the Evolutionary Design and Optimization Group is working to learn how much data is enough in the context of BigData's high dimensionality and volume and how the diverse outcomes of factored learners can be efficiently fused.

CONSUMER CREDIT-RISK MODELS VIA MACHINE-LEARNING ALGORITHMS
Andrew Lo Project image Amir E. Khandani, Adlar J. Kim

We apply machine-learning techniques to construct nonlinear nonparametric forecasting models of consumer credit risk. By combining customer transactions and credit bureau data from January 2005 to April 2009 for a sample of a major commercial bank’s customers, we are able to construct out-of-sample forecasts that significantly improve the classification rates of credit-card-holder delinquencies and defaults, with linear regression R2’s of forecasted/realized delinquencies of 85%.

Many use cases for business-oriented databases involve the creation of tailor-made summaries known as "reports". Report development is tedious because multiple SQL queries may be required to generate a single report, because queries may include complex combinations of formulas and aggregate functions (e.g. averages of totals), and because the visual output layout of non-tabular results must be manually defined through the use of templating languages or a graphical form editor. This project aims to reduce the user input required for report development by combining a visual query system generating hierarchical data with a system for automatic creation of report page layouts.

Modern database management systems (DBMS) have been designed to efficiently store, manage and perform computations on massive amounts of data. In contrast, many existing visualization systems do not scale seamlessly from small data sets to enormous ones. We have designed a three-tiered visualization system called ScalaR to deal with this issue. ScalaR dynamically performs resolution reduction when the expected result of a DBMS query is too large to be effectively rendered on existing screen real estate. Instead of running the original query, ScalaR inserts aggregation, sampling or filtering operations to reduce the size of the result. This paper presents the design and implementation of ScalaR, and shows results for an example application, displaying satellite imagery data stored in SciDB as the back-end DBMS.

The new field of energy-efficient algorithms aims to develop new techniques for solving computational problems with vastly reduced energy consumption—for some problems, by several orders of magnitude—in exchange for a small increase in time and memory requirements. Specifically, we explore how to algorithmically exploit reversible computation, an idea that has been around since the 1970s and has just started to become a practical reality in the latest AMD chips, but for which we have only just begun understanding how to design efficient algorithms. Our preliminary investigations indicate that some basic problems (but not others) can have their energy cost substantially reduced, far less than that spent by a traditional algorithm on traditional hardware. This theoretical ground work will become especially important over the next two decades, when Landauer’s Principle predicts that the energy efficiency of computers (kilowatt hours spent per nonreversible computation) must stop improving. Improving the energy efficiency of computation will reduce the ~$15 billion annual cost spent to power today’s data centers, not to mention many other computers in the world processing big data. Furthermore, reducing the energy consumed by a chip reduces its heat generation, which should allow that chip to run proportionally faster with the same cooling system, or to run proportionally longer powered by the same battery. Thus our work will enable computation to become cheaper, faster, and longer lasting, and determine the theoretical limits thereof.

In this project, the goal is to identify boundaries between different types of underground rocks using seismic sensors. Such boundaries are of interest in hydrocarbon exploration as they are places where oil is often present. These sensors produce massive streams of data that need to be mined to understand the location of boundaries.

ERASURE CODING MEETS DISTRIBUTED COMPUTING: CODED EMULATION OF SHARED MEMORY ARCHITECTURES
Nancy Lynch Viveck R. Cadambe (MIT), Muriel Médard (MIT), Peter Musial (EMC Corporation)

Modern data services manage data by storing it on distributed servers. It is a common requirement of such systems that they ensure that the data is available to the users even though the system components can be unreliable, for instance, the servers can crash. It is also important that the data should appear to the users as if stored on a single centralized system, even though the data is stored in distributed servers. For instance, when the data is being constantly updated, a user that reads from the system should obtain the latest version. In this talk, we present a new approach for the design of storage systems satisfying these requirements. Our approach merges ideas from two distinct fields, coding theory and distributed computing theory, to create a solution that incurs a significantly smaller footprint in terms of the amount of communication and storage as compared with current state of the art.

EVIDENCE-BASED EDUCATION RESEARCH FOR ALL
Daniel Weitzner Una-May O'Reilly Hal Abelson, (CSAIL, MIT); Lori Breslow, (Director, MIT Teaching & Learning Laboratory)

We are using the ALFA group's MOOCdb project as a framework to explore specific technical aspects related to global research collaboration around online education data. We are exploring software sharing, common data organization, data privacy and data access. The piloting effort uses data from several of MITx courses. Our activities recognize the importance of a transparent, fair access policy on online education data which respects privacy, confidentiality and the legal obligations of data controllers. We recognize that computer technology will play a key role in supporting general access so that respect for privacy and confidentiality is ideally balanced with the public good of investigating the digital evidence gathered from online education offerings. We are interested in joining forces with all who want to participate in these activities. Please contact us for more information.

Big Data needs Big Processors, and Big Processors need Big Caches. Increasingly, however, power and thermal considerations dictate that many small processors and many small caches supplant the paradigm of few big processors and caches. The Execution Migration Machine (EM²) project aims to find the best way of using these resources.

The goal of Linked Data is to replace traditional app-data silos with a universal integration platform to provide globally contextualized information using global identifiers, authentication, authorization, storage, and privacy. The architecture separates application from data giving users control of the data and where it’s stored, independent of the choice of application.

The goal of this project is to study computer and human vision when large amounts of visual data become available. We are developing the Scene UNderstanding (SUN) database, a large database of images found on the web organized by scene types that are being fully segmented and annotated. With this large database we are developing computer vision algorithms for scene understanding that make use of a large training combined with non-parametric (memory based) methods. In parallel, we are also studying how humans memorize large amounts of visual information. As a result we try to understand which representations might be useful for developing new efficient computer vision algorithms and also, how can we use computer vision models of human memory to predict which images will be remembered.

INFORMATION ACCOUNTABILITY
Daniel Weitzner Hal Abelson Tim Berners-Lee Joan Feigenbaum, Jim Hendler, Gerald Sussman

Ease of information flow is both the boon and the bane of large-scale, decentralized systems like the World Wide Web. For all the benefits and opportunities brought by the information revolution, with that same revolution have come the challenges of inappropriate use. Such excesses and abuses in the use of information are most commonly viewed through the lens of information security. This paper argues that debates over online privacy, copyright, and information policy questions have been overly dominated by the access restriction perspective. Our alternative is to design systems that are oriented toward information accountability and appropriate use, rather than information security and access restriction. Our goal is to extend the Web architecture to support transparency and accountability.

We are designing machine learning algorithms whose results are meaningful and intuitive to human experts, yet have predictive accuracy on par with the state-of-the-art machine learning algorithms. These models are parsimonious (sparse), as humans can handle only a handful of cognitive entities at one time, and are in the forms of a decision list or linear scoring system. Our algorithms have been used to design predictive models for medical condition prediction, stroke prediction in medical patients, prediction of violent crime in youth raised in out-of-home care, and for other applications.

Modern use of data relies heavily on predictive modeling. Machine learning methods are needed to distill large, heterogeneous, and fragmented data sources into useful pieces of information such as answers to search queries, purchasing patterns of customers, or likely actions of mobile users. This research focuses on predicting the behavior of mobile users -- actions they are likely to take in any particular context -- based on a collection of intermittent sensors such as GPS, wifi, accelerometer, and others. Our goal is to develop methods that will be useful more broadly.

Many crimes can happen every day in a major city, and figuring out which ones are committed by the same individual or group is an important and difficult data mining challenge. To do this, we propose a pattern detection method called Series Finder. Series Finder incorporates both the common characteristics of all patterns and the unique aspects of each specific pattern. This is joint work between MIT and the Cambridge Police Department.

Modern data services manage data by storing it on distributed servers. The data should appear to its users to be consistent, as if it were maintained on a single centralized server, even though it is actually distributed and replicated. It should be efficient to access, and available in the face of unpredictable failures and other network changes. We design and analyze algorithms to implement such data services.

The National Alliance for Medical Image Computing (NA-MIC) is a multi-institutional, interdisciplinary team of computer scientists, software engineers, and medical investigators who develop computational tools for the analysis and visualization of medical image data.

As we develop storage and compute platforms for scaling to big data and practical algorithms for efficiently processing it, we will need to create new ways to access and interact with massive scale data. A comprehensive solution to the problem of dealing with large amounts of Web and sensor data involves not only analysis strategies, but also access strategies. It is entirely possible that for a given, large dataset, there will be hundreds if not thousands of distinct types of queries that may be applied to the data.

In collaboration with New York City's Power company, Con Edison, we aim to identify components of New York's underground electrical grid that are the most vulnerable to failure. This information can be used to assist with Con Edison's pre-emptive maintenance and repair programs. We focus in particular on prediction of manhole events (fires, explosions, smoking manholes) on the low-voltage network. These events can be difficult to predict, and some of the data used for the project are over a century old. This project is the winner of the 2013 INFORMS Innovative Applications in Analytics Award.

The vast majority of machine learning, statistical, and scientific operations can be expressed via a small number of linear algebra operations. SciDB is a database system designed to support scalable linear algebra over massive arrays stored on disk of a large cluster of machines. It is much faster than relational databases on these types of workloads, and scales to much larger datasets than main memory matrix-oriented systems like Matlab and R.

The goal of CarTel (“car telecommunications”) is to investigate how sensor equipped cars and smartphones can be used to capture information about the transportation network and urban environment in general. Example results include an interactive map of the biggest potholes in Cambridge and Boston, collected using car-mounted accelerometers, and traffic aware routing, where real-time traffic delays from cars are used to find the fastest driving routes.

Condensr is a review summarization system that processes Yelp restaurant reviews and categorizes them, breaking down reviews into comments about food, ambiance, service and value, as well as giving an overall summary of reviewer sentiment. The goal is to go beyond a simple star rating to give the overall consensus of diners about various aspects of a restaurant experience.

The goal of this project is to learn how people inside of large organizations influence each other, and to track the flow of influence throughout an organization. Relationships can be modeled as graphs, with edges indicating the degree of influence. Weights are learned from a variety of data sources, including personal communication and data gathered from sensors about face-to-face interaction.

TwitInfo extracts a series of tweets that match a keyword from Twitter and arranges them on a timeline, provide a quick summary of a collection of Tweets on topic in a simple visualization. The key idea is to identify “peaks” in the frequency of tweets that represent interesting occurrences in time (e.g., points scored in a sporting event, or a major speech by a politician)..

SPORTS ANALYTICS
Cynthia Rudin Theja Tulabandhula

We are designing a prediction and decision system for real-time use during a professional car race. This project exemplifies some of the main challenges in Big Data, where careful knowledge of the domain (racing) needs to be infused within statistical modeling techniques in order to build a real-time decision system.

The goal of this project is to develop powerful algorithmic sampling techniques which allow one to estimate parameters of the data by viewing only a miniscule portion of it. Such parameters may be combinatorial, such as whether a large network has the "six degrees of separation property", algebraic, such as whether the data is well-approximated by a linear function, or even distributional, such as whether the data comes from a distribution over a large number of distinct elements.

Locality-Sensitive Hashing (LSH) is an efficient algorithm for finding pairs of similar (or highly correlated) objects in a database without enumerating all pairs of such objects. Example applications include searching for near-duplicate documents, similar images, highly correlated stocks etc. Although the algorithm is very fast, one can envision further improvements in its efficiency by adapting it to specific data sets. The goal of this project is to develop tools and techniques for performing such tuning.

The day-to-day practice of medicine is based largely on a combination of the personal experience of those making the decisions and non-patient-specific information derived by applying conventional statistical methods to large clinical trials. With the boom in the collection of clinical information in computationally accessible formats, it is now possible to use advanced machine learning and data mining techniques to put clinical decision making on a sounder more patient-specific basis. That is the mission of CSAIL's Data -driven Medical Research Group. Current projects include risk stratification post acute coronary syndrome, prediction of impending heart failure, diagnosis of mental disease from EEG data, and understanding ways to reduce the prevalence of healthcare associated infections.

We want to teach machines to see. If we could recognize objects and locations, this would have large impact on robotics, assistive care, and public safety, to name just a few areas. Presently, machine vision systems can recognize a small number of object categories, and can localize objects within an image on moderately well. The next big task in computer vision is to scale-up object recognition: to reliably detect thousands of object classes. An important component of any long term solution to the vision problem is an online, unsupervised training from a massive dataset of images.

EVENTS

 

Upcoming Big Data Lectures/Talks

Archived Big Data Lectures/Talks (Members only)

Member meetings (Members only)

My Note: See Below

UPCOMING MEETINGS

JUN 30  2014:  11-12:30PM Advisory Board meeting
NOV 12-13 2014:  MIT Big Data Annual Member Meeting  +  Advisory Board meeting

Upcoming Big Data Privacy Working Group Meetings in 2014: regular monthly meetings on 2nd wed of the month at 11:00am ET:
JUNE 11 2014  11AM
JULY 9  2014    11AM
AUGUST 13 2014 11AM
SEPTEMBER 10 2014  11AM
OCTOBER 8 2014  11AM
NOVEMBER 12 2014 11AM

UPCOMING TALKS

BIG DATA TALKS & LECTURE SERIES --FALL 2014

MIT CSAIL, Stata Center, Bldg 32

Talks will feature distinguished individuals from academia, industry and government including pre-eminent people from all the subfields of computer science that have something to say about data, data processing and analytics, as well as people from organizations that are consumers of Big Data from both industry and government.

Prior speakers:  Erik Brynjolfsson (MIT Sloan School); Ion Stoica (UC Berkeley); Alyosha Efros (CMU); Murli Buluswar (AIG); Chris Olston (Google); Jeff Dean and Sanjay Ghemawat (Google).

MEMBER MEETINGS

MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS [AGENDA, SLIDES & VIDEO]  
MAR 3 2014 MIT WHITE HOUSE BIG DATA PRIVACY WORKSHOP [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING [AGENDA, SLIDES & VIDEO]    
JUN 19 2013  MEMBER WORKSHOP: BIG DATA PRIVACY [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
APR 4 2013  MEMBER WORKSHOP: BIG DATA INTEGRATION  [AGENDA, SLIDES & VIDEO] -- [DOWNLOAD WORKSHOP REPORT]
NOV 28-29 2012 ANNUAL BIG DATA MEMBER MEETING  [AGENDA, SLIDES & VIDEO]

PARTNERS

We partner with different organizations including industry, government, non-profits and other universities.

Members provide the Initiative with invaluable funding support.  As a member, organizations have unique access to Big Data research at MIT.  It is also an opportunity for corporations to share and discuss their real world challenges and concerns.

MEMBERS

Data Partners make data sets available for research  at MIT -- for exploring new ideas; for testing out theories and new algortihms, systems and tools; for student projects and data challenges; and ultimately, for demonstrating the impact of Big Data with real world data.

DATA PARTNERS

My Note: See Below

MEMBERS

The MIT Big Data Initiative acknowledges our industry partners for their generous support.

 

Industry Sponsors include:  AIG, Alior Bank, British Telecommunications (BT), EMC, Facebook, Huawei, Intel, Microsoft, Quanta, Samsung, SAP, Shell, Thomson Reuters.

DATA PARTNERS

We partner with different organizations to make data sets available for research here at MIT -- for exploring new ideas; for testing out theories and new algortihms, systems and tools; for student projects and challenges; and ultimately, for demonstrating the impact of Big Data with real world data.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


 

Data Resources:  MGH and The Laboratory for Quantiative Medicine MGH/Harvard Medical School have built a number of very large databases on patients.   Last June, we were tasked by the MGH Cancer Center to build a database on all of the 173,301 Massachusetts General Hospital cancer patients,167,814 of whom were diagnosed between 1968 and 2010. The database contains 559,921 pathology reports, 575,204 discharge reports, 10,938,444 encounter notes, 304,211 operative reports, 22,009,527 procedure notes, 9,159,232 radiology reports,~1,700,000 aggregated medical bills, and ~250,000 images. The database contains all-cause survival information from the Social Security Administration Death Master File (which provides information on all deaths of persons issued social security numbers since 1937), and cause-of-death information from the Massachusetts Death Certificate Database (which contains international classification of disease cause of death information on 1,984,790 people who died in the state of Massachusetts between 1970 and 2008).  The database is linked to the MGH SNAPSHOT gene sequence dataset, thus providing a great wealth of genetic data on a large number of patients.

As far as we are aware, in terms of the total mass of data, this database is the largest source of clinical information on cancer in the world.

Additional Information on Data Resources and Applications, by James Michaelson, PhD My Note: See PDF and Add Here

Seminar Series on Quantiative Medicine, Spring 2013 featuring speakers from Harvard, MIT, CSAIL and others.

SPECIAL PROJECTS

The MIT Big Data Initiative supports a number of special projects and activities:

 Big Data & Privacy Workshop [JUNE 2013] and Working Group

 MIT Big Data Challenge

 MIT Living Lab

My Note: See Below

BIG DATA PRIVACY

Big Data promises a better world.  A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources.  These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us -- but how will that data get used and for what purpose?  Who owns the data?   How do we assure accountability for misuse? 

Just as Big Data lays out many promises, it lays out many questions and challenges when it comes to privacy.  We must think carefully about the role of technology and how we design and engineer next generation systems to appropriately protect and manage privacy, in particular within the context of how policy and laws are developed to protect personal privacy.   Decisions about how to address privacy in big data systems will impact almost everyone as we push to make more data open and available inside organizations and publicly. Governments around the world are pushing themselves and private companies to make data transparent and accessible.  Some of this will be personal data.  We will need new tools and technologies for analysis, for anonymizing data, for running queries over encrypted data, for auditing and tracking information, and for managing and sharing our own personal data in the future.   Because issues of data privacy will be relevant across so many aspects of our life, including banking, insurance, medical, public health, government, etc, we believe it is important to collectively address major challenges managing data privacy in a big data world.


Workshop: Big Data Privacy

Exploring the Future Role of Technology in Protecting Privacy

The goal of this workshop [held in June 2013] is to bring together a select group of thought leaders, from academia, industry and government, to focus on the future of Big Data and some of the unique issues and challenges around data privacy.  Our aim is to think longer term (5 years +) and better understand and help define the role of technology in protecting and managing privacy particularly when large and diverse data sets are collected and combined.  We will use the workshop to collectively articulate major challenges and begin to layout a roadmap for future research and technology needs.

This workshop was supported by the MIT Big Data Initiative at CSAIL and by a Grant from The Alfred P. Sloan Foundation.

AGENDA

MEETING SUMMARY REPORT

[Nov 2013]  BIG DATA PRIVACY WORKING GROUP planning (members only)


DEFINING “PRIVACY” IN A BIG DATA WORLD
DAVID VLADECK - GEORGETOWN UNIVERSITY LAW CENTER

Professor Vladeck served as former Director of the US Federal Trade Commission, Bureau of Consumer Protection

How can we make sure that we harness the power of big data effectively without sacrificing personal privacy completely?


BIG DATA, SYSTEMIC RISK, AND PRIVACY-PRESERVING RISK MEASUREMENT
ANDREW LO - MIT SLOAN SCHOOL OF MANAGEMENT

In the financial industry privacy is a tremendously hotly contested issue.  The problem is that in the financial system, where we don’t use patents to protect our intellectual property, we use trade secrecy, we equate data privacy with profitability.  This is "big data versus big dollars".


BIG DATA: NEW OIL OF THE INTERNET
ALEX (SANDY) PENTLAND - MIT MEDIA LAB

In a report, “Personal Data: Emergence of a New Asset Class,” prepared for the World Economic Forum, we proposed a "New Deal on Data" framework with a vision of ownership rights, personal data stores, and peer-to-peer contract law.  The report findings helped to shape the EU Human Rights on Data document and the US Consumer Privacy Bill of Rights.  Among the proposals in the report was the notion that there could be a combination of informed consent and contract law that allows for auditing of data about oneself.  The personal data would also have meta data that  accompany the personal data, showing provenance, permissions, context, and ownership.  At MIT, a open source version of such a scheme has been created in the form of Open Personal Data Store (openPDS) [ref:  http://openpds.media.mit.edu/]


DIFFERENTIAL PRIVACY
KOBBI NISSIM - BEN-GURION UNIVERSITY

I think there is a great promise for the marriage of Big Data and Differential Privacy. Big data brings with it a promise for research and society, but often the data contains detailed sensitive information about individuals, making privacy a real issue. Heuristic privacy protection techniques were designed for an information regime very different from today's. Many failures were demonstrated in the last decade.  Differential privacy provides provable guarantee for individuals and provides good utility on larger datasets.


NO FREE LUNCH
ASHWIN MACHANAVAJJHALA - DUKE UNIVERSITY

The issues relating to data privacy in the real world exhibit some similarities and some differences across all kinds of different domains.  In the medical environment, there is medical data, genomic data or other research data based on private information about patients and subjects.  Functional uses of the private information include finding correlations between a disease and a geographic region, or between a genome and disease.  In an advertising context, social media firms focus on the clicks and browsing habits of their users, assessing trends by region, age, gender, or other distinguishing features of the user population.  Private information includes an individual’s personal profile and “friends”.  Functional uses of the data could entail a prompt to recommend certain things to certain groups of users or to produce ads targeted to users based on their social networks.  Different applications will have different requirements for the level of privacy needed.  The big challenge in any of these applications is: "How do you really trade off privacy for utility?"


DEVELOPING ACCOUNTABLE SYSTEMS
LALANA KAGAL - MIT CSAIL

Much of the research in the area of data privacy focuses on controlling access to the data.  But, as we have seen, it is possible to break these kinds of systems.  You can in fact infer private information from anonymized data sets, examples include the re-identification of medical records, exposure of sexual orientation on Facebook, and breaking the anonymity of the Netflix prize dataset.  What we are proposing is an accountability approach to privacy, when security approaches are insufficient.  The accountability approach is a supplement to, and not a replacement for upfront prevention.


ENCRYPTED QUERY PROCESSING
RALUCA ADA POPA - MIT CSAIL

In 2012 hackers extracted 6.5 million hashed passwords from LinkedIn’s database and were able to reverse most of them. This is a problem we all are familiar with: confidentiality leaks. There are many reasons why data leaks, and for the purpose of this talk I’m going to group them into two threats. First, consider the layout of an application that has data stored in a database.  The first threat is attacks to the database server.  These are attacks in which an adversary could get full access to the database server, but does not modify the data, it just reads it.   The second threat is more general and includes any attacks, passive or active, to any part of the servers.  For example, hackers today can infiltrate the application systems and even obtain root access.  How do we protect data confidentiality in the face of these threats?

 

DATA CHALLENGES

Source: http://bigdata.csail.mit.edu/challenge

The MIT Big Data Challenge
OVERVIEW

The MIT Big Data Initiative at CSAIL is organizing competitions designed to spur innovation in how we think about and use data to address major societal issues.  Big Data promises a better world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources.  These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us. 

Working with our partners, the MIT Big Data Challenge will define real-world challenges in different areas such as transportation, health, finance and education, and make available data sets for the competition.  Our goal is to provide the MIT community with new and unique opportunities to show how data can make a difference.

 


MIT Big Data Challenge: Transportation in the City of Boston

OPEN: NOV 12 2013
END: JAN 20, 2014
PRIZES:  totaling $10K

 

The first MIT Big Data Challenge launched November 12 2013 in partnership with the City of Boston and co-sponsored by Transportation@MIT focuses on transportation in downtown Boston.  The challenge will make available multiple data sets, including transportation data from more than 2.3 million taxi rides, local events, social media and weather records, with the goal of predicting demand for taxis in downtown Boston and creating visualizations that provide new ways to understand public transportation patterns in the city.

The City of Boston is interested in gaining new insights into how people use all modes of transportation travel in and around the downtown Boston area.   A critical imperative of Boston's Complete Streets Policy is to move all modes of transportation more efficiently and to use real-time data to facilitate better trip-planning between modes of transportation.   With urban congestion on the rise, city planners are looking for ways to improve transportation such as providing people with more options to get from one place to another (walking, biking, driving, or using public transit) and by reducing and more efficiently routing vehicles in the city.

This Data Challenge provides a unique opportunity to analyze City of Boston taxi data and combine multiple data sets including social media, transit ridership, events data and weather data to effectively predict demand and better understand patterns in taxi ridership.  We hope this will result in new insights for the City of Boston and the public that will improve transportation in the city (and ability to get a cab when you need one)!


Announcing:

A LIVING LAB

Source: http://bigdata.csail.mit.edu/Living_Lab

 
 
Background

We now have the ability to collect and acquire digital information at an unprecedented rate across practically all aspects of our life including healthcare, financial transactions, social interactions, education, energy usage, transportation, environmental monitoring and so on.  "Big Data" is about harnessing all of this digital information by combining and analyzing it in completely new ways to make better predictions and ultimately, better decisions.  Over the next decade Big Data has the potential to profoundly change the way we live, work and play.  Big Data also introduces unique challenges when it comes to managing and protecting personal privacy. Big Data privacy issues are complex, introducing a host of ethical, legal, policy and technical questions.  How do we build on Big Data’s potential for good, while maintaining essential privacy protections?  And, how do we design future technologies, policies, and practices to get that balance right for society? 

Our goal with this project is to demonstrate how organizations can leverage data in the future, including how we collect, manage, and use personal information, from setting appropriate policies to demonstrating systems that can implement it in practice.  In terms of integrating, analyzing, and sharing data, MIT faces similar challenges to many organizations across different sectors whether in industry or government.   A Big Data testbed at MIT will allow us to demonstrate how data can be used to better understand and improve our community; collectively explore ways to address technical and privacy challenges; and demonstrate new approaches and solutions emerging from the research community.

A Living Lab
 
What is a "Living Lab"?  It is a open innovation ecosystem, operating in a specific context, integrating research and innovation processes within a public-private partnership.*  In short, a testbed for research and experimentation.
 
We propose creating a Big Data Living Lab at MIT to allow the community to access, share, and use data about life on campus.  Why? To explore issues around large-scale data integration, privacy, visualization, and performance, as well as social implications of big data.

 

NEWS

Source: http://bigdata.csail.mit.edu/news

WHITE HOUSE & MIT RELEASE REPORTS ON BIG DATA AND PRIVACY

Article by: Adam Conner-Simons, MIT CSAIL News

"The White House and the MIT both separately released reports on big data and privacy this past week.

The White House report stemmed from a 90-day review undertaken by White Counselor John Podesta that included a day-long workshop hosted by the MIT Big Data Initiative at CSAIL.

 

JUN YANG'S TALK

 

JANET WIENER'S TALK

Abstract: 

 

MEMBER WORKSHOP ON DATA ANALYTICS

MIT Big Data Initiative at CSAIL
Member Workshop
"Big Data Analytics: Challenges in Big Data for Data Mining, Machine Learning and Statistics"

Extracting useful information from very large data sets is challenging. In this workshop, we will focus on the challenges of applying machine learning, data mining, and statistics to massive-scale data sets.

Date: MARCH 26, 2014
Location:  MIT CSAIL, Stata Center, Bldg 32, Student Street 1st floor, 32-155
Cambridge, MA 02139

 

BIG, FAST, WEIRD DATA

“In the past, we've focused on scale, but over the last few years, the big new problems have been about variety,” says Sam Madden, professor of electrical engineering at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). “Big Data is no longer just about processing a huge number of bytes, but doing things with data that you couldn’t do previously. Increasingly, data is coming at you really fast, and it’s much more complex. It’s not just tabular data you can easily stick into a spreadsheet or a database.”

 

BOSTON GLOBE ARTICLE ON MIT BIG DATA CHALLENGE

"Taxi data show Bostonians don't fear the cold"
An MIT contest reveals a portrait of the city through cab rides

"FOR YEARS, Boston’s Department of Transportation has collected GPS data on every taxi pickup and drop-off throughout the city. It is an astonishing accumulation of raw numbers on how Bostonians get around, ripe with opportunity for analysis.

 

MIT, WHITE HOUSE CO-SPONSOR WORKSHOP ON BIG-DATA PRIVACY

Article by: Larry Hardesty, MIT News Office
March 4, 2014

"On Monday, MIT hosted a daylong workshop on big data and privacy, co-sponsored by the White House as part of a 90-day review of data privacy policy that President Barack Obama announced in a Jan. 17 speech on U.S. intelligence gathering.

 

MAR 26 2014 MEMBER WORKSHOP: BIG DATA ANALYTICS

Extracting useful information from very large data sets is challenging. In this workshop, we will focus on the challenges of applying machine learning, data mining, and statistics to massive-scale data sets. 

View WORKING AGENDA

 

WHITE HOUSE TO CO-HOST WORKSHOP AT MIT ON ‘BIG DATA’ AND PRIVACY

As part of President Barack Obama’s call for a review of privacy issues in the context of increased digital information, MIT and the White House Office of Science and Technology Policy will co-host a daylong workshop on “big data” and privacy March 3 at MIT.

 

BRUCE SCHNEIER TALK

VIEW TALK NOW 
"NSA Surveillance and What To Do About It"

MIT LAUNCHES FIRST ONLINE PROFESSIONAL COURSE ON BIG DATA

Cambridge, Mass. – MIT will offer its first online professional course, Tackling the Challenges of Big Data, to a global audience beginning March 4, 2014. The four-week online course, aimed at technical professionals and executives, will tackle state-of-the-art topics in big data ranging from data collection, storage and processing to analytics and visualization, as well as address a range of real world applications.

 

JAN 31 2014 NEW ENGLAND DATABASE SUMMIT

Announcing a call for papers and participants for the Seventh Annual New
England Database Summit (NEDB Summit '14). This will be an all day
conference-style event where participants from the research community
in the New England area can come together to present ideas and discuss
their research. Registration for the event is free from students and a
nominal charge ($25) for non-students, and anyone is welcome to
attend. Lunch, drinks and appetizers will be provided. 

The event will also feature keynotes from James Mickens (Microsoft
 

DEC 10 2013 BIG DATA CHALLENGE MEETUP

Join us at meetup for the MIT Big Data Challenge - Transportation in the City of Boston.
MEETUP: TUESDAY DEC 10 5:00-7:00pm
LOCATION:  Stata Center, 32-G882, Hewlett Meeting Room
MAP: http://www.csail.mit.edu/contactus
REGISTER:  https://www.eventbrite.com/e/mit-big-data-challenge-meetup-tickets-9532175995
 

 

MIT BIG DATA INITIATIVE LAUNCHES TRANSPORTATION CHALLENGE, PRIVACY WORKING GROUP AT WHITE HOUSE EVENT

The Big Data Initiative at the MIT Computer Science and Artificial Intelligence Lab (CSAIL) today announced two new activities aimed at improving the use and management of Big Data.  The first is a series of data challenges designed to spur innovation in how people use data to solve problems and make decisions. The second is a new Big Data and Privacy Working Group that will bring together leaders from industry, government and academia to address the role of technology in protecting and managing privacy.

 

NOV 13-14 2013 ANNUAL BIG DATA MEMBER MEETING

The MIT Big Data Annual Member Meeting 2013 at CSAIL -- learn about the latest Big Data research, see cool demos, meet faculty and students, and network.

View the latest agenda here:  http://bigdata.csail.mit.edu/annual2013

 

MICHAEL I. JORDAN TALK

VIEW TALK NOW

 

BIG DATA DESIGN CONTEST

MIT + Big Data T-Shirt Graphic Design Contest!!
 
hosted by: MIT Big Data Initiative at CSAIL
http://bigdata.csail.mit.edu/

Help us design a cool graphic for our new Big Data T-Shirt!   It's your chance to participate in designing the official t-shirt for
BIG DATA @ CSAIL.  Submit your t-shirt design by September 30th for a chance to win the competition and have your design become part of BIG DATA!
 
The winner:  $200 Amazon Gift Certificate

 

BIG DATA AND BLINKDB: A QUERY ENGINE FOR BIG DATA

BlinkDB is a new query processing system which allows one to run interactive SQL queries for tens of terabytes of data with the response time being equivalent to "blink-time".  This was the goal when CSAIL and UC Berkeley AMPLab joined together to collaborate on the BlinkDB.

 

BIG DATA AND THE LAW: APPLYING NLP TO SUPREME COURT DECISIONS

Big Data and the Law:  Researchers at MIT and Harvard have applied Natural Language Processing to Supreme Court decisions to determine authorship in cases where the justices have decided not to sign the opinions.

"A U.S. Supreme Court mystery drew them together: the Harvard 3L, the engineer, the Jenner & Block
associate, the Massachusetts Institute of Technology professor and a team of MIT doctoral students. The
mystery: Who actually wrote the joint dissent in last year's health care blockbuster?

MIT BIG DATA INITIATIVE AT CSAIL IS AWARDED GRANT FROM ALFRED P. SLOAN FOUNDATION

The MIT Big Data Initiative at CSAIL was awarded a grant from the Alfred P. Sloan Foundation in support of the Big Data Privacy Workshop.

 

JUNE 19 2013 WORKSHOP: BIG DATA PRIVACY

Big Data promises a better world.  A world where data will be used to make better decisions, from how we invest money to how we manage our healthcare to how we educate our children and manage our cities and resources.  These changes are enabled by a proliferation of new technologies and tools that have the ability to measure, monitor, record, combine and query all kinds of data about us and the world around us -- but how will that data get used and for what purpose?  Who owns the data?   How do we assure accountability for misuse? 

 

BIG DATA FOR THE REST OF US

Recently the news has been full of stories about the potential for `big data’ about customer behavior to revolutionize business, and personal data has been called the `new oil of the internet’.   But what about big data for the average person?  For the poor?

 

APRIL 4 2013 MEMBER WORKSHOP: BIG DATA INTEGRATION

Big Data integration -- combining multiple diverse data sets together -- is one of the key problems organizations face when working with Big Data.  The challenges of integrating data are myriad:  different data sets may come from different sources (internal and external); may have been created by different people; be structured in different ways; and contain different data types (e.g., text, images, maps, database tables, etc.)  In addition, data sets may represent different levels of temporal or spatial granularity (e.g., real time per-transaction data vs.

 

FEB 1 2013 NEW ENGLAND DATABASE SUMMIT

Coming up on February 1, 2013 is the Sixth Annual New England Database Summit (NEDB Summit), to be held at the MIT Stata Center.  This will be an all day conference-style event where participants from the research community and industry in the New England area can come together to present ideas and discuss their latest research and experiences on databases.

 

MIT NEWS: SPEEDING ALGORITHMS BY SHRINKING DATA

A new approach to processing 'big data' creates succinct representations of huge data sets, so that existing algorithms can handle them efficiently

 

MIT NEWS: MINING PHYSICIANS' NOTES FOR MEDICAL INSIGHTS

A new approach to algorithmically distinguishing words with multiple possible meanings could help find useful data in electronic medical records

 

NOV 28-29 2012 BIGDATA@CSAIL MEMBER MEETING

This 1st annual bigdata@csail member meeting will include our 8 founding member companies -- AIG, EMC, Huawei, Intel, Microsoft, Samsung, SAP and Thomson Reuters -- and the MIT CSAIL research community. 

 

CSAIL TACKLES BIG DATA

In May 2012, CSAIL announced a major new initiative to tackle the challenges of the burgeoning field known as “big data” -- data collections that are too big, growing too fast, or are too complex for existing information technology systems to handle. The announcement was made at an MIT event attended by Massachusetts Governor Deval Patrick, who simultaneously announced a new statewide initiative to establish Massachusetts as a hub of big data research.

LOGIN

My Note: Requires Enter your bigdata@CSAIL username.

Page statistics
1248 view(s) and 12 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments