Table of contents
  1. Story
  2. Meeting Description
  3. Slides
    1. Tommy Shen
      1. Slide 1 Data Simulations for Open Data
      2. Slide 2 Congrats, you're hired!
      3. Slide 3 Four Months Later...
      4. Slide 4 Inheriting Chaos
      5. Slide 5 Common Simulation Uses
      6. Slide 6 My Motivation
      7. Slide 7 Simulations for #OpenData
      8. Slide 8 Don't
      9. Slide 9 Easiest Method of Simulating Truth
      10. Slide 10 Some Ideas on Simulating Truth
      11. Slide 11 Push Boundaries and Analyze Extreme Case Senarios
      12. Slide 12 Scale, Variation, and 'Black Swans'
      13. Slide 13 Pitfalls of Scaling to the Universe
      14. Slide 14 Ideas on Scale-driven Simulations
      15. Slide 15 What Happens When Your Data Looks Like This?
      16. Slide 16 Creating Entire Databases
      17. Slide 17 Unexpected Challenges of Creating an Entire DB
      18. Slide 18 My Vision
      19. Slide 19 My Info
      20. Slide 20
    2. Daniell Toth
      1. Slide 1 Disclosure Limitation
      2. Slide 2 Talk Outline
      3. Slide 3 Why Disclosure Limitation?
      4. Slide 4 Utility/Protection Tradeoff
      5. Slide 5 Different Method = Different Utility
      6. Slide 6 What Are We Protecting? 1
      7. Slide 7 What Are We Protecting? 2
      8. Slide 8 Methods for Protecting Data
      9. Slide 49 Recap
      10. Slide 50 Thank You
  4. Story
    1. Slide 1 Cover Page
    2. Slide 2 Select Name Filter for Bayesian Data Analysis Meetup
    3. Slide 3 Select Status Filter for Upcoming Meetup
    4. Slide 4 Knowledge Base
    5. Slide 5 Posts
    6. Slide 6 Sunlight Foundation
    7. Slide 7 Committee to Protect Journalists
    8. Slide 8 Trace the Guns
    9. Slide 9 Trace the Guns Report Tables
    10. Spotfire Dashboard
  5. DataViz Contest Announcement
    1. DSDC RSVP Visualization 1
    2. DSDC RSVP Visualization 2
    3. DSDC RSVP Visualization 3
  6. Past Meetups
    1. July 30, 2013
    2. June 18, 2013
      1. Untangling the webs of immigration lobbying
        1. Figure 1. Immigration Lobbying in Congress; click for interactive graphic
        2. Cluster A: Agricultural /H-2A Visas­­­
        3. Cluster B: Seasonal / H-2B Visas
        4. Cluster C: High-Skill Visas, Employee Verification, Other Corporate Concerns
        5. Cluster D: Family Issues
        6. Cluster E: "Dream Act", General Path To Citizenship, Enforcement
        7. Cluster F: Performing Artist Visas
        8. Who Lobbies The Most?
          1. Table 1. 20 most active sectors and their issues
        9. Most Active Issues
          1. Table 2. 20 most active issues and their top sectors
        10. Conclusion
        11. Methodology
      2. Looking back to predict what FWD.us means for tech and immigration
        1. Table 3. Most lobbied on bills in the 109th – 112th Congresses
    3. May 28, 2013
    4. April 23, 2013
    5. March 28, 2013
    6. March 7, 2013
    7. January 28, 2013
    8. December 18, 2012
    9. November 19, 2012
    10. November 11, 2012
    11. October 17, 2012
    12. October 1, 2012
    13. September 17, 2012
    14. August 27, 2012
    15. July 23, 2012
    16. June 28, 2012
    17. May 22, 2012
    18. April 30, 2012
    19. March 27, 2012
    20. March 20, 2012
    21. February 20, 2012
    22. January 25, 2012
    23. December 21, 2011
    24. November 29, 2011
    25. October 25, 2011
    26. September 26, 2011
    27. August 25, 2011
  7. Past Blogs
    1. The Sunlight Foundation is looking for an Enthusiastic Software Developer
    2. Selling Data Science: Common Language
    3. Fantastic presentations from R using slidify and rCharts
      1. Slidify
      2. rCharts
      3. Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript
      4. Interactive learning environments
      5. Open presentation
    4. Data Science MD August Recap: Sports Analytics Meetup
    5. Data Visualization: Exploring Biodiversity
    6. Create Data-Driven Tools that May Improve Diabetes Care for a Chance to Win $100K
    7. Data Visualization: Our Gun Debate
    8. Recommended Reading List for “Teaching Machines to Read” Event
    9. Data Visualization: Graphics with GGPlot2
      1. London Bike Routes
      2. Raman Spectroscopic Grading of Gliomas
      3. TwitteR Package
      4. Sentencing Data for Local Courts
    10. Big Data Meta Review – The Evolution of Big Data as Told by Google
      1. The Google File System – October 2003
      2. MapReduce: Simplified Data Processing on Large Clusters - December 2004
      3. Bigtable: A Distributed Storage System for Structured Data - November 2006
      4. Large-scale Incremental Processing Using Distributed Transactions and Notifications - 2010
      5. Pregel: a system for large-scale graph processing – June 2010
      6. Dremel: Interactive Analysis of Web-Scale Datasets -  2010
      7. Spanner: Google’s Globally-Distributed Database - October 2012
    11. Weekly Round-Up: Long Data, Data Storytelling, Operational Intelligence, and Visualizing Inaugural Speeches
    12. Data Visualization: Drawing us in
      1. Zeus’s Affairs
      2. Immigration Insanity
      3. #WorkingInAmerica
    13. The Rise of Data Products
    14. D3.js Meta Tutorial
    15. Data Competition Announcement – SBP 2013 Big-Data Challenge
      1. Reality Commons
    16. Data Visualization: How we get it done!
      1. Data Animators Bring Data to Life
      2. Transitioning Charts on Identical Data
      3. Visualizing Your System Files
    17. The DC Data Source Weekly: Week 1 – Data for Testing Recommendation Algorithms
      1. Jester
      2. Book-Crossing Dataset
      3. Yandex
      4. hetrec2011-movielens-2k
      5. hetrec2011-delicious-2k
      6. hetrec2011-lastfm-2k
    18. R Users DC Meetup: “Analyze US Government Survey Data with R”
    19. Data Visualization: Biology & Medicine
      1. Tablets in Science
      2. Exploring Gene Ontology
      3. Visualizing Brain Surgery
    20. Weekly Round-Up: Getting Started with Data Science, Interview Questions, GE’s Big Data Plans, Lean Startup Experiment Data, and Brain Visualizations
      1. Scientists construct first map of how the brain organizes everything we see
    21. Data Visualization: 2013 Week #1
      1. A Simple Mashup
      2. Mapping our Thoughts
      3. UI Design Evolution
    22. The (near) Future of Data Analysis – A Review
    23. Weekly Round-Up: Simpler Data Tools, The Future of Data, Open Data, Big Data Dangers, and NBA Missile Tracking Cameras
    24. Data Science DC Event Review: Political Campaign Data Science
    25. Weekly Round-Up: Data-Driven Processes & Products, SMB Big Data, Telling Stories, and Fighting Bullying
      1. The key to data science? Telling stories
    26. DBDC Event Review: Money for Nothing – Productizing Government Data
    27. Weekly Round-Up: Big Data Business Models, Cloudera, BI Trends, and Cool Visualizations
      1. Infographic: Google's Flu Map Might Predict The Next Big Epidemic
    28. Data Science DC Event Review: Implicit Sentiment Mining in Twitter Streams
    29. Meet Hadoop Creator Doug Cutting
    30. 3 Ways You’re Ruining Your GA Data, and How to Stop: Tips from a Data Scientist
    31. Community Indicators Conference
    32. Event Review: Carl Morris Symposium on Large-Scale Data Inference
    33. Data Scientists survey results teaser

Data Science DC

Last modified
Table of contents
  1. Story
  2. Meeting Description
  3. Slides
    1. Tommy Shen
      1. Slide 1 Data Simulations for Open Data
      2. Slide 2 Congrats, you're hired!
      3. Slide 3 Four Months Later...
      4. Slide 4 Inheriting Chaos
      5. Slide 5 Common Simulation Uses
      6. Slide 6 My Motivation
      7. Slide 7 Simulations for #OpenData
      8. Slide 8 Don't
      9. Slide 9 Easiest Method of Simulating Truth
      10. Slide 10 Some Ideas on Simulating Truth
      11. Slide 11 Push Boundaries and Analyze Extreme Case Senarios
      12. Slide 12 Scale, Variation, and 'Black Swans'
      13. Slide 13 Pitfalls of Scaling to the Universe
      14. Slide 14 Ideas on Scale-driven Simulations
      15. Slide 15 What Happens When Your Data Looks Like This?
      16. Slide 16 Creating Entire Databases
      17. Slide 17 Unexpected Challenges of Creating an Entire DB
      18. Slide 18 My Vision
      19. Slide 19 My Info
      20. Slide 20
    2. Daniell Toth
      1. Slide 1 Disclosure Limitation
      2. Slide 2 Talk Outline
      3. Slide 3 Why Disclosure Limitation?
      4. Slide 4 Utility/Protection Tradeoff
      5. Slide 5 Different Method = Different Utility
      6. Slide 6 What Are We Protecting? 1
      7. Slide 7 What Are We Protecting? 2
      8. Slide 8 Methods for Protecting Data
      9. Slide 49 Recap
      10. Slide 50 Thank You
  4. Story
    1. Slide 1 Cover Page
    2. Slide 2 Select Name Filter for Bayesian Data Analysis Meetup
    3. Slide 3 Select Status Filter for Upcoming Meetup
    4. Slide 4 Knowledge Base
    5. Slide 5 Posts
    6. Slide 6 Sunlight Foundation
    7. Slide 7 Committee to Protect Journalists
    8. Slide 8 Trace the Guns
    9. Slide 9 Trace the Guns Report Tables
    10. Spotfire Dashboard
  5. DataViz Contest Announcement
    1. DSDC RSVP Visualization 1
    2. DSDC RSVP Visualization 2
    3. DSDC RSVP Visualization 3
  6. Past Meetups
    1. July 30, 2013
    2. June 18, 2013
      1. Untangling the webs of immigration lobbying
        1. Figure 1. Immigration Lobbying in Congress; click for interactive graphic
        2. Cluster A: Agricultural /H-2A Visas­­­
        3. Cluster B: Seasonal / H-2B Visas
        4. Cluster C: High-Skill Visas, Employee Verification, Other Corporate Concerns
        5. Cluster D: Family Issues
        6. Cluster E: "Dream Act", General Path To Citizenship, Enforcement
        7. Cluster F: Performing Artist Visas
        8. Who Lobbies The Most?
          1. Table 1. 20 most active sectors and their issues
        9. Most Active Issues
          1. Table 2. 20 most active issues and their top sectors
        10. Conclusion
        11. Methodology
      2. Looking back to predict what FWD.us means for tech and immigration
        1. Table 3. Most lobbied on bills in the 109th – 112th Congresses
    3. May 28, 2013
    4. April 23, 2013
    5. March 28, 2013
    6. March 7, 2013
    7. January 28, 2013
    8. December 18, 2012
    9. November 19, 2012
    10. November 11, 2012
    11. October 17, 2012
    12. October 1, 2012
    13. September 17, 2012
    14. August 27, 2012
    15. July 23, 2012
    16. June 28, 2012
    17. May 22, 2012
    18. April 30, 2012
    19. March 27, 2012
    20. March 20, 2012
    21. February 20, 2012
    22. January 25, 2012
    23. December 21, 2011
    24. November 29, 2011
    25. October 25, 2011
    26. September 26, 2011
    27. August 25, 2011
  7. Past Blogs
    1. The Sunlight Foundation is looking for an Enthusiastic Software Developer
    2. Selling Data Science: Common Language
    3. Fantastic presentations from R using slidify and rCharts
      1. Slidify
      2. rCharts
      3. Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript
      4. Interactive learning environments
      5. Open presentation
    4. Data Science MD August Recap: Sports Analytics Meetup
    5. Data Visualization: Exploring Biodiversity
    6. Create Data-Driven Tools that May Improve Diabetes Care for a Chance to Win $100K
    7. Data Visualization: Our Gun Debate
    8. Recommended Reading List for “Teaching Machines to Read” Event
    9. Data Visualization: Graphics with GGPlot2
      1. London Bike Routes
      2. Raman Spectroscopic Grading of Gliomas
      3. TwitteR Package
      4. Sentencing Data for Local Courts
    10. Big Data Meta Review – The Evolution of Big Data as Told by Google
      1. The Google File System – October 2003
      2. MapReduce: Simplified Data Processing on Large Clusters - December 2004
      3. Bigtable: A Distributed Storage System for Structured Data - November 2006
      4. Large-scale Incremental Processing Using Distributed Transactions and Notifications - 2010
      5. Pregel: a system for large-scale graph processing – June 2010
      6. Dremel: Interactive Analysis of Web-Scale Datasets -  2010
      7. Spanner: Google’s Globally-Distributed Database - October 2012
    11. Weekly Round-Up: Long Data, Data Storytelling, Operational Intelligence, and Visualizing Inaugural Speeches
    12. Data Visualization: Drawing us in
      1. Zeus’s Affairs
      2. Immigration Insanity
      3. #WorkingInAmerica
    13. The Rise of Data Products
    14. D3.js Meta Tutorial
    15. Data Competition Announcement – SBP 2013 Big-Data Challenge
      1. Reality Commons
    16. Data Visualization: How we get it done!
      1. Data Animators Bring Data to Life
      2. Transitioning Charts on Identical Data
      3. Visualizing Your System Files
    17. The DC Data Source Weekly: Week 1 – Data for Testing Recommendation Algorithms
      1. Jester
      2. Book-Crossing Dataset
      3. Yandex
      4. hetrec2011-movielens-2k
      5. hetrec2011-delicious-2k
      6. hetrec2011-lastfm-2k
    18. R Users DC Meetup: “Analyze US Government Survey Data with R”
    19. Data Visualization: Biology & Medicine
      1. Tablets in Science
      2. Exploring Gene Ontology
      3. Visualizing Brain Surgery
    20. Weekly Round-Up: Getting Started with Data Science, Interview Questions, GE’s Big Data Plans, Lean Startup Experiment Data, and Brain Visualizations
      1. Scientists construct first map of how the brain organizes everything we see
    21. Data Visualization: 2013 Week #1
      1. A Simple Mashup
      2. Mapping our Thoughts
      3. UI Design Evolution
    22. The (near) Future of Data Analysis – A Review
    23. Weekly Round-Up: Simpler Data Tools, The Future of Data, Open Data, Big Data Dangers, and NBA Missile Tracking Cameras
    24. Data Science DC Event Review: Political Campaign Data Science
    25. Weekly Round-Up: Data-Driven Processes & Products, SMB Big Data, Telling Stories, and Fighting Bullying
      1. The key to data science? Telling stories
    26. DBDC Event Review: Money for Nothing – Productizing Government Data
    27. Weekly Round-Up: Big Data Business Models, Cloudera, BI Trends, and Cool Visualizations
      1. Infographic: Google's Flu Map Might Predict The Next Big Epidemic
    28. Data Science DC Event Review: Implicit Sentiment Mining in Twitter Streams
    29. Meet Hadoop Creator Doug Cutting
    30. 3 Ways You’re Ruining Your GA Data, and How to Stop: Tips from a Data Scientist
    31. Community Indicators Conference
    32. Event Review: Carl Morris Symposium on Large-Scale Data Inference
    33. Data Scientists survey results teaser

  1. Story
  2. Meeting Description
  3. Slides
    1. Tommy Shen
      1. Slide 1 Data Simulations for Open Data
      2. Slide 2 Congrats, you're hired!
      3. Slide 3 Four Months Later...
      4. Slide 4 Inheriting Chaos
      5. Slide 5 Common Simulation Uses
      6. Slide 6 My Motivation
      7. Slide 7 Simulations for #OpenData
      8. Slide 8 Don't
      9. Slide 9 Easiest Method of Simulating Truth
      10. Slide 10 Some Ideas on Simulating Truth
      11. Slide 11 Push Boundaries and Analyze Extreme Case Senarios
      12. Slide 12 Scale, Variation, and 'Black Swans'
      13. Slide 13 Pitfalls of Scaling to the Universe
      14. Slide 14 Ideas on Scale-driven Simulations
      15. Slide 15 What Happens When Your Data Looks Like This?
      16. Slide 16 Creating Entire Databases
      17. Slide 17 Unexpected Challenges of Creating an Entire DB
      18. Slide 18 My Vision
      19. Slide 19 My Info
      20. Slide 20
    2. Daniell Toth
      1. Slide 1 Disclosure Limitation
      2. Slide 2 Talk Outline
      3. Slide 3 Why Disclosure Limitation?
      4. Slide 4 Utility/Protection Tradeoff
      5. Slide 5 Different Method = Different Utility
      6. Slide 6 What Are We Protecting? 1
      7. Slide 7 What Are We Protecting? 2
      8. Slide 8 Methods for Protecting Data
      9. Slide 49 Recap
      10. Slide 50 Thank You
  4. Story
    1. Slide 1 Cover Page
    2. Slide 2 Select Name Filter for Bayesian Data Analysis Meetup
    3. Slide 3 Select Status Filter for Upcoming Meetup
    4. Slide 4 Knowledge Base
    5. Slide 5 Posts
    6. Slide 6 Sunlight Foundation
    7. Slide 7 Committee to Protect Journalists
    8. Slide 8 Trace the Guns
    9. Slide 9 Trace the Guns Report Tables
    10. Spotfire Dashboard
  5. DataViz Contest Announcement
    1. DSDC RSVP Visualization 1
    2. DSDC RSVP Visualization 2
    3. DSDC RSVP Visualization 3
  6. Past Meetups
    1. July 30, 2013
    2. June 18, 2013
      1. Untangling the webs of immigration lobbying
        1. Figure 1. Immigration Lobbying in Congress; click for interactive graphic
        2. Cluster A: Agricultural /H-2A Visas­­­
        3. Cluster B: Seasonal / H-2B Visas
        4. Cluster C: High-Skill Visas, Employee Verification, Other Corporate Concerns
        5. Cluster D: Family Issues
        6. Cluster E: "Dream Act", General Path To Citizenship, Enforcement
        7. Cluster F: Performing Artist Visas
        8. Who Lobbies The Most?
          1. Table 1. 20 most active sectors and their issues
        9. Most Active Issues
          1. Table 2. 20 most active issues and their top sectors
        10. Conclusion
        11. Methodology
      2. Looking back to predict what FWD.us means for tech and immigration
        1. Table 3. Most lobbied on bills in the 109th – 112th Congresses
    3. May 28, 2013
    4. April 23, 2013
    5. March 28, 2013
    6. March 7, 2013
    7. January 28, 2013
    8. December 18, 2012
    9. November 19, 2012
    10. November 11, 2012
    11. October 17, 2012
    12. October 1, 2012
    13. September 17, 2012
    14. August 27, 2012
    15. July 23, 2012
    16. June 28, 2012
    17. May 22, 2012
    18. April 30, 2012
    19. March 27, 2012
    20. March 20, 2012
    21. February 20, 2012
    22. January 25, 2012
    23. December 21, 2011
    24. November 29, 2011
    25. October 25, 2011
    26. September 26, 2011
    27. August 25, 2011
  7. Past Blogs
    1. The Sunlight Foundation is looking for an Enthusiastic Software Developer
    2. Selling Data Science: Common Language
    3. Fantastic presentations from R using slidify and rCharts
      1. Slidify
      2. rCharts
      3. Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript
      4. Interactive learning environments
      5. Open presentation
    4. Data Science MD August Recap: Sports Analytics Meetup
    5. Data Visualization: Exploring Biodiversity
    6. Create Data-Driven Tools that May Improve Diabetes Care for a Chance to Win $100K
    7. Data Visualization: Our Gun Debate
    8. Recommended Reading List for “Teaching Machines to Read” Event
    9. Data Visualization: Graphics with GGPlot2
      1. London Bike Routes
      2. Raman Spectroscopic Grading of Gliomas
      3. TwitteR Package
      4. Sentencing Data for Local Courts
    10. Big Data Meta Review – The Evolution of Big Data as Told by Google
      1. The Google File System – October 2003
      2. MapReduce: Simplified Data Processing on Large Clusters - December 2004
      3. Bigtable: A Distributed Storage System for Structured Data - November 2006
      4. Large-scale Incremental Processing Using Distributed Transactions and Notifications - 2010
      5. Pregel: a system for large-scale graph processing – June 2010
      6. Dremel: Interactive Analysis of Web-Scale Datasets -  2010
      7. Spanner: Google’s Globally-Distributed Database - October 2012
    11. Weekly Round-Up: Long Data, Data Storytelling, Operational Intelligence, and Visualizing Inaugural Speeches
    12. Data Visualization: Drawing us in
      1. Zeus’s Affairs
      2. Immigration Insanity
      3. #WorkingInAmerica
    13. The Rise of Data Products
    14. D3.js Meta Tutorial
    15. Data Competition Announcement – SBP 2013 Big-Data Challenge
      1. Reality Commons
    16. Data Visualization: How we get it done!
      1. Data Animators Bring Data to Life
      2. Transitioning Charts on Identical Data
      3. Visualizing Your System Files
    17. The DC Data Source Weekly: Week 1 – Data for Testing Recommendation Algorithms
      1. Jester
      2. Book-Crossing Dataset
      3. Yandex
      4. hetrec2011-movielens-2k
      5. hetrec2011-delicious-2k
      6. hetrec2011-lastfm-2k
    18. R Users DC Meetup: “Analyze US Government Survey Data with R”
    19. Data Visualization: Biology & Medicine
      1. Tablets in Science
      2. Exploring Gene Ontology
      3. Visualizing Brain Surgery
    20. Weekly Round-Up: Getting Started with Data Science, Interview Questions, GE’s Big Data Plans, Lean Startup Experiment Data, and Brain Visualizations
      1. Scientists construct first map of how the brain organizes everything we see
    21. Data Visualization: 2013 Week #1
      1. A Simple Mashup
      2. Mapping our Thoughts
      3. UI Design Evolution
    22. The (near) Future of Data Analysis – A Review
    23. Weekly Round-Up: Simpler Data Tools, The Future of Data, Open Data, Big Data Dangers, and NBA Missile Tracking Cameras
    24. Data Science DC Event Review: Political Campaign Data Science
    25. Weekly Round-Up: Data-Driven Processes & Products, SMB Big Data, Telling Stories, and Fighting Bullying
      1. The key to data science? Telling stories
    26. DBDC Event Review: Money for Nothing – Productizing Government Data
    27. Weekly Round-Up: Big Data Business Models, Cloudera, BI Trends, and Cool Visualizations
      1. Infographic: Google's Flu Map Might Predict The Next Big Epidemic
    28. Data Science DC Event Review: Implicit Sentiment Mining in Twitter Streams
    29. Meet Hadoop Creator Doug Cutting
    30. 3 Ways You’re Ruining Your GA Data, and How to Stop: Tips from a Data Scientist
    31. Community Indicators Conference
    32. Event Review: Carl Morris Symposium on Large-Scale Data Inference
    33. Data Scientists survey results teaser

Story

Suppressing, Synthesizing, and Analyzing Confidential Data

Meeting Description: See below

Attendance: 125

Photos: 10

Comments: Many

Audio:  DataConfidentiality.mp3

Excellent Meetup! I suggest we have another Meetup there and have CapitalOne Labs tells about the work they do and especially their recent acquisition of Bundle to advance their big data agenda. "Bundle gives you unbiased ratings on businesses based on anonymous credit card data"

This meetup provided the contrasting views of a data scientist and a statistican to a controversial problem about the use "restricted data".

Open Government Data can be restricted because of the Open Data Policy of the US Federal Government as outlined at Data.gov:

  • Public Information: All datasets accessed through Data.gov are confined to public information and must not contain National Security information as defined by statute and/or Executive Order, or other information/data that is protected by other statute, practice, or legal precedent. The supplying Department/Agency is required to maintain currency with public disclosure requirements.
  • Security: All information accessed through Data.gov is in compliance with the required confidentiality, integrity, and availability controls mandated by Federal Information Processing Standard (FIPS) 199 as promulgated by the National Institute of Standards and Technology (NIST) and the associated NIST publications supporting the Certification and Accreditation (C&A) process. Submitting Agencies are required to follow NIST guidelines and OMB guidance (including C&A requirements).
  • Privacy: All information accessed through Data.gov must be in compliance with current privacy requirements including OMB guidance. In particular, Agencies are responsible for ensuring that the datasets accessed through Data.gov have any required Privacy Impact Assessments or System of Records Notices (SORN) easily available on their websites.
  • Data Quality and Retention: All information accessed through Data.gov is subject to the Information Quality Act (P.L. 106-554). For all data accessed through Data.gov, each agency has confirmed that the data being provided through this site meets the agency's Information Quality Guidelines.
  • Secondary Use" Data accessed through Data.gov do not, and should not, include controls over its end use. However, as the data owner or authoritative source for the data, the submitting Department or Agency must retain version control of datasets accessed. Once the data have been downloaded from the agency's site, the government cannot vouch for their quality and timeliness. Furthermore, the US Government cannot vouch for any analyses conducted with data retrieved from Data.gov.

Federal Government Data is also governed by the Principles and Practices for a Federal Statistical Agency Fifth Edition:

Practice 7: Respect for the Privacy and Autonomy of Data Providers

Practice 8: Protection of the Confidentiality of Data Providers’ Information

Statitical researchers are granted access to retricted Federal Statistical and other data on condition that their public disclosure will not violate the laws and regulations associated with these data, otherwise the fundamental trust involved with the collection and reporting of these data is violated and the data collection methodology is compromised.

Tommy Shen, a data scientist, commented afterwards: One of the reasons I agreed to present yesterday is that I fundamentally believe that we, as a data science community, can do better than sums and averages; that instead of settling for the utility curves presented to us by government agencies, can expand the universe of the possible information and knowledge that can be gleaned from the data that your tax dollars and mine help to collect without making sacrifices to privacy.

Daniell Toth, a mathematical statistician, described the methods he uses in his work for a government agency as follows:

  • Identity
    • Suppression; Data Swapping
  • Value
    • Top-Coding; Perturbation;
    • Synthetic Data Approaches
  • Link
    • Aggregation/Cell Suppression; Data Smearing

His slides include examples of each method and he concluded:

  • Protecting data always involves a tradeoff of utility
  • You must know what you are trying to protect
  • We discussed a number of methods – the best depends on the intended use of the data and what you are protecting

My comment was that the first speaker needs to employ the services of a professional statistician who knows how to anonymize and/or aggregate data while preserving its statistical properties and that the second speaker needs to explin that decision makers in the government have access to the raw data and detailed results and that the public needs to work with available open government data and lobby their Congressional Representatives to support legisation like the Data Act of 2013.

SAS provides simulated statistical data sets for training and the Data Transparency Coalition has a meeting September 10th to discuss moving forward with the Data Transparency 2013.

Finally, I suggest members of the Data Community DC volunteer to preview the presentations and provide a brief discussion afterwards that is then posted in a more detailed blog. I am also going back and writing a blog digest of the Past Meetups.

Brand Niemann, former Senior Enterprise Architect and Data Scientist with the US EPA, completed 30 years of federal service in 2010. Since then he has worked as a data scientist for a number of organizations, produced data science products for a large number of data sets, and published data stories for Federal Computer Week, Semantic Community, AOL GovernmentBreaking Gov, and Information Week.

Meeting Description

Source: http://www.meetup.com/Data-Science-D...nts/133650972/

Wednesday, August 28, 2013

6:00 PM to 9:00 PM

Capital One Labs

3030 Clarendon Blvd, 8th Floor, Arlington, VA (map)

For our August Meetup, we're thrilled to have two speakers who will walk us through a range of options for sharing and working with data that cannot be fully shared. Tom Shen, Senior Data Analyst at the District of Columbia Office of the State Superintendent of Education, will talk about simulation-based approaches to keeping the public and researchers informed about K-12 education, without violating student confidentiality. Then Daniell Toth, Senior Research Mathematical Statistician for the Bureau of Labor Statistics, will dive deeper into mathematical techniques that preserve the key properties of data sets, while suppressing potentially identifying information.

Whether you're a government or industry organization looking to provide value to the public or to your customers, or a data scientist who consumes data from other organizations, or merely curious about how these statistical techniques work, this will be a highly valuable presentation.

Plus, we are extremely excited to be branching out a bit as far as location goes! This Meetup will be at the Clarendon, VA offices of Capital One Labs. In addition to being more convenient for our Virginia members (sorry, Maryland, we'll be back in the District in September), Capital One has a fantastic venue that lets us continue the discussion on their roof deck (weather permitting)! To make this happen, the event will be starting 30 minutes earlier than is typical.

Agenda:

• 6:00pm -- Networking, Food, and Refreshments
• 6:30pm -- Introduction
• 6:45pm -- Presentations
• 7:45pm -- Speaker and Audience Discussion
• 8:00pm -- Stay put for Data Drinks!
Bios:

Tommy Shen has a Masters in Public Policy from Georgetown University, and currently works as a Senior Data Analyst for the District of Columbia Office of the State Superintendent of Education. He's interested in open data, R, and waffles. Follow him @Gimperion.

Daniell Toth has a PhD in Mathematics from Indiana University in Bloomington, and works as a Senior Research Mathematical Statistician at the Bureau of Labor Statistics. He is also Associate Editor of The American Statistician, and publishes in topics related to survey methods.

Stedman Blake Hood

From WaPo today: "Sensitive details are so pervasive in the documents that The Post is publishing only summary tables and charts online." They should've come to the talk!

http://www.washingtonpost.com/w...

Tommy Shen

My slides are here: http://www.straydots.com/sim_talk_deck/

Harlan Harris

The geospatial animation that Keelan created of RSVPs is here: http://www.youtube.com/watch?v=...

Slides

Tommy Shen

Source: www.straydots.com/sim_talk_deck/

Slide 1 Data Simulations for Open Data

DataSimulationsforOpenDataSlide1.png

Slide 2 Congrats, you're hired!

DataSimulationsforOpenDataSlide2.png

Slide 3 Four Months Later...

DataSimulationsforOpenDataSlide3.png

Slide 4 Inheriting Chaos

DataSimulationsforOpenDataSlide4.png

Slide 5 Common Simulation Uses

DataSimulationsforOpenDataSlide5.png

Slide 6 My Motivation

DataSimulationsforOpenDataSlide6.png

Slide 7 Simulations for #OpenData

DataSimulationsforOpenDataSlide7.png

Slide 8 Don't

DataSimulationsforOpenDataSlide8.png

Slide 9 Easiest Method of Simulating Truth

DataSimulationsforOpenDataSlide9.png

Slide 10 Some Ideas on Simulating Truth

DataSimulationsforOpenDataSlide10.png

Slide 11 Push Boundaries and Analyze Extreme Case Senarios

DataSimulationsforOpenDataSlide11.png

Slide 12 Scale, Variation, and 'Black Swans'

DataSimulationsforOpenDataSlide12.png

Slide 13 Pitfalls of Scaling to the Universe

DataSimulationsforOpenDataSlide13.png

Slide 14 Ideas on Scale-driven Simulations

DataSimulationsforOpenDataSlide14.png

Slide 15 What Happens When Your Data Looks Like This?

My Note: I think there was a prblem displaying this slide.

DataSimulationsforOpenDataSlide15.png

Slide 16 Creating Entire Databases

DataSimulationsforOpenDataSlide16.png

Slide 17 Unexpected Challenges of Creating an Entire DB

DataSimulationsforOpenDataSlide17.png

Slide 18 My Vision

DataSimulationsforOpenDataSlide18.png

Slide 19 My Info

DataSimulationsforOpenDataSlide19.png

Slide 20

My Note: Blank Slide

Daniell Toth

Slides (PDF)

My Note: Requested permission to post (50). I would suggest posting screen captures of only slides 1-8 and 49-50.

Slide 1 Disclosure Limitation

DisclosureLimitiationDaniellTothSlide1.png

Slide 2 Talk Outline

DisclosureLimitiationDaniellTothSlide2.png

Slide 3 Why Disclosure Limitation?

DisclosureLimitiationDaniellTothSlide3.png

Slide 4 Utility/Protection Tradeoff

DisclosureLimitiationDaniellTothSlide4.png

Slide 5 Different Method = Different Utility

DisclosureLimitiationDaniellTothSlide5.png

Slide 6 What Are We Protecting? 1

DisclosureLimitiationDaniellTothSlide6.png

Slide 7 What Are We Protecting? 2

DisclosureLimitiationDaniellTothSlide7.png

Slide 8 Methods for Protecting Data

DisclosureLimitiationDaniellTothSlide8.png

Slide 49 Recap

DisclosureLimitiationDaniellTothSlide49.png

Slide 50 Thank You

DisclosureLimitiationDaniellTothSlide50.png

Story

Data Science DC In Review

Eminent Data Scientist DJ Patil says "data scientists are a lot like journalists in that both collect data to tell stories. It's that story part that's most important -- a successful data scientist is one who can weave a cohesive narrative from the numbers and statistics."

Dominic Sale, new OMB Chief of Data Analytics & Reporting said the new Digital Government Strategy is "treating all content as data." So big data = all your content.

So my purpose is to structure the content at Data Community DC and Data Science DC Meetup as data, create effective visualizations, and begin the process of weaving a cohesive narrative out of 28 Meetups and 206 Posts.

The Meetups in the last six months are summarized in the table below along with the work in progress to identify data sets, and produce data products and stories.

 

Meetup Date Title and Slides (if available) Blog Data Product Data Set
August 28, 2013 Suppressing, Synthesizing, and Analyzing Confidential Data Story and Post (in review process) None None
July 30, 2013 Lightning Talks! Post and Story (in process) Julia Neo4j Graph Database of Semantic Medline (in process)
June 18, 2013 Untangling the Webs of Immigration Lobbying Slides Blogs 1 and 2 Sunlight Foundation and In Process Excel
May 28, 2013 Making Data Available with Data Platforms Story Story Multiple in Stories
April 23, 2013 Natural Language Processing and Big Data Story Graph Databases and Neo4j Tutorial Semantic Medline
March 28, 2013 - August 25, 2011 For 23 Previous see Past Meetups      

The Posts were structured as data in a spreadsheet of 9 entities for 206 Posts as follows:

  • Year, Month, Author, Title, and URL were obvious choices of entities and easy to extract
    • Lesson Learned: Do easy five first to get an overview and help decide on categories, adjust if necessary, and see the distributions.
  • Category, MindTouch, Data Set, and Comment were more subjective and difficult to extract
    • Lesson Learned: This could take a lot of time and thought, so I just did a first pass to identify at least four data sets I could use now and lots of things to explore in subsequent iterations using my comments I attached to each post, like: I definitely want to look at these links for data sets and ideas.

I also looked for a possible mapping of the Meetups to the Posts and did not see one. I should spend more time on that latter. My previous Post of a Meetup is an example of that.

The four data sets I found were:

I created 5th and 6th data sets from the content at Data Community DC (Posts) and Data Science DC Meetup (Knowledge Base): Excel

The results are shown in the screen captures of the Spotfire visualizations and the Spotfire Dashboard below. Also see separate Story for Trace the Guns.

Some of the quotes if the Posts I found particularly interesting were:

  • P.S. If you are one of those Data Creatives or Engineers with Javascript skills to burn and a bit of free time, we’d love your help putting together a web-based tool related to this project. Please drop us a line.
  • Luke Franci says: “D3 has a steep learning curve, especially if (like me) you are not used to the pixel-precision of graphics programming. To build a visualization with D3, you need to understand JavaScript objects, functions, and the method-chaining paradigm of jQuery; the basics of SVG and CSS; D3′s API; and the principles for designing effective infographics.”
  • Sean Murphy says: "A data product is the combination of data and possibly algorithms that creates value–social, financial, and/or environmental in nature–for one or more individuals."
  • Google’s search is the consumate data product. Take one part giant web index (data) and add it to the page rank algorithm (the algorithms) and you have a ubiquitous data product.

So this Data Product and Story shows:

  • There is a web-based tool that can do this
  • There is something easier than D3 to do this
  • A data product is the data sets and their stories woven together
  • One can make a small web index (data) of very important content that make it very searchable in one page.

Slide 1 Cover Page

DSDCRegistrationData-Spotfire1.png

Slide 2 Select Name Filter for Bayesian Data Analysis Meetup

DSDCRegistrationData-Spotfire2.png

Slide 3 Select Status Filter for Upcoming Meetup

DSDCRegistrationData-Spotfire3.png

Slide 4 Knowledge Base

DSDCRegistrationData-Spotfire4.png

Slide 5 Posts

DSDCRegistrationData-Spotfire5.png

Slide 6 Sunlight Foundation

DSDCRegistrationData-Spotfire6.png

Slide 7 Committee to Protect Journalists

DSDCRegistrationData-Spotfire7.png

Slide 8 Trace the Guns

DSDCRegistrationData-Spotfire8.png

Slide 9 Trace the Guns Report Tables

DSDCRegistrationData-Spotfire9.png

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

DataViz Contest Announcement

My Note: Use this data set

Congratulations! to Bill Palomi, who won our DataViz contest with his set of visualizations that told a compelling story about how he learned about the Meetup from the data set: http://billpalombi.com/DSDCRSVP... Runners up were Jonathon Taylor (http://jonathantaylor.co.nf/DSD..., Chrome only) and Clarence Dillon (https://docs.google.com/open?id=...). Thanks to contest coordinator David Luria, judge Matt Stiles (http://thedailyviz.com), and contest sponsor NetApp (http://www.netapp.com/us/), who provided a set of Beats headphones! This was fun -- we'll have to do something like it again soon!

Harlan Harris

Reminder and two-day deadline: We're looking for the Best Data Visualization around DSDC's RSVPs. The visualization can be a glossy static image, or an interactive web page, or 3D-printed if you're into that sort of thing. We've found a data visualization expert to judge, and found a sponsor to offer an *excellent* prize. Seriously, you want this prize. David Luria (davidluria@gmail.com), will be coordinating entries which must be submitted to him no later than Friday Dec. 14th. Please contact him if you have questions. Results announced at Tuesday's Meetup. The data is here: http://files.meetup.com/2215331... CSV Excel

DSDC RSVP Visualization 1

Source: http://billpalombi.com/DSDCRSVPVis.html

Getting Acquainted Through Data

Hey there! I'm Bill, nice to meet you. I'm new to DSDC. I haven't even been to my first meetup yet, so I'm not sure what DSDC is all about. I'd like to learn more. But how? Sure, I could look at all the past event descriptions (and I have, cool stuff!), but it'd be nice to get an idea of DSDC event topics at a glance. This vis let's me do just that.

DSDCRSVPVisualizationWordCloud.png

My Note: These links go no where.

Each of the words in the tag cloud was in a DSDC event name at least twice. Words are scaled by their frequency. Unsuprisingly, "Data" and "Science" are the most frequent. I can also see, though, that DSDCer's interact with data largely through regression and are also interested in data discovery and the presentation of results.

DSDC sounds cool, but how do I know that the meetup is going in the right direction? I want to be part of a group thats growing, not fading. Again, the data has answers to my questions.

DSDCRSVPVisualization1.png

Wow, it looks like a really popular event took place in October! What was that? A quick vis of attendance by event should show me.

DSDCRSVPVisualization2.png

Would you look at that, DSDC's most popular event ever - "Bayesian Data Analysis, The Basic Basics: Inference on a Binomial Proportion" - was in October. It's a shame I missed out. Oh well, at least next week's "Political Campaign Data Science" event is almost as big. I'm looking forward to it!

Oh wait, I'm on the waitlist. I guess i should have RSVP'd earlier. What time do DSDCer's typically RSVP anywhay? You can probably guess where I'm going with this.

DSDCRSVPVisualization3.png

This graph may not come as a suprise to you, but it did to me. I don't think that many DSDCer's check meetup.com while they sip their morning coffee. But I bet that they do check their email. I have email notifications disabled for meetup.com, so I don't know about new events until I check the webpage in the evening. It looks like I'm in the minority. I should think about enrolling in email notifications if I want to join the morning RSVP rush.

I hope this vis has been as interesting to you as it was to me. I'm looking forward to attending DSDC events in the future and meeting other data heads.

May your residuals be normally distributed,

Bill

DSDC RSVP Visualization 2

Source: http://jonathantaylor.co.nf/DSDC.html

My Note: Clickable data points and bars.

jonathantaylor.co.nfDSDC.png

 

DSDC RSVP Visualization 3

Source: https://docs.google.com/open?id=0B2L...0FVSWdrSzRJZkU

My Note: I had problems signing into my Google Account then sloved the problem by signing in another way!

clipboard_1378215036131.png

 

DSDCRSVPVisualization3Slide2.png

 

DSDCRSVPVisualization3Slide3.png

Past Meetups

My Note: Mine the content at Data Community DC for Tools and Public Data Sets (In Process).

My Note: These past Meetups are copied from above.

 

Meetup Date Title and Slides (if available) Blog Data Product Data Set
August 28, 2013 Suppressing, Synthesizing, and Analyzing Confidential Data Story and Post (in review process) None None
July 30, 2013 Lightning Talks! Post and Story (in process) Julia Neo4j Graph Database of Semantic Medline (in process)
June 18, 2013 Untangling the Webs of Immigration Lobbying Slides Blogs 1 and 2 Sunlight Foundation and In Process Excel
May 28, 2013 Making Data Available with Data Platforms Story Story Multiple in Stories
April 23, 2013 Natural Language Processing and Big Data Story Graph Databases and Neo4j Tutorial Semantic Medline
March 28, 2013 - August 25, 2011 For 23 Previous see Past Meetups      
 

 

Meetup Date Title and Slides Blog Data Product Data Set
March 28, 2013 Uncovering Hidden Social Information      
March 7, 2013 Estimating Effect Sizes in Machine Learning Predictive Models      
January 28, 2013 Recommendation Systems in the Real World      
December 18, 2012 Political Campaign Data Science      
November 19, 2012 Implicit Sentiment Mining in Twitter Streams      
November 11, 2012 Workshop: Number Crunching in Python with NumPy      
October 17, 2012 Building Enterprise Apps for Big Data with Cascading      
October 1, 2012 Bayesian Data Analysis, The Basic Basics: Inference on a Binomial Proportion      
September 17, 2012 PAW-Gov Data Drinks      
August 27, 2012 The Split Personalities of Data Scientists: Survey Results and Insights      
July 23, 2012 The Beating Heart of Kim Kardashian      
June 28, 2012 Rare and Hidden Event Detection      
May 22, 2012 Data Science and Scientific Discovery: New Approaches to Nature’s Complexity      
April 30, 2012 Information Visualization for Knowledge Discovery      
March 27, 2012 Data Science Classroom: Clustering      
March 20, 2012 Seminar on Bayesian Inference (full day event!)      
February 20, 2012 Big Data: Fuel for Innovation and Progress or Pollution of the Information Age?      
January 25, 2012 Presentation/Discussion: Maximal Information Coefficient      
December 21, 2011 Data Science Classroom: Naive Bayes and Logistic Regression      
November 29, 2011 Panel Discussion on Selling Data Science Projects and Results      
October 25, 2011 If a Tree Falls in a LogicForest... An Introduction to Logic Regression      
September 26, 2011 Presentation/Discussion: "What the heck is Data Science, anyway?"      
August 25, 2011 Kickoff -- Regularization Regression      

 

July 30, 2013

Source: http://www.meetup.com/Data-Science-D...nts/124794492/

6:30 PM
Lightning Talks!
200 Data Scientists |  4.50 | 1 Photo

For our July Meetup, we're thrilled to be presenting the work of not one, not two, but nine members of our Data Science community! These lightning talks will include highlights of work presented at recent or upcoming academic conferences, demos of tools and techniques that people are excited about, and more, all in bite-sized packages of less than 10 minutes.

Tentative Speaker/Affiliation/Topic list:

  1. Harsha Rajasimha, Jeeva Informatics, Next-Gen Sequencing Data Analysis
  2. Jon Schwabish, CBO, Fundamental Graphic Types
  3. Robert Dempsey, Intridea, Growth Hacker
  4. Kevin Coogan, Amalgamood, Hadoop Summit / GraphLab Review
  5. Patrick Wagstrom, IBM Research, GitHub API and Graph Databases
  6. Elena Zheleva, LivingSocial, Incentivized Sharing in Social Networks
  7. Dhruv Sharma, FDIC, Random Forests and Credit Scoring
  8. Natalie Robb, WaveLength Analytics, Analyzing Data Exhaust
  9. Tommy Jones, STPI, Consensus Clustering

Agenda:

  • 6:30pm -- Networking, Empenadas, and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentations
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (Tonic, 22nd & G St.)

Robert Dempsey

Good morning everyone,

Thanks for all of your fantastic questions after the lightning talks. Here are links to everything I mentioned in my talk:

Slides - http://www.slideshare.net/intri...

Surfiki - http://surfiki.com/

Refine (Open Source) - https://github.com/intridea/surf...

Python script for scraping LinkedIn company data (Open Source) - https://github.com/intridea/link...

If you have any questions, give me a call: 888-968-4332 x517

Brand Niemann
Very good, but I suggested the presenters make their data sets and results available for others to use and no one did that so far as I can tell, so here is an example of what I have been working on that does that:

The President’s Fiscal Year 2014 Budget as Big Data
http://semanticommunity.info/Da...

The Digital Government Strategy Should Begin with the Budget Analytics 2014
http://semanticommunity.info/Da...

I think that auditing/verifying/reproducing what others do is an important part of data science practice.

Robert Dempsey

So really what you're saying Brand is that you'd prefer talks where the presenters can share their data. Is that correct?

Brand Niemann

Yes, and document their steps and methods so they can be used to train data science students and reproduced by say forensic data scientists like myself.

June 18, 2013

Source:http://www.meetup.com/Data-Science-D...nts/118644262/

6:30 PM
Untangling the Webs of Immigration Lobbying
150 Data Scientists |  5.00

My Note: The Slides were not that helpful to understanding their work so I read there original articles below and found a Spreadsheet in Socrata. See Most Active Issues

For our June Meetup, we're thrilled to have three researchers from the Sunlight Foundation talk about a recent data science project. They analyzed 7,814 lobbying reports related to immigration bills in Congress, and used that data to build an outstanding network analysis and interactive visualization. (To see the full version, click here.) Lee Drutman will frame the question that they tried to answer, then Zander Furnas will get into the nitty-gritty of the analysis and visualization, and Amy Cesal will talk about how they gave the results a compelling and informative visual design. You'll learn about tools like Latent Semantic AnalysisHierarchical Agglomerative ClusteringGephi, and sigma.js, and maybe a little bit about lobbying too!

Agenda:

  • 6:30pm -- Networking, Empenadas, and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentations and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (Tonic, 22nd & G St.)

Bios:

Amy Cesal is a graphic designer at the Sunlight Foundation. Prior to joining Sunlight, she worked at a small design studio in Adams Morgan as a web and print designer. Amy graduated from American University with a major in graphic design and a minor in marketing. Follow her on Twitter at @amycesal.

Lee Drutman is a Senior Fellow at the Sunlight Foundation and adjunct professor of political science at Johns Hopkins University and the University of California. He holds a Ph.D. in political science from the University of California, Berkeley and a B.A. from Brown University. He has been quoted by NPR, ABC News, The Colbert Report, The New York Times, The Washington Post, Politico, The Hill, Roll Call, among many other news outlets.

Alexander Furnas is a Research Fellow at the Sunlight Foundation. He graduated from Wesleyan University's College of Social Studies (CSS) with a B.A. in Social Studies, and holds an MSc. (oxon) with Distinction in the Social Science of the Internet from the University of Oxford, UK. Prior to joining the Sunlight Foundation, Alexander worked as a research fellow at the Center for Legislative Research and Advocacy in Delhi, India, on national innovation policy.

Lauren Johnson

Hi there, sharing a short blog post about the event: http://laurencjohnson.com/2013/...

Amy

Check out the video of the talk: http://youtu.be/J5cMwit0knk

Jennifer Parks

The event write up and audio are available here: http://datacommunityd...

(Visualizations by Alexander Furnas and Amy Cesal)

As Congress inches toward major immigration legislation, a new Sunlight Foundation analysis (based on almost 8,000 lobbying reports) offers a comprehensive and interactive guide to the web of interests with something at stake.

As legislation continues to take shape, a wide range of sectors will continue flooding Congress with their lobbyists, trying to make sure that their particular concerns are fully addressed. These visualizations can help to better understand who these interests are, what they care about, and how intensely they are likely to lobby to get what they want.

Figure 1. Immigration Lobbying in Congress; click for interactive graphic

Click for interactive version

Figure 1 gives us the big picture. The network connects lobbying interests with the specific immigration bills on which they’ve lobbied. The size of the circles represents the amount of lobbying activity. We’ve given a different color to each of 34 distinct sub-issues identified by a textual analysis of bill summaries (see our methodology section for more details).

The network can be viewed in three ways:

  1. By zooming out and viewing it as a whole (Figure 1), we can see  five somewhat (though not entirely) distinct clusters, representing five major immigration lobbying hotspots in recent years.

  2. By zooming in and looking at each of the hotspots in more detail (Clusters A-F), we see more messiness, representing the wider range of interests and sub-issues in each of the clusters.

  3. Finally, and most importantly, by exploring our interactive graphic, you can investigate the links between every interest, issue and bill. There are thousands of links to uncover.

(To understand how we built this visualization, visit our methodology section at the end of this post.)

That immigration policy attracts a tangled web of lobbyists from a wide swath of the economy and dozens of advocacy interests will not be news to those who have followed the issue for years. After all, the last time Congress tried to pass immigration reform, the Comprehensive Immigration Reform Act of 2007 stretched to 789 pages.

And in the five years (2008-2012) since the reform last died on the Senate floor, we count 6,712 quarterly lobbying reports filed by 678 lobbying organizations in 170 sectors mentioning 987 unique bills, associated with more than $1.5 billion in lobbying spending.

Let’s look at six identifiable clusters in a little more detail.

Cluster A: Agricultural /H-2A Visas­­­

 

 

Starting in the top left, the first cluster brings together lobbying on agricultural work visas. A series of bills have been introduced to streamline the H-2A agricultural worker program, based on compromises worked out over years between agricultural employers, farm worker unions, and key lawmakers. The bills have generally been introduced under the title “The AgJOBS Act.” As then-Senator Ken Salazar D-CO explained in 2007: “There are 567 organizations that have endorsed [the AgJOBS Act], from the Colorado Farm Bureau, to the Farmers Union, to every single agricultural organization in America. The leaders on AgJOBS in the Senate, Senator Feinstein and Senator Craig, have been eloquent in making their statements about the need for the agricultural community, farmers and ranchers, to be able to have a stable workforce. We need to stop the rotting of the vegetables and the fruits in California, in Colorado, and across this country. The only way we are going to be able to do that is if we have a stable workforce for agriculture.”

One other proposal that has been heavily lobbied – “The Dairy and Sheep H-2A Visa Enhancement Act” – would expand the H-2A visa program to also include sheepherders and dairy workers. Naturally, this has been of particular interest to the dairy industry, which relies on immigrant labor. As bill sponsor Rep. John McHugh (R-NY) told the House floor in 2009: “During the past decade, dairy farms throughout the nation have increasingly experienced difficulty in hiring local workers to meet their needs and, as a result, are ever more reliant upon immigrant labor. The tremendous uncertainty regarding that labor supply has a profound impact on their ability to plan for the future and make sound business decisions.”

Cluster B: Seasonal / H-2B Visas

 

 

 

Another cluster highlights massive lobbying in support of extending the H-2B visa worker exemption to allow non-agricultural seasonal businesses to hire immigrant workers. As Sen. Barbara Mikulski (D-MD), sponsor of the proposal, told the Senate floor in 2009, “Every member of the Senate who has heard from their constituents —whether they are seafood processors, landscapers, resorts, timber companies, fisheries, pool companies or carnivals – knows the urgency in their voices, knows the immediacy of the problem and knows that the Congress must act now to save these businesses. I urge my colleagues to join this effort, support the Save our Small and Seasonal Businesses Act, and push this Congress to fix the problem today.”

The lobbying interests most active on these issues have been hotels and motels, restaurants, florists and nursery services, forestry and forest products companies, and real estate agents.

Cluster C: High-Skill Visas, Employee Verification, Other Corporate Concerns

 

The messiest cluster is in the middle, and it largely coalesces around two major lobbying interests: computer software and manufacturers, and the Chamber of Commerce. The concerns of these groups overlap. Part of their lobbying (particularly the tech companies) is aimed at high-skilled workers: retaining math and science Ph.D. graduates, enabling high-skilled immigration generally, and expanding H1.B and L.1 temporary visa programs. (H1-B visas allow skilled workers in mostly engineering, science, medicine, and finance to enter the U.S. for a limited period while L-1 visas allow employees of international companies to work in U.S. offices of the company for a limited period.)

The other major issue in this cluster is around the employee verification program, a web-based system that helps employers to check whether potential employees are in fact eligible to work in the United States. Both the Chamber of Commerce and the computer industry have lobbied heavily on this issue. This cluster also catches some concerns over homeland security, border enforcement, and travel visas, and brings in the building trade unions and the lodging and tourism sector.

Cluster D: Family Issues

A smaller cluster on the right centers around advocacy on family-related issues, with the most lobbying centering on legislation known as the “Uniting American Families Act” which provides family immigration benefits for same-sex couples. As Sen. Patrick Leahy (D-VT), one of the sponsors of the legislation put it on the Senate floor in 2012: “My legislation would grant same-sex binational couples the same immigration benefits provided to heterosexual couples. Passage of this important legislation would help put our country on par with over 25 other developed countries that value and respect human rights.” Lobbying interests in this space include gay and lesbian groups, liberal advocacy groups, and some religious organizations.

This issue and lobbying cluster also covers some issues related to child protection and asylum. The major interest lobbying on these issues are lawyers.

Cluster E: "Dream Act", General Path To Citizenship, Enforcement

The cluster at the bottom of the network centers primarily around a single piece of major legislation, The Dream Act, and the two sectors most interested in advancing its passage: minority and ethnic groups, and schools & colleges. The Dream Act would create a path to citizenship for young people brought to the U.S. as children who go to college or serve in the military.

This cluster also covers other primary lobbying concerns of minority and ethnic groups, including a set of tough-on-immigration enforcement and criminal justice-related bills, as well as some bills covering various path-to-citizenship issues.

Cluster F: Performing Artist Visas

Finally, there is one small cluster of lobbying that is on its own. Live theater, museums, bands, orchestras, and other artistic performance industries have lobbied for special visa exemptions for performing artists. The most active bill in this cluster is H.R. 1785 from the 1111th congress, the Arts Require Timely Service (ARTS) Act, which would require expedited processing of visas petitions filed by employers on behalf of individuals with extraordinary artistic ability.

As Senate bill sponsor Orrin Hatch (R-UT) told the Senate floor in 2009: “There is no doubt that nonprofit arts organizations across the country engage foreign guest artists in their orchestras, theatres, and dance and opera companies. In my home state of Utah, I am aware that many organizations that will benefit from passage of the ARTS Act, including Brigham Young University, Cache Valley Center for the Arts, The Orchestra of Southern Utah, University of Utah, Murray Symphony Orchestra, Salt Lake Symphony, and the Utah Shakespeare Festival, to name a few.”

This cluster would normally have been filtered according to the methodology we used (see below) but we left it in because it is good example of a specialty niche issue within the larger constellation of immigration lobbying.

Who Lobbies The Most?

Another way to slice the data is to look at which sectors are the most active, and on which issues. For full underlying data on all sectors and all issues, click here. My Note: See Excel

Table 1. 20 most active sectors and their issues
sector Total bill mentions Gini Top Issue #2 Issue
Minority/Ethnic Groups 760 0.549 Dream Act, 63 Reuniting Families, 40
Schools & colleges 624 0.866 Dream Act, 173 Science Tech workers, 19
Chambers of commerce 508 0.657 General employer responsibility, 68 E verify, 55
Computer software 420 0.763 Science Tech workers, 49 H1 B and L 1, 30
Milk & dairy producers 369 0.952 Agricultural Jobs, 132 H 2A Visas, 95
Computer manufacture & services 346 0.814 General employer responsibility, 63 E verify, 51
Building trades unions 320 0.816 Veterans affairs, 26 Border funding, 18
Farm organizations & cooperatives 263 0.959 Agricultural Jobs, 94 H 2A Visas, 59
Attorneys & law firms 240 0.738 Child Protection / Asylum, 42 Dream Act, 15
Computers, components & accessories 239 0.869 Science Tech workers, 45 High Skilled Immigration, 20
Lodging & tourism 193 0.885 Travel / extended stay, 51 Travel Visa, 27
Florists & Nursery Services 181 0.939 Seasonal businesses, 63 Agricultural Jobs, 19
Meat processing & products 176 0.888 E verify, 38 Border enforcement, 26
Human Rights 174 0.806 Dream Act, 19 Uniting Same-sex Families, 18
Democratic/Liberal 170 0.76 Dream Act, 29 Uniting Same-sex Families, 22
Restaurants & drinking establishments 153 0.901 Seasonal businesses, 27 General enforcement, 18
Hotels & motels 150 0.918 Seasonal businesses, 33 healthcare coverage, 17
Data processing & computer services 146 0.791 Homeland security, 16 Border funding, 12
Pro-Israel 114 0.818 Dream Act, 18 Reuniting Families, 11

(numbers for each issue represent bill mentions in issue area)

Table 1 gives us the sectors that have been most active on immigration, and the issues on which they’ve been the most active. Our measure of activity is lobbying bill mentions – that is, each time a sector mentions an immigration-related bill in a lobbying report, we count it as a bill mention.

The “Gini” coefficient is a measure how much each sector’s lobbying is concentrated on a narrow set of issues. The closer the Gini value is to 1.0, the more concentrated the lobbying. Not surprisingly, minority and ethnic groups have the most lobbying bill mentions in our data (760) and the lowest Gini coefficient (0.549), representing a very diverse set of issues on which they lobby.

Schools and Colleges are the second most active sector. They also show a tremendous interest in the Dream Act, which is not surprising since the Dream Act concerns higher education. Chambers of Commerce (primarily the U.S. Chamber of Commerce) have been particularly interested in lobbying on employer responsibility issues. Computer software companies have been lobbying heavily to get more high-skilled workers. And milk and dairy producers have been lobbying for reform to the agricultural visa programs.

Among unions, the building trades union has been the most active. It’s worth noting that among the top lobbying sectors, a few have very high Gini coefficients (e.g. farm organizations & cooperatives at 0.959, and milk & diary producers at 0.952), meaning that their lobbying is intensely concentrated on just a few bills. By contrast, groups like the Chamber of Commerce and minority and ethnic groups have a much broader issue profile in this space.

Most Active Issues

We can also slice our data based on which issues got the most bill mentions, which generates a statistical tie between agricultural jobs and the Dream Act, with issues related to permanent residences for math and science Ph.D.s, employer responsibility, and small and seasonal business issues coming in close behind.

Table 2. 20 most active issues and their top sectors
Issue Total Gini Top Sector #2 Sector
Agricultural Jobs 510 0.953 Milk & dairy producers, 132 Farm organizations & cooperatives, 94
Dream Act 504 0.945 Schools & colleges, 173 Minority/Ethnic Groups, 63
Science & Tech workers 372 0.902 Computer software, 49 Computers, components & accessories, 45
General employer responsibility 359 0.935 Chambers of commerce, 68 Computer manufacture & services, 63
Seasonal businesses 320 0.945 Florists & Nursery Services, 63 Hotels & motels, 33
Homeland security 263 0.902 Transportation, 31 Lodging & tourism, 23
General enforcement 247 0.921 Minority/Ethnic Groups, 38 Chambers of commerce, 19
H-2A Visas 241 0.963 Milk & dairy producers, 95 Farm organizations & cooperatives, 59
E-verify 233 0.964 Chambers of commerce, 55 Computer manufacture & services, 51
Border enforcement 231 0.882 Minority/Ethnic Groups, 33 Meat processing & products, 26
Defense funding 223 0.897 Computer manufacture & services, 23 Lodging & tourism, 20
H1-B and L-1 217 0.925 Computer software, 30 Electronics manufacturing & services, 27
Border funding 217 0.955 Chambers of commerce, 34 Computer software, 18
Uniting Same-sex Families 217 0.968 Gay & lesbian rights & issues, 55 Minority/Ethnic Groups, 35
High- Skilled Immigration 206 0.922 Computer software, 29 Computers, components & accessories, 20
Reuniting Families 169 0.954 Minority/Ethnic Groups, 40 Human Rights, 16
Criminal justice 158 0.957 Computer software, 19 Minority/Ethnic Groups, 13
Healthcare funding 151 0.926 Computer software, 23 Online computer services, 8
Travel / extended stay 144 0.959 Lodging & tourism, 51 Minority/Ethnic Groups, 18

(numbers for each sector represent bill mentions in that issue area)

Not surprisingly, agricultural jobs issues are lobbied on very heavily by the agricultural industries, especially the milk and dairy producers, the farm organizations and cooperatives, and the vegetable and fruit industries. The Dream Act is primarily lobbied on by universities, minority and ethnic groups, and Democratic-leaning non-profits. Getting more math and science Ph.D.s is of particular interest to the computer industry.

From the high Gini coefficients, we can tell that almost all of these issues are of interest to only a handful of sectors.

Conclusion

As comprehensive immigration goes forward, a wide range of lobbying sectors are going to be fighting to make sure that their particular concerns get included in the final reform (we count 170 sectors who have registered an interest since 2007).

Our analysis shows that the immigration debate is both structured and chaotic. When we zoom out to see the entire network in one glimpse (Figure 1) we can see five big clusters of activity on immigration. When we zoom in, we see much more messiness: lots of sectors, lots of bills, lots of interests, and many overlaps and tangles. In other words, a lot of clamoring is about to take place. We hope you will spend some time exploring our interactive graphic to see for yourself who cares about what and how much.

Methodology

We began our analysis with all bills mentioned in lobbying disclosure forms under the category of immigration during the 109th – 112th congresses. We then excluded the top two percent of most lobbied bills – generally omnibus bills - leaving a corpus of more specific single-issue immigration reform bills. The remaining 915 bills were classified into sub-categories based on the text of bill summaries produced by the Congressional Research Service. To create issue categories we used Latent Semantic Analysis (LSA) -- a form of natural language processing commonly used for text comparison -- to measure the conceptual similarity of the bills. We then clustered the most similar bills together with hierarchical agglomerative clustering. This process ensured that the most conceptually similar bills – at least according to the content of their CRS summaries – were treated as belonging to similar issue areas, like H-2A visa reform, or the Dream Act, under the umbrella of immigration reform. We manually labeled the resulting 34 clusters based on their contents.

We then applied the labels (denoted by color) from this clustering to the bills in the network representation of lobbying activity shown above. The nodes in this bipartite network represent specific bills and the industries (based on Center for Responsive Politics’ industry classifications) that lobbied on them. The weighted directed edges are based on the number of lobbying reports filed by the given industry that mentioned the given bill. To highlight significant activity, we filtered the network visualization to show only the subgraph with a k-core of 3. The graph layout was done using the Gephi implementation of the OpenOrd algorithm, which employs aggressive edge-cutting to promote clustering.

Data sources: Influenceexplorer.com, Sunlight Foundation Congress API, opensecrets.org,

correction (5/17/13): We removed a reference to the number of lobbyists working on immigration. This story had linked to a piece from our reporting group, which made an incorrect assessment about the number of lobbyists. That piece has been updated.

Looking back to predict what FWD.us means for tech and immigration

by 
<time class="published" datetime="2013-04-11T17:37:41" pubdate="">April 11, 2013, 5:37 p.m.</time>

When Facebook's Mark Zuckerberg formally launched his new immigration advocacy effort today with an op-ed in the Washington Post, it underscored the role being taken in the immigration debate by the tech community, which has been emerging as a powerful lobbying force in Washington.

According to sources quoted in a  Politico story,  the new group,  FWD.us, that Zuckerberg founded with other Silicon Valley luminaries,  will "place its initial focus on the passage of comprehensive immigration reform." While the group emphasized the need for reform that is "comprehensive" -- Capitol Hill-speak for legislation that includes border security measures and a path to citizenship for millions of people now living in the country illegally, the Politico piece goes on to note that the tech industry has a very specific goal, namely, "to increase the number of H1-B temporary visas for high-skilled foreign workers as well as lessen the wait times for green cards for their workers." Last year, Facebook filed 384 requests of some 65,000 visas available for high skilled employees, according to the website myVisajobs.com.

The backers of Zuckerberg's new nonprofit match closely with one of the clusters we identified in our look at the structure of immigration lobbying, late last month. Other co-founders of the group are: Reid Hoffman, the co-founder of LinkedIn, and John Doerr, whose powerhouse venture capital firm, Kleiner, Perkins, helped incubate Google and Amazon.com. Other major supporters pictured on the website or mentioned in Zuckerberg's op-ed include: Yahoo CEO Marissa Mayer, and Eric Schmidt, Google’s top executive, as well as top officials from blue chip tech firms such as Airbnb, Facebook, Instagram, Netflix, Reed Hastings, Groupon, PayPal, Y Combinator and Zynga.

MORE: Give me your poor, your tired your . . . lobbyists?

Several industries that have lobbied extensively on immigration in the past are well represented among that group, in particular computer software, data processing and computer services and online computer services. Below is a closeup of the larger immigration lobbying network we produced, with these industries and some of the bills on which they lobbied most since 2007 highlighted. Looking more closely at this cluster can provide us some useful clues into the provisions that FWD.us is likely to advocate for as it enters the influence game.

Tech Sector Lobbying on Immigration

The key bills in this network tell us a lot about the kinds of provisions that the tech industry cares about when it comes to immigration reform. The table below shows the most lobbied on bills in the 109th – 112th Congresses by the sectors most relevant to the parties behind FWD.us (computer software, online computer services, and data processing and computer services).

Table 3. Most lobbied on bills in the 109th – 112th Congresses
Bill Number Immigration sub-issue Bill Title Lobbying Reports
S.887 (111) H-1B and L-1 S.887 H-1B and L-1 Visa Reform Act of 2009 30
H.R.5921 (110) High Skilled Immigration H.R.5921 High Skilled Per Country Level Elimination Act 30
H.R.6039 (110) Science/Tech workers H.R.6039 To amend the Immigration and Nationality Act … 28
H.1 (111) Omnibus H.R.1 American Recovery and Reinvestment Act of 2009 24
H.R.1791 (111) Science/Tech workers H.R.1791 STAPLE Act 20
H.R.3012 (112) High Skilled Immigration H.R.3012 Fairness for High-Skilled Immigrants Act 16

It is clear that of the Sunlight-identified immigration sub-issues, the tech industry has been most active in three areas: H1-B and L1 visa reform,  high skilled immigration reform, and science/tech worker specific provisions. Lets look at these three areas in turn.

  • H1-B and L2 visa reform: Both H1-B and L1 visas apply to temporary workers. H1-B visas are granted to foreign workers in specialty occupations, while L1 visas are granted for intra-company transfers.

  • High skilled visa reform: Currently the amount of employment-based immigration is limited by per country caps. In the past, tech companies have lobbied to have these caps lifted.

  • Science/Tech Worker provisions:  Various provisions, most famously the STAPLE Act, have been introduced to grant permanent residence to graduates receiving master’s level or higher degrees in the STEM disciplines. The Tech sector has, in the past, put its weight behind these reforms.

We don’t know for sure exactly what FWD.us will push for in this current round of immigration reform, but we can make some good guesses. If the associations of its major backers, and the history of the technology sector are any indication, look for FWD.us to work hard to ensure that some very specific asks are included in any comprehensive immigration bill.

(Contributing: Anu Narayanswamy)

May 28, 2013

Source: http://www.meetup.com/Data-Science-D...nts/118272102/

6:30 PM · Rate this Meetup
Making Data Available with Data Platforms
150 Data Scientists |  4.50 | 1 Photo

For our May Meetup, we are exceedingly happy to have three experts talk about a key component in any data ecosystem -- the platforms that let you manage, catalog, share, and access data via CSV or API. The tools behind portals like data.gov can be used by organizations large or small, and can be an extremely efficient way to make data available -- to other people or other systems, to the public or just within your organization.

Whether you're a producer of data that you'd like others to be able to use, or a consumer of data that others have made available, or one of the people responsible for making data more valuable, you should come and hear from our speakers. Matt Bailey will be talking about his experiences using the open-source CKAN tool to support the DC Brigade of Code for America. Marcus Louie will talk about Socrata, a commercial platform (with an open-source community edition) used by governments large and small. And Philip Ashlock will talk about DataBeam, a lightweight, open-source, and very quick to set up data platform tool he created for the GSA.

Agenda:

  • 6:30pm -- Networking, Empenadas, and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentations and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (Tonic, 22nd & G St.)

Bios:

Matt Bailey is a technology strategist at the Consumer Financial Protection Agency, and was formerly CTO of Citizen Effect, a civic engagement and fundraising nonprofit. He is also co-captain of Code for DC, a local brigade of Code for America. Matt has an MA in English from Georgetown University. He is @MattBailey0 on Twitter.

Marcus Louie is a Data Solutions Architect at Socrata, where he helps to design and sell an enterprise/government-class data platform. Previously, he worked as a product manager at Collective[i], a New York-based business analytics firm. He has a PhD in Engineering and Public Policy from Carnegie Mellon.

Philip Ashlock served as a Presidential Innovation Fellow, working with the White House and GSA on my.usa.gov, and working with Code for America on their Commons project, helping local governments share effort and value around open data and civic technology projects. Currently, he is consulting through his company, Civic Agency. He has a BS from Western Washington University, where he studied New Media and Computer Science. He is @PhilipAshlock on Twitter.

Andy

http://r4stats.com/2013/05/14/b...

April 23, 2013

Source: http://www.meetup.com/Data-Science-D...nts/109386702/

6:30 PM
Natural Language Processing and Big Data
210 Data Scientists |  4.50

For our April Meetup, we are excited to bring you an event themed around Big Data Week! We have two presenters talking about their work with very different, very large text data sets. First, Ben Bengfort from UMBC and Full Stack Data Science will be talking about how to use Python's NLTK and Hadoop Streaming to make sense of large text corpora. Then, Tom Rindflesch from the National Library of Medicine will talk about his group's work building a system to help medical researchers keep up with the flood of current and historical articles published on PubMed.

Notes:

  • Please check out the other events around Big Data Week DC, and follow the #bdw13 hash tag on Twitter!
  • We're very happy to have new Meetup DC NLP cross-listing this event! Welcome to folks coming from DC NLP! Members of DSDC interested in Natural Language Processing should definitely consider joining DC NLP.
  • We're back at GWU for this event.

Agenda:

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentations and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (Tonic, 22nd & G St., space reserved!)

Presentations:

Natural Language Processing of Big Data using NLTK and Hadoop Streaming

Many of the largest and most difficult to process data sets that we encounter during the course of big data processing tend not to be well structured log data or database row values, but rather unstructured bodies of text. In recent years, Natural Language Processing techniques have accelerated our ability to stochastically mine data from unstructured text and in fact require large training data sets themselves to produce meaningful results. Simultaneously the growth of distributed computational architectures and file systems have allowed data scientists to deal with large volumes of data; clearly there is common ground that can allow us to achieve spectacular results. The two most popular open source tools for both NLP and Distributed Computing, TheNatural Language Toolkit and Apache Hadoop, are written in different languages --  Python and Java. We will discusses the methodology to integrate them using Hadoop’s Streaming interface which sends and receives data into and from mapper and reducer scripts via the standard file descriptors.

Semantic MEDLINE:  An Advanced Information Management Application for Biomedicine

Semantic MEDLINE integrates information retrieval, advanced natural language processing, automatic summarization, and visualization into a single Web portal. The application is intended to help manage the results of PubMed searches by condensing core semantic content in the citations retrieved. Output is presented as a connected graph of semantic relations, with links to the original MEDLINE citations. The ability to connect salient information across documents helps users keep up with the research literature and discover connections which might otherwise go unnoticed. Semantic MEDLINE can  make an impact on biomedicine by supporting scientific discovery and the timely translation of insights from basic research into advances in clinical practice and patient care.

Bios:

Benjamin Bengfort is a Data Science consultant at Full Stack Data Science, and has used Machine Learning and Natural Language Processing techniques to determine textual complexity in large literary corpora. He is a PhD candidate in Computer Science, with a focus on NLP, at the University of Maryland, Baltimore County, and has a MS in Computer Science from North Dakota State University.

Please follow Ben on Twitter at @bbengfort!

Thomas Rindflesch has a Ph.D. in linguistics from the University of Minnesota and conducts research in natural language processing in the Lister Hill Center for Biomedical Communications at the National Library of Medicine. He leads a research group that focuses on developing semantic interpretation of biomedical text and exploiting results in innovative informatics methodology for clinical practice and basic research. Recent efforts concentrate on supporting literature-based discovery.

Tony Ojeda

For those interested in learning more about NLP with Python's NLTK and how to analyze text yourself, Ben is teaching an introductory workshop for Data Community DC on the subject. Check it out here - http://www.meetup.com/Data-Comm...

Janet Dobbins

Ben will also be teaching an online course called, "Introduction to Analytics using Hadoop and R" at Statistics.com this fall. You can check out our text analytics courses, including NLP, taught by Dr. Nitin Indurkhya, here - http://www.statistics...

Lee Angelelli

I enjoyed the presentations yesterday - very informative. For those who want to learn more about how Big Data, NLP, Watson, advanced analytics, and cognitive computing are being applied today in the healthcare industry -> as part of Big Data week -> IBM would like to invite Data Science DC members to IBM's "Analytics in Support of Health Care Transformation : Making Better Health Care Decisions with IBM Watson and Advanced Analytics" seminar tomorrow 4/25 starting 8:30am - 12:20pm located at IBM 600 14th Street, NW Washington D.C. -> IEG center 2nd floor.

To register for the event and see the list of presenters/topics - please visit the IBM Analytics Solution Centers (ASC) website -> https://www-950.ibm.com/events/w...

Harlan Harris

Glad to see so many people last night! Two things. Is there anyone interested in writing an event review (what happened, what did you get out of it) for the DC2 blog? Please get in touch. Also, Ben and Tom suggested resources related to this event here: http://datacommunitydc.org/blog...

March 28, 2013

Source: http://www.meetup.com/Data-Science-D...nts/107231302/

6:30 PM
Uncovering Hidden Social Information
130 Data Scientists |  5.00 | 8 Photos

For our March Meetup, we are very pleased to have Professor Jennifer Golbeck from the University of Maryland talking about her research in the field of social media and social network analysis. Public information on social network sites can be very informative about individuals' interests and about group communication and influence. Building computational and statistical models of these processes is a rapidly advancing field, and any data scientist will want to learn how to think about and work with this sort of data.

NOTE: We are holding this event at a new venue -- the Microsoft offices in Chevy Chase/Friendship Heights. We realize that this will make it much easier for some of you to attend than our usual downtown location, and much harder for others. In general, expect DSDC events to be downtown most of the time, and Metro-accessible all of the time.

Agenda:

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentation and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (Clyde's, 5441 Wisconsin Ave.)

Presentation:

Uncovering Hidden Social Information: Inferring User Traits and Relationships Online 

People share a lot of information about themselves in social media such as Facebook and Twitter. However, through all that sharing,additional hidden information is embedded. In this talk, I will discuss several studies we have run that infer information about users and their relationships online, including predictions of users' personality traits, their political preferences, and the trust they have in one another. I will discuss the results of these studies, themethodologies used, and how our techniques can be applied to similar 
problems.

Bio:

Jennifer Golbeck is Director of the Human-Computer Interaction Lab and an Associate Professor in the College of Information Studies at the University of Maryland, College Park. 

Her research focuses on analyzing and computing with social media. This includes building models of social relationships, particularly trust, as well as user preferences and attributes, and using the results to design and build systems that improve the way people interact with information online. 

She is a Research Fellow of the Web Science Research Initiativeand in 2006, she was selected as one of IEEE Intelligent Systems' Top Ten to Watch, a list of their top young AI researchers. She has a PhD in Computer Science from the University of Maryland, College Park.

Her new book, Analyzing the Social Web, was published on March 26th. You should definitely buy a copy.

Andy

it's finally here

http://www.barnesandnoble.com/w...

March 7, 2013

Source: http://www.meetup.com/Data-Science-D...nts/105714512/

6:30 PM
Estimating Effect Sizes in Machine Learning Predictive Models
125 Data Scientists |  4.50 | 2 Photos

For our, ah, extremely-late February Meetup, we are very pleased to welcome back Dr. Abhijit Dasgupta, who will be speaking about some of his cutting edge work that straddles the "two cultures" of statistics and machine learning. When using classical regression models, it is relatively easy to estimate effect size -- the conditional effect on the outcome as you change one of the predictors. But when your predictive model is a black box, such as a random forest or neural network, this valuable information is typically unattainable. Abhijit and his colleagues have found practical new methods for estimating effect size when using predictive models. Anyone who has had the experience of being at a loss for words when trying to interpret or communicate a complex predictive model will want to learn about these new approaches.

Agenda:

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentation and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (reserved space at Tonic!)


Abstract:

Predictive modeling has been widely used for prediction, but a constant criticism has been in getting interpretation of the conditional effects of different predictors on the outcome. This criticism has been specially loud with respect to "black box" methods like random forests, ensemble learners and neural networks. We go back to the basic definition of effect size estimates in statistics, which is based on the idea of counterfactual outcomes, and find that we can  explicitly leverage the idea of counterfactuals within the predictive modeling framework to estimate traditional and non-traditional effect size estimates both at an individual, subgroup and global level in a flexible manner. We will show how main effects, interactions, and more general nonlinear effects can be estimated in this fashion, without explicitly specifying a model structure per se. We will illustrate this with some recent work on binary regression, obtaining odds ratios, risk differences, risk ratios and both additive and multiplicative interaction effects.

The presentation will be based, in part, on the following paper:

Dasgupta, Szymczak, Moore, Bailey-Wilson, and Malley (2013). Risk Estimation using Probability Machines. Under review.

Bio:

Abhijit Dasgupta is a biostatistician, data scientist, and consultant for the NIH, local startups,  and other  clients. He has a PhD in biostatistics from the University of Washington, and has published work in JASA, Genetic Epidemiology, and a number of other journals.

Andy

Google Refine for Data cleanup in Patents

I thought i post this article here, since during many of the Data Science meetup, data cleanup queries always comes up. Here is an interesting example of data cleanup in Patent analysis using Google Refine.

http://www.patinformatics.com/b...

Abhijit

The slides from the talk are here: https://www.dropbox.com/s/51yv7c Slides

Harlan Harris

And for completeness, the intro slides are here: https://docs.google.co...  Slides

Also, if anyone's interested in blogging about this event for DC2, please email me. Thanks! 

January 28, 2013

Source: http://www.meetup.com/Data-Science-D...ents/94856042/

6:30 PM
Recommendation Systems in the Real World
200 Data Scientists |  4.50 | 33 Photos

For our January Meetup, we're very happy to have two experts in what it takes to build a recommendation system (think Netflix or Amazon) in the messy real world. First, Matt Bryan from WaPo Labs (a digital startup embedded in the Washington Post) will introduce the technology, and show how it powers systems that suggest news articles. Then, Bryce Nyeggen from LivingSocial will share some pitfalls you can make when designing a recommendation system, and how to get it right the first time.

Agenda:

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentations and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks

Bios:

Matt Bryan is a Senior Product Manager for the news aggregation and personalization platform called Trove at WaPo Labs, a small digital innovation group at the Washington Post Company. The Trove Platform powers the Personal Post on washingtonpost.com, the social news site socialreader.com and numerous news-focused iOS and Android apps. Matt has a MS in Mathematics and Statistics from Georgetown University.

Bryce Nyeggen is a Data Scientist at LivingSocial, the local marketplace to buy and share the best things in your city. He focuses on building recommendation systems and creating collaborative filtering algorithms to send customers the most relevant deals, and turning interesting math into fast code. Bryce has a BS in Economics and Political Science from the University of Wisconsin-Madison.

David Anderson

Thanks John, you are right to bring this up. These situations always remind me of Stephen Jay Gould's work, specifically his book, The Mismeasure of Man which I highly recommend to anyone who has not read it yet. Let's get it right with Big Data. http://en.wikipedia.org/wiki/Th...

Andy

David Brooks thougts on data as well,

http://www.nytimes.com/2013/02/...

David Anderson

Eventbrite Director of Data discusses recommendations system: http://gigaom.com/2013/01/29/yo...

Harlan Harris

Two things! First, there's a post up with some suggestions from our speakers on Getting Started with Recommendation Engines on the DC2 blog: http://bit.ly/10OCFPo Check it out.

Jan

I usually start with a framework and then go from there. I personally think graphlab + graphchi (http://bickson.blogsp...), MyMediaLite (http://mymedialite.ne...), SVDFeature (http://svdfeature.ape...) or the Matrix Factorization code in Vowpal Wabbit (https://github.com/Joh...) are all good starting points.

Peter Dudka

Is there any way that we can get the slides from Bryce and Matt's presentation? I'd love to be able to refer back to them again

A former member

My slides are here: https://mega.co.nz/#!H... My Note: I could not get this to download.

Andy

http://www.hilarymason.com/blog...

December 18, 2012

Source: http://www.meetup.com/Data-Science-D...ents/87687152/

6:30 PM
Political Campaign Data Science
180 Data Scientists |  5.00

For our December Meetup, we're very happy to have a panel of four experts in the role of Data Science in Political Campaigns talk about their work. As we have all heard, the 2012 Presidential election was steeped in data like never before, with the Obama campaign in particular using advanced analytical methods to target voters, and with Nate Silver and other polling aggregators providing fascinating insights into the dynamics of the campaign.Sasha Issenberg, author and journalist, will summarize the ways that modern campaigns aggregate and use data in every aspect of voter contact. Ken Strasma, practitioner, will talk about the value of microtargetting, and explain how statistically customized messaging is valuable in both political and nonpolitical marketing.Alex Lundry, practitioner, will dive a little farther into microtargetting, describing how these models are built in practice. And Shrayes Ramesh, academic, will talk about his research into causality and political contributions, and the extent to which money leads or follows policy.

Agenda:

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:15pm -- Presentations and discussion
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks @ Tonic

Bios:

Sasha Issenberg is the "Victory Lab" columnist for Slate and the Washington correspondent for Monocle, where he covers politics, business, diplomacy, and culture. He covered the 2008 election as a national political reporter in the Washington bureau of The Boston Globe, and his work has also appeared in New YorkThe New York Times MagazineThe Washington MonthlyInc.The Atlantic,BostonPhiladelphia, and George, where he served as a contributing editor. You should definitely buy his book, The Victory Lab.

Ken Strasma is the Founder and President of Strategic Telemetry, which provides cutting edge microtargeting, mapping, and data analysis consulting services to corporations, campaigns, and non-profit organizations. He served as the Targeting Director for the 2008 Obama and 2004 Kerry campaigns, and managed campaigns and congressional caucuses in Minnesota and Wisconsin. He is the author of numerous articles and studies regarding targeting, redistricting technology and the Census.

Alex Lundry is Vice President and Director of Research atTargetPoint Consulting, where he works as a political pollster, microtargeter, data-miner and data-visualizer. He is one of the country’s leading experts on electoral targeting, voter analytics and political data-mining, and is a creator, connoisseur and critic of political data visualizations and infographics. He has a Masters of Public Policy from Georgetown University, and teaches Statistics at Georgetown's Government Department.

Shrayes Ramesh is a PhD Candidate in the Department of Economics at the University of Maryland, College Park, and a Data Scientist and Senior Technical Instructor at EMC. His research focuses on the economic and political strategies private agents use when interacting with government and policymakers. He has published on the topics of federal spending, campaign contributions, and the game theory of political decision-making.

Andy

http://techcrunch.com/2013/01/0...

Sean Murphy

Data Science DC's sister meetup, R Users DC recently had a talk by the famous Hadley Wickham on The (near) Future of Data Analysis. A review and the slides from the talk just went live on:
http://datacommunitydc.org

Tony Ojeda

Event review, audio of each segment, and slides where available are posted at http://datacommunitydc.org/blog....

Andy

http://gigaom.com/2012/12/22/we...

Phil Kalina

A new series of articles by Sasha Issenberg on big data in politics begins at http://www.technologyreview.com...

Harlan Harris

Congratulations! to Bill Palomi, who won our DataViz contest with his set of visualizations that told a compelling story about how he learned about the Meetup from the data set: http://billpalombi.com/DSDCRSVP... Runners up were Jonathon Taylor (http://jonathantaylor.co.nf/DSD..., Chrome only) and Clarence Dillon (https://docs.google.com/open?id=...). Thanks to contest coordinator David Luria, judge Matt Stiles (http://thedailyviz.com), and contest sponsor NetApp (http://www.netapp.com/us/), who provided a set of Beats headphones! This was fun -- we'll have to do something like it again soon!

Harlan Harris

Reminder and two-day deadline: We're looking for the Best Data Visualization around DSDC's RSVPs. The visualization can be a glossy static image, or an interactive web page, or 3D-printed if you're into that sort of thing. We've found a data visualization expert to judge, and found a sponsor to offer an *excellent* prize. Seriously, you want this prize. David Luria (davidluria@gmail.com), will be coordinating entries which must be submitted to him no later than Friday Dec. 14th. Please contact him if you have questions. Results announced at Tuesday's Meetup. The data is here: http://files.meetup.com/2215331...

November 19, 2012

Source: http://www.meetup.com/Data-Science-D...ents/75041762/

6:30 PM
Implicit Sentiment Mining in Twitter Streams
170 Data Scientists |  4.50 | 8 Photos

For our November Meetup, we're very happy to have Maksim (Max) Tsvetovat from local analytics consulting firm Deepmile Networks, talking about extracting sentiment from Twitter data. Although the idea of using billions of tweets to learn about opinions is appealing, getting it to work in a compelling and valuable manner has been fraught with difficulty. Max will bring us up to speed, and discuss a method that works well for certain domains.

Notes: We're back at Google for this event! And we'll be continuing our experiment with informal pre-event themed networking -- please come early to meet and chat with people interested in Natural Language Processing!

Agenda:

  • 6:30pm -- Networking and Refreshments (Discussion theme: NLP)
  • 7:00pm -- Introduction
  • 7:15pm -- Max's presentation and Q&A
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks (location TBA)

Abstract:

In this talk, I will describe a new method for estimating sentiment in online speech. This method does not rely on pre-defined lists of"good" or "bad" words -- but, rather, measures affinity toward asubject, brand, politician, etc. by locating and measuringpsycholinguistic similarities between speakers and producingaggregate sentiment statistics. This method is ideally suited tounderstanding sentiment toward politicians, journalists,  advertisers-- anyone that produces large amounts of direct speech. While thislimits the domains in which this method is applicable, its accuracy
increases significantly.

Bio:

Max is the Chief Technology Officer at DeepMile. He has a PhD from Carnegie Mellon University and is currently a Research Assistant Professor at George Mason University where he teaches Social Network Analysis. He is widely published in computer science, organizational theory and social network journals, and is a regular presenter at industry conferences. To learn more about Max and his research, you can explore his website -- www.tsvetovat.org. You should also buy his book, Social Network Analysis for Startups.

Janet Dobbins

Max Tsetovat will be teaching an online course, "Social Network Analysis Using Python" at Statistics.com that starts 9/20 - 10/18/13. More details:
http://www.statistics.com/sna-p...

Andy

Similar topic on Twitter analysis, UC Berkeley course 
http://blogs.ischool.berkeley.e...

Harlan Harris

The audio for this event is now available on the files page: http://www.meetup.com/Data-Scie... Also, if anyone is interested in writing up an event summary/review for the Data Community DC blog, please contact me! Mu Note: http://files.meetup.com/2215331/MaxT...rSentiment.mp3

Maksim Tsvetovat

I added the slides to the repo. For some reason they didn't go through the first time. Also, see Slideshare -- http://www.slideshare.net/jazzf...

Maksim Tsvetovat

Guys -- slides and code from the presentation are posted on GitHub:

https://github.com/maksim2042/DC...

November 11, 2012

Source: http://www.meetup.com/Data-Science-D...ents/82037832/

10:00 AM
Workshop: Number Crunching in Python with NumPy
1 Member

DSDC is pleased to work with DC Python to put together this free workshop! To participate, go to the DC Python web page, join the group if you haven't previously, and RSVP for the eventthere.


Join Matt Davis (bio) as he leads this much awaited workshop titled, "Number Crunching in Python with NumPy". Basic familiarity with Python is expected as Matt teaches us how to be effective data manipulators using the excellent NumPy library. In addition to NumPy, Matt will also cover related tools/libraries such as IPython,SciPymatplotlib, and Pandas.

The Computer Science department at GWU has graciously agreed to make an instructional classroom available for this workshop. The room has limited capacity so please RSVP early to reserve your spot. If your plans change, please update your reply for the benefit of everyone on the waiting list.

DC Python and Data Science DC meetup groups have teamed up to make this workshop possible. It will be an all-volunteer run event and there is no sign-up fee but we strongly urge $10 cash donations at the door to cover the cost of pizza and drinks.

Andy

http://techcrunch.com/2012/10/2...

October 17, 2012

Source: http://www.meetup.com/Data-Science-D...ents/83813992/

6:30 PM
Building Enterprise Apps for Big Data with Cascading
95 Data Scientists |  5.00 | 11 Photos

For our October Meetup, we're thrilled to have Paco Nathan talking about his experiences working with and deploying enterprise-scale predictive systems. Cascading, the open-source application framework that Paco specializes in, is a wrapper around Hadoop, so we're thrilled to be partnering with Hadoop DC this month!

Notes: We're back at newBrandAnalytics for this event! And we'll be continuing our experiment with informal pre-event themed networking -- please come early to meet and chat with people interested in, or perhaps immersed in, startup businesses with an analytics or data science focus, and continue the conversation afterwards at Data Drinks.

Agenda:

  • 6:30pm -- Networking and Refreshments (Discussion theme: Startups)
  • 7:00pm -- Introduction
  • 7:15pm -- Paco's presentation and Q&A
  • 8:30pm -- Post presentation conversations
  • 8:45pm -- Adjourn for Data Drinks
    • Happy Hour Prices
    • & Our own floor
    • @ Science Club DC (19th btwn L&M)

Abstract:

Cascading is an open source project which provides an abstraction layer on top of Hadoop and other compute frameworks for Big Data apps. The API provides workflow orchestration for defining complex apps, and is particularly well-suited for Enterprise IT. Large deployments run at Twitter, Etsy, Climate Corp, Trulia, AirBnB, and many other firms, based on the Java API or alternatively using DSLs in Scala (Scalding) and Clojure (Cascalog), as well as other JVM-based languages.
 

This talk will review some of the speaker's experiences leading Data teams for large-scale deployments of predictive analytics, and how those learnings have led into trade-offs and best practices which we use in Cascading. We will discuss use cases and architectural patterns for large MapReduce workflows, when robustness and predictability are high priorities. We will also review a sample recommender application (on GitHub) based on government Open Data.

Bio:

Paco Nathan is a Data Scientist at Concurrent in SF and a committer on the Cascading.org open source project. He has expertise in Hadoop, R, AWS, machine learning, predictive analytics, and NLP -- with 25+ years in the tech industry overall, in a range of Enterprise and Consumer Internet firms. For the past 10 years Paco has led innovative Data teams, deploying Big Data apps based on Cascading, Hadoop, HBase, Hive, Lucene, Redis, and related technologies.

Paco Nathan

Thank you very much for the opportunity to present at Data Science DC and Hadoop DC. Wonderful discussions! I really appreciated getting to meet many people involved in Data here in the DC area. Just posted the slide deck for tonight's talk on SlideShare: http://www.slideshare.net/pacoi... My Note: I requested this and got a .key file that I could not open.

October 1, 2012

Source: http://www.meetup.com/Data-Science-D...ents/75682162/

6:30 PM
Bayesian Data Analysis, The Basic Basics: Inference on a Binomial Proportion
155 Data Scientists |  4.00 | 7 Photos

For our slipped-to-early-October Meetup, we're thrilled to have Rob Mealey, of former host newBrandAnalytics, presenting the next in our occasional Data Science Classroom series! Bayesian data analysis is a critically important modern skill for flexibly modelling and understanding complex (or, as we'll see, simple) data.

New things:

  1. New location! We're thrilled to be hosting this event in Google's gorgeous (and very colorful) meeting space! Thank you, Google! (And thank you newBrandAnalytics for your outstanding hosting in recent months!)
  2. New informal themed discussions! Before the main event, during the refreshments and mingling portion of the evening, we're going to set aside an area for people interested in a particular topic to meet each other and discuss that topic. This month, we've chosen Python data analysis as the theme.

Agenda:

  • 6:30pm -- Networking and Refreshments (Discussion theme: Python)
  • 7:00pm -- Introduction
  • 7:15pm -- Rob's presentation and Q&A
  • around 8:30pm -- Adjourn for Data Drinks

Abstract:

Compared with traditional statistical methods, the basic toolkit of Bayesian statistics produces more intuitive, easier to understand -- and use and update and compare -- outputs through comparatively difficult computational and mathematical procedures. Everything in and out of a Bayesian analysis is probability and can be combined or broken apart according to the rules of probability. But understanding code and sampling algorithms -- really understanding the algorithms and computation -- and a much deeper grasp of probability distribution theory are much more important in understanding Bayesian inference earlier on.

This tutorial is an introduction to the basic basics of Bayesian inference through a self-contained example involving data simulation and inference on a binomial proportion, such as vote share in a two-way election or a basketball player's free-throw percentage. Time allowing, we will also introduce some more advanced concepts. It is meant to help people with a general understanding of traditional statistics and probability take that first step towards drinking the Bayesian kool-aid. It is delicious stuff, I promise. Tastes like genuine reductions in uncertainty. And cherries.

Bio:

Rob Mealey has a BA in Political Science from St. Michael College and an MS in Applied Economics from Johns Hopkins. He works as a data scientist at DC-based social media analytics start-up newBrandAnalytics, surfacing disruptive and actionable findings from messy and disparate data sources for organizations and companies large and small across many different sectors. He previously worked at Fannie Mae... afterwards... where he single-handedly figured out what went wrong. It isn't his fault if no one listened.

Rob blogs at http://www.obscureanalytics.com and tweets at @robbymeals.

Robert Mealey

Hi all. Really appreciate everyone coming. Sounds like I rushed through things way too fast for some people. Sorry! I was worried I would do just that. I was trying to not talk down to all the massive brains in the room, but I should have been more careful.

An earlier version of the same ideas can be found here: http://www.obscureanalytics.com..., and playing with the code up on github can help. The books I've been using to learn this stuff are: 
Introduction to Bayesian Statistics, William Bolstad 
Doing Bayesian Data Analysis, John K Kruschke 
Scott Lynch's Introduction to Applied Bayesian Statistics and Estimation for Social Scientists

A good very basic intro book is: 
Introduction to Bayesian Statistics, William Bolstad

I hope that some people got at least something out of it, and I really appreciated the opportunity to talk with you all!

Ross Mohan

Hey Rob, let's try this time with content <g> So, thinking of your Google/RCP Naive Bayes political poll analysis, I ran across another Tonka Toy(tm) that might be great for Bayesian analysis: http://en.wikipedia.o...

Robert Mealey

Here's a link to the slides:
http://obscureanalytics.com/dsd...

Jerzy Wieczorek

Python users might also want to check out Think Stats by Allen Downey (free online, or paperback from O'Reilly). It's an intro stats book that takes the Bayes approach from the start, and it's targeted at coders.
http://www.greenteapress.com/th...

Portman Wills

Are the slides online anywhere? I found the code (https://github.com/robbymeals) but couldn't find the presentation.

Robert Mealey

Obviously it was much more detailed and tool focused than this will be. And better. The slides and code are all on his github page: https://github.com/joh... and are fantastically useful. This presentation will really be a conceptual intro with simple applications.

September 17, 2012

Source: http://www.meetup.com/Data-Science-D...ents/75036512/

6:30 PM
PAW-Gov Data Drinks
25 Data Scientists |  4.50

For those interested in or attending the Predictive Analytics World for Government conference here in Washington, DC on September 17th and 18th, Data Science DC will be holding a Data Drinks event following the first evening of the conference.

All are welcome! Come meet others from DC and elsewhere interested in Data Science, Analytics, and Big Data applications in the government sector!

We have reserved the downstairs Small Bar area of the Iron Horse Taproom, which is less than 10 minutes from the conference venue.

Statistics.com will be sponsoring this event and providing food. Iron Horse has an excellent beer selection.

August 27, 2012

Source: http://www.meetup.com/Data-Science-D...ents/76428622/

6:30 PM
The Split Personalities of Data Scientists: Survey Results and Insights
90 Data Scientists |  4.50

Who are Data Scientists and what is the distribution of their skills in the community? What skills do organizations say they want but actually need?

For our August event, Harlan Harris, Marck Vaisman, and Sean Murphy will be presenting results from a survey of Data Scientists that addresses these questions and explores the disconnect between reality and recruitment. They will highlight unique combinations of skills and suggest ways that Data Scientists can market themselves and get ahead.

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:10pm -- Harlan, Marck and Sean's presentation and Q&A
  • around 8:30pm -- Adjourn for Data Drinks

Bios

Harlan Harris has a PhD in Machine Learning from the University of Illinois at Urbana-Champaign, and is Senior Data Scientist at Kaplan Test Prep.

Marck Vaisman has an MBA from Vanderbilt and a BS in Mechanical Engineering from Boston University, and is Principal Data Scientist at DataXtract, working with a broad range of clients.

Sean Murphy has an MBA from Oxford and a MS in Biomedical Engineering from Johns Hopkins, and has worked as a Data Scientist at Hopkins and at several startups.

Harlan and Marck co-organize the Data Science DC and R Users DC Meetups, and Sean will soon be launching a new Data Business DC Meetup. All three are on the Board of the forthcoming Data Community DC, Inc.

Janet Dobbins

Hi Marck,
New article by Harvard Business Review, "Data Scientist: The Sexiest Jog of the 21st Century."

http://hbr.org/2012/10/data-sci...

July 23, 2012

Source: http://www.meetup.com/Data-Science-D...ents/70123182/

6:30 PM
The Beating Heart of Kim Kardashian
105 Data Scientists |  4.50

For our July event, we're thrilled to have Mike Dewar, PhD, Data Scientist at bitly in New York City, talking about modeling streams of data. And Kim Kardashian.

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:10pm -- Dr. Dewar's presentation and Q&A
  • around 8:30pm -- Adjourn for Data Drinks

Abstract

Kim Kardashian is a dynamic input/output system that maps attention onto revenue. The administrators of this system have a straightforward task: by maximising the attention that flows into the system, they can maximise the revenue that flows out of the system. While it is widely accepted how to measure revenue it is not so clear how to measure attention.

Armed with bitly click data, I will describe a mechanism that creates a continuous signal representative of the attention entering the Kim Kardashian system. I will then discuss the efficacy of the Kardashian administrators in herding attention through the system, by comparing this new signal with signals of a similar nature.

We will look at bitly data, make an argument about why binning event streams is bad, talk about a realtime database, and wonder at the collective browsing behaviour of hundreds of thousands of people. Note: this talk will not deal with the internal mechanics of the Kardashian attention to revenue mapping, which remain a mystery.

Bio

Mike has cheerfully refused to provide a bio, so here are some possibly relevant facts from the Internet: Mike Dewar graduated from the University College of Wales, Aberystwyth with a BSc Hons in Agriculture in 1983. He has a PhD in modelling dynamic systems from data from the University of Sheffield in the UK, and has worked as a Machine Learning post-doc in The University of Edinburgh and Columbia University. Among other accomplishments, he is Group Management Accountant at South African Fruit Exporters, was briefly famous in the internets for a visualization of the Afghan War Diaries Wikileaks data, and is Canada's Sexiest Election Candidate. At bitly, he builds mathematical models and visualizations of how links are shared. He has a book coming out this summer entitled "Getting Started with D3", and previously wrote several books about the British Army in Northern Ireland and elsewhere.

Harlan Harris

Mike's slides are at https://github.com/mikedewar/bea... . Download the repo and open index.html in a web browser (Chrome works better than Firefox, it looks like). I've posted recorded audio on the Files page: http://www.meetup.com/Data-Scie... My Note: http://files.meetup.com/2215331/Mike...tProcesses.mp3

June 28, 2012

Source: http://www.meetup.com/Data-Science-D...ents/65834962/

6:30 PM
Rare and Hidden Event Detection
100 Data Scientists |  4.50 | 1 Photo

For our June Meetup, we are thrilled to have Gerhard Pilcher from Elder Research, Inc., presenting on the topic of rare event detection in data sets.

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:10pm -- Mr. Pilcher's presentation and Q&A
  • around 8:30pm -- Adjourn for Data Drinks

Abstract

There are a collection of interesting problems associated with rare and hidden events.  Most people naturally associate this class of problems with fraud detection and criminal activity but there are many other applications in science and marketing.  By definition there are a lot of “non-events” in Rare Event detection resulting in noise that confuses attempts to model or discriminate among event outcomes.  I will share some of our experiences with sorting through the noise and amplifying the response signal and discuss some techniques to increase confidence (or not!) in the resulting model.  I use the term “hidden event”, also called unsupervised modeling, to describe a set of problems where the event outcome is unknown or the number of known cases is too small to be useful in machine learning algorithms.  Detecting hidden events requires a much higher level of subject matter knowledge.  I will discuss some general approaches to this class of problem and then focus on Mahalanobis distance measurement as an example technique for anomaly detection.  I hope everyone will have the opportunity to add a few new tools to their data analysis “tool box”.

Bio

Gerhard Pilcher's work experience spans both private and government sectors, and has featured applications of data mining techniques to Fraud Detection and Risk Management. He currently serves as Vice President and Senior Scientist at Elder Research. Among his previous roles, he was Chief Technology Officer and VP of Engineering for Pulse Communications, where he directed the design of early digital subscriber line (DSL) systems.

Gerhard has served on various boards including the Strategic Advisory Board for the North Carolina State University Computer Science Department. He has a Master of Science in Analytics (Institute for Advanced Analytics, NCSU) and Bachelor of Science in Computer Science from NCSU.

Gerhard Pilcher

The presentation is here: http://files.meetup.com/2215331... Slides

Harlan Harris

The audio is here: http://files.meetup.com/2215331... You'll have to be logged in to Meetup to download it.

Harlan Harris

Intro slides are here: https://docs.google.com/present/... Slides

May 22, 2012

Source: http://www.meetup.com/Data-Science-D...ents/61338572/

6:30 PM
Data Science and Scientific Discovery: New Approaches to Nature’s Complexity
90 Data Scientists |  4.50

For our May Meetup, we're very pleased to have Dr. John Rumble from R&R Data Services share his views of how Data Science and Scientific Data interact. Among many other professional accomplishments, he co-founded the CODATA Data Science Journal, and during his many years at NIST, he led the development of standards for representation and management of scientific data. Dr. Rumble talk is entitled Data Science and Scientific Discovery: New Approaches to Nature’s Complexity.

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:10pm -- Dr. Rumble's presentation and Q&A
  • around 8:30pm -- Adjourn for Data Drinks

Abstract

As we approach an era in which virtually all scientific and technical (S&T) data are generated, collected, managed, accessed, and exploited digitally, we can ask how emerging Data Science technology and tools will impact science and scientific discovery. The federal government and many individual scientific disciplines are bringing new focus to S&T data with new initiatives called “Big Data” and “The Fourth Paradigm.” To appreciate the challenges and opportunities, it is critical to understand the complexity of nature and how that affects the application of Data Science techniques to S&T Big Data. In my talk I will discuss issues such as (1) independent variables and the growth of knowledge, (2), correlation and causation – they ain’t the same thing, (3) why 6 billion is not so big after all, and (4) data and scientific discovery. There are many emerging opportunities for Data Scientists to work with S&T Big Data, and you will be better prepared after listening to these stories.

Bio

Dr. John Rumble has combined his analytical and scientific skills with years of executive experience in a variety of government, industry, and non-profit organizations. Rumble was a pioneer in using computers and information technology in all aspects of scientific and technical (S&T) data. Working for the National Institute of Standards and Technology (NIST), he expanded its Standard Reference Data Program into new areas including materials science, biotechnology, engineering, and other areas. He has also led several national and international non-profit groups and committees involved in S&T data work, including standards development and the Committee on Data for Science and Technology (CODATA) of the International Council for Science, of which he was elected President in 1998.

Dr. Rumble received a B.A. in Chemistry from Cornell University and a Ph.D. in Chemical Physics from Indiana University, Bloomington. He has received many honors for his work including having been named Fellow of six national and international scientific societies. In 2006, Rumble was awarded the CODATA Prize for his achievements. He has authored three books, written over 60 technical papers, and given more than 150 talks on five continents.

Harlan Harris

Audio and slides from this presentation are now available on the Files page! http://www.meetup.com/Data-Scie...

My Note: http://files.meetup.com/2215331/Rumb...e%20final.pptx Slides

and http://files.meetup.com/2215331/John...cienceData.mp3

April 30, 2012

Source: http://www.meetup.com/Data-Science-D...ents/56623962/

6:30 PM
Information Visualization for Knowledge Discovery
90 Data Scientists |  5.00

For our April Meetup, we are very happy to have Prof. Ben Shneiderman from the Department of Computer Science and the Human-Computer Interaction Lab at the University of Maryland, talking about Information Visualization for Knowledge Discovery.

  • 6:30pm -- Networking and Refreshments
  • 7:00pm -- Introduction
  • 7:10pm -- Prof. Shneiderman's presentation and Q&A
  • around 8:30pm -- Adjourn for Data Drinks

Abstract

Interactive information visualization tools provide researchers with remarkable capabilities to support discovery. These telescopes for high-dimensional data combine powerful statistical methods with user-controlled interfaces. Users can begin with an overview, zoom in on areas of interest, filter out unwanted items, and then click for details-on-demand. With careful design and efficient algorithms, the dynamic queries approach to data exploration can provide 100msec updates even for million-item visualizations that can represent billion-record databases.

Interactive visualization tools have led to commercial success stories such as SpotfireSmart Money's Market Map and the Hive Group, as well as research tools such as TimeSearcher, developed at UMD for time series data analysis. The integration of statistics with visualizations has been successfully applied to temporal event sequences such as electronic health records (Lifelines2 and Lifeflow) and social network data (SocialAction and NodeXL).

Bio

Ben Shneiderman is a Professor in the Department of Computer Science and Founding Director (1983-2000) of the Human-Computer Interaction Laboratory at the University of Maryland. He is a Fellow of the AAAS, ACM, and IEEE, and a Member of the National Academy of Engineering. He is the co-author with Catherine Plaisant of Designing the User Interface: Strategies for Effective Human-Computer Interaction (5th ed., 2010). With Stu Card and Jock Mackinlay, he co-authored Readings in Information Visualization: Using Vision to Think (1999). With Ben Bederson he co-authored The Craft of Information Visualization (2003). His book Leonardo’s Laptop appeared in October 2002 (MIT Press) and won the IEEE book award for Distinguished Literary Contribution.

His latest book, with Derek Hansen and Marc Smith, is Analyzing Social Media Networks with NodeXL (2010). Prof. Shneiderman will be happy to sign copies of this book (available at Amazon) after the presentation.

Note

HCIL's Annual Symposium is May 22nd and 23rd, and may be of interest to many!

Harlan Harris

Audio is posted in the Files section. I'll find out about slides and video... My Note: http://files.meetup.com/2215331/BenS...ualization.3gp

March 27, 2012

Source: http://www.meetup.com/Data-Science-D...ents/48644692/

7:00 PM
Data Science Classroom: Clustering
70 Data Scientists |  4.50

We're very pleased to have Dr. Abhijit Dasgupta, statistical consultant at NIH and other local businesses,  and machine learning researcher, present our second Data Science Classroom event! Clustering (Unsupervised Learning) is a foundational Statistics and Machine Learning task, used to identify and visualize structure and commonalities in unlabeled data.  

At 6:30pm, mingling, food and drink.

At 7:00pm, after the usual introduction, Abhijit will give a mostly-introductory talk, talking about when clustering is and isn't useful, methods for determining the number of clusters, scalability of different approaches, and related issues.

At 8:30pm or so, we will find a bar for Data Drinks.

So that we can best plan for this event, please do not RSVP "Yes" unless you would be willing to bet us even odds that you will actually attend.

Harlan Harris

Intro slides are here: https://docs.google.com/present/...  My Note: Requires Password
Presentation slides are here: http://dl.dropbox.com/u/1282581... My Note: Do not downlaod this!
Presentation audio is here: http://www.meetup.com/Data-Scie...My Note: http://files.meetup.com/2215331/Abhi...Clustering.3gp

March 20, 2012

Source: http://www.meetup.com/Data-Science-D...ents/55076962/

8:00 AM
Seminar on Bayesian Inference (full day event!)
80 Data Scientists |  4.50

We're pleased to announce that John Myles White will be in town on Tuesday, March 20th, for a full-day seminar on Bayesian Inference at the Computer Science Department of George Mason University in Fairfax, VA. The event is open to the public, and we would be very happy for DSDC Members and others to attend! There is a small charge to cover John's expenses, which is optional for students and faculty of GMU and other educational institutions.

The seminar begins at 8am and ends at 3:30pm, with a break from 11am to 12:30pm. The location is "SUB II (The Hub) Rooms 3, 4, and 5", and The Hub can be found on this map of campus(pdf).

The content to be covered is as follows:

Section 1:

  • An Introduction to Bayesian Inference
  • Introduce the Bayesian paradigm of inference as probabilistic calculation
  • Provide a loose treatment of the Cox axioms
  • Discuss useful statistical theory:
  • Likelihood functions
  • Maximum likelihood estimation
  • Fisher information
  • Bias, variance, consistency and the Central Limit Theorem for estimators
  • Review standard probability distributions
  • Go through the classical coin-filpping example in detail with a beta prior
  • Describe results of Bayesian inference as comparable to MLE with regularization added in

Section 2:

  • BUGS as a Tool for Automating Bayesian Inference
  • Describe how to specify models using BUGS language
  • Go through many example models
  • Normal with unknown mean, known variance
  • Normal with unknown mean, unknown variance
  • Linear regression: unknown coefficients and variance, Normal priors
  • Linear regression with Laplace priors
  • Logistic regression
  • Hierarchical models
  • LDA
  • SNA models

Bio: John Myles White is a Ph.D. student in the Princeton Psychology Department, where he studies how humans make decisions both theoretically and experimentally. Along with the political scientist Drew Conway, he is the author of a book recently published by O’Reilly Media entitled “Machine Learning for Hackers”, which is meant to introduce experienced programmers to the machine learning toolkit. John is now working with the statistician Mark Hansen on a book for laypeople about exploratory data analysis. He is also the lead maintainer for several popular R packages, including ProjectTemplate and log4r.

John Myles White

For those interested, the slides from yesterday's talk are available at https://github.com/johnmyleswhit...

February 20, 2012

Source: http://www.meetup.com/Data-Science-D...ents/50169542/

6:30 PM
Big Data: Fuel for Innovation and Progress or Pollution of the Information Age?
45 Data Scientists |  4.00

Join us on February 20 at 6:30pm for a lively discussion about data privacy given by Jules Polonetsky, Director and Co-chair of the Future of Privacy Forum, a Washington, D.C.-based think tank that seeks to advance responsible data practices.

"Big Data: Fuel for Innovation and Progress or Pollution of the Information Age?"

Companies large and small look increasingly to data for insights to improve products and services, to market more effectively and to develop new services. At the same time, regulators and advocates are increasingly concerned about the privacy risks created by large data sets and seek new technical and legal protections for consumers. What is the future of data and innovation in an era of “Do Not Track proposals? How can companies maximize the benefits of data use while minimizing the risks? 

Harlan Harris

Carrie and others, a (not great quality) recording of this talk is now downloadable here: http://files.meetup.com/2215331...

January 25, 2012

Source: http://www.meetup.com/Data-Science-D...ents/45226242/

7:00 PM
Presentation/Discussion: Maximal Information Coefficient
55 Data Scientists |  5.00

For the January Meetup, we're pleased to have Sean Murphy lead a discussion of this new paper:

Detecting Novel Associations in Large Data Sets, David N. Reshef et al., Science 16 December 2011: 1518-1524.

Here's a paragraph of the Perspective:

Most scientists will be familiar with the use of Pearson's correlation coefficient r to measure the strength of association between a pair of variables: for example, between the height of a child and the average height of their parents (r ≈ 0.5; see the figure, panel A), or between wheat yield and annual rainfall (r ≈ 0.75, panel B). However, Pearson's r captures only linear association, and its usefulness is greatly reduced when associations are nonlinear. What has long been needed is a measure that quantifies associations between variables generally, one that reduces to Pearson's in the linear case, but that behaves as we'd like in the nonlinear case. On page 1518 of this issue, Reshef et al. (1) introduce the maximal information coefficient, or MIC, that can be used to determine nonlinear correlations in data sets equitably.

After presenting a short overview of the paper and why it is important to statistical practitioners, Sean will lead a discussion of MIC. We would urge everyone to at least skim the paper before attending. (If you don't have non-paid access to Science, let us know and we'll hook you up...)

As always, mingling, food & drink start at 6:30pm, the presentation starts at 7pm, and we try to find a bar for Data Drinks around 8:30pm!

Sean Murphy is an entrepreneur, educator, Senior Research Scientist at Johns Hopkins and Director of Research at Manhattan GMAT. His research has spanned anomaly detection in time series data, agent-based models for disease simulation, and prediction techniques for viral evolution. In his spare time, he analyzes large data sets to design, build and deliver better educational web applications.

A former member

Andrew Gelman's blog has an interesting critique of MIC: http://andrewgelman.com/2012/03...

Harlan Harris

And here are Sean's slides: http://files.meetup.com/2215331... Slides

Harlan Harris

Intro slides are here: https://docs.google.com/present/... Slides

John Dennison

http://www.youtube.com/watch?v=...

John Dennison

http://web.mit.edu/newsoffice/2...

December 21, 2011

Source: http://www.meetup.com/Data-Science-D...ents/43638302/

6:30 PM
Data Science Classroom: Naive Bayes and Logistic Regression
60 Data Scientists |  4.50

For our next Meetup, we're pleased to have Elena Zheleva presenting an introduction to two fundamental methods used in data analysis and prediction. And to celebrate the holiday season, and with thanks to our sponsors, we'll be serving egg nog along with tasty eats and even tastier data science!

Consider the problem of separating spam email from non-spam email, or predicting which leads will become customers, or any other classification problem. Logistic Regression is a classical method from Statistics used to model the probability that data is in one of two categories. Naive Bayes is a very similar Machine Learning algorithm for categorization with independent predictors. Any Data Scientist should have a solid understanding of these two foundational methods, what they do, how they differ, and how to use them.

Elena Zheleva earned her PhD this year from the Computer Science Department at the University of Maryland, College Park. Her dissertation explored new methods for classification in social networks which consider both friendship links and social groups of users, as well as models for evolving social networks and groups, and the application of machine learning algorithms to privacy issues in social networks. She currently works in the Data Science group at LivingSocial in Washington, DC.

Finally, to help us predict turnout at our events, we ask you to only RSVP "Yes" if there's at least a 50% likelihood that you will attend! (Frequentists, please do the best you can.)

Amrinder Arora

It was an excellent meetup. Forgot to post this earlier, but I was live blogging this.. http://www.standardwisdom.com/s...

John Dennison

Here is a great simple python implementation of Naive Bayes. For those glaze over at baysian/simga notation. This is a great way to learn. http://ebiquity.umbc.edu/blogge...

Harlan Harris

Introductory slides are here:
https://docs.google.com/present/... Slides

Josh

Some material exploring this topic (I believe Elena mentioned the second and third ones):

http://www.quora.com/What-is-th...

http://www.cs.cmu.edu/~tom/mlbo...

http://ai.stanford.edu/~ang/pap...

Saurabh

http://blog.kaggle.com/2011/03/...

November 29, 2011

Source: http://www.meetup.com/Data-Science-D...ents/39933182/

6:30 PM
Panel Discussion on Selling Data Science Projects and Results
40 Data Scientists |  4.50

One of the more difficult tasks for people who build analytics models is communicating with others about their work and why they should care. For this Meetup, we're going to talk about "Selling Data Science Projects and Results to People with Budgets", whether those people are executives in your company, or granting agencies, or higher-ups in government. A panel of data science experts with a range of experiences will start off with 5-minute presentations about an attempt to sell a data science project that went either particularly well or particularly poorly. We'll then transition to a panel and Q&A discussion.

Our panel will be:

Brett Berlin - Senior S&T Strategy Advisor, Dynamics Research Corp.
Brian Johnson - Co-founder, Donor Precision
Doug Samuelson - President, InfoLogix
Nick Tabbal - SVP, Resonate Networks
Russ Vane - Managing Consultant, IBM

Harlan Harris

Intro slides here: https://docs.google.com/present/... Slides

October 25, 2011

Source: http://www.meetup.com/Data-Science-D...ents/37103942/

6:30 PM
If a Tree Falls in a LogicForest... An Introduction to Logic Regression
40 Data Scientists |  4.50

For our next Meetup, John Dennison will be discussing Logic Regression, a 10-year-old technique that blends aspects of decision trees with linear regression.

Socializing starts at 6:30, and the presentation at 7:00!

Blurb:

With an absolute blur of possible machine learning techniques out there, it can be difficult to know even where to start. I hope to share some thoughts useful to settle upon an approach. That is, what are the questions that you ask yourself when settling on a specific technique? Using some of these ideas, we will be exploring tools for learning with binary predictors. This will include an introduction to Logic Regression, simulated annealing and an ensemble technique implemented in R called LogicForest.

Bio:

John Dennison is a Data Product Developer for AcuStream LLC, a healthcare software start-up in McLean, VA. While reletively new to the machine learning world, he has taken the dive into nerdy world of data science and is not looking back. He is also despises thinly titled talks.

Harlan Harris

John's presentation here:
http://files.meetup.com/2215331... Slides

Introductory slides here:
https://docs.google.com/present/...

September 26, 2011

Source: http://www.meetup.com/Data-Science-D...ents/33054232/

6:30 PM
Presentation/Discussion: "What the heck is Data Science, anyway?"
35 Data Scientists |  4.50

Harlan Harris, co-founder of the Data Science DC Meetup, veteran of the NYC Machine Learning community and bona fide Data Scientist will untangle the ways in which people use the term Data Science and lead a discussion following his presentation.

As the tech community and even the media catch on to the sophisticated ways in which people use data, the term "data science" has emerged as a catch-all. But as with any new buzz word it's worth unpacking just what the reality is. Join us on September 26th to hear and see how people are using the term and even offer your own take!

We'll start at 6:30 with socializing and food/drinks and the presentation will begin at 7pm.

We're still finalizing the location, but you assume that it will be in downtown DC and close to a Metro station.

August 25, 2011

Source: http://www.meetup.com/Data-Science-D...ents/27740321/

6:30 PM
Kickoff -- Regularization Regression
100 Data Scientists |  4.50

We're going to kickoff the DC Data Science Meetup by joining our friends over at the R Users DC Group on August 25th for a presentation by John Myles White on some tricks for solving machine learning problems with more than a few predictors. John is a Ph.D. student and is currently working on a new O'Reilly book called "Machine Learning for Hackers".

We're still working to finalize a location, but are committed to hosting the meetup near a Metro station in Downtown DC. Things will kickoff at 6:30 with refreshments and networking and we'll start the presentation at 7:00pm.

For those curious about Data Science, here are some links:

http://www.dataists.com/2010/09/the-data-science-venn-diagram/

http://www.quora.com/What-is-data-science

And for more on John Myles White: 

Twitter: @johnmyleswhite

Bio: http://www.johnmyleswhite.com/about/

Blog: http://www.johnmyleswhite.com

And here's a link to the DC R Users Group:

http://www.meetup.com/R-users-DC/

Past Blogs

IN PROCESS

The Sunlight Foundation is looking for an Enthusiastic Software Developer

This guest post is a job listing reposted from the Sunlight Foundation.

My Note: See their statement and the technologies they use: We collect huge amounts of information throughout the day from all over Congress, the U.S. executive branch, and state and municipal governments. We use this information to build powerful, impactful websites

The Sunlight Foundation uses cutting-edge technology and ideas to make government more open and transparent. We’re a non-partisan non-profit organization, founded in 2006, with more than 50 employees, including journalists, policy advocates and software developers.

Software Developer

The Sunlight Foundation is looking for an enthusiastic software developer to join our ranks in Washington, DC and expand the breadth and sophistication of what we’re doing with government data. This is a new position.

We collect huge amounts of information throughout the day from all over Congress, the U.S. executive branch, and state and municipal governments. We use this information to build powerful, impactful websites (such as Open States and Scout; mobile applications (such as our Congress apps for Android and iOS); and APIs (like these). Along the way, we make heavy use of GitHub to work with other groups and developers to build an ecosystem of tools and datasets for the entire open government community (you can see some examples at github.com/unitedstates and github.com/sunlightlabs).

We want to do more of all of this. If you take this position, you’ll be asked to:

  • Find and incorporate new sources of useful data into our APIs and applications.
  • Create new open source tools that solve real problems for Sunlight, our users and others in the open government community.
  • Research and develop cutting edge ways of using our data to create innovative features and solve new kinds of problems.
  • Notice what we’re not doing but should be, and do it.

This position will be part of a small team that focuses on many of the projects listed above. There will be plenty of opportunities to work with other teams in our technology department, as well as Sunlight’s journalists and policy staff.

You’ll also be asked to use a wide variety of programming languages and technologies.

Some technologies we use at Sunlight:

  • Web, system, and tool development with RubyPython, and Node.
  • Front-end and data visualization in JavaScript (and especially D3).
  • Both SQL and NoSQL, with PostgreSQL and MongoDB.
  • Searching lots of text using Elasticsearch and langua<wbr/>ge analysis.
  • Native app development on Android (Java) and iOS (Objective-C).

You don’t need to know all of these, but you should know a couple of them, and be able to learn some others.

Sunlight works in both cyberspace and the physical world, so in addition to writing code, it would be an extra bonus if you enjoy writing about your work, public speaking on Sunlight’s behalf or meeting with people both inside and outside of government.

To apply, send an email to jobs@sunlightfoundation.com with your name and “Software Developer” in the subject line. Include your cover letter in the body and include a link to your GitHub account or other open-source work. Attach or link to a resume.

We will confirm receipt of all applications made to jobs@sunlightfoundation.com<wbr/>. No phone calls, please. Recruiters: principals only.

Selling Data Science: Common Language

 

GreenBook: Are You Burning Away Your Data Fuel?

What do you think of when you say the word “data”?  For data scientists this means SO MANY different things from unstructured data like natural language and web crawling to perfectly square excel spreadsheets.  What do non-data scientists think of?  Many times we might come up with a slick line for describing what we do with data, such as, “I help find meaning in data” but that doesn’t help sell data science.  Language is everything, and if people don’t use a word on a regular basis it will not have any meaning for them.  Many people aren’t sure whether they even have data let alone if there’s some deeper meaning, some insight, they would like to find.  As with any language barrier the goal is to find common ground and build from there.

You can’t blame people, the word “data” is about as abstract as you can get, perhaps because it can refer to so many different things.  When discussing data casually, rather than mansplain what you believe data is or what it could be, it’s much easier to find examples of data that they are familiar with and preferably are integral to their work.

The most common data that everyone runs into is natural language, unfortunately this unstructured data is also some of the most difficult to work with; In other words, they may know what it is but showing how it’s data may still be difficult.  One solution: discuss a metric with a qualitative name, such metrics include: “similarity”, “diversity”, or “uniqueness”.  We may use the Jaro algorithm to measure similarity, where we count common letters between two strings and their transpositions, and there are other algorithms.  When discuss ‘similarity’ with someone new, or any other word that measures relationships in natural language, we are exploring something we both accept and we are building common ground.

 

first50JournalistDeathAssoc

Clustered Journalist Professions based on mortality rate. Those with more connections, and thus at the center, have a higher mortality rate.

Some data is obvious, like this neatly curated spreadsheet (Excel) from the Committee to Protect Journalists.  Part of my larger presentation at Freedom Hack (thus the lack of labels), the visualization shown on the right was only possible to build in short order because the data was already well organized.  If we’re lucky enough to have such an easy start to a conversation, we get to bring the conversation to the next level and maybe build something interesting that all parties can appreciate; In other words we get to “geek out” professionally.

Fantastic presentations from R using slidify and rCharts

Ramnath Vaidynathan presenting in DC

Dr. Ramnath Vaidyanathan of McGill University gave an excellent presentation at a joint Data Visualization DC/Statistical Programming DCevent on Monday, August 19 at nclud, on two R projects he leads — slidify and rCharts. After the evening, all I can say is, Wow!! It’s truly impressive to see what can be achieved in presentation and information-rich graphics directly from R. Again, wow!! (I think many of the attendees shared this sentiment)

Slidify

Slidify is a R package that

helps create, customize and share elegant, dynamic and interactive HTML5 documents through R Markdown.

We have blogged about slidify, but it was great to get an overview of slidify directly from the creator. Dr. Vaidyanathan explained that the underlying principle in developing slidify is the separation of the content and the appearance and behavior of the final product. He achieves this using HTML5 frameworks, layouts and widgets which are customizable (though he provides several here and through his slidifyExamples R package).

Example RMarkdown file for slidify

You start with a modified R Markdown file as seen here. This file can have chunks of R code in it. It is then processed to a pure Markdown file, interlacing the output of R code into the file. This is then split-apply-combined to produce the final HTML5 document. This document can be shared using GitHub, Dropbox or RPubs directly from R. Dr. Vaidyanathan gave examples of how slidify can even be used to create interactive quizzes or even interactive documents utilizing slidify and Shiny.

One really neat feature he demonstrated is the ability to embed an interactive R console within a slidify presentation. He explained that this used a Shiny server backend locally, or an OpenCPU backend if published online. This feature changes how presentations can be delivered, by not forcing the presenter to bounce around between windows but actually demonstrate within the presentations.

rCharts

rCharts is

an R package to create, customize and share interactive visualizations, using a lattice-like formula interface

Again, we have blogged about rCharts, but there have been several advances in the short time since then, both in rCharts and interactive documents that Dr. Vaidyanathan has developed.

rCharts creates a formula-driven interface to several Javascript graphics frameworks, including NVD3HighchartsPolycharts and Vega. This formula interface is familiar to R users, and makes the process of creating these charts quite straightforward. Some customization is possible, as well as putting in basic controls without having to use Shiny. We saw several examples of excellent interactive charts using simple R commands. There is even a gallery where users can contribute their rCharts creations. There is really no excuse any more for avoiding these technologies for visualization, and it makes life so much more interesting!!

Bikeshare maps, or how to create stellar interactive visualizations using R and Javascript

Dr. Vaidyanathan demonstrated one project which, I feel, shows the power of the technologies he is developing using R and Javascript. He created a web application using R, Shiny, his rCharts packages which accesses the Leaflet Javascript library, and a very little bit of Javascript magic to visualize the availability of bicycles at different stations in a bike sharing network. This application can automatically download real-time data and visualize availability in over 100 bike sharing systems worldwide. He focused on the London bike share map, which was fascinating in that it showed how bikes had moved from the city to the outer fringes at night. Clicking on any dot showed how many bikes were available at that station.

London Bike Share map
Dr. Vaidyanathan quickly demonstrated a basic process of how to map points on a city map, how to change their appearance and how to add additional meta-data to each point, that will appear as a pop-up when clicked.

You can see the full project and how Dr. Vaidyanathan developed this application here.

Interactive learning environments

Finally, Dr. Vaidyanathan showed a new application he is developing using slidify, rCharts, and other open-source technologies like OpenCPU and PopcornJS. This application allows him to author a lesson in R Markdown, integrate interactive components including interactive R consoles, record the lesson as a screencast, sync the screencast with the slides, and publish it. This seems to me to be one possible future for presenting massive online courses. An example presentation is available here, and the project is hosted here

Open presentation

The presentation and all the relevant code and demos are hosted on GitHub, and the presentation can be seen (developed using slidify, naturally) here.

Stay tuned for an interview I did with Dr. Vaidyanathan earlier, which will be published here shortly.

Have fun using these fantastic tools in the R ecosystem to make really cool, informative presentations of your data projects. See you next time!!!

Data Science MD August Recap: Sports Analytics Meetup

pitchfx

For August’s meetup, Data Science MD hosted a discussion on one of the most popular fields of analytics, the wide world of sports. From Moneyballto the MIT Sloan Sports Analytics conference, there has been much interest by researchers, team owners, and athletes in the area of sports analytics. Enthusiastic fans eagerly pour over recent statistics and crunch the numbers to see just how well their favorite sports team will do this season.

One issue that sports teams must deal with is the reselling of tickets to different events. Joshua Brickman, the Director of Ticket Analytics for , led off the night by discussing how the Washington Wizards are addressing secondary markets such as StubHub. One of the major initiatives taking place is a joint venture between Ticketmaster and the NBA to create a unified ticket exchange for all teams. Tickets, like most items, operate on a free market, where customers are free to purchase from whomever they choose.

brickman-slide-1

Joshua went on to explain that teams could either try to beat the secondary markets by limiting printing, changing fee structures, and offering guarantees, or they could instead take advantage of the transaction data received each week from the league across secondary markets.

brickman-slide-2

Josh outlined that the problem with the data was that it was only for the Wizards, it was only received weekly, and it doesn’t take into consideration dynamic pricing changes. So instead they built their own models and queries to create heat maps. The first heat map shows the inventory sold. For this particular example, the Wizards had a sold out game.

brickman-slide-3

Possibly of more importance was the heat map showing at what premium were tickets sold on the secondary market. In certain cases, the prices were actually lower than face value.

brickman-slide-4

As with most data science products, the visualization of the results is extremely important. Joshua explained that the graphical heat maps make the data easily digestible for sales and directors, and supplements their numerical tracking. Their current process involves combining SQL queries with hand drawn shape files. Joshua also explained how they can track secondary markets and calculate current dynamic prices to see discrepancies.

brickman-slide-5

Joshua ended with describing how future work could involve incorporating historical data and current secondary market prices to modify pricing to more closely reflect current conditions.

brickman-slide-6

Our next speaker for the night was , the Player Information Analyst for our very own Baltimore Orioles. Tom began by describing how PITCHf/x is installed in every major league stadium and provides teams with information the location, velocity, and movement of every pitch. Using heat maps, Tom was able to show how the strike zone has changed between 2009 and 2013.

duncan-slide-1

Tom then described the R code necessary to generate the heat maps.

duncan-slide-2

Since different batters have different strike zones, locations needed to be rescaled to define new boundaries. For instance, Jose Altuve, who is 5’5″, has a relative Z location that is shifted slightly higher.

duncan-slide-3

Tom then went on to describe the impact that home plate umpires have on the game. On average, 155 pitches are called per game, with 15 being within one inch of the strike zone, and 31 being within two inches. With a game sometimes being determined by a single pitch, the decisions that an home plate umpire make are very important. A given pitch is worth approximately 0.13 runs.

duncan-slide-4

Next Tom showed various heat map comparisons that highlighted differences between umpires, batters, and counts. One of the most surprisingly results was the difference when the batter faced an 0-2 count versus 3-0 count. I suggest readers look at all the slides to see the other interesting results.

duncan-slide-5

While the heat maps provide a lot of useful information, it is sometimes interesting to look at certain pitches of interest. By linking to video clips, Tom demonstrated how an interactive strike scatter plot could be created. Please view the video to see this demonstration.

duncan-slide-6

Tom concluded by saying that PITCHf/x is very powerful, and yes, umpires have a very difficult job!

Slides can be found here for Joshua and here for Tom.

The video for the event is below:

Data Visualization: Exploring Biodiversity

BHL_Flikr

When you have a few hundred years worth of data on biological records, as the Smithsonian does, from journals to preserved specimens to field notes to sensor data, even the most diligently kept records don’t perfectly align over the years, and in some cases there is outright conflicting information.  This data is important, it is our civilization’s best minds giving their all to capture and record the biological diversity of our planet.  Unfortunately, as it stands today, if you or I were to decide we wanted to learn more, or if we wanted to research a specific species or subject, accessing and making sense of that data effectively becomes a career.  Earlier this year an executive order was given which generally stated that federally funded research had to comply with certain data management rules, and the Smithsonian took that order to heart, event though it didn’t necessarily directly apply to them, and has embarked to make their treasure of information more easily accessible.  This is a laudable goal, but how do we actually go about accomplishing this?  Starting with digitized information, which is a challenge in and of itself, we have a real Big Data challenge, setting the stage for data visualization.

The Smithsonian has already gone a long way in curating their biodiversity data on the Biodiversity Heritage Library (BHL) website, where you can find ever increasing sources.  However, we know this curation challenge can not be met by simply wrapping the data with a single structure or taxonomy.   When we search and explore the BHL data we may not know precisely what we’re looking for, and we don’t want a scavenger hunt to ensue where we’re forced to find clues and hidden secrets in hopes of reaching our national treasure; maybe the Gates family can help us out…

People see relationships in the data differently, so when we go exploring one person may do better with a tree structure, others prefer a classic title/subject style search, or we may be interested in reference types and frequencies.  Why we don’t think about it as one monolithic system is akin to discussing the number of Angels that fit on the head of a pin, we’ll never be able to test our theories.  Our best course is to accept that we all dive into data from different perspectives, and we must therefore make available different methods of exploration.

Data Visualization DC (DVDC) is partnering with the Smithsonian’s Biodiversity Heritage Library to introduce new methods of exploring their vast national data treasure.  Working with cutting edge visualizers such as Andy Trice of Adobe, DVDC is pairing new tools with the Smithsonian’s, our public, biodiversity data.  Andy’s development of web standards for Visualizing data with HTML5 is a key step forward to making the BHL data more easily accessible not only by creating rich immersive experiences, but also by providing the means through which we all can take a bite out of this huge challenge.  We have begun with simple data sets, such as these sets organized by title and subject, but there are many fronts to tackle.  Thankfully the Smithsonian is providing as much of their BHL data as possible through their API, but for all that is available there is more that is yet to even be digitized.  Hopefully over time we can utilize data visualization to unlock this great public treasure, and hopefully knowledge of biodiversity will help us imagine greater things.

Create Data-Driven Tools that May Improve Diabetes Care for a Chance to Win $100K

This post is being reblogged by DC2 as it looked like a pretty worthy cause. The original author was Dwayne Spradlin.

RDD-DDD-1week-left-700px-w 2

In 2011, McKinsey Global Institute put out a study that projected, “If U.S. healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year.” The transformative power of open data has been a hot topic of national conversation over the past several years, particularly in its application to healthcare, as data practitioners ponder groundbreaking new developments that could change the lives of patients. In what may come as a surprising move for some, Sanofi US, an established healthcare player, is leading the charge to find out exactly what’s possible when you apply the power of open data to healthcare innovation to help millions of people living with diabetes.

Last week, in partnership with the Health Data Consortium, the Sanofi US 2013 Data Design Diabetes Innovation Challenge – Prove It! kicked off the Redesigning Data Challenge Series, inviting innovators to create data-driven tools that could potentially improve diabetes care in the US. Now in its third year, the partnership with the Health Data Consortium brings a new focus on data to Data Design Diabetes, encouraging entrepreneurs, data scientists, and designers to create the evidence needed to make better decisions related to diabetes. Through baseline knowledge models, evidence-based practice, or predictive analysis, the Challenge asks innovators to submit Prove It! concepts that have the potential to create real change with real knowledge.

Since its inception in 2011, Data Design Diabetes has helped to launch a number of successful startups, spurring innovative solutions to help people living with diabetes, their families, and their caregivers. The first Data Design Diabetes winner, Ginger.io, uses an app that alerts caregivers to concerning behavioral changes. The team has raised $8.2M in venture funding, has grown to a team of twelve, and recently acquired Rock Health startup, Pipette. Ginger.io has been groundbreaking in its use of mobile technology to create data that can be utilized to help improve the health and daily lives of people living with diabetes. Last year’s winner, n4a Diabetes Care Center, uses predictive analysis to isolate and target patients based on cost patterns and risk profiles, to provide them with support and services designed to slow the progression of the disease, improve the quality of health, and slow the spending associated with a patient’s health.

Prove It! is open for submissions now through April 7, 2013. Finalist teams will present at Health Datapalooza IV in Washington, DC, and one winner will receive $100,000.

Criteria

EVIDENCE-BASED HEALTH OUTCOMES: Ability to demonstrate in an evidence-based way how the concept can improve the outcomes and/or experience of people living with diabetes in the US.

TARGET AUDIENCE: Ability to support one or more members of the healthcare ecosystem andprovide them with data-driven tools or evidence-based insight that can help them make better contributions to staving the diabetes epidemic in the US.

DECISION-MAKING: Ability to illustrate how the concept can enable better data-driven decision-making at a particular stage across the spectrum of type 1 or type 2 diabetes, from lifestyle and environmental factors to diagnosis, treatment, maintenance, and beyond.

DATA SCIENCE: Utilize new or traditional data methodology — such as baseline knowledge models, evidence-based practice, and predictive analysis — to create a tool that may potentially change the landscape of diabetes management through richer insight, more timely information, or better sets of decisions.

2013 Data Design Diabetes Innovation Challenge Timeline

  • March 18: Challenge open for submissions
  • April 7: Last day to submit to the Challenge
  • April 18: Finalists announced during TEDMED
  • April 18-June 2: Finalists participate in virtual incubator
  • April 26-28: Innovators’ bootcamp in San Francisco, CA
  • June 3: Finalists present at Health Datapalooza IV
  • June 4: Winner announced at Health Datapalooza IV

Submit a concept today, or follow the Challenge on Twitter or Facebook for updates as the 2013 Data Design Diabetes Innovation Challenge – Prove It! progresses.

Data Visualization: Our Gun Debate

Our nation’s gun debate has been thrust back into the spotlight, and as a data scientist I am interested in what data we have and how we can visualize it to glean new information and knowledge.  As we know, data visualization is an art as well as a science, so when we see a data visualization we are seeing the writer’s perspective of the data.  What can we track?  Murder rates, laws, friend connections, gun sales, registrations, campaign donations, and crimes to name a few.  Will visualizing this data highlight core issues and help us understand what we can do, or what we shouldn’t do?  Are writers’ representations of data obscuring other insights?  Are writers primarily attempting to appeal to our heartbreak over gun violence tragedies?

 

USGunTraffic

Some visualizations empathize with the human loss in gun violence, other infographics and interactive visualizations look to shed light on gun data; “NRA influence,” “officers down,” and “guns across state lines” are signature works of the Washington Post.  The message of “guns across state lines” is stated plainly across the top as ”… states with strong gun laws import from states with weaker laws.” The static presentation of the data certainly highlights this fact, using the classic red=bad, green=good, implying that good states are receiving guns from bad states.  Viewing the data dynamically by selecting different states and observing where guns are coming from and going to, highlights the proximity factor in gun import/export; a State’s physical proximity is also important.

 

USStateGunLaws

What is the effect of the law on gun movement?  New York’s has very strict guns laws compared to the rest of the country, and its primary gun suppliers are VA, GA, and FL, which all have almost no gun laws by comparison; this trend is also true of Illinois.  By itself this would confirm the hypothesis, however we can find a similar trend among states with very few regulations, and by contrast California supplies guns throughout the country and especially to neighboring states.  We can more easily break apart the gun export/import at Trace the Guns, which shows that many states have significant amounts of both.

One thing is clear, states with more strict laws export less compared to their imports; perhaps it’s easier to control supply than demand.  Based strictly on this data, if the goal is to limit availability of guns for violent crimes, either all states would have to adapt strict gun laws, or interstate gun sales would have to stop.  I believe we need better data, or we need to recognize different trends.

Recommended Reading List for “Teaching Machines to Read” Event

On March 14th, the Mid-Maryland Data Science group will be hosting a meetup on Teaching Machines to Read: Processing Text with NLP, featuring three NLP experts discussing their work. Since NLP is a complex topic that intersects many disciplines such as computer science, linguistics, and artificial intelligence, we wanted to provide a collection of resources that may help members become familiar with the field.

book

 

Fundamentally, natural language processing deals with the ability of computational systems to analyze and understand human language. Craig recommended several online courses about NLP.

Starting February 24th, Coursera is offering a 10 week long course on Natural Language Processing, taught by Michael Collins of Columbia University. There is also another planned course, Natural Language Processing, taught by several Stanford professors.

From the main Stanford website, the CS224N – Natural Language Processing course has many lectures and slides available. Craig also recommended the accompanying textbook, Foundations of Statistical Natural Language Processing.

For those interested in playing around with NLP themselves, the Weka project is a collection of machine learning algorithms for data mining tasks using Java.

Data Visualization: Graphics with GGPlot2

By:  DSC00302 - Version 2

My Note: Interesting to see how this compares to Spotfire!

Basic plots in R using standard packages like lattice work for most situations where you want to see trends in small data sets, such as your simulation variables, which make sense considering lattice began with the Bell Lab’s S language.  However, when we need to summarize and communicate our work with those primarily interested in the “forest” perspective, we use tools like ggplot2.  In other words, the difference between lattice and ggplot2 is the difference between understanding data versus drawing pictures.

You can learn all about ggplot2 by downloading the R package and reading, but even Even Hadley Wickham, author of ggplot2, thinks going through the R help documentation will “drive you crazy!”  To alleviate stress, we’ve compiled references, examples, documentationblogsbooksgroups, and commentary from practitioners who use ggplot2 regularly, enjoy.

GGplot2 is an actively maintained open-source chart-drawing library for R based upon the principles of “Grammar of Graphics”, thus the “gg”.  Grammar of Graphics was written for statisticians, computer scientists, geographers, research and applied scientists, and others interested in visualizing data.  GGplot2 can be generalized as layers composed of: a data set, mappings and aesthetics (position, shape, size color), statistical transforms, and scaling.  To better wrap our minds around how this applies to ggplot2, we can take Hadley’s tour, or attend one of his events.  The overall goal is to automate graphical processes and put more resources at our fingertips; below are some great works from practitioners.

London Bike Routes

PopularLondonBikeRoutes

The London bike routes image is built with three layers: building polygons, waterways and lakes, and bike routes.  The route data itself is a count of the number of bikes, as well as their position, featured as thickness and color intensity in yellow, which is a nice contrast to the black and grey of the city map.  I enjoy this dataviz because you can imagine yourself trying to get around on a bicycle in London.

Raman Spectroscopic Grading of Gliomas

SpectroscopicObservations

The background of this work is the classification of tumour tissues using their Raman-Spectra. A detailed discussion can be found in C. Beleites et al.  Gliomas are the most frequent brain tumours, and astrocytomas are their largest subgroup. These tumours are treated by surgery. However, the exact borders of the tumour are hardly visible. Thus the need for new tools that help the surgeon find the tumour border. A grading scheme is given by the World Health Organization (WHO).

TwitteR Package

twitter-ggplot

Curious about your influence on twitter?  Want to see how your messages resonate within and outside your network?  Here is a great website that goes through many examples on using the TwitteR package in R, with the following ggplot2 code that creates the chart on our right-hand-side:

[code lang="R"]require(ggplot2)

ggplot()+geom_bar(aes(x=na.omit(df$rt)))+

opts(axis.text.x=theme_text(angle=-90,size=6))+

xlab(NULL)

[/code]

The ggplot2 interface is interesting because you’re using the + operator, thus manifesting the Grammar of Graphics concept of layers.

Sentencing Data for Local Courts

visualizingSentencingData-ggplot2

This final example of Sentencing Data for Local Courts easily breaks up the data by demographics committing different classes of crimes.  As above, the R code is very simple and follows the layering paradigm:

[code lang="R"]ggplot(iw, aes(AGE,fill=sex))+geom_bar() +

facet_wrap(~Offence_type)

[/code]

Big Data Meta Review – The Evolution of Big Data as Told by Google

seansmall by 

Every now and then a new paper is published that offers a step-change in the status quo, forever altering the community. As Google is one of the original big data companies, it is not surprising that many of these papers in the big data space have come from Google.

Below is a list of citations, including abstracts, of the Google papers that you should read to understand the last decade of big data. You can see Google lay the most basic foundation for big data in terms of a distributed file system (GFS) capable of handling truly big (but unstructured) data. They then provide a paradigm for processing that data at scale (MapReduce), easing

google-logo-18711-370x229

the burden of developers and increasing productivity. Next comes a way to handle structured data at scale (BigTable) and then a system (Percolator) for incrementally updating their existing big data sets.  Google then addresses some of the shortfalls of MapReduce by telling the world about Pregel, designed for large scale graph processing, and Dremel, designed to handle near instantaneous interrogation of web-scale data, further increasing end-user productivity. Finally, the big “G” publicizes Spanner, a database distributed not just across cores and machines in a data center, but across machines distributed across the globe.

Enjoy the abstracts (and for the more technically minded, click on over to the PDFs) and get a better sense not only of big data but, more importantly, the evolutionary trajectory that one of the giants in this rapidly growing field has taken.

The Google File System – October 2003

Authors: Sanjay GhemawatHoward Gobioff, and Shun-Tak Leung

Abstract:

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points.

The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.

In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

MapReduce: Simplified Data Processing on Large Clusters - December 2004

Authors: Jeffrey Dean and Sanjay Ghemawat

Abstract:

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.

Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.

 

Bigtable: A Distributed Storage System for Structured Data - November 2006

Authors: Fay Chang, Jeffrey DeanSanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber

Abstract:

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

 

Large-scale Incremental Processing Using Distributed Transactions and Notifications - 2010

Authors:  

Abstract:

Updating an index of the web as documents are crawled requires continuously transforming a large repository of existing documents as new documents arrive. This task is one example of a class of data processing tasks that transform a large repository of data via small, independent mutations. These tasks lie in a gap between the capabilities of existing infrastructure. Databases do not meet the storage or throughput requirements of these tasks: Google’s indexing system stores tens of petabytes of data and processes billions of updates per day on thousands of machines. MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency.

We have built Percolator, a system for incrementally processing updates to a large data set, and deployed it to create the Google web search index. By replacing a batch-based indexing system with an indexing system based on incremental processing using Percolator, we process the same number of documents per day, while reducing the average age of documents in Google search results by 50%.

 

Pregel: a system for large-scale graph processing – June 2010

Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski

Abstract:

Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs – in some cases billions of vertices, trillions of edges – poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

Dremel: Interactive Analysis of Web-Scale Datasets -  2010

Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis

Abstract:

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

Spanner: Google’s Globally-Distributed Database - October 2012

Authors: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford

Abstract:

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

Weekly Round-Up: Long Data, Data Storytelling, Operational Intelligence, and Visualizing Inaugural Speeches

Welcome back to the round-up, an overview of the most interesting data science, statistics, and analytics articles of the past week. This week, we have 4 fascinating articles ranging in topics from data storytelling to visualizing presidential inauguration speeches.

In this week’s round-up:

  • Forget Big Data, Think Long Data
  • Telling a Story with Data
  • Is Operational Intelligence the End Game for Big Data?
  • US Presidents Inaugural Speech Text Network Analysis

Forget Big Data, Think Long Data

While Big Data is all the rage these days, this Wired article trumpets the merits of Long Data – data sets that have massive historical sweeps. It points out that Big Data is usually from the present or from the recent past and so you often don’t get the same perspective as you do from data sets that span very long timelines. These data sets let you observe how events unfold over time, which can provide valuable insights. The article goes on to describe more differences between Big and Long Data and cites examples of some of the ways Long Data is used today.

Telling a Story with Data

Deloitte University Press published an interesting post this week about how to tell a story with data. The post argues that unless decision-makers understand the data and its implications, they may not change their behavior and adopt analytical approaches while making decisions. This is where data storytelling – the art of communicating the insights that can be drawn from the data – comes in. The post goes on to describe some good and bad examples of this and also provides some useful guidelines for it.

Is Operational Intelligence the End Game for Big Data?

This is a post on the Inside Analysis blog that talks about how Business Intelligence is beginning to be taken to the next level, and how that level is Operational Intelligence. With the advancement of data science and big data technologies, organizations are starting to be able to take a deeper look into their data, draw insights that weren’t visible previously, and start using predictive analytics to forecast more accurately. The post goes on to talk about Operational Intelligence and how these new insights can be transferred to a user or system that can make the appropriate business decisions and take the required actions.

US Presidents Inaugural Speech Text Network Analysis

This is a post from Nodus Labs showing off some of the interesting work they’ve done creating network visualizations out of US presidential inaugural addresses. The post describes their methodology, includes a video explaining the networks, and they even embedded some examples that you can play around with and explore!

That’s it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there’s something we missed, feel free to let us know in the comments below.

Data Visualization: Drawing us in

Written by:  DSC00302 - Version 2

My Note: Again, I need to look at the data behind these visualizations.

Do we decide to focus on something, or are we drawn to focus on something?  These ideas and others are what we will begin to explore in our new Data Visualization DC meetup.  Personally, I believe we can’t help but be drawn in by good presentation, but if what’s presented isn’t compelling then we quickly lose interest.  The many nuanced layers of the interactive visualizations below draw us in, mimic the physical world, and fulfill our expectation of well segmented classes of information.  The layers of information are contrasted so well that when we want to see those relationships independently, there is a convenient filter built into the visualization.  These visualizations use cartoons to smooth out unimportant detail, and emphasize the artist’s intent.  I love playing with these interactive visualizations and breaking them down.  I hope all learning becomes this intuitive.

Zeus’s Affairs

Zeus's Affairs

I remember studying Greek mythology in school, and there were hints that Zeus was naughty, but NEVER did I come across the extent of Zeus’s “Divine Influence.” Rather than chapters in a textbook on each relationship, we can explore by individual Titan, God, and Mortal, but most importantly by which writer/historian captures the details of the relationships (we’re short on eyewitnesses).  This allows us to learn by whatever strikes our fancy at the moment, which is a far more effective way to retain information.  As far as I’m concerned, this is how textbooks should be written.

Immigration Insanity

immigration-flowchart-cartoon

We hear about immigration all the time in the news, and many of us know someone going through the process.  We know the process is difficult but this opacity has never been a legitimate excuse, and we expect ourselves to just figure it out.  Well stop blaming yourself because now that we have the process mapped out for us, it’s very clear that the most diligent of people can easily get lost in the immigration maze.  

How did it get this way?  Over Data Drinks you might find me discussing “legal creep,” similar to “requirements creep” in engineering, where many small changes result in a very large change overall, and usually one that never would have been approved.  Whatever the process, now that we can see it clearly, it begs for empirical evidence, which this New York Times Interactive Visualization elegantly provides.  Now, if we can get the immigration process map for each year, going back to 1880, and compare against the empirical results we could understand the effectiveness of immigration legislation.  Perhaps I’m dreaming again.

#WorkingInAmerica

American-Jobs

A bar graph is a bar graph however we create the bar, right?  Well I can’t help but think about the people in each jobs sector when looking at this interactive jobs visualization, and it has to do with the little cartoon people constituting each bar.  Details like this, and that it tracks its hashtag #workinginamerica in real-time across the bottom, are key concepts behind a great interactive visualization.  Rather than spending all of our time remembering the raw facts, the visualization encourages us to ask questions such as, “why are education and health grouped,” “why is hospitality outpacing manufacturing,” and “what happened between 1971 to 1991 that raised employment of ages 16-34 above 35-54?”  The answer to the last question is likely Baby Boomers, but isn’t it fun to confirm our suspicions!

The Rise of Data Products

by 

My Note: Again I like this slide format and evolution of the definition of data products.

I had the great opportunity to present at the kick-off event for the Mid-Maryland Data Science Meetup on “The Rise of Data Products”. Below is the talk captured in images and text.

Update: You can also download the audio here and follow along.

 

Slide01

Tonight’s talk is focused on capturing what I see as a new (or continuing) Gold Rush and could not be more excited about.

 

Before we can talk about the Rise of Data products, we need to define a Data Product.  Hilary Mason provides the following definition: “a data product is a product that is based on the combination of data and algorithms.”  To flesh this definition out a bit, here are some examples.

1) LinkedIn has a well known data science team and highlighted below is one such data product – a vanity metric indicating how many times you have appeared in searches and how many times people have viewed your profile. While some may argue that this is more of a data feature than product, I am sure it drives revenue as you have to pay to find out who is viewing your profile.

Slide04

2) Google’s search is the consumate data product. Take one part giant web index (data) and add it to the page rank algorithm (the algorithms) and you have a ubiquitous data product.

Slide05

3) Last, and not least, is Hipmunk. This company allows users to search flight data and visualize the results in an easy to understand fashion. Additionally, Hipmunk attempts to quantify the pain entailed by different flights (those 3 layovers add up) into an “agony” metric.

 

Slide06

So let’s try a slightly different definition – a data product is the combination of data and algorithms that creates value–social, financial, and/or environmental in nature–for one or more individuals.

Slide07

One can argue that data products have been around for some time and I would completely agree. However, the point of this talk is why are they exploding now?

Slide08

 I would argue that it is all about supply and demand. And, for this brief 15 minute talk (a distillation of a much longer talk), I am going to constrain the data product supply issue to the availability and cost of the tools required to explore data and the infrastructure required to deliver data products. On the demand side, I am going to do a “proof by example,” complete with much arm waving, to show that today’s mass market consumers want data.

Slide09

On the demand side, let’s start with something humans have been doing ever since they came down from the trees: running.

With a small sensor embedded in the shoe (not the only way these days), Nike+ collects detailed information about runners and simply cannot give enough data back to its customers. In terms of this specific success as evidence of general data product demand, Nike+ users have logged over 2 billion miles as of 1/29/2013.

Slide10

As further evidence of mass market data desire, 23and Me has convinced nearly a quarter million people to spit into a little plastic cup, seal it up, mail it off, and get their DNA sequenced. 23and Me then gives back the data to the user in the form of a genetic profile, complete with relative genetic disease risks and clear/detailed explanations of those numbers.

Slide11

And finally is Google Maps or GPS in general .. merging complex GIS data with sophisticated algorithms to compute optimal pathing and estimated time of arrival. Who doesn’t use this data product?

Slide12

In closing, the case for overwhelming data product demand is strong ::insert waving arms::: and made stronger by the fact that our very language has become sprinkled with quasi stat/math terms.  Who would ever think that pre-teens would talk about something trending?

Slide13

Let’s talk about the supply side of the equation now, starting with the tools required to explore data.

Slide14

Then: Everyone’s “favorite” old-school tool, Excel, costs a few hundred dollars depending on many factors.

Now: Google docs has a spreadsheet where 100 of your closest friends can simultaneously edit your data while you watch in real time.

And the cost, FREE.

Slide17

Let’s take a step past spreadsheets and rapidly prototype some custom algorithms using Matlab (Yes, some would do it in C but I would argue that most can do it faster in Matlab). The only problem here is that Matlab ain’t cheap. Beware when a login is required to get even individual license pricing.

Now, you have Python and a million different modules to support your data diving and scientific needs. Or, for the really adventurous, you can jump to the very forward looking, wickedly-fast, big-data ready, Julia. If a scientific/numeric programming language can be sexy, it would be Julia.

And the cost, FREE.

Slide20

Let’s just say you want to work with data frames with some hardcore statistical analyses. For a number of years, you have had SAS, Stata, and SPSS but these tools come at an extremely high cost. Now, you have R. And its FREE.

Slide23

Yes, an amazing set of robust and flexible tools for exploring data and prototyping data products can now be had for the low, low price of free, which is a radical departure from the days of yore.

Now that you have an amazing result from using your free tools, it is time to tell the world.

Back in the day (think Copernicus and Galileo), you would write a letter containing your amazing results (your data product) which would then take a few months to arrive to a colleague (your market). This was not a scalable infrastructure.

Slide24

Contemporary researchers push their findings out through the twisted world of peer-reviewed publications … where the content producers (researchers) often have to pay to get published while someone else makes money off of the work. Curious. More troubling is the fact that these articles are static.

Now, if you want to reach a global audience, you can pick up a CMS like WordPress or a web framework such as Rails or Django and build an interactive application.  Oh yeah, these tools are free.

Slide27

So the tools are free and now the question of infrastructure must be addressed. And before we hit infrastructure, I need to at least mention that over used buzz word, “big data.”

In terms of data products, “big data” is interesting for at least the simple reason that having more data increases the odds of having something valuable to at least someone.

Think of it this way, if Google only indexed a handful of pages, “Google” would never have become the verb that it is today.

Slide28

If you noticed the pattern of tools getting cheaper, we see the exact same trend with data stores. Whether your choice is relational or NOSQL, big or little-data, you can have your pick for FREE.

Slide31 Slide34

With data stores available for the low cost of nothing, we need actual computers to run everything. Traditionally, one bought servers which cost an arm and a leg and don’t forget siting requirements and maintenance among other costs. Now Amazon’s EC2 and Google Compute Engine allow you to spin up a cluster of 100 instances in a few minutes. Even better, with Heroku, sitting on top of Amazon, you can stand up any number of different data stores in minutes.

Slide37

 

Why should you be excited? Because the entire tool set and the infrastructure required to build and offer world-changing data products is now either free or incredibly low cost.

Let me put it another way. Imagine if Ford started giving away car factories, complete with all required car parts, to anyone with the time to make cars!!!!!

Slide38

Luckily, there are such individuals who will put this free factory to work. These “data scientists” understand the entire data science stack or pipeline. They can by themselves take raw data to a product ready to be consumed globally  (or at least make a pretty impressive prototype). While these individuals are relatively rare now, this state will change. Such an opportunity will draw a flood of individuals, and that rate will only increase as the tools become simpler to use.

Slide39

Let’s make the excitement a bit more personal and go back to that company with a lovable logo, Hipmunk.

If I remember the story correctly, two guys at the end of 2010 taught themselves Ruby On Rails and built what would become the Hipmunk we know and love today.

Two guys.

Three months.

Learned to Code.

And, by the way, Hipmunk has $20.2 million in funding 2 years later!

Slide40

It is a great time to work with data.

Slide41Slide42

 

D3.js Meta Tutorial

seansmall by 

My Note: This is what I need for my D3.js work! See Luke Franci quote.

D3.js, the follow up to Mike Bostock’s impressive and useful Protovis library, is a fantastic tool for building web-based, dynamic, data visualizations consumable by the masses. However, as anyone not familiar with javascript, jQuery, SVG, CSS and/or HTML, getting anything up and running can be a challenge. While there are tutorials on the git hub page, these docs might assume a higher level of knowledge than many who are interested in visualizing some data. Below are some great tutorials and articles about working with D3.

d3

 

Mere Mortals
This is simply one of the best and most detailed introductions to D3 that I have seen. Seriously, if you want to have a chance of building your own custom visualizations and not just hacking at someone else’s prebuilt example, read this article!

Luke Franci puts it best: “D3 has a steep learning curve, especially if (like me) you are not used to the pixel-precision of graphics programming. To build a visualization with D3, you need to understand JavaScript objects, functions, and the method-chaining paradigm of jQuery; the basics of SVG and CSS; D3′s API; and the principles for designing effective infographics.”

D3 Tutorial Set
Scott Murray offers up a great set of tutorials that walk you through a lot of the basics.

Data for D3
As Jerome Cukier says, “[p]utting your data in the right form is crucial to have concise code that runs fast and is easy to read” and this article discusses and shows different ways you can format data for use with D3. A highly recommended read!

Setting up your Development Environment
This is another excellent tutorial by Jerome Cukier focused on getting the development environment up and running (a crucial and often frustrating step).

D3 Workshop Slides by Mike Bostock
The following set of slides presents a “large swath of introductory material” that was initially given at the VIZBI 2012 conference on March 5th over 3 hours.

D3: Data-Driven Documents
This paper is the paper for citing D3 and describes a lot of the design decisions made and the rationale behind them.

Data Competition Announcement – SBP 2013 Big-Data Challenge

SBP 2013 Big-Data Challenge

International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction (SBP 2013)
Washington, DC, April 2-5, 2013
 
SBP is pleased to offer a challenge problem again in 2013, and we expect it to build on the successes of last year’s inaugural challenge. The theme of the SBP 2013 Challenge is how to better demonstrate the scientific and business value of “big data.”

Challenge Overview

Because of their pervasiveness, sensing capabilities, and computational power, cell phones afford a convenient platform to advance the understanding of social dynamics and influence. However, lack of data in the public domain and the interdisciplinary nature of conducting social science research with mobile phones continue to limit the role that phones play in social research and mobile commerce backed by a solid social-science foundation.

In an attempt to address these challenges, we are working with the MIT Human Dynamics laboratory to release several mobile datasets on “Reality Commons” that contain the dynamics of several communities of about 100 people each.

Challenge Problem

We invite researchers to either 1) propose and submit their own applications to demonstrate the scientific and business values of these data sets, 2) suggest how to meaningfully extend these experiments to larger populations, or 3) develop math that fits agent-based models or systems dynamics models to larger populations. The problem itself will be open-ended, and we encourage approaches from different disciplines that encompass a range of applications using the data.

Submissions and Finalists

Submissions (of approximately 3000~4000 words) will be evaluated based on theoretical grounding and experimental evidence. Winners will be selected by an interdisciplinary committee of researchers, and recognized at the conference with 1st, 2nd, and 3rd-place prizes. The winners will also introduce their ideas briefly at the conference and/or give a quick demo. We hope to collect the submissions into a publication.

Important Dates

* Submission deadline: January 31, 2013 (23:59 PST)

* Notification of Winners: March 2, 2013

2012 challenge problem winners

* Matthew Lease, School of Information, University of Texas at Austin “Discovering and Navigating Memes in Social Media”

* Masoud Makrehchi, Research Scientist, Thomson Reuters, Toronto “Conflict Thermometer: Predicting Social Conflicts by Analyzing Language Gap in Polarized Social Media”

For details, please contact Nitin Agarwal at nxagarwal@ualr.edu, or Wen Dong at wdong@media.mit.edu.

Reality Commons

brought to you by the MIT Human Dynamics Lab

My Note: Data sets.

DATASETS

Home

Gallery

Badge Dataset

Friends and Family

Reality Mining

Social Evolution

Contact Us

Cell phones have become an important platform for the understanding of social dynamics and influence, because of their pervasiveness, sensing capabilities, and computational power. Many applications have emerged in recent years in mobile health, mobile banking, location based services, media democracy, and social movements. With these new capabilities, we can potentially be able to identify exact points and times of infection for diseases, determine who most influences us to gain weight or become healthier, know exactly how information flows among employees and productivity emerges in our work spaces, and understand how rumors spread.

There remain, however, significant challenges to making mobile phones the essential tool for conducting social science research and also supporting mobile commerce with a solid social science foundation. Perhaps the greatest challenge is the lack of data in the public domain, data large and extensive enough to capture the disparate facets of human behavior and interactions. Another major challenge lies in the interdisciplinary nature of conducting social science research with mobile phones. Software engineers need to work collaboratively alongside social scientists and data miners in various fields.

In an attempt to address these challenges, we release several mobile data sets here in "Reality Commons" that contain the dynamics of several communities of about 100 people each. We invite researchers to propose and submit their own applications of the data to demonstrate the scientific and business values of these data sets, suggest how to meaningfully extend these experiments to larger populations, and develop the math that fits agent-based models or systems dynamics models to larger populations.

These data sets were collected with tools developed in the MIT Human Dynamics Lab and are now available as open source projects (see the funf open-source sensing platform for Android phones,http://funf.media.mit.edu) or at cost (e.g., the sociometric badges for sensing organizational behavior, see http://sociometricsolutions.com )

Data Visualization: How we get it done!

Knowing what’s possible in data visualization allows us to stay ‘grounded’ in what we suggest to clients; cocktail parties are another story.  It’s a lot of fun to imagine that you’re going to create a visually stunning interactive chart that is ‘self evident’, it’s another thing to know you need to use D3.js on Google Apps Engine (GAE) powered by Renjin with rapidly developed R algorithms.  The ‘how to’ of data visualization is not only limited to coding languages and libraries, but also about where to deploy and what services already exist that allow you to focus on your client’s needs and bring them the value they appreciate.  This week we introduce some approaches to realistically wading into data visualization.

Data Animators Bring Data to Life

 13_01_18 VectorWise Demo

Ultimately everyone ends up working with people, and to understand what we can expect from our peers and colleagues we give them titles and positions, such as “Data Scientist”.  In this article on visualizing Big Data, I came across the use of “Data Animator,” which is understandable in this context as the team was expecting interactive graphics from their data visualizations.  What was of particular interest to me was How the data scientists took what we may typically think of as a visualization and, by making the exploration fast and easy, introduced animation to their charts in much the same way that a flip book makes a movie out of a set of pictures.  This article goes into detail on how the right analysis platform can help you “tell a story” about a particular dataset, thereby animating what was once brutally intensive analysis and presentation.

Transitioning Charts on Identical Data


13_01_18 D3js Example

 

I often go for the flashy visualizations, where at first glance the viewer can’t help but say ‘Wow’, but eventually we are all hit with the reality of building these visualizations.  This article goes through the D3.js and CoffeeScript code used to create a demo that offers three different perspectives on a single dataset.  This is important as different people think about the data in different ways, and allowing them to transition easily, either on their own or as part of a presentation, puts the data in the right perspective.

Visualizing Your System Files

13_01_18 HTML5 System File Visualizer

 

In conversation I often find that we try to impress each other, discussing clients we’re landing or courting and the size of their data, but it can be just as compelling to apply great tools closer to home.  We all know about tree graphs, force graphs, layered pie charts, etc., but just as it took centuries before we put wheels on the bottom of our luggage, saving countless sore arms and backs, why haven’t we applied some of these visualization tools to our own system files?   Lou Montulli, a founding engineer at Netscape, recently applied his knowledge of web browsers to his file system to determine how his disk space was being allocated.  The data is presented in a variety of formats, the exploration is smooth, and the depth of precision allows you to start wherever you’re comfortable.  Now I have to decide what to do with all those old pictures on my hard drive…

Written by: 

The DC Data Source Weekly: Week 1 – Data for Testing Recommendation Algorithms

Data Community DC is excited to bring you another weekly blog post, the DC Data Source Weekly, to complement the excellent Data Visualization by Sean Gonzalez and the Weekly Roundup of Top Data Stories by Tony Ojeda.

The DC Data Source Weekly will spend precious few words overviewing and directing our readers to fascinating sources of free data. Even better, these data sources will often be themed with upcoming or previous Data Science DCData Business DC, and/or R Users DC meetup events.

To kick off, we will look at sample data sets for testing out recommendation systems to align with January 28th’s meetup: Recommendation Systems in the Real World.

Jester

This dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users; therefore it differentiates itself from other datasets by having a much smaller number of rateable items.

Book-Crossing Dataset

This dataset is from the Book-Crossing community, and contains 278,858 users providing 1,149,780 ratings about 271,379 books.

Yandex

The Relevance Prediction Challenge provides a unique opportunity to consolidate and scrutinize the work from industrial labs on predicting the relevance of URLs using user search behavior. It provides a fully anonymized dataset shared by Yandex which has clicks and relevance judgements. Predicting relevance based on clicks is difficult, and is not a solved problem. This Challenge and the shared dataset will enable a whole new set of researchers to conduct such experiments.

The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. Unlike previous click datasets, it also includes relevance judgments for the ranked URLs, for the purposes of training relevance prediction models. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts.

 

hetrec2011-movielens-2k

This is an extension of MovieLens10M dataset, which contains personal ratings and tags about movies. From the original dataset, only those users with both ratings and tags have been mantained.

In the dataset, the movies are linked to Internet Movie Database (IMDb) and RottenTomatoes (RT) movie review systems. Each movie does have its IMDb and RT identifiers, English and Spanish titles, picture URLs, genres, directors, actors (ordered by “popularity”), RT audience’ and experts’ ratings and scores, countries, and filming locations.

http://ir.ii.uam.es/hetrec2011/datas...ous/readme.txt

hetrec2011-delicious-2k

This dataset has been obtained from Delicious social bookmarking system. Its users are interconnected in a social network generated from Delicious “mutual fan” relations. Each user has bookmarks, tag assignments, i.e. tuples [user, tag, bookmark], and contact relations within the dataset social network. Each bookmark has a title and URL.

hetrec2011-lastfm-2k

This dataset has been obtained from Last.fm online music system. Its users are interconnected in a social network generated from Last.fm “friend” relations. Each user has a list of most listened music artists, tag assignments, i.e. tuples [user, tag, artist], and friend relations within the dataset social network. Each artist has a Last.fm URL and a picture URL.

A thank you goes out to Bryce Nyeggen for suggesting several of the above data sources.

R Users DC Meetup: “Analyze US Government Survey Data with R”


We are excited to bring to you a review written by guest columnist Jerzy Wieczorek, U.S. Census Bureau statistician, “working primarily in small area estimation and Bayesian statistics, and using statistics in humanitarian and volunteer work.”

 

At this past Thursday’s R Users DC Meetup, Anthony Damico presented a rousing encouragement of the use of R and introduced us to several excellent resources:

  • Analyze Survey Data for Free,” his blog with instructions and “obsessively commented” example R scripts for downloading and analyzing publicly-available US government datasets… including the use of MonetDB, an open-source database management system that helps R handle huge datasets
  • twotorials.com, his collection of 2-minute-long targeted tutorial videos for learning R (there are over 90 so far)
  • sascii, his R package which can automatically process a flat data file’s SAS input statements and read the data directly into R, thus bypassing the need for manual conversion or running SAS.

For more details, including Anthony’s data analysis flowchart and his advice on R vs. its competitors, see the full review over at Civil Statistician.

You can also download the audio of Anthony’s presentation here or watch a video of it here.

Data Visualization: Biology & Medicine

Welcome back to our weekly installment of visualizations in data science!  This week we have a lineup exploring biological systems,  and ultimately how great visualizations are furthering medical practice.  These visualisations are designed with a select audience in mind and not necessarily the general public, but they are a thrilling way to discover what previous formats had effectively rendered unavailable.

Tablets in Science

iPathCaseKEGG

KEGG is an online and integrated molecular database designed to support our interests in metabolism by linking sequenced genomes to higher level systematic functions.  Having the data is nice, but unless molecular biology is your profession the KEGG database was only proof that others were interested in the subject.  With iPathCaseKEGG, an iOS App designed to visually explore the KEGG database, anyone with a passing curiosity can learn quickly, and there are a plethora of similar toys when you’re ready for a new subject.

Exploring Gene Ontology

Figure S3

There was a time in history when a treatise could be written on a given subject, a single person could relate their work to all other corresponding works in the world in a single book.  Today, what we relate to depends on what we focus on and how we visualize the data, leading to new emphasis and new renderings.  In this data mining approach to biological networks, the author explores various hypotheses by emphasizing different data using similar data visualizations, and each time discovers a unique aggregate in the data.  What we know depends on how we explore.

Visualizing Brain Surgery

DESI - Brain Surgery Imagery

There has always been a give and take between gathering passive or minimally invasive diagnostic information about a patient and using the operation itself to make final decisions.  As is the case with my own ankle surgery, the precision of X-Rays and MRI’s is not infinite and only offers some confidence that surgery will be advantageous.  At least with an ankle, once surgery begins, it is clear what is healthy bone, versus post traumatic growth, versus scar tissue; thus, the decision to operate is relatively easy.  However, in brain surgery, cancerous cells are not as obvious, and variation in judgement can mean serious loss of motor, sensory, or cognitive ability.  This article explores the work of Purdue professor R. Graham Cooks to visualize the borders of cancerous brain tumors using charged solvents during surgery.

Written by: 

Weekly Round-Up: Getting Started with Data Science, Interview Questions, GE’s Big Data Plans, Lean Startup Experiment Data, and Brain Visualizations

In this first week of 2013, we found a variety of interesting data science articles ranging in topics from some ways to get started in data science and some interview questions you can expect if you land an interview to some future applications for machine-generated data.

In this week’s round-up:

  • Software engineer’s guide to getting started with data science
  • Interview Questions for Data Scientists
  • General Electric lays out big plans for big data
  • Early Evidence Is Often Too Early And Not Really Evidence
  • Scientists construct first map of how the brain organizes everything we see

Software engineer’s guide to getting started with data science

This is a great guide submitted to R-bloggers describing the process and resources the author used to learn the basics of data science. It references the training materials he used for self learning, which classes he took, which books he read, what he found useful, and what didn’t work so well for him. This article is great for those who are interested in getting started in data science as well as those who have learned the basics and are looking for what they should learn next.

Interview Questions for Data Scientists

This is a post by Hilary Mason, Chief Scientist at bitly, about some of the questions she asks data scientist candidates during job interviews. The questions she highlights in this post are the non-technical questions that give her a sense of the applicant’s personality, how passionate they are about the work they’ve done, and how their brain works. For those job-seekers out there, you should be prepared to answer these types of questions in addition to the usual technical ones.

General Electric lays out big plans for big data

This InfoWorld article describes some of the plans GE has announced for combining sensors, data, analytics, and the cloud in a new initiative they call the Industrial Internet. This means making a lot of the processes, products, and solutions across their many lines of business more intelligent. The article also includes a link to a keynote speech from GE CEO Jeff Immelt about the initiative, which can be viewed here.

Early Evidence Is Often Too Early And Not Really Evidence

In this OnStartups article, Dharmesh Shah from HubSpot warns us about the quality of the data that is captured during experiments conducted by startups. Conducting experiments to validate your assumptions, getting customer feedback, and learning from that feedback is crucial to the success of a startup and is a key component of the Lean Startup methodology, but Shah warns that you should make sure you have enough data and that the data is sufficiently clean before making decisions. The article goes into some more detail about this and provides a couple really good examples.

Scientists construct map of how the brain organizes everything we see

My Note: See Below

This is an interesting project out of UC Berkeley, where scientists have constructed a map of how the brain processes and organizes things that we see. The article includes a video explaining how they captured the data necessary to do this, what they have created from it, and an overview of how they created it. It also includes a link to the awesome visualization that resulted so that you can play around with it yourself.

That’s it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there’s something we missed, feel free to let us know in the comments below.

Scientists construct first map of how the brain organizes everything we see

BERKELEY —

Source: http://newscenter.berkeley.edu/2012/12/19/semanticspace/

Our eyes may be our window to the world, but how do we make sense of the thousands of images that flood our retinas each day? Scientists at the University of California, Berkeley, have found that the brain is wired to put in order all the categories of objects and actions that we see. They have created the first interactive map of how the brain organizes these groupings.

Alex Huth explains the science of how the brain organizes visual categories. (Alex Huth video)

The result — achieved through computational models of brain imaging data collected while the subjects watched hours of movie clips — is what researchers call “a continuous semantic space.”

Some relationships between categories make sense (humans and animals share the same “semantic neighborhood”) while others (hallways and buckets) are less obvious. The researchers found that different people share a similar semantic layout. 

“Our methods open a door that will quickly lead to a more complete and detailed understanding of how the brain is organized. Already, our online brain viewer appears to provide the most detailed look ever at the visual function and organization of a single human brain,” said Alexander Huth, a doctoral student in neuroscience at UC Berkeley and lead author of the study published today (Wednesday, Dec. 19) in the journal Neuron.

A clearer understanding of how the brain organizes visual input can help with the medical diagnosis and treatment of brain disorders. These findings may also be used to create brain-machine interfaces, particularly for facial and other image recognition systems. Among other things, they could improve a grocery store self-checkout system’s ability to recognize different kinds of merchandise.

 ”Our discovery suggests that brain scans could soon be used to label an image that someone is seeing, and may also help teach computers how to better recognize images,” said Huth, who has produced a video and interactive website to explain the science of what the researchers found.

It has long been thought that each category of object or action humans see — people, animals, vehicles, household appliances and movements — is represented in a separate region of the visual cortex. In this latest study, UC Berkeley researchers found that these categories are actually represented in highly organized, overlapping maps that cover as much as 20 percent of the brain, including the somatosensory and frontal cortices.

Maps show how different categories of living and non-living objects that we see are related to one another in the brain’s “semantic space.”

To conduct the experiment, the brain activity of five researchers was recorded via functional Magnetic Resonance Imaging (fMRI) as they each watched two hours of movie clips. The brain scans simultaneously measured blood flow in thousands of locations across the brain.

Researchers then used regularized linear regression analysis, which finds correlations in data, to build a model showing how each of the roughly 30,000 locations in the cortex responded to each of the 1,700 categories of objects and actions seen in the movie clips. Next, they used principal components analysis, a statistical method that can summarize large data sets, to find the “semantic space” that was common to all the study subjects.

The results are presented in multicolored, multidimensional maps showing the more than 1,700 visual categories and their relationships to one another. Categories that activate the same brain areas have similar colors. For example, humans are green, animals are yellow, vehicles are pink and violet and buildings are blue. For more details about the experiment, watch the video above.

“Using the semantic space as a visualization tool, we immediately saw that categories are represented in these incredibly intricate maps that cover much more of the brain than we expected,” Huth said.

Other co-authors of the study are UC Berkeley neuroscientists Shinji Nishimoto, An T. Vu and Jack Gallant.

Data Visualization: 2013 Week #1

For the first week of the New Year, Data Community DC is expanding into a natural extension of its current outlook: Data Visualization.  As has been our style, we leverage our network to bring you the most practical and interesting news and points of view.  In this article, we focus on two important aspects of Data Visualization: gathering data and effective visualization techniques.

A Simple Mashup

As Data Scientists we often want to use every tool in our arsenal, possibly because we believe it will help or we want to live on the cutting edge, but we forget that people are the best pattern matchers and all we need to tease out useful insights is data juxtoposition.  Here the Trauma Professional’s Blog creates a mashup of the Fatal Accident Reporting System (FARS), the Census of Fatal Occupational Injuries (CFOI), and other hidden government databases, allowing our natural pattern recognition to find what’s most relevant to us.

Mapping our Thoughts

On the other hand, it is simply fun to go off the deep end and use every tool in the box!  This Fall Berkley researchers measured blood flow in video watchers’ brains combined with PCA analysis to create a “continuous semantic map” of how we organize the world we see around us.  The stunning interactive map relates position in the brain with its influence in the organizational process, and much more.  While the raw data is all available in one place, as opposed to hidden government databases, its inter-relationships are not self evident.  This is a fun visualization because we can explore the raw data versus the analysis results, allowing us to reconcile our own understandings of brain function.

UI Design Evolution

If you’ve ever wandered to the cockpit of an airplane, you may have gotten that intimidating feeling as you imagined how to synthesize all that information in real-time.  This intimidating feeling is exactly what F22 Raptor engineers aimed to eliminate.  This article by AssertTrue() breaks down the thinking behind cockpit design over time; but, as data scientists, we know that there is a crucial relationship between automation (algorithms, AI, etc.) and action (pilot/user input).  Too much automation and vital information is unavailable for critical decisions, while information overload may force attention to non-critical issues.  We can abstract these concepts for our own UI challenges.  In organizations, people are pilots, and what information is presented to them from around the organization greatly affects what decisions they make on its behalf.  Action = f(Awareness)

Written by: Sean M. Gonzalez

The (near) Future of Data Analysis – A Review

My Note: This is very similar to what I have been doing with slides, but without using code for the analytics!

 co-organizes Data Business DC, among many other things.

Hadley Wickham, having just taught workshops in DC for RStudio, shared with the DC R Meetup his view on the future, or at least the near future of Data Analysis. Herein lies my notes for this talk, spiffed up into semi-comprehensible language. Please note that my thoughts, opinions, and biases have been split, combined, and applied to his. If you really only want to download the slides, scroll to the end of this article.

 

As another legal disclaimer, one of the best parts of this talk was Hadley’s commitment to making a statement, or, as he related, “I am going to say things that I don’t necessarily believe but I will say them strongly.”

You will also note that Hadley’s slides are striking even at 300px wide … evidence that the fundamental concepts of data visualization and graphic design overlap considerably.

 

Data analysis is a heavily overladen term with different meanings to different people.

However, there are three sets of tools for data analysis:

  1. Transform, which he equated to data munging or wrangling
  2. Visualization, which is useful for raising new questions but, as it requires eyes on each image, does not scale well; and
  3. Modeling,  which complements visualization and where you have made a question sufficiently precise that you can build a quantitative model. The downside to modeling is that it doesn’t let you find what you don’t expect.

 

Now, I have to admit I loved this one. Data analysis is “typing not clicking.” Not to disparage all of those Excel users out there but programming or coding (in open source languages) allows one to automate processes, make analyses reproducible, and even help communicate your thoughts and results, even to “future you.”  You can also throw your code on Stack Overflow for help or to help others.

Hadley also described data analysis as much more cogitation time than CPU execution time. One should spend more time thinking about things than actually doing them.  However, as data sets scale, this balance may shift a bit … or one could argue the longer it takes to run your analysis, the more thought you should put into the idea and code before it runs for days as the penalty for doing the wrong analysis or an incorrect analysis grows. Luckily, we aren’t quite back to the days of the punchcard.

Above is a nice way of looking at some of the possible data analysis tool sets for different types of individuals. To put this into the vernacular of the data scientist survey that Harlan, Marck and I put out,  R+js/python would map well to the Data Researcher, R+sql+regex+xpath, could map to the Data Creative, and R+java+scala+C/C++ could map to the Data Developer.  Ideally, one would be a polyglot and know languages that span these categorizations.

Who doesn’t love this quote? The future (of R and data analysis) is here in pockets of advanced practitioners. As ideas disperse through the community and the rest of the masses catchup, we push forward.

Communication is key …

but traditional tools fall short when communicating data science ideas and results and methods. Thus, rmarkdown gets it done and can be quickly and easily translated into HTML.

Going one step further but still coupled to rmarkdown is a new service, RPubs, that allows one click publishing of rmarkdown to the web for free. Check it out …

If rmarkdown is the Microsoft Word of data science, than Slidify is the comparable to Powerpoint (and it is free), allowing one to integrate text, code, and output powerfully and easily.

While these tools are great, they aren’t perfect.  We are not yet at a point where our local IDE has been seemlessly integrated into our code versioning system, our data versioning system, our environment and dependency versioning system, our publishing/broadcasting results generating system, or our collaboration systems.

Not there yet …

Amen.

Basically, Rcpp allows you to embed C++ code into your R code easily.  Why would someone want to do that? Because it allows you to easily circumvent the performance penalty of FOR loops in R; just write them in C++.

On a personal rant, I don’t think mixing in additional languages is necessarily a good idea, especially C++.

Notice the units of microseconds.  There is always a trade off between the time spent optimizing your code and running your slow code.

Awesome name, full stop. Let’s take two great tastes, ggplot2 and D3.js, and put them together.

If you don’t know about ggplot2 or the Grammar of Graphics, click the links!

D3 allows one to make beautiful, animated, and even interactive data visualizations in javascript for the web. If you have seen some of the impressive interactive visualizations at the New York Times, you have seen D3 in action.  However, D3 has a quite steep learning curve as it requires understanding of CSS, HTML, javascript,

As a comparison, what might take you a few lines of code to create in R + gglot2, could take you a few hundred lines of code in D3.js.  Some middle ground is needed, allowing R to produce web suitable, D3-esque graphics.

ps Just in case you were wondering, r2d3 does not yet exist. It is currently vaporware.

Enter shiny which allows you to make web apps with R hidden as the back end, generating .pngs that are refreshed, potentially with an adjustable parameter input from the same web page. This doesn’t seem the Holy Grail everyone is looking for but is moving the conversation forward.

One central theme was the idea that we want to say what we want and allow the software to figure out the best way to do that. We want a D3-type visualization but we don’t want to learn 5 languages to do it. Also, this applies equally on the data analysis size for data sets ranging many orders of magnitude.

Another theme was that the output of the future is HTML 5.  I did not know this but R Studio is basically a web browser, everything is drawn using HTML5, js, and CSS.

Loved this slide because who doesn’t want to know?!

DPLYR is an attempt at a grammar of data manipulation, abstracting out the back end of crunching the data from the description of what someone wants done (and no, SQL is not the solution to that problem).

And this concludes what was a fantastic talk about The ^(near) Future of Data Analysis. If you’ve made it this far and still want to download Hadley’s full slide deck or Marck’s introductory talk, look no further:

Weekly Round-Up: Simpler Data Tools, The Future of Data, Open Data, Big Data Dangers, and NBA Missile Tracking Cameras

This week, we encountered several interesting data articles and blog posts we thought were worthwhile for the round-up. As this year comes to a close, we were reminded of how much noise is out there about data, having to sift through hundreds of “Year in Review” and “Big Data Predictions for 2013″ articles that are currently being published. We chose to bypass those for this round-up in favor of focusing on longer term trends, interesting ways data is being used, and important things to consider when handling data.

Happy New Year, everyone!

In this week’s round-up:

  • We don’t need more data scientists, just make big data easier to use
  • The Secret Life of Data in the Year 2020
  • Open data is not a panacea
  • The Danger Of Big Data
  • How Missile Tracking Cameras Are Remaking The NBA

We don’t need more data scientists, just make big data easier to use

This is an interesting GigaOM article in which Scott Brave of Baynote argues that in order to make big data available to the masses, what we need is not more data scientists but simpler tools so that the people who need the information to make decisions have the ability to access it themselves.

The Secret Life of Data in the Year 2020

In this article, Intel futurist Brian David Johnson explains his vision for the future uses of data in the coming years. Some of the things he touches on are smarter tracking of personal statistics, algorithms that talk to machines and other algorithms, important considerations that need to be kept in mind in order to design all these things, and some overarching concepts that that touch on humanness, intelligence, and inter-connectivity.

Open data is not a panacea

This is a blog post by Cathy O’Neil on her blog, Mathbabe, containing some insights about open data, the way it’s generally defined, people’s reactions to it, and some pragmatic ways to think about the issues surrounding it.

The Danger of Big Data

In this article, Kerry Bodine from Forrester Research warns us about the dangers of relying too heavily on quantitative analysis and not enough on qualitative research. Kerry explains how she sees this more and more in her line of work and provides some effective ways to combine both the quantitative and the qualitative research to gain a better understanding of what we are looking at.

How Missile Tracking Cameras Are Remaking The NBA

This Fast Company article describes new ways basketball teams are using data (with the help of cameras that were formerly used for missile tracking). The NBA teams are using these cameras to help them track everything that happens in a game and use algorithms to identify patterns and trends. Check out the article for more details.

That’s it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there’s something we missed, feel free to let us know in the comments below.

Data Science DC Event Review: Political Campaign Data Science

Our December Data Science DC Meetup was about Data Science in Political Campaigns, and we had four experts come in to give short presentations about their work and then participate in a panel discussion.  Here is an overview of each of the four presentations, including slides and audio where available.  For those of you who couldn’t make it to the event, hopefully this gives you a good sense of what was presented and for those who were there, hopefully this lets you relive the political data magic.

Political Campaign Data Science

Political Campaign Data Science

Sasha Issenberg
First to present was Sasha Issenberg, a political columnist and author of the book The Victory Lab. Sasha laid the groundwork for the evening, speaking about what kinds of data political campaigns have available to them and the major strategies they employ. He talked about how information is used to model scenarios and how individuals will vote. The quantity of data available is impressive, as there are now hundreds of variables to model against voting data.

Sasha also spoke about some of the ways the data is modeled, such as forming subgroups from individual-level data to measure which of those subgroups are responding to your tactics, using the data to gain a better understanding of what would mobilize someone to vote or move them toward voting a certain way, and being able to divide the electorate into smaller groups and figuring out how to move them.

Download the audio of Sasha’s talk here


Shrayes Ramesh
Next up was Shrayes Ramesh, a PhD Candidate in the Department of Economics at the University of Maryland, College Park, and a Data Scientist and Senior Technical Instructor at EMC. Shrayes gave a presentation about the correlation between political monetary contributions and changes in the political climate, the difficulties encountered in modeling these things, and some solutions that can lead to better models. He described some of the problems, such as unobserved factors caused by lack of data, that can lead to overestimating the impact contributions have on vote share.

Sharayes then presented some potential solutions, the first of which involved finding variables correlated with contributions but uncorrelated with election outcomes. Another solution involved what he called exogenous exits, or looking at exits from political office that were not the results of losing elections (ex. resignations, deaths, promotions, etc.). His final message to the audience was to to be cautious about studies that show correlation in politics and to ask yourself whether they are really telling the whole story.

View Shrayes’ presentation slides here
Download the audio of Shrayes’ presentation here


Ken Strasma
Ken Strasma was the first of two speakers presenting on micro-targeting in political campaigns. Ken worked on the data science team employed by the Obama campaign during the 2012 election. He started out by describing how micro-targeting utilizes statistical models to predict voter behaviors at the individual level. He mentioned that there is a lot of skepticism around the effectiveness of micro-targeting and that the media tends to focus on one or two of the most interesting indicators when reporting stories about it – for example, that cat or dog owners tend to vote a certain way. However, Ken said that most of the competitive advantage comes from doing the boring things and doing them very well.

He also devoted some time to talking about what voters are worthwhile targets, specifically taking into consideration whether or not they are persuadable and what message you can deliver to get them to act favorably. Finally, Ken also drew some comparisons between the political campaigns and commercial marketing, focusing on the lessons the commercial world can learn from the micro-targeting methods used in politics.

View Ken’s presentation slides here
Download the audio of Ken’s presentation here


Alex Lundry
Last but certainly not least was Alex Lundry, who worked on the Romney campaign’s data science team for the 2012 presidential election. Alex provided examples of the types of data analysis the Romney campaign used throughout the election season. This included monitoring paid media for both campaigns, tracking what proportion of TV spots aired for each party each day, tracking the different commercials that the opposition aired over time, factor analysis to determine how disperse an ad buy was, and using predictive modeling to try to measure how likely people were to vote and who they would vote for if they did. They also conducted real-time assessments of the opposition’s convention speeches by gender and partisanship, did some experimentation with analyzing the effects on voters when candidates visit their towns, and conducted sentiment analysis of political discussions.

As far as tools, Alex said they relied heavily on the R statistical programming language and on Tableau’s data discovery and visualization software for the bulk of their analyses.

Download the audio of Alex’s presentation here


Panel Discussion
After the presentations, there was a panel discussion where the presenters answered questions such as whether they would prefer to have better algorithms or more data, how they measure results when the results occur mostly on a single day, and what they predict for the 2016 election season.

Download the audio of the panel discussion here

Weekly Round-Up: Data-Driven Processes & Products, SMB Big Data, Telling Stories, and Fighting Bullying

For this week’s round-up of the most interesting data articles and blog posts from around the web, we have several articles ranging in topics from using big data in small and medium sized businesses to using machine learning to combat bullying.

In this week’s round-up:

  • So I’ve got Big Data. Now What?
  • How small and mid-sized businesses can tap the power of big data
  • The key to data science? Telling stories
  • Fighting bullying with machine learning

So I’ve got Big Data. Now What?

This TechCrunch article delves into some of the things that data scientists can help companies do and serves as a call to action for businesses to get more data-driven in their processes and decision-making. It provides some concrete examples of what we data scientists can do for a company, some ways data is being used to drive intelligent and insightful features inside existing products, and some recommendations for how to get started down the road to managing a more data-driven organization.

How small and mid-sized businesses can tap the power of big data

This article from the Washington Post highlights some of the ways small and medium sized businesses can use data to run their organizations more efficiently and make more informed decisions. Historically, smaller companies have been hindered when it comes to leveraging the power of data due to the large investment in business intelligence systems that was required. However, new technologies are leveling the playing field and this article provides some useful tips on how smaller companies can make big data part of their arsenal.

The key to data science? Telling stories

My Note; See Below

This is a ZDNet article summarizing a talk that DJ Patil gave at Le Web about the importance of using data to tell stories. The article contains some highlights of the talk and a video of Patil’s insightful presentation.

Fighting bullying with machine learning

This was an interesting summary of how an interdisciplinary team of computer scientists and psychologists at the University of Wisconsin are using machine learning to capture data about bullying so that it can be prevented. There are links to their publications and also to the code and data they are using.

That’s it for this week. Make sure to come back next week when we’ll have some more interesting articles! If there’s something we missed, feel free to let us know in the comments below.

The key to data science? Telling stories

Source: http://www.zdnet.com/the-key-to-data-science-telling-stories-7000008765/

Summary: Data scientists and journalists like those of us at ZDNet aren't all that different. Both require collecting data to weave narratives, Greylock Partners' D.J. Patil says.

 
dj-patil-leweb2012-scrn

We weren't at last week's LeWeb conference in Paris, France -- dommage! -- but D.J. Patil, data scientist-in-residence at the venture capital firm Greylock Partners, was, and he had a few choice words to offer the crowd about data science.

Patil says data scientists are a lot like journalists in that both collect data to tell stories. It's that story part that's most important -- a successful data scientist is one who can weave a cohesive narrative from the numbers and statistics.

Like journalism, there are many stories to tell from the same set of data, and data scientists must choose carefully. And just like you, the reader, are wondering what the point of this article is --what's my key takeaway? -- there's a level of subjectivity that data science doesn't normally get credit for.

The Wall Street Journal's Ben Rooney was there to capture Patil's words.

Five highlights:

  1. "Data science is about creating narratives. It is about creating analogies, about using complex data to tell stories."
  2. "Data science is about trying to create a process that allows you to create new ways of thinking about problems that are novel, or you are trying to use data to create or make something."
  3. "We need to separate out the idea of 'big data,' from being 'data driven.' Companies need to be data-driven but often you don't need lots of data to do something."
  4. "Subjective areas are where data science shines. It allows us to ask questions...the point is to have a debate."
  5. "We posted job postings for the same job but with different titles. Data scientist, research analyst, research scientist -- the people that were the true creative curious types all came through the data science job title."

The full talk, in a video, below:

Topics: Big DataData Management

DBDC Event Review: Money for Nothing – Productizing Government Data

Some companies are increasingly closing their APIs and locking down their data with the aim to figure out how to monetize it (see twitter for instance). The government, however, is doing the opposite. It’s actively creating API’s to make government more transparent and accessible. This is creating new opportunities for businesses who are willing to make sense of and transform the data into something valuable for a particular market. These new opportunities and the lessons learned by those in the trenches was the theme for Data Business DC‘s (DBDC’s) second meetup Money For Nothing — Productizing Government Data. Presenters ranged from the small, one-person startup using government data as the foundation of the business to a senior government official talking about the challenges, successes, and opportunities of using government data.

Josh Hurd of Fedalytics Presents

 

Josh Hurd, founder of Fedalytics, was the first speaker of the night. Fedalytics provides a tool for lobbyists and other interested parties to gain a unified view of politicians and politically orientated organizations. The problem Josh solves is that this data is spread throughout a number of government databases and files that in the past required tremendous effort to clean, combine, use, and present. Josh makes it easy for users to get the information and results they need – quickly.

Matthew Eierman talk about HDScores

Next up was Matthew Eierman of HDScores whose business harmonizes health department data like restaurant inspections. If you’ve ever been to New York City, you might have noticed that every restaurant has a score prominently displayed right next to the door, mandated by government regulation. Matthew mentioned that once the regulations went into effect, health sanitation restaurant closings went down by 15%. However, while that’s cool, not every municipality or state does this. Moreover, each government entity has different formats and ways of distributing this information. HDScores does the hard job of compiling and harmonizing this data so that consumers can get the information they need to make good decisions. Check out their website to see which restaurants near you might by sketchy — you might be surprised!

Our last speaker of the night was Brian Forde, Senior Advisor to the U.S. CTO of Mobile and Data Innovation. He talked about a number of ways government data is saving lives and making a difference. One story in particular was about how the government, businesses, high school students, and hackers came together to coordinate relief efforts for Hurricane Sandy. High schools identified identified open gas stations via twitter posts, businesses, like Hess, provided near real time data on which stations still had gas, and hackers helped pull all ofvthe disparate parts together. The end result was a site that helped citizens affected by Sandy find where they could by gas.

The more important point Brian wanted to make was that the government is working hard to make data more transparent and accessible. With that in mind, he encouraged the audience to contact him about government data sources that need to be machine readable for inclusion in apps. If you know of any data that needs liberation, contact him at bforde at ostp.eop.gov . Lastly, he also encouraged us all to check out http://www.data.gov/ and http://challenge.gov/, two government sites designed to free data and solve tough problems.

In sum, it was a great evening of learning from both the folks who are using government data and those within the government who are working hard to make it easier to access and use such data. I encourage you to check out these awesome companies and reach out to Brian if you know of government data sources that need to be machine readable.

Weekly Round-Up: Big Data Business Models, Cloudera, BI Trends, and Cool Visualizations

HAVE YOU USED GOOGLE TO SEARCH FOR COLD SYMPTOMS, OR A HOME REMEDY? THEN YOU’VE UNWITTINGLY HELPED THE SEARCH GIANT TO PREDICT THE OUTBREAK OF FLU ON AN INTERNATIONAL SCALE.

You’re definitely going to get the flu this year.

Alright, sorry, maybe not definitely. But the CDC is reporting that flu season is off to an “early start,” and will likely be one of the worst in the past decade.

How does the CDC predict such things? With cold, hard clinical evidence: The organization publishes a weekly FluView report based on the number of patients who have reported flu-like symptoms and the number of hospitalizations. But as CDC Director Thomas Frieden noted, the spread of the flu is fairly “unpredictable,” and FluView has a one- to two-week lag.

Leave it to Google to leverage our search data to create an almost real-time prediction map. “In 2007, a small team of engineers began to explore ways of accurately modeling real-world phenomena using patterns in search queries,” Corrie Conrad, a senior program manager at Google, tells Co.Design. “We found that what people search for is actually a good indicator of influenza activity in a population.” The resulting Google.org project, Google Flu Trends, tracks the number of times that users search for flu information, like ”flu symptoms” or “flu remedy." Then, each search is added to the map using the IP address associated with the query, creating a localized map that predicts outbreaks in near real-time. It’s essentially a map that uses the Google Trends to predict illness. “While traditional systems require 1-2 weeks to gather and process surveillance data, our estimates are current each day,” explain the Google team in a piece in the journal Nature (PDF here).

“Of course, not every person who searches for ‘flu’ is actually sick, but a pattern emerges when all the flu-related search queries are added together,” say the designers. “We compared our query counts with traditional flu surveillance systems and found that many search queries tend to be popular exactly when flu season is happening.” This year, a group of Johns Hopkins and CDC researchers authored a study provingthat the Flu Trends data is closely correlated with an increase in emergency room activity in general.

Could Google Trends emerge as an ally for hospitals struggling during times of epidemics and pandemics? Despite the privacy qualms many have about Google, it’s a remarkable idea. Conrad tells Co.Design that they have no current plans to expand the methodology to more diseases, but they’ve found that their system is also accurate in predicting Dengue. It’s interesting to imagine how the methodology could be applied to more general trends--birth rates, for example, or even mental health issues. One Finnish data junkie, for example, has used the data to compare it to sick days taken. It correlates … roughly.

KELSEY CAMPBELL-DOLLAGHAN

Data Science DC Event Review: Implicit Sentiment Mining in Twitter Streams

Geoff Moes wrote a great, comprehensive writeup of our most recent Data Science DC event, “Implicit Sentiment Mining in Twitter Streams” by Maksim (Max) Tsvetovat. Geoff is a seasoned software developer and math enthusiast, with an interest in applying analytics to software engineering problems. He blogs about his thoughts at Elegant Coding. Here’s an excerpt (pardon the language in the example):

The field of computational linguistics has developed a number of techniques to handle some the complexity issues… by parsing text using POS (parts-of-speech) identification which helps with homonyms and some ambiguity. [Max] gives the following example:

Create rules with amplifier words and inverter words:

  • This concert (np) was (v) f**ing (AMP) awesome (+1) = +2
  • But the opening act (np) was (v) not (INV) great (+1) = -1
  • My car (np) got (v) rear-ended (v)! F**ing (AMP) awesome (+1) = +2??

Here he introduces two concepts which modify the sentiment, which might fall under the concept of sentiment “polarity classification” or detection.  One idea is of an amplifier (AMP) which makes the sentiment stronger and an inverter (INV) which creates an opposite sentiment.  I found this idea of “sentiment modification” intriguing and did a little searching and came across a paper called “Multilingual Sentiment Analysis on Social Media” which describes these ideas [page 12] and a few more including an attenuator which is the opposite of an amplifier.  It also describes some other modifiers that control sentiment flow in the text, pretty interesting concepts, actually the paper looks quite interesting,

Read the rest on Geoff’s blog…

Meet Hadoop Creator Doug Cutting

My Note: I have written as story about how Hadoop is costing people 50X what they originally though it would cost them to use.

Posted on  by Sean Murphy

Although I would much prefer you come to the next DataBusinessDC event, there is another event, away far away in Columbia, MD that some might prefer. Details below.


When:
Wed, Dec 5, 2012
6 – 8:00 PM

Location:
Johns Hopkins University Applied Physics Laboratory
11100 Johns Hopkins Road, Laurel, MD 20723
Building 200, Room E100


Session Abstract
Beyond Batch: Hadoop started as an offline, batch-processing system. It made it practical to store and process much larger datasets than before. Subsequently, more interactive, online systems emerged, integrating with Hadoop. First among these was HBase, the key/value store. Now scalable interactive query engines are beginning to join the Hadoop ecosystem. Realtime is gradually becoming a viable peer to batch in big data.
Please register here!

3 Ways You’re Ruining Your GA Data, and How to Stop: Tips from a Data Scientist

This is a cross post by Sean Patrick Murphy (@sayhitosean), who runs the DataBusinessDC meetup. Finish reading over at Spinnakr’s blog.

Marketers who want to be more data-driven often start with Google Analytics. It’s ubiquitous, free, and comparatively easy to use. It’s also the first client data I have access to in many of my consulting projects. Since I see it all the time, I’ve noticed that most companies’ GA is afflicted with serious hygiene problems.

To learn more about three of the common issues ruining the usability of GA data, and instructions for fixing them, keep reading here.

Community Indicators Conference

This is a guest post from Jim Farnham, co-organizer of the upcoming Community Indicators Consortium Impact Summit in College Park. Community indicators are an effort to bring performance metrics to local-level governance, and as such are related to the Open Data movement. CIC is sponsoring DC2 this month, in an effort to get the local community aware of this event. We thank Jim and CIC for their support, and urge locals interested in Open Data and related topics to consider attending!

CIC’s Impact Summit – November 15-16,2012 in College Park, MD — will be a forum for community indicators practitioners and stakeholders to share projects, research and lessons of various fields including Sustainability, Health, Education as well as to explore approaches and tools in creating positive impacts within our communities. The Community Indicators Consortium (CIC) is an active, open learning network and global community of practice among persons interested or engaged in the field of indicators development and application. We host webinars and conferences, manage an indicator project database, and undertake projects aimed at building the field.

Our conference provides an opportunity to share ways of increasing the impact of our work on behalf of our communities, public officials, funders, professional networks, businesses and clients in a variety of formats and tracks.

Join over 200 practitioners, analysts, academics, funders, and data providers in over 20 sessions to share our work and explore new ways and tools to track impact, understand macro trends buffeting our communities, and learn how to effect change and bridge the distance between objective high quality data and subjective perceptions and interpretation.  Full details are available at the conference web site.  For those of you who cannot make it to College Park, we will provide streaming of portions of the conference and other portions will be online for member access after the conference

(Individual membership in CIC is $75 per year and provides discounted access to CIC conferences, free access to 10-20 webinars per year and all webinar and conference archives).

Presenters at the CIC Impact Summit  include:
Robert Groves – Former Director of US Census Bureau
Bryan Sivak  – Chief Technology Officer, US. Department of Health and Human Services
Eugenie Birch – University of Pennsylvania
Charlotte Kahn – Director of Boston Indicators Project
Chantel Bottoms – Austin Community Action Network
Michael McAfee – Promise Neighborhoods Institute, PolicyLink
Salin Geevargese – HUD, Office of Sustainable Housing and Communities

And about 50 others.

Follow us @CommunityIC and tweet to #CICSUMMIT

Please contact us at conference@<wbr/>communityindicators.net with any questions or comments regarding the conference.

Event Review: Carl Morris Symposium on Large-Scale Data Inference

This is a guest blog post by area Statistician Jerzy Wieczorek. Jerzy is an active contributor to the local community through his blog, Civil Statistician, his twitter account, @civilstat, and Meetups, where he recently talked about interactive map visualizations in R. He is also active in pro-bono statistical consulting for non-profits and government agencies: internationally via Statistics Without Borders, and locally via DataKind‘s DataCorps project with DC Action for Children. Thanks, Jerzy!

The 2nd Symposium on Large-Scale Data Inference took place last Thursday in Silver Spring. The symposium, organized by Social & Scientific Systems, focused on the intersection of statistics and data visualization and honored Carl Morris as the keynote speaker. Dr. Morris, whose work on Empirical Bayes methods and hierarchical models underlies many advances in large-scale data analysis, spoke about using multilevel modeling to reduce false positives in situations where you expect to see regression to the mean. He explained why “you’ll do better as a frequentist if you use Bayesian methods.” The other speakers included:

  • Mark Hansen, on data-driven art installations, teaching data journalism, and looking at the data from all perspectives (“Everyone types the name of an object in R and watches all 20 billion lines scroll by at least once, right?”),
  • Di Cook, on “visual inference” i.e. informal hypothesis testing by trying to pick out the real data plot out of a lineup of fake data plots; and testing this approach on Mechanical Turk vs. on colleagues (“It’s impressive that statisticians have got good visual skills…”),
  • Rob Kass, on statistical modeling in cognitive neuroscience and the role of statistical thinking in general (Use models to decompose variation into knowledge vs. uncertainty; and Analyse the modeling procedures themselves),
  • Chris Volinsky, on using mobile phone data in city planning, useful metrics for classifying/clustering such structured data, and the practice of doing data visualization as a team (“Physical manifestations beat small screens”),
  • a final panel discussion, on defining big data (starting with “Big data is whatever can’t be read into R easily”); the statistician’s role on a research team; distinctions between statistics and other quantitative sciences; tools and skills that statisticians should learn; and when to optimize carefully vs. just try things out.

See civilstat.com for more detailed notes on each talk: part 1 (Morris and  Hansen), part 2 (Cook and Kass), and part 3 (Volinksy and the panel).

With the symposium’s small, comfortable scale, I found it easy to chat with the speakers and fellow attendees during breaks. There was also a student poster session during the lunch break, showcasing work in dynamic graphics and hierarchical modeling. For anyone who wants a broad, accessible insight into recent research and underlying themes in statistics, big data, and visualization, I highly recommend this symposium series.

Data Scientists survey results teaser

Earlier this year, several of us from the DC2 community (Harlan Harris –that’s me, Marck Vaisman, and Sean Murphy) conducted a web-based survey of Data Scientists, with the goal of better understanding the varieties of people, skills, and experiences that fall under this rather broad buzzword. We have analyzed the results from over 250 respondents, and are excited to share some initial findings here!

The first task in the survey was to rank a set of 21 skill categories. We used the technique of non-negative matrix factorization to find five underlying dimensions of variation among the rankings. We found that Data Scientists have skills that tend to be associated together, and by grouping those skills, we can provide people with a useful shorthand. Here are the skills groups, with category names that we think clarify what we as Data Scientists bring to the table:

  • Programming: Back-end Programming, Front-end Programming, Systems Administration
  • Stats: Classical Statistics, Data Manipulation, Science, Spatial Statistics, Surveys and Marketing, Temporal Statistics, Visualization
  • Math: Algorithms, Bayesian/Monte Carlo Statistics, Graphical Models, Math, Optimization, Simulation
  • Business: Business, Product Development
  • Machine Learning/Big Data: Big and Distributed Data, Machine Learning, Structured Data, Unstructured Data

Clearly not everyone who is strong in some aspects of these categories will be expert in every area. But, as a general rule, these skill groups co-occur. Equally important, a Data Scientist who may have skills in Machine Learning and Big Data may have little expertise in Surveys or Front-End Programming.

We performed a similar NMF analysis on a series of self-evaluation questions near the end of the survey. Respondents gave “Completely Agree” to “Completely Disagree” responses to statements that started with “I think of myself as a(n)…” We view the Self-Identification groups that fell out of the NMF analysis as being critical to clarifying the diverse backgrounds and interests of Data Scientists. Here are how the responses to these questions grouped, along with category names that we feel are useful:

  • Data Businessperson: Business person, Leader, Entrepreneur
  • Data Creative: Artist, Jack-of-All-Trades, Hacker
  • Data Researcher: Scientist, Researcher, Statistician
  • Data Engineer: Engineer, Developer

Many people responded to many of these self-ID questions positively, but the analysis shows underlying dimensions of variation that can inform peoples’ career paths and interests. Even more fascinating, the two groupings we identified, skills and self-ID, correlate in ways that we think are highly valuable to Data Scientists and organizations that need our skills. The below graph shows how survey participants, labeled by their primary (by strongest factor loading) skill group and their primary self-ID group, arrange themselves in a cross-tabulation table (click to see larger).

As we further dive into these results, we will be stressing the point that our data shows substantial variation in skills and interests among Data Scientists. The field is quite diverse, and a Data Creative who can build an amazing Javascript tool to visualize data from a set of disparate sources may be very different from a Data Businessperson who starts a data-related business or a Data Researcher who uses advanced mathematical tools to bring insight to organizations or a Data Engineer who integrates enterprise databases with predictive or optimization systems.

We’d love to share more results with you! If you are in the Washington, DC area on August 27th, please come see us talk about the survey results at the Data Science DC Meetup! And if you’ll be attending DataGotham in New York City on September 14th, we’ll be presenting highlights there too! Otherwise, stay tuned for future presentations and publications. If you have any specific questions that we might be able to answer as we further explore the data, please email us!

Harlan (harlan at datacommunitydc.org)
Sean (seanm at datacommunitydc.org)
Marck (marck at datacommunitydc.org)

ps. If you are one of those Data Creatives or Engineers with Javascript skills to burn and a bit of free time, we’d love your help putting together a web-based tool related to this project. Please drop us a line.

Page statistics
18582 view(s) and 124 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments