Table of contents
  1. Story
    1. Data Science for EPA Big Data Analytics
    2. Footnote: Email to EPA CIO Anne Duncan
    3. Slides
      1. Slide 1 EPA Big Data Analytics: EnviroAtlas Data
      2. Slide 2 Overview
      3. Slide 3 EPA EnviroAtlas Data: Web Page
      4. Slide 4 EPA EnviroAtlas Data: Description
      5. Slide 5 EPA EnviroAtlas Data: Maps
      6. Slide 6 EPA EnviroAtlas Data: National
      7. Slide 7 EPA EnviroAtlas Data: Community
      8. Slide 8 EPA EnviroAtlas Data: Map of Communities
      9. Slide 9 Geodatabases-to-Shape Files
      10. Slide 10 FME Workbench: National Metrics Log File
      11. Slide 11 FME Workbench: National Metrics GDB-to-SHP
      12. Slide 12 Data Science Data Publication: MindTouch Knowledge Base
      13. Slide 13 Data Science Data Publication: Spreadsheet Knowledge Base
      14. Slide 14 EPA EnviroAtlas National & Community Inventory
      15. Slide 15 Data Science Data Publication: Spotfire Cover Page
      16. Slide 16 Data Science Data Publication: IRM Strategic Plan
      17. Slide 17 Data Science Data Publication: IRM Strategic Plan Tables
      18. Slide 18 Data Science Data Publication: EnviroAtlas Inventory National
      19. Slide 19 Data Science Data Publication: EnviroAtlas Inventory Community
      20. Slide 20 Data Science Data Publication: EnviroAtlas Inventory NatureServe
      21. Slide 21 Data Science Data Publication: EnviroAtlas Inventory Land Cover
      22. Slide 22 Conclusions and Recommendations
      23. Slide 23 Exploratory Data Science on Even Bigger Data
      24. Slide 24 Spotfire Data Tables and Relations
      25. Slide 25 Exploratory Data Science: BioMass by HUC12 1
      26. Slide 26 Exploratory Data Science: BioMass by HUC12 2
      27. Slide 27 Exploratory Data Science: Florida BenMap
      28. Slide 28 Exploratory Data Science: Florida BG_Pop
  2. Spotfire Dashboard
  3. Slides
    1. Slide 1 President's Chief Data Scientist: DJ Patil and HHS IDEA LAB Demand-Driven Open Data: David Portnoy
    2. Slide 2 Silicon Valley to Washington
    3. Slide 3 Dhanurjay Patil in the News
    4. Slide 4 U.S. Data Chief Aims to Empower Citizens with Information
    5. Slide 5 Back-of-the-Napkin Math, with DJ Patil: #PiDay Challenge
    6. Slide 6 First White House Data Chief Discusses His Top Priorities
    7. Slide 7 On the Case at Mount Sinai, It’s Dr. Data 1
    8. Slide 8 On the Case at Mount Sinai, It’s Dr. Data 2
    9. Slide 9 Data Science for Natural Medicines and Epigenetics
    10. Slide 10 Natural Medicine for Disease and Wellness Meetup
    11. Slide 11 The Birth of Demand-Driven Open Data
    12. Slide 12 Demand-Driven Open Data
    13. Slide 13 Key Questions
    14. Slide 14 EPA Big Data Analytics: Turning Data Into Value
    15. Slide 15 EPA's Ethan McMahon's Excellent Questions
    16. Slide 16 Agenda
  4. Research Notes
    1. Oregon Data
    2. New EPA Report on Fracking
    3. Data Sharing
      1. How to share data with a statistician
      2. What you should deliver to the statistician
        1. The raw data
        2. The tidy data set
        3. The code book
        4. How to code variables
        5. The instruction list/script
      3. You should then expect from the statistician:
      4. Contributors
    4. EPA  OW and OEI  policy
    5. Oregon data on watersheds and harmful algal blooms
    6. Ideas for Big Data and Public Policy Article
    7. Convert GPX to SHP with a Free Trial of FME Desktop
    8. Response and Invitation to EPA About Meetup
    9. EPA’s Cross-Agency Data Analytics and Visualization Program
    10. Big Data Analytics Questions From EPA
  5. The President Speaks At Hadoop World: Introduces DJ Patil as Nation’s First Chief Data Scientist
  6. On Demand Webinar Featuring New Federal Chief Data Scientist DJ Patil and Hilary Mason
  7. Data Driven: Creating a Data Culture
    1. Preface
      1. Data Driven
      2. Revision History for the First Edition
    2. Data Driven: Creating a Data Culture
      1. What Is a Data Scientist?
      2. What Is a Data-Driven Organization?
        1. Democratizing Data
      3. What Does a Data-Driven Organization Do Well?
        1. Managing Research
        2. Designing the Organization
        3. Process
          1. Figure 1-1. DILBERT
          2. Daily dashboard
          3. Metrics meetings
          4. Standup and domain-specific review meetings
      4. Tools, Tool Decisions, and Democratizing Data Access
      5. Creating Culture Change
  8. The White House Names Dr. DJ Patil as the First U.S. Chief Data Scientist
  9. Unleashing the Power of Data to Serve the American People
    1. Overview: What Is Data Science, and Why Does It Matter?
    2. The Role of the First-Ever U.S. Chief Data Scientist
      1. Providing vision on how to provide maximum social return on federal data.
      2. Creating nationwide data policies that enable shared services and forward-leaning practices to advance our nation’s leadership in the data age.
      3. Working with agencies to establish best practices for data management and ensure long-term sustainability of databases.
      4. Recruiting and retaining the best minds in data science for public service to address these data science objectives and act as conduits among the government, academia, and industry.
      5. Precision medicine
      6. Usable data products
      7. Responsible data science
  10. DJ Patil Scores the Sexiest Job in D.C.
  11. EPA Ecosystems Research
    1. Ecosystems Research Areas
    2. Available and Awarded Funding
    3. Publications
    4. Tools and Resources
  12. EPA Methods, Models, Tools, and Databases
    1. Methods
    2. Models
    3. Tools
    4. Databases
  13. Databases, GIS and Mapping Applications, Online Reporting
    1. Databases
    2. GIS and Mapping Applications
  14. EnviroAtlas: Fresno, CA and surrounding area
    1. Background
      1. Fresno Area Demographics
    2. Ecosystem Services Overview
      1. Access to Parks
      2. Stream and Lake Buffers
    3. EnviroAtlas Tools and Features
  15. Municipalities within EnviroAtlas Boundaries
    1. How are EnviroAtlas boundaries created?
    2. How do I use this document?
  16. NEXT

Data Science for EPA Big Data Analytics

Last modified
Table of contents
  1. Story
    1. Data Science for EPA Big Data Analytics
    2. Footnote: Email to EPA CIO Anne Duncan
    3. Slides
      1. Slide 1 EPA Big Data Analytics: EnviroAtlas Data
      2. Slide 2 Overview
      3. Slide 3 EPA EnviroAtlas Data: Web Page
      4. Slide 4 EPA EnviroAtlas Data: Description
      5. Slide 5 EPA EnviroAtlas Data: Maps
      6. Slide 6 EPA EnviroAtlas Data: National
      7. Slide 7 EPA EnviroAtlas Data: Community
      8. Slide 8 EPA EnviroAtlas Data: Map of Communities
      9. Slide 9 Geodatabases-to-Shape Files
      10. Slide 10 FME Workbench: National Metrics Log File
      11. Slide 11 FME Workbench: National Metrics GDB-to-SHP
      12. Slide 12 Data Science Data Publication: MindTouch Knowledge Base
      13. Slide 13 Data Science Data Publication: Spreadsheet Knowledge Base
      14. Slide 14 EPA EnviroAtlas National & Community Inventory
      15. Slide 15 Data Science Data Publication: Spotfire Cover Page
      16. Slide 16 Data Science Data Publication: IRM Strategic Plan
      17. Slide 17 Data Science Data Publication: IRM Strategic Plan Tables
      18. Slide 18 Data Science Data Publication: EnviroAtlas Inventory National
      19. Slide 19 Data Science Data Publication: EnviroAtlas Inventory Community
      20. Slide 20 Data Science Data Publication: EnviroAtlas Inventory NatureServe
      21. Slide 21 Data Science Data Publication: EnviroAtlas Inventory Land Cover
      22. Slide 22 Conclusions and Recommendations
      23. Slide 23 Exploratory Data Science on Even Bigger Data
      24. Slide 24 Spotfire Data Tables and Relations
      25. Slide 25 Exploratory Data Science: BioMass by HUC12 1
      26. Slide 26 Exploratory Data Science: BioMass by HUC12 2
      27. Slide 27 Exploratory Data Science: Florida BenMap
      28. Slide 28 Exploratory Data Science: Florida BG_Pop
  2. Spotfire Dashboard
  3. Slides
    1. Slide 1 President's Chief Data Scientist: DJ Patil and HHS IDEA LAB Demand-Driven Open Data: David Portnoy
    2. Slide 2 Silicon Valley to Washington
    3. Slide 3 Dhanurjay Patil in the News
    4. Slide 4 U.S. Data Chief Aims to Empower Citizens with Information
    5. Slide 5 Back-of-the-Napkin Math, with DJ Patil: #PiDay Challenge
    6. Slide 6 First White House Data Chief Discusses His Top Priorities
    7. Slide 7 On the Case at Mount Sinai, It’s Dr. Data 1
    8. Slide 8 On the Case at Mount Sinai, It’s Dr. Data 2
    9. Slide 9 Data Science for Natural Medicines and Epigenetics
    10. Slide 10 Natural Medicine for Disease and Wellness Meetup
    11. Slide 11 The Birth of Demand-Driven Open Data
    12. Slide 12 Demand-Driven Open Data
    13. Slide 13 Key Questions
    14. Slide 14 EPA Big Data Analytics: Turning Data Into Value
    15. Slide 15 EPA's Ethan McMahon's Excellent Questions
    16. Slide 16 Agenda
  4. Research Notes
    1. Oregon Data
    2. New EPA Report on Fracking
    3. Data Sharing
      1. How to share data with a statistician
      2. What you should deliver to the statistician
        1. The raw data
        2. The tidy data set
        3. The code book
        4. How to code variables
        5. The instruction list/script
      3. You should then expect from the statistician:
      4. Contributors
    4. EPA  OW and OEI  policy
    5. Oregon data on watersheds and harmful algal blooms
    6. Ideas for Big Data and Public Policy Article
    7. Convert GPX to SHP with a Free Trial of FME Desktop
    8. Response and Invitation to EPA About Meetup
    9. EPA’s Cross-Agency Data Analytics and Visualization Program
    10. Big Data Analytics Questions From EPA
  5. The President Speaks At Hadoop World: Introduces DJ Patil as Nation’s First Chief Data Scientist
  6. On Demand Webinar Featuring New Federal Chief Data Scientist DJ Patil and Hilary Mason
  7. Data Driven: Creating a Data Culture
    1. Preface
      1. Data Driven
      2. Revision History for the First Edition
    2. Data Driven: Creating a Data Culture
      1. What Is a Data Scientist?
      2. What Is a Data-Driven Organization?
        1. Democratizing Data
      3. What Does a Data-Driven Organization Do Well?
        1. Managing Research
        2. Designing the Organization
        3. Process
          1. Figure 1-1. DILBERT
          2. Daily dashboard
          3. Metrics meetings
          4. Standup and domain-specific review meetings
      4. Tools, Tool Decisions, and Democratizing Data Access
      5. Creating Culture Change
  8. The White House Names Dr. DJ Patil as the First U.S. Chief Data Scientist
  9. Unleashing the Power of Data to Serve the American People
    1. Overview: What Is Data Science, and Why Does It Matter?
    2. The Role of the First-Ever U.S. Chief Data Scientist
      1. Providing vision on how to provide maximum social return on federal data.
      2. Creating nationwide data policies that enable shared services and forward-leaning practices to advance our nation’s leadership in the data age.
      3. Working with agencies to establish best practices for data management and ensure long-term sustainability of databases.
      4. Recruiting and retaining the best minds in data science for public service to address these data science objectives and act as conduits among the government, academia, and industry.
      5. Precision medicine
      6. Usable data products
      7. Responsible data science
  10. DJ Patil Scores the Sexiest Job in D.C.
  11. EPA Ecosystems Research
    1. Ecosystems Research Areas
    2. Available and Awarded Funding
    3. Publications
    4. Tools and Resources
  12. EPA Methods, Models, Tools, and Databases
    1. Methods
    2. Models
    3. Tools
    4. Databases
  13. Databases, GIS and Mapping Applications, Online Reporting
    1. Databases
    2. GIS and Mapping Applications
  14. EnviroAtlas: Fresno, CA and surrounding area
    1. Background
      1. Fresno Area Demographics
    2. Ecosystem Services Overview
      1. Access to Parks
      2. Stream and Lake Buffers
    3. EnviroAtlas Tools and Features
  15. Municipalities within EnviroAtlas Boundaries
    1. How are EnviroAtlas boundaries created?
    2. How do I use this document?
  16. NEXT

  1. Story
    1. Data Science for EPA Big Data Analytics
    2. Footnote: Email to EPA CIO Anne Duncan
    3. Slides
      1. Slide 1 EPA Big Data Analytics: EnviroAtlas Data
      2. Slide 2 Overview
      3. Slide 3 EPA EnviroAtlas Data: Web Page
      4. Slide 4 EPA EnviroAtlas Data: Description
      5. Slide 5 EPA EnviroAtlas Data: Maps
      6. Slide 6 EPA EnviroAtlas Data: National
      7. Slide 7 EPA EnviroAtlas Data: Community
      8. Slide 8 EPA EnviroAtlas Data: Map of Communities
      9. Slide 9 Geodatabases-to-Shape Files
      10. Slide 10 FME Workbench: National Metrics Log File
      11. Slide 11 FME Workbench: National Metrics GDB-to-SHP
      12. Slide 12 Data Science Data Publication: MindTouch Knowledge Base
      13. Slide 13 Data Science Data Publication: Spreadsheet Knowledge Base
      14. Slide 14 EPA EnviroAtlas National & Community Inventory
      15. Slide 15 Data Science Data Publication: Spotfire Cover Page
      16. Slide 16 Data Science Data Publication: IRM Strategic Plan
      17. Slide 17 Data Science Data Publication: IRM Strategic Plan Tables
      18. Slide 18 Data Science Data Publication: EnviroAtlas Inventory National
      19. Slide 19 Data Science Data Publication: EnviroAtlas Inventory Community
      20. Slide 20 Data Science Data Publication: EnviroAtlas Inventory NatureServe
      21. Slide 21 Data Science Data Publication: EnviroAtlas Inventory Land Cover
      22. Slide 22 Conclusions and Recommendations
      23. Slide 23 Exploratory Data Science on Even Bigger Data
      24. Slide 24 Spotfire Data Tables and Relations
      25. Slide 25 Exploratory Data Science: BioMass by HUC12 1
      26. Slide 26 Exploratory Data Science: BioMass by HUC12 2
      27. Slide 27 Exploratory Data Science: Florida BenMap
      28. Slide 28 Exploratory Data Science: Florida BG_Pop
  2. Spotfire Dashboard
  3. Slides
    1. Slide 1 President's Chief Data Scientist: DJ Patil and HHS IDEA LAB Demand-Driven Open Data: David Portnoy
    2. Slide 2 Silicon Valley to Washington
    3. Slide 3 Dhanurjay Patil in the News
    4. Slide 4 U.S. Data Chief Aims to Empower Citizens with Information
    5. Slide 5 Back-of-the-Napkin Math, with DJ Patil: #PiDay Challenge
    6. Slide 6 First White House Data Chief Discusses His Top Priorities
    7. Slide 7 On the Case at Mount Sinai, It’s Dr. Data 1
    8. Slide 8 On the Case at Mount Sinai, It’s Dr. Data 2
    9. Slide 9 Data Science for Natural Medicines and Epigenetics
    10. Slide 10 Natural Medicine for Disease and Wellness Meetup
    11. Slide 11 The Birth of Demand-Driven Open Data
    12. Slide 12 Demand-Driven Open Data
    13. Slide 13 Key Questions
    14. Slide 14 EPA Big Data Analytics: Turning Data Into Value
    15. Slide 15 EPA's Ethan McMahon's Excellent Questions
    16. Slide 16 Agenda
  4. Research Notes
    1. Oregon Data
    2. New EPA Report on Fracking
    3. Data Sharing
      1. How to share data with a statistician
      2. What you should deliver to the statistician
        1. The raw data
        2. The tidy data set
        3. The code book
        4. How to code variables
        5. The instruction list/script
      3. You should then expect from the statistician:
      4. Contributors
    4. EPA  OW and OEI  policy
    5. Oregon data on watersheds and harmful algal blooms
    6. Ideas for Big Data and Public Policy Article
    7. Convert GPX to SHP with a Free Trial of FME Desktop
    8. Response and Invitation to EPA About Meetup
    9. EPA’s Cross-Agency Data Analytics and Visualization Program
    10. Big Data Analytics Questions From EPA
  5. The President Speaks At Hadoop World: Introduces DJ Patil as Nation’s First Chief Data Scientist
  6. On Demand Webinar Featuring New Federal Chief Data Scientist DJ Patil and Hilary Mason
  7. Data Driven: Creating a Data Culture
    1. Preface
      1. Data Driven
      2. Revision History for the First Edition
    2. Data Driven: Creating a Data Culture
      1. What Is a Data Scientist?
      2. What Is a Data-Driven Organization?
        1. Democratizing Data
      3. What Does a Data-Driven Organization Do Well?
        1. Managing Research
        2. Designing the Organization
        3. Process
          1. Figure 1-1. DILBERT
          2. Daily dashboard
          3. Metrics meetings
          4. Standup and domain-specific review meetings
      4. Tools, Tool Decisions, and Democratizing Data Access
      5. Creating Culture Change
  8. The White House Names Dr. DJ Patil as the First U.S. Chief Data Scientist
  9. Unleashing the Power of Data to Serve the American People
    1. Overview: What Is Data Science, and Why Does It Matter?
    2. The Role of the First-Ever U.S. Chief Data Scientist
      1. Providing vision on how to provide maximum social return on federal data.
      2. Creating nationwide data policies that enable shared services and forward-leaning practices to advance our nation’s leadership in the data age.
      3. Working with agencies to establish best practices for data management and ensure long-term sustainability of databases.
      4. Recruiting and retaining the best minds in data science for public service to address these data science objectives and act as conduits among the government, academia, and industry.
      5. Precision medicine
      6. Usable data products
      7. Responsible data science
  10. DJ Patil Scores the Sexiest Job in D.C.
  11. EPA Ecosystems Research
    1. Ecosystems Research Areas
    2. Available and Awarded Funding
    3. Publications
    4. Tools and Resources
  12. EPA Methods, Models, Tools, and Databases
    1. Methods
    2. Models
    3. Tools
    4. Databases
  13. Databases, GIS and Mapping Applications, Online Reporting
    1. Databases
    2. GIS and Mapping Applications
  14. EnviroAtlas: Fresno, CA and surrounding area
    1. Background
      1. Fresno Area Demographics
    2. Ecosystem Services Overview
      1. Access to Parks
      2. Stream and Lake Buffers
    3. EnviroAtlas Tools and Features
  15. Municipalities within EnviroAtlas Boundaries
    1. How are EnviroAtlas boundaries created?
    2. How do I use this document?
  16. NEXT

Story

Data Science for EPA Big Data Analytics

President Obama has personally appointed and introduced a new Chief Data Scientist, Dr. DJ Patil, who has outlined his new program focusing on four activities and three priority areas:

  1. Providing vision on how to provide maximum social return on federal data.
  2. Creating nationwide data policies that enable shared services and forward-leaning practices to advance our nation’s leadership in the data age.
  3. Working with agencies to establish best practices for data management and ensure long-term sustainability of databases.
  4. Recruiting and retaining the best minds in data science for public service to address these data science objectives and act as conduits among the government, academia, and industry.

 

  1. Precision medicine
  2. Usable data products
  3. Responsible data science

The Federal Big Data Working Group Meetup supports these four activities and three priority areas as follows:

  1. Vision: Federal Big Data Working Group Meetup Mission Statement
  2. Data Policies: Data Science Data Publications
  3. Working with Agencies: Meetups and MOOCs (Massive Open Online Courses)
  4. Data Scientists: The People in People, Process, and Products

 

  1. Precision medicine: Semantic Medline and Natural Medicine for Disease and Wellness Meetup (Precision Medicine should include Precision Nutrition: Data Science for Natural Medicines)
  2. Usable data products: Semantic Community and Federal Big Data Working Group Meetup Data Science Data Publications
  3. Responsible data science: Federal Big Data Working Group Meetup and Eastern Foundry

In support of him, I am developing a Data Science for EPA Big Data Analytics Data Product and Meetup (see slides below) In cooperation with EPA using EPA Ecosystem Data to answer not only EPA's Ethan McMahon's excellent questions (see below), but address the broader matter of:

Turning Data Into Value: Organizations do not fear a shortage of data. Systems, applications, and devices all produce and consume exponentially increasing amounts of it. The Federal Big Data Working Group Meetup Data Scientists help organizations use their data to best advantage by integrating, analyzing, and acting on it via event processing as follows:

  • Integration provides the right data to the right system or person in real time.
  • Analytics lets users develop insights using vast amounts of data to understand the past and anticipate the future.
  • Event processing combines the knowledge gained from analytics with real-time information to identify patterns of events and act to bring about the best outcomes.

As you can see, analytics (of Earth Science Data from EPA, etc.) is just one of three elements to support the President’s new Chief Data Scientist.

DJ Patil and Hilary Mason have recently written an excellent  short booklet on Data Driven: Creating a Data Culture.because cultural change is the most challenging part of data science in an organization. I certainly found this was true in my final years at EPA as a data scientist. I also agree with the suggestions in a recent article on Data Sharing with a Statistician, since I was worked as a statistician earlier in my career at EPA.

EPA's Ethan McMahon's has provide the following public request: EPA is planning to stand up a big data analytics service within the agency. We’d appreciate ideas from the ESIP community in a few areas:

1. What problems have you tried to solve using data analytics and/or visualization?

Answer: Exploratory Data Analysis and Data Science Products for many EPA and Earth Science data sets.

2. Are there any strategies or best practices you used to manage data within or between enterprise data systems? 

Answer: Cross-Industry Data Mining Standard Process (See the Wikipedia page on the CRISP-DM process model) and Spotfire.

3. What techniques make sense for integrating large or varied data from multiple sources?

Answer: Spotfire, Tamr, etc.

4. What technologies have you used and how did you select them?

Answer: MindTouch and Spotfire, data scientist's recommendations and success using them.

5. Did you use any particular training resources for using big data analytics systems, and if so which ones?

Answer: MOOCs (See Five MOOCs for Big Data Applications and Analytics)

6. What lessons would be helpful for us to learn as we set up this service?

Answer: Answer the four questions: how was the data collected?, where is it stored?, what are the results?, and why should we believe them?

We’re open to your ideas and we’re ready to share what we have learned. Please respond to me directly (mcmahon.ethan@epa.gov). Interestingly his email response says: If you have questions about EPA data analytics, please contact Richard Allen at allen.richard@epa.gov..

The EPA Office of Environmental Information has produced the EPA's IRM Strategic Plan for FY 2015 - 2018 in PDF, which was converted to Word using Adobe Cloud Tools, and imported into MindTouch as part of the EPA Big Data Analytics Knowledge Base. The EPA IRM Strategic Plan was mined and the following excerpts were of particular interest:

  • Semantic Web: EPA needs to manage terminology to position the Agency for the Semantic Web. See 9.2.9 Terminology Services
  • Over the past few years, the pace of IT change has continued to increase and spread to new areas of technical and business operations. Trends such as Agile development and service-oriented architecture, the move from dedicated hosting to cloud computing, mobile technology in the office and in the field, “big data,” the Internet of Things, bring-your-own-device, Voice-Over-IP—these are more than gradual technical changes. They have disrupted traditional mission business practices, raised service expectations and restructured IT funding strategies.
  • To respond to these trends in an era of diminishing resources, where IT must do more with less, OEI has set new goals for itself in four specific areas. These goals play out across the subsequent sections of this IRM Strategic Plan. See 2.5 IRM Goals
  • As is true for almost all government agencies, EPA necessarily lags the private sector in its adoption of cutting edge technology due to its longer financial planning horizons and lower risk tolerance. Where possible, however, EPA is experimenting with technical innovations in instances where they are relatively low in cost, are supported by a strong development community and offer promise of breakthroughs in service improvement and burden reduction. A case in point is the Advanced Data Analytics initiative, where OEI is focusing the power of big data analytics on areas such as compliance improvement and environmental forecasting. Another is the Smart Mobile initiative to automate previously manual field operations. See 2.5.4 Apply Leading Edge Technology in Existing and New Programs
  • As more system data dictionaries and value lists are added to the system and mapped, DERS will show which data from different systems might be brought together or integrated. It encourages the adoption of common pick-lists and shows mappings across analogous pick lists. DERS’s semantic mapping capabilities enable information assets to be more easily understood. EPA program offices can identify redundancies among their systems and direct modernization efforts accordingly. See 9.2.7 Data Element Registry Services
  • In GIS usage, “symbology” refers to the use of consistent icons and symbols across maps to indicate the same things, such as monitoring points, emission points or watershed boundaries. Symbology becomes a major issue when integrating data from disparate sources. See Reference 7
  • Especially for scientific data sets, which can be extremely large, storage and maintenance of old data can be excessively costly. Information is not saved regardless of cost: EPA’s policy is to store and maintain data only in relation to its value for existing and potential future uses. See Reference 14

We have been working with the EPA ORD EnviroAtlas national and community data which are available to download below as geodatabases. However ESRI geobases are not an open format, so there is a need to convert them to SHAPE files. EnviroAtlas is part of EPA’s ecosystems research which is working to protect ecosystems and the air and water resources that provide numerous benefits for humans and other living things. In the author's opinion, this is the best EPA "big data" for analytics. We are also looking at state (e.g. Oregon) and local data that can be integrated like the EPA IRM Plan describes.

We have also invited EPA’s Cross-Agency Data Analytics and Visualization Program staff to our Meetups to coordinate and communicate (see Response and Invitation to EPA About Meetup). Interestingly, the author tried to do the same basic things while there at EPA:

  • New and enhanced ways to gather data, conduct analysis, perform data visualization and use “big data” to explore and address environmental, business, and public policy challenges. Business intelligence tools, geospatial tools and visualization tools are also converging, providing opportunities for knowledge and insight from disparate data sources.
    • Response: First EPA Data Architect and Data Scientist and Piloted use of Spotfire Cloud (incorporates S-PLUS).
  • Work with EPA programs and regional offices to create a central platform for data analytics to evaluate information across a wide range of data collections and break through the Agency’s stove-piped set of data systems, which do not make these associations easy or common today.
    • Response: 50+ EPA data science stories and products that are still available
  • Use external data sources in combination with EPA data sources, expert systems and machine learning on unstructured data and/or scattered data sets across the Agency.

The EPA Big Data Analytics Spreadsheet contains the Knowledge Bases for this page and the EPA IRM Strategic Plan and its Appendix Tables which are a valuable resource. This is an example of natural language processing, entity extraction, structuring unstructured content into both a more machine- and human- readable from to support semantic searching, analytics, and visualization. So of the key entities extract are given above. The Appendix 4 Table provides an interface to EPA Big Data Analytics and the Appendix 5 Table shows the challenge with only 5 Integrated Data Collections.

The most integrated EPA data collections are those for EPA Ecosystem Research, but these do not appear to be the focus of the new EPA Big Data Analytics Program, so that is what the Federal Big Data Working Group Meetup will focus on in creating another Data Science MOOC, like the USDA Data Science MOOC in process for their CIO and ACDO.

This reminded me of what a former EPA data colleague said to me about 5 years ago, so I recently emailed: Remember when you told me that senior management does not see the value of what I do? Well they do now! They just found out that I have been doing: EPA’s Cross-Agency Data Analytics and Visualization Program before and after leaving EPA. Please see this story and join our April 20th Meetup.

After exploring the EPA EnviroAtlas and Ecoystem Research data sets and the State of Oregon Department of Environmental Quality and some suggested local Oregon data sets, it was decided to analyze and visualize the following:

CityPts_020915.png

The data mining process, tools, and sample results are shown in the Slides below.

The Conclusions and Recommendations are:

This is the first Data Science for EPA Big Data Analytics Use Case. Also see previous Data Science for EPA EnviroAtlas and Data Science for EPA Air Data).

Two more Data Science for EPA Big Data Analytics Uses Cases are:

More Uses Case are in process for an EPA Big Data Analytics MOOC (Massive Open Online Course) for the Federal Big Data Working Group Meetup.

Footnote: Email to EPA CIO Anne Duncan

Ms. Duncan, I am a retired, former EPA employee, of 30+ years, that served as EPA’s first Data Architect, Senior Enterprise Architect, and Data Scientist.

Since leaving EPA, I have remained active as a data scientist/data journalist with my own company, Semantic Community, and have founded and organized the Federal Big Data Working Group Meetup which recently had a Meetup entitled: President's Chief Data Scientist and EPA Big Data Analytics Meetup at http://www.meetup.com/Federal-Big-Data-Working-Group/events/220799665/

We had EPA people involved in this Meetup and I have been working with former EPA colleagues to do big data analytics on EPA data sets like EnviroAtlas, Fracturing Data, Oregon Data, etc.

We recently had a Meetup with USDA CIO Joyce Hunter and she asked for my suggestions on how to advance her Open Data and Data Science Programs, and I suggested she use former USDA people that had worked well with USDA data, just like I am doing for the EPA and other agencies whose data I worked with previously and since leaving government service.

She agreed and asked that we help her Open Data and Data Science Programs by doing a MOOC (Massive Open Online Course) for them, which will be the subject of our May 18th Meetup: USDA Data Science MOOC: http://www.meetup.com/Federal-Big-Data-Working-roup/events/221457264/

We have decided to do the same for EPA because I have over 50 data science products from my EPA days and since then that can be used to educate and train data scientists in the analysis and visualization of EPA Data sets.

We have been in communication with the new President’s Chief Data Scientist and participated in their recent Tech Meetup organized by White House CTO, Megan Smith.

Best wishes in your new position, Brand

Slides

Slides

Slide 2 Overview

BrandNiemann04172015Slide2.PNG

Slide 3 EPA EnviroAtlas Data: Web Page

http://enviroatlas.epa.gov/enviroatl...oad/index.html

BrandNiemann04172015Slide3.PNG

Slide 4 EPA EnviroAtlas Data: Description

http://enviroatlas.epa.gov/enviroatl...oad/index.html

BrandNiemann04172015Slide4.PNG

Slide 5 EPA EnviroAtlas Data: Maps

http://enviroatlas.epa.gov/enviroatl...ata/scale.html

BrandNiemann04172015Slide5.PNG

Slide 6 EPA EnviroAtlas Data: National

http://enviroatlas.epa.gov/enviroatl...ata/scale.html

BrandNiemann04172015Slide6.PNG

Slide 7 EPA EnviroAtlas Data: Community

http://enviroatlas.epa.gov/enviroatl...ata/scale.html

BrandNiemann04172015Slide7.PNG

Slide 8 EPA EnviroAtlas Data: Map of Communities

http://enviroatlas.epa.gov/enviroatl...mmunities.html

BrandNiemann04172015Slide8.PNG

Slide 9 Geodatabases-to-Shape Files

BrandNiemann04172015Slide9.PNG

Slide 10 FME Workbench: National Metrics Log File

BrandNiemann04172015Slide10.PNG

Slide 11 FME Workbench: National Metrics GDB-to-SHP

http://www.safe.com/

BrandNiemann04172015Slide11.PNG

Slide 12 Data Science Data Publication: MindTouch Knowledge Base

Data Science for EPA Big Data Analytics

BrandNiemann04172015Slide12.PNG

Slide 13 Data Science Data Publication: Spreadsheet Knowledge Base

EPABigDataAnalytics.xlsx

BrandNiemann04172015Slide13.PNG

Slide 14 EPA EnviroAtlas National & Community Inventory

xlscurrentdata.xls

BrandNiemann04172015Slide14.PNG

Slide 15 Data Science Data Publication: Spotfire Cover Page

Web Player

BrandNiemann04172015Slide15.PNG

Slide 16 Data Science Data Publication: IRM Strategic Plan

Web Player

BrandNiemann04172015Slide16.PNG

Slide 17 Data Science Data Publication: IRM Strategic Plan Tables

Web Player

BrandNiemann04172015Slide17.PNG

Slide 18 Data Science Data Publication: EnviroAtlas Inventory National

Web Player

BrandNiemann04172015Slide18.PNG

Slide 19 Data Science Data Publication: EnviroAtlas Inventory Community

Web Player

BrandNiemann04172015Slide19.PNG

Slide 20 Data Science Data Publication: EnviroAtlas Inventory NatureServe

Web Player

BrandNiemann04172015Slide20.PNG

Slide 21 Data Science Data Publication: EnviroAtlas Inventory Land Cover

Web Player

BrandNiemann04172015Slide21.PNG

Slide 22 Conclusions and Recommendations

BrandNiemann04172015Slide22.PNG

Slide 23 Exploratory Data Science on Even Bigger Data

BrandNiemann04172015Slide23.png

Slide 24 Spotfire Data Tables and Relations

BrandNiemann04172015Slide24.png

Slide 25 Exploratory Data Science: BioMass by HUC12 1

BrandNiemann04172015Slide25.png

Slide 26 Exploratory Data Science: BioMass by HUC12 2

BrandNiemann04172015Slide26.png

Slide 27 Exploratory Data Science: Florida BenMap

BrandNiemann04172015Slide27.png

Slide 28 Exploratory Data Science: Florida BG_Pop

BrandNiemann04172015Slide28.png

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Slides

Slides

Slide 2 Silicon Valley to Washington

BrandNiemann04202015Slide2.PNG

Slide 3 Dhanurjay Patil in the News

http://en.wikipedia.org/wiki/DJ_Patil

BrandNiemann04202015Slide3.PNG

Slide 4 U.S. Data Chief Aims to Empower Citizens with Information

http://www.wsj.com/articles/white-ho...ion-1426024220BrandNiemann04202015Slide4.PNG

Slide 5 Back-of-the-Napkin Math, with DJ Patil: #PiDay Challenge

https://www.whitehouse.gov/blog/2015/03/16/back-napkin-math-dj-patil-piday-challenge

BrandNiemann04202015Slide5.PNG

Slide 6 First White House Data Chief Discusses His Top Priorities

http://www.scientificamerican.com/article/first-white-house-data-chief-discusses-his-top-priorities/

BrandNiemann04202015Slide6.PNG

Slide 7 On the Case at Mount Sinai, It’s Dr. Data 1

http://nyti.ms/1Frc9B6

BrandNiemann04202015Slide7.PNG

Slide 8 On the Case at Mount Sinai, It’s Dr. Data 2

BrandNiemann04202015Slide8.PNG

Slide 9 Data Science for Natural Medicines and Epigenetics

http://www.meetup.com/Federal-Big-Data-Working-Group/events/221457330/

BrandNiemann04202015Slide9.PNG

Slide 10 Natural Medicine for Disease and Wellness Meetup

http://www.meetup.com/Natural-Medicine-for-Disease-and-Wellness-Meetup/

BrandNiemann04202015Slide10.PNG

Slide 14 EPA Big Data Analytics: Turning Data Into Value

BrandNiemann04202015Slide14.PNG

Slide 15 EPA's Ethan McMahon's Excellent Questions

If you have questions about EPA data analytics, please contact Richard Allen at allen.richard@epa.gov..

BrandNiemann04202015Slide15.PNG

Research Notes

http://water.usgs.gov/GIS/huc.html

https://gdg.sc.egov.usda.gov/GDGHome_StatusMaps.aspx

https://gdg.sc.egov.usda.gov/Catalog...ption/WBD.html

ftp://ftp.ftw.nrcs.usda.gov/wbd/WBD_...ion_March2015/

IN PROCESS (EPA Ecosystem Data and Oregon Data)

http://www.deq.state.or.us/wq/tmdls/basinmap.htm
http://www.deq.state.or.us/wq/tmdls/deschutes.htm
http://www.deq.state.or.us/news/databases.htm
http://www.deq.state.or.us/lq/tanks/...blicLookup.asp
http://www.deq.state.or.us/lq/tanks/tankslists.htm
Underground Storage Tank Program
http://www.deq.state.or.us/news/databases.htm
http://www.deq.state.or.us/msd/gis/gis.htm
http://www.deq.state.or.us/wq/dwp/results.htm
http://aol.research.pdx.edu/
downloadable data
http://cfpub.epa.gov/ecotox/data_download.cfm
http://www.deq.state.or.us/programs/...EnfResults.asp
http://www.deq.state.or.us/wqTemp/Client478.csv

New EPA Report on Fracking

The EPA released its peer-reviewed analysis of over two years of data from the FracFocus Chemical Disclosure Registry 1.0. FracFocus is a publicly accessible website, managed by the Ground Water Protection Council (GWPC) and Interstate Oil and Gas Compact Commission, where oil and gas production well operators can disclose information about ingredients used in hydraulic fracturing fluids at individual wells.

Along with the report, the EPA is making available the:

  • EPA database developed from the FracFocus Registry 1.0 data and statistical codes used for analysis,
  • Selected tables from the report and the underlying data used to create each table in Microsoft Excel spreadsheets, and
  • State-level summaries of chemical and water used for hydraulic fracturing during the time of this study.

The FracFocus 1.0 data analysis will be combined with a broad review of existing literature, other EPA research, and input we’ve collected through our outreach efforts in the Draft Assessment of Potential Impacts of Hydraulic Fracturing on Drinking Water Resources. When final, the Assessment will provide a new lens to help our states and communities better understand the potential impacts on our drinking water resources from hydraulic fracturing.

Study: http://www2.epa.gov/hfstudy

Data: http://www2.epa.gov/hfstudy/epa-anal...acfocus-1-data

Published Scientific Papers: http://www2.epa.gov/hfstudy/publishe...entific-papers

EPA Analysis of FracFocus 1.0 Data: http://www2.epa.gov/hfstudy/epa-anal...acfocus-1-data

EPA Project Database Developed from FracFocus 1 Disclosures: http://www2.epa.gov/hfstudy/epa-proj...-1-disclosures (96 MB) My Note: Downloaded Access

Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1: http://www2.epa.gov/hfstudy/analysis...ure-registry-1 (4 xlsx - 23 MB) My Note: Downloaded

EPA State-level Summaries of FracFocus 1 Hydraulic Fracturing Data: http://www2.epa.gov/hfstudy/epa-stat...racturing-data (20 PDF)

EPA Analysis of FracFocus 1.0 Data Webinar Presentation: http://www2.epa.gov/hfstudy/epa-anal...r-presentation (PDF 2 MB) My Note: Downloaded

Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1.0: Data Management and Quality Assessment Report (PDF) (42 pp, 2 MB) My Note: Downloaded

Analysis of Hydraulic Fracturing Fluid Data from the FracFocus Chemical Disclosure Registry 1 (PDF): http://www2.epa.gov/hfstudy/analysis...registry-1-pdf My Note: Downloaded

EPA Analysis of FracFocus 1 Data Fact Sheet: http://www2.epa.gov/sites/production...e_fracfocu.pdf My Note: Downloaded​

Data Sharing

https://github.com/jtleek/datasharin...wsltr_20150401

How to share data with a statistician

This is a guide for anyone who needs to share data with a statistician. The target audiences I have in mind are:

  • Scientific collaborators who need statisticians to analyze data for them
  • Students or postdocs in scientific disciplines looking for consulting advice
  • Junior statistics students whose job it is to collate/clean data sets

The goals of this guide are to provide some instruction on the best way to share data to avoid the most common pitfalls and sources of delay in the transition from data collection to data analysis. The Leek group works with a large number of collaborators and the number one source of variation in the speed to results is the status of the data when they arrive at the Leek group. Based on my conversations with other statisticians this is true nearly universally.

My strong feeling is that statisticians should be able to handle the data in whatever state they arrive. It is important to see the raw data, understand the steps in the processing pipeline, and be able to incorporate hidden sources of variability in one's data analysis. On the other hand, for many data types, the processing steps are well documented and standardized. So the work of converting the data from raw form to directly analyzable form can be performed before calling on a statistician. This can dramatically speed the turnaround time, since the statistician doesn't have to work through all the pre-processing steps first.

What you should deliver to the statistician

For maximum speed in the analysis this is the information you should pass to a statistician:

  1. The raw data.
  2. A tidy data set
  3. A code book describing each variable and its values in the tidy data set.
  4. An explicit and exact recipe you used to go from 1 -> 2,3

Let's look at each part of the data package you will transfer.

The raw data

It is critical that you include the rawest form of the data that you have access to. Here are some examples of the raw form of data:

  • The strange binary file your measurement machine spits out
  • The unformatted Excel file with 10 worksheets the company you contracted with sent you
  • The complicated JSON data you got from scraping the Twitter API
  • The hand-entered numbers you collected looking through a microscope

You know the raw data is in the right format if you:

  1. Ran no software on the data
  2. Did not manipulate any of the numbers in the data
  3. You did not remove any data from the data set
  4. You did not summarize the data in any way

If you did any manipulation of the data at all it is not the raw form of the data. Reporting manipulated data as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a forensic study of your data to figure out why the raw data looks weird.

The tidy data set

The general principles of tidy data are laid out by Hadley Wickham in this paper and this video. The paper and the video are both focused on the R package, which you may or may not know how to use. Regardless the four general principles you should pay attention to are:

  1. Each variable you measure should be in one column
  2. Each different observation of that variable should be in a different row
  3. There should be one table for each "kind" of variable
  4. If you have multiple tables, they should include a column in the table that allows them to be linked

While these are the hard and fast rules, there are a number of other things that will make your data set much easier to handle. First is to include a row at the top of each data table/spreadsheet that contains full row names. So if you measured age at diagnosis for patients, you would head that column with the name AgeAtDiagnosis instead of something like ADx or another abbreviation that may be hard for another person to understand.

Here is an example of how this would work from genomics. Suppose that for 20 people you have collected gene expression measurements with RNA-sequencing. You have also collected demographic and clinical information about the patients including their age, treatment, and diagnosis. You would have one table/spreadsheet that contains the clinical/demographic information. It would have four columns (patient id, age, treatment, diagnosis) and 21 rows (a row with variable names, then one row for every patient). You would also have one spreadsheet for the summarized genomic data. Usually this type of data is summarized at the level of the number of counts per exon. Suppose you have 100,000 exons, then you would have a table/spreadsheet that had 21 rows (a row for gene names, and one row for each patient) and 100,001 columns (one row for patient ids and one row for each data type).

If you are sharing your data with the collaborator in Excel, the tidy data should be in one Excel file per table. They should not have multiple worksheets, no macros should be applied to the data, and no columns/cells should be highlighted. Alternatively share the data in a CSV or TAB-delimited text file.

The code book

For almost any data set, the measurements you calculate will need to be described in more detail than you will sneak into the spreadsheet. The code book contains this information. At minimum it should contain:

  1. Information about the variables (including units!) in the data set not contained in the tidy data
  2. Information about the summary choices you made
  3. Information about the experimental study design you used

In our genomics example, the analyst would want to know what the unit of measurement for each clinical/demographic variable is (age in years, treatment by name/dose, level of diagnosis and how heterogeneous). They would also want to know how you picked the exons you used for summarizing the genomic data (UCSC/Ensembl, etc.). They would also want to know any other information about how you did the data collection/study design. For example, are these the first 20 patients that walked into the clinic? Are they 20 highly selected patients by some characteristic like age? Are they randomized to treatments?

A common format for this document is a Word file. There should be a section called "Study design" that has a thorough description of how you collected the data. There is a section called "Code book" that describes each variable and its units.

How to code variables

When you put variables into a spreadsheet there are several main categories you will run into depending on their data type:

  1. Continuous
  2. Ordinal
  3. Categorical
  4. Missing
  5. Censored

Continuous variables are anything measured on a quantitative scale that could be any fractional number. An example would be something like weight measured in kg. Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered. This could be for example survey responses where the choices are: poor, fair, good. Categorical data are data where there are multiple categories, but they aren't ordered. One example would be sex: male or female. Missing data are data that are missing and you don't know the mechanism. You should code missing values as NA. Censored data/make up/ throw away missing observations.

In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3. This will avoid potential mixups about which direction effects go and will help identify coding errors.

Always encode every piece of information about your observations using text. For example, if you are storing data in Excel and use a form of colored text or cell background formatting to indicate information about an observation ("red variable entries were observed in experiment 1.") then this information will not be exported (and will be lost!) when the data is exported as raw text. Every piece of data should be encoded as actual text that can be exported.

The instruction list/script

You may have heard this before, but reproducibility is kind of a big deal in computational science. That means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate the analyses from raw data all the way to final results. If you are trying to be efficient, you will likely perform some summarization/data analysis steps before the data can be considered tidy.

The ideal thing for you to do when performing summarization is to create a computer script (in R, Python, or something else) that takes the raw data as input and produces the tidy data you are sharing as output. You can try running your script a couple of times and see if the code produces the same output.

In many cases, the person who collected the data has incentive to make it tidy for a statistician to speed the process of collaboration. They may not know how to code in a scripting language. In that case, what you should provide the statistician is something called pseudocode. It should look something like:

  1. Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
  2. Step 2 - run the software separately for each sample
  3. Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set

You should also include information about which system (Mac/Windows/Linux) you used the software on and whether you tried it more than once to confirm it gave the same results. Ideally, you will run this by a fellow student/labmate to confirm that they can obtain the same output file you did.

What you should expect from the analyst
When you turn over a properly tidied data set it dramatically decreases the workload on the statistician. So hopefully they will get back to you much sooner. But most careful statisticians will check your recipe, ask questions about steps you performed, and try to confirm that they can obtain the same tidy data that you did with, at minimum, spot checks.

You should then expect from the statistician:

  1. An analysis script that performs each of the analyses (not just instructions)
  2. The exact computer code they used to run the analysis
  3. All output files/figures they generated.

This is the information you will use in the supplement to establish reproducibility and precision of your results. Each of the steps in the analysis should be clearly explained and you should ask questions when you don't understand what the analyst did. It is the responsibility of both the statistician and the scientist to understand the statistical analysis. You may not be able to perform the exact analyses without the statistician's code, but you should be able to explain why the statistician performed each step to a labmate/your principal investigator.

Contributors

Jeff Leek - Wrote the initial version.
L. Collado-Torres - Fixed typos, added links.
Nick Reich - Added tips on storing data as text.

EPA  OW and OEI  policy

New long-term vision for TMDL issue in OW

http://water.epa.gov/lawsregs/lawsguidance/cwa/tmdl/ My Note: Need to study this

EPA IRM Strategic Plan

http://www2.epa.gov/aboutepa/about-office-environmental-information-oei My Note: See ​EPA IRM Strategic Plan

Center for Data Innovation Meetups with Open Data Leaders

EPA was in January.  Video on YouTube

http://www.meetup.com/Open-Data-Leaders/ My Note: EPA Staff at Meetups!

Oregon data on watersheds and harmful algal blooms

If we can pull in other Oregon information on environment and natural resources, that would be helpful.  Land use change is one indicator of interest.  Type of land ownership is another factor.  EPA and Earthengine.google.org are huge of course.  

http://oregonstate.edu/inr/
Institute for Natural Resources
Oregon University System Institute
Oregon State University, Portland State University

These are for South Coast Basin to illustrate the issues.  The starting example is Tenmile Lakes.  However, the intent is to use statewide information.

Datasets and Reports for Oregon -- Harmful Algal Blooms and Watersheds

Focus: South Coast Basin

South Coast Basin is a management region in the Oregon Department of Environmental Quality (DEQ).

http://www.deq.state.or.us/wq/tmdls/basinmap.htm

Oregon DEQ Harmful Algal Bloom (HAB) Strategy 2011

http://www.deq.state.or.us/wq/algae/algae.htm#deq

Appendix C – Specific Waterbodies with HABs

Tenmile Lakes (Coos County) is the only South Coast Basin waterbody listed in Appendix C.

Assessing Oregon’s Basins: South Coast Basin Water Quality Status and Action Plan (Dec 2013)

http://www.deq.state.or.us/wq/watershed/watershed.htm

Executive Summary (pp. 1 – 5)

The context is that most of the land is forested with over half privately owned and managed under the Forest Practices Act. 

Human Health – Harmful Algal Blooms (pp. 92 – 99)

This section reviews information on watersheds, lakes, TMDLs and HAB Advisories. HAB toxins microcystin and anatoxin have appeared. Tenmile Lakes and Sru Lake have had HAB health advisories but algae problems are not restricted to them (Table 48).  Some lakes have not even been characterized.

Safe Drinking Water Act Implementation (pp. 155 – 164)

There are issues with contaminants in drinking water systems but not in relation to HABs in this area. 

TMDLs and Water Quality Implementation Plans (pp. 172 - 178)

The approach emphasizes coordination among existing plans and authorities.

Section 319 Grants – Nonpoint Source Pollution Control (pp. 185 – 186)

A majority of the funds have focused on “ground enhancement activities such as riparian enhancement and sediment abatement projects.”

South Coast Basin Action Matrix (pp. 188 – 193)

This summarizes water quality resource concerns and partners.   The activities of ecosystem function restoration may cut across concerns.

South Coast Basin TMDLs

(Refer also to TMDLs and Water Quality Implementation Plans in South Coast Basin Plan above.)

http://www.deq.state.or.us/wq/tmdls/southcoast.htm

Chetco Subbasin:  TMDL Initiated

Coos Subbasin:  TMDL Initiated

Tenmile Lakes Watershed: TMDL approved by USEPA on May 31, 2007

Focus on sediment and phosphorus for weed and algae problems

Activities: Sediment abatement; Riparian vegetation; Lakefront wetlands; Invasive weeds

Coquille Subbasin:

Coquille River & Estuary:  TMDL approved by USEPA on July 3, 1996

Focus on dissolved oxygen levels harming aquatic life

Activities:  Sewage treatment; “existing strategies” for nonpoint sources

Upper South Fork Coquille Watershed:  TMDL approved by USEPA on March 23, 2001

Focus on high stream temperature adversely affecting salmonids and trout

Activities: Oregon Plan to conserve and restore “functional elements of the ecosystem”

Sixes Subbasin:  TMDL Initiated

Garrison Lake:  TMDL approved by USEPA on October 7, 1988

Focus on total phosphorus affecting aesthetics and algal growth

Activities: Sewage effluent

Oregon Plan for Salmon & Watersheds

http://www.oregon.gov/opsw/pages/index.aspx

This coordinates the activities of many groups with the aim of restoring native fish populations and the aquatic systems that support them.

Oregon Watershed Enhancement Board (OWEB) – South Coast Basin

http://www.oregon.gov/OWEB/Pages/index.aspx

http://www.oregon.gov/OWEB/Pages/BiennialReport1315/South_Coast.aspx

http://www.oregon.gov/OWEB/Pages/BiennialReport1315/Ecosystem-Services.aspx

OWEB reports on the Oregon Plan for Salmon & Watersheds in the South Coast Basin.  There are some conservation and restoration activities with a focus on fish populations.  The ecosystem services report provides a statewide perspective on ecosystem services with stream restoration, Forest Resource Trust, Willamette Basin project, Eugene payments to landowners to preserve landscapes, water quality trading in Medford, Klamath Basin water quality improvement tracking, Circle Creek floodplain reconnection, fish passage banking, studies of blue carbon in Oregon estuaries, and sage grouse habitat mitigation.

Oregon Watershed Councils

http://www.oregonwatersheds.org/councils

These are community groups that use grants and volunteers for restoration.   There are several in the Southwest area that includes South Coast Basin.  Tenmile Lakes and Coos Watershed are two examples.

Tenmile Lakes Basin Partnership

http://tlbp.presys.com/

Projects are riparian planting (for shade), fencing (for livestock) and fish passage. 

Coos Watershed Association

http://www.cooswatershed.org/

Restoration activities include fish passage and in-stream habitat improvements, riparian area enhancement and road-related erosion control.

Partnership for Coastal Watersheds is another local group working on Coos Bay in connection with the South Slough National Estuarine Research Reserve.

http://www.partnershipforcoastalwatersheds.org/coos-bay-water-quality-monitoring-network/

Harmful Algal Bloom Advisories

Recreation

http://public.health.oregon.gov/HealthyEnvironments/Recreation/HarmfulAlgaeBlooms/Pages/Blue-GreenAlgaeAdvisories.aspx

In 2014, there were 10 total advisories in 6 regions plus a permanent advisory at South Umpqua River & Lawson Bar (Douglas County).  This included a HAB advisory for microcystin in Tenmile Lakes.

Drinking Water

http://public.health.oregon.gov/HealthyEnvironments/DrinkingWater/Operations/Treatment/Pages/algae.aspx

There is a list of seven basins of concern that does not include South Coast Basin.

Southwest Oregon has health services for Counties: Coos, Curry, Douglas, Lane, Jackson, Josephine.

http://www.healthyoregon.com/  

Ideas for Big Data and Public Policy Article

Source: February 18, 2015, Email

Brand: I have been tossing around ideas for an EPA-related article for the Big Data and Public Policy special issue.  We would have Bob Hall, an ecologist in EPA Region 9 with whom I have been working on ecosystem function and watersheds.  We might also have a colleague Robin Schafer, who is a co-author on the ecosystem function work with a focus on applying ecosystem function to risk assessment.  Building on our other publications is important so that we have some actual data to present.

The idea is to link the local watershed studies of ecosystem function to the national picture of EnviroAtlas, the new EPA ORD map of ecosystem services.   Can we use EnviroAtlas to expand and accelerate studies of ecosystem function?  This would involve linking from local to national and then gaining insight on how to apply these tools in other local settings.  That's where we are.   It's not a plan yet, just an idea.  The ecosystem function work at EPA provides the local data.   You have the expertise on Big Data and EnviroAtlas.  I am also wondering how TMDL reports fit into this picture.  They provide information on where an ecosystem function approach might be useful for managing watersheds (especially for addressing nonpoint source pollution).

How does this look to you?  Suggestions?

Thanks for the followup on this.
 
EnviroAtlas Data: http://enviroatlas.epa.gov/enviroatl...oad/index.html
 
Is still not in open format like they promised. Maybe your EPA Colleagues can help move that along.
 
TMDL data is in good shape from our DataBay work.
 
I need to look at these ORD Ecosystem Databases to see if I can work with them: http://www2.epa.gov/eco-research/met...stems-research
 
Am I missing any databases that are relevant here?
 
We could also respond to Ethan McMahon’s recent request:
 
EPA is planning to stand up a big data analytics service within the agency. We’d appreciate ideas from the ESIP community in a few areas:

1. What problems have you tried to solve using data analytics and/or visualization?
2. Are there any strategies or best practices you used to manage data within or between enterprise data systems?
3. What techniques make sense for integrating large or varied data from multiple sources?
4. What technologies have you used and how did you select them?
5. Did you use any particular training resources for using big data analytics systems, and if so which ones?
6. What lessons would be helpful for us to learn as we set up this service?

We’re open to your ideas and we’re ready to share what we have learned. Please respond to me directly (mcmahon.ethan@epa.gov) or to the ESIP listserv as you deem appropriate.

Brand: Concerning EnviroAtlas, do you have access to the proprietary software to make use of the data?   If we can demonstrate the utility of the data, that would help the EPA staff get support for open data.  For the article deadline, we have to work with whatever is available right now.

The article should include a response to Ethan's request.  We can cite that as a request through ESIP that gives another piece of the policy implementation. 

I have to think more about TMDL data.  There might be some interest in looking at Toledo, Ohio and Lake Erie.

If you can get a list of databases of interest, then I can check with Bob to see how it relates to the local studies to see what is missing

Convert GPX to SHP with a Free Trial of FME Desktop

FME's data conversion software is used by over 10,000 customers to convert data into any of 335+ formats (including GPX and SHP), while preserving the highest levels of accuracy and data quality.

With your free, fully-functional 30-day trial you can use FME Desktop right now to:

Connect GPX to SHP or 325+ other formats, databases, and web services
Transform GPX data into precise SHP data model you need
Automate GPX to SHP workflows for ongoing time savings
There's no risk, no obligation and no credit card required.

http://www.safe.com/c/gpx-to-shp/?ut...FZE0aQodQEcAUA

As of February 2015, the EnviroAtlas is transitioning to a more recent version of the 12-digit HUCs, data aggregated to these new boundaries will be available soon.

I didn't have any of your problems when I tried to generate the attached workspace with the Pittsburgh data, and I had no trouble saving the data to Shape in a directory with spaces in the field name when viewing the source FILEGDB data in the FME Data Inspector.

I suspect that installing FME as Administrator might cause these problems, but I am not certain about that.  Can you uninstall FME and then try to install it again as yourself (your user account with admin priveleges).  Please refer to the doc http://docs.safe.com/fme/pdf/FME_Desktop_Admin_Guide.pdf  If you have any trouble installing FME or if you get unusual errors when trying to do this translation, please send me a screenshot of the error that you are getting.

Response and Invitation to EPA About Meetup

Megan, Hello. Your statement: “EPA did not request assistance from this meetup nor provide materials for it” is incorrect.

If you will look at the Wiki page I created for this meetup: http://semanticommunity.info/Data_Science/Data_Science_for_EPA_Big_Data_Analytics

you will see that both Ethan McMahon and Richard Allen did provide information and request feedback in open discussion forms which I have already provided to those forms and in emails.

In addition, you are unaware of the many EPA people who have provide EPA data to me for the past 5 years, since leaving EPA as senior enterprise architect and data scientist, and continue to provide data and conversations around it currently. I created over 50 data science products while at EPA and many more since leaving using EPA data. We live in an age of open government data and this is what is suppose to happen and is happening.

I would invite you and your staff to attend this meetup and receive feedback, as Ethan and Richard have requested, and you could even present/respond, as other senior government officials have done in our meetups.

After leaving 30+ years of government service, I was asked by senior administration officials to continue the work I had done and been honored for by OMB and the Federal CIO Council, by founding and co-organizing these meetups. As you can see, we have done almost 30 meetups with a number of agencies, including EPA previously.

I sincerely hope that you and your staff are able to attend, as we now have over 100 RSVPs, and several EPA staff have expressed to me personally their interest in attending and seeing the presentations posted.

Best regards, Brand

Dr. Brand Niemann

Director and Senior Data Scientist/Data Journalist

Semantic Community

http://semanticommunity.info

http://www.meetup.com/Federal-Big-Data-Working-Group/

http://www.meetup.com/Virginia-Big-Data-Meetup/

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup

From: Carroll, Megan [mailto:Carroll.Megan@epa.gov]
Sent: Thursday, April 02, 2015 3:54 PM
To: bniemann@cox.net
Cc: McMahon, Ethan; Cooper, Geoff
Subject: For the record regarding Federal Big Data Working Group Meetup

Dear Mr. Niemann:

I am writing to set the record straight regarding the Federal Big Data Working Group Meetup scheduled for April 20, 2015 entitled “President's Chief Data Scientist and EPA Big Data Analytics.” EPA did not request assistance from this meetup nor provide materials for it.

EPA’s Office of Environmental Information, as the organization that is leading the agency’s data analytics effort, is concerned that you advertised and organized this meeting in a way that implies that EPA and you are coordinating on this topic, when in fact we are not. Your use of EPA’s name in the title implies that EPA has sought your input or requested that you organize this activity on EPA’s behalf, but that is not the case. It is inappropriate for you to present yourself and EPA in this fashion.

We found out about the meetup after it was advertised and we were surprised that our efforts, which are in their early phases, were being discussed in public by a third party. Please do not imply that you and EPA are coordinating in any future events unless you truly are.

Thank you.

Megan Carroll

Acting Director, Environmental Analysis Division

Office of Information Analysis and Access

Office of Environmental Information

202-564-2814 (desk)

EPA/WJC West, Room 5305A

Cc:  Geoff Cooper, US EPA Office of General Counsel

        Ethan McMahon, US EPA Office of Environmental Information

From: Brand Niemann [mailto:bniemann@cox.net]
Sent: Monday, March 30, 2015 4:44 AM
To: 'Patil, Dhanurjay'
Cc: David Portnoy; McMahon, Ethan; Taveau, Daniella
Subject: Slides for President's Chief Data Scientist and EPA Big Data Analytics on Mon, April 20

I modified the agenda and slides. Slide 13 poses three key questions for discussion:

  • Will Precision Medicine include Natural Medicine for Disease and Wellness?
  • How does Demand-Driven Open Data fit with DJ Patil’s Top Three Priorities?
  • Who will do the Responsible Data Science?

I will be presenting Data Science for EPA Big Data Analytics in support of the third question.

I welcome your feedback and responses beforehand to refine this.

Brand

From: Brand Niemann [mailto:bniemann@cox.net]
Sent: Sunday, March 29, 2015 10:58 PM
To: 'Patil, Dhanurjay'
Cc: David Portnoy (
portnoy.david@gmail.com)
Subject: DRAFT Slides for President's Chief Data Scientist and EPA Big Data Analytics on Mon, April 20

DJ and David, My DRAFT slides to introduce DJ’s top three priorities, which we are working on, and the DDOD that you want us to work on, are at:

http://semanticommunity.info/@api/deki/files/32814/BrandNiemann04202015.pptx

I welcome your feedback and suggestions on how best to do this.

Thank you, Brand

From: Patil, Dhanurjay [mailto:Dhanurjay_A_Patil@ostp.eop.gov]
Sent: Saturday, March 28, 2015 7:15 PM
To: Brand Niemann
Subject: RE: RSVPs to President's Chief Data Scientist and EPA Big Data Analytics on Mon, April 20

I won’t be able to make this one, but could one later in the year.

From: Brand Niemann [mailto:bniemann@cox.net]
Sent: Saturday, March 28, 2015 12:55 PM
To: Patil, Dhanurjay
Cc: David Portnoy
Subject: RSVPs to President's Chief Data Scientist and EPA Big Data Analytics on Mon, April 20

DJ, FYI re over 100 RSVPs for you to present at our upcoming Meetup of the Federal Big Data Working Group on President's Chief Data Scientist and EPA Big Data Analytics

http://www.meetup.com/Federal-Big-Da...nts/220799665/

Please advise me of your availability on April 20th Thank you.

Brand Niemann

P.S. David Portnoy is also speaking,

From: info@meetup.com [mailto:info@meetup.com]
Sent: Saturday, March 28, 2015 10:43 AM
To:
bniemann@cox.net
Subject: Vivien RSVPed to President's Chief Data Scientist and EPA Big Data Analytics on Mon, April 20

 

Latest RSVPs for

President's Chief Data Scientist and EPA Big Data Analytics

Monday, April 20, 2015 6:30PM

with Federal Big Data Working Group

95 Members and 7 guests are going to this Meetup

EPA’s Cross-Agency Data Analytics and Visualization Program

Source: http://www2.epa.gov/toxics-release-i...zation-program

EPA is at the beginning of a transformative stage in information management, where there are new and enhanced ways to gather data, conduct analysis, perform data visualization and use “big data” to explore and address environmental, business, and public policy challenges. Business intelligence tools, geospatial tools and visualization tools are also converging, providing opportunities for knowledge and insight from disparate data sources.

EPA will reap huge benefits from the ability to harness this power of data corporately across the enterprise to produce knowledge on demand and drive better environmental results. The Office of Information Analysis and Access intends to work with EPA programs and regional offices to create a central platform for data analytics to evaluate information across a wide range of data collections and break through the Agency’s stove-piped set of data systems, which do not make these associations easy or common today. 
Work in this area will also fit with the Agency’s interests in advanced monitoring technology that leverages sensors, real time data and external data sources such as NASA satellites, or financial or health data in combination with EPA data sources. Expert systems and machine learning to target violations using data acquired through electronic reporting, as well as analytics with unstructured data and/or scattered data sets across the Agency are also envisioned as part of this new, emerging program at EPA.
    
From: Brand Niemann [mailto:bniemann@cox.net
Sent: Tuesday, February 03, 2015 12:52 PM
To: 'Allen, Richard'
Cc: 'mcmahon.ethan@epa.gov'; joanaron@ymail.com
Subject: RE: [ESIP-all] Job Opp - EPA - Division Director Data Analytics

Richard, I seem to recall meeting you when I was the Senior Data Scientist and Enterprise Architect at EPA and OEI.

Since then I have done more data science data analytics on EPA and other government data in the Federal Big Data Working Group Meetup I founded and co-organize.

You can see our work with EPA data in my Wiki at http://semanticommunity.info

Best wishes with this new activity.

Brand

P.S. I have provide Ethan with suggestions recently on his new EPA Big Data program that we discussed at Meetup last night:

http://www.meetup.com/Federal-Big-Da...nts/218868025/

Dr. Brand Niemann
Director and Senior Data Scientist/Data Journalist
Semantic Community
http://semanticommunity.info 
http://www.meetup.com/Federal-Big-Data-Working-Group/ 
http://www.meetup.com/Virginia-Big-Data-Meetup/ 
http://semanticommunity.info/Data_Sc...g_Group_Meetup 

From: ESIP-all [mailto:esip-all-bounces@lists.esipfed.org] On Behalf Of Allen, Richard via ESIP-all
Sent: Tuesday, February 03, 2015 11:35 AM
To: esip-all@lists.esipfed.org
Subject: [ESIP-all] Job Opp - EPA - Division Director Data Analytics

Hello ESIP Members,

The Environmental Protection Agency’s Office of Environmental Information is seeking a visionary leader with experience in advanced data analytics to lead our effort to develop an enterprise analytics program. We are looking for someone who has experience with IT and experience leading and managing an organization of around 20 – 30 people. The position has two advertisements one for current and reinstatement eligible federal employees and one for US citizens. Please note that this position closes for external applicants on Feb. 6 and for federal employees on Feb. 18.

Finally, I'd greatly appreciate your help getting the word out to the right candidates. Please forward to any friends and colleagues that may be interested. 

Federal Employees
https://www.usajobs.gov/GetJob/ViewDetails/392669200

Open to all US Citizens
https://www.usajobs.gov/GetJob/ViewDetails/392808600

Thanks,

Richard
_________________________________
Richard G. Allen PhD
Environmental Protection Specialist
Office of Information Analysis and Access
Office of Environmental Information
202-566-0624

Big Data Analytics Questions From EPAEdit section

Source: February 20, 2015, Email

Hi ESIP folks,

EPA is planning to stand up a big data analytics service within the agency. We’d appreciate ideas from the ESIP community in a few areas:

1. What problems have you tried to solve using data analytics and/or visualization?
2. Are there any strategies or best practices you used to manage data within or between enterprise data systems?
3. What techniques make sense for integrating large or varied data from multiple sources?
4. What technologies have you used and how did you select them?
5. Did you use any particular training resources for using big data analytics systems, and if so which ones?
6. What lessons would be helpful for us to learn as we set up this service?

We’re open to your ideas and we’re ready to share what we have learned. Please respond to me directly (mcmahon.ethan@epa.gov) or to the ESIP listserv as you deem appropriate.

By the way, these topics are related to the ESIP Earth Science Data Analytics Cluster, so feel free to track that group’s progress.

Thanks, Ethan

Ethan McMahon
US EPA
Office of Environmental Information
202-566-0359

The President Speaks At Hadoop World: Introduces DJ Patil as Nation’s First Chief Data Scientist

Feb 21, 2015 09:13 pm By Bob Gourley

Data science history was made on 19 Feb 2015. For the first time in the history of Hadoop World, the President of the United States gave a keynote. Dwell on that for a bit. This is huge. No matter what your politics is, you really have to agree, this is huge.
After the President's address also see the overview provided by DJ Patil, the new Chief Data Scientist for the federal government.
See the video here at this link and embedded below

On Demand Webinar Featuring New Federal Chief Data Scientist DJ Patil and Hilary Mason

Feb 24, 2015 06:08 pm By Bob Gourley

If you are an enterprise technologist working with infrastructure solutions, or a statistician or one of those famed data scientists, you have probably already discovered who is responsible for your career advancement and education... You are.  Others might help, but you must take charge of your career including your self education.

One of the most important things you can do to advance your career is to learn the lessons of others who have survived and thrived in a wide variety of situations. When it comes to data science and the use of new technologies to produce actionable information, there are few greater teachers than DJ Patil (data scientist for the federal government) and Hilary Mason (well renowned data scientist, founder of Fast Forward Labs). I most strongly recommend taking advantage of every opportunity you can get to learn from them.

They are the co-authors of a book you can download for free from O'Reilly called Data Driven. So that is one great way right there.

They also recently participated in a moderated discussion with another great data scientist, Josh Wills of Cloudera.  You can listen to this discussion online here. It is loaded with great career advice, tips, techniques and success strategies relevant to the profession of data science and related disciplines of strategy, technology and statistics.

Sign up for the ondemand recording of the session here.

Watch more from Strata + Hadoop San Jose 2015: http://goo.gl/k9J3GB
Visit the conference website to learn more: http://strataconf.com/big-data-confer...
Subscribe to O’Reilly on YouTube! http://goo.gl/szEauh

Stay Connected to O'Reilly Media by Email - http://goo.gl/YZSWbO
Follow O'Reilly Media:
http://plus.google.com/+oreillymedia
https://www.facebook.com/OReilly
https://twitter.com/OReillyMedia

About DJ Patil:
Dr. DJ Patil has been the VP of Product at RelateIQ and the Data Scientist in Residence at Greylock Partners.

He has held a variety of roles in academia, industry, and government, including Chief Scientist, Chief Security Officer, and Head of Analytics and Data Teams at the LinkedIn Corporation. Additionally he has held a number of roles at Skype, PayPal, and eBay.

Data Driven: Creating a Data Culture

Source: http://www.oreilly.com/data/free/fil...ata-driven.pdf (PDF)

DJ Patil and Hilary Mason

DataDrivenCoverPage.png

Preface

Data Driven

by DJ Patil and Hilary Mason

Copyright © 2015 O’Reilly Media, Inc. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( http://safaribooksonline.com ). For more information, contact our corporate/institutional sales department:
800-998-9938 or corporate@oreilly.com .

Editor: Timothy McGovern
Copyeditor: Rachel Monaghan
Interior Designer: David Futato
Cover Designer: Karen Montgomery
Illustrator: Rebecca Demarest
January 2015: First Edition

Revision History for the First Edition

2015-01-05: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Driven, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc. was aware of a trademark claim, the designations have been printed in caps or initial caps.

While the publisher and the author(s) have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author(s) disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

The data movement is in full swing. There are conferences (Strata+Hadoop World), bestselling books (Big Data, The Signal and the Noise, Lean Analytics), business articles (“Data Scientist: The Sexiest Job of the 21st Century”), and training courses (An Introduction to Machine Learning with Web Data, the Insight Data Science Fellows Program) on the value of data and how to be a data scientist. Unfortunately, there is little that discusses how companies that successfully use data actually do that work. Using data effectively is not just about which database you use or how many data scientists you have on staff, but rather it’s a complex interplay between the data you have, where it is stored and how people work with it, and what problems are considered worth solving.

978-1-491-92119-7

[LSI]

Data Driven: Creating a Data Culture

The data movement is in full swing. There are conferences (Strata +Hadoop World), bestselling books (Big Data, The Signal and the Noise, Lean Analytics), business articles (“Data Scientist: The Sexiest Job of the 21st Century”), and training courses (An Introduction to Machine Learning with Web Data, the Insight Data Science Fellows Program) on the value of data and how to be a data scientist. Unfortunately, there is little that discusses how companies that successfully use data actually do that work. Using data effectively is not just about which database you use or how many data scientists you have on staff, but rather it’s a complex interplay between the data you have, where it is stored and how people work with it, and what problems are considered worth solving.

While most people focus on the technology, the best organizations recognize that people are at the center of this complexity. In any organization, the answers to questions such as who controls the data, who they report to, and how they choose what to work on are always more important than whether to use a database like PostgreSQL or Amazon Redshift or HDFS.

We want to see more organizations succeed with data. We believe data will change the way that businesses interact with the world, and we want more people to have access. To succeed with data, businesses must develop a data culture.

What Is a Data Scientist?

Culture starts with the people in your organization, and their roles and responsibilities. And central to a data culture is the role of the data scientist. The title data scientist has skyrocketed in popularity over the past five years. Demand has been driven by the impact on an organization of using data effectively. There are chief data scientists now in startups, in large companies, in nonprofits, and in government. So what exactly is a data scientist?

A data scientist doesn’t do anything fundamentally new. We’ve long had statisticians, analysts, and programmers. What’s new is the way data scientists combine several different skills in a single profession. The first of these skills is mathematics, primarily statistics and linear algebra. Most scientific graduate programs provide sufficient mathematical background for a data scientist.

Second, data scientists need computing skills, including programming and infrastructure design. A data scientist who lacks the tools to get data from a database into an analysis package and back out again will become a second-class citizen in the technical organization.

Finally, a data scientist must be able to communicate. Data scientists are valued for their ability to create narratives around their work. They don’t live in an abstract, mathematical world; they understand how to integrate the results into a larger story, and recognize that if their results don’t lead to action, those results are meaningless.

DataDrivenFigure1.png

In addition to these skills, a data scientist must be able to ask the right questions. That ability is harder to evaluate than any specific skill, but it’s essential. Asking the right questions involves domain knowledge and expertise, coupled with a keen ability to see the problem, see the available data, and match up the two. It also requires empathy, a concept that is neglected in most technical education programs.

The old Star Trek shows provide a great analogy for the role of the data scientist. Captain Kirk is the CEO. Inevitably, there is a crisis and the first person Kirk turns to is Spock, who is essentially his chief data officer. Spock’s first words are always “curious” and “fascinating”— he’s always adding new data. Spock not only has the data, but more importantly, he uses it to understand the situation and its context. The combination of data and context allows him to use his domain expertise to recommend solutions. This combination gives the crew a unique competitive advantage.

Does your organization have its version of a Spock in the boardroom? Or in another executive meeting? If the data scientists are isolated in a group that has no real contact with the decision makers, your organization’s leadership will suffer from a lack of context and expertise. Major corporations and governments have realized that they need a Spock on the bridge, and have created roles such as the chief data scientist (CDS) and chief data officer (CDO) to ensure that their leadership teams have data expertise. Examples include Walmart, the New York Stock Exchange, the cities of Los Angeles and New York, and even the US Department of Commerce and National Institutes of Health.

DataDrivenFigure2.png

Why have a CDO/CDS if the organization already has a chief technology officer (CTO) or a chief information officer (CIO)? First, it is important to establish the chief data officer as a distinct role; that’s much more important than who should report to whom. Second, all of these roles are rapidly evolving. Third, while these roles overlap, the primary measures of success for the CTO, CIO, and CDS/CDO are different. The CIO has a rapidly increasing set of IT responsibilities, from negotiating the “bring your own device” movement to supporting new cloud technologies. Similarly, the CTO is tasked with an increasing number of infrastructure-related technical responsibilities. The CDS/CDO is responsible for ensuring that the organization is data driven.

What Is a Data-Driven Organization?

The most well-known data-driven organizations are consumer Internet companies: Google, Amazon, Facebook, and LinkedIn. However, being data driven isn’t limited to the Internet. Walmart has pioneered the use of data since the 1970s. It was one of the first organizations to build large data warehouses to manage inventory across its business. This enabled it to become the first company to have more than $1 billion in sales during its first 17 years. And the innovation didn’t stop there. In the 1980s, Walmart realized that the quality of its data was insufficient, so to acquire better data it became the first company to use barcode scanners at the cash registers. The company wanted to know what products were selling and how the placement of those products in the store impacted sales. It also needed to understand seasonal trends and how regional differences impacted its customers. As the number of stores and the volume of goods increased, the complexity of its inventory management increased. Thanks to its historical data, combined with a fast predictive model, the company was able to manage its growth curve. To further decrease the time for its data to turn into a decision, it became the first large company to invest in RFID technologies. More recently it’s put efforts behind cutting-edge data processing technologies like Hadoop and Cassandra.

FedEx and UPS are well known for using data to compete. UPS’s data led to the realization that, if its drivers took only right turns (limiting left turns), it would see a large improvement in fuel savings and safety, while reducing wasted time. The results were surprising: UPS shaved an astonishing 20.4 million miles off routes in a single year.

Similarly, General Electric uses data to create improve the efficiency of its airline engines. Currently there are approximately 20,000 airplanes operating with 43,000 GE engines. Over the next 15 years, 30,000 more engines are expected to be in use. A 1% improvement in efficiency would result in $30 billion in savings over the next 15 years. Part of its effort to attack these problems has been the new GEnx engine. Each engine weighs 13,740 pounds, has 4,000 parts with 18 fan blades spinning at 1,242 ft/sec, and has a discharge temperature of 1,325ºF. But one of the most radical departures from traditional engines is the amount of data that is recorded in real time. According to GE, a typical flight will generate a terabyte of data.

This data is used by the pilots to make better decisions about efficiencies, and by the airlines to find optimal flight paths as well as to anticipate potential issues and conduct preventative maintenance.

What about these data-driven organizations enables them to use data to gain a competitive advantage? In Building Data Science Teams, we said that a data-driven organization

acquires, processes, and leverages data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape..

Let’s break down the statement a little. The first steps in working with data are acquiring and processing. But it’s not obvious what it takes to do these regularly. The best data-driven organizations focus relentlessly on keeping their data clean. The data must be organized, well documented, consistently formatted, and error free. Cleaning the data is often the most taxing part of data science, and is frequently 80% of the work. Setting up the process to clean data at scale adds further complexity. Successful organizations invest heavily in tooling, processes, and regular audits. They have developed a culture that understands the importance of data quality; otherwise, as the adage goes, garbage in, garbage out.

A surprising number of organizations invest heavily in processing the data, with the hopes that people will simply start creating value from it. This “if we build it, they will come” attitude rarely works. The result is large operational and capital expenditures to create a vault of data that rarely gets used. The best organizations put their data to use. They use the data to understand their customers and the nuances of their business. They develop experiments that allow them to test hypotheses that improve their organization and processes. And they use the data to build new products. The next section explains how they do it.

Democratizing Data

The democratization of data is one of the most powerful ideas to come out of data science. Everyone in an organization should have access to as much data as legally possible.

While broad access to data has become more common in the sciences (for example, it is possible to access raw data from the National Weather Service or the National Institutes for Health), Facebook was one of the first companies to give its employees access to data at scale. Early on, Facebook realized that giving everyone access to data was a good thing. Employees didn’t have to put in a request, wait for prioritization, and receive data that might be out of date. This idea was radical because the prevailing belief was that employees wouldn’t know how to access the data, incorrect data would be used to make poor business decisions, and technical costs would become prohibitive. While there were certainly challenges, Facebook found that the benefits far outweighed the costs; it became a more agile company that could develop new products and respond to market changes quickly. Access to data became a critical part of Facebook’s success, and remains something it invests in aggressively.

All of the major web companies soon followed suit. Being able to access data through SQL became a mandatory skill for those in business functions at organizations like Google and LinkedIn. And the wave hasn’t stopped with consumer Internet companies. Nonprofits are seeing real benefits from encouraging access to their data—so much so that many are opening their data to the public. They have realized that experts outside of the organization can make important discoveries that might have been otherwise missed. For example, the World Bank now makes its data open so that groups of volunteers can come together to clean and interpret it. It’s gotten so much value that it’s gone one step further and has a special site dedicated to public data.

Governments have also begun to recognize the value of democratizing access to data, at both the local and national level. The UK government has been a leader in open data efforts, and the US government created the Open Government Initiative to take advantage of this movement. As the public and the government began to see the value of making the data more open, governments began to catalog their data, provide training on how to use the data, and publish data in ways that are compatible with modern technologies. In New York City, access to data led to new Moneyball-like approaches that were more efficient, including finding “a five-fold return on the time of building inspectors looking for illegal apartments” and “an increase in the rate of detection for dangerous buildings that are highly likely to result in firefighter injury or death.” International governments have also followed suit to capitalize on the benefits of opening their data.

One challenge of democratization is helping people find the right data sets and ensuring that the data is clean. As we’ve said many times, 80% of a data scientist’s work is preparing the data, and users without a background in data analysis won’t be prepared to do the cleanup themselves. To help employees make the best use of data, a new role has emerged: the data steward. The steward’s mandate is to ensure consistency and quality of the data by investing in tooling and processes that make the cost of working with data scale logarithmically while the data itself scales exponentially.

What Does a Data-Driven Organization Do Well?

There’s almost nothing more exciting than getting access to a new data set and imagining what it might tell you about the world! Data scientists may have a methodical and precise process for approaching a new data set, but while they are clearly looking for specific things in the data, they are also developing an intuition about the reliability of the data set and how it can be used.

For example, one of New York’s public data sets includes the number of people who cross the city’s bridges each day. Let’s take just the data for the Verrazano-Narrows Bridge. You might imagine that this would produce a very predictable pattern. People commute during the week, and perhaps don’t on the weekends. And, in fact, we see exactly that for the first few months of 2012. We can ask a few straightforward questions. What’s the average number of commuters per day? How many people commuted on the least busy day? On the most busy day? But then something strange happens. There’s a bunch of missing data. What’s going on?

DataDrivenFigure3.png

A bit of digging around those dates will show you that there’s no conspiracy here: that data represents Hurricane Sandy, when the bridges and tunnels were deliberately closed. It also explains the spike that happens when the bridges reopened. You also see traffic drop sharply for the blizzard of February 2013. The data set is as simple as they come—it’s just one integer per day—and yet there’s a fascinating story hiding here.

When data scientists initially dive into a data set, they are not just assembling basic statistics, they’re also developing an intuition for any flaws in the data and any unexpected things the data might be able to explain. It’s not a matter of checking statistics off a list, but rather of building a mental model of what data says about the world.

The process is similar, though on a larger scale, for organizations with a data culture. One of the most important distinctions between organizations that are data driven and those that are not is how they approach hypothesis formulation and problem solving. Data-driven organizations all follow some variant of the scientific method, which we call the data scientific method:

  1. Start with data.
  2. Develop intuitions about the data and the questions it can answer.
  3. Formulate your question.
  4. Leverage your current data to better understand if it is the right question to ask. If not, iterate until you have a testable hypothesis.
  5. Create a framework where you can run tests/experiments.
  6. Analyze the results to draw insights about the question.

In 2009, Twitter was faced with a challenge. There was tremendous excitement about the service, but people were just not using it regularly: three out of four people would stop using it within two months. To solve the engagement problem, Twitter started by asking questions and looking at its current data. It found a number of surprising results. First, users who had used the service at least seven times in their first month were over 90% likely to return in subsequent months. For many organizations, identifying this magic number would be more than sufficient. But Twitter continued to study the data, and was well rewarded. Among users with high retention, it found that once a user followed 30 or more people, that person was almost certainly going to become a long-term user. The company continued to dig, and found that the nature of the people followed was also essential. Two-thirds of the people who new users followed were purely for content, but one-third had to follow the new users back.

Armed with these facts, the Twitter team was able to discover a solution that was counter to the conventional thinking about onboarding new users. Sites like Facebook and LinkedIn presented new users with an “address book importer” that would crawl the user’s email addresses. Next, a new user would see a page filled with suggestions for “people you may know.” Any other pages that might add “friction” to the user’s experience would cause a significant (20%+) number of users to abandon the onboarding.

The analysis showed that Twitter needed to (a) teach new users what a tweet was, (b) suggest accounts that had high-quality content segmented by categories (e.g., NFL, NBA, news sites), and then (c) suggest other users who were highly likely to follow someone once they knew that person was on Twitter. Implementing these ideas adds friction to the onboarding process by teaching users about the tweet; it also puts people a user is likely to interact with last. However, the result wasn’t a decrease in new users, but instead a 30% increase in people completing the experience and a 20% increase in long-term engagement!

In hindsight, the process and results almost look magical. They’re far from that; they represent dedicated adherence to the data scientific method. It took roughly 2.5 years to arrive at and test these results—and the process is nowhere near complete. Regular tests are ongoing to further improve what happens when new users arrive. The data scientific method never stops.

Twitter isn’t the only place that employs the data scientific method. Google is famous for testing hundreds of experiments a day to improve its search functionality. LinkedIn and Facebook are constantly conducting experiments to learn how to improve the experience of new users. Netflix is well known for testing and adjusting its entire experience to reduce the probability that users will cancel their subscriptions. It is using its data to make very costly investments into the types of shows that need to be created to keep its users engaged. The Obama campaign, in its record-breaking fundraising effort, did over 500 A/B tests over 20 months, resulting in increasing donation conversions by 49% and signup conversions by a whopping 161%.

Managing Research

Once you have a sense of the problems that you’d like to tackle, you need to develop a robust process for managing research. Without a process, it’s easy to spend too much time on unimportant problems, and without a research-specific process, it’s easy to get drawn into the engineering world where research is not a priority.

Here’s a set of questions that can be asked about every data science problem. They provide a loose framework for managing a robust portfolio of research efforts with both short- and long-term rewards. For each research problem, we ask:

What is the question we’re asking?

It’s important to state the question in language that everyone in the company can understand. This is harder than you think it will be! Most companies have teams with diverse backgrounds, so it’s important to articulate clearly the question that you are addressing so that everyone in the organization can imagine why it might be relevant and useful.

How do we know when we’ve won?

Once you have defined the question, you need to define the metrics by which you will evaluate your answer. In many cases, these are quantitative metrics (e.g., cross-validation), but in some cases the metric may be qualitative, or even a “looks good to me.” Everyone on the team needs to be clear about what a success will look like.

Assuming we solve this problem perfectly, what will we build first?

This question is designed to assess the solution’s potential to impact your business. What capabilities will you have that you don’t have now? Is this an important problem to solve right now?

While you should always have a “first thing” in mind, we recommend coming up with further questions that you’ll be able to investigate once you have answered the first one. That way, you can manage both short- and long-term value.

If everyone in the world uses this, what is the impact?

What’s the maximum potential impact of this work? If it’s not inspiring, is it worth pursuing at all? It’s vital to make sure that data science resources are invested in projects that will have a significant impact on the business. There is no greater insult than “You’ve created an elegant solution to an irrelevant problem.”

What’s the most evil thing that can be done with this?

This question is a bit different! Don’t ask it if you work with people who enjoy doing evil. Instead, save this one for groups that are so lawful and good that they limit their thinking. By asking the team to imagine what their impact could be if you abandon all constraints, you allow for a conversation that will help you identify opportunities that you would otherwise miss, and refine good ideas into great ones. We don’t want to build “evil” products, but subversive thinking is a good way to get outside the proverbial box.

One of the challenges with data is the power that it can unleash for both good and bad, and data scientists may find themselves making decisions that have ethical consequences. It is essential to recognize that just because you can, doesn’t mean you should. It’s important to get outside input. When uncertain, we turn to well-regarded experts on privacy and legal matters (e.g., the Electronic Frontier Foundation).

Designing the Organization

In the last few years, a lot of attention has been focused on the celebrity data scientist. But data science isn’t about celebrities; it’s a team sport. While a single person who has access to data and knows how to use it can have a huge impact, relying on a single celebrity isn’t scalable. A culture that is dependent on one individual is fragile and won’t be sustainable. It’s more important to think about the composition of the team and how it should be organized.

Should the data team be centralized or decentralized? Should it be part of Engineering, a product group, or Finance, or should it be a separate organization? These are all important questions, but don’t focus on them at first. Instead, focus on whether you have the key ingredients that will allow the team to be effective. Here are some of the questions you should ask:

  • What are the short-term and long-term goals for data?
  • Who are the supporters and who are the opponents?
  • Where are conflicts likely to arise?
  • What systems are needed to make the data scientists successful?
  • What are the costs and time horizons required to implement those systems?

Ask these questions constantly. As the data culture emerges and gains sophistication, periodic restructuring will be necessary.

LinkedIn’s early data efforts were split between data scientists supporting the CFO (dashboards and basic reporting) and data scientists building products. When I (DJ) joined LinkedIn we had a debate about how we should structure the team. The conventional options were to build a team under the CEO, under the CTO, or in Engineering. We tried something different. We put the data team in the Product organization. First, we wanted this team to be able to drive and implement change while ensuring ownership and accountability. Second, we had a phenomenal Engineering team, and we realized that we could bring the Product and Engineering teams into better alignment through common DNA.

Over time, we realized that this model couldn’t support the speed at which we were growing (from 200 to 2,000 employees in under four years), so we decentralized the team. The unfortunate consequence was that data scientists would end up isolated, supporting a specific group. We tried many solutions, but eventually decided that a decentralized organization worked only when there were at least three data scientists supporting a given area.

All of these organizational changes took place as we implemented new technologies, built out our data warehouse, and grew our data operations. We constantly needed to rethink and reevaluate our organizational structure to provide the best career growth and impact. However, we always had one central tenet in mind: to grow a massive company, every part of the organization must be data driven. This means that the data would be fully democratized, and everyone would be sufficiently data proficient. Naturally, we would still need those with a specific skill set, but data would become an intrinsic skill and asset for every team.

Process

While most organizations focus on corporate structure, they give less attention to the processes and technology needed to build a data-driven culture. The next three sections outline some of the most essential processes and ways to evaluate technologies.

Figure 1-1. DILBERT

DataDrivenFigure4.png

© 2007 Scott Adams. Used By permission of UNIVERSAL UCLICK. All rights reserved.

Daily dashboard

Data-driven organizations look at their data every morning. Starting every day with a review of the data isn’t just a priority, it’s a habitual practice. The simplest way to review the data is by looking at dashboards that describe key metrics. These dashboards might be implemented by a spreadsheet that is emailed, or by a business intelligence application accessed through the Web.

There are two classic complaints around dashboards. The first is that they don’t contain enough data; the second is that there is too much data. How do you find the right balance? Here are some hints.

Data vomit

Don’t fall for the urge to add “just one more thing” to the dashboard. Adding more information at greater density creates data vomit. Data vomit is bad and leads to frustration. The data becomes intimidating and, as a result, is just ignored.

Time dependency

Put data on your dashboard only if you know what you will do if something changes. For example, if there is a significant change on an hourly dashboard, does someone’s pager go off? Does the appropriate team know what to investigate? Similarly, display the data in a form that allows action to be taken. If the dashboard contains a pie graph, will you be able to tell if there is a change? Instead of a single dashboard, create different dashboards that reflect different time scales. For example, some dashboards might be on an hourly scale, while others could be on a quarterly scale. These simple measures prevent your dashboard from turning into data vomit.

Value

Manage your dashboards instead of letting them manage you. Review them and ask whether they are still giving you value. If not, change them. It’s surprising how often people consider their dashboard “fixed” and unchangeable. It’s quite the opposite: the dashboard is a living entity that allows you to manage your organization. As the organization’s data sophistication increases, it’s likely that old ways of measuring the system will become too simplistic. Hence, those older measures should be replaced with newer ones.

Visual

Make your data look nice. It’s surprising how ugly most dashboards are. The font is too small or in a typeface that isn’t clearly readable. If there is a line graph, it looks like it came from the 1980s. Sometimes dashboards are put in 3D, or have a color palette with no real meaning. Turn the data you regularly look at into something that you’d like to look at.

Fatigue

Finally, watch out for “alert/alarm fatigue.” We like to create alerts when something changes. But if there are too many alarms, you create alarm fatigue: the team becomes desensitized to the alerts, because they’re occurring so often and they’re frequently meaningless. Review alarms and ask what actions are taken once the alarm is activated. Similarly, review false positives and false negatives to see if the alerting system can be improved. Don’t be afraid to remove an alert or an alarm if it isn’t serving a purpose. Some well-regarded teams carry pagers, so they can share the pain of unnecessary alarms. There is nothing like successive wake-up calls at 3 a.m. to motivate change.

As a general rule of thumb, we like to ask four questions whenever data is displayed (in a dashboard, a presentation, or in a product):

  • What do you want users to take away? In other words, what information do you want the user to walk away with? Is it that things are good or things are bad?
  • What action should you take? When presenting a result, ask what you want your audience to do. For example, if there is a problem with sales, and the recipient is the CEO, you might want the CEO to call the head of sales right away. You may not be able to convey this in the dashboard, but you certainly can discuss it in your data meetings.
  • How do you want the viewer to feel? Most effective organizations embrace putting emotion and narrative around the data. If the goal is to make people feel excited, use green. If the feeling is neutral, use black or blue. If you want to express concern or urgency, use yellow or red. Great data teams spend time and energy ensuring the narrative provides adequate context, is compelling, and is intellectually honest.
  • Finally, is the data display adding value regularly? If not, don’t be afraid to “prune” it. Removing items that are no longer valuable keeps the dashboard effective and actionable.
Metrics meetings

One of the biggest challenges an organization faces isn’t creating the dashboard, it’s getting people to spend time studying it. Many teams go to great lengths to create a dashboard, only to learn how rarely people use it. We’ve seen many attempts to force people to look at the dashboards, including automated delivery to mobile devices, human-crafted email summaries, and print copies of the dashboards left on the chairs of executives. Most of these techniques don’t work. When we’ve used tracking codes to measure the open rates of these emails and see who has viewed the dashboards, the numbers were atrocious.

The model that I’ve found the best is inspired by Sustained Silent Reading, or SSR (a popular school reading program in the United States). Instead of assuming that people looked at the data on their own, we spent the first part of the meeting looking at the data as a group. During this time, people could ask questions that would help them understand the data. Everyone would then write down notes, circle interesting results, or otherwise annotate the findings. At the end of the reading period, time was dedicated to a discussion of that data.

I’ve seen dramatic results from this method. During the first few meetings, the conversation is focused on basic questions, but those are quickly replaced by deeper questions. The team begins to develop a common language for talking about the data. Even the sophistication of the data presented begins to change.

This process prevents data from being used as a weapon to push an agenda. Rather than jumping straight into decision making, we start with a conversation about the data. At the end of the conversation, we can then ask if we have enough information to make a decision. If the answer is yes, we can move forward. If it’s no, we can ask what it will take to make an informed decision. Finally, we ask if there is any reason to think that we should go against the data.

As a result, everyone becomes smarter. A discussion makes everyone more informed about the data and its different interpretations. It also limits mistakes. This kind of forum provides a safe environment for basic questions that might otherwise seem dumb (such as “what do the labels on the axis mean?”). Simple questions often expose flaws in the way data was collected or counted. It’s better to find the flaws before a decision has been made.

Second, the conversation disarms the data as a political weapon. All too often, Team 1 shows data defending its argument, and Team 2 shows similar but conflicting data for its argument. Who is right? Focusing the conversation on the data rather than the decision makes it possible to talk openly about how data was collected, counted, and presented. Both teams might be right, but their data may be addressing different issues. In this way, assessment through discussion wins, not the best graph.

One word of caution: don’t follow the data blindly. Being data driven doesn’t mean ignoring your gut instinct. This is what we call “letting the data drive you off a cliff.” Do a web search for “GPS” and “cliff ” and you’ll find that a surprising number of people actually crash their car when the GPS is giving them bad information. Think about that for a second. The windshield is huge relative to the screen size of the GPS. As a result, the data coming through to the driver is massive relative to the information that is output by the voice or the screen of the GPS. By likewise hyper-optimizing to a specific set of metrics, you too can drive (your business) off a cliff. Sometimes it is necessary to ignore the data. Imagine a company that is trying to determine which market to enter next. There is stronger user adoption in Market A, but in Market B there is a new competitor. Let’s suppose all the data says that the company should go after Market A. It still might be better to go after Market B. Why? The data can’t capture everything. And sometimes you have to trust your gut.

How can you prevent these kinds of catastrophic failures? First, regularly ask “are we driving off a cliff?” By doing so, you create a culture that challenges the status quo. When a person uses that phrase, it signals that it’s safe to challenge the data. Everyone can step back and take into account the broader landscape.

Standup and domain-specific review meetings

We’ve discussed a number of meetings that are needed as part of a data-driven culture. But we all know that endless, dull meetings kill creativity and independent thought. How do we get beyond that? I’ve found that it works to borrow processes and structures from companies that implement agile methods.

The first of these is the standup meeting. These are short meetings (often defined by the time that a person is willing to stand) that are used to make sure everyone on the team is up-to-date on issues. Questions or issues become action items that are addressed outside of the standup meeting. At the next standup meeting, the action items are reviewed to see if they have been resolved, and if not, to determine when resolution is expected. If you’re literally standing up, standup meetings should be no longer than 30 minutes. They’re a great way to start the day and enhance communication.

It’s a misconception that standup meetings and other ideas from the agile movement work only for high-tech Silicon Valley firms. Standup meetings are used to monitor situations in the US Department of Defense. The National Weather Service has a daily meeting where the weather forecasters gather (both in person and virtually) to raise any issues. The key to success in all of these forums is to be ruthlessly efficient during the meeting and to make sure issues raised (action items) are acted upon.

It’s also important for the data team to hold product review, design review, architecture review, and code review meetings. All of these meetings are forums where domain-specific expertise can provide constructive criticism, governance, and help. The key to making these meetings work is to make sure participants feel safe to talk about their work. During these meetings, definitions of metrics, methodologies, and results should be presented before being deployed to the broader organization.

Tools, Tool Decisions, and Democratizing Data Access

Data scientists are always asked about their tools: What tools do you use? How do I become an expert user of a particular tool? What’s the fastest way to learn Hadoop?

The secret of great data science is that the tools are almost irrelevant. An expert practitioner can do better work in the Bash shell environment than a nonexpert can do in R. Learning how to understand the problem, formulate an experiment and ask good questions, collect or retrieve data, calculate statistics or implement an algorithm, and verify the accuracy of the result is much more important than mastering the tools.

However, there are a few attributes of tools that both are timeless and enable stronger teamwork:

  • The best tools are powerful. They aren’t visual dashboards that offer a limited set of options, but Turing-complete programming languages. Powerful tools allow for unconventional and powerful analysis techniques.
  • The best tools are easy to use and learn. While tools should be powerful, it should be easy to understand how to apply them. With programming languages, you want to see tutorials, books, and strong communities.
  • The best tools support teamwork. These tools should make it easier to collaborate on analysis and to make data science work reproducible.
  • The best tools are beloved by the community. A tool that’s popular in a technical community will have many more resources supporting it than a proprietary one. It’ll be easier to find people who already use the tools your team uses, and it’ll give your employees the ability to demonstrate your company’s expertise by participating in that community.

We’ve stressed the democratization of data access. Democratization doesn’t come without costs, and may require rethinking your organization’s data practices and tools. Not that long ago, it was common for organizations with a lot of data to build a “data warehouse.” To create a data warehouse, you would build a very robust database that assumed the data types wouldn’t change very much over time. Access to the warehouse was restricted to those who were “sanctioned”; all others were required to go to them. This organizational structure created a data bottleneck. The tools in the data warehouse might have been powerful, but they weren’t easy to use, and certainly didn’t support teamwork or a larger data community. If you needed data, you had to go through the data bureaucracy, which meant that you’d get your data a day (or days) later; if you wanted data that didn’t fit into the warehouse’s predefined schema, you might be out of luck. Rather than teaching people to fish, data bureaucrats opted to create dashboards that had limited functionality outside of the key questions they were designed to answer.

Requiring users to go through a data bureaucracy to get access to data is sure to halt democratization. Don’t force the users who need data to go through channels; train them to get it themselves. One challenge of this approach is the ability to support a large number of users. Most data solutions are evaluated on speed—but when you’re supporting large numbers of users, raw speed may not be relevant. Almost anything will be faster than submitting a request through data warehouse staff. We made this mistake early in the process of building LinkedIn’s data solutions, and learned that the following criteria are more relevant than pure speed:

  • How well will the solution scale with the number of concurrent users?
  • How does it scale with the volume of data?
  • How does the price change as the number of users or the volume of data grows?
  • Does the system fail gracefully when something goes wrong?
  • What happens when there is a catastrophic failure?

Price is important, but the real driver is the ability to support the broadest possible set of users.

Consider what Facebook found when it first allowed everyone to access the data. Simply providing access to data didn’t help people make better decisions. People couldn’t find the data they needed, and there remained a huge gap in technical proficiency. Meanwhile its data was growing at an unprecedented rate. To scale more efficiently, Facebook invested aggressively in tooling. One of its first investments was a new way to store, access, and interact with the data. And with that the company realized that it would need a new language that would be easier for a broader set of employees to use.

While Hadoop, the underlying technology, showed great promise, the primary language supported, Pig, was not friendly for analysts, product managers, or anyone accustomed to languages like SQL. Facebook decided to invest in Hive, a SQL-like language that would be more optimal for Hadoop, and a unique tooling layer called HiPal that would be the primary GUI for Hive. HiPal allowed any user to see what others in the company were accessing. This unique form of transparency allowed a new user to get up to speed quickly by studying what other people on the team were requesting and then building on it.

As powerful as HiPal was, it still was insufficient at helping users who were unfamiliar with coding or languages. As a result, the Facebook data team started a series of training classes. These classes allowed the team to educate staff about HiPal and simultaneously share best practices. Combined with training, HiPal jumpstarted the data capabilities of Facebook’s teams and fostered a strong sense of data culture. It lowered the cost of data access and created the expectation that you needed data to support your business decisions. It was a major foundation for Facebook’s growth strategy and international expansion.

Creating Culture Change

We want organizations to succeed with data. But succeeding with data isn’t just a matter of putting some Hadoop in your machine room, or hiring some physicists with crazy math skills. Succeeding with data requires real cultural change. It requires learning how to have a discussion about the data—how to hear what the data might be saying rather than just enlisting it as a weapon in company politics. It requires spreading data through your organization, not just by adding a few data scientists (though they are critical to the process), but by enabling everyone in the organization to access the data and see what they can learn.

As Peter Drucker stated in Management Challenges for the 21st Century, “Everybody has accepted by now that change is unavoidable. But that still implies that change is like death and taxes—it should be postponed as long as possible and no change would be vastly preferable. But in a period of upheaval, such as the one we are living in, change is the norm.” Good data science gives organizations the tools to anticipate and stay on the leading edge of change. Building a data culture isn’t easy; it requires persistence and patience. You’re more likely to succeed if you start with small projects and build on their success than if you create a grandiose scheme. But however you approach it, building a data culture is the key to success in the 21st century.

The White House Names Dr. DJ Patil as the First U.S. Chief Data Scientist

Source: http://www.whitehouse.gov/blog/2015/...data-scientist

Megan Smith

February 18, 2015 

Today, I am excited to welcome Dr. DJ Patil as Deputy Chief Technology Officer for Data Policy and Chief Data Scientist here at the White House in the Office of Science and Technology Policy. President Obama has prioritized bringing top technical talent like DJ into the federal government to harness the power of technology and innovation to help government better serve the American people.

Across our great nation, we’ve begun to see an acceleration of the power of data to deliver value. From early open data work by the National Oceanic and Atmospheric Administration (NOAA), which provides data that enables weather forecasts to come directly to our mobile phones, to powering GPS systems that feed geospatial data to countless apps and services — government data has supported a transformation in the way we live today for the better.

DJ joins the White House following an incredible career as a data scientist — a term he helped coin — in the public and private sectors, and in academia. Most recently, DJ served as the Vice President of Product at RelateIQ, which was acquired by Salesforce. DJ also previously held positions at LinkedIn, Greylock Partners, Skype, PayPal, and eBay. Prior to his work in the private sector, DJ worked at the Department of Defense, where he directed new efforts to bridge computational and social sciences in fields like social network analysis to help anticipate emerging threats to the United States.

As a doctoral student and faculty member at the University of Maryland, DJ used open datasets published by NOAA to make major improvements in numerical weather forecasting. He holds a bachelor’s degree in mathematics from the University of California, San Diego, and a PhD in applied mathematics from the University of Maryland College Park. DJ has also authored a number of influential articles and books explaining the important current and potential applications of data science.

As Chief Data Scientist, DJ will help shape policies and practices to help the U.S. remain a leader in technology and innovation, foster partnerships to help responsibly maximize the nation’s return on its investment in data, and help to recruit and retain the best minds in data science to join us in serving the public. DJ will also work on the Administration’s Precision Medicine Initiative, which focuses on utilizing advances in data and health care to provide clinicians with new tools, knowledge, and therapies to select which treatments will work best for which patients, while protecting patient privacy.

As part of the CTO team, DJ will work closely with colleagues across government, including the Chief Information Officer and U.S. Digital Service. DJ’s work will also include data science leadership on the Administration’s momentum on open data and data science.

Over the past six years, the Obama administration has made historic progress in this area. In addition to making more than 138,000 data sets available to the public for innovation and entrepreneurship, the Administration is also empowering Americans with secure access to their personal data and expanding our capacity to process and examine large and complex datasets. Utilizing data for innovation holds amazing potential for the future of our country.

DJ’s work will help ensure government remains effective and innovative for the American public in our increasingly digital world. We welcome DJ to our team.

Megan Smith is the U.S. Chief Technology Officer.

Unleashing the Power of Data to Serve the American People

Source: https://medium.com/@WhiteHouse/unlea...e-198534d009a2

Memorandum: Unleashing the Power of Data to Serve the American People
To: The American People
From: Dr. DJ Patil, Deputy U.S. CTO for Data Policy and Chief Data Scientist
Date: February 20, 2015

Overview: What Is Data Science, and Why Does It Matter?

The data age has arrived. From crowd-sourced product reviews to real-time traffic alerts, “big data” has become a regular part of our daily lives. In 2013, researchers estimated that there were about 4 zettabytes of data worldwide: That’s approximately the total volume of information that would be created if every person in the United States took a digital photo every second of every day for over four months! The vast majority of existing data has been generated in the past few years, and today’s explosive pace of data growth is set to continue. In this setting, data science — the ability to extract knowledge and insights from large and complex data sets — is fundamentally important.

While there is a rich history of companies using data to their competitive advantage, the disproportionate beneficiaries of big data and data science have been Internet technologies like social media, search, and e-commerce. Yet transformative uses of data in other spheres are just around the corner. Precision medicine and other forms of smarter health care delivery, individualized education, and the “Internet of Things” (which refers to devices like cars or thermostats communicating with each other using embedded sensors linked through wired and wireless networks) are just a few of the ways in which innovative data science applications will transform our future.

The Obama administration has embraced the use of data to improve the operation of the U.S. government and the interactions that people have with it. On May 9, 2013, President Obama signed Executive Order 13642, which made open and machine-readable data the new default for government information. Over the past few years, the Administration has launched a number of Open Data Initiatives aimed at scaling up open data efforts across the government, helping make troves of valuable data — data that taxpayers have already paid for — easily accessible to anyone. In fact, I used data made available by the National Oceanic and Atmospheric Administration to improve numerical methods of weather forecasting as part of my doctoral work. So I know firsthand just how valuable this data can be — it helped get me through school!

Given the substantial benefits that responsibly and creatively deployed data can provide to us and our nation, it is essential that we work together to push the frontiers of data science. Given the importance this Administration has placed on data, along with the momentum that has been created, now is a unique time to establish a legacy of data supporting the public good. That is why, after a long time in the private sector, I am returning to the federal government as the Deputy Chief Technology Officer for Data Policy and Chief Data Scientist.

Organizations are increasingly realizing that in order to maximize their benefit from data, they require dedicated leadership with the relevant skills. Many corporations, local governments, federal agencies, and others have already created such a role, which is usually called the Chief Data Officer (CDO) or the Chief Data Scientist (CDS). The role of an organization’s CDO or CDS is to help their organization acquire, process, and leverage data in a timely fashion to create efficiencies, iterate on and develop new products, and navigate the competitive landscape.

The Role of the First-Ever U.S. Chief Data Scientist

Similarly, my role as the U.S. CDS will be to responsibly source, process, and leverage data in a timely fashion to enable transparency, provide security, and foster innovation for the benefit of the American public, in order to maximize the nation’s return on its investment in data.

So what specifically am I here to do? As I start, I plan to focus on these four activities:

Providing vision on how to provide maximum social return on federal data.

https://soundcloud.com/whitehouse/01...ist-dr/s-IwA82

Creating nationwide data policies that enable shared services and forward-leaning practices to advance our nation’s leadership in the data age.

https://soundcloud.com/whitehouse/02...ist-dr/s-EupsN

Working with agencies to establish best practices for data management and ensure long-term sustainability of databases.

https://soundcloud.com/whitehouse/03...ist-dr/s-H6hil

Recruiting and retaining the best minds in data science for public service to address these data science objectives and act as conduits among the government, academia, and industry.

https://soundcloud.com/whitehouse/04...ist-dr/s-ab65N

As I work to fulfill these duties across the Administration, I’ll be focusing on several priority areas, including:

Precision medicine

Medical and genomic data provides an incredible opportunity to transition from a “one-size-fits-all” approach to health care towards a truly personalized system, one that takes into account individual differences in people’s genes, environments, and lifestyles in order to optimally prevent and treat disease. We will work through collaborative public and private efforts carried out under the President’s new Precision Medicine Initiative to catalyze a new era of responsible and secure data-based health care.

Usable data products

The President’s Executive Order 13642 on machine-readable data gives us a tremendous opportunity to productively connect unique data sets. The challenge is that open data is necessary, but not always sufficient, to create value and drive innovation. For example, the binary 0s and 1s that allow a computer to generate an MRI are of little use to a patient — it is the computationally rendered MRI image that communicates the information locked inside of that binary data. We will work to deliver not just raw datasets, but also value-added “data products” that integrate and usefully present information from multiple sources.

Responsible data science

We will work carefully and thoughtfully to ensure data science policy protects privacy and considers societal, ethical, and moral consequences.

Data will continue to transform the way we live and work. I am eager to get started as the first U.S. CDS, and I look forward to providing regular updates on our progress.

DJ Patil Scores the Sexiest Job in D.C.

Source: http://www.meritalk.com/blog.php?use...gentry_id=3873

Posted: 2/20/2015

alt

By Bill Glanz

DJ Patil is not a turntable icon, spinning electronic dance music. He is the master of another medium – data. And last week he was named the nation’s first Chief Data Scientist.

In fact, Patil may have coined the term. He co-wrote a paper that appeared in the October 2012 Harvard Business Review, which referred to the data scientist as the sexiest job of the 21st Century.

“If ‘sexy’ means having rare qualities that are much in demand, data scientists are already there. They are difficult and expensive to hire and, given the very competitive market for their services, difficult to retain. There simply aren’t a lot of people with their combination of scientific background and computational and analytical skills,” Patil and Thomas Davenport wrote.

In the same paper, Patil also wrote that he and Jeff Hammerbacher coined the term in 2008.

What Will Patil Do?

Putting data to work will be Patil’s principle function. John Podesta, a counselor to the president, said in a conference call a few weeks ago that Patil will be “looking at the way big data science can improve services across the board, starting particularly in the health care world,” according to a report by NextGov’s Jack Moore.

Patil outlined his ambitious new goals in a memo, “Unleashing the Power of Data to Serve the American People,” which he released Friday.

In the memo, Patil said he sees his role as leading the Federal charge “to responsibly source, process, and leverage data in a timely fashion to enable transparency, provide security, and foster innovation for the benefit of the American public, in order to maximize the nation’s return on its investment in data.”

Patil says he wants to focus on: providing vision on how to provide maximum social return on Federal data; creating nationwide data policies that enable shared services and forward-leaning practices to advance our nation’s leadership in the data age; working with agencies to establish best practices for data management and ensure long-term sustainability of databases; and recruiting and retaining the best minds in data science for public service to address these data science objectives and act as conduits among the government, academia and industry.

Agencies know data holds secrets that can help them improve service or understand what consumers want and need – but typically lack the tools and expertise needed to turn those ideas into practical reality.

Patil summed up the potential value of Big Data in his Harvard Business Review piece this way: “If your organization stores multiple petabytes of data, if the information most critical to your business resides in forms other than rows and columns of numbers, or if answering your biggest question would involve a ‘mashup’ of several analytical efforts, you’ve got a big data opportunity.”

Federal agencies are “more data-driven than most companies are right now,” Patil said at the Strata + Hadoop big data conference in San Jose last Thursday. “And that’s a bold statement. But from everything [I’ve seen] in the small period of time that I’ve been there, it’s absolutely true.” Perhaps nowhere is that challenge – and the potential for payoff – greater than in the public healthcare sector. Megan Smith, the nation’s chief technology officer, wrote on the White House blog last week that Patil’s healthcare focus will include working on the administration’s Precision Medicine Initiative, “which focuses on utilizing advances in data and health care to provide clinicians with new tools, knowledge, and therapies to select which treatments will work best for which patients, while protecting patient privacy.”

Silicon Valley to DoD 

Patil worked for a number of IT powerhouses, including Salesforce, EBay, Skype, PayPal and LinkedIn. He’s also done work for NOAA to improve weather forecasting and with the Defense Department, Smith said, “where he directed new efforts to bridge computational and social sciences in fields like social network analysis to help anticipate emerging threats to the United States.”

Patil joins Smith; VMware’s Tony Scott, the new U.S. chief information officer; Mikey Dickerson, who leads the U.S. Digital Service; and Alexander Macgillivray, the nation’s deputy CTO; as high-level Silicon Valley transplants who have been brought into government by the Obama administration to inject private-sector ingenuity into public-sector IT.

The federal government sits on the world’s biggest data treasure trove, much of it still walled off from public or industrial action. But the Obama administration has made some 138,000 data sets available since taking office, with more to come. The Data DJ will have his work cut out for him.

Join the conversation. Post a comment below or email me at bglanz@300brand.com.

EPA Ecosystems Research

Source: http://www2.epa.gov/eco-research

EPA’s ecosystems research is working to protect ecosystems and the air and water resources that provide numerous benefits for humans and other living things.

EPA Methods, Models, Tools, and Databases

Source: http://www2.epa.gov/eco-research/met...stems-research

Methods

  • Biological Methods and Manual Development 
    EPA's research in stream and source monitoring indicators includes fish, macroinvertebrates, periphyton, zooplankton, functional ecosystem indicators, water and sediment toxicity, and fish tissue contaminants. EPA exposure scientists regularly prepare and update field and laboratory protocol and methods manuals. They also provide technical assistance to EPA regions, program offices and states on the implementation and interpretation of these manuals. This website lists currently available manuals and protocols.
  • Regional Vulnerability Assessment (ReVA)
    ReVA is an approach to regional scale, priority-setting assessment by integrating research on human and environmental health, ecorestoration, landscape analysis, regional exposure and process modeling, problem formulation, and ecological risk guidelines.

Models

  • Bioaccumulation in Aquatic Systems Simulator (BASS)
    BASS simulates population and bioaccumulation dynamics of age-structured fish communities. While BASS was designed to investigate bioaccumulation of chemicals within community or ecosystem contexts, it also allows EPA to evaluate various dimensions of fish health associated with non-chemical stressors. Accurate bioaccumulation estimates help predict realistic dietary exposures to humans and fish-eating wildlife.
  • Community Multi-scale Air Quality (CMAQ) Model
    CMAQ is an air quality model and software suite designed to model multiple pollutants at multiple scales. CMAQ allows regulatory agencies and state governments to evaluate the impact of air quality management decisions, and gives scientists the ability to probe, simulate, and understand chemical and physical interactions in the atmosphere.
  • Exposure Analysis Modeling System (EXAMS)
    EXAMS is a modeling system that supports development of aquatic ecosystem models for rapid evaluation of the fate, transport, and exposure concentrations of synthetic organic chemicals like pesticides, industrial materials, and leachates from disposal sites. The system is able to generate and summarize data critical for ecological risk assessments. Much of the data required for EXAMS to function has been collected historically. This allows data needs to be met for some projects without intensive field sampling.
  • Food and Gill Exchange of Toxic Substances (FGETS)
    FGETS predicts temporal dynamics of fish whole body concentration (ug chemical/(g live weight fish)) of non ionic, non metabolized, organic chemicals that are bioaccumulated from either water only or water and food jointly.
  • Framework for Risk Analysis of Multi-Media Environmental Systems (FRAMES)
    FRAMES takes collections of models and modeling tools and applies them to real world problems. It facilitates communication between models, supporting the passage of data that helps simulate complex environmental processes. The tool has been used in EPA assessments in support of the Hazardous Waste Identification Rule, which establishes contaminant concentration levels in industrial waste streams that are considered safe for disposal.
  • Markov Chain Nest Productivity Model (MCnest)
    The Markov Chain Nest Productivity Model (or MCnest) integrates existing toxicity information from three standardized avian toxicity tests with information on species life history and the timing of pesticide applications relative to the timing of avian breeding seasons, to quantitatively estimate the impact of pesticide-use scenarios on the annual reproductive success of bird populations. 
  • PRZM
    PRZM is a one-dimensional, finite-difference model that accounts for pesticide and nitrogen fate in the crop root zone. The software includes modeling capabilities for such phenomena as soil temperature simulation, volatilization and vapor phase transport in soils, irrigation simulation, microbial transformation, and a method of characteristics algorithm to eliminate numerical dispersion.
  • Regional Vulnerability Assessment (ReVA)
    ReVA is an approach to regional scale, priority-setting assessment by integrating research on human and environmental health, ecorestoration, landscape analysis, regional exposure and process modeling, problem formulation, and ecological risk guidelines.
  • SPARC Performs Automated Reasoning in Chemistry (SPARC)
    SPARC estimates chemical reactivity parameters and physical properties for a wide range of organic molecules. This information is needed to be able to predict the fate and transport of pollutants in the environment. SPARC is being designed to incorporate multiple mathematical approaches to estimate important chemical reactions and behavior. It will then interface directly with air, water, and land models to provide scientists with data that can inform risk assessments and help prioritize toxicity-testing requirements for regulated chemicals.
  • Storm Water Management Model (SWMM)
    SWMM is a hydrology and hydraulics model that aids in the design of green and grey stormwater infrastructure alternatives
    Research question: How can urban stormwater and combined sewer overflows be best managed through some combination of conventional structural controls & non-structural BMP and low-impact development controls?
  • Supercomputer for Model Uncertainty and Sensitivity Evaluation (SuperMUSE)
    SuperMUSE enhances quality assurance in environmental models and applications. With SuperMUSE, EPA can now better investigate new and existing uncertainty analysis (UA) and sensitivity analysis (SA) methods. EPA can also more easily achieve UA/SA of complex, Windows-based environmental models, allowing scientists to conduct analyses that have, to date, been impractical to consider.
  • Vadose zone LEACHing (VLEACH)
    VLEACH is a one-dimensional, finite difference model for making preliminary assessments of the effects on groundwater from the leaching of volatile, sorbed contaminants through the vadose zone. The program models four main processes: liquid-phase advection, solid-phase sorption, vapor-phase diffusion, and three-phase equilibration.
  • Virulo
    Virulo is a probabilistic screening model for predicting leaching of viruses in unsaturated soils. Monte Carlo is employed to generate ensemble simulations of virus attenuation. The probability of failure is generated to achieve a user-chosen degree of attenuation.

Tools

  • Causal Analysis/Diagnosis Decision Information System (CADDIS)
    CADDIS helps scientists and engineers conduct causal assessments in aquatic systems. It is organized into five volumes:
    • Volume 1: Stressor Identification provides a step-by-step guide for identifying probable causes of impairment in a particular system, based on the U.S. EPA's Stressor Identification process. If you are interested in conducting a complete causal assessment, learning about different types of evidence, or reviewing a history of causal assessment theory, start with this volume.
    • Volume 2: Sources, Stressors & Responses provides background information on many common sources, stressors, and biotic responses in stream ecosystems. If you are interested in viewing source- and stressor-specific summary information (e.g., for urbanization, physical habitat, nutrients, metals, pH and other stressors), start with this volume.
    • Volume 3: Examples & Applications provides examples illustrating different steps of causal assessments. If you are interested in reading completed causal assessment case studies, seeing how Stressor Identification worksheets are completed, or examining example applications of data analysis techniques, start with this volume.
    • Volume 4: Data Analysis provides guidance on the use of statistical analysis to support causal assessments. If you are interested in learning how to use data in your causal assessment, start with this volume.
    • Volume 5: Causal Databases provides access to literature databases and associated tools for use in causal assessments. If you are interested in applying literature-based evidence to your causal assessment, start with this volume.
  • Eco-Health Relationship Browser
    The Eco-Health Relationship Browser illustrates the linkages between human health and ecosystem services—benefits supplied by Nature. This interactive tool provides information about our nation's ecosystems, the services they provide, and how those services, or their degradation and loss, may affect people.
    Research question: What goods or services does a certain ecosystem provide and how do those goods and services affect human health and well-being?
  • Enhancing CMAQ Air-Surface Exchange for Ecosystems
    While EPA's Community Multiscale Air Quality (CMAQ) is a state of the science one-atmosphere model, it needs to have state-of-the-science connections to ecosystem models for multimedia analyses. The emphasis of this ecosystem research to enhance air-surface exchange capabilities in CMAQ is to add, modify, and evaluate CMAQ components for better ecosystem model linkage and land-use change study.  The research objective is to improve uni-directional and bi-directional exchange process descriptions in CMAQ and reduce uncertainty in the model estimates of dry and wet deposition, especially for sulfur and nitrogen compounds.  Bi-directional air-surface exchange parameterizations represent state-of-the-science capabilities for regional air quality modeling.  The objective is also to incorporate missing processes in CMAQ and to make the characterization of land use models more flexible.
  • Exposure Model for Soil-Organic Fate and Transport (EMSOFT)
     EMSOFT is usedEMSOFT can also calculate average chemical concentrations at a given depth over time.
    • To determine concentrations of contaminants remaining in the soil over a given time (when the initial soil concentration is known);
    • To quantify the mass flux (rate of transfer) of contaminants into the atmosphere over time; and
    • To subsequently calculate contaminant air concentrations by inputting mass flux values into atmospheric dispersion models.
  • Integrated Climate and Land-Use Scenarios (ICLUS)-Online
    Initial set of housing density scenarios to assist in integrated assessments of the impacts of climate and land-use change
  • Report on the Environment (ROE)
    ROE presents the best available indicators of information on national conditions and trends in air, water, land, human health, and ecological systems that address 23 questions EPA considers mission critical to protecting our environment and human health.
  • ReVA - Environmental Decision Toolkits under development and ReVA Tools and Projects 
    ReVA is designed to create the methods needed to understand a region's environmental quality and its spatial pattern. Impacts of human activities are not uniformly distributed across landscapes and regions (defined here as a multi-state area) and are often interacting in complex ways.
  • Risk Assessment Guidance & Tools
    EPA has developed help for assessing and managing environmental risks, including guidance and tools which are the models and databases used in risk assessments.
  • Sanitary Sewer Overflow Analysis & Planning (SSOAP) Toolbox
    The SSOAP toolbox is a suite of computer software tools used for quantification of RDII and facilitating capacity analysis of sanitary sewer systems. This toolbox includes the Storm Water Management Model Version 5 (SWMM5) for performing dynamic routing of flows through the sanitary sewer systems.
  • Spatial Allocator
    The Spatial Allocator is used by the air quality modeling community to perform commonly needed spatial tasks without the use of a commercial Geographic Information System (GIS).
  • Spreadsheet-based Ecological Risk Assessment for the Fate of Mercury (SERAFM)
    SERAFM is a steady-state, process based mercury cycling model designed specifically to assist a risk assessor or researcher in estimating mercury concentrations in the water column, sediment, and fish tissue for a given water body for a specified watershed. SERAFM predicts mercury concentrations in these media for the species Hg0, HgII, and MeHg.
  • System for Urban Stormwater Treatment and Analysis Integration (SUSTAIN)
    SUSTAIN is a decision support system to facilitate selection and placement of Best Management Practices (BMPs) and Low Impact Development (LID) techniques at strategic locations in urban watersheds. It was developed to assist stormwater management professionals in developing implementation plans for flow and pollution control to protect source waters and meet water quality goals. From an understanding of the needs of the user community, SUSTAIN was designed for use by watershed and stormwater practitioners to develop, evaluate, and select optimal BMP combinations at various watershed scales on the basis of cost and effectiveness. 
    Research questions:
    • How effective are BMPs in reducing runoff and pollutant loadings?
    • What are the most cost-effective solutions for meeting water quality and quantity objectives?
    • Where, what type of, and how big should BMPs be?
  • Watershed Health Assessment Tools Investigating Fisheries (WHATIF) 
    WHATIF is software that integrates a number of calculators, tools, and models for assessing the health of watersheds and streams with an emphasis on fish communities. The toolset consists of hydrologic and stream geometry calculators, a fish assemblage predictor, a fish habitat suitability calculator, macro-invertebrate biodiversity calculators, and a process-based model to predict biomass dynamics of stream biota. WHATIF also supports screening analyses, such as prioritizing areas for restoration and comparing alternative watershed and habitat management scenarios.
  • Web-based Interspecies Correlation Estimation (WEB-ICE)
    WEB-ICE estimates acute toxicity to aquatic and terrestrial organisms for use in risk assessment.
  • Wildlife Contaminants Exposure Model (WCEM)
    WCEM estimates wildlife exposure to substances through inhalation and through ingestion of food, water, and soil in North American environments. It is suitable for any screening-level risk assessment exercise requiring an estimate of wildlife exposure to organic or inorganic compounds but can also support more detailed risk characterizations.
  • Wildlife Exposure Factors Handbook
    The Wildlife Exposure Factors Handbook provides data, references, and guidance for conducting exposure assessments for wildlife species exposed to toxic chemicals in their environment. The goals of this Handbook are
    • To promote the application of risk assessment methods to wildlife species,
    • To foster a consistent approach to wildlife exposure and risk assessments, and
    • To increase the accessibility of the literature applicable to these assessments.
  • Ubertool: Ecological Risk Web Application for Pesticide Modeling
    Ubertool supports ecological risk decisions.  The dashboard infrastructure integrates the processing of model results for over a dozen commonly-used EPA aquatic and terrestrial regulatory models and supporting datasets.

Databases

  • CADLit Database
    CADLit contains stressor-response information for multiple stressor exposures reported in the peer-reviewed scientific literature. As part of a causal analysis, CADLit can help:

  • To identify potential causes of impairment by providing information that may support or negate causal pathways in conceptual model diagrams.
  • To support or negate the contribution of a specific stressor to an impairment by providing qualitative and quantitative data from other studies of similar stressor scenarios.
  • Database of Sources of Environmental Releases of Dioxin-like Compounds in the United States
    The database is a repository of certain specific chlorinated dibenzo-p-dioxin/dibenzofuran (CDD/CDF) emissions data from all known sources in the US. The database contains information that can be analyzed to track emissions of CDD/CDF over time, compare specific profiles between and among source categories, and develop source specific emission factors that can then be used to develop emission estimates.
  • ECOTOX Databases My Note: I downloaded this
    ECOTOX is a comprehensive database, which provides information on adverse effects of single chemical stressors to ecologically relevant aquatic and terrestrial species. ECOTOX includes more than 400,000 test records covering 5,900 aquatic and terrestrial species and 8,400 chemicals. 
  • Environmental Decision Toolkits under development
    List of National, Regional, and Place Based ReVA toolkits in development
  • Health & Environmental Research Online (HERO)
    HERO provides access to scientific literature used to support EPA’s integrated science assessments.
    Research question: How can EPA increase transparency and provide access to the scientific literature behind scientific assessments?
  • Landscape Ecology Data Browsers
  • Regional Vulnerability Assessment (ReVA)
    • ReVA
      ReVA is an approach to regional scale, priority-setting assessment by integrating research on human and environmental health, ecorestoration, landscape analysis, regional exposure and process modeling, problem formulation, and ecological risk guidelines.
    • ReVA Data 
      ReVA uses EPA's Environmental Information Management System (EIMS) to manage its library of projects, data sets, models, and documents. The EIMS database is a comprehensive resource for persons interested in environmental information. By accessing EIMS, you will be able to browse and review ReVA's current data, meetings, projects, and documents.
    • ReVA - Environmental Decision Toolkits under development
      ReVA is designed to create the methods needed to understand a region's environmental quality and its spatial pattern. Impacts of human activities are not uniformly distributed across landscapes and regions (defined here as a multi-state area) and are often interacting in complex ways.
  • ReefLink Database
    This scientific and management information database utilizes systems thinking to describe the linkages between decisions, human activities, and provisioning of reef ecosystem goods and services.
  • Virtual Field Reference Database (VFRDB)
    The VFRDB provides in situ reference measurement data for statistically rigorous accuracy assessments of land-cover maps derived from satellite and airborne remote sensing platforms.
  • Watershed Deposition Tool Data
    Deposition components available from CMAQ

Databases, GIS and Mapping Applications, Online Reporting

Source: http://www.deq.state.or.us/news/databases.htm

These databases are available for your review and are intended as a starting point in your research. The databases are also not intended as a legal resource. The information and data within the databases are provided with the understanding that conclusions drawn from such information are the responsibility of the user. 

Databases

 

GIS and Mapping Applications

Air Quality Index
Daily index of air quality for specific locations in Oregon

 

Geographic Information Systems (GIS)
Integrates hardware, software, and data for capturing, managing, analyzing, and displaying all forms of geographically referenced information

Enforcement Actions
Formal enforcement actions that include date, respondent name, location, inspector, penalty and violation

 

E-Cycles Collection Sites
Find an Oregon E-Cycles collection site.  Oregon E-Cycles collection sites provide FREE recycling of computers, monitors and televisions to anyone bringing seven or fewer items for recycling at one time.

Enforcement Notice of NonCompliance
Includes name of alleged violator, location, violation data, and description of violation

 

Facility Profiler 
Search for DEQ regulated or permitted facilities and sites

Environmental Cleanup Site Information (ECSI)
Sites in Oregon with known or potential contamination from hazardous substances

 

Laboratory Analytical Storage and Retrieval (LASAR) 
Air and water quality monitoring data

La Pine Demonstration Project
Field data and/or flow and usage data

 

Longitude Latitude Identification (LLID) 
Map tool for geographic information such as longitude and latitude, stream name associated with LLID, and river mile

Leaking Underground Storage Tank (LUST) Cleanup Sites
Search for sites that reported releases of petroleum products from underground storage tanks which includes residential heating oil tanks
 

Oregon Incident Response Information System (OR-IRIS) 

Understand the natural, physical and jurisdictional setting of a hazardous material release or emergency response. Arranged by functional/operation themes:

  • Transportation/Infrastructure
  • People at Risk
  • Water Resources Protection
  • Potential Toxic Sources
  • Incident Notification Groups
  • Emergency Response Resources
  • Wildlife/Habitat
  • Natural Resources and Hazards
  • Public Health

Onsite Septic System Installers and Pumpers
Database of licensed Onsite System Installer and Pumper businesses

Pacific Northwest Water Quality Data Exchange
Various water quality monitoring databases throughout the Northwest

Wastewater Permit Reports
Reports on National Pollution Discharge System (NPDES) permits and Water Pollution Control Facility (WPCF) permits

   
Wastewater Permit Search
Search for a wastewater permit and water quality permit-related documents
  Online Reporting For Contractors'/Generator's Use Only
Wastewater System Certified Operators
Search for Certified Wastewater System Operators
  Hazardous Waste Reporting (HazWaste.net)
Submit annual hazardous waste report online
Wastewater System Operators Exams
Search Operator Exams database
  Online Petroleum Release Reporting 
Use to report source and cause of underground storage tank releases
Underground Injection Control (UIC)
UIC database of Class V wells
   

EnviroAtlas: Fresno, CA and surrounding area

Source: http://enviroatlas.epa.gov/enviroatl...f/FresnoCA.pdf (PDF)

November 2014

EnviroAtlas combines maps, graphs, and other analysis tools, fact sheets, and downloadable data into an easy-to-use, web-based educational and decision-support tool. EnviroAtlas helps users understand the connections between the benefits we derive from ecosystem services and the natural resources that provide them. For more information, please visit our website.

For selected U.S. communities, EnviroAtlas includes fine-scale maps and other information about existing and potential benefits from the local natural environment. This fact sheet highlights some of the many community data layers available.

Background

The Fresno, CA area was chosen as an EnviroAtlas community because it offers several opportunities to leverage existing research and community engagement activities. It has also received poor air quality ratings that can be evaluated from a green infrastructure perspective. The EnviroAtlas boundary for the Fresno, CA area was determined using the 2010 Census definition of an Urban Area. It includes the greater Fresno area, as well as Clovis and other portions of Fresno County. The area measures 597 square kilometers, and encompasses 405 census block groups.

EnviroAtlasFresnoCAFigure2.png

EnviroAtlasFresnoCAFigure3.png

Fresno Area Demographics

Total population 659,628
Under 13 years old 21.05%
Over 70 years of age 7.15%
Other than white/non-Hispanic 64.33%
Below twice the poverty level 44.47%

EnviroAtlasFresnoCAFigure1.png

The Fresno, CA area is in the Central California Valley ecoregion. It has a mild mid-latitude climate with long, hot, dry summers and slightly wet winters. The region was once vegetated with extensive grasslands but has mostly been converted into agriculture with some urban and suburban development. The leading industries in the area are agriculture and healthcare. Community and Saint Agnes Medical Centers, the City of Fresno, and Kaiser Permanente are among largest employers in the area. The demographics of the Fresno community area indicate that the potential exists for income and other disparities in the distribution of environmental assets. EnviroAtlas includes demographic maps that can help screen for potential health and well-being disparities resulting from disproportionate distribution of “green infrastructure.”

Ecosystem Services Overview

In EnviroAtlas, ecosystem services are organized into seven benefit categories:

  • Clean air
  • Clean and plentiful water
  • Natural hazard mitigation
  • Climate stabilization
  • Recreation, culture, and aesthetics
  • Biodiversity conservation
  • Food, fuel, and materials (data available only in select communities)

EnviroAtlas helps users understand the connections between the benefits we receive from ecosystems services and the natural resources that provide them. Examples of some of the environmental data included in EnviroAtlas are detailed below:

Access to Parks

Parks provide access to green space, encourage physical activity, and improve the livability and aesthetics of urban areas. Those who live closer to parks may be more likely to receive multiple benefits associated with this proximity.

  • An estimated 14 percent of the Fresno area is located within easy walking distance (500
  • meters) of a park entrance (dark and medium green areas in figure at right).
  • An estimated 6 percent of the Fresno area does not have a park entrance within 5 kilometers (white areas in figure at right).

EnviroAtlasFresnoCAFigure4.png

Photo: Eric Vance, EPA

Stream and Lake Buffers

Natural land cover adjacent to streams and rivers, sometimes called the riparian area or zone, helps protect water quality and supply for drinking, recreation, and aquatic habitat. The EnviroAtlas community component analyzes stream and lake buffers in widths of 15 and 50 meters.

  • An estimated 1.8 percent of Fresno’s land area is within 50 meters of a stream or lake.
  • 14 percent of Fresno’s 50 meter stream and lake buffers are forested.
  • 17 percent of Fresno’s 50 meter stream and lake buffers are impervious.

In addition to the Clean and Plentiful Water benefit category, stream and lake buffer data layers can also be found under Biodiversity Conservation, Natural Hazard Mitigation, and Recreation, Culture and Aesthetics in the interactive map’s table of contents.

EnviroAtlas Tools and Features

EnviroAtlas combines maps, graphs, and other analysis tools, fact sheets, and downloadable data into an easy-to-use, web-based educational and decision-support tool. EnviroAtlas helps users understand the connections between the benefits we derive from ecosystem services and the natural resources that provide them. For more information, please visit our website.
Proximity to park entrances

Municipalities within EnviroAtlas Boundaries

Source: http://enviroatlas.epa.gov/enviroatl...InBnd_List.pdf (PDF)

How are EnviroAtlas boundaries created?

EnviroAtlas community boundaries are derived from the 2010 US Census Bureau’s Urbanized Areas (UAs). The UAs are created using Census Blocks that “comprise a densely settled core of … blocks that meet minimum population density requirements, along with adjacent territory containing non-residential urban land uses as well as territory with low population density included to link outlying densely settled territory with the densely settled core.” UAs must have at least 50,000 people. Because the EnviroAtlas community component uses the Census Block Group (each comprised of 4-10 Census Blocks) as a unit of analysis, community boundaries were created from the UAs rather than reflecting the UAs themselves.

Block groups were initially included if 50% of their population was within the UA boundary. From there, all holes were filled and all islands were excluded. If a block group only touched the main body of block groups at one corner, it was excluded.

The EnviroAtlas team then assessed each community boundary individually to ensure that it included all the area relevant to the principal community of focus and did not extend far beyond that community. Where available, the municipal boundaries were compared to the EnviroAtlas boundaries to ensure that the municipal core of the principal community was fully included.

Because EnviroAtlas boundaries are based on Census Urbanized Areas rather than municipal boundaries, they often contain many municipalities near the target city.

How do I use this document?

Page 2 of this document contains a table of contents listing each EnviroAtlas target community. Each community name is a link to that community’s page within this document. The additional municipalities within that community are listed on the community’s page along with an approximation of how much of each municipality is included.

In general, an “All” designation means that the entire municipality is within the EnviroAtlas boundary. “Most” means that approximately 50 to 99 percent of the municipality is within the EnviroAtlas boundary. “Part” means that approximately 0 to 50 percent of the municipality is within the EnviroAtlas boundary.

Please note that these lists were created using the NavTEQ 2011 municipalities data layer. Because the data are not directly from the municipalities, the resulting municipal boundaries may not exactly reflect the actual municipal boundary. Any changes to the municipal boundary since 2011 are not included in this list.

In the figure to the left, Durham (yellow) and Chapel Hill (pink) are both given the “All” designation because they are completely within the EnviroAtlas boundary while Hillsborough (orange) and Carrboro (green) are both given “Most” because they go slightly beyond the EnviroAtlas boundary.
 

EnviroAtlasMUAsFigure1.png

NEXT

Page statistics
4211 view(s) and 127 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments