Table of contents
  1. Story
  2. Slides
    1. Slide 1 Data Science for NIST Big Data Framework
    2. Slide 2 Introduction
    3. Slide 3 Federal Big Data Working Group Meetup
    4. Slide 4 NIST Requests Comments on NIST Big Data interoperability Framework
    5. Slide 5 NIST Big Data interoperability Framework: Seven Volumes
    6. Slide 6 NIST Big Data interoperability Framework: Three Stages
    7. Slide 7 Purpose
    8. Slide 8 Data Mining Standard Process
    9. Slide 9 Method and Results
    10. Slide 10 Data Mining Standard Results
    11. Slide 11 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Index
    12. Slide 12 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Find
    13. Slide 13 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Find
    14. Slide 14 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Other
    15. Slide 15 Data Science for NIST Big Data Framework: Spotfire Cover Page
    16. Slide 16 Data Science for NIST Big Data Framework: Spotfire Tab 1
    17. Slide 17 Data Science for NIST Big Data Framework: Spotfire Tab 2
    18. Slide 18 Data Science for NIST Big Data Framework: Spotfire Tab 3
    19. Slide 19 Data Science for NIST Big Data Framework: Spotfire Tab 4
    20. Slide 20 Conclusions and Recommendations
  3. Spotfire Dashboard
  4. Research Notes
    1. Comment Template for SP1500-x (replace x with volume number)
  5. NIST Requests Comments on NIST Big Data interoperability Framework
  6. Definitions
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology NIST Special Publication 1500-1
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Big Data and Data Science Definitions
      1. 2.1 Big Data Definitions
      2. 2.2 Data Science Definitions
        1. Figure 1: Skills Needed in Data Science
      3. 2.3 Other Big Data Definitions
        1. Table 1: Sampling of Concepts Attributed to Big Data
    12. 3 Big Data Features
      1. 3.1 Data Elements and Metadata
      2. 3.2 Data Records and Non-Relational Models
      3. 3.3 Dataset Characteristics and Storage
      4. 3.4 Data in Motion
      5. 3.5 Data Science Lifecycle Model for Big Data
      6. 3.6 Big Data Analytics
      7. 3.7 Big Data Metrics and Benchmarks
      8. 3.8 Big Data Security and Privacy
      9. 3.9 Data Governance
    13. 4 Big Data Engineering Patterns (Fundamental Concepts)
    14. Appendix A:Index of Terms
      1. A
      2. B
      3. C
      4. D
      5. F
      6. M
      7. N
      8. O
      9. P
      10. R
      11. S
      12. T
      13. U
      14. V
    15. Appendix B: Terms and Definitions
    16. Appendix C: Acronyms
    17. Appendix D: References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
      9. [9]
      10. [10]
      11. [11]
      12. [12]
      13. [13]
      14. [14]
      15. [15]
      16. [16]
      17. [17]
      18. [18]
      19. [19]
      20. [20]
      21. [21]
      22. [22]
  7. Taxonomies
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology Special Publication 1500-2
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Reference Architecture Taxonomy
      1. 2.1 Actors and Roles
        1. Figure 1: NIST Big Data Reference Architecture
        2. Figure 2: Roles and a Sampling of Actors in the NBDRA Taxonomy
      2. 2.2 System Orchestrator
        1. Figure 3: System Orchestrator Actors and Activities
      3. 2.3 Data Provider
        1. Figure 4: Data Provider Actors and Activities
      4. 2.4 Big Data Application Provider
        1. Figure 5: Big Data Application Provider Actors and Activities
      5. 2.5 Big Data Framework Provider
        1. Figure 6: Big Data Framework Provider Actors and Activities
      6. 2.6 Data Consumer
        1. Figure 7: Data Consumer Actors and Activities
      7. 2.7 Management Fabric
        1. Figure 8: Big Data Management Actors and Activities
      8. 2.8 Security and Privacy Fabric
        1. Figure 9: Big Data Security and Privacy Actors and Activities
    12. 3 Data Characteristic Hierarchy
        1. Figure 10: Data Characteristic Hierarchy
      1. 3.1 Data Elements
      2. 3.2 Records
      3. 3.3 Datasets
      4. 3.4 Multiple Datasets
    13. 4 Summary
    14. Appendix A: Acronyms
    15. Appendix B: References
      1. [1]
      2. [2]
      3. [3]
  8. Use Case & Requirements
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology Special Publication 1500-3
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Use Cases and Requirements Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Use Case Summaries
      1. 2.1 Use Case Development Process
      2. 2.2 Government Operation
        1. 2.2.1 Use Case 1: Census 2010 and 2000—Title 13 Big Data
        2. 2.2.2 Use Case 2: NARA Accession, Search, Retrieve, Preservation
        3. 2.2.3 Use Case 3: Statistical Survey Response Improvement
        4. 2.2.4 Use Case 4: Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design)
      3. 2.3 Commercial
        1. 2.3.1 Use Case 5: Cloud Eco-System for Financial Industries
        2. 2.3.2 Use Case 6: Mendeley – An International Network of Research
        3. 2.3.3 Use Case 7: Netflix Movie Service
        4. 2.3.4 Use Case 8: Web Search
        5. 2.3.5 Use Case 9: Big Data Business Continuity and Disaster Recovery Within a Cloud Eco-System
        6. 2.3.6 Use Case 10: Cargo Shipping
          1. Figure 1: Cargo Shipping Scenario
        7. 2.3.7 Use Case 11: Materials Data for Manufacturing
        8. 2.3.8 Use Case 12: Simulation-Driven Materials Genomics
      4. 2.4 Defense
        1. 2.4.1 Use Case 13: Cloud Large-Scale Geospatial Analysis and Visualization
        2. 2.4.2 Use Case 14: Object Identification and Tracking from Wide-Area Large Format Imagery or Full Motion Video—Persistent Surveillance
        3. 2.4.3 Use Case 15: Intelligence Data Processing and Analysis
      5. 2.5 Health Care and Life Sciences
        1. 2.5.1 Use Case 16: Electronic Medical Record (EMR) Data
        2. 2.5.2 Use Case 17: Pathology Imaging/Digital Pathology
          1. Figure 2: Pathology Imaging/Digital Pathology—Examples of 2-D and 3-D Pathology Images
          2. Figure 3: Pathology Imaging/Digital Pathology
        3. 2.5.3 Use Case 18: Computational Bioimaging
        4. 2.5.4 Use Case 19: Genomic Measurements
        5. 2.5.5 Use Case 20: Comparative Analysis for Metagenomes and Genomes
        6. 2.5.6 Use Case 21: Individualized Diabetes Management
        7. 2.5.7 Use Case 22: Statistical Relational Artificial Intelligence for Health Care
        8. 2.5.8 Use Case 23: World Population-Scale Epidemiological Study
        9. 2.5.9 Use Case 24: Social Contagion Modeling for Planning, Public Health, and Disaster Management
        10. 2.5.10 Use Case 25: Biodiversity and LifeWatch
      6. 2.6 Deep Learning and Social Media
        1. 2.6.1 Use Case 26: Large-Scale Deep Learning
        2. 2.6.2 Use Case 27: Organizing Large-Scale, Unstructured Collections of Consumer Photos
        3. 2.6.3 Use Case 28: Truthy—Information Diffusion Research from Twitter Data
        4. 2.6.4 Use Case 29: Crowd Sourcing in the Humanities as Source for Big and Dynamic Data
        5. 2.6.5 Use Case 30: CINET—Cyberinfrastructure for Network (Graph) Science and Analytics
        6. 2.6.6 Use Case 31: NIST Information Access Division—Analytic Technology Performance Measurements, Evaluations, and Standards
      7. 2.7 The Ecosystem for Research
        1. 2.7.1 Use Case 32: DataNet Federation Consortium (DFC)
          1. Figure 4: DataNet Federation Consortium DFC – iRODS Architecture
        2. 2.7.2 Use Case 33: The Discinnet Process
        3. 2.7.3 Use Case 34: Semantic Graph Search on Scientific Chemical and Text-Based Data
        4. 2.7.4 Use Case 35: Light Source Beamlines
      8. 2.8 Astronomy and Physics
        1. 2.8.1 Use Case 36: Catalina Real-Time Transient Survey: A Digital, Panoramic, Synoptic Sky Survey
          1. Figure 5: Catalina CRTS: A Digital, Panoramic, Synoptic Sky Survey
        2. 2.8.2 Use Case 37: DOE Extreme Data from Cosmological Sky Survey and Simulations
        3. 2.8.3 Use Case 38: Large Survey Data for Cosmology
        4. 2.8.4 Use Case 39: Particle Physics—Analysis of Large Hadron Collider Data: Discovery of Higgs Particle
          1. Figure 6: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – CERN LHC Location
          2. Figure 7: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – The Multi-tier LHC Computing Infrastructure
        5. 2.8.5 Use Case 40: Belle II High Energy Physics Experiment
      9. 2.9 Earth, Environmental, and Polar Science
        1. 2.9.1 Use Case 41: EISCAT 3D Incoherent Scatter Radar System
          1. Figure 8: EISCAT 3D Incoherent Scatter Radar System – System Architecture
        2. 2.9.2 Use Case 42: ENVRI, Common Operations of Environmental Research Infrastructure
          1. Figure 9: ENVRI Common Architecture
          2. Figure 10(a): ICOS Architecture
          3. Figure 10(b): LifeWatch Architecture
          4. Figure 10(c): EMSO Architecture
          5. Figure 10(d): EURO-Argo Architecture
          6. Figure 10(e): EISCAT 3D Architecture
        3. 2.9.3 Use Case 43: Radar Data Analysis for the Center for Remote Sensing of Ice Sheets
          1. Figure 11: Typical CReSIS Radar Data After Analysis
          2. Figure 12: Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets
          3. Figure 13: Typical echogram with detected boundaries
        4. 2.9.4 Use Case 44: Unmanned Air Vehicle Synthetic Aperture Radar (UAVSAR) Data Processing, Data Product Delivery, and Data Services
          1. Figure 14: Combined Unwrapped Coseismic Interferograms
        5. 2.9.5 Use Case 45: NASA Langley Research Center/ Goddard Space Flight Center iRODS Federation Test Bed
        6. 2.9.6 Use Case 46: MERRA Analytic Services (MERRA/AS)
          1. Figure 15: Typical MERRA/AS Output
        7. 2.9.7 Use Case 47: Atmospheric Turbulence – Event Discovery and Predictive Analytics
          1. Figure 16: Typical NASA Image of Turbulent Waves
        8. 2.9.8 Use Case 48: Climate Studies Using the Community Earth System Model at the U.S. Department of Energy (DOE) NERSC Center
        9. 2.9.9 Use Case 49: DOE Biological and Environmental Research (BER) Subsurface Biogeochemistry Scientific Focus Area
        10. 2.9.10 Use Case 50: DOE BER AmeriFlux and FLUXNET Networks
      10. 2.10 Energy
        1. 2.10.1 Use Case 51: Consumption Forecasting in Smart Grids
    12. 3 Use Case Requirements
      1. 3.1 Use Case Specific Requirements
      2. 3.2 General Requirements
    13. Appendix A: Use Case Study Source Materials
      1. NBD-PWG Use Case Studies Template
        1. Comments on fields
      2. Submitted Use Case Studies
        1. Government Operation> Use Case 1: Big Data Archival: Census 2010 and 2000
        2. Government Operation> Use Case 2: NARA Accession, Search, Retrieve, Preservation
        3. Government Operation> Use Case 3: Statistical Survey Response Improvement
        4. Government Operation> Use Case 4: Non Traditional Data in Statistical Survey
        5. Commercial> Use Case 5: Cloud Computing in Financial Industries
        6. Commercial> Use Case 6: Mendeley—An International Network of Research
        7. Commercial> Use Case 7: Netflix Movie Service
        8. Commercial> Use Case 8: Web Search
        9. Commercial> Use Case 9: Cloud-based Continuity and Disaster Recovery
        10. Commercial> Use Case 10: Cargo Shipping
        11. Commercial> Use Case 11: Materials Data
        12. Commercial> Use Case 12: Simulation Driven Materials Genomics
        13. Defense> Use Case 13: Large Scale Geospatial Analysis and Visualization
        14. Defense> Use Case 14: Object Identification and Tracking – Persistent Surveillance
        15. Defense> Use Case 15: Intelligence Data Processing and Analysis
        16. Healthcare and Life Sciences> Use Case 16: Electronic Medical Record (EMR) Data
        17. Healthcare and Life Sciences> Use Case 17: Pathology Imaging/Digital Pathology
        18. Healthcare and Life Sciences> Use Case 18: Computational Bioimaging
        19. Healthcare and Life Sciences> Use Case 19: Genomic Measurements
        20. Healthcare and Life Sciences> Use Case 20: Comparative Analysis for (meta) Genomes
        21. Healthcare and Life Sciences> Use Case 21: Individualized Diabetes Management
        22. Healthcare and Life Sciences> Use Case 22: Statistical Relational AI for Health Care
        23. Healthcare and Life Sciences> Use Case 23: World Population Scale Epidemiology
        24. Healthcare and Life Sciences> Use Case 24: Social Contagion Modeling
        25. Healthcare and Life Sciences> Use Case 25: LifeWatch Biodiversity
        26. Deep Learning and Social Media> Use Case 26: Large-scale Deep Learning
        27. Deep Learning and Social Media> Use Case 27: Large Scale Consumer Photos Organization
        28. Deep Learning and Social Media> Use Case 28: Truthy Twitter Data Analysis
        29. Deep Learning and Social Media> Use Case 29: Crowd Sourcing in the Humanities
        30. Deep Learning and Social Media> Use Case 30: CINET Network Science Cyberinfrastructure
        31. Deep Learning and Social Media> Use Case 31: NIST Analytic Technology Measurement and Evaluations
        32. The Ecosystem for Research> Use Case 32: DataNet Federation Consortium (DFC)
        33. The Ecosystem for Research> Use Case 33: The ‘Discinnet Process’
        34. The Ecosystem for Research> Use Case 34: Graph Search on Scientific Data
        35. The Ecosystem for Research> Use Case 35: Light Source Beamlines
        36. Astronomy and Physics> Use Case 36: Catalina Digital Sky Survey for Transients
        37. Astronomy and Physics> Use Case 37: Cosmological Sky Survey and Simulations
        38. Astronomy and Physics> Use Case 38: Large Survey Data for Cosmology
        39. Astronomy and Physics> Use Case 39: Analysis of LHC (Large Hadron Collider) Data
          1. Note: See Table Below
        40. Astronomy and Physics> Use Case 40: Belle II Experiment
        41. Earth, Environmental and Polar Science> Use Case 41: EISCAT 3D Incoherent Scatter Radar System
        42. Earth, Environmental and Polar Science> Use Case 42: ENVRI, Common Environmental Research Infrastructure
        43. Earth, Environmental and Polar Science> Use Case 43: Radar Data Analysis for CReSIS
          1. Note: See Table Below
        44. Earth, Environmental and Polar Science> Use Case 44: UAVSAR Data Processing
        45. Earth, Environmental and Polar Science> Use Case 45: NASA LARC/GSFC iRODS Federation Testbed
        46. Earth, Environmental and Polar Science> Use Case 46: MERRA Analytic Services
        47. Earth, Environmental and Polar Science> Use Case 47: Atmospheric Turbulence—Event Discovery
        48. Earth, Environmental and Polar Science> Use Case 48: Climate Studies using the Community Earth System Model
        49. Earth, Environmental and Polar Science> Use Case 49: Subsurface Biogeochemistry
        50. Earth, Environmental and Polar Science> Use Case 50: AmeriFlux and FLUXNET
        51. Energy> Use Case 51: Consumption Forecasting in Smart Grids
    14. Appendix B: Summary of Key Properties
      1. Table B-1: Use Case Specific Information by Key Properties
    15. Appendix C: Use Case Requirements Summary
      1. Table C-1: Use Case Specific Requirements
    16. Appendix D: Use Case Detail Requirements
      1. Table D-1: Data Sources Requirements
        1. General Requirements
        2. Use Case Specific Requirements for Data Sources
      2. Table D-2: Data Transformation
        1. General Requirements
        2. Use Case Specific Requirements for Data Transformation
      3. Table D-3: Capabilities
        1. General Requirements
        2. Use Case Specific Requirements for Capabilities
      4. Table D-4: Data Consumer
        1. General Requirements
        2. Use Case Specific Requirements for Data Consumers
      5. Table D-5: Security and Privacy
        1. General Requirements
        2. Use Case Specific Requirements for Security and Privacy
      6. Table D-6: Lifecycle Management
        1. General Requirements
        2. Use Case Specific Requirements for Lifecycle Management
      7. Table D-7: Others
        1. General Requirements
        2. Use Case Specific Requirements for Others
    17. Appendix E: Acronyms
    18. Appendix F: References
      1. Use Case 6: Mendeley—an International Network of Research
      2. Use Case 7: Netflix Movie Service
      3. Use Case 8: Web Search
      4. Use Case 9: Cloud-Based Continuity and Disaster Recovery
      5. Use Cases 11 and 12: Materials Data & Simulation Driven Materials Genomics
      6. Use Case 13: Large Scale Geospatial Analysis and Visualization
      7. Use Case 14: Object Identification and Tracking - Persistent Surveillance
      8. Use Case 15: Intelligence Data Processing and Analysis
      9. Use Case 16: Electronic Medical Record (EMR) Data
      10. Use Case 17: Pathology Imaging/Digital Pathology
      11. Use Case 19: Genomic Measurements
      12. Use Case 20: Comparative Analysis for (Meta) Genomes
      13. Use Case 26: Large-Scale Deep Learning
      14. Use Case 27: Large Scale Consumer Photos Organization
      15. Use Case 28: Truthy Twitter Data Analysis
      16. Use Case 30: CINET Network Science Cyberinfrastructure
      17. Use Case 31: NIST Analytic Technology Measurement and Evaluations
      18. Use Case 32: DataNet Federation Consortium (DFC)
      19. Use Case 33: The 'Discinnet Process'
      20. Use Case 34: Graph Search on Scientific Data
      21. Use Case 35: Light Source Beamlines
      22. Use Case 36: Catalina Digital Sky Survey for Transients
      23. Use Case 37: Cosmological Sky Survey and Simulations
      24. Use Case 38: Large Survey Data for Cosmology
      25. Use Case 39: Analysis of LHC (Large Hadron Collider) Data
      26. Use Case 40: Belle II Experiment
      27. Use Case 41: EISCAT 3D Incoherent Scatter Radar System
      28. Use Case 42: ENVRI, Common Environmental Research Infrastructure
      29. Use Case 43: Radar Data Analysis for CReSIS
      30. Use Case 44: UAVSAR Data Processing
      31. Use Case 47: Atmospheric Turbulence - Event Discovery
      32. Use Case 48: Climate Studies Using the Community Earth System Model
      33. Use Case 50: AmeriFlux and FLUXNET
      34. Use Case 51: Consumption Forecasting in Smart Grids
    19. Document References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
      9. [9]
      10. [10]
      11. [11]
  9. Security and Privacy
  10. Architecture White Paper Survey
  11. Reference Architecture
  12. Standards Roadmap

Data Science for NIST Big Data Framework

Last modified
Table of contents
  1. Story
  2. Slides
    1. Slide 1 Data Science for NIST Big Data Framework
    2. Slide 2 Introduction
    3. Slide 3 Federal Big Data Working Group Meetup
    4. Slide 4 NIST Requests Comments on NIST Big Data interoperability Framework
    5. Slide 5 NIST Big Data interoperability Framework: Seven Volumes
    6. Slide 6 NIST Big Data interoperability Framework: Three Stages
    7. Slide 7 Purpose
    8. Slide 8 Data Mining Standard Process
    9. Slide 9 Method and Results
    10. Slide 10 Data Mining Standard Results
    11. Slide 11 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Index
    12. Slide 12 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Find
    13. Slide 13 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Find
    14. Slide 14 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Other
    15. Slide 15 Data Science for NIST Big Data Framework: Spotfire Cover Page
    16. Slide 16 Data Science for NIST Big Data Framework: Spotfire Tab 1
    17. Slide 17 Data Science for NIST Big Data Framework: Spotfire Tab 2
    18. Slide 18 Data Science for NIST Big Data Framework: Spotfire Tab 3
    19. Slide 19 Data Science for NIST Big Data Framework: Spotfire Tab 4
    20. Slide 20 Conclusions and Recommendations
  3. Spotfire Dashboard
  4. Research Notes
    1. Comment Template for SP1500-x (replace x with volume number)
  5. NIST Requests Comments on NIST Big Data interoperability Framework
  6. Definitions
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology NIST Special Publication 1500-1
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Big Data and Data Science Definitions
      1. 2.1 Big Data Definitions
      2. 2.2 Data Science Definitions
        1. Figure 1: Skills Needed in Data Science
      3. 2.3 Other Big Data Definitions
        1. Table 1: Sampling of Concepts Attributed to Big Data
    12. 3 Big Data Features
      1. 3.1 Data Elements and Metadata
      2. 3.2 Data Records and Non-Relational Models
      3. 3.3 Dataset Characteristics and Storage
      4. 3.4 Data in Motion
      5. 3.5 Data Science Lifecycle Model for Big Data
      6. 3.6 Big Data Analytics
      7. 3.7 Big Data Metrics and Benchmarks
      8. 3.8 Big Data Security and Privacy
      9. 3.9 Data Governance
    13. 4 Big Data Engineering Patterns (Fundamental Concepts)
    14. Appendix A:Index of Terms
      1. A
      2. B
      3. C
      4. D
      5. F
      6. M
      7. N
      8. O
      9. P
      10. R
      11. S
      12. T
      13. U
      14. V
    15. Appendix B: Terms and Definitions
    16. Appendix C: Acronyms
    17. Appendix D: References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
      9. [9]
      10. [10]
      11. [11]
      12. [12]
      13. [13]
      14. [14]
      15. [15]
      16. [16]
      17. [17]
      18. [18]
      19. [19]
      20. [20]
      21. [21]
      22. [22]
  7. Taxonomies
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology Special Publication 1500-2
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Reference Architecture Taxonomy
      1. 2.1 Actors and Roles
        1. Figure 1: NIST Big Data Reference Architecture
        2. Figure 2: Roles and a Sampling of Actors in the NBDRA Taxonomy
      2. 2.2 System Orchestrator
        1. Figure 3: System Orchestrator Actors and Activities
      3. 2.3 Data Provider
        1. Figure 4: Data Provider Actors and Activities
      4. 2.4 Big Data Application Provider
        1. Figure 5: Big Data Application Provider Actors and Activities
      5. 2.5 Big Data Framework Provider
        1. Figure 6: Big Data Framework Provider Actors and Activities
      6. 2.6 Data Consumer
        1. Figure 7: Data Consumer Actors and Activities
      7. 2.7 Management Fabric
        1. Figure 8: Big Data Management Actors and Activities
      8. 2.8 Security and Privacy Fabric
        1. Figure 9: Big Data Security and Privacy Actors and Activities
    12. 3 Data Characteristic Hierarchy
        1. Figure 10: Data Characteristic Hierarchy
      1. 3.1 Data Elements
      2. 3.2 Records
      3. 3.3 Datasets
      4. 3.4 Multiple Datasets
    13. 4 Summary
    14. Appendix A: Acronyms
    15. Appendix B: References
      1. [1]
      2. [2]
      3. [3]
  8. Use Case & Requirements
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology Special Publication 1500-3
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Use Cases and Requirements Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Use Case Summaries
      1. 2.1 Use Case Development Process
      2. 2.2 Government Operation
        1. 2.2.1 Use Case 1: Census 2010 and 2000—Title 13 Big Data
        2. 2.2.2 Use Case 2: NARA Accession, Search, Retrieve, Preservation
        3. 2.2.3 Use Case 3: Statistical Survey Response Improvement
        4. 2.2.4 Use Case 4: Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design)
      3. 2.3 Commercial
        1. 2.3.1 Use Case 5: Cloud Eco-System for Financial Industries
        2. 2.3.2 Use Case 6: Mendeley – An International Network of Research
        3. 2.3.3 Use Case 7: Netflix Movie Service
        4. 2.3.4 Use Case 8: Web Search
        5. 2.3.5 Use Case 9: Big Data Business Continuity and Disaster Recovery Within a Cloud Eco-System
        6. 2.3.6 Use Case 10: Cargo Shipping
          1. Figure 1: Cargo Shipping Scenario
        7. 2.3.7 Use Case 11: Materials Data for Manufacturing
        8. 2.3.8 Use Case 12: Simulation-Driven Materials Genomics
      4. 2.4 Defense
        1. 2.4.1 Use Case 13: Cloud Large-Scale Geospatial Analysis and Visualization
        2. 2.4.2 Use Case 14: Object Identification and Tracking from Wide-Area Large Format Imagery or Full Motion Video—Persistent Surveillance
        3. 2.4.3 Use Case 15: Intelligence Data Processing and Analysis
      5. 2.5 Health Care and Life Sciences
        1. 2.5.1 Use Case 16: Electronic Medical Record (EMR) Data
        2. 2.5.2 Use Case 17: Pathology Imaging/Digital Pathology
          1. Figure 2: Pathology Imaging/Digital Pathology—Examples of 2-D and 3-D Pathology Images
          2. Figure 3: Pathology Imaging/Digital Pathology
        3. 2.5.3 Use Case 18: Computational Bioimaging
        4. 2.5.4 Use Case 19: Genomic Measurements
        5. 2.5.5 Use Case 20: Comparative Analysis for Metagenomes and Genomes
        6. 2.5.6 Use Case 21: Individualized Diabetes Management
        7. 2.5.7 Use Case 22: Statistical Relational Artificial Intelligence for Health Care
        8. 2.5.8 Use Case 23: World Population-Scale Epidemiological Study
        9. 2.5.9 Use Case 24: Social Contagion Modeling for Planning, Public Health, and Disaster Management
        10. 2.5.10 Use Case 25: Biodiversity and LifeWatch
      6. 2.6 Deep Learning and Social Media
        1. 2.6.1 Use Case 26: Large-Scale Deep Learning
        2. 2.6.2 Use Case 27: Organizing Large-Scale, Unstructured Collections of Consumer Photos
        3. 2.6.3 Use Case 28: Truthy—Information Diffusion Research from Twitter Data
        4. 2.6.4 Use Case 29: Crowd Sourcing in the Humanities as Source for Big and Dynamic Data
        5. 2.6.5 Use Case 30: CINET—Cyberinfrastructure for Network (Graph) Science and Analytics
        6. 2.6.6 Use Case 31: NIST Information Access Division—Analytic Technology Performance Measurements, Evaluations, and Standards
      7. 2.7 The Ecosystem for Research
        1. 2.7.1 Use Case 32: DataNet Federation Consortium (DFC)
          1. Figure 4: DataNet Federation Consortium DFC – iRODS Architecture
        2. 2.7.2 Use Case 33: The Discinnet Process
        3. 2.7.3 Use Case 34: Semantic Graph Search on Scientific Chemical and Text-Based Data
        4. 2.7.4 Use Case 35: Light Source Beamlines
      8. 2.8 Astronomy and Physics
        1. 2.8.1 Use Case 36: Catalina Real-Time Transient Survey: A Digital, Panoramic, Synoptic Sky Survey
          1. Figure 5: Catalina CRTS: A Digital, Panoramic, Synoptic Sky Survey
        2. 2.8.2 Use Case 37: DOE Extreme Data from Cosmological Sky Survey and Simulations
        3. 2.8.3 Use Case 38: Large Survey Data for Cosmology
        4. 2.8.4 Use Case 39: Particle Physics—Analysis of Large Hadron Collider Data: Discovery of Higgs Particle
          1. Figure 6: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – CERN LHC Location
          2. Figure 7: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – The Multi-tier LHC Computing Infrastructure
        5. 2.8.5 Use Case 40: Belle II High Energy Physics Experiment
      9. 2.9 Earth, Environmental, and Polar Science
        1. 2.9.1 Use Case 41: EISCAT 3D Incoherent Scatter Radar System
          1. Figure 8: EISCAT 3D Incoherent Scatter Radar System – System Architecture
        2. 2.9.2 Use Case 42: ENVRI, Common Operations of Environmental Research Infrastructure
          1. Figure 9: ENVRI Common Architecture
          2. Figure 10(a): ICOS Architecture
          3. Figure 10(b): LifeWatch Architecture
          4. Figure 10(c): EMSO Architecture
          5. Figure 10(d): EURO-Argo Architecture
          6. Figure 10(e): EISCAT 3D Architecture
        3. 2.9.3 Use Case 43: Radar Data Analysis for the Center for Remote Sensing of Ice Sheets
          1. Figure 11: Typical CReSIS Radar Data After Analysis
          2. Figure 12: Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets
          3. Figure 13: Typical echogram with detected boundaries
        4. 2.9.4 Use Case 44: Unmanned Air Vehicle Synthetic Aperture Radar (UAVSAR) Data Processing, Data Product Delivery, and Data Services
          1. Figure 14: Combined Unwrapped Coseismic Interferograms
        5. 2.9.5 Use Case 45: NASA Langley Research Center/ Goddard Space Flight Center iRODS Federation Test Bed
        6. 2.9.6 Use Case 46: MERRA Analytic Services (MERRA/AS)
          1. Figure 15: Typical MERRA/AS Output
        7. 2.9.7 Use Case 47: Atmospheric Turbulence – Event Discovery and Predictive Analytics
          1. Figure 16: Typical NASA Image of Turbulent Waves
        8. 2.9.8 Use Case 48: Climate Studies Using the Community Earth System Model at the U.S. Department of Energy (DOE) NERSC Center
        9. 2.9.9 Use Case 49: DOE Biological and Environmental Research (BER) Subsurface Biogeochemistry Scientific Focus Area
        10. 2.9.10 Use Case 50: DOE BER AmeriFlux and FLUXNET Networks
      10. 2.10 Energy
        1. 2.10.1 Use Case 51: Consumption Forecasting in Smart Grids
    12. 3 Use Case Requirements
      1. 3.1 Use Case Specific Requirements
      2. 3.2 General Requirements
    13. Appendix A: Use Case Study Source Materials
      1. NBD-PWG Use Case Studies Template
        1. Comments on fields
      2. Submitted Use Case Studies
        1. Government Operation> Use Case 1: Big Data Archival: Census 2010 and 2000
        2. Government Operation> Use Case 2: NARA Accession, Search, Retrieve, Preservation
        3. Government Operation> Use Case 3: Statistical Survey Response Improvement
        4. Government Operation> Use Case 4: Non Traditional Data in Statistical Survey
        5. Commercial> Use Case 5: Cloud Computing in Financial Industries
        6. Commercial> Use Case 6: Mendeley—An International Network of Research
        7. Commercial> Use Case 7: Netflix Movie Service
        8. Commercial> Use Case 8: Web Search
        9. Commercial> Use Case 9: Cloud-based Continuity and Disaster Recovery
        10. Commercial> Use Case 10: Cargo Shipping
        11. Commercial> Use Case 11: Materials Data
        12. Commercial> Use Case 12: Simulation Driven Materials Genomics
        13. Defense> Use Case 13: Large Scale Geospatial Analysis and Visualization
        14. Defense> Use Case 14: Object Identification and Tracking – Persistent Surveillance
        15. Defense> Use Case 15: Intelligence Data Processing and Analysis
        16. Healthcare and Life Sciences> Use Case 16: Electronic Medical Record (EMR) Data
        17. Healthcare and Life Sciences> Use Case 17: Pathology Imaging/Digital Pathology
        18. Healthcare and Life Sciences> Use Case 18: Computational Bioimaging
        19. Healthcare and Life Sciences> Use Case 19: Genomic Measurements
        20. Healthcare and Life Sciences> Use Case 20: Comparative Analysis for (meta) Genomes
        21. Healthcare and Life Sciences> Use Case 21: Individualized Diabetes Management
        22. Healthcare and Life Sciences> Use Case 22: Statistical Relational AI for Health Care
        23. Healthcare and Life Sciences> Use Case 23: World Population Scale Epidemiology
        24. Healthcare and Life Sciences> Use Case 24: Social Contagion Modeling
        25. Healthcare and Life Sciences> Use Case 25: LifeWatch Biodiversity
        26. Deep Learning and Social Media> Use Case 26: Large-scale Deep Learning
        27. Deep Learning and Social Media> Use Case 27: Large Scale Consumer Photos Organization
        28. Deep Learning and Social Media> Use Case 28: Truthy Twitter Data Analysis
        29. Deep Learning and Social Media> Use Case 29: Crowd Sourcing in the Humanities
        30. Deep Learning and Social Media> Use Case 30: CINET Network Science Cyberinfrastructure
        31. Deep Learning and Social Media> Use Case 31: NIST Analytic Technology Measurement and Evaluations
        32. The Ecosystem for Research> Use Case 32: DataNet Federation Consortium (DFC)
        33. The Ecosystem for Research> Use Case 33: The ‘Discinnet Process’
        34. The Ecosystem for Research> Use Case 34: Graph Search on Scientific Data
        35. The Ecosystem for Research> Use Case 35: Light Source Beamlines
        36. Astronomy and Physics> Use Case 36: Catalina Digital Sky Survey for Transients
        37. Astronomy and Physics> Use Case 37: Cosmological Sky Survey and Simulations
        38. Astronomy and Physics> Use Case 38: Large Survey Data for Cosmology
        39. Astronomy and Physics> Use Case 39: Analysis of LHC (Large Hadron Collider) Data
          1. Note: See Table Below
        40. Astronomy and Physics> Use Case 40: Belle II Experiment
        41. Earth, Environmental and Polar Science> Use Case 41: EISCAT 3D Incoherent Scatter Radar System
        42. Earth, Environmental and Polar Science> Use Case 42: ENVRI, Common Environmental Research Infrastructure
        43. Earth, Environmental and Polar Science> Use Case 43: Radar Data Analysis for CReSIS
          1. Note: See Table Below
        44. Earth, Environmental and Polar Science> Use Case 44: UAVSAR Data Processing
        45. Earth, Environmental and Polar Science> Use Case 45: NASA LARC/GSFC iRODS Federation Testbed
        46. Earth, Environmental and Polar Science> Use Case 46: MERRA Analytic Services
        47. Earth, Environmental and Polar Science> Use Case 47: Atmospheric Turbulence—Event Discovery
        48. Earth, Environmental and Polar Science> Use Case 48: Climate Studies using the Community Earth System Model
        49. Earth, Environmental and Polar Science> Use Case 49: Subsurface Biogeochemistry
        50. Earth, Environmental and Polar Science> Use Case 50: AmeriFlux and FLUXNET
        51. Energy> Use Case 51: Consumption Forecasting in Smart Grids
    14. Appendix B: Summary of Key Properties
      1. Table B-1: Use Case Specific Information by Key Properties
    15. Appendix C: Use Case Requirements Summary
      1. Table C-1: Use Case Specific Requirements
    16. Appendix D: Use Case Detail Requirements
      1. Table D-1: Data Sources Requirements
        1. General Requirements
        2. Use Case Specific Requirements for Data Sources
      2. Table D-2: Data Transformation
        1. General Requirements
        2. Use Case Specific Requirements for Data Transformation
      3. Table D-3: Capabilities
        1. General Requirements
        2. Use Case Specific Requirements for Capabilities
      4. Table D-4: Data Consumer
        1. General Requirements
        2. Use Case Specific Requirements for Data Consumers
      5. Table D-5: Security and Privacy
        1. General Requirements
        2. Use Case Specific Requirements for Security and Privacy
      6. Table D-6: Lifecycle Management
        1. General Requirements
        2. Use Case Specific Requirements for Lifecycle Management
      7. Table D-7: Others
        1. General Requirements
        2. Use Case Specific Requirements for Others
    17. Appendix E: Acronyms
    18. Appendix F: References
      1. Use Case 6: Mendeley—an International Network of Research
      2. Use Case 7: Netflix Movie Service
      3. Use Case 8: Web Search
      4. Use Case 9: Cloud-Based Continuity and Disaster Recovery
      5. Use Cases 11 and 12: Materials Data & Simulation Driven Materials Genomics
      6. Use Case 13: Large Scale Geospatial Analysis and Visualization
      7. Use Case 14: Object Identification and Tracking - Persistent Surveillance
      8. Use Case 15: Intelligence Data Processing and Analysis
      9. Use Case 16: Electronic Medical Record (EMR) Data
      10. Use Case 17: Pathology Imaging/Digital Pathology
      11. Use Case 19: Genomic Measurements
      12. Use Case 20: Comparative Analysis for (Meta) Genomes
      13. Use Case 26: Large-Scale Deep Learning
      14. Use Case 27: Large Scale Consumer Photos Organization
      15. Use Case 28: Truthy Twitter Data Analysis
      16. Use Case 30: CINET Network Science Cyberinfrastructure
      17. Use Case 31: NIST Analytic Technology Measurement and Evaluations
      18. Use Case 32: DataNet Federation Consortium (DFC)
      19. Use Case 33: The 'Discinnet Process'
      20. Use Case 34: Graph Search on Scientific Data
      21. Use Case 35: Light Source Beamlines
      22. Use Case 36: Catalina Digital Sky Survey for Transients
      23. Use Case 37: Cosmological Sky Survey and Simulations
      24. Use Case 38: Large Survey Data for Cosmology
      25. Use Case 39: Analysis of LHC (Large Hadron Collider) Data
      26. Use Case 40: Belle II Experiment
      27. Use Case 41: EISCAT 3D Incoherent Scatter Radar System
      28. Use Case 42: ENVRI, Common Environmental Research Infrastructure
      29. Use Case 43: Radar Data Analysis for CReSIS
      30. Use Case 44: UAVSAR Data Processing
      31. Use Case 47: Atmospheric Turbulence - Event Discovery
      32. Use Case 48: Climate Studies Using the Community Earth System Model
      33. Use Case 50: AmeriFlux and FLUXNET
      34. Use Case 51: Consumption Forecasting in Smart Grids
    19. Document References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
      9. [9]
      10. [10]
      11. [11]
  9. Security and Privacy
  10. Architecture White Paper Survey
  11. Reference Architecture
  12. Standards Roadmap

  1. Story
  2. Slides
    1. Slide 1 Data Science for NIST Big Data Framework
    2. Slide 2 Introduction
    3. Slide 3 Federal Big Data Working Group Meetup
    4. Slide 4 NIST Requests Comments on NIST Big Data interoperability Framework
    5. Slide 5 NIST Big Data interoperability Framework: Seven Volumes
    6. Slide 6 NIST Big Data interoperability Framework: Three Stages
    7. Slide 7 Purpose
    8. Slide 8 Data Mining Standard Process
    9. Slide 9 Method and Results
    10. Slide 10 Data Mining Standard Results
    11. Slide 11 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Index
    12. Slide 12 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Find
    13. Slide 13 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Find
    14. Slide 14 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Other
    15. Slide 15 Data Science for NIST Big Data Framework: Spotfire Cover Page
    16. Slide 16 Data Science for NIST Big Data Framework: Spotfire Tab 1
    17. Slide 17 Data Science for NIST Big Data Framework: Spotfire Tab 2
    18. Slide 18 Data Science for NIST Big Data Framework: Spotfire Tab 3
    19. Slide 19 Data Science for NIST Big Data Framework: Spotfire Tab 4
    20. Slide 20 Conclusions and Recommendations
  3. Spotfire Dashboard
  4. Research Notes
    1. Comment Template for SP1500-x (replace x with volume number)
  5. NIST Requests Comments on NIST Big Data interoperability Framework
  6. Definitions
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology NIST Special Publication 1500-1
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Big Data and Data Science Definitions
      1. 2.1 Big Data Definitions
      2. 2.2 Data Science Definitions
        1. Figure 1: Skills Needed in Data Science
      3. 2.3 Other Big Data Definitions
        1. Table 1: Sampling of Concepts Attributed to Big Data
    12. 3 Big Data Features
      1. 3.1 Data Elements and Metadata
      2. 3.2 Data Records and Non-Relational Models
      3. 3.3 Dataset Characteristics and Storage
      4. 3.4 Data in Motion
      5. 3.5 Data Science Lifecycle Model for Big Data
      6. 3.6 Big Data Analytics
      7. 3.7 Big Data Metrics and Benchmarks
      8. 3.8 Big Data Security and Privacy
      9. 3.9 Data Governance
    13. 4 Big Data Engineering Patterns (Fundamental Concepts)
    14. Appendix A:Index of Terms
      1. A
      2. B
      3. C
      4. D
      5. F
      6. M
      7. N
      8. O
      9. P
      10. R
      11. S
      12. T
      13. U
      14. V
    15. Appendix B: Terms and Definitions
    16. Appendix C: Acronyms
    17. Appendix D: References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
      9. [9]
      10. [10]
      11. [11]
      12. [12]
      13. [13]
      14. [14]
      15. [15]
      16. [16]
      17. [17]
      18. [18]
      19. [19]
      20. [20]
      21. [21]
      22. [22]
  7. Taxonomies
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology Special Publication 1500-2
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Reference Architecture Taxonomy
      1. 2.1 Actors and Roles
        1. Figure 1: NIST Big Data Reference Architecture
        2. Figure 2: Roles and a Sampling of Actors in the NBDRA Taxonomy
      2. 2.2 System Orchestrator
        1. Figure 3: System Orchestrator Actors and Activities
      3. 2.3 Data Provider
        1. Figure 4: Data Provider Actors and Activities
      4. 2.4 Big Data Application Provider
        1. Figure 5: Big Data Application Provider Actors and Activities
      5. 2.5 Big Data Framework Provider
        1. Figure 6: Big Data Framework Provider Actors and Activities
      6. 2.6 Data Consumer
        1. Figure 7: Data Consumer Actors and Activities
      7. 2.7 Management Fabric
        1. Figure 8: Big Data Management Actors and Activities
      8. 2.8 Security and Privacy Fabric
        1. Figure 9: Big Data Security and Privacy Actors and Activities
    12. 3 Data Characteristic Hierarchy
        1. Figure 10: Data Characteristic Hierarchy
      1. 3.1 Data Elements
      2. 3.2 Records
      3. 3.3 Datasets
      4. 3.4 Multiple Datasets
    13. 4 Summary
    14. Appendix A: Acronyms
    15. Appendix B: References
      1. [1]
      2. [2]
      3. [3]
  8. Use Case & Requirements
    1. Cover Page
    2. Inside Cover Page
    3. National Institute of Standards and Technology Special Publication 1500-3
    4. Reports on Computer Systems Technology
    5. Abstract
    6. Acknowledgements
    7. Notice to Readers
    8. Table of Contents
    9. Executive Summary
    10. 1 Introduction
      1. 1.1 Background
      2. 1.2 Scope and Objectives of the Use Cases and Requirements Subgroup
      3. 1.3 Report Production
      4. 1.4 Report Structure
      5. 1.5 Future Work on this Volume
    11. 2 Use Case Summaries
      1. 2.1 Use Case Development Process
      2. 2.2 Government Operation
        1. 2.2.1 Use Case 1: Census 2010 and 2000—Title 13 Big Data
        2. 2.2.2 Use Case 2: NARA Accession, Search, Retrieve, Preservation
        3. 2.2.3 Use Case 3: Statistical Survey Response Improvement
        4. 2.2.4 Use Case 4: Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design)
      3. 2.3 Commercial
        1. 2.3.1 Use Case 5: Cloud Eco-System for Financial Industries
        2. 2.3.2 Use Case 6: Mendeley – An International Network of Research
        3. 2.3.3 Use Case 7: Netflix Movie Service
        4. 2.3.4 Use Case 8: Web Search
        5. 2.3.5 Use Case 9: Big Data Business Continuity and Disaster Recovery Within a Cloud Eco-System
        6. 2.3.6 Use Case 10: Cargo Shipping
          1. Figure 1: Cargo Shipping Scenario
        7. 2.3.7 Use Case 11: Materials Data for Manufacturing
        8. 2.3.8 Use Case 12: Simulation-Driven Materials Genomics
      4. 2.4 Defense
        1. 2.4.1 Use Case 13: Cloud Large-Scale Geospatial Analysis and Visualization
        2. 2.4.2 Use Case 14: Object Identification and Tracking from Wide-Area Large Format Imagery or Full Motion Video—Persistent Surveillance
        3. 2.4.3 Use Case 15: Intelligence Data Processing and Analysis
      5. 2.5 Health Care and Life Sciences
        1. 2.5.1 Use Case 16: Electronic Medical Record (EMR) Data
        2. 2.5.2 Use Case 17: Pathology Imaging/Digital Pathology
          1. Figure 2: Pathology Imaging/Digital Pathology—Examples of 2-D and 3-D Pathology Images
          2. Figure 3: Pathology Imaging/Digital Pathology
        3. 2.5.3 Use Case 18: Computational Bioimaging
        4. 2.5.4 Use Case 19: Genomic Measurements
        5. 2.5.5 Use Case 20: Comparative Analysis for Metagenomes and Genomes
        6. 2.5.6 Use Case 21: Individualized Diabetes Management
        7. 2.5.7 Use Case 22: Statistical Relational Artificial Intelligence for Health Care
        8. 2.5.8 Use Case 23: World Population-Scale Epidemiological Study
        9. 2.5.9 Use Case 24: Social Contagion Modeling for Planning, Public Health, and Disaster Management
        10. 2.5.10 Use Case 25: Biodiversity and LifeWatch
      6. 2.6 Deep Learning and Social Media
        1. 2.6.1 Use Case 26: Large-Scale Deep Learning
        2. 2.6.2 Use Case 27: Organizing Large-Scale, Unstructured Collections of Consumer Photos
        3. 2.6.3 Use Case 28: Truthy—Information Diffusion Research from Twitter Data
        4. 2.6.4 Use Case 29: Crowd Sourcing in the Humanities as Source for Big and Dynamic Data
        5. 2.6.5 Use Case 30: CINET—Cyberinfrastructure for Network (Graph) Science and Analytics
        6. 2.6.6 Use Case 31: NIST Information Access Division—Analytic Technology Performance Measurements, Evaluations, and Standards
      7. 2.7 The Ecosystem for Research
        1. 2.7.1 Use Case 32: DataNet Federation Consortium (DFC)
          1. Figure 4: DataNet Federation Consortium DFC – iRODS Architecture
        2. 2.7.2 Use Case 33: The Discinnet Process
        3. 2.7.3 Use Case 34: Semantic Graph Search on Scientific Chemical and Text-Based Data
        4. 2.7.4 Use Case 35: Light Source Beamlines
      8. 2.8 Astronomy and Physics
        1. 2.8.1 Use Case 36: Catalina Real-Time Transient Survey: A Digital, Panoramic, Synoptic Sky Survey
          1. Figure 5: Catalina CRTS: A Digital, Panoramic, Synoptic Sky Survey
        2. 2.8.2 Use Case 37: DOE Extreme Data from Cosmological Sky Survey and Simulations
        3. 2.8.3 Use Case 38: Large Survey Data for Cosmology
        4. 2.8.4 Use Case 39: Particle Physics—Analysis of Large Hadron Collider Data: Discovery of Higgs Particle
          1. Figure 6: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – CERN LHC Location
          2. Figure 7: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – The Multi-tier LHC Computing Infrastructure
        5. 2.8.5 Use Case 40: Belle II High Energy Physics Experiment
      9. 2.9 Earth, Environmental, and Polar Science
        1. 2.9.1 Use Case 41: EISCAT 3D Incoherent Scatter Radar System
          1. Figure 8: EISCAT 3D Incoherent Scatter Radar System – System Architecture
        2. 2.9.2 Use Case 42: ENVRI, Common Operations of Environmental Research Infrastructure
          1. Figure 9: ENVRI Common Architecture
          2. Figure 10(a): ICOS Architecture
          3. Figure 10(b): LifeWatch Architecture
          4. Figure 10(c): EMSO Architecture
          5. Figure 10(d): EURO-Argo Architecture
          6. Figure 10(e): EISCAT 3D Architecture
        3. 2.9.3 Use Case 43: Radar Data Analysis for the Center for Remote Sensing of Ice Sheets
          1. Figure 11: Typical CReSIS Radar Data After Analysis
          2. Figure 12: Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets
          3. Figure 13: Typical echogram with detected boundaries
        4. 2.9.4 Use Case 44: Unmanned Air Vehicle Synthetic Aperture Radar (UAVSAR) Data Processing, Data Product Delivery, and Data Services
          1. Figure 14: Combined Unwrapped Coseismic Interferograms
        5. 2.9.5 Use Case 45: NASA Langley Research Center/ Goddard Space Flight Center iRODS Federation Test Bed
        6. 2.9.6 Use Case 46: MERRA Analytic Services (MERRA/AS)
          1. Figure 15: Typical MERRA/AS Output
        7. 2.9.7 Use Case 47: Atmospheric Turbulence – Event Discovery and Predictive Analytics
          1. Figure 16: Typical NASA Image of Turbulent Waves
        8. 2.9.8 Use Case 48: Climate Studies Using the Community Earth System Model at the U.S. Department of Energy (DOE) NERSC Center
        9. 2.9.9 Use Case 49: DOE Biological and Environmental Research (BER) Subsurface Biogeochemistry Scientific Focus Area
        10. 2.9.10 Use Case 50: DOE BER AmeriFlux and FLUXNET Networks
      10. 2.10 Energy
        1. 2.10.1 Use Case 51: Consumption Forecasting in Smart Grids
    12. 3 Use Case Requirements
      1. 3.1 Use Case Specific Requirements
      2. 3.2 General Requirements
    13. Appendix A: Use Case Study Source Materials
      1. NBD-PWG Use Case Studies Template
        1. Comments on fields
      2. Submitted Use Case Studies
        1. Government Operation> Use Case 1: Big Data Archival: Census 2010 and 2000
        2. Government Operation> Use Case 2: NARA Accession, Search, Retrieve, Preservation
        3. Government Operation> Use Case 3: Statistical Survey Response Improvement
        4. Government Operation> Use Case 4: Non Traditional Data in Statistical Survey
        5. Commercial> Use Case 5: Cloud Computing in Financial Industries
        6. Commercial> Use Case 6: Mendeley—An International Network of Research
        7. Commercial> Use Case 7: Netflix Movie Service
        8. Commercial> Use Case 8: Web Search
        9. Commercial> Use Case 9: Cloud-based Continuity and Disaster Recovery
        10. Commercial> Use Case 10: Cargo Shipping
        11. Commercial> Use Case 11: Materials Data
        12. Commercial> Use Case 12: Simulation Driven Materials Genomics
        13. Defense> Use Case 13: Large Scale Geospatial Analysis and Visualization
        14. Defense> Use Case 14: Object Identification and Tracking – Persistent Surveillance
        15. Defense> Use Case 15: Intelligence Data Processing and Analysis
        16. Healthcare and Life Sciences> Use Case 16: Electronic Medical Record (EMR) Data
        17. Healthcare and Life Sciences> Use Case 17: Pathology Imaging/Digital Pathology
        18. Healthcare and Life Sciences> Use Case 18: Computational Bioimaging
        19. Healthcare and Life Sciences> Use Case 19: Genomic Measurements
        20. Healthcare and Life Sciences> Use Case 20: Comparative Analysis for (meta) Genomes
        21. Healthcare and Life Sciences> Use Case 21: Individualized Diabetes Management
        22. Healthcare and Life Sciences> Use Case 22: Statistical Relational AI for Health Care
        23. Healthcare and Life Sciences> Use Case 23: World Population Scale Epidemiology
        24. Healthcare and Life Sciences> Use Case 24: Social Contagion Modeling
        25. Healthcare and Life Sciences> Use Case 25: LifeWatch Biodiversity
        26. Deep Learning and Social Media> Use Case 26: Large-scale Deep Learning
        27. Deep Learning and Social Media> Use Case 27: Large Scale Consumer Photos Organization
        28. Deep Learning and Social Media> Use Case 28: Truthy Twitter Data Analysis
        29. Deep Learning and Social Media> Use Case 29: Crowd Sourcing in the Humanities
        30. Deep Learning and Social Media> Use Case 30: CINET Network Science Cyberinfrastructure
        31. Deep Learning and Social Media> Use Case 31: NIST Analytic Technology Measurement and Evaluations
        32. The Ecosystem for Research> Use Case 32: DataNet Federation Consortium (DFC)
        33. The Ecosystem for Research> Use Case 33: The ‘Discinnet Process’
        34. The Ecosystem for Research> Use Case 34: Graph Search on Scientific Data
        35. The Ecosystem for Research> Use Case 35: Light Source Beamlines
        36. Astronomy and Physics> Use Case 36: Catalina Digital Sky Survey for Transients
        37. Astronomy and Physics> Use Case 37: Cosmological Sky Survey and Simulations
        38. Astronomy and Physics> Use Case 38: Large Survey Data for Cosmology
        39. Astronomy and Physics> Use Case 39: Analysis of LHC (Large Hadron Collider) Data
          1. Note: See Table Below
        40. Astronomy and Physics> Use Case 40: Belle II Experiment
        41. Earth, Environmental and Polar Science> Use Case 41: EISCAT 3D Incoherent Scatter Radar System
        42. Earth, Environmental and Polar Science> Use Case 42: ENVRI, Common Environmental Research Infrastructure
        43. Earth, Environmental and Polar Science> Use Case 43: Radar Data Analysis for CReSIS
          1. Note: See Table Below
        44. Earth, Environmental and Polar Science> Use Case 44: UAVSAR Data Processing
        45. Earth, Environmental and Polar Science> Use Case 45: NASA LARC/GSFC iRODS Federation Testbed
        46. Earth, Environmental and Polar Science> Use Case 46: MERRA Analytic Services
        47. Earth, Environmental and Polar Science> Use Case 47: Atmospheric Turbulence—Event Discovery
        48. Earth, Environmental and Polar Science> Use Case 48: Climate Studies using the Community Earth System Model
        49. Earth, Environmental and Polar Science> Use Case 49: Subsurface Biogeochemistry
        50. Earth, Environmental and Polar Science> Use Case 50: AmeriFlux and FLUXNET
        51. Energy> Use Case 51: Consumption Forecasting in Smart Grids
    14. Appendix B: Summary of Key Properties
      1. Table B-1: Use Case Specific Information by Key Properties
    15. Appendix C: Use Case Requirements Summary
      1. Table C-1: Use Case Specific Requirements
    16. Appendix D: Use Case Detail Requirements
      1. Table D-1: Data Sources Requirements
        1. General Requirements
        2. Use Case Specific Requirements for Data Sources
      2. Table D-2: Data Transformation
        1. General Requirements
        2. Use Case Specific Requirements for Data Transformation
      3. Table D-3: Capabilities
        1. General Requirements
        2. Use Case Specific Requirements for Capabilities
      4. Table D-4: Data Consumer
        1. General Requirements
        2. Use Case Specific Requirements for Data Consumers
      5. Table D-5: Security and Privacy
        1. General Requirements
        2. Use Case Specific Requirements for Security and Privacy
      6. Table D-6: Lifecycle Management
        1. General Requirements
        2. Use Case Specific Requirements for Lifecycle Management
      7. Table D-7: Others
        1. General Requirements
        2. Use Case Specific Requirements for Others
    17. Appendix E: Acronyms
    18. Appendix F: References
      1. Use Case 6: Mendeley—an International Network of Research
      2. Use Case 7: Netflix Movie Service
      3. Use Case 8: Web Search
      4. Use Case 9: Cloud-Based Continuity and Disaster Recovery
      5. Use Cases 11 and 12: Materials Data & Simulation Driven Materials Genomics
      6. Use Case 13: Large Scale Geospatial Analysis and Visualization
      7. Use Case 14: Object Identification and Tracking - Persistent Surveillance
      8. Use Case 15: Intelligence Data Processing and Analysis
      9. Use Case 16: Electronic Medical Record (EMR) Data
      10. Use Case 17: Pathology Imaging/Digital Pathology
      11. Use Case 19: Genomic Measurements
      12. Use Case 20: Comparative Analysis for (Meta) Genomes
      13. Use Case 26: Large-Scale Deep Learning
      14. Use Case 27: Large Scale Consumer Photos Organization
      15. Use Case 28: Truthy Twitter Data Analysis
      16. Use Case 30: CINET Network Science Cyberinfrastructure
      17. Use Case 31: NIST Analytic Technology Measurement and Evaluations
      18. Use Case 32: DataNet Federation Consortium (DFC)
      19. Use Case 33: The 'Discinnet Process'
      20. Use Case 34: Graph Search on Scientific Data
      21. Use Case 35: Light Source Beamlines
      22. Use Case 36: Catalina Digital Sky Survey for Transients
      23. Use Case 37: Cosmological Sky Survey and Simulations
      24. Use Case 38: Large Survey Data for Cosmology
      25. Use Case 39: Analysis of LHC (Large Hadron Collider) Data
      26. Use Case 40: Belle II Experiment
      27. Use Case 41: EISCAT 3D Incoherent Scatter Radar System
      28. Use Case 42: ENVRI, Common Environmental Research Infrastructure
      29. Use Case 43: Radar Data Analysis for CReSIS
      30. Use Case 44: UAVSAR Data Processing
      31. Use Case 47: Atmospheric Turbulence - Event Discovery
      32. Use Case 48: Climate Studies Using the Community Earth System Model
      33. Use Case 50: AmeriFlux and FLUXNET
      34. Use Case 51: Consumption Forecasting in Smart Grids
    19. Document References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
      9. [9]
      10. [10]
      11. [11]
  9. Security and Privacy
  10. Architecture White Paper Survey
  11. Reference Architecture
  12. Standards Roadmap

Story

Data Science for NIST Big Data Framework

NIST is seeking feedback on the Version 1 draft of the NIST Big Data Interoperability Framework. Once public comments are received, compiled, and addressed by the NBD-PWG, and reviewed and approved by NIST internal editorial board, Version 1 of Volume 1 through Volume 7 will be published as final. Three versions are planned, with Versions 2 and 3 building on the first.

I complemented the NIST Team (see Acknowledgements) on excellent work over a long period of time and told them that I asked the 700+ members of our Federal Big Data Working Group Meetup to review the DRAFT documents and provide comments. I said I think this will take us longer than the May 21st deadline and we plan to do a Meetup on this in July. We are looking especially for the 6 Uses Cases that have data sets according to a recent email we saw from the NIST Big Data Workgroup participants.

While I have started a Comment Template for detailed comments, my focus is to use the excellent content for the Federal Big Data Working Group Meetup as follows:

Data Science for NIST Big Data Framework will be done by Data Mining following the six step standard:

  • CRISP-DM Step 1: Business (Organizational) Understanding
  • CRISP-DM Step 2: Data Understanding
  • CRISP-DM Step 3: Data Preparation
  • CRISP-DM Step 4: Modeling
  • CRISP-DM Step 5: Evaluation
  • CRISP-DM Step 6: Deployment

The method and results will be documented in the Slides and Spotfire Dashboard below. The Knowledge Base Index and selected tables will be documented in the NIST Big Data Spreadsheet.

The Meetup date and agenda will be announced soon.

The Data Mining Standard Results are:

  • CRISP-DM Step 1: Business (Organizational) Understanding:
    • Knowledge Base: 7 Word Documents to MindTouch
  • CRISP-DM Step 2: Data Understanding:
    • MindTouch Index to Spreadsheet
  • CRISP-DM Step 3: Data Preparation:
    • Report Tables and Use Case Data Sets
  • CRISP-DM Step 4: Modeling:
    • Spotfire Exploratory Data Analysis
  • CRISP-DM Step 5: Evaluation:
    • Data Science Answer to Four Questions
  • CRISP-DM Step 6: Deployment:
    • Data Science Data Publication and MOOC

The Conclusions and Recommendations are:

  • The Version 1 DRAFT NIST Big Data Interoperability Framework (7 volumes) has been reviewed for detailed comments and repurposed by the Federal Big Data Working Group Meetup.
  • A Knowledge Base, Data Science Data Publication, and Massive Open Online Course (MOOC) have been created from the excellent content using the CRISP Data Mining Standard.
  • The methods and results are documented to aid the NIST Big Data Work Group and Federal Big Data Working Group Meetup in future activities.
  • The Federal Big Data Working Group Meetup is creating an interface (Stage 2) and applications (Stage 3) by doing Data Science for NIST Big Data Framework!
  • The Federal Big Data Working Group Meetup is focused on Use Cases with Government Data and Workforce Education of Data Scientists and Chief Data Officers.

Notes:

Volume 3 Appendices:

A: Not a good format for a data set

B: Used

C: Used

D: Not a good format for a data set

E: Acronyms

F: Check for data sets

Document References: Used

NIST References the Geoffrey Fox MOOC, which was our March 2nd Meetup: Data Science for Big Data Application and Analytics MOOC and Meetup

Use Cases and Requirements Subgroup to gather use cases and extract requirements. The Subgroup developed a use case template with 26 fields that were completed by 51 users in the following broad areas:

 

Use Case Category (NIST Number) FBDWGM Example Meetup Date

Government Operations (4)

Data Science for Agency Initiatives 2015 August 3rd Meetup

Commercial (8)

USDA Data Science MOOC 2015 Wharton DC Innovation Summit, April 28-29: FBDWGM Workshop 4/29 Meetup
Defense (3) Data Science for the DTIC Data Ecosystem RFI January 27th Meetup
Healthcare and Life Sciences (10) Data Science for Affordable Care Act Data July 20th Meetup
Deep Learning and Social Media (6) Big Data from Everywhere for Families and Community Service and Data Science for MyFamilySearch.org February 13th Meetup
The Ecosystem for Research (4) Data Science for the National Big Data R and D Initiative February 2nd Meetup 
Astronomy and Physics (5) Data Science for Cyber Physical Systems-Internet of Things June 29th Meetup
Earth, Environmental and Polar Science (10) Data Science for EPA Big Data Analytics April 20th Meetup
Energy (1) Data Science for USGS Minerals Big Data May 29th EarthCube and June 15th Meetup

The Federal Big Data Working Group Meetup will continue focusing on government big data use cases for Data Science Data Publications and MOOCs.

MORE TO FOLLOW

Slides

Slides

Slide 2 Introduction

BrandNiemann05212015Slide2.PNG

Slide 3 Federal Big Data Working Group Meetup

http://www.meetup.com/Federal-Big-Da...nts/222458479/

BrandNiemann05212015Slide3.PNG

Slide 4 NIST Requests Comments on NIST Big Data interoperability Framework

http://bigdatawg.nist.gov/V1_output_docs.php

BrandNiemann05212015Slide4.PNG

Slide 5 NIST Big Data interoperability Framework: Seven Volumes

BrandNiemann05212015Slide5.PNG

Slide 6 NIST Big Data interoperability Framework: Three Stages

BrandNiemann05212015Slide6.PNG

Slide 7 Purpose

BrandNiemann05212015Slide7.PNG

Slide 8 Data Mining Standard Process

Data Mining

BrandNiemann05212015Slide8.PNG

Slide 9 Method and Results

BrandNiemann05212015Slide9.PNG

Slide 10 Data Mining Standard Results

BrandNiemann05212015Slide10.PNG

Slide 11 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Index

Data Science for NIST Big Data Framework NIST Big Data Framework

BrandNiemann05212015Slide11.PNG

Slide 12 Data Science for NIST Big Data Framework: MindTouch Knowledge Base Find

Data Science for NIST Big Data Framework NIST Big Data Framework

BrandNiemann05212015Slide12.PNG

Slide 13 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Find

NIST Big Data Spreadsheet

BrandNiemann05212015Slide13.PNG

Slide 14 Data Science for NIST Big Data Framework: Spreadsheet Knowledge Base: Other

NIST Big Data Spreadsheet

BrandNiemann05212015Slide14.PNG

Slide 15 Data Science for NIST Big Data Framework: Spotfire Cover Page

Web Player

BrandNiemann05212015Slide15.PNG

Slide 16 Data Science for NIST Big Data Framework: Spotfire Tab 1

Web Player

BrandNiemann05212015Slide16.PNG

Slide 17 Data Science for NIST Big Data Framework: Spotfire Tab 2

Web Player

BrandNiemann05212015Slide17.PNG

Slide 18 Data Science for NIST Big Data Framework: Spotfire Tab 3

Web Player

BrandNiemann05212015Slide18.PNG

Slide 19 Data Science for NIST Big Data Framework: Spotfire Tab 4

Web Player

BrandNiemann05212015Slide19.PNG

Slide 20 Conclusions and Recommendations

BrandNiemann05212015Slide20.PNG

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

Wo and Nancy, Congratulations on excellent work over a long period of time!

I am asking the 700+ members of our Federal Big Data Working Group Meetup to review your DRAFT documents and provide comments.

To facilitate that process and my own comments, I am doing that at: 
http://semanticommunity.info/Data_Science/Data_Science_for_NIST_Big_Data_Framework

My initial comments are at: 
http://semanticommunity.info/Data_Science/Data_Science_for_NIST_Big_Data_Framework#Research_Notes

I think this will take us longer than the May 21st schedule and we plan to do a Meetup on this in July.

We are looking especially for the 6 Uses Cases that have data sets according to a recent email we saw from your participants. 
Best regards, Brand 

Dear Brand, 

Thanks for your encouragement and special thanks for your Federal Big Data WG Meetup for willing to review and help to enhance the content of our seven NBD-PWG documents! 

Yes, the 45 days public comment period is kind of short but the good news is, our future version 2 and 3 will be built on top of version 1 meaning any late comments can still be useful for our enhancement. Thanks for starting the commenting on Vol. 1 and 2, and in the process of commenting Vol. 3. I would be very appreciative if you can send in a version before May 21 so that our editing team can review and incorporate any appropriate incoming comments.

Thanks so much for your help and looking forward for your Meetup’s comments! 
--Wo

Comment Template for SP1500-x (replace x with volume number)

Source: http://bigdatawg.nist.gov/_uploadfil..._template.docx (Word)

Submitted by: Brand Niemann

Email:bniemann@cox.net

Date: May 10, 2015 (IN PROCESS)

Type: E = Editorial, G = General, T = Technical

 

Item   #

Type

Page #

Line #

Section

Comment (with rationale)

Suggested Change

1

E

Table 1: Sampling of Concepts Attributed to Big Data

 

Volume 1

What do the numbers in this table refer to?

 

2

E

Appendix B: References

 

Volume 2

Why are References (a), (b), and (c) missing in the Word document? They appeared when I copied the Word document to MindTouch.

 

3 E

Figure 16: Typical NASA Image of Turbulent Waves

  Volume 3 Figure 16 is missing in the List of Figures in the Table of Contents  

IN PROCESS

 

 

 

Volume 3

LOOKING FOR THE 6 WITH DATA TO ANALYZE

 

 

 

 

Appendix D: Acronyms

 

 

Volume 6

Missing Definition of WiFi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

NIST Requests Comments on NIST Big Data interoperability Framework

Version 1.0 Working Drafts for public comments. Please submit comments by May 21, 2015

NIST is seeking feedback on the Version 1 draft of the NIST Big Data Interoperability Framework. Once public comments are received, compiled, and addressed by the NBD-PWG, and reviewed and approved by NIST internal editorial board, Version 1 of Volume 1 through Volume 7 will be published as final. Three versions are planned, with Versions 2 and 3 building on the first. Further explanation of the three planned versions and the information contained therein is included in each volume.

When submitting comments, Please be as specific as possible in any comments or edits to the text. Specific edits could include, but are not limited to, changes in the current text, additional text further explaining a topic or explaining a new topic, additional references, or comments about the text, topics, or document organization. These specific edits can be recorded using one of the two following methods.

  1. TRACK CHANGES: make edits to and comments on the text directly into this Word document using track changes and embedded comments
     
  2. <comment> capture specific edits using the Comment Template : (http://bigdatawg.nist.gov/_uploadfiles/SP1500-1-to-7_comment_template.docx), which includes space for Section number, page number, comment, and text edits. </comment>

Submit the edited file from either method 1 or 2 to SP1500comments@nist.gov with the volume number in the subject line (e.g., Edits for Volume 1.)

Please contact Wo Chang (wchang@nist.gov) with any questions about the feedback submission process.

Big Data professionals continue to be welcome to join the NBD-PWG to help craft the work contained in the volumes of the NIST Big Data Interoperability Framework. Please register to join our effort.

  • NIST Big Data Definitions & Taxonomies Subgroup
    • 1. M0392 | PDFDraft SP 1500-1 -- Volume 1: Definitions
    • 2. M0393 | PDFDraft SP 1500-2 -- Volume 2: Taxonomies  
  • NIST Big Data Use Case & Requirements Subgroup
  • NIST Big Data Security & Privacy Subgroup
    • 4. M0395 | PDFDraft SP 1500-4 -- Volume 4: Security and Privacy  
  • NIST Big Data Reference Architecture Subgroup
    • 5. M0396 | PDFDraft SP 1500-5 -- Volume 5: Architectures White Paper Survey
    • 6. M0397 | PDFDraft SP 1500-6 -- Volume 6: Reference Architecture  
  • NIST Big Data Technology Roadmap Subgroup
    • 7. M0398 | PDFDraft SP 1500-7 -- Volume 7: Standards Roadmap

Definitions

Source: http://bigdatawg.nist.gov/_uploadfil...022325181.docx (Word)

Cover Page

NIST Special Publication 1500-1

DRAFT NIST Big Data Interoperability Framework:

Volume 1, Definitions

NIST Big Data Public Working Group

Definitions and Taxonomies Subgroup

Draft Version 1

April 6, 2015

http://dx.doi.org/10.6028/NIST.SP.1500-1

Inside Cover Page

NIST Special Publication 1500-1

Information Technology Laboratory

DRAFT NIST Big Data Interoperability Framework:

Volume 1, Definitions

Draft Version 1

NIST Big Data Public Working Group (NBD-PWG)

Definitions and Taxonomies Subgroup

National Institute of Standards and Technology

Gaithersburg, MD 20899

April 2015

U. S. Department of Commerce

Penny Pritzker, Secretary

National Institute of Standards and Technology

Dr. Willie E. May, Under Secretary of Commerce for Standards and Technology and Director

National Institute of Standards and Technology NIST Special Publication 1500-1

33 pages (April 6, 2015)

Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.

There may be references in this publication to other publications currently under development by NIST in accordance with its assigned statutory responsibilities. The information in this publication, including concepts and methodologies, may be used by Federal agencies even before the completion of such companion publications. Thus, until each publication is completed, current requirements, guidelines, and procedures, where they exist, remain operative. For planning and transition purposes, Federal agencies may wish to closely follow the development of these new publications by NIST.

Organizations are encouraged to review all draft publications during public comment periods and provide feedback to NIST. All NIST Information Technology Laboratory publications, other than the ones noted above, are available at http://www.nist.gov/publication-portal.cfm.

Public comment period: April 6, 2015 through May 21, 2015

Comments on this publication may be submitted to Wo Chang

National Institute of Standards and Technology

Attn: Wo Chang, Information Technology Laboratory

100 Bureau Drive (Mail Stop 8900) Gaithersburg, MD 20899-8930

Email: SP1500comments@nist.gov

Reports on Computer Systems Technology

The Information Technology Laboratory (ITL) at NIST promotes the U.S. economy and public welfare by providing technical leadership for the Nation’s measurement and standards infrastructure. ITL develops tests, test methods, reference data, proof of concept implementations, and technical analyses to advance the development and productive use of information technology. ITL’s responsibilities include the development of management, administrative, technical, and physical standards and guidelines for the cost-effective security and privacy of other than national security-related information in Federal information systems. This document reports on ITL’s research, guidance, and outreach efforts in Information Technology and its collaborative activities with industry, government, and academic organizations.

Abstract

Big Data is a term used to describe the new deluge of data in our networked, digitized, sensor-laden, information-driven world. While great opportunities exist with Big Data, it can overwhelm traditional technical approaches and its growth is outpacing scientific and technological advances in data analytics. To advance progress in Big Data, the NIST Big Data Public Working Group (NBD-PWG) is working to develop consensus on important, fundamental questions related to Big Data. The results are reported in the NIST Big Data Interoperability Framework series of volumes. This volume, Volume 1, contains a definition of Big Data and related terms necessary to lay the groundwork for discussions surrounding Big Data.

Keywords

Big Data, Data Science, Reference Architecture, System Orchestrator, Data Provider, Big Data Application Provider, Big Data Framework Provider, Data Consumer, Security and Privacy Fabric, Management Fabric, Big Data taxonomy, use cases, Big Data characteristics

Acknowledgements

This document reflects the contributions and discussions by the membership of the NBD-PWG, co-chaired by Wo Chang of the NIST ITL, Robert Marcus of ET-Strategies, and Chaitanya Baru, University of California San Diego Supercomputer Center.

The document contains input from members of the NBD-PWG Definitions and Taxonomies Subgroup, led by Nancy Grady (SAIC), Natasha Balac (SDSC), and Eugene Luster (R2AD).

NIST SP1500-1, Version 1 has been collaboratively authored by the NBD-PWG. As of the date of this publication, there are over six hundred NBD-PWG participants from industry, academia, and government. Federal agency participants include the National Archives and Records Administration (NARA), National Aeronautics and Space Administration (NASA), National Science Foundation (NSF), and the U.S. Departments of Agriculture, Commerce, Defense, Energy, Health and Human Services, Homeland Security, Transportation, Treasury, and Veterans Affairs.

NIST would like to acknowledge the specific contributions to this volume by the following NBD-PWG members:

Deborah Blackstock

MITRE Corporation

David Boyd

L3 Data Tactics

Pw Carey

Compliance Partners, LLC

Wo Chang

NIST

Yuri Demchenko

University of Amsterdam

Frank Farance

Consultant

Geoffrey Fox

University of Indiana

Ian Gorton

CMU

Nancy Grady

SAIC

Karen Guertler

Consultant

Keith Hare

JCC Consulting, Inc.

Christine Hawkinson

U.S. Bureau of Land Management

Thomas Huang

NASA

Philippe Journeau

ResearXis

Pavithra Kenjige

PK Technologies

Orit Levin

Microsoft

Eugene Luster

U.S. Defense Information Systems Agency/R2AD LLC

Ashok Malhotra

Oracle

Bill Mandrick

L3 Data Tactics

Robert Marcus

ET-Strategies

Lisa Martinez

Consultant

Gary Mazzaferro

AlloyCloud, Inc.

William Miller

MaCT USA

Sanjay Mishra

Verizon

Bob Natale

Mitre

Rod Peterson

U.S. Department of Veterans Affairs

Ann Racuya-Robbins

World Knowledge Bank

Russell Reinsch

Calibrum

John Rogers

HP

Arnab Roy

Fujitsu

Mark Underwood

Krypton Brothers LLC

William Vorhies

Predictive Modeling LLC

Tim Zimmerman

Consultant

Alicia Zuniga-Alvarado

Consultant

The editors for this document were Nancy Grady and Wo Chang.

Notice to Readers

NIST is seeking feedback on the proposed working draft of the NIST Big Data Interoperability Framework: Volume 1, Definitions. Once public comments are received, compiled, and addressed by the NBD-PWG, and reviewed and approved by NIST internal editorial board, Version 1 of this volume will be published as final. Three versions are planned for this volume, with Versions 2 and 3 building on the first. Further explanation of the three planned versions and the information contained therein is included in Section 1.5 of this document.

Please be as specific as possible in any comments or edits to the text. Specific edits include, but are not limited to, changes in the current text, additional text further explaining a topic or explaining a new topic, additional references, or comments about the text, topics, or document organization. These specific edits can be recorded using one of the two following methods.

  1. TRACK CHANGES: make edits to and comments on the text directly into this Word document using track changes
  2. COMMENT TEMPLATE: capture specific edits using the Comment Template : (http://bigdatawg.nist.gov/_uploadfiles/SP1500-1-to-7_comment_template.docx), which includes space for Section number, page number, comment, and text edits

Submit the edited file from either method 1 or 2 to SP1500comments@nist.gov with the volume number in the subject line (e.g., Edits for Volume 1.)

Please contact Wo Chang (wchang@nist.gov) with any questions about the feedback submission process.

Big Data professionals continue to be welcome to join the NBD-PWG to help craft the work contained in the volumes of the NIST Big Data Interoperability Framework. Additional information about the NBD-PWG can be found at http://bigdatawg.nist.gov.

Table of Contents

Executive Summary.. vii

1       Introduction.. 1

1.1         Background.. 1

1.2         Scope and Objectives of the Definitions and Taxonomies Subgroup. 2

1.3         Report Production.. 2

1.4         Report Structure. 3

1.5         Future Work on this Volume. 3

2       Big Data and Data Science Definitions.. 4

2.1         Big Data Definitions. 4

2.2         Data Science Definitions. 7

2.3         Other Big Data Definitions. 9

3       Big Data Features.. 12

3.1         Data Elements and Metadata. 12

3.2         Data Records and Non-Relational Models. 12

3.3         Dataset Characteristics and Storage. 13

3.4         Data in Motion.. 15

3.5         Data Science Lifecycle Model for Big Data. 16

3.6         Big Data Analytics. 16

3.7         Big Data Metrics and Benchmarks. 17

3.8         Big Data Security and Privacy. 17

3.9         Data Governance. 18

4       Big Data Engineering Patterns (Fundamental Concepts) 19

Appendix A: Index of Terms.. A-1

Appendix B: Terms and Definitions.. B-1

Appendix C: Acronyms.. C-1

Appendix D: References.. D-1

Figure

Figure 1: Skills Needed in Data Science. 8

Table

Table 1: Sampling of Concepts Attributed to Big Data. 10

Executive Summary

The NIST Big Data Public Working Group (NBD-PWG) Definitions and Taxonomy Subgroup prepared this NIST Big Data Interoperability Framework: Volume 1, Definitions to address fundamental concepts needed to understand the new paradigm for data applications, collectively known as Big Data, and the analytic processes collectively known as data science. While Big Data has been defined in a myriad of ways, the shift to a Big Data paradigm occurs when the scale of the data leads to the need for a cluster of computing and storage resources to provide cost-effective data management. Data science combines various technologies, techniques, and theories from various fields, mostly related to computer science and statistics, to obtain actionable knowledge from data. This report seeks to clarify the underlying concepts of Big Data and data science to enhance communication among Big Data producers and consumers. By defining concepts related to Big Data and data science, a common terminology can be used among Big Data practitioners.

The NIST Big Data Interoperability Framework consists of seven volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The seven volumes are as follows:

·         Volume 1, Definitions

·         Volume 2, Taxonomies

·         Volume 3, Use Cases and General Requirements

·         Volume 4, Security and Privacy

·         Volume 5, Architectures White Paper Survey

·         Volume 6, Reference Architecture

·         Volume 7, Standards Roadmap

The NIST Big Data Interoperability Framework will be released in three versions, which correspond to the three stages of the NBD-PWG work. The three stages aim to achieve the following:

Stage 1: Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic

Stage 2: Define general interfaces between the NIST Big Data Reference Architecture (NBDRA) components

Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces

Potential areas of future work for the Subgroup during stage 2 are highlighted in Section 1.5 of this volume. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data.

1 Introduction

1.1 Background

There is broad agreement among commercial, academic, and government leaders about the remarkable potential of Big Data to spark innovation, fuel commerce, and drive progress. Big Data is the common term used to describe the deluge of data in today’s networked, digitized, sensor-laden, and information-driven world. The availability of vast data resources carries the potential to answer questions previously out of reach, including the following:

·         How can a potential pandemic reliably be detected early enough to intervene?

·         Can new materials with advanced properties be predicted before these materials have ever been synthesized?

·         How can the current advantage of the attacker over the defender in guarding against cyber-security threats be reversed?

There is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The growth rates for data volumes, speeds, and complexity are outpacing scientific and technological advances in data analytics, management, transport, and data user spheres.

Despite widespread agreement on the inherent opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions continues to confuse potential users and stymie progress. These questions include the following:

·         What attributes define Big Data solutions?

·         How is Big Data different from traditional data environments and related applications?

·         What are the essential characteristics of Big Data environments?

·         How do these environments integrate with currently deployed architectures?

·         What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?

Within this context, on March 29, 2012, the White House announced the Big Data Research and Development Initiative.[1] The initiative’s goals include helping to accelerate the pace of discovery in science and engineering, strengthening national security, and transforming teaching and learning by improving the ability to extract knowledge and insights from large and complex collections of digital data.

Six federal departments and their agencies announced more than $200 million in commitments spread across more than 80 projects, which aim to significantly improve the tools and techniques needed to access, organize, and draw conclusions from huge volumes of digital data. The initiative also challenged industry, research universities, and nonprofits to join with the federal government to make the most of the opportunities created by Big Data.

Motivated by the White House initiative and public suggestions, the National Institute of Standards and Technology (NIST) has accepted the challenge to stimulate collaboration among industry professionals to further the secure and effective adoption of Big Data. As one result of NIST’s Cloud and Big Data Forum held on January 15–17, 2013, there was strong encouragement for NIST to create a public working group for the development of a Big Data Standards Roadmap. Forum participants noted that this roadmap should define and prioritize Big Data requirements, including interoperability, portability, reusability, extensibility, data usage, analytics, and technology infrastructure. In doing so, the roadmap would accelerate the adoption of the most secure and effective Big Data techniques and technology.

On June 19, 2013, the NIST Big Data Public Working Group (NBD-PWG) was launched with extensive participation by industry, academia, and government from across the nation. The scope of the NBD-PWG involves forming a community of interests from all sectors—including industry, academia, and government—with the goal of developing consensus on definitions, taxonomies, secure reference architectures, security and privacy, and—from these—a standards roadmap. Such a consensus would create a vendor-neutral, technology- and infrastructure-independent framework that would enable Big Data stakeholders to identify and use the best analytics tools for their processing and visualization requirements on the most suitable computing platform and cluster, while also allowing value-added from Big Data service providers.

The NIST Big Data Interoperability Framework consists of seven volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The seven volumes are as follows:

·         Volume 1, Definitions

·         Volume 2, Taxonomies

·         Volume 3, Use Cases and General Requirements

·         Volume 4, Security and Privacy

·         Volume 5, Architectures White Paper Survey

·         Volume 6, Reference Architecture

·         Volume 7, Standards Roadmap

The NIST Big Data Interoperability Framework will be released in three versions, which correspond to the three stages of the NBD-PWG work. The three stages aim to achieve the following:

Stage 1: Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic

Stage 2: Define general interfaces between the NIST Big Data Reference Architecture (NBDRA) components

Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces

Potential areas of future work for the Subgroup during stage 2 are highlighted in Section 1.5 of this volume. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data.

1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup

This volume was prepared by the NBD-PWG Definitions and Taxonomy Subgroup, which focused on identifying Big Data concepts and defining related terms in areas such as data science, reference architecture, and patterns.

The aim of this volume is to provide a common vocabulary for those involved with Big Data. For managers, the terms in this volume will distinguish the concepts needed to understand this changing field. For procurement officers, this document will provide the framework for discussing organizational needs, and distinguishing among offered approaches. For marketers, this document will provide the means to promote solutions and innovations. For the technical community, this volume will provide a common language to better differentiate the specific offerings.

1.3 Report Production

Big Data and data science are being used as buzzwords and are composites of many concepts. To better identify those terms, the NBD-PWG Definitions and Taxonomy Subgroup first addressed the individual concepts needed in this disruptive field. Then, the two over-arching buzzwords¾Big Data and data science¾and the concepts they encompass were clarified.

To keep the topic of data and data systems manageable, the Subgroup attempted to limit discussions to differences affected by the existence of Big Data. Expansive topics such as data type or analytics taxonomies and metadata were only explored to the extent that there were issues or effects specific to Big Data. However, the Subgroup did include the concepts involved in other topics that are needed to understand the new Big Data methodologies.

Terms were developed independent of a specific tool or implementation, to avoid highlighting specific implementations, and to stay general enough for the inevitable changes in the field.

The Subgroup is aware that some fields, such as legal, use specific language that may differ from the definitions provided herein. The current version reflects the breadth of knowledge of the Subgroup members. During the comment period, the broader community is requested to address any domain conflicts caused by the terminology used in this volume.

1.4 Report Structure

This volume seeks to clarify the meanings of the broad terms Big Data and data science, which are discussed at length in Section 2. The more elemental concepts and terms that provide additional insights are discussed in Section 3. Section 4 explores several concepts that are more detailed. This first version of NIST Big Data Interoperability Framework: Volume 1, Definitions describes some of the fundamental concepts that will be important to determine categories or functional capabilities that represent architecture choices.

Tightly coupled information can be found in the other volumes of the NIST Big Data Interoperability Framework. Volume 2, Taxonomies provides a description of the more detailed components of the NIST Big Data Reference Architecture (NBDRA) presented in Volume 6, Reference Architecture. Security and privacy related concepts are described in detail in Volume 4, Security and Privacy. To understand how these systems are architected to meet users’ needs, the reader is referred to Volume 3, Use Cases and General Requirements. Volume 7, Standards Roadmap recaps the framework established in Volumes 1 through 6 and discusses NBDRA related standards. Comparing related sections in these volumes will provide a more comprehensive understanding of the consensus of the NBD-PWG.

1.5 Future Work on this Volume

This volume represents the beginning stage of the NBD-PWG’s effort to provide order and clarity to an emerging and rapidly changing field. Big Data encompasses a large range of data types, fields of study, technologies, and techniques. Distilling from the varied viewpoints a consistent, core set of definitions to frame the discussion has been challenging. However, through discussion of the varied viewpoints a greater understanding of the Big Data paradigm will emerge. As the field matures, this document will also need to mature to accommodate innovations in the field. To ensure the concepts are accurate, future NBD-PWG tasks will consist of the following:

·         Defining the different patterns of communications between Big Data resources to better clarify the different approaches being taken

·         Updating Volume 1 taking into account the efforts of other working groups such as International Organization for Standardization (ISO) Joint Technical Committee 1 (JTC 1) and the Transaction Processing Performance Council.

·         Improve the discussions of governance and data ownership

·         Develop the Management section

·         Develop the Security and Privacy section

·         Add a discussion of the value of data

2 Big Data and Data Science Definitions

The rate of growth of data generated and stored has been increasing exponentially. In a 1965 paper [2], Gordon Moore estimated that the density of transistors on an integrated circuit board was doubling every two years. Known as “Moore’s Law”, this rate of growth has been applied to all aspects of computing, from clock speeds to memory. The growth rates of data volumes are considered faster than Moore’s Law, with data volumes more than doubling every eighteen months. This data explosion is creating opportunities for new ways of combining and using data to find value, as well as providing significant challenges due to the size of the data being managed and analyzed. One significant shift is in the amount of unstructured data. Historically, structured data has typically been the focus of most enterprise analytics, and has been handled through the use of the relational data model. Recently, the quantity of unstructured data, such as micro-texts, web pages, relationship data, images and videos, has exploded and the trend indicates an increase in the incorporation of unstructured data to generate value. The central benefit of Big Data analytics is the ability to process large amounts and various types of information. Big Data does not imply that the current data volumes are simply “bigger” than before, or bigger than current techniques can efficiently handle. The need for greater performance or efficiency happens on a continual basis. However, Big Data represents a fundamental change in the architecture needed to efficiently handle current datasets.

In the evolution of data systems, there have been a number of times when the need for efficient, cost effective data analysis has forced a change in existing technologies. For example, the move to a relational model occurred when methods to reliably handle changes to structured data led to the shift toward a data storage paradigm that modeled relational algebra. That was a fundamental shift in data handling. The current revolution in technologies referred to as Big Data has arisen because the relational data model can no longer efficiently handle all the current needs for analysis of large and often unstructured datasets. It is not just that data is bigger than before, as it has been steadily getting larger for decades. The Big Data revolution is instead a one-time fundamental shift in architecture, just as the shift to the relational model was a one-time shift. As relational databases evolved to greater efficiencies over decades, so too will Big Data technologies continue to evolve. Many of the conceptual underpinnings of Big Data have been around for years, but the last decade has seen an explosion in their maturation and application to scaled data systems.

The term Big Data has been used to describe a number of concepts, in part because several distinct aspects are consistently interacting with each other. To understand this revolution, the interplay of the following four aspects must be considered: the characteristics of the datasets, the analysis of the datasets, the performance of the systems that handle the data, and the business considerations of cost effectiveness.

In the following sections, the two broad concepts, Big Data and data science, are broken down into specific individual terms and concepts.

2.1 Big Data Definitions

Big Data refers to the inability of traditional data architectures to efficiently handle the new datasets. Characteristics of Big Data that force new architectures are volume (i.e., the size of the dataset) and variety (i.e., data from multiple repositories, domains, or types), and the data in motion characteristics of velocity (i.e., rate of flow) and variability (i.e., the change in other characteristics). These characteristics—volume, variety, velocity, and variability—are known colloquially as the ‘Vs’ of Big Data and are further discussed in Section 3. Each of these characteristics influences the overall design of a Big Data system, resulting in different data system architectures or different data lifecycle process orderings to achieve needed efficiencies.

Big Data consists of extensive datasets¾primarily in the characteristics of volume, variety, velocity, and/or variability¾that require a scalable architecture for efficient storage, manipulation, and analysis.

Note that this definition contains the interplay between the characteristics of the data and the need for a system architecture that can scale to achieve the needed performance and cost efficiency. There are two fundamentally different methods for system scaling, often described metaphorically as “vertical” or “horizontal” scaling. Vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance. This approach is limited by physical capabilities whose improvements have been described by Moore’s Law, requiring ever more sophisticated elements (e.g., hardware, software) that add time and expense to the implementation. The alternate method is to use horizontal scaling, to make use of a cluster of individual (usually commodity) resources integrated to act as a single system. It is this horizontal scaling that is at the heart of the Big Data revolution.

The Big Data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.

This new paradigm leads to a number of conceptual definitions that suggest Big Data exists when the scale of the data causes the management of the data to be a significant driver in the design of the system architecture. This definition does not explicitly refer to the horizontal scaling in the Big Data paradigm.

As stated above, fundamentally, the Big Data paradigm is a shift in data system architectures from monolithic systems with vertical scaling (i.e., adding more power, such as faster processors or disks, to existing machines) into a parallelized, “horizontally scaled”, system (i.e., adding more machines to the available collection) that uses a loosely coupled set of resources in parallel. This type of parallelization shift began over 20 years ago in the simulation community, when scientific simulations began using massively parallel processing (MPP) systems.

Massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program.

In different combinations of splitting the code and data across independent processors, computational scientists were able to greatly extend their simulation capabilities. This, of course, introduced a number of complications in such areas as message passing, data movement, latency in the consistency across resources, load balancing, and system inefficiencies, while waiting on other resources to complete their computational tasks.

The Big Data paradigm of today is similar. Data systems need a level of extensibility that matches the scaling in the data. To get that level of extensibility, different mechanisms are needed to distribute data and data retrieval processes across loosely coupled resources.

While the methods to achieve efficient scalability across resources will continually evolve, this paradigm shift (in analogy to the prior shift in the simulation community) is a one-time occurrence. Eventually, a new paradigm shift will likely occur beyond this distribution of a processing or data system that spans multiple resources working in parallel. That future revolution will need to be described with new terminology.

Big Data focuses on the self-referencing viewpoint that data is big because it requires scalable systems to handle it. Conversely, architectures with better scaling have come about because of the need to handle Big Data. It is difficult to delineate a size requirement for a dataset to be considered Big Data. Data is usually considered “big” if the use of new scalable architectures provides a cost or performance efficiency over the traditional vertically scaled architectures (i.e., if similar performance cannot be achieved in a traditional, single platform computing resource.) This circular relationship between the characteristics of the data and the performance of data systems leads to different definitions for Big Data if only one aspect is considered.

Some definitions for Big Data focus on the systems innovations required because of the characteristics of Big Data.

Big Data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.

Once again the definition is coupled, so that Big Data engineering is used when the characteristics of the data require it. New engineering techniques in the data layer have been driven by the growing prominence of datasets that cannot be handled efficiently in a traditional relational model. The need for scalable access in structured data has led to software built on the key-value pair paradigm. The rise in importance of document analysis has spawned a document-oriented database paradigm, and the increasing importance of relationship data has led to efficiencies in the use of graph-oriented data storage.

The new non-relational model database paradigms are typically referred to as NoSQL (Not Only or No Structured Query Language [SQL]) systems, which are further discussed in Section 3. The problem with identifying Big Data storage paradigms as NoSQL is, first, that it describes the storage of data with respect to a set theory-based language for query and retrieval of data, and, second, that there is a growing capability in the application of the SQL query language against the new non-relational data repositories. While NoSQL is in such common usage that it will continue to refer to the new data models beyond the relational model, it is hoped the term itself will be replaced with a more suitable term, since it is unwise to name a set of new storage paradigms with respect to a query language currently in use against that storage.

Non-relational models, frequently referred to as NoSQL, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.

Another related engineering technique is the federated database system, which is related to the variety characteristic of Big Data.

A federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database.

A federated database is thus a database system comprised of underlying database systems. Big Data systems can likewise pull a variety of data from many sources, but the underlying repositories do not all have to conform to the relational model.

Note that for systems and analysis processes, the Big Data paradigm shift also causes changes in the traditional data lifecycle processes. One description of the end-to-end data lifecycle categorizes the process steps as collection, preparation, analysis, and action. Different Big Data use cases can be characterized in terms of the dataset characteristics and in terms of the time window for the end-to-end data lifecycle. Dataset characteristics change the data lifecycle processes in different ways, for example in the point in the lifecycle at which the data is placed in persistent storage. In a traditional relational model, the data is stored after preparation (for example, after the extract-transform-load and cleansing processes). In a high velocity use case, the data is prepared and analyzed for alerting, and only then is the data (or aggregates of the data) given a persistent storage. In a volume use case, the data is often stored in the raw state in which it was produced—before being cleansed and organized (sometimes referred to as extract-load-transform). The consequence of persistence of data in its raw state is that a schema or model for the data is only applied when the data is retrieved for preparation and analysis. This Big Data concept is described as schema-on-read.

Schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.

Another concept of Big Data is often referred to as moving the processing to the data, not the data to the processing.

Computational portability is the movement of the computation to the location of the data.

The implication is that data is too extensive to be queried and moved into another resource for analysis, so the analysis program is instead distributed to the data-holding resources, with only the results being aggregated on a remote resource. This concept of data locality is actually a critical aspect of parallel data architectures. Additional system concepts are the interoperability (ability for tools to work together), reusability (ability to apply tools from one domain to another), and extendibility (ability to add or modify existing tools for new domains). These system concepts are not specific to Big Data, but their presence in Big Data can be understood in the examination of a Big Data reference architecture, which is discussed in NIST Big Data Interoperability Framework: Volume 6, Reference Architecture of this series.

Additional concepts used in reference to the term Big Data refer to changes in analytics, which will be discussed in Section 2.2. A number of other terms (particularly terms starting with the letter V) are also used, several of which refer to the data science process or its benefit, instead of new Big Data characteristics. Some of these additional terms include veracity (i.e., accuracy of the data), value (i.e., value of the analytics to the organization), volatility (i.e., tendency for data structures to change over time), and validity (i.e., appropriateness of the data for its intended use). While these characteristics and others¾including quality control, metadata, and data provenance¾long pre-dated Big Data, their impact is still important in Big Data systems. Several of these terms are discussed with respect to Big Data analytics in Section 3.4.

Essentially, Big Data refers to the extensibility of data repositories and data processing across resources working in parallel, in the same way the compute-intensive simulation community embraced massively parallel processing two decades ago. By working out methods for communication among resources, the same scaling is now available to data-intensive applications.

2.2 Data Science Definitions

In its purest form, data science is the fourth paradigm of science, following theory, experiment, and computational science. The fourth paradigm is a term coined by Dr. Jim Gray in 2007. It refers to the conduct of data analysis as an empirical science, learning directly from data itself. Data science as a paradigm would refer to the formulation of a hypothesis, the collection of the data—new or pre-existing—to address the hypothesis, and the analytical confirmation or denial of the hypothesis (or the determination that additional information or study is needed.) In many data science projects, the raw data is browsed first, which informs a hypothesis, which is then investigated. As in any experimental science, the end result could be that the original hypothesis itself needs to be reformulated. The key concept is that data science is an empirical science, performing the scientific process directly on the data. Note that the hypothesis may be driven by a business need, or can be the restatement of a business need in terms of a technical hypothesis.

The data science paradigm is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.

Data science can be understood as the activities happening in the processing layer of the system architecture, against data stored in the data layer, in order to extract knowledge from the raw data through the complete data lifecycle.

The data lifecycle is the set of processes that transform raw data into actionable knowledge.

Traditionally, the term analytics has been used as one of the steps in the data lifecycle of collection, preparation, analysis, and action.

Analytics is the synthesis of knowledge from information.

With the new Big Data paradigm, analytics are no longer separable from the data model and the distribution of that data across parallel resources. When structured data was almost exclusively stored as organized information in a relational model, the analytics could be designed for this structure. While the working definition of the data science paradigm refers to learning directly from data, in the Big Data paradigm this learning must implicitly involve all steps in the data lifecycle, with analytics being only a subset.

Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.

Data science across the entire data lifecycle now incorporates principles, techniques, and methods from many disciplines and domains, including the analytics domains of mathematics, data mining (specifically machine learning and pattern recognition), statistics, operations research, and visualization, along with the domains of systems, software, and network engineering. Data scientists and data science teams solve complex data problems by employing deep expertise in one or more of these disciplines, in the context of business strategy, and under the guidance of domain knowledge. Personal skills in communication, presentation, and inquisitiveness are also very important given the complexity of interactions within Big Data systems.

A data scientist is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data lifecycle.

While this full collection of skills can be present in a single individual, it is also possible that these skills, as shown in Figure 1, are covered in the members of a team.

Figure 1: Skills Needed in Data Science

Volume1Figure1.png

Data science is not solely concerned with analytics, but also with the end-to-end experimental lifecycle, where the data system is essentially the scientific equipment. The implication is that the data scientist must be aware of the sources and provenance of the data, the appropriateness and accuracy of the transformations on the data, the interplay between the transformation algorithms and processes, and the data storage mechanisms. This end-to-end overview role ensures that everything is performed correctly to meaningfully address the hypothesis. These analytics concepts are discussed further in Section 3.4.

Data science is increasingly used to influence business decisions. In Big Data systems, identifying a correlation is often sufficient for a business to take action. As a simple example, if it can be determined that using the color blue on a website leads to greater sales than using green, then this correlation can be used to improve the business. The reason for the preference is not needed¾it is enough to determine correlation.

Several issues are currently being debated within the data science community, two of which are data sampling, and the idea that more data is superior to better algorithms

Data sampling, a central concept of statistics, involves the selection of a subset of data from the larger data population. The subset of data can be used as input for analytical processes, to determine methodology to be used for experimental procedures, or to address questions. For example, it is possible to calculate the data needed to determine an outcome for an experimental procedure (e.g., during a pharmaceutical clinical trial).

When the data mining community began, the emphasis was typically on re-purposed data (i.e., data used to train models was sampled from a larger dataset that was originally collected for another purpose). The often-overlooked critical step was to ensure that the analytics were not prone to over-fitting (i.e., the analytical pattern matched the data sample but did not work well to answer questions of the overall data population). In the new Big Data paradigm, it is implied that data sampling from the overall data population is no longer necessary since the Big Data system can theoretically process all the data without loss of performance. However, even if all of the available data is used, it still only represents a population subset whose behaviors led them to produce the data, which might not be the true population of interest. For example, studying Twitter data to analyze people’s behaviors does not represent all people, as not everyone uses Twitter. While less sampling may be used in data science processes, it is important to be aware of the implicit data sampling when trying to address business questions.

The assertion that more data is superior to better algorithms implies that better results can be achieved by analyzing larger samples of data rather that refining the algorithms used in the analytics. The heart of this debate states that a few bad data elements are less likely to influence the analytical results in a large dataset than if errors are present in a small sample of that dataset. If the analytics needs are correlation and not causation, then this assertion is easier to justify. Outside the context of large datasets in which aggregate trending behavior is all that matters, the data quality rule remains “garbage-in, garbage-out”, where you cannot expect accurate results based on inaccurate data.

For descriptive purposes, analytics activities can be broken into different categories, including discovery, exploratory analysis, correlation analysis, predictive modeling, and machine learning. Again, these analytics categories are not specific to Big Data, but some have gained more visibility due to their greater application in data science.

Data science is tightly linked to Big Data, and refers to the management and execution of the end-to-end data processes, including the behaviors of the data system. As such, data science includes all of analytics, but analytics does not include all of data science.

2.3 Other Big Data Definitions

A number of Big Data definitions have been suggested as efforts have been made to understand the extent of this new field. Several Big Data concepts, discussed in previous sections, were observed in a sample of definitions taken from blog posts [3] [4] [5] [6]. The sample of formal and informal definitions offer a sense of the spectrum of concepts applied to the term Big Data. The sample of Big Data concepts and definitions are aligned in Table 1. The NBD-PWG’s definition is closest to the Gartner definition, with additional emphasis that the horizontal scaling is the element that provides the cost efficiency. The Big Data concepts and definitions in Table 1 are not comprehensive, but rather illustrate the inter-related concepts attributed to the catch-all term Big Data.

Table 1: Sampling of Concepts Attributed to Big Data

 

Concept

Author

Definition

4Vs (Volume, Variety, Velocity, and Variability) and Engineering

Gartner[7],[8]

“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Volume

Techtarget [9]

“Although Big data doesn't refer to any specific quantity, the term is often used when speaking about petabytes and exabytes of data.”

Oxford English Dictionary (OED)[10]

“big data n. Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.”

Bigger Data

Annette Greiner[5]

“Big data is data that contains enough observations to demand unusual handling because of its sheer size, though what is unusual changes over time and varies from one discipline to another.”

Not Only Volume

Quentin Hardy[5]

“What’s ‘big’ in big data isn’t necessarily the size of the databases, it’s the big number of data sources we have, as digital sensors and behavior trackers migrate across the world.”

Chris Neumann[5]

“…our original definition was a system that (1) was capable of storing 10 TB of data or more … As time went on, diversity of data started to become more prevalent in these systems (particularly the need to mix structured and unstructured data), which led to more widespread adoption of the “3 Vs” (volume, velocity, and variety) as a definition for big data.”

Big Data Engineering

IDC[11] [16]

“Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.”

Hal Varian[5]

“Big data means data that cannot fit easily into a standard relational database.”

McKinsey[12]

“Big Data refers to a dataset whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”

Less Sampling

John Foreman[5]

“Big data is when your business wants to use data to solve a problem, answer a question, produce a product, etc., …crafting a solution to the problem that leverages the data without simply sampling or tossing out records.”

Peter Skomoroch[5]

“Big data originally described the practice in the consumer Internet industry of applying algorithms to increasingly large amounts of disparate data to solve problems that had suboptimal solutions with smaller datasets.”

New Data Types

Tom Davenport[13]

“The broad range of new and massive data types that have appeared over the last decade or so.”

Mark van Rijmenam[5]

“Big data is not all about volume, it is more about combining different data sets and to analyze it in real-time to get insights for your organization. Therefore, the right definition of big data should in fact be: mixed data.”

Analytics

Ryan Swanstrom[5]

“Big data used to mean data that a single machine was unable to handle. Now big data has become a buzzword to mean anything related to data analytics or visualization.”

Data Science

Joel Gurin[5]

“Big data describes datasets that are so large, complex, or rapidly changing that they push the very limits of our analytical capability.”

Josh Ferguson[5]

“Big data is the broad name given to challenges and opportunities we have as data about every aspect of our lives becomes available. It’s not just about data though; it also includes the people, processes, and analysis that turn data into meaning.”

Value

Harlan Harris[5]

“To me, ‘big data’ is the situation where an organization can (arguably) say that they have access to what they need to reconstruct, understand, and model the part of the world that they care about.”

Jessica Kirkpatrick[5]

“Big data refers to using complex datasets to drive focus, direction, and decision making within a company or organization.”

Hilary Mason[5]

“Big data is just the ability to gather information and query it in such a way that we are able to learn things about the world that were previously inaccessible to us.”

Gregory Piatetsky-Shapiro[5]

“The best definition I saw is, “Data is big when data size becomes part of the problem.” However, this refers to the size only. Now the buzzword “big data” refers to the new data-driven paradigm of business, science and technology, where the huge data size and scope enables better and new services, products, and platforms.”

Cultural Change

Drew Conway[5]

“Big data, which started as a technological innovation in distributed computing, is now a cultural movement by which we continue to discover how humanity interacts with the world—and each other—at large-scale.”

Daniel Gillick[5]

“‘Big data’ represents a cultural shift in which more and more decisions are made by algorithms with transparent logic, operating on documented immutable evidence. I think ‘big’ refers more to the pervasive nature of this change than to any particular amount of data.”

Cathy O’Neil[5]

“‘Big data’ is more than one thing, but an important aspect is its use as a rhetorical device, something that can be used to deceive or mislead or overhype.”

3 Big Data Features

The diversity of Big Data concepts discussed in Section 2 is similarly reflected in the discussion of Big Data features in Section 3. Some Big Data terms and concepts are discussed in Section 3 to understand new aspects brought about by the Big Data paradigm in the context of existing data architecture and analysis context.

3.1 Data Elements and Metadata

Individual data elements have not changed with Big Data and are not discussed in detail in this document. For additional information on data types, readers are directed to the ISO standard ISO/IEC 11404:2007 General Purpose Datatypes [14], and, as an example, its extension into healthcare information data types in ISO 21090:2011 Health Informatics [15].

One important concept to Big Data is metadata, which is often described as “data about data.” Metadata describes additional information about the data such as how and when data was collected and how it has been processed. Metadata should itself be viewed as data with all the requirements for tracking, change management, and security. Many standards are being developed for metadata, for general metadata coverage (e.g., ISO/IEC 11179-x [16]) and discipline specific metadata (e.g., ISO 19115-x [17] for geospatial data).

Metadata that describes the history of a dataset is called its provenance, which is discussed in Section 3.6. As open data (data available to others) and linked data (data that is connected to other data) become the norm, it is increasingly important to have information about how data was collected, transmitted, and processed. Provenance type of metadata guides users to correct data utilization when the data is repurposed from its original collection process in an effort to extract additional value.

Semantic metadata, another type of metadata, refers to the definitional description of a data element to assist with proper interpretation. An ontology can be conceptualized as a graphic model, representing a semantic relationship between entities. Ontologies are semantic models constrained to follow different levels of logic models. Ontologies and semantic models predated Big Data and not discussed in depth this document. Ontologies can be very general or extremely domain specific in nature. A number of mechanisms exist for implementing these unique definitional descriptions, and the reader is referred to the World Wide Web Consortium (W3C) efforts on the semantic web [18] [19] for additional information. Semantic data is important in the new Big Data Paradigm since the Semantic Web represents a Big Data attempt to provide cross-cutting meanings for terms. Again, semantic metadata is especially important for linked data efforts.

Taxonomies represent in some sense metadata about data element relationships. Taxonomy is a hierarchical relationship between entities, where a data element is broken down into smaller component parts. While these concepts are important, they predated the Big Data paradigm shift.

3.2 Data Records and Non-Relational Models

Data elements are collected into records that describe a particular observation, event, or transaction. Previously, most of the data in business systems was structured data, where each record was consistently structured and could be described efficiently in a relational model. Records are conceptualized as the rows in a table where data elements are in the cells. Unstructured data types, such as text, image, video, and relationship data, have been increasing in both volume and prominence. While modern relational databases tend to have support for these types of data elements, their ability to directly analyze, index, and process them has tended to be both limited and accessed via non-standard SQL extensions. The need to analyze unstructured or semi-structured data has been present for many years. However, the Big Data paradigm shift has increased the emphasis on the value of unstructured or relationship data, and also on different engineering methods that can handle data more efficiently.

Again, semantic metadata is, Big Data Engineering refers to the new ways data is stored in records. In some cases the records are still in the concept of a table structure. One storage paradigm is a key-value structure, with a record consisting of a key and a string of data together in the value. The data is retrieved through the key, and the non-relational database software handles accessing the data in the value. This can be viewed as a subset/simplification of a relational database table with a single index field and column. A variant on this is the document store, where the document has multiple value fields, any of which can be used as the index/key. The difference from the relational table model is that the set of documents do not all need to have same value fields.

Another type of new Big Data record storage is in a graphical model. A graphical model represents the relationship between data elements. The data elements are nodes, and the relationship is represented as a link between nodes. Graph storage models represent each data element as a series of subject, predicate, and object triples. Often, the available types of objects and relationships are described via ontologies as discussed above.

Another data element relationship concept that is not new in the Big Data paradigm shift is the presence of complexity between the data elements. There are systems where data elements cannot be analyzed outside the context of other data elements. This is evident, for example, in the analytics for the Human Genome Project, where it is the relationship between the elements and their position and proximity to other elements that matters. The term complexity is often attributed to Big Data, but it refers to this inter-relationship between data elements or across data records, independent of whether the dataset has the characteristics of Big Data

3.3 Dataset Characteristics and Storage

Data records are grouped into datasets, which can have the Big Data characteristics of volume, velocity, variety, and variability. Dataset characteristics can refer to the data itself, or data at rest, while characteristics of the data that is traversing a network or temporarily residing in computer memory to be read or updated is referred to as data in motion, which is discussed in Section 3.4.

Data at Rest: Typical characteristics of data at rest that are notably different in the era of Big Data are volume and variety. Volume is the characteristic of data at rest that is most associated with Big Data. Estimates show that the amount of data in the world doubles every two years. [20] Should this trend continue, by 2020 there would be 500 times the amount of data as existed in 2011. The sheer volume of the data is colossal. The data volumes have stimulated new ways for scalable storage across a collection of horizontally coupled resources, as described in Section 2.1.

The second characteristic of data at rest is the increasing need to use a variety of data, meaning the data represents a number of data domains and a number of data types. Traditionally, a variety of data was handled through transformations or pre-analytics to extract features that would allow integration with other data. The wider range of data formats, logical models, timescales, and semantics, which is desirous to use in analytics, complicates the integration of the variety of data. For example, data to be integrated could be text from social networks, image data, or a raw feed directly from a sensor source. To deal with a wider range of data formats, a federated database model was designed as a database across the underlying databases. Data to be integrated for analytics could now be of such volume that it cannot be moved to integrate, or it may be that some of the data is not under control of the organization creating the data system. In either case, the variety of Big Data forces a range of new Big Data engineering solutions to efficiently and automatically integrate data that is stored across multiple repositories, in multiple formats, and in multiple logical data models.

Big Data engineering has spawned data storage models that are more efficient for unstructured data than the traditional relational model, causing a derivative issue for the mechanisms to integrate this data. New scalable techniques have arisen to manage and manipulate Big Data not stored in traditional expensive high-performance “vertically” scaled systems, but rather spread across a number of less expensive resources. For example, the document store was developed specifically to support the idea of storing and indexing heterogeneous data in a common repository for analysis. New types of non-relational storage for data records are discussed below.

Shared-disk File Systems: These approaches, such as Storage Area Networks (SANs) and Network Attached Storage (NAS), use a single storage pool, which is accessed from multiple computing resources. While these technologies solved many aspects of accessing very large datasets from multiple nodes simultaneously, they suffered from issues related to data locking and updates and, more importantly, created a performance bottleneck (from every input/output [I/O] operation accessing the common storage pool) that limited their ability to scale up to meet the needs of many Big Data applications. These limitations were overcome through the implementation of fully distributed file systems.

Distributed File Systems: In distributed file storage systems, multi-structured (object) datasets are distributed across the computing nodes of the server cluster(s). The data may be distributed at the file/dataset level, or more commonly, at the block level, allowing multiple nodes in the cluster to interact with different parts of a large file/dataset simultaneously. Big Data frameworks are frequently designed to take advantage of data locality to each node when distributing the processing, which avoids any need to move the data between nodes. In addition, many distributed file systems also implement file/block level replication where each file/block is stored multiple times on different machines for both reliability/recovery (data is not lost if a node in the cluster fails), as well as enhanced data locality. Any type of data and many sizes of files can be handled without formal extract, transformation, and load conversions, with some technologies performing markedly better for large file sizes.

Distributed Computing: The popular framework for distributed computing consists of a storage layer and processing layer combination that implements a multiple-class, algorithm-programming model. Low cost servers supporting the distributed file system that stores the data can dramatically lower the storage costs of computing on a large scale of data (e.g., web indexing). MapReduce is the default processing component in data-distributed computing. Processing results are typically then loaded into an analysis environment.

The use of inexpensive servers is appropriate for slower, batch-speed Big Data applications, but do not provide good performance for applications requiring low latency processing. The use of basic MapReduce for processing places limitations on updating or iterative access to the data during computation. Bulk Synchronous Parallelism systems or newer MapReduce developments can be used when repeated updating is a requirement. Improvements and “generalizations” of MapReduce have been developed that provide additional functions lacking in the older technology, including fault tolerance, iteration flexibility, elimination of middle layer, and ease of query.

Resource Negotiation: The common distributed computing system has little in the way of built-in data management capabilities. In response, several technologies have been developed to provide the necessary support functions, including operations management, workflow integration, security, and governance. Of special importance to resource management development, are new features for supporting additional processing models (other than MapReduce) and controls for multi-tenant environments, higher availability, and lower latency applications.

In a typical implementation, the resource manager is the hub for several node managers. The client or user accesses the resource manager which in turn launches a request to an application master within one or many node managers. A second client may also launch its own requests, which will be given to other application masters within the same or other node managers. Tasks are assigned a priority value allocated based on available CPU and memory, and provided the appropriate processing resource in the node.

Data movement is normally handled by transfer and application program interface (API) technologies other than the resource manager. In rare cases, peer-to-peer (P2P) communications protocols can also propagate or migrate files across networks at scale, meaning that technically these P2P networks are also distributed file systems. The largest social networks, arguably some of the most dominant users of Big Data, move binary large objects (BLOBs) of over 1 gigabyte (GB) in size internally over large numbers of computers via such technologies. The internal use case has been extended to private file synchronization, where the technology permits automatic updates to local folders whenever two end users are linked through the system.

In external use cases, each end of the P2P system contributes bandwidth to the data movement, making this currently the fastest way to leverage documents to the largest number of concurrent users. For example, NASA (U.S. National Aeronautics and Space Administration) uses this technology to make 3GB images available to the public. However, any large bundle of data (e.g., video, scientific data) can be quickly distributed with lower bandwidth cost.

There are additional aspects of Big Data that are changing rapidly and are not fully explored in this document, including cluster management and other mechanisms for providing communication among the cluster resources holding the data in the non-relational models. Discussion of the use of multiple tiers of storage (e.g., in-memory, cache, solid state drive, hard drive, network drive) in the newly emerging software defined storage can be found in other industry publications. Software defined storage is the use of software to determine the dynamic allocation of tiers of storage to reduce storage costs while maintaining the required data retrieval performance.

3.4 Data in Motion

Another important characteristic of Big Data is the time window in which the analysis can take place. Data in motion is processed and analyzed in real time, or near-real time, and has to be handled in a very different way than data at rest (i.e., persisted data). Data in motion tends to resemble event-processing architectures, and focuses on real-time or operational intelligence applications.

Typical characteristics of data in motion that are significantly different in the era of Big Data are velocity and variability. The velocity is the rate of flow at which the data is created, stored, analyzed, and visualized. Big Data velocity means a large quantity of data is being processed in a short amount of time. In the Big Data era, data is created and passed on in real time or near real time. Increasing data flow rates create new challenges to enable real- or near real-time data usage. Traditionally this concept has been described as streaming data. While these aspects are new for some industries, other industries (e.g., telecommunications) have processed high volume and short time interval data for years. However, the new in-parallel scaling approaches do add new Big Data engineering options for efficiently handling this data.

The second characteristic for data in motion is variability, which refers to any change in data over time, including the flow rate, the format, or the composition. Given that many data processes generate a surge in the amount of data arriving in a given amount of time, new techniques are needed to efficiently handle this data. The data processing is often tied up with the automatic provisioning of additional virtualized resources in a cloud environment. Detailed discussions of the techniques used to process data can be found in other industry publications that focus on operational cloud architectures. [21 [22] Early Big Data systems built by Internet search providers and others were frequently deployed on bare metal to achieve the best efficiency at distributing I/O across the clusters and multiple storage devices. While cloud (i.e., virtualized) infrastructures were frequently used to test and prototype Big Data deployments, there are recent trends, due to improved efficiency in I/O virtualization infrastructures, of production solutions being deployed on cloud or Infrastructure-as-a-Service (IaaS) platforms. A high velocity system with high variability may be deployed on a cloud infrastructure, because of the cost and performance efficiency of being able to add or remove nodes to handle the peak performance. Being able to release those resources when they are no longer needed provides significant cost savings for operating this type of Big Data system. Very large implementations and in some cases cloud providers are now implementing this same type of elastic infrastructure on top of their physical hardware. This is especially true for organizations that already need extensive infrastructure but simply need to balance resources across application workloads that can vary.

3.5 Data Science Lifecycle Model for Big Data

As was introduced in Section 2.1, the data lifecycle consists of the following four stages:

1.      Collection: This stage gathers and stores data in its original form (i.e., raw data.)

2.      Preparation: This stage involves the collection of processes that convert raw data into cleansed, organized information.

3.      Analysis: This stage involves the techniques that produce synthesized knowledge from organized information.

4.      Action: This stage involves processes that use the synthesized knowledge to generate value for the enterprise.

In the traditional data warehouse, the data handling process followed the order above (i.e., collection, preparation, storage, and analysis.) The relational model was designed in a way that optimized the intended analytics. The different Big Data characteristics have influenced changes in the ordering of the data handling processes. Examples of these changes are as follows:

·         Data warehouse: Persistent storage occurs after data preparation

·         Big Data volume system: Data is stored immediately in raw form before preparation; preparation occurs on read, and is referred to as ‘schema on read’

·         Big Data velocity application: The collection, preparation, and analytics (alerting) occur on the fly, and possibly includes some summarization or aggregation prior to storage

Just as simulations split the analytical processing across clusters of processors, data processes are redesigned to split data transformations across data nodes. Because the data may be too big to move, the transformation code may be sent in parallel across the data persistence nodes, rather than the data being extracted and brought to the transformation servers.

3.6 Big Data Analytics

Analytic processes are often characterized as discovery for the initial hypothesis formulation, development for establishing the analytics process for a specific hypothesis, and applied for the encapsulation of the analysis into an operational system. While Big Data has touched all three types of analytic processes, the majority of the changes is observed in development and applied analytics. New Big Data engineering technologies change the types of analytics that are possible, but do not result in completely new types of analytics. However, given the retrieval speeds, analysts are able to interact with their data in ways that were not previously possible. Traditional statistical analytic techniques downsize, sample, or summarize the data before analysis. This was done to make analysis of large datasets reasonable on hardware that could not scale to the size of the dataset. Big Data analytics often emphasize the value of computation across the entire dataset, which gives analysts better chances to determine causation, rather than just correlation. Correlation, though, is still useful when knowing the direction or trend of something is enough to take action. Today, most analytics in statistics and data mining focus on causation—being able to describe why something is happening. Discovering the cause aids actors in changing a trend or outcome. Actors, which in system development can represent individuals, organizations, software, or hardware, are discussed in NIST Big Data Interoperability Framework: Volume 2, Taxonomy. Big Data solutions make it more feasible to implement causation type of complex analytics for large, complex, and heterogeneous data.

In addition to volume, velocity, variety, and variability, several terms, many beginning with V, have been used in connection with Big Data requirements for the system architecture. Some of these terms strongly relate to analytics on the data. Veracity and provenance are two such terms and are discussed below.

Veracity refers to the completeness and accuracy of the data and relates to the vernacular “garbage-in, garbage-out” description for data quality issues in existence for a long time. If the analytics are causal, then the quality of every data element is extremely important. If the analytics are correlations or trending over massive volume datasets, then individual bad elements could be lost in the overall counts and the trend will still be accurate. As mentioned in Section 2.2, many people debate whether “more data is superior to better algorithms,” but that is a topic better discussed elsewhere.

As discussed in Section 3.1, the provenance, or history of the data, is increasingly an essential factor in Big Data analytics, as more and more data is being repurposed for new types of analytics in completely different disciplines from which the data was created. As the usage of data persists far beyond the control of the data producers, it becomes ever more essential that metadata about the full creation and processing history is made available along with the data. In addition, it is vital to know what analytics may have produced the data, since there are always confidence ranges, error ranges, and precision/recall limits associated with analytic outputs.

Another analytics consideration is the speed of interaction between the analytics processes and the person or process responsible for delivering the actionable insight. Analytic data processing speed can fall along a continuum between batch and streaming oriented processing. Although the processing continuum existed prior to the era of Big Data, the desired location on this continuum is a large factor in the choice of architectures and component tools to be used. Given the greater query and analytic speeds within Big Data due to the scaling across a cluster, there is an increasing emphasis on interactive (i.e., real-time) processing Rapid analytics cycles allow an analyst to do exploratory discovery on the data, browsing more of the data space than might otherwise have been possible in any practical time frame. The processing continuum is further discussed in NIST Big Data Interoperability Framework: Volume 6, Reference Architecture.

3.7 Big Data Metrics and Benchmarks

Initial considerations in the use of Big Data engineering include the determination, for a particular situation, of the size threshold after which data should be considered Big Data. Multiple factors must be considered in this determination and the outcome is particular to each application. As described in Section 2.1, Big Data characteristics lead to use of Big Data engineering techniques to allow the data system to operate affordably and efficiently. Whether a performance or cost efficiency can be attained for a particular application requires a design analysis, which is beyond the scope of this report.

There is a significant need for metrics and benchmarking to provide standards for the performance of Big Data systems. This topic is being addressed by the Transaction Processing Performance Council TCP-xHD Big Data Committee, and available information from their efforts may be included in future versions of this report.

3.8 Big Data Security and Privacy

Security and privacy have also been affected by the emergence of the Big Data paradigm. A detailed discussion of the influence of Big Data on security and privacy is included in NIST Big Data Interoperability Framework: Volume 4, Security and Privacy. Some of the effects of Big Data characteristics on security and privacy summarized below:

·         Variety: Retargeting traditional relational database security to non-relational databases has been a challenge. An emergent phenomenon introduced by Big Data variety that has gained considerable importance is the ability to infer identity from anonymized datasets by correlating with apparently innocuous public databases.

·         Volume: The volume of Big Data has necessitated storage in multi-tiered storage media. The movement of data between tiers has led to a requirement of systematically analyzing the threat models and research and development of novel techniques.

·         Velocity: As with non-relational databases, distributed programming frameworks such as Hadoop were not developed with security as a primary objective.

·         Veracity: Complex challenges have been introduced in protecting data integrity as well as maintaining privacy policies as data moves across individual boundaries to groups, communities of interest, state, national, and international boundaries.

·         Volatility: Security and privacy requirements can shift according to the time dependent nature of roles that collected, processed, aggregated, and stored it. Governance can shift as responsible organizations merge or even disappear

Privacy concerns, and frameworks to address these concerns, predate Big Data. While bounded in comparison to Big Data, past solutions considered legal, social, and technical requirements for privacy in distributed systems, very large databases, and in HPCC. The addition of variety, volume, velocity, veracity, volatility, and value to the mix has amplified these concerns to the level of a national conversation, with unanticipated impacts on privacy frameworks.

3.9 Data Governance

Data governance is a fundamental element in the management of data and data systems.

Data governance refers to administering, or formalizing, discipline (e.g., behavior patterns) around the management of data.

The definition of data governance includes management across the complete data lifecycle, whether the data is at rest, in motion, in incomplete stages, or transactions. To maximize its benefit, data governance must also consider the issues of privacy and security of individuals of all ages, individuals as companies, and companies as companies.

Data governance is needed to address important issues in the new global Internet Big Data economy. For example, many businesses provide a data hosting platform for data that is generated by the users of the system. While governance policies and processes from the point of view of the data hosting company are commonplace, the issue of governance and control rights of the data providers is new. Many questions remain including the following. Do they still own their data, or is the data owned by the hosting company? Do the data producers have the ability to delete their data? Can they control who is allowed to see their data?

The question of governance resides between the value that one party (e.g., the data hosting company) wants to generate versus the rights that the data provider wants to retain to obtain their own value. New governance concerns arising from the Big Data Paradigm need greater discussion, and will be discussed during the development of the next version of this document.

4 Big Data Engineering Patterns (Fundamental Concepts)

To define the differences between Big Data technologies, different ‘scenarios’ and ‘patterns’ are needed to illustrate relationships between Big Data characteristics (Section 2.1) and between the NBDRA components found in NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. The scenarios would describe the high-level functional processes that can be used to categorize and, therefore, provide better understanding of the different use cases presented in NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements, as well as help to clarify the differences in specific implementations of components listed in the NIST Big Data Interoperability Framework: Volume 6, Reference Architecture.

The topics surrounding the relaxation of the principles of a relational model in non-relational systems are very important. These topics are discussed in industry publications on concurrency, and will be addressed more fully in of future additions to this document.

Appendix A:Index of Terms

A

analytics, 7

B

Big Data, 5, 6

Big Data engineering, 5

Big Data paradigm, 4

Big Data velocity application, 14

Big Data volume system, 14

C

complexity, 10

Computational portability, 6

D

data lifecycle, 6

data sampling, 8

data science, 8

data science paradigm, 6

data scientist, 7

data warehouse, 14

F

federated database system, 5

fourth paradigm, 6

M

massively parallel processing, 4

metadata, 10

N

non-relational models, 5NoSQL, 5

O

ontologies, 10

P

provenance, 13

R

relational model, 10

S

Schema-on-read, 6

semantic data, 10

semi-structured data, 10

streaming data, 12

structured data, 10

T

taxonomies, 10

U

unstructured data, 10

V

validity, 8

value, 8

variability, 4

variety, 4

velocity, 10

veracity, 8

volatility, 8

volume, 4

Appendix B: Terms and Definitions

Analytics is the synthesis of knowledge from information.

Big Data consists of extensive datasets¾primarily in the characteristics of volume, variety, velocity, and/or variability¾that require a scalable architecture for efficient storage, manipulation, and analysis.

Big Data engineering includes advanced techniques that harness independent resources for building scalable data systems when the characteristics of the datasets require new architectures for efficient storage, manipulation, and analysis.

The Big Data paradigm consists of the distribution of data systems across horizontally coupled, independent resources to achieve the scalability needed for the efficient processing of extensive datasets.

Computational portability is the movement of the computation to the location of the data.

Data governance refers to the overall management of the availability, usability, integrity, and security of the data employed in an enterprise.

The data lifecycle is the set of processes that transforms raw data into actionable knowledge, which includes data collection, preparation, analytics, visualization, and access.

Data science is the empirical synthesis of actionable knowledge from raw data through the complete data lifecycle process.

The data science paradigm is extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.

A data scientist  is a practitioner who has sufficient knowledge in the overlapping regimes of business needs, domain knowledge, analytical skills, and software and systems engineering to manage the end-to-end data processes through each stage in the data lifecycle.

Distributed computing is a computing system in which components located on networked computers communicate and coordinate their actions by passing messages.

Distributed file systems contain multi-structured (object) datasets that are distributed across the computing nodes of the server cluster(s).

A federated database system is a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database.

Horizontal scaling implies the coordination of individual resources (e.g., server) that are integrated to act in parallel as a single system (i.e., operate as a cluster).

Latency refers to the delay in processing or in availability.

Massively parallel processing refers to a multitude of individual processors working in parallel to execute a particular program.

Non-relational models, frequently referred to as NoSQL, refer to logical data models that do not follow relational algebra for the storage and manipulation of data.

Resource negotiation consists of built-in data management capabilities that provide the necessary support functions, such as operations management, workflow integration, security, governance, support for additional processing models, and controls for multi-tenant environments, providing higher availability and lower latency applications.

Schema-on-read is the application of a data schema through preparation steps such as transformations, cleansing, and integration at the time the data is read from the database.

Shared-disk file systems, such as Storage Area Networks (SANs) and Network Attached Storage (NAS), use a single storage pool, which is accessed from multiple computing resources.

Validity refers to appropriateness of the data for its intended use

Value refers to the inherent wealth, economic and social, embedded in any data set

Variability refers to the change in other data characteristics

Variety refers to data from multiple repositories, domains, or types

Velocity refers to the rate of data flow

Veracity refers to the accuracy of the data

Vertical scaling implies increasing the system parameters of processing speed, storage, and memory for greater performance.

Volatility refers to the tendency for data structures to change over time

Volume refers to the size of the dataset

Appendix C: Acronyms

API                 application program interface

BLOBs           binary large objects

GB                  gigabyte

I/O                  input/output

ISO                 International Organization for Standardization

ITL                 Information Technology Laboratory

JTC 1              Joint Technical Committee 1

MPP                massively parallel processing

NARA            National Archives and Records Administration

NAS               Network Attached Storage

NASA             National Aeronautics and Space Administration

NBD-PWG     NIST Big Data Public Working Group

NBDRA          NIST Big Data Reference Architecture

NIST               National Institute of Standards and Technology

NSF                National Science Foundation

OED               Oxford English Dictionary

P2P                 peer-to-peer

SANs              Storage Area Networks

SQL                Structured Query Language

NoSQL           Not Only or No Structured Query Language

W3C               World Wide Web Consortium

Appendix D: References

Document References

[1]

The White House Office of Science and Technology Policy, “Big Data is a Big Deal,” OSTP Blog, accessed February 21, 2014, http://www.whitehouse.gov/blog/2012/03/29/big-data-big-deal.

[2]

Gordon Moore, "Cramming More Components Onto Integrated Circuits," Electronics, Volume 38, Number 8 (1965), pages 114-117.

[3]

Jenna Dutcher, “What is Big Data,” Data Science at Berkeley Blog, September 3, 2014, http://datascience.berkeley.edu/what-is-big-data/.

[4]

Emerging Technology From the arXiv (Contributor), “The Big Data Conundrum: How to Define It?,” MIT Technology Review, October 3, 2013, http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-define-it/.

[5]

ISO/IEC JTC 1 Study Group on Big Data (SGBD), “N0095 Final SGBD Report to JTC1,” September 3, 2014, http://jtc1bigdatasg.nist.gov/_uploadfiles/N0095_Final_SGBD_Report_to_JTC1.docx

[6]

Gil Press (Contributor), “12 Big Data Definitions: What’s Yours?,” Forbes.com, accessed November 17, 2014, http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-yours/.

[7]

ISO/IEC JTC 1 Study Group on Big Data (SGBD), “N0095 Final SGBD Report to JTC1,” September 3, 2014, http://jtc1bigdatasg.nist.gov/_uploadfiles/N0095_Final_SGBD_Report_to_JTC1.docx

[8]

Gartner IT Glossary, “Big Data” (definition), Gartner.com, accessed November 17, 2014, http://www.gartner.com/it-glossary/big-data.

[9]

Jenna Dutcher, “What is Big Data,” Data Science at Berkeley Blog, September 3, 2014, http://datascience.berkeley.edu/what-is-big-data/.

[10]

Oxford English Dictionary, “Big Data” (definition), OED.com, accessed November 17, 2014, http://www.oed.com/view/Entry/18833#eid301162178.

[11]

John Gantz and David Reinsel, “Extracting Value from Chaos,” IDC iView sponsored by EMC Corp, accessed November 17, 2014, http://www.emc.com/collateral/analyst-reports/idc-extracting-value-from-chaos-ar.pdf.

[12]

James Manyika et al., “Big data: The next frontier for innovation, competition, and productivity,” McKinsey Global Institute, May 2011.

[13]

Tom Davenport, “Big Data@Work,” Harvard Business Review Press, February 25, 2014.

[14]

ISO/IEC 11404:2007, “Information technology -- General-Purpose Datatypes (GPD),” International Organization for Standardization, http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=39479.

[15]

ISO 21090:2011, “Health informatics -- Harmonized data types for information interchange,” International Organization for Standardization, http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=35646.  

[16]

ISO/IEC 11179-2004, Information technology – “Metadata registries (MDR) – Part 1: Framework,” International Organization for Standardization, http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=35343.

[17]

ISO 19115-2014, “Geographic information – Metadata – Part 1: Fundamentals,” International Organization for Standardization, http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=53798.

[18]

Phil Archer, “W3C Data Activity Building the Web of Data,” W3C, http://www.w3.org/2013/data/.

[19]

Dan Brickley and Ivan Herman, “Semantic Web Interest Group,” W3C, June 16, 2012, http://www.w3.org/2001/sw/interest/.

[20]

EMC2, “Digital Universe,” EMC, accessed February 21, 2014, http://www.emc.com/leadership/programs/digital-universe.htm.

[21]

Lee Badger, David Bernstein, Robert Bohn, Frederic de Vaulx, Mike Hogan, Michaela Iorga, Jian Mao, John Messina, Kevin Mills, Eric Simmon, Annie Sokol, Jin Tong, Fred Whiteside, and Dawn Leaf, “US Government Cloud Computing Technology Roadmap Volume I: High-Priority Requirements to Further USG Agency Cloud Computing Adoption; and Volume II: Useful Information for Cloud Adopters,” National Institute of Standards and Technology, October 21, 2014, http://dx.doi.org/10.6028/NIST.SP.500-293.

[22]

Lee Badger, Tim Grance, Robert Patt-Corner, and Jeff Voas, “Cloud Computing Synopsis and Recommendations,” National Institute of Standards and Technology, May 2012, http://csrc.nist.gov/publications/nistpubs/800-146/sp800-146.pdf.

 

Taxonomies

Source: http://bigdatawg.nist.gov/_uploadfil...613775223.docx (Word)

Cover Page

NIST Special Publication 1500-2

DRAFT NIST Big Data Interoperability Framework:

Volume 2, Big Data Taxonomies

NIST Big Data Public Working Group

Definitions and Taxonomies Subgroup

Draft Version 1

April 6, 2015

http://dx.doi.org/10.6028/NIST.SP.1500-2

Inside Cover Page

NIST Special Publication 1500-2

Information Technology Laboratory

DRAFT NIST Big Data Interoperability Framework:

Volume 2, Big Data Taxonomies

Draft Version 1

NIST Big Data Public Working Group (NBD-PWG)

Definitions and Taxonomies Subgroup

National Institute of Standards and Technology

Gaithersburg, MD 20899

April 2015

U. S. Department of Commerce

Penny Pritzker, Secretary

National Institute of Standards and Technology

Dr. Willie E. May, Under Secretary of Commerce for Standards and Technology and Director

National Institute of Standards and Technology Special Publication 1500-2

32 pages (April 6, 2015)

Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.

There may be references in this publication to other publications currently under development by NIST in accordance with its assigned statutory responsibilities. The information in this publication, including concepts and methodologies, may be used by Federal agencies even before the completion of such companion publications. Thus, until each publication is completed, current requirements, guidelines, and procedures, where they exist, remain operative. For planning and transition purposes, Federal agencies may wish to closely follow the development of these new publications by NIST.

Organizations are encouraged to review all draft publications during public comment periods and provide feedback to NIST. All NIST Information Technology Laboratory publications, other than the ones noted above, are available at http://www.nist.gov/publication-portal.cfm.

Public comment period: April 6, 2015 through May 21, 2015

Comments on this publication may be submitted to Wo Chang

National Institute of Standards and Technology

Attn: Wo Chang, Information Technology Laboratory

100 Bureau Drive (Mail Stop 8900) Gaithersburg, MD 20899-8930

Email: SP1500comments@nist.gov

 

Reports on Computer Systems Technology

The Information Technology Laboratory (ITL) at NIST promotes the U.S. economy and public welfare by providing technical leadership for the Nation’s measurement and standards infrastructure. ITL develops tests, test methods, reference data, proof of concept implementations, and technical analyses to advance the development and productive use of information technology. ITL’s responsibilities include the development of management, administrative, technical, and physical standards and guidelines for the cost-effective security and privacy of other than national security-related information in Federal information systems. This document reports on ITL’s research, guidance, and outreach efforts in Information Technology and its collaborative activities with industry, government, and academic organizations.

Abstract

Big Data is a term used to describe the new deluge of data in our networked, digitized, sensor-laden, information-driven world. While great opportunities exist with Big Data, it can overwhelm traditional technical approaches and its growth is outpacing scientific and technological advances in data analytics. To advance progress in Big Data, the NIST Big Data Public Working Group (NBD-PWG) is working to develop consensus on important, fundamental questions related to Big Data. The results are reported in the NIST Big Data Interoperability Framework series of volumes. This volume, Volume 2, contains the Big Data taxonomies developed by the NBD-PWG. These taxonomies organize the reference architecture components, fabrics, and other topics to lay the groundwork for discussions surrounding Big Data.

Keywords

Big Data, Data Science, Reference Architecture, System Orchestrator, Data Provider, Big Data Application Provider, Big Data Framework Provider, Data Consumer, Security and Privacy Fabric, Management Fabric, Big Data taxonomy, use cases, Big Data characteristics

Acknowledgements

This document reflects the contributions and discussions by the membership of the NBD-PWG, co-chaired by Wo Chang of the NIST ITL, Robert Marcus of ET-Strategies, and Chaitanya Baru, University of California San Diego Supercomputer Center.

The document contains input from members of the NBD-PWG: Definitions and Taxonomies Subgroup led by Nancy Grady (SAIC), Natasha Balac (SDSC), and Eugene Luster (R2AD); Security and Privacy Subgroup, led by Arnab Roy (Fujitsu) and Akhil Manchanda (GE); and Reference Architecture Subgroup, led by Orit Levin (Microsoft), Don Krapohl (Augmented Intelligence), and James Ketner (AT&T).

NIST SP1500-2, Version 1 has been collaboratively authored by the NBD-PWG. As of the date of this publication, there are over six hundred NBD-PWG participants from industry, academia, and government. Federal agency participants include the National Archives and Records Administration (NARA), National Aeronautics and Space Administration (NASA), National Science Foundation (NSF), and the U.S. Departments of Agriculture, Commerce, Defense, Energy, Health and Human Services, Homeland Security, Transportation, Treasury, and Veterans Affairs.

NIST would like to acknowledge the specific contributions to this volume by the following NBD-PWG members:

Natasha Balac

University of California, San Diego, Supercomputer Center

Chaitan Baru

University of California, San Diego, Supercomputer Center

Deborah Blackstock

MITRE Corporation

Pw Carey

Compliance Partners, LLC

Wo Chang

NIST

Yuri Demchenko

University of Amsterdam

Nancy Grady

SAIC

Karen Guertler

Consultant

Christine Hawkinson

U.S. Bureau of Land Management

Pavithra Kenjige

PK Technologies

Orit Levin

Microsoft

Eugene Luster

U.S. Defense Information Systems Agency/R2AD LLC

Bill Mandrick

Data Tactics

Robert Marcus

ET-Strategies

Gary Mazzaferro

AlloyCloud, Inc.

William Miller

MaCT USA

Sanjay Mishra

Verizon

Rod Peterson

U.S. Department of Veterans Affairs

John Rogers

HP

William Vorhies

Predictive Modeling LLC

Mark Underwood

Krypton Brothers LLC

Alicia Zuniga-Alvarado

Consultant

The editors for this document were Nancy Grady and Wo Chang.

Notice to Readers

NIST is seeking feedback on the proposed working draft of the NIST Big Data Interoperability Framework: Volume 2, Big Data Taxonomies. Once public comments are received, compiled, and addressed by the NBD-PWG, and reviewed and approved by NIST internal editorial board, Version 1 of this volume will be published as final. Three versions are planned for this volume, with Versions 2 and 3 building on the first. Further explanation of the three planned versions and the information contained therein is included in Section 1.5 of this document.

Please be as specific as possible in any comments or edits to the text. Specific edits include, but are not limited to, changes in the current text, additional text further explaining a topic or explaining a new topic, additional references, or comments about the text, topics, or document organization. These specific edits can be recorded using one of the two following methods.

  1. TRACK CHANGES: make edits to and comments on the text directly into this Word document using track changes
  2. COMMENT TEMPLATE: capture specific edits using the Comment Template (http://bigdatawg.nist.gov/_uploadfiles/SP1500-1-to-7_comment_template.docx), which includes space for Section number, page number, comment, and text edits

Submit the edited file from either method 1 or 2 to SP1500comments@nist.gov with the volume number in the subject line (e.g., Edits for Volume 2.)

Please contact Wo Chang (wchang@nist.gov) with any questions about the feedback submission process.

Big Data professionals continue to be welcome to join the NBD-PWG to help craft the work contained in the volumes of the NIST Big Data Interoperability Framework. Additional information about the NBD-PWG can be found at http://bigdatawg.nist.gov.

Table of Contents

Executive Summary. 7

1       Introduction.. 8

1.1         Background. 8

1.2         Scope and Objectives of the Definitions and Taxonomies Subgroup. 9

1.3         Report Production. 9

1.4         Report Structure. 10

1.5         Future Work on this Volume. 10

2       Reference Architecture Taxonomy. 12

2.1         Actors and Roles. 12

2.2         System Orchestrator. 14

2.3         Data Provider. 16

2.4         Big Data Application Provider. 19

2.5         Big Data Framework Provider. 22

2.6         Data Consumer. 23

2.7         Management Fabric. 24

2.8         Security and Privacy Fabric. 25

3       Data Characteristic Hierarchy. 26

3.1         Data Elements. 26

3.2         Records. 27

3.3         Datasets. 27

3.4         Multiple Datasets. 28

4       Summary. 29

Appendix A: Acronyms. A-1

Appendix B: References. B-1

Figures

Figure 1: NIST Big Data Reference Architecture. 13

Figure 2: Roles and a Sampling of Actors in the NBDRA Taxonomy. 14

Figure 3: System Orchestrator Actors and Activities. 15

Figure 4: Data Provider Actors and Activities. 17

Figure 5: Big Data Application Provider Actors and Activities. 19

Figure 6: Big Data Framework Provider Actors and Activities. 22

Figure 7: Data Consumer Actors and Activities. 24

Figure 8: Big Data Management Actors and Activities. 25

Figure 9: Big Data Security and Privacy Actors and Activities. 25

Figure 10: Data Characteristic Hierarchy. 26

Executive Summary

This NIST Big Data Interoperability Framework: Volume 2, Taxonomies was prepared by the NIST Big Data Public Working Group (NBD-PWG) Definitions and Taxonomy Subgroup to facilitate communication and improve understanding across Big Data stakeholders by describing the functional components of the NIST Big Data Reference Architecture (NBDRA). The top-level roles of the taxonomy are System Orchestrator, Data Provider, Big Data Application Provider, Big Data Framework Provider, Data Consumer, Security and Privacy, and Management. The actors and activities for each of the top-level roles are outlined in this document as well. The NBDRA taxonomy aims to describe new issues in Big Data systems but is not an exhaustive list. In some cases, exploration of new Big Data topics includes current practices and technologies to provide needed context.

The NIST Big Data Interoperability Framework consists of seven volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The seven volumes are as follows:

·         Volume 1, Definitions

·         Volume 2, Taxonomies

·         Volume 3, Use Cases and General Requirements

·         Volume 4, Security and Privacy

·         Volume 5, Architectures White Paper Survey

·         Volume 6, Reference Architecture

·         Volume 7, Standards Roadmap

The NIST Big Data Interoperability Framework will be released in three versions, which correspond to the three stages of the NBD-PWG work. The three stages aim to achieve the following:

Stage 1: Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic

Stage 2: Define general interfaces between the NBDRA components

Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces

Potential areas of future work for the Subgroup during stage 2 are highlighted in Section 1.5 of this volume. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data.

Introduction

1.1 Background

There is broad agreement among commercial, academic, and government leaders about the remarkable potential of Big Data to spark innovation, fuel commerce, and drive progress. Big Data is the common term used to describe the deluge of data in today’s networked, digitized, sensor-laden, and information-driven world. The availability of vast data resources carries the potential to answer questions previously out of reach, including the following:

  • How can a potential pandemic reliably be detected early enough to intervene?
  • Can new materials with advanced properties be predicted before these materials have ever been synthesized?
  • How can the current advantage of the attacker over the defender in guarding against cyber-security threats be reversed?

There is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The growth rates for data volumes, speeds, and complexity are outpacing scientific and technological advances in data analytics, management, transport, and data user spheres.

Despite widespread agreement on the inherent opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions continues to confuse potential users and stymie progress. These questions include the following:

  • What attributes define Big Data solutions?
  • How is Big Data different from traditional data environments and related applications?
  • What are the essential characteristics of Big Data environments?
  • How do these environments integrate with currently deployed architectures?
  • What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?

Within this context, on March 29, 2012, the White House announced the Big Data Research and Development Initiative.[1] The initiative’s goals include helping to accelerate the pace of discovery in science and engineering, strengthening national security, and transforming teaching and learning by improving the ability to extract knowledge and insights from large and complex collections of digital data.

Six federal departments and their agencies announced more than $200 million in commitments spread across more than 80 projects, which aim to significantly improve the tools and techniques needed to access, organize, and draw conclusions from huge volumes of digital data. The initiative also challenged industry, research universities, and nonprofits to join with the federal government to make the most of the opportunities created by Big Data.

Motivated by the White House initiative and public suggestions, the National Institute of Standards and Technology (NIST) has accepted the challenge to stimulate collaboration among industry professionals to further the secure and effective adoption of Big Data. As one result of NIST’s Cloud and Big Data Forum held on January 15–17, 2013, there was strong encouragement for NIST to create a public working group for the development of a Big Data Interoperability Framework. Forum participants noted that this framework should define and prioritize Big Data requirements, including interoperability, portability, reusability, extensibility, data usage, analytics, and technology infrastructure. In doing so, the framework would accelerate the adoption of the most secure and effective Big Data techniques and technology.

On June 19, 2013, the NIST Big Data Public Working Group (NBD-PWG) was launched with extensive participation by industry, academia, and government from across the nation. The scope of the NBD-PWG involves forming a community of interests from all sectors—including industry, academia, and government—with the goal of developing consensus on definitions, taxonomies, secure reference architectures, security and privacy requirements, and¾from these¾a standardsroadmap. Such a consensus would create a vendor-neutral, technology- and infrastructure-independent framework that would enable Big Data stakeholders to identify and use the best analytics tools for their processing and visualization requirements on the most suitable computing platform and cluster, while also allowing value-added from Big Data service providers.

The NIST Big Data Interoperability Framework consists of seven volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The seven volumes are as follows:

·         Volume 1, Definitions

·         Volume 2, Taxonomies

·         Volume 3, Use Cases and General Requirements

·         Volume 4, Security and Privacy

·         Volume 5, Architectures White Paper Survey

·         Volume 6, Reference Architecture

·         Volume 7, Standards Roadmap

The NIST Big Data Interoperability Framework will be released in three versions, which correspond to the three stages of the NBD-PWG work. The three stages aim to achieve the following:

Stage 1: Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic

Stage 2: Define general interfaces between the NIST Big Data Reference Architecture (NBDRA) components

Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces

Potential areas of future work for the Subgroup during stage 2 are highlighted in Section 1.5 of this volume. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data.

1.2 Scope and Objectives of the Definitions and Taxonomies Subgroup

The NBD-PWG Definitions and Taxonomy Subgroup focused on identifying Big Data concepts, defining terms needed to describe this new paradigm, and defining reference architecture terms. This taxonomy provides a hierarchy of the components of the reference architecture. It is designed to meet the needs of specific user groups, as follows:

For managers, the terms will distinguish the categorization of techniques needed to understand this changing field.

For procurement officers, it will provide the framework for discussing organizational needs and distinguishing among offered approaches.

For marketers, it will provide the means to promote Big Data solutions and innovations.

For the technical community, it will provide a common language to better differentiate Big Data’s specific offerings.

1.3 Report Production

This document derives from discussions in the NBD-PWG Definitions and Taxonomy Subgroup and with interested parties. This volume provides the taxonomy of the components of the NBDRA. This taxonomy was developed using a mind map representation, which provided a mechanism for multiple inputs and easy editing.

It is difficult to describe the new components of Big Data systems without fully describing the context in which they reside. The Subgroup attempted to describe only what has changed in the shift to the new Big Data paradigm, and only the components needed to clarify this shift. For example, there is no attempt to create a taxonomy of analytics techniques as these pre-date Big Data. This taxonomy will be a work in progress to mature as new technologies are developed and the patterns within data and system architectures are better understood.

In addition to the reference architecture taxonomy, the Subgroup began the development of a data hierarchy

1.4 Report Structure

This document provides multiple hierarchical presentations related to Big Data.

The first presentation is the taxonomy for the NBDRA. This taxonomy provides the terminology and definitions for the components of technical systems that implement technologies for Big Data. Section 2 introduces the NBDRA using concepts of actors and roles and the activities each performs. In the NBDRA presented in NIST Big Data Interoperability Framework Volume 6: Reference Architecture, there are two roles that span the activities within the other roles: Management, and Security and Privacy. These two topic areas will be addressed further in future versions of this document. The NBDRA components are more fully described in the NIST Big Data Interoperability Framework: Volume 6, Reference Architecture and the NIST Big Data Interoperability Framework: Volume 4, Security and Privacy documents. Comparing the related sections in these two documents will give the reader a more complete picture of the consensus of the working groups.

The second presentation is a hierarchical description about the data itself. For clarity, a strict taxonomy is not followed; rather, data is examined at different groupings to better describe what is new with Big Data. The grouping-based description presents data elements, data records, datasets, and multiple datasets.  This examination at different groupings provides a way to easily identify the data characteristics that have driven the development of Big Data engineering technologies, as described in the NIST Big Data Interoperability Framework: Volume 1, Definitions.

Within the following sections, illustrative examples are given to facilitate understanding of the role/actor and activity of the NBDRA. There is no expectation of completeness in the components; the intent is to provide enough context to understand the specific areas that have changed because of the new Big Data paradigm. Likewise, the data hierarchy only expresses the broad overview of data at different levels of granularity to highlight the properties that drive the need for Big Data architectures.

For descriptions of the future of Big Data and opportunities to use Big Data technologies, the reader is referred to the NIST Big Data Interoperability Framework: Volume 7, Standards Roadmap. Finally, to understand how these systems are architected to meet users’ needs, the reader is referred to NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements.

1.5 Future Work on this Volume

As mentioned in the previous section, the Subgroup is continuing to explore the changes in both Management and in Security and Privacy. As changes in the activities within these roles are clarified, the taxonomy will be developed further. In addition, a fuller understanding of Big Data and its technologies should consider the interactions between the characteristics of the data and the desired methods in both technique and time window for performance. These characteristics drive the application and the choice of tools to meet system requirements. Investigation of the interfaces between data characteristics and technologies is a continuing task for the NBD-PWG Definitions and Taxonomy Subgroup and the NBD-PWG Reference Architecture Subgroup. Finally, societal impact issues have not yet been fully explored. There are a number of overarching issues in the implications of Big Data, such as data ownership and data governance, which need more examination. Big Data is a rapidly evolving field, and the initial discussion presented in this volume must be considered a work in progress.

2 Reference Architecture Taxonomy

This section focuses on a taxonomy for the NBDRA, and is intended to describe the hierarchy of actors and roles and the activities the actors perform in those roles. There are a number of models for describing the technologies needed for an application, such as a layer model of network, hardware, operating system, application. For elucidating the taxonomy, a hierarchy has been chosen to allow placing the new technologies within the context previous technologies. As this taxonomy is not definitive, it is expected that the taxonomy will mature as new technologies emerge and increase understanding of how to best categorize the different methods for building data systems.

2.1 Actors and Roles

In system development, actors and roles have the same relationship as in the movies. The roles are the parts the actors play in the overall system. One actor can perform multiple roles. Likewise, a role can be played by multiple actors, in the sense that a team of independents entities¾perhaps from independent organizations¾may be used to satisfy end-to-end system requirements. System development actors can represent individuals, organizations, software, or hardware. Each activity in the taxonomy can be executed by a different actor. Examples of actors include the following:

  • Sensors
  • Applications
  • Software agents
  • Individuals
  • Organizations
  • Hardware resources
  • Service abstractions

In the past, data systems tended to be hosted, developed, and deployed with the resources of only one organization. Currently, roles may be distributed, analogous to the diversity of actors within a given cloud-based application. Actors in Big Data systems can likewise come from multiple organizations.

Developing the reference architecture taxonomy began with a review of the NBD-PWG analyses of the use cases and reference architecture survey provided in NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements and NIST Big Data Interoperability Framework: Volume 5, Reference Architecture Survey, respectively. From these analyses, several commonalities between Big Data architectures were identified and formulated into five general architecture components, and two fabrics interwoven in the five components, as shown in Figure 1.

Figure 1: NIST Big Data Reference Architecture

Volume2Figure1.png

These seven items¾ five main architecture components and two fabrics interwoven in them¾form the foundation of the reference architecture taxonomy.

The five main components, which represent the central roles, are summarized below and discussed in this section (Section 2).

·         System Orchestrator: Defines and integrates the required data application activities into an operational vertical system

·         Data Provider: Introduces new data or information feeds into the Big Data system

·         Big Data Application Provider: Executes a lifecycle to meet security and privacy requirements as well as System Orchestrator-defined requirements

·         Big Data Framework Provider: Establishes a computing framework in which to execute certain transformation applications while protecting the privacy and integrity of data

·         Data Consumer: Includes end users or other systems who use the results of the Big Data Application Provider

The two fabrics, which are discussed separately in Sections 3 and 4, are as follows:

·         Security and Privacy Fabric

·         Management Fabric

Figure 2 outlines potential actors for the seven items listed above. The five central roles are explained in greater detail in the following subsections.

Figure 2: Roles and a Sampling of Actors in the NBDRA Taxonomy

Volume2Figure2.png

2.2 System Orchestrator

The System Orchestrator provides the overarching requirements that the system must fulfill, including policy, governance, architecture, resources, and business requirements, as well as monitoring or auditing activities to ensure the system complies with those requirements.

The System Orchestrator role includes defining and integrating the required data application activities into an operational vertical system. The System Orchestrator role provides system requirements, high-level design, and monitoring for the data system. While the role pre-dates Big Data systems, some related design activities have changed within the Big Data paradigm.

Figure 3 lists the actors and activities associated with the System Orchestrator, which are further described below.

Figure 3: System Orchestrator Actors and Activities

Volume2Figure3.png

A.   Business Ownership Requirements and Monitoring

As the business owner of the system, the System Orchestrator oversees the business context within which the system operates, including specifying the following:

·         Business goals

·         Targeted business action

·         Data Provider contracts and service-level agreements (SLAs)

·         Data Consumer contracts and SLAs

·         Negotiation with capabilities provider

·         Make/buy cost analysis

A number of new business models have been created for Big Data systems, including Data as a Service (DaaS), where a business provides the Big Data Application Provider role as a service to other actors. In this case, the business model is to process data received from a Data Provider and provide the transformed data to the contracted Data Consumer.

B.   Governance Requirements and Monitoring

The System Orchestrator establishes all policies and regulations to be followed throughout the data lifecycle, including the following:

·         Policy compliance requirements and monitoring

·         Change management process definition and requirements

·         Data stewardship and ownership

Big Data systems potentially interact with processes and data being provided by other organizations, requiring more detailed governance and monitoring between the components of the overall system.

C.   Data Science Requirements and Monitoring

The System Orchestrator establishes detailed requirements for functional performance of the analytics for the end-to-end system, translating the business goal into data and analytics design, including:

·         Data source selection (e.g., identifying descriptions, location, file types, and provenance)

·         Data collection and storage requirements and monitoring

·         Data preparation requirements and monitoring

·         Data analysis requirements and monitoring

·         Analytical model choice (e.g., search, aggregation, correlation and statistics, and causal modeling)

·         Data visualization requirements and monitoring

·         Application type specification (e.g., streaming, real-time, and batch)

A number of the design activities have changed in the new paradigm. In particular, a greater choice of data models now exists beyond the relational model. Choosing a non-relational model will depend on the data type. Choosing the data fields that are used to decide how to distribute the data across multiple nodes will depend on the organization’s data analysis needs, and on the ability to use those fields to distribute the data evenly across resources.

D.   System Architecture Requirements and Monitoring

The System Orchestrator establishes detailed architectural requirements for the data system, including the following:

·         Data process requirements

·         Software requirements

·         Hardware requirements

·         Logical data modeling and partitioning

·         Data export requirements

·         Scaling requirements

The system architecture has changed in the Big Data paradigm due to the potential interplay between the different actors. The coordination between the five functional NBDRA components is more complex, with additional communications and interconnectivity requirements among the independently operated component activities. Maintaining the needed performance can lead to a very different architecture from that used prior to the new distribution of data across system nodes.

2.3 Data Provider

A Data Provider makes data available to itself or to others. The actor fulfilling this role can be part of the Big Data system, from another system, or internal or external to the organization orchestrating the system. Once the data is within the local system, requests to retrieve the needed data will be made by the Big Data Application Provider and routed to the Big Data Framework Provider. Data Provider actors include those shown in Figure 4.

Figure 4: Data Provider Actors and Activities

Volume2Figure4.png

While the concept of a Data Provider is not new, the greater data collection and analytics capabilities have opened up new possibilities for providing valuable data. The U.S. government’s Open Data Initiative advocates that Federal agencies which are stewards of public data also serve the role of Data Provider.

The nine possible Data Provider activities outlined in Figure 4 are discussed further below.

A.   Data Capture from Sources

The Data Provider captures data from its own sources or others. This activity could be described as the capture from a data producer, whether it is a sensor or an organizational process. Aspects of the data sources activity include both online and offline sources. Among possible online sources are the following:

·         Web browsers

·         Sensors

·         Deep packet inspection devices (e.g., bridge, router, border controller)

·         Mobile devices

Offline sources can include the following:

·         Public records

·         Internal records

While perhaps not theoretically different from what has been in use before, data capture from sources is an area that is exploding in the new Big Data paradigm. New forms of sensors are now providing not only a number of sources of data, but also data in large quantities. Smartphones and personal wearable devices (e.g., exercise monitors, household electric meters) can all be used as sensors. In addition, technologies such as radio frequency identification (RFID) chips are sources of data for the location of shipped items. Collectively, all the data-producing sensors are known as the Internet of Things (IoT). The subset of personal information devices are often referred to as “wearable tech”, with the resulting data sometimes referred to as “digital exhaust”.

B.   Data Persistence

The Data Provider stores the data in a repository from which the data can be extracted and made available to others. The stored data is subject to a data retention policy. The data can be stored (i.e., persisted) in the following ways:

·         Internal hosting

·         External hosting

·         Cloud hosting (a different hosting model whether internal or external)

Hosting models have expanded through the use of cloud computing. In addition, the data persistence is often accessed through mechanisms such as web services that hide the specifics of the underlying storage. DaaS is a term used for this kind of data persistence that is accessed through specific interfaces.

C.   Data Scrubbing

Some datasets contain sensitive data elements that are naturally collected as part of the data production process. Whether for regulatory compliance or sensitivity, such data elements may be altered or removed. As one example of data scrubbing for Personally Identifiable Information (PII), the Data Provider can:

·         Remove PII

·         Perform data randomization

The latter obscures the PII to remove the possibility of directly tracing the data back to an individual, while maintaining the value distributions within the data. In the era of Big Data, data scrubbing requires greater diligence. While individual sources may not contain PII, when combined with other data sources, the risk arises that individuals may be identified from the integrated data.

D.   Data Annotation and Metadata Creation

The Data Provider maintains information about the data and its processing, called metadata, in their repository, and also maintains the data itself. The metadata, or data annotation, would provide information about the origins and history of the data, in sufficient detail to enable proper use and interpretation of the data. The following approaches can be used to encode the metadata:

·         In an ontology: a semantic description of the elements of the data

·         Within a data file: in any number of formats

With the push for open data where data is repurposed to draw out additional value beyond the initial reason for which it was generated, it has become even more critical that information about the data be encoded to clarify the data’s origins and processing. While the actors that collected the data will have a clear understanding of the data history, repurposing data for other uses is open to misinterpretations when other actors use the data at a later time.

E.    Access Rights Management

The Data Provider determines the different mechanisms that will be used to define the rights of access, which can be specified individually or by groupings such as the following:

·         Data sources: the collection of datasets from a specific source

·         Data producer: the collection of datasets from a given producer

·         PII access rights: as an example of restrictions on data elements

F.    Access Policy Contracts

The Data Provider defines policy for others’ use of the accessed data, as well as what data will be made available. These contracts specify:

·         Policies for primary and secondary rights

·         Agreements

To expand this description, the contracts specify acceptable use policies and any specific restrictions on the use of the data, as well as ownership of the original data and any derivative works from the data.

G.   Data Distribution Application Programming Interfaces

Technical protocols are defined for different types of data access from data distribution application programming interfaces (APIs), which can include:

·         File Transfer Protocol (FTP) or streaming

·         Compression techniques (e.g., single compressed file, split compressed file, )

·         Authentication methods

·         Authorization

H.   Capabilities Hosting

In addition to offering data downloads, the Data Provider offers several capabilities to access the data, including the following:

·         Providing query access without transferring the data

·         Allowing analytic tools to be sent to operate on the data sets

For large volumes of data, it may become impractical to move the data to another location for processing. This is often described as moving the processing to the data, rather than the data to the processing.

I.      Data Availability Publication

The Data Provider makes available the information needed to know what data or data services they offer. Such publication can consist of the following:

·         Web description

·         Services catalog

·         Data dictionaries

·         Advertising

A number of third-party locations also currently publish a list of links to available datasets (e.g., U.S. Government’s Open Data Initiative  [2].)

2.4 Big Data Application Provider

The Big Data Application Provider executes the manipulations of the data lifecycle to meet requirements established by the System Orchestrator, as well as meeting security and privacy requirements. This is where the general capabilities within the Big Data framework are combined to produce the specific data system. Figure 5 lists the actors and activities associated with the Big Data Application Provider.

Figure 5: Big Data Application Provider Actors and Activities

Volume2Figure5.png

While the activities of an application provider are the same whether the solution being built concerns Big Data or not, the methods and techniques have changed for Big Data because the data and data processing is parallelized across resources.

A.   Collection

The Big Data Application Provider must establish the mechanisms to capture data from the Data Provider. These mechanisms include the following:

  • Transport protocol and security
  • Data format
  • Metadata

While the foregoing transport mechanisms predate Big Data, the resources to handle the large volumes or velocities do result in changes in the way the processes are resourced.

B.   Preparation

Whether processes are involved before or after the storage of raw data, a number of them are used in the data preparation activity, analogous to current processes for data systems. Preparation processes include the following:

  • Data validation (e.g., checksums/hashes, format checks)
  • Data cleansing (e.g., eliminating bad records/fields, deduplication)
  • Outlier removal
  • Data conversion (e.g., standardization, reformatting, and encapsulating)
  • Calculated field creation and indexing
  • Data aggregation and summarization
  • Data partition implementation
  • Data storage preparation
  • Data virtualization layer

Just as data collection may require a number of resources to handle the load, data preparation may also require new resources or new techniques. For large data volumes, data collection is often followed by storage of the data in its raw form. Data preparation processes then occur after the storage and are handled by the application code. This technique of storing raw data first and applying a schema upon interaction with the data is commonly called “schema on read”, and is a new area of emphasis in Big Data due to the size of the datasets. When storing a new cleansed copy of the data is prohibitive, the data is stored in its raw form and only prepared for a specific purpose when requested.

Data summarization is a second area of added emphasis due to Big Data. With very large datasets, it is difficult to render all the data for visualization. Proper sampling would need some a priori understanding of the distribution of the entire dataset. Summarization techniques can characterize local subsets of the data, and then provide these characterizations for visualization as the data is browsed.

C.   Analytics

The term data science is used in many ways. While it can refer to the end-to-end data lifecycle, the most common usage focuses on the steps of discovery (i.e., rapid hypothesis-test cycle) for finding value in big volume datasets. This rapid analytics cycle (also described as agile analytics) starts with quick correlation or trending analysis, with greater effort spent on hypotheses that appear most promising.

The analytics processes for structured and unstructured data have been maturing for many years. There is now more emphasis on the analytics of unstructured data because of the greater quantities now available. The knowledge that valuable information resides in unstructured data promotes a greater attention to the analysis of this type of data.

While analytic methods have not changed with Big Data, their implementation has changed to accommodate parallel data distribution across a cluster of independent nodes and data access methods. For example, the overall data analytic task may be broken into subtasks that are assigned to the independent data nodes. The results from each subtask are collected and compiled to achieve the final full dataset analysis. Furthermore, data often resided in simple tables or relational databases. With the introduction of new storage paradigms, analytics techniques should be modified for different types of data access.

Some considerations for analytical processes used for Big Data or small data are the following:

·         Metadata matching processes

·         Analysis complexity considerations (e.g., computational, machine learning, data extent, data location)

·         Analytics latency considerations (e.g., real-time or streaming, near real-time or interactive, batch or offline)

·         Human-in-the-loop analytics lifecycle (e.g., discovery, hypothesis, hypothesis testing)

While these considerations are not new to Big Data, implementing them can be tightly coupled with the specifics of the data storage and the preparation step. In addition, some of the preparation tasks are done during the analytics phase (the schema-on-read discussed above).

D.   Visualization

While visualization (or the human in the loop) is often placed under analytics, the added emphasis due to Big Data warrants separate consideration of visualization. The following are three general categories of data visualization:

·         Exploratory data visualization for data understanding (e.g., browsing, outlier detection, boundary conditions)

·         Explicatory visualization for analytical results (e.g., confirmation, near real-time presentation of analytics, interpreting analytic results)

·         Explanatory visualization to “tell the story” (e.g., reports, business intelligence, summarization)

Data science relies on the full dataset type of discovery or exploration visualization from which the data scientist would form a hypothesis. While clearly predating Big Data, a greater emphasis now exists on exploratory visualization, as it is immensely helpful in understanding large volumes of repurposed data because the size of the datasets requires new techniques.

Explanatory visualization is the creation of a simplified, digestible visual representation of the results, suitable for assisting a decision or communicating the knowledge gained. Again, while this technique has long been in use, there is now greater emphasis to “tell the story”. Often this is done through simple visuals or “infographics”. Given the large volumes and varieties of data, and the data’s potentially complex relationships, the communication of the analytics to a non-analyst audience requires careful visual representation to communicate the results in a way that can be easily consumed.

E.    Access

The Big Data Application Provider gives the Data Consumer access to the results of the data system, including the following:

·         Data export API processes (e.g., protocol, query language)

·         Data charging mechanisms

·         Consumer analytics hosting

·         Analytics as a service hosting

The access activity of the Big Data Application Provider should mirror all actions of the Data Provider, since the Data Consumer may view this system as the Data Provider for their follow-on tasks. Many of the access-related tasks have changed with Big Data, as algorithms have been rewritten to accommodate for and optimize the parallelized resources.

2.5 Big Data Framework Provider

The Big Data Framework Provider has general resources or services to be used by the Big Data Application Provider in the creation of the specific application. There are many new technologies from which the Big Data Application Provider can choose in using these resources and the network to build the specific system. Figure 6 lists the actors and activities associated with the Big Data Framework Provider.

Figure 6: Big Data Framework Provider Actors and Activities

Volume2Figure6.png

The Big Data Framework Provider role has seen the most significant changes with the introduction of Big Data. The Big Data Framework Provider consists of one or more instances of the three subcomponents or activities: infrastructure frameworks, data platform frameworks, and processing frameworks. There is no requirement that all instances at a given level in the hierarchy be of the same technology and, in fact, most Big Data implementations are hybrids combining multiple technology approaches. These provide flexibility and can meet the complete range of requirements that are driven from the Big Data Application Provider. Due to the rapid emergence of new techniques, this is an area that will continue to need discussion. As the Subgroup continues its discussion into patterns within these techniques, different orderings will no doubt be more representative and understandable.

A.   Infrastructure Frameworks

Infrastructure frameworks can be grouped as follows:

·         Networking: These are the resources that transfer data from one resource to another (e.g., physical, virtual, software defined)

·         Computing: These are the physical processors and memory that execute and hold the software of the other Big Data system components (e.g., physical resources, operating system, virtual implementation, logical distribution)

·         Storage: These are resources which provide persistence of the data in a Big Data system (e.g., in-memory, local disk, hardware/software [HW/SW] redundant array of independent disks [RAID], Storage Area Networks [SAN], network-attached storage [NAS])

·         Environmental: These are the physical plant resources (e.g., power, cooling) that must be accounted for when establishing an instance of a Big Data system

The biggest change under the Big Data paradigm is the cooperation of horizontally scaled, independent resources to achieve the desired performance.

B.   Data Platform Frameworks

This is the most recognized area for changes in Big Data engineering, and given rapid changes, the hierarchy in this area will likely change in the future to better represent the patterns within the techniques. The data platform frameworks activity was expanded into the following logical data organization and distribution approaches to provide additional clarity needed for the new approaches of Big Data.

·         Physical storage (e.g., distributed and non-distributed file systems and object stores)

·         File systems (e.g., centralized, distributed)

·         Logical storage

  • Simple tuple (e.g., relational, non-relational or not only SQL [NoSQL] tables both row and column)
  • Complex tuple (e.g., indexed document store, non-indexed key-value or queues)
  • Graph (e.g., property, hyper-graph, triple stores)

The logical storage paradigm has expanded beyond the “flat file” and relational model paradigms to develop new non-relational models. This has implications for the concurrency of the data across nodes within the non-relational model. Transaction support in this context refers to the completion of an entire data update sequence and the maintenance of eventual consistency across data nodes. This is an area that needs more exploration and categorization.

C.   Processing Frameworks

Processing frameworks provide the software support for applications which can deal with the volume, velocity, variety, and variability of data. Some aspects related to processing frameworks are the following:

·         Data type processing services (e.g., numeric, textual, spatial, images, video)

·         Schema information or metadata (e.g., on demand, pre-knowledge)

·         Query frameworks (e.g., relational, arrays)

·         Temporal frameworks

  • Batch (e.g., dense linear algebra, sparse linear algebra, spectral, N-body, structured grids, unstructured grids, Map/Reduce, Bulk Synchronous Parallel [BSP])
  • Interactive
  • Real-time/streaming (e.g., event ordering, state management, partitioning)

·         Application frameworks (e.g., automation, test, hosting, workflow)

·         Messaging/communications frameworks

·         Resource management frameworks (e.g., cloud/virtualization, intra-framework, inter-framework)

Both the Big Data Application Provider activities and the Big Data Framework Provider activities have changed significantly due to Big Data engineering. Currently, the interchange between these two roles operates over a set of independent, yet coupled, resources. It is in this interchange that the new methods for data distribution over a cluster have developed. Just as simulations went through a process of parallelization (or horizontal scaling) to harness massive numbers of independent process to coordinate them to a single analysis, Big Data services now perform the orchestration of data processes over parallel resources.

2.6 Data Consumer

The Data Consumer receives the value output of the Big Data system. In many respects, the Data Consumer receives the same functionality that the Data Provider brings to the Big Data Application Provider. After the system adds value to the original data sources, the Big Data Application Provider then offers that same functionality to the Data Consumer. There is less change in this role due to Big Data, except, of course, in the desire for Consumers to extract extensive datasets from the Big Data Application Provider. Figure 7 lists the actors and activities associated with the Data Consumer.

Figure 7: Data Consumer Actors and Activities

Volume2Figure7.png

The activities listed in Figure 7 are explicit to the Data Consumer role within a data system. If the Data Consumer is in fact a follow-on application, then the Data Consumer would look to the Big Data Application Provider for the activities of any other Data Provider. The follow-on application’s System Orchestrator would negotiate with this application’s System Orchestrator for the types of data wanted, access rights, and other requirements. The Big Data Application Provider would thus serve as the Data Provider, from the perspective of the follow-on application.

A.   Search and Retrieve

The Big Data Application Provider could allow the Data Consumer to search across the data, and query and retrieve data for its own usage.

B.   Download

All the data from the Data Provider could be exported to the Data Consumer for download.

C.   Analyze Locally

The Data Provider could allow the Data Consumer to run their own application on the data.

D.   Reporting

The data can be presented according to the chosen filters, values, and formatting.

E.    Visualization

The Data Consumer could be allowed to browse the raw data, or the data output from the analytics.

2.7 Management Fabric

The Big Data characteristics of volume, velocity, variety, and variability demand a versatile management platform for storing, processing, and managing complex data. Management of Big Data systems should handle both system and data related aspects of the Big Data environment. The Management Fabric of the NBDRA encompasses two general groups of activities: system management and Big Data lifecycle management. System management includes activities such as provisioning, configuration, package management, software management, backup management, capability management, resources management, and performance management. Big Data lifecycle management involves activities surrounding the data lifecycle of collection, preparation/curation, analytics, visualization, and access.  More discussion about the Management Fabric is needed, particularly with respect to new issues in the management of Big Data and Big Data engineering. This section will be developed in Version 2of this document.

Figure 8 lists an initial set of activities associated with the Management role of the NBDRA.

Figure 8: Big Data Management Actors and Activities

Volume2Figure8.png

2.8 Security and Privacy Fabric

Security and privacy issues affect all other components of the NBDRA, as depicted by the encompassing Security and Privacy box in Figure 1. A Security and Privacy Fabric could interact with the System Orchestrator for policy, requirements, and auditing and also with both the Big Data Application Provider and the Big Data Framework Provider for development, deployment, and operation. These ubiquitous security and privacy activities are described in the NIST Big Data Interoperability Framework: Volume 4, Security and Privacy document. Figure 9 lists representative actors and activities associated with the Security and Privacy Fabric of the NBDRA. Security and privacy actors and activities will be further developed in Version 2 of NIST Big Data Interoperability Framework: Volume 4, Security and Privacy document and summarized in this volume.

Figure 9: Big Data Security and Privacy Actors and Activities

Volume2Figure9.png

3 Data Characteristic Hierarchy

Equally important to understanding the new Big Data engineering that has emerged in the last ten years, is the need to understand what data characteristics have driven the need for the new technologies. In Section 2 of this document, a taxonomy was presented for the NBDRA, which is described in NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. The NBDRA taxonomy has a hierarchy of roles/actors, and activities. To understand the characteristics of data and how they have changed with the new Big Data Paradigm, it is illustrative to look at the data characteristics at different levels of granularity. Understanding what characteristics have changed with Big Data can best be done by examining the data scales of data elements, of related data elements grouped into a record that represents a specific entity or event, of records collected into a dataset, and of multiple datasets¾all in turn, as shown in Figure 10. Therefore, this section does not present a strict taxonomy, breaking down each element into parts, but provides a description of data objects at a specific granularity, attributes for those objects, and characteristics and subcharacteristics of the attributes. The framework described will help illuminate areas where the driving characteristics for Big Data can be understood in the context of the characteristics of all data.

Figure 10: Data Characteristic Hierarchy

Volume2Figure10.png

3.1 Data Elements

Individual data elements have naturally not changed in the new Big Data paradigm. Data elements are understood by their data type and additional contextual data, or metadata, which provides history or additional understanding about the data.

A.   Data Format

Data formats are well characterized through International Organization for Standardization (ISO) standards such as ISO 8601: 2004 Data elements and interchange formatsInformation interchangeRepresentation of dates and times. [3] The data formats have not changed for Big Data.

B.   Data Values and Vocabulary

The data element is characterized by its actual value. This value is restricted to its defined data type (e.g., numeric, string, date) and chosen data format. Sometimes the value is restricted to a specific standard vocabulary for interoperability with others in the field, or to a set of allowed values.

C.   Metadata and Semantics

Metadata is sometimes simplistically described as “data about data.” Metadata can refer to a number of categories of contextual information, including the origins and history of the data, the processing times, the software versions, and other information. In addition, data can be described semantically to better understand what the value represents, and to make the data machine-operable. Both metadata and semantic data are not specific to Big Data. [a]

D.   Quality And Veracity

Veracity is one of the Big Data characteristics used in describing Big Data, but the accuracy of the data is not a new concern. Data quality is another name for the consideration of the reliability of the data. Again this topic predated Big Data and is beyond the scope of this volume.[b]

3.2 Records

Data elements are grouped into records that describe a specific entity or event or transaction. At the level of records, new emphasis for Big Data begins to be seen.

A.   Record Format

Records have structure and formats. Record structures are commonly grouped as structured, semi-structured, and unstructured. Structured data was traditionally described through formats such as comma separated values, or as a row in a relational database. Unstructured refers to free text, such as in a document or a video stream. An example of semi-structured is a record wrapped with a markup language such as XML or HTML, where the contents within the markup can be free text.

These categories again predate Big Data, but two notable changes have occurred with Big Data. First, structured and unstructured data can be stored in one of the new non-relational formats, such as a key-value record structure, a key-document record, or a graph. Second, a greater emphasis is placed on unstructured data due to increasing amounts on the Web (e.g., online images and video.)

B.   Complexity

Complexity refers to the interrelationship between data elements in a record, or between records (e.g., in the interrelationships in genomic data between genes and proteins.) Complexity is not new to Big Data.

C.   Volume

Records themselves have an aspect of volume in the emerging data sources, such as considering an entire DNA on an organism as a record.

D.   Metadata and Semantics

The same metadata categories described for data elements can be applied to records. In addition, relationships between data elements can be described semantically in terms of an ontology.

3.3 Datasets

Records can be grouped to form datasets. This grouping of records can reveal changes due to Big Data.

Quality and Consistency

A new aspect of data quality for records focuses on the characteristic of consistency. As records are distributed horizontally across a collection of data nodes, consistency becomes an issue. In relational databases, consistency was maintained by assuring that all operations in a transaction were completed successfully, otherwise the operations were rolled back. This assured that the database maintained its internal consistency. [c] For Big Data, with multiple nodes and backup nodes, new data is sent in turn to the appropriate nodes. However, constraints may or may not exist to confirm that all nodes have been updated when the query is sent. The time delay in replicating data across nodes can cause an inconsistency. The methods used to update nodes are one of the main areas in which specific implementations of non-relational data storage methods differ. A description of these patterns is a future focus area for this NBD-PWG.

3.4 Multiple Datasets

The primary focus on multiple datasets concerns the need to integrate or fuse multiple datasets. The focus here is on the variety characteristic of Big Data. Extensive datasets cannot always be converted into one structure (e.g., all weather data being reported on the same spatio-temporal grid). Since large volume datasets cannot be easily copied into a normalized structure, new techniques are being developed to integrate data as needed.

Personally Identifiable Information

An area of increasing concern with Big Data is the identification of individuals from the integration of multiple datasets, even when the individual datasets would not allow the identification. For additional discussion, the reader is referred to NIST Big Data Interoperability Framework: Volume 4, Security and Privacy.

[a] Further information about metadata and semantics can be found in: ISO/IEC 11179 Information Technology–Metadata registries; W3C’s work on the Semantic Web.

[b] Further information about data quality can be found in ISO 8000 Data Quality.

[c] For additional information on this concept the reader is referred to the literature on ACID properties of databases.

4 Summary

Big Data and data science represent a rapidly changing field due to the recent emergence of new technologies and rapid advancements in methods and perspectives. This document presents a taxonomy for the NBDRA, which is presented in NIST Big Data Interoperability Framework: Volume 6, Reference Architecture. This taxonomy is a first attempt at providing a hierarchy for categorizing the new components and activities of Big Data systems. This initial version does not incorporate a breakdown of either the Management or the Security and Privacy roles within the NBDRA as those areas need further discussion within the NBD-PWG. In addition, a description of data at different scales was provided to place concepts being ascribed to Big Data into their context. The NBD-PWG will further develop the data characteristics and attributes in the future, in particular determining whether additional characteristics related to data at rest or in-motion should be described. The Big Data patterns related to transactional constraints such as ACID (Atomicity, Consistency, Isolation, Durability—a set of properties guaranteeing reliable processing of database transactions) have not been described here, and are left to future work as the interfaces between resources is an important area for discussion. This document constitutes a first presentation of these descriptions, and future enhancements should provide additional understanding of what is new in Big Data and in specific technology implementations.

Appendix A: Acronyms

ACID               Atomicity, Consistency, Isolation, Durability

APIs                 application programming interfaces

BSP                  Bulk Synchronous Parallel

DaaS                Data as a Service

FTP                  File Transfer Protocol

HW/SW RAID hardware/software redundant array of independent disks

IoT                   Internet of Things

ISO                  International Organization for Standardization

ITL                  Information Technology Laboratory

NARA             National Archives and Records Administration 

NAS                 network-attached storage

NASA              National Aeronautics and Space Administration

NBD-PWG       NIST Big Data Public Working Group

NBDRA           NIST Big Data Reference Architecture

NIST                National Institute of Standards and Technology

NoSQL            not only SQL

NSF                 National Science Foundation

PII                    Personally Identifiable Information

RFID                radio frequency identification

SAN                 Storage Area Networks

SLAs                service-level agreements

Appendix B: References

Document References

[1]

Tom Kalil, “Big Data is a Big Deal”, The White House, Office of Science and Technology Policy, accessed February 21, 2014, http://www.whitehouse.gov/blog/2012/...-data-big-deal.

[2]

General Services Administration, “The home of the U.S. Government’s open data,” Data.gov, http://www.data.gov/

[3]

ISO 8601: 2004, “Data elements and interchange formats -- Information interchange -- Representation of dates and times,” International Organization of Standardization, http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=40874

Use Case & Requirements

Source: http://bigdatawg.nist.gov/_uploadfil...746659136.docx (Word)

Cover Page

NIST Special Publication 1500-3

DRAFT NIST Big Data Interoperability Framework:

Volume 3, Use Cases and General Requirements

NIST Big Data Public Working Group

Use Cases and Requirements Subgroup

Draft Version 1

April 6, 2015

http://dx.doi.org/10.6028/NIST.SP.1500-3

Inside Cover Page

NIST Special Publication 1500-3

Information Technology Laboratory

DRAFT NIST Big Data Interoperability Framework:

Volume 3, Use Cases and General Requirements

Draft Version 1NIST Big Data Public Working Group (NBD-PWG)

Use Cases and Requirements Subgroup

National Institute of Standards and Technology

Gaithersburg, MD 20899

April 2015

U. S. Department of Commerce

Penny Pritzker, Secretary

National Institute of Standards and Technology

Dr. Willie E. May, Under Secretary of Commerce for Standards and Technology and Director

National Institute of Standards and Technology Special Publication 1500-3

260 pages (April 6, 2015)Certain commercial entities, equipment, or materials may be identified in this document in order to describe an experimental procedure or concept adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the entities, materials, or equipment are necessarily the best available for the purpose.

There may be references in this publication to other publications currently under development by NIST in accordance with its assigned statutory responsibilities. The information in this publication, including concepts and methodologies, may be used by Federal agencies even before the completion of such companion publications. Thus, until each publication is completed, current requirements, guidelines, and procedures, where they exist, remain operative. For planning and transition purposes, Federal agencies may wish to closely follow the development of these new publications by NIST.

Organizations are encouraged to review all draft publications during public comment periods and provide feedback to NIST. All NIST Information Technology Laboratory publications, other than the ones noted above, are available at http://www.nist.gov/publication-portal.cfm.

Public comment period: April 6, 2015 through May 21, 2015

Comments on this publication may be submitted to Wo Chang

National Institute of Standards and Technology

Attn: Wo Chang, Information Technology Laboratory

100 Bureau Drive (Mail Stop 8900) Gaithersburg, MD 20899-8930

Email: SP1500comments@nist.gov

 

Reports on Computer Systems Technology

The Information Technology Laboratory (ITL) at NIST promotes the U.S. economy and public welfare by providing technical leadership for the Nation’s measurement and standards infrastructure. ITL develops tests, test methods, reference data, proof of concept implementations, and technical analyses to advance the development and productive use of information technology. ITL’s responsibilities include the development of management, administrative, technical, and physical standards and guidelines for the cost-effective security and privacy of other than national security-related information in Federal information systems. This document reports on ITL’s research, guidance, and outreach efforts in Information Technology and its collaborative activities with industry, government, and academic organizations.

Abstract

Big Data is a term used to describe the deluge of data in our networked, digitized, sensor-laden, information-driven world. While great opportunities exist with Big Data, it can overwhelm traditional technical approaches and its growth is outpacing scientific and technological advances in data analytics. To advance progress in Big Data, the NIST Big Data Public Working Group (NBD-PWG) is working to develop consensus on important, fundamental questions related to Big Data. The results are reported in the NIST Big Data Interoperability Framework series of volumes. This volume, Volume 3, contains the 51 use cases gathered by the NBD-PWG Use Cases and Requirements Subgroup and the requirements generated from those use cases. The use cases are presented in their original and summarized form. Requirements, or challenges, were extracted from each use case, and then summarized over all of the use cases. These generalized requirements were used in the development of the NIST Big Data Reference Architecture (NBDRA), which is presented in Volume 6.

Keywords

Big Data, data science, reference architecture, System Orchestrator, Data Provider, Big Data Application Provider, Big Data Framework Provider, Data Consumer, Security and Privacy Fabric, Management Fabric, Big Data taxonomy, use cases, Big Data characteristics

Acknowledgements

This document reflects the contributions and discussions by the membership of the NBD-PWG, co-chaired by Wo Chang of the NIST ITL, Robert Marcus of ET-Strategies, and Chaitanya Baru, University of California San Diego Supercomputer Center. 

The document contains input from members of the NBD-PWG Use Cases and Requirements Subgroup, led by Geoffrey Fox (University of Indiana), and Tsegereda Beyene (Cisco Systems).

NIST SP1500-3, Version 1 has been collaboratively authored by the NBD-PWG. As of the date of this publication, there are over six hundred NBD-PWG participants from industry, academia, and government. Federal agency participants include the National Archives and Records Administration (NARA), National Aeronautics and Space Administration (NASA), National Science Foundation (NSF), and the U.S. Departments of Agriculture, Commerce, Defense, Energy, Health and Human Services, Homeland Security, Transportation, Treasury, and Veterans Affairs.

NIST would like to acknowledge the specific contributions to this volume by the following NBD-PWG members:

Tsegereda Beyene

Cisco Systems

Deborah Blackstock

MITRE Corporation

David Boyd

Data Tactics Corporation

Scott Brim

Internet2

Pw Carey

Compliance Partners, LLC

Wo Chang

NIST

Marge Cole

SGT, Inc.

Yuri Demchenko

University of Amsterdam

Safia Djennane

Cloud-Age-IT

Geoffrey Fox

Indiana University

Nancy Grady

SAIC

Jay Greenberg

The Boeing Company

Karen Guertler

Consultant

Keith Hare

JCC Consulting, Inc.

Babak Jahromi

Microsoft

Pavithra Kenjige

PK Technologies

Donald Krapohl

Augmented Intelligence

Luca Lepori

Data Hold

Orit Levin

Microsoft

Eugene Luster

DISA/R2AD

Ashok Malhotra

Oracle Corporation

Robert Marcus

ET-Strategies

Gary Mazzaferro

AlloyCloud, Inc.

William Miller

MaCT USA

Sanjay Mishra

Verizon

Doug Scrimager

Slalom Consulting

Cherry Tom

IEEE-SA

Wilco van Ginkel

Verizon

Timothy Zimmerlin

Automation Technologies Inc.

Alicia Zuniga-Alvarado

Consultant

The editors for this document were Geoffrey Fox and Wo Chang.

 

Notice to Readers

NIST is seeking feedback on the proposed working draft of the NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements. Once public comments are received, compiled, and addressed by the NBD-PWG, and reviewed and approved by NIST internal editorial board, Version 1 of this volume will be published as final. Three versions are planned for this volume, with Versions 2 and 3 building on the first. Further explanation of the three planned versions and the information contained therein is included in Section 1.5 of this document.

Please be as specific as possible in any comments or edits to the text. Specific edits include, but are not limited to, changes in the current text, additional text further explaining a topic or explaining a new topic, additional references, or comments about the text, topics, or document organization. These specific edits can be recorded using one of the two following methods.

  1. TRACK CHANGES: make edits to and comments on the text directly into this Word document using track changes
  2. COMMENT TEMPLATE: capture specific edits using the Comment Template (http://bigdatawg.nist.gov/_uploadfiles/SP1500-1-to-7_comment_template.docx), which includes space for Section number, page number, comment, and text edits

Submit the edited file from either method 1 or 2 to SP1500comments@nist.gov with the volume number in the subject line (e.g., Edits for Volume 3.)

Please contact Wo Chang (wchang@nist.gov) with any questions about the feedback submission process.

Big Data professionals continue to be welcome to join the NBD-PWG to help craft the work contained in the volumes of the NIST Big Data Interoperability Framework. Additional information about the NBD-PWG can be found at http://bigdatawg.nist.gov.

Table of Contents

Executive Summary. ix

1        Introduction.. 1

1.1      Background. 1

1.2      Scope and Objectives of the Use Cases and Requirements Subgroup. 2

1.3      Report Production. 3

1.4      Report Structure. 3

1.5      Future Work on this Volume. 3

2        Use Case Summaries. 5

2.1      Use Case Development Process. 5

2.2      Government Operation. 5

2.2.1     Use Case 1: Census 2010 and 2000Title 13 Big Data. 5

2.2.2     Use Case 2: NARA Accession, Search, Retrieve, Preservation. 6

2.2.3     Use Case 3: Statistical Survey Response Improvement 6

2.2.4     Use Case 4: Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design) 7

2.3      Commercial. 7

2.3.1     Use Case 5: Cloud Eco-System for Financial Industries. 7

2.3.2     Use Case 6: Mendeley – An International Network of Research. 8

2.3.3     Use Case 7: Netflix Movie Service. 8

2.3.4     Use Case 8: Web Search. 9

2.3.5     Use Case 9: Big Data Business Continuity and Disaster Recovery Within a Cloud Eco-System... 9

2.3.6     Use Case 10: Cargo Shipping. 10

2.3.7     Use Case 11: Materials Data for Manufacturing. 11

2.3.8     Use Case 12: Simulation-Driven Materials Genomics. 12

2.4      Defense. 12

2.4.1     Use Case 13: Cloud Large-Scale Geospatial Analysis and Visualization. 12

2.4.2     Use Case 14: Object Identification and Tracking from Wide-Area Large Format Imagery or Full Motion VideoPersistent Surveillance. 13

2.4.3     Use Case 15: Intelligence Data Processing and Analysis. 13

2.5      Health Care and Life Sciences. 14

2.5.1     Use Case 16: Electronic Medical Record (EMR) Data. 14

2.5.2     Use Case 17: Pathology Imaging/Digital Pathology. 14

2.5.3     Use Case 18: Computational Bioimaging. 16

2.5.4     Use Case 19: Genomic Measurements. 16

2.5.5     Use Case 20: Comparative Analysis for Metagenomes and Genomes. 16

2.5.6     Use Case 21: Individualized Diabetes Management 17

2.5.7     Use Case 22: Statistical Relational Artificial Intelligence for Health Care. 18

2.5.8     Use Case 23: World Population-Scale Epidemiological Study. 18

2.5.9     Use Case 24: Social Contagion Modeling for Planning, Public Health, and Disaster Management 19

2.5.10   Use Case 25: Biodiversity and LifeWatch. 19

2.6      Deep Learning and Social Media. 20

2.6.1     Use Case 26: Large-Scale Deep Learning. 20

2.6.2     Use Case 27: Organizing Large-Scale, Unstructured Collections of Consumer Photos. 20

2.6.3     Use Case 28: TruthyInformation Diffusion Research from Twitter Data. 21

2.6.4     Use Case 29: Crowd Sourcing in the Humanities as Source for Big and Dynamic Data. 21

2.6.5     Use Case 30: CINETCyberinfrastructure for Network (Graph) Science and Analytics. 22

2.6.6     Use Case 31: NIST Information Access DivisionAnalytic Technology Performance Measurements, Evaluations, and Standards. 22

2.7      The Ecosystem for Research. 23

2.7.1     Use Case 32: DataNet Federation Consortium (DFC) 23

2.7.2     Use Case 33: The Discinnet Process. 24

2.7.3     Use Case 34: Semantic Graph Search on Scientific Chemical and Text-Based Data. 24

2.7.4     Use Case 35: Light Source Beamlines. 25

2.8      Astronomy and Physics. 25

2.8.1     Use Case 36: Catalina Real-Time Transient Survey: A Digital, Panoramic, Synoptic Sky Survey. 25

2.8.2     Use Case 37: DOE Extreme Data from Cosmological Sky Survey and Simulations. 27

2.8.3     Use Case 38: Large Survey Data for Cosmology. 27

2.8.4     Use Case 39: Particle PhysicsAnalysis of Large Hadron Collider Data: Discovery of Higgs Particle  27

2.8.5     Use Case 40: Belle II High Energy Physics Experiment 30

2.9      Earth, Environmental, and Polar Science. 30

2.9.1     Use Case 41: EISCAT 3D Incoherent Scatter Radar System... 30

2.9.2     Use Case 42: ENVRI, Common Operations of Environmental Research Infrastructure. 31

2.9.3     Use Case 43: Radar Data Analysis for the Center for Remote Sensing of Ice Sheets. 35

2.9.4     Use Case 44: Unmanned Air Vehicle Synthetic Aperture Radar (UAVSAR) Data Processing, Data Product Delivery, and Data Services. 37

2.9.5     Use Case 45: NASA Langley Research Center/ Goddard Space Flight Center iRODS Federation Test Bed  38

2.9.6     Use Case 46: MERRA Analytic Services (MERRA/AS) 38

2.9.7     Use Case 47: Atmospheric Turbulence – Event Discovery and Predictive Analytics. 39

2.9.8     Use Case 48: Climate Studies Using the Community Earth System Model at the U.S. Department of Energy (DOE) NERSC Center 40

2.9.9     Use Case 49: DOE Biological and Environmental Research (BER) Subsurface Biogeochemistry Scientific Focus Area. 41

2.9.10   Use Case 50: DOE BER AmeriFlux and FLUXNET Networks. 41

2.10    Energy. 42

2.10.1   Use Case 51: Consumption Forecasting in Smart Grids. 42

3        Use Case Requirements. 43

3.1      Use Case Specific Requirements. 43

3.2      General Requirements. 43

Appendix A: Use Case Study Source Materials. A-1

Appendix B: Summary of Key Properties. B-1

Appendix C: Use Case Requirements Summary. C-1

Appendix D: Use Case Detail Requirements. D-1

Appendix E: Acronyms. E-1

Appendix F: References. F-1

Figures

Figure 1: Cargo Shipping Scenario. 11

Figure 2: Pathology Imaging/Digital PathologyExamples of 2-D and 3-D Pathology Images. 15

Figure 3: Pathology Imaging/Digital Pathology. 15

Figure 4: DataNet Federation Consortium DFC – iRODS Architecture. 23

Figure 5: Catalina CRTS: A Digital, Panoramic, Synoptic Sky Survey. 26

Figure 6: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – CERN LHC Location. 28

Figure 7: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – The Multi-tier LHC Computing Infrastructure  29

Figure 8: EISCAT 3D Incoherent Scatter Radar System – System Architecture. 31

Figure 9: ENVRI Common Architecture. 32

Figure 10(a): ICOS Architecture. 33

Figure 10(b): LifeWatch Architecture. 33

Figure 10(c): EMSO Architecture. 34

Figure 10(d): EURO-Argo Architecture. 34

Figure 10(e): EISCAT 3D Architecture. 35

Figure 11: Typical CReSIS Radar Data After Analysis. 35

Figure 12: Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets– Typical Flight Paths of Data Gathering in Survey Region  36

Figure 13: Typical echogram with detected boundaries. 36

Figure 14: Combined Unwrapped Coseismic Interferograms. 37

Figure 15: Typical MERRA/AS Output. 39

Tables

Table B-1: Use Case Specific Information by Key Properties. B-1

Table C-1: Use Case Specific Requirements. C-1

Executive Summary

The NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements document was prepared by the NIST Big Data Public Working Group (NBD-PWG) Use Cases and Requirements Subgroup to gather use cases and extract requirements. The Subgroup developed a use case template with 26 fields that were completed by 51 users in the following broad areas:

·         Government Operations (4)

·         Commercial (8)

·         Defense (3)

·         Healthcare and Life Sciences (10)

·         Deep Learning and Social Media (6)

·         The Ecosystem for Research (4)

·         Astronomy and Physics (5)

·         Earth, Environmental and Polar Science (10)

·         Energy (1)

The use cases are, of course, only representative, and do not represent the entire spectrum of Big Data usage. All of the use cases were openly submitted and no significant editing has been performed. While there are differences in scope and interpretation, the benefits of free and open submission outweighed those of greater uniformity.

This document covers the process used by the Subgroup to collect use cases and extract requirements to form the NIST Big Data Reference Architecture (NBDRA). Included in this document are summaries of each use case, extracted requirements, and the original, unedited use case materials.

The NIST Big Data Interoperability Framework consists of seven volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The seven volumes are as follows:

·         Volume 1, Definitions

·         Volume 2, Taxonomies

·         Volume 3, Use Cases and General Requirements

·         Volume 4, Security and Privacy

·         Volume 5, Architectures White Paper Survey

·         Volume 6, Reference Architecture

·         Volume 7, Standards Roadmap

The NIST Big Data Interoperability Framework will be released in three versions, which correspond to the three stages of the NBD-PWG work. The three stages aim to achieve the following:

Stage 1: Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic

Stage 2: Define general interfaces between the NBDRA components

Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces

Potential areas of future work for the Subgroup during stage 2 are highlighted in Section 1.5 of this volume. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data.

Introduction

1.1 Background

There is broad agreement among commercial, academic, and government leaders about the remarkable potential of Big Data to spark innovation, fuel commerce, and drive progress. Big Data is the common term used to describe the deluge of data in today’s networked, digitized, sensor-laden, and information-driven world. The availability of vast data resources carries the potential to answer questions previously out of reach, including the following:

·         How can a potential pandemic reliably be detected early enough to intervene?

·         Can new materials with advanced properties be predicted before these materials have ever been synthesized?

·         How can the current advantage of the attacker over the defender in guarding against cyber-security threats be reversed?

There is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The growth rates for data volumes, speeds, and complexity are outpacing scientific and technological advances in data analytics, management, transport, and data user spheres.

Despite widespread agreement on the inherent opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions continues to confuse potential users and stymie progress. These questions include the following:

·         What attributes define Big Data solutions?

·         How is Big Data different from traditional data environments and related applications?

·         What are the essential characteristics of Big Data environments?

·         How do these environments integrate with currently deployed architectures?

·         What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?

Within this context, on March 29, 2012, the White House announced the Big Data Research and Development Initiative. [1] The initiative’s goals include helping to accelerate the pace of discovery in science and engineering, strengthening national security, and transforming teaching and learning by improving the ability to extract knowledge and insights from large and complex collections of digital data.

Six federal departments and their agencies announced more than $200 million in commitments spread across more than 80 projects, which aim to significantly improve the tools and techniques needed to access, organize, and draw conclusions from huge volumes of digital data. The initiative also challenged industry, research universities, and nonprofits to join with the federal government to make the most of the opportunities created by Big Data.

Motivated by the White House initiative and public suggestions, the National Institute of Standards and Technology (NIST) has accepted the challenge to stimulate collaboration among industry professionals to further the secure and effective adoption of Big Data. As one result of NIST’s Cloud and Big Data Forum held on January 15–17, 2013, there was strong encouragement for NIST to create a public working group for the development of a Big Data Interoperability Framework. Forum participants noted that this roadmap should define and prioritize Big Data requirements, including interoperability, portability, reusability, extensibility, data usage, analytics, and technology infrastructure. In doing so, the roadmap would accelerate the adoption of the most secure and effective Big Data techniques and technology.

On June 19, 2013, the NIST Big Data Public Working Group (NBD-PWG) was launched with extensive participation by industry, academia, and government from across the nation. The scope of the NBD-PWG involves forming a community of interests from all sectors—including industry, academia, and government—with the goal of developing consensus on definitions, taxonomies, secure reference architectures, security and privacy requirements, and¾from these¾a standards roadmap. Such a consensus would create a vendor-neutral, technology- and infrastructure-independent framework that would enable Big Data stakeholders to identify and use the best analytics tools for their processing and visualization requirements on the most suitable computing platform and cluster, while also allowing value-added from Big Data service providers.

The NIST Big Data Interoperability Framework consists of seven volumes, each of which addresses a specific key topic, resulting from the work of the NBD-PWG. The seven volumes are as follows:

·         Volume 1, Definitions

·         Volume 2, Taxonomies

·         Volume 3, Use Cases and General Requirements

·         Volume 4, Security and Privacy

·         Volume 5, Architectures White Paper Survey

·         Volume 6, Reference Architecture

·         Volume 7, Standards Roadmap

The NIST Big Data Interoperability Framework will be released in three versions, which correspond to the three stages of the NBD-PWG work. The three stages aim to achieve the following:

Stage 1: Identify the high-level Big Data reference architecture key components, which are technology, infrastructure, and vendor agnostic

Stage 2: Define general interfaces between the NBDRA components

Stage 3: Validate the NBDRA by building Big Data general applications through the general interfaces

Potential areas of future work for the Subgroup during stage 2 are highlighted in Section 1.5 of this volume. The current effort documented in this volume reflects concepts developed within the rapidly evolving field of Big Data.

1.2 Scope and Objectives of the Use Cases and Requirements Subgroup

This volume was prepared by the NBD-PWG Use Cases and Requirements Subgroup. The effort focused on forming a community of interest from industry, academia, and government, with the goal of developing a consensus list of Big Data requirements across all stakeholders. This included gathering and understanding various use cases from nine diversified areas (i.e., application domains). To achieve this goal, the Subgroup completed the following tasks:

·         Gathered input from all stakeholders regarding Big Data requirements

·         Analyzed and prioritized a list of challenging use case specific requirements that may delay or prevent adoption of Big Data deployment

·         Developed a comprehensive list of generalized Big Data requirements

·         Collaborated with the NBD-PWG Reference Architecture Subgroup to provide input for the NIST Big Data Reference Architecture (NBDRA)

·         Documented the findings in this report

1.3 Report Production

This report was produced using an open collaborative process involving weekly telephone conversations and information exchange using the NIST document system. The 51 use cases included herein came from Subgroup members participating in the calls and from other interested parties informed of the opportunity to contribute.

The outputs from the use case process are presented in this report and online at the following locations:

·         Index to all use cases: http://bigdatawg.nist.gov/usecases.php

·         List of specific requirements versus use case: http://bigdatawg.nist.gov/uc_reqs_summary.php

·         List of general requirements versus architecture component: http://bigdatawg.nist.gov/uc_reqs_gen.php

·         List of general requirements versus architecture component with record of use cases giving requirements: http://bigdatawg.nist.gov/uc_reqs_gen_ref.php

·         List of architecture components and specific requirements plus use case constraining the components: http://bigdatawg.nist.gov/uc_reqs_gen_detail.php

·         General requirements: http://bigdatawg.nist.gov/uc_reqs_gen.php.

1.4 Report Structure

Following this introductory section, the remainder of this document is organized as follows:

·         Section 2 presents the 51 use cases

o   Section 2.1 discusses the process that led to their production

o   Sections 2.2 through 2.10 provide summaries of the 51 use cases; each summary has three subsections: Application, Current Approach, and Future. The use cases are organized into the nine broad areas (application domains) listed below, with the number of associated use cases in parentheses:

  • Government Operation (4)
  • Commercial (8)
  • Defense (3)
  • Healthcare and Life Sciences (10)
  • Deep Learning and Social Media (6)
  • The Ecosystem for Research (4)
  • Astronomy and Physics (5)
  • Earth, Environmental, and Polar Science (10)
  • Energy (1)

·         Section 3 presents a more detailed analysis of requirements across use cases

·         Appendix A contains the original, unedited use cases

·         Appendix B summarizes key properties of each use case

·         Appendix C presents a summary of use case requirements

·         Appendix D provides the requirements extracted from each use case and aggregated general requirements grouped by characterization category

·         Appendix E contains acronyms and abbreviations used in this document

·         Appendix F supplies the document references

1.5 Future Work on this Volume

Future work on this document will include the following:

·         Identify general features or patterns and a classification of use cases by these features

·         Draw on the use case classification to suggest classes of software models and system architectures [2][3][4][5][6]

·         A more detailed analysis of reference architecture based on sample codes that are being implemented in a university class  [7]

·         Collect benchmarks that capture the “essence” of individual use cases

·         Additional work may arise from these or other NBD-PWG activities. Other future work may include collection and classification of additional use cases in areas that would benefit from additional entries, such as Government Operations, Commercial, Internet of Things, and Energy. Additional information on current or new use cases may become available, including associated figures. In future use cases, more quantitative specifications could be made, including more precise and uniform recording of data volume. In addition, further requirements analysis can be performed now that the reference architecture is more mature.

2 Use Case Summaries

2.1 Use Case Development Process

A use case is a typical application stated at a high level for the purposes of extracting requirements or comparing usages across fields. In order to develop a consensus list of Big Data requirements across all stakeholders, the Subgroup began by collecting use cases. Publically available information was collected for various Big Data architecture examples used in nine broad areas (i.e., application domains). The examples were collected in a broad call with special attention given to some areas including Healthcare and Government. After collection of 51 use cases, these nine broad areas were identified by the Subgroup members to better organize the collection. Each example of Big Data architecture constituted one use case. The nine application domains were as follows:

·         Government Operation

·         Commercial

·         Defense

·         Healthcare and Life Sciences

·         Deep Learning and Social Media

·         The Ecosystem for Research

·         Astronomy and Physics

·         Earth, Environmental, and Polar Science

·         Energy

As noted above, participants in the NBD-PWG Use Cases and Requirements Subgroup and other interested parties supplied the information for the use cases. The template used to collect use case information and provided at the front of Appendix A, was valuable for gathering consistent information that enabled the Subgroup to develop supporting analysis and comparison of the use cases. However, varied levels of detail and quantitative or qualitative information were received for each use case template section. The original, unedited use cases are also included in Appendix A and may be downloaded from the NIST document library (http://bigdatawg.nist.gov/usecases.php).

Beginning with Section 2.2 below, each Big Data use case is presented with a high-level description, the current approach to the use case, and, if available, a future desired computational environment. For some application domains, several similar Big Data use cases are presented, providing a more complete view of Big Data requirements within that application domain.

The use cases are numbered sequentially to facilitate cross-referencing between the use case summaries presented in this section, the original use cases (Appendix A), and the use case summary tables (Appendices B, C, and D).

2.2 Government Operation

2.2.1 Use Case 1: Census 2010 and 2000Title 13 Big Data

Submitted by Vivek Navale and Quyen Nguyen, National Archives and Records Administration (NARA)

Application

Census 2010 and 2000 – Title 13 data must be preserved for several decades so they can be accessed and analyzed after 75 years. Data must be maintained ‘as-is’ with no access and no data analytics for 75 years, preserved at the bit level, and curated, which may include format transformation. Access and analytics must be provided after 75 years. Title 13 of the U.S. Code authorizes the U.S. Census Bureau to collect and preserve census related data and guarantees that individual and industry-specific data are protected.

Current Approach

The dataset contains 380 terabytes (TB) of scanned documents.

Future

Future data scenarios and applications were not expressed for this use case.

2.2.2 Use Case 2: NARA Accession, Search, Retrieve, Preservation

Submitted by Vivek Navale and Quyen Nguyen, NARA

Application

This area comprises accession, search, retrieval, and long-term preservation of government data.

Current Approach

The data are currently handled as follows:

1.      Get physical and legal custody of the data

2.      Pre-process data for conducting virus scans, identifying file format identifications, and removing empty files

3.      Index the data

4.      Categorize records (e.g., sensitive, non-sensitive, privacy data)

5.      Transform old file formats to modern formats (e.g., WordPerfect to PDF)

6.      Conduct e-discovery

7.      Search and retrieve to respond to special requests

8.      Search and retrieve public records by public users

Currently hundreds of TBs are stored centrally in commercial databases supported by custom software and commercial search products.

Future

Federal agencies possess many distributed data sources, which currently must be transferred to centralized storage. In the future, those data sources may reside in multiple cloud environments. In this case, physical custody should avoid transferring Big Data from cloud to cloud or from cloud to data center.

2.2.3 Use Case 3: Statistical Survey Response Improvement

Submitted by Cavan Capps, U.S. Census Bureau

Application

Survey costs are increasing as survey responses decline. The goal of this work is to increase the quality—and reduce the cost—of field surveys by using advanced ‘recommendation system techniques.’ These techniques are open and scientifically objective, using data mashed up from several sources and also historical survey para-data (i.e., administrative data about the survey.)

Current Approach

This use case handles about a petabyte (PB) of data coming from surveys and other government administrative sources. Data can be streamed. During the decennial census, approximately 150 million records transmitted as field data are streamed continuously. All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes. Data quality should be high and statistically checked for accuracy and reliability throughout the collection process. Software used includes Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, and Pig.

Future

Improved recommendation systems are needed similar to those used in e-commerce (e.g., similar to the Netflix use case) that reduce costs and improve quality, while providing confidentiality safeguards that are reliable and publicly auditable. Data visualization is useful for data review, operational activity, and general analysis. The system continues to evolve and incorporate important features such as mobile access.

2.2.4 Use Case 4: Non-Traditional Data in Statistical Survey Response Improvement (Adaptive Design)

Submitted by Cavan Capps, U.S. Census Bureau

Application

Survey costs are increasing as survey response declines. This use case has goals similar to those of the Statistical Survey Response Improvement use case. However, this case involves non-traditional commercial and public data sources from the web, wireless communication, and electronic transactions mashed up analytically with traditional surveys. The purpose of the mashup is to improve statistics for small area geographies and new measures, as well as the timeliness of released statistics.

Current Approach

Data from a range of sources are integrated including survey data, other government administrative data, web scrapped data, wireless data, e-transaction data, possibly social media data, and positioning data from various sources. Software, visualization, and data characteristics are similar to those in the Statistical Survey Response Improvement use case.

Future

Analytics need to be developed that give more detailed statistical estimations, on a more near real-time basis, for less cost. The reliability of estimated statistics from such mashed up sources still must be evaluated.

2.3 Commercial

2.3.1 Use Case 5: Cloud Eco-System for Financial Industries

Submitted by Pw Carey, Compliance Partners, LLC

Application

Use of cloud (e.g., Big Data) technologies needs to be extended in financial industries (i.e., banking, securities and investments, insurance) transacting business within the U.S.

Current Approach

The financial industry is already using Big Data and Hadoop for fraud detection, risk analysis, assessments, as well as improving their knowledge and understanding of customers. At the same time, the industry is still using traditional client/server/data warehouse/relational database management systems (RDBMSs) for the handling, processing, storage, and archival of financial data. Real-time data and analysis are important in these applications.

Future

Security, privacy, and regulation must be addressed. For example, the financial industry must examine SEC-mandated use of XBRL (extensible business-related markup language) and use of other cloud functions.

2.3.2 Use Case 6: Mendeley – An International Network of Research

Submitted by William Gunn, Mendeley

Application

Mendeley has built a database of research documents and facilitates the creation of shared bibliographies. Mendeley collects and uses the information about research reading patterns and other activities conducted via their software to build more efficient literature discovery and analysis tools. Text mining and classification systems enable automatic recommendation of relevant research, improving research teams’ performance and cost-efficiency, particularly those engaged in curation of literature on a particular subject.

Current Approach

Data size is presently 15 TB and growing at a rate of about 1 TB per month. Processing takes place on Amazon Web Services (AWS) using the following software: Hadoop, Scribe, Hive, Mahout, and Python. The database uses standard libraries for machine learning and analytics, latent Dirichlet allocation (LDA, a generative probabilistic model for discrete data collection), and custom-built reporting tools for aggregating readership and social activities for each document.

Future

Currently Hadoop batch jobs are scheduled daily, but work has begun on real-time recommendation. The database contains approximately 400 million documents and roughly 80 million unique documents, and receives 500,000 to 700,000 new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (i.e., scalable and parallelized) when they are uploaded from different sources and have been slightly modified via third-party annotation tools or publisher watermarks and cover pages.

2.3.3 Use Case 7: Netflix Movie Service

Submitted by Geoffrey Fox, Indiana University

Application

Netflix allows streaming of user-selected movies to satisfy multiple objectives (for different stakeholders)—but with a focus on retaining subscribers. The company needs to find the best possible ordering of a set of videos for a user (e.g., household) within a given context in real time, with the objective of maximizing movie consumption. Recommendation systems and streaming video delivery are core Netflix technologies. Recommendation systems are always personalized and use logistic/linear regression, elastic nets, matrix factorization, clustering, LDA, association rules, gradient-boosted decision trees, and other tools. Digital movies are stored in the cloud with metadata, along with individual user profiles and rankings for small fraction of movies. The current system uses multiple criteria: a content-based recommendation system, a user-based recommendation system, and diversity. Algorithms are continuously refined with A/B testing (i.e., two-variable randomized experiments used in online marketing).

Current Approach

Netflix held a competition for the best collaborative filtering algorithm to predict user ratings for films—the purpose of which was to improve ratings by 10%. The winning system combined over 100 different algorithms. Netflix systems use SQL, NoSQL, and MapReduce on AWS. Netflix recommendation systems have features in common with e-commerce systems such as Amazon.com. Streaming video has features in common with other content-providing services such as iTunes, Google Play, Pandora, and Last.fm. Business initiatives such as Netflix-sponsored content have been used to increase viewership.

Future

Streaming video is a very competitive business. Netflix needs to be aware of other companies and trends in both content (e.g., which movies are popular) and Big Data technology.

2.3.4 Use Case 8: Web Search

Submitted by Geoffrey Fox, Indiana University

Application

A web search function returns results in ~0.1 seconds based on search terms with an average of three words. It is important to maximize quantities such as “precision@10” for the number of highly accurate/appropriate responses in the top 10 ranked results.

Current Approach

The current approach uses the following steps:

1.      Crawl the web

2.      Pre-process data to identify what is searchable (words, positions)

3.      Form an inverted index, which maps words to their locations in documents

4.      Rank the relevance of documents using the PageRank algorithm

5.      Employ advertising technology, e.g., using reverse engineering to identify ranking models—or preventing reverse engineering

6.      Cluster documents into topics (as in Google News)

7.      Update results efficiently.

Modern clouds and technologies such as MapReduce have been heavily influenced by this application, which now comprises ~45 billion web pages total.

Future

Web search is a very competitive field, so continuous innovation is needed. Two important innovation areas are addressing the growing segment of mobile clients, and increasing sophistication of responses and layout to maximize the total benefit of clients, advertisers, and the search company. The “deep web” (content not indexed by standard search engines, buried behind user interfaces to databases, etc.) and multimedia searches are also of increasing importance. Each day, 500 million photos are uploaded, and each minute, 100 hours of video are uploaded to YouTube.

2.3.5 Use Case 9: Big Data Business Continuity and Disaster Recovery Within a Cloud Eco-System

Submitted by Pw Carey, Compliance Partners, LLC

Application

Business Continuity and Disaster Recovery (BC/DR) needs to consider the role that four overlaying and interdependent forces will play in ensuring a workable solution to an entity's business continuity plan and requisite disaster recovery strategy. The four areas are people (i.e., resources), processes (e.g., time/cost/return on investment [ROI]), technology (e.g., various operating systems, platforms, and footprints), and governance (e.g., subject to various and multiple regulatory agencies).

Current Approach

Data replication services are provided through cloud ecosystems, incorporating IaaS and supported by Tier 3 data centers. Replication is different from backup and only moves the changes that took place since the previous replication, including block-level changes. The replication can be done quickly—with a five-second window—while the data are replicated every four hours. This data snapshot is retained for seven business days, or longer if necessary. Replicated data can be moved to a failover center (i.e., a backup system) to satisfy an organization’s recovery point objectives (RPO) and recovery time objectives (RTO). There are some relevant technologies from VMware, NetApps, Oracle, IBM, and Brocade. Data sizes range from terabytes to petabytes.

Future

Migrating from a primary site to either a replication site or a backup site is not yet fully automated. The goal is to enable the user to automatically initiate the failover sequence. Both organizations must know which servers have to be restored and what the dependencies and inter-dependencies are between the primary site servers and replication and/or backup site servers. This knowledge requires continuous monitoring of both.

2.3.6 Use Case 10: Cargo Shipping

Submitted by William Miller, MaCT USA

Application

Delivery companies such as Federal Express, United Parcel Service (UPS), and DHL need optimal means of monitoring and tracking cargo.

Current Approach

Information is updated only when items are checked with a bar code scanner, which sends data to the central server. An item’s location is not currently displayed in real time. Figure 1 provides an architectural diagram.

Future

Tracking items in real time is feasible through the Internet of Things application, in which objects are given unique identifiers and capability to transfer data automatically, i.e., without human interaction. A new aspect will be the item’s status condition, including sensor information, global positioning system (GPS) coordinates, and a unique identification schema based upon standards under development (specifically International Organization for Standardization [ISO] standard 29161) from the ISO Joint Technical Committee 1, Subcommittee 31, Working Group 2, which develops technical standards for data structures used for automatic identification applications.


Figure 1: Cargo Shipping Scenario

Volume3Figure1.png

2.3.7 Use Case 11: Materials Data for Manufacturing

Submitted by John Rumble, R&R Data Services

Application

Every physical product is made from a material that has been selected for its properties, cost, and availability. This translates into hundreds of billions of dollars of material decisions made every year. However, the adoption of new materials normally takes decades (usually two to three decades) rather than a small number of years, in part because data on new materials are not easily available. To speed adoption time, accessibility, quality, and usability must be broadened, and proprietary barriers to sharing materials data must be overcome. Sufficiently large repositories of materials data are needed to support discovery.

Current Approach

Decisions about materials usage are currently unnecessarily conservative, are often based on older rather than newer materials research and development (R&D) data, and do not take advantage of advances in modeling and simulation.

Future

Materials informatics is an area in which the new tools of data science can have a major impact by predicting the performance of real materials (in gram to ton quantities) starting at the atomistic, nanometer, and/or micrometer levels of description. The following efforts are needed to support this area:

·         Establish materials data repositories, beyond the existing ones, that focus on fundamental data.

·         Develop internationally accepted data recording standards that can be used by a very diverse materials community, including developers of materials test standards (e.g., ASTM International and ISO), testing companies, materials producers, and R&D labs.

·         Develop tools and procedures to help organizations that need to deposit proprietary materials in data repositories to mask proprietary information while maintaining the data’s usability.

·         Develop multi-variable materials data visualization tools in which the number of variables can be quite high.

2.3.8 Use Case 12: Simulation-Driven Materials Genomics

Submitted by David Skinner, Lawrence Berkeley National Laboratory (LBNL)

Application

Massive simulations spanning wide spaces of possible design lead to innovative battery technologies. Systematic computational studies are being conducted to examine innovation possibilities in photovoltaics. Search and simulation is the basis for rational design of materials. All these require management of simulation results contributing to the materials genome.

Current Approach

Survey results are produced using PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, and varied materials community codes running on large supercomputers, such as the Hopper at the National Energy Research Scientific Computing Center (NERSC), a 150,000-core machine that produces high-resolution simulations.

Future

Large-scale computing and flexible data methods at scale for messy data are needed for simulation science. The advancement of goal-driven thinking in materials design requires machine learning and knowledge systems that integrate data from publications, experiments, and simulations. Other needs include scalable key-value and object store databases; the current 100 TB of data will grow to 500 TB over the next five years.

2.4 Defense

2.4.1 Use Case 13: Cloud Large-Scale Geospatial Analysis and Visualization

Submitted by David Boyd, Data Tactics

Application

Large-scale geospatial data analysis and visualization must be supported. As the number of geospatially aware sensors and geospatially tagged data sources increase, the volume of geospatial data requiring complex analysis and visualization is growing exponentially.

Current Approach

Traditional geographic information systems (GISs) are generally capable of analyzing millions of objects and visualizing thousands. Data types include imagery (various formats such as NITF, GeoTiff, and CADRG) and vector (various formats such as shape files, KML [Keyhole Markup Language], and text streams). Object types include points, lines, areas, polylines, circles, and ellipses. Image registration—transforming various data into one system—requires data and sensor accuracy. Analytics include principal component analysis (PCA) and independent component analysis (ICA) and consider closest point of approach, deviation from route, and point density over time. Software includes a server with a geospatially enabled RDBMS, geospatial server/analysis software (ESRI ArcServer or Geoserver), and visualization (either browser-based or using the ArcMap application).

Future

Today’s intelligence systems often contain trillions of geospatial objects and must visualize and interact with millions of objects. Critical issues are indexing, retrieval and distributed analysis (note that geospatial data requires unique approaches to indexing and distributed analysis); visualization generation and transmission; and visualization of data at the end of low-bandwidth wireless connections. Data are sensitive and must be completely secure in transit and at rest (particularly on handhelds).

2.4.2 Use Case 14: Object Identification and Tracking from Wide-Area Large Format Imagery or Full Motion VideoPersistent Surveillance

Submitted by David Boyd, Data Tactics

Application

Persistent surveillance sensors can easily collect PB of imagery data in the space of a few hours. The data should be reduced to a set of geospatial objects (e.g., points, tracks) that can be easily integrated with other data to form a common operational picture. Typical processing involves extracting and tracking entities (e.g., vehicles, people, packages) over time from the raw image data.

Current Approach

It is not feasible for humans to process these data for either alerting or tracking purposes. The data need to be processed close to the sensor, which is likely forward-deployed since it is too large to be easily transmitted. Typical object extraction systems are currently small (e.g., 1 to 20 nodes) graphics processing unit (GPU)-enhanced clusters. There are a wide range of custom software and tools, including traditional RDBMSs and display tools. Real-time data are obtained at Full Motion Video (FMV)—30 to 60 frames per second at full-color 1080p resolution (i.e., 1920 x 1080 pixels, a high-definition progressive scan) or Wide-Area Large Format Imagery (WALF)—1 to 10 frames per second at 10,000 pixels x 10,000 pixels and full-color resolution. Visualization of extracted outputs will typically be as overlays on a geospatial (i.e., GIS) display. Analytics are basic object detection analytics and integration with sophisticated situation awareness tools with data fusion. Significant security issues must be considered; sources and methods cannot be compromised (i.e., “the enemy” should not know what we see).

Future

A typical problem is integration of this processing into a large GPU cluster capable of processing data from several sensors in parallel and in near real time. Transmission of data from sensor to system is also a major challenge.

2.4.3 Use Case 15: Intelligence Data Processing and Analysis

Submitted by David Boyd, Data Tactics

Application

Intelligence analysts need the following capabilities:

·         Identify relationships between entities (e.g., people, organizations, places, equipment).

·         Spot trends in sentiment or intent for either the general population or a leadership group such as state and non-state actors.

·         Identify the locations and possibly timing of hostile actions including implantation of improvised explosive devices.

·         Track the location and actions of potentially hostile actors.

·         Reason against and derive knowledge from diverse, disconnected, and frequently unstructured (e.g., text) data sources.

·         Process data close to the point of collection, and allow for easy sharing of data to/from individual soldiers, forward-deployed units, and senior leadership in garrisons.

Current Approach

Software includes Hadoop, Accumulo (Big Table), Solr, natural language processing (NLP), Puppet (for deployment and security), and Storm running on medium-size clusters. Data size ranges from tens of terabytes to hundreds of petabytes, with imagery intelligence devices gathering a petabyte in a few hours. Dismounted warfighters typically have at most one to hundreds of gigabytes (GB), which is typically handheld data storage.

Future

Data currently exist in disparate silos. These data must be accessible through a semantically integrated data space. A wide variety of data types, sources, structures, and quality will span domains and require integrated search and reasoning. Most critical data are either unstructured or maintained as imagery or video, which requires significant processing to extract entities and information. Network quality, provenance, and security are essential.

2.5 Health Care and Life Sciences

2.5.1 Use Case 16: Electronic Medical Record (EMR) Data

Submitted by Shaun Grannis, Indiana University

Application

Large national initiatives around health data are emerging. These include developing a digital learning health care system to support increasingly evidence-based clinical decisions with timely, accurate, and up-to-date patient-centered clinical information; using electronic observational clinical data to efficiently and rapidly translate scientific discoveries into effective clinical treatments; and electronically sharing integrated health data to improve healthcare process efficiency and outcomes. These key initiatives all rely on high-quality, large-scale, standardized, and aggregate health data. Advanced methods are needed for normalizing patient, provider, facility, and clinical concept identification within and among separate health care organizations. With these methods in place, feature selection, information retrieval, and enhanced machine learning decision-models can be used to define and extract clinical phenotypes from non-standard discrete and free-text clinical data. Clinical phenotype data must be leveraged to support cohort selection, clinical outcomes research, and clinical decision support.

Current Approach

The Indiana Network for Patient Care (INPC), the nation's largest and longest-running health information exchange, houses clinical data from more than 1,100 discrete logical operational healthcare sources. More than 20 TB of raw data, these data describe over 12 million patients and over 4 billion discrete clinical observations. Between 500,000 and 1.5 million new real-time clinical transactions are added every day.

Future

Running on an Indiana University supercomputer, Teradata, PostgreSQL, and MongoDB will support information retrieval methods to identify relevant clinical features (e.g., term frequency–inverse document frequency [tf-idf], latent semantic analysis, mutual information). NLP techniques will extract relevant clinical features. Validated features will be used to parameterize clinical phenotype decision models based on maximum likelihood estimators and Bayesian networks. Decision models will be used to identify a variety of clinical phenotypes such as diabetes, congestive heart failure, and pancreatic cancer.

2.5.2 Use Case 17: Pathology Imaging/Digital Pathology

Submitted by Fusheng Wang, Emory University

Application

Digital pathology imaging is an emerging field in which examination of high-resolution images of tissue specimens enables novel and more effective ways to diagnose diseases. Pathology image analysis segments massive spatial objects (e.g., millions of objects per image) such as nuclei and blood vessels, represented with their boundaries, along with many extracted image features from these objects. The derived information is used for many complex queries and analytics to support biomedical research and clinical diagnosis. Figure 2 presents examples of two- and three-dimensional (2D and 3D) pathology images.

Figure 2: Pathology Imaging/Digital PathologyExamples of 2-D and 3-D Pathology Images

Volume3Figure2.png

Current Approach

Each 2D image comprises 1 GB of raw image data and entails 1.5 GB of analytical results. Message Passing Interface (MPI) is used for image analysis. Data processing happens with MapReduce (a data processing program) and Hive (to abstract the MapReduce program and support data warehouse interactions), along with spatial extension on supercomputers and clouds. GPUs are used effectively for image creation. Figure 3 shows the architecture of Hadoop-GIS, a spatial data warehousing system, over MapReduce to support spatial analytics for analytical pathology imaging.

Figure 3: Pathology Imaging/Digital Pathology

Volume3Figure3.png

Future

Recently, 3D pathology imaging has been made possible using 3D laser technologies or serially sectioning hundreds of tissue sections onto slides and scanning them into digital images. Segmenting 3D microanatomic objects from registered serial images could produce tens of millions of 3D objects from a single image. This provides a deep ‘map’ of human tissues for next-generation diagnosis. 3D images can comprise 1 TB of raw image data and entail 1 TB of analytical results. A moderated hospital would generate 1 PB of data per year.

2.5.3 Use Case 18: Computational Bioimaging

Submitted by David Skinner, Joaquin Correa, Daniela Ushizima, and Joerg Meyer, LBNL

Application

Data delivered from bioimaging are increasingly automated, higher resolution, and multi-modal. This has created a data analysis bottleneck that, if resolved, can advance bioscience discovery through Big Data techniques.

Current Approach

The current piecemeal analysis approach does not scale to situations in which a single scan on emerging machines is 32 TB and medical diagnostic imaging is annually around 70 PB, excluding cardiology. A web-based, one-stop shop is needed for high-performance, high-throughput image processing for producers and consumers of models built on bio-imaging data.

Future

The goal is to resolve that bottleneck with extreme-scale computing and community-focused science gateways, both of which apply massive data analysis toward massive imaging data sets. Workflow components include data acquisition, storage, enhancement, noise minimization, segmentation of regions of interest, crowd-based selection and extraction of features, and object classification, as well as organization and search. Suggested software packages are ImageJ, OMERO, VolRover, and advanced segmentation and feature detection software.

2.5.4 Use Case 19: Genomic Measurements

Submitted by Justin Zook, National Institute of Standards and Technology

Application

The NIST Genome in a Bottle Consortium integrates data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as reference materials. The consortium also develops methods to use these reference materials to assess performance of any genome sequencing run.

Current Approach

NIST’s approximately 40 TB network file system (NFS) is full. The National Institutes of Health (NIH) and the National Center for Biotechnology Information (NCBI) are also currently storing PBs of data. NIST is also storing data using open-source sequencing bioinformatics software from academic groups (UNIX-based) on a 72-core cluster, supplemented by larger systems at collaborators.

Future

DNA sequencers can generate ~300 GB of compressed data per day, and this volume has increased much faster than Moore’s Law gives for increase in computer processing power. Future data could include other ‘omics’ (e.g., genomics) measurements, which will be even larger than DNA sequencing. Clouds have been explored as a cost effective scalable approach.

2.5.5 Use Case 20: Comparative Analysis for Metagenomes and Genomes

Submitted by Ernest Szeto, LBNL, Joint Genome Institute

Application

Given a metagenomic sample this use case aims to do the following:

·         Determine the community composition in terms of other reference isolate genomes

·         Characterize the function of its genes

·         Begin to infer possible functional pathways

·         Characterize similarity or dissimilarity with other metagenomic samples

·         Begin to characterize changes in community composition and function due to changes in environmental pressures

·         Isolate subsections of data based on quality measures and community composition

Current Approach

The current integrated comparative analysis system for metagenomes and genomes is front-ended by an interactive web user interface (UI) with core data. The system involves backend precomputations and batch job computation submission from the UI. The system provides an interface to standard bioinformatics tools (e.g., BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors).

Future

Management of heterogeneity of biological data is currently performed by a RDBMS (i.e., Oracle). Unfortunately, it does not scale for even the current volume, 50 TB of data. NoSQL solutions aim at providing an alternative, but unfortunately they do not always lend themselves to real-time interactive use or rapid and parallel bulk loading, and sometimes they have issues regarding robustness.

2.5.6 Use Case 21: Individualized Diabetes Management

Submitted by Ying Ding, Indiana University

Application

Diabetes is a growing illness in the world population, affecting both developing and developed countries. Current management strategies do not adequately take into account individual patient profiles, such as co-morbidities and medications, which are common in patients with chronic illnesses. Advanced graph-based data mining techniques must be applied to electronic health records (EHR), converting them into RDF (Resource Description Framework) graphs. These advanced techniques would facilitate searches for diabetes patients and allow for extraction of their EHR data for outcome evaluation.

Current Approach

Typical patient data records are composed of 100 controlled vocabulary values and 1,000 continuous values. Most values have a timestamp. The traditional paradigm of relational row-column lookup needs to be updated to semantic graph traversal.

Future

The first step is to compare patient records to identify similar patients from a large EHR database (i.e., an individualized cohort.) Each patient’s management outcome should be evaluated to formulate the most appropriate solution for a given patient with diabetes. The process would use efficient parallel retrieval algorithms, suitable for cloud or high-performance computing (HPC), using the open source Hbase database with both indexed and custom search capability to identify patients of possible interest. The Semantic Linking for Property Values method would be used to convert an existing data warehouse at Mayo Clinic, called the Enterprise Data Trust (EDT), into RDF triples that enable one to find similar patients through linking of both vocabulary-based and continuous values. The time-dependent properties need to be processed before query to allow matching based on derivatives and other derived properties.

2.5.7 Use Case 22: Statistical Relational Artificial Intelligence for Health Care

Submitted by Sriraam Natarajan, Indiana University

Application

The goal of the project is to analyze large, multi-modal medical data, including different data types such as imaging, EHR, and genetic and natural language. This approach employs relational probabilistic models that have the capability of handling rich relational data and modeling uncertainty using probability theory. The software learns models from multiple data types, and can possibly integrate information and reason about complex queries. Users can provide a set of descriptions, for instance: magnetic resonance imaging (MRI) images and demographic data about a particular subject. They can then query for the onset of a particular disease (e.g., Alzheimer’s), and the system will provide a probability distribution over the possible occurrence of this disease.

Current Approach

A single server can handle a test cohort of a few hundred patients with associated data of hundreds of gigabytes.

Future

A cohort of millions of patients can involve PB size datasets. A major issue is the availability of too much data (e.g., images, genetic sequences), which can make the analysis complicated. Sometimes, large amounts of data about a single subject are available, but the number of subjects is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Another challenge lies in aligning the data and merging from multiple sources in a form that will be useful for a combined analysis.

2.5.8 Use Case 23: World Population-Scale Epidemiological Study

Submitted by Madhav Marathe, Stephen Eubank, and Chris Barrett, Virginia Tech

Application

There is a need for reliable, real-time prediction and control of pandemics similar to the 2009 H1N1 influenza. Addressing various kinds of contagion diffusion may involve modeling and computing information, diseases, and social unrest. Agent-based models can utilize the underlying interaction network (i.e., a network defined by a model of people, vehicles, and their activities) to study the evolution of the desired phenomena.

Current Approach

There is a two-step approach: (1) build a synthetic global population; and (2) run simulations over the global population to reason about outbreaks and various intervention strategies. The current 100 TB dataset was generated centrally with an MPI-based simulation system written in Charm++. Parallelism is achieved by exploiting the disease residence time period.

Future

Large social contagion models can be used to study complex global-scale issues, greatly increasing the size of systems used.

2.5.9 Use Case 24: Social Contagion Modeling for Planning, Public Health, and Disaster Management

Submitted by Madhav Marathe and Chris Kuhlman, Virginia Tech

Application

Social behavior models are applicable to national security, public health, viral marketing, city planning, and disaster preparedness. In a social unrest application, people take to the streets to voice either unhappiness with or support for government leadership. Models would help quantify the degree to which normal business and activities are disrupted because of fear and anger, the possibility of peaceful demonstrations and/or violent protests, and the potential for government responses ranging from appeasement, to allowing protests, to issuing threats against protestors, to taking actions to thwart protests. Addressing these issues would require fine-resolution models (at the level of individual people, vehicles, and buildings) and datasets.

Current Approach

The social contagion model infrastructure simulates different types of human-to-human interactions (e.g., face-to-face versus online media), and also interactions between people, services (e.g., transportation), and infrastructure (e.g., Internet, electric power). These activity models are generated from averages such as census data.

Future

One significant concern is data fusion (i.e., how to combine data from different sources and how to deal with missing or incomplete data.) A valid modeling process must take into account heterogeneous features of hundreds of millions or billions of individuals, as well as cultural variations across countries. For such large and complex models, the validation process itself is also a challenge.

2.5.10 Use Case 25: Biodiversity and LifeWatch

Submitted by Wouter Los and Yuri Demchenko, University of Amsterdam

Application

Research and monitor different ecosystems, biological species, their dynamics, and their migration with a mix of custom sensors and data access/processing, and a federation with relevant projects in the area. Particular case studies include monitoring alien species, migrating birds, and wetlands. One of many efforts from the consortium titled Common Operations for Environmental Research Infrastructures (ENVRI) is investigating integration of LifeWatch with other environmental e-infrastructures.

Current Approach

At this time, this project is in the preliminary planning phases and, therefore, the current approach is not fully developed.

Future

The LifeWatch initiative will provide integrated access to a variety of data, analytical, and modeling tools as served by a variety of collaborating initiatives. It will also offer data and tools in selected workflows for specific scientific communities. In addition, LifeWatch will provide opportunities to construct personalized “virtual labs,” allowing participants to enter and access new data and analytical tools. New data will be shared with the data facilities cooperating with LifeWatch, including both the Global Biodiversity Information Facility and the Biodiversity Catalogue, also known as the Biodiversity Science Web Services Registry. Data include ‘omics’, species information, ecological information (e.g., biomass, population density), and ecosystem data (e.g., carbon dioxide [CO2] fluxes, algal blooming, water and soil characteristics.)

2.6 Deep Learning and Social Media

2.6.1 Use Case 26: Large-Scale Deep Learning

Submitted by Adam Coates, Stanford University

Application

There is a need to increase the size of datasets and models that can be tackled with deep learning algorithms. Large models (e.g., neural networks with more neurons and connections) combined with large datasets are increasingly the top performers in benchmark tasks for vision, speech, and NLP. It will be necessary to train a deep neural network from a large (e.g., much greater than 1 TB) corpus of data, which is typically comprised of imagery, video, audio, or text. Such training procedures often require customization of the neural network architecture, learning criteria, and dataset pre-processing. In addition to the computational expense demanded by the learning algorithms, the need for rapid prototyping and ease of development is extremely high.

Current Approach

The largest applications so far are to image recognition and scientific studies of unsupervised learning with 10 million images and up to 11 billion parameters on a 64 GPU HPC Infiniband cluster. Both supervised (i.e., using existing classified images) and unsupervised applications are being investigated.

Future

Large datasets of 100 TB or more may be necessary to exploit the representational power of the larger models. Training a self-driving car could take 100 million images at megapixel resolution. Deep learning shares many characteristics with the broader field of machine learning. The paramount requirements are high computational throughput for mostly dense linear algebra operations, and extremely high productivity for researcher exploration. High-performance libraries must be integrated with high-level (e.g., Python) prototyping environments.

2.6.2 Use Case 27: Organizing Large-Scale, Unstructured Collections of Consumer Photos

Submitted by David Crandall, Indiana University

Application

Collections of millions to billions of consumer images are used to produce 3D reconstructions of scenes—with no a priori knowledge of either the scene structure or the camera positions. The resulting 3D models allow efficient and effective browsing of large-scale photo collections by geographic position. New images can be geolocated by matching them to 3D models, and object recognition can be performed on each image. The 3D reconstruction can be posed as a robust, non-linear, least squares optimization problem: observed or noisy correspondences between images are constraints, and unknowns are six-dimensional (6D) camera poses of each image and 3D positions of each point in the scene.

Current Approach

The current system is a Hadoop cluster with 480 cores processing data of initial applications. Over 500 billion images are currently on Facebook, and over 5 billion are on Flickr, with over 500 million images added to social media sites each day.

Future

Necessary maintenance and upgrades require many analytics including feature extraction, feature matching, and large-scale probabilistic inference. These analytics appear in many or most computer vision and image processing problems, including recognition, stereo resolution, and image denoising. Other needs are visualizing large-scale, 3D reconstructions and navigating large-scale collections of images that have been aligned to maps.

2.6.3 Use Case 28: TruthyInformation Diffusion Research from Twitter Data

Submitted by Filippo Menczer, Alessandro Flammini, and Emilio Ferrara, Indiana University

Application

How communication spreads on socio-technical networks must be better understood, and methods are needed to detect potentially harmful information spread at early stages (e.g., deceiving messages, orchestrated campaigns, untrustworthy information).

Current Approach

Twitter generates a large volume of continuous streaming data—about 30 TB a year, compressed—through circulation of ~100 million messages per day. The increase over time is roughly 500 GB data per day. All these data must be acquired and stored. Additional needs include near real-time analysis of such data for anomaly detection, stream clustering, signal classification, and online-learning; and data retrieval, Big Data visualization, data-interactive web interfaces, and public application programming interfaces (APIs) for data querying. Software packages for data analysis include Python/ SciPy/ NumPy/ MPI. Information diffusion, clustering, and dynamic network visualization capabilities already exist.

Future

Truthy plans to expand, incorporating Google+ and Facebook, and so needs to move toward advanced distributed storage programs, such as Hadoop/Indexed HBase and Hadoop Distributed File System (HDFS). Redis should be used as an in-memory database to be a buffer for real-time analysis. Solutions will need to incorporate streaming clustering, anomaly detection, and online learning.

2.6.4 Use Case 29: Crowd Sourcing in the Humanities as Source for Big and Dynamic Data

Submitted by Sebastian Drude, Max-Planck-Institute for Psycholinguistics, Nijmegen, the Netherlands

Application

Information is captured from many individuals and their devices using a range of sources: manually entered, recorded multimedia, reaction times, pictures, sensor information. These data are used to characterize wide-ranging individual, social, cultural, and linguistic variations among several dimensions (e.g., space, social space, time).

Current Approach

At this point, typical systems used are Extensible Markup Language (XML) technology and traditional relational databases. Other than pictures, not much multi-media is employed yet.

Future

Crowd sourcing is beginning to be used on a larger scale. However, the availability of sensors in mobile devices provides a huge potential for collecting large amount of data from numerous individuals. This possibility has not been explored on a large scale so far; existing crowd sourcing projects are usually of a limited scale and web-based. Privacy issues may be involved because of access to individuals’ audiovisual files; anonymization may be necessary but not always possible. Data management and curation are critical. With multimedia, the size could be hundreds of terabytes.

2.6.5 Use Case 30: CINETCyberinfrastructure for Network (Graph) Science and Analytics

Submitted by Madhav Marathe and Keith Bisset, Virginia Tech

Application

CINET provides a common web-based platform that allows the end user seamless access to the following:

·         Network and graph analysis tools such as SNAP, NetworkX, and Galib

·         Real-world and synthetic networks

·         Computing resources

·         Data management systems.

Current Approach

CINET uses an Infiniband-connected HPC cluster with 720 cores to provide HPC as a service. The platform is being used for research and education. CINET is used in classes and to support research by social science and social networking communities

Future

Rapid repository growth is expected to lead to at least 1,000 to 5,000 networks and methods in about a year. As more fields use graphs of increasing size, parallel algorithms will be important. Two critical challenges are data manipulation and bookkeeping of the derived data, as there are no well-defined and effective models and tools for unified management of various graph data.

2.6.6 Use Case 31: NIST Information Access DivisionAnalytic Technology Performance Measurements, Evaluations, and Standards

Submitted by John Garofolo, NIST

Application

Performance metrics, measurement methods, and community evaluations are needed to ground and accelerate development of advanced analytic technologies in the areas of speech and language processing, video and multimedia processing, biometric image processing, and heterogeneous data processing, as well as the interaction of analytics with users. Typically one of two processing models are employed: (1) push test data out to test participants, and analyze the output of participant systems, and (2) push algorithm test harness interfaces out to participants, bring in their algorithms, and test them on internal computing clusters.

Current Approach

There is a large annotated corpora of unstructured/semi-structured text, audio, video, images, multimedia, and heterogeneous collections of the above, including ground truth annotations for training, developmental testing, and summative evaluations. The test corpora exceed 900 million web pages occupying 30 TB of storage, 100 million tweets, 100 million ground-truthed biometric images, several hundred thousand partially ground-truthed video clips, and terabytes of smaller fully ground-truthed test collections.

Future

Even larger data collections are being planned for future evaluations of analytics involving multiple data streams and very heterogeneous data. In addition to larger datasets, the future includes testing of streaming algorithms with multiple heterogeneous data. The use of clouds is being explored.

2.7 The Ecosystem for Research

2.7.1 Use Case 32: DataNet Federation Consortium (DFC)

Submitted by Reagan Moore, University of North Carolina at Chapel Hill

Application

The DataNet Federation Consortium (DFC) promotes collaborative and interdisciplinary research through a federation of data management systems across federal repositories, national academic research initiatives, institutional repositories, and international collaborations. The collaboration environment runs at scale and includes petabytes of data, hundreds of millions of files, hundreds of millions of metadata attributes, tens of thousands of users, and a thousand storage resources.

Current Approach

Currently, 25 science and engineering domains have projects that rely on the iRODS (Integrated Rule-Oriented Data System) policy-based data management system. Active organizations include the National Science Foundation, with major projects such as the Ocean Observatories Initiative (sensor archiving); Temporal Dynamics of Learning Center (cognitive science data grid); iPlant Collaborative (plant genomics); Drexel’s engineering digital library; and H. W. Odum Institute for Research in Social Science (data grid federation with Dataverse). iRODS currently manages PB of data, hundreds of millions of files, hundreds of millions of metadata attributes, tens of thousands of users, and a thousand storage resources. It interoperates with workflow systems (e.g., National Center for Computing Applications’ [NCSA’s] Cyberintegrator, Kepler, Taverna), cloud, and more traditional storage models, as well as different transport protocols. Figure 4 presents a diagram of the iRODS architecture.

Future

Future data scenarios and applications were not expressed for this use case.


Figure 4: DataNet Federation Consortium DFC – iRODS Architecture

Volume3Figure4.png

2.7.2 Use Case 33: The Discinnet Process

Submitted by P. Journeau, Discinnet Labs

Application

Discinnet has developed a Web 2.0 collaborative platform and research prototype as a pilot installation, which is now being deployed and tested by researchers from a growing number of diverse research fields. The goal is to reach a wide enough sample of active research fields, represented as clusters (i.e., researchers projected and aggregating within a manifold of mostly shared experimental dimensions) to test general, hence potentially interdisciplinary, epistemological models throughout the present decade.

Current Approach

Currently, 35 clusters have been started, with close to 100 awaiting more resources. There is potential for many more to be created, administered, and animated by research communities. Examples of clusters include optics, cosmology, materials, microalgae, health care, applied math, computation, rubber, and other chemical products/issues.

Future

Discinnet itself would not be Big Data but rather will generate metadata when applied to a cluster that involves Big Data. In interdisciplinary integration of several fields, the process would reconcile metadata from many complexity levels.

2.7.3 Use Case 34: Semantic Graph Search on Scientific Chemical and Text-Based Data

Submitted by Talapady Bhat, NIST

Application

Social media-based infrastructure, terminology and semantic data-graphs are established to annotate and present technology information. The process uses root- and rule-based methods currently associated primarily with certain Indo-European languages, such as Sanskrit and Latin.

Current Approach

Many reports, including a recent one on the Material Genome Project, find that exclusive top-down solutions to facilitate data sharing and integration are not desirable for multi-disciplinary efforts. However, a bottom-up approach can be chaotic. For this reason, there is need for a balanced blend of the two approaches to support easy-to-use techniques to metadata creation, integration, and sharing. This challenge is very similar to the challenge faced by language developers, so a recently developed method is based on these ideas. There are ongoing efforts to extend this method to publications of interest to the Material Genome Initiative [8], the Open Government movement [9], and the NIST Integrated Knowledge Editorial Net (NIKE) [10], a NIST-wide publication archive.) These efforts are a component of the Research Data Alliance Metadata Standards Directory Working Group [11].

Future

A cloud infrastructure should be created for social media of scientific information. Scientists from across the world could use this infrastructure to participate and deposit results of their experiments. Prior to establishing a scientific social medium, some issues must be resolved including the following:

·         Minimize challenges related to establishing re-usable, interdisciplinary, scalable, on-demand, use-case, and user-friendly vocabulary.

·         Adopt an existing or create new on-demand ‘data-graph’ to place information in an intuitive way, such that it would easily integrate with existing data-graphs in a federated environment, independently of details of data management.

·         Find relevant scientific data without spending too much time on the Internet.

Start with resources such as the Open Government movement, Material Genome Initiative, and Protein Databank. This effort includes many local and networked resources. Developing an infrastructure to automatically integrate information from all these resources using data-graphs is a challenge, but steps are being taken to solve it. Strong database tools and servers for data-graph manipulation are needed.

2.7.4 Use Case 35: Light Source Beamlines

Submitted by Eli Dart, LBNL

Application

Samples are exposed to X-rays from light sources in a variety of configurations, depending on the experiment. Detectors, essentially high-speed digital cameras, collect the data. The data are then analyzed to reconstruct a view of the sample or process being studied.

Current Approach

A variety of commercial and open source software is used for data analysis. For example, Octopus is used for tomographic reconstruction, and Avizo (http://vsg3d.com) and FIJI (a distribution of ImageJ) are used for visualization and analysis. Data transfer is accomplished using physical transport of portable media, which severely limits performance, high-performance GridFTP, managed by Globus Online, or workflow systems such as SPADE (Support for Provenance Auditing in Distributed Environments, an open source software infrastructure).

Future

Camera resolution is continually increasing. Data transfer to large-scale computing facilities is becoming necessary because of the computational power required to conduct the analysis on timescales useful to the experiment. Because of the large number of beamlines (e.g., 39 at the LBNL Advanced Light Source), aggregate data load is likely to increase significantly over the coming years, as will the need for a generalized infrastructure for analyzing GB per second of data from many beamline detectors at multiple facilities.

2.8 Astronomy and Physics

2.8.1 Use Case 36: Catalina Real-Time Transient Survey: A Digital, Panoramic, Synoptic Sky Survey

Submitted by S. G. Djorgovski, Caltech

Application

Catalina Real-Time Transient Survey (CRTS) explores the variable universe in the visible light regime, on timescales ranging from minutes to years, by searching for variable and transient sources. It discovers a broad variety of astrophysical objects and phenomena, including various types of cosmic explosions (e.g., supernovae), variable stars, phenomena associated with accretion to massive black holes (e.g., active galactic nuclei) and their relativistic jets, and high proper motion stars. The data are collected from three telescopes (two in Arizona and one in Australia), with additional ones expected in the near future in Chile.

Current Approach

The survey generates up to approximately 0.1 TB on a clear night with a total of approximately 100 TB in current data holdings. The data are preprocessed at the telescope and then transferred to the University of Arizona and Caltech for further analysis, distribution, and archiving. The data are processed in real time, and detected transient events are published electronically through a variety of dissemination mechanisms, with no proprietary withholding period (CRTS has a completely open data policy). Further data analysis includes classification of the detected transient events, additional observations using other telescopes, scientific interpretation, and publishing. This process makes heavy use of the archival data (several PBs) from a wide variety of geographically distributed resources connected through the virtual observatory (VO) framework.

Future

CRTS is a scientific and methodological test bed and precursor of larger surveys to come, notably the Large Synoptic Survey Telescope (LSST), expected to operate in the 2020s and selected as the highest-priority ground-based instrument in the 2010 Astronomy and Astrophysics Decadal Survey. LSST will gather about 30 TB per night. Figure 5 illustrates the schematic architecture for a cyber infrastructure for time domain astronomy.

Figure 5: Catalina CRTS: A Digital, Panoramic, Synoptic Sky Survey

Volume3Figure5.png

Survey pipelines from telescopes (on the ground or in space) produce transient event data streams, and the events, along with their observational descriptions, are ingested by one or more depositories, from which the event data can be disseminated electronically to human astronomers or robotic telescopes. Each event is assigned an evolving portfolio of information, which includes all available data on that celestial position. The data are gathered from a wide variety of data archives unified under the Virtual Observatory framework, expert annotations, etc. Representations of such federated information can be both human-readable and machine-readable. The data are fed into one or more automated event characterization, classification, and prioritization engines that deploy a variety of machine learning tools for these tasks. The engines’ output, which evolves dynamically as new information arrives and is processed, informs the follow-up observations of the selected events, and the resulting data are communicated back to the event portfolios for the next iteration. Users, either human or robotic, can tap into the system at multiple points, both for information retrieval and to contribute new information, through a standardized set of formats and protocols. This could be done in (near) real-time or in archival (i.e., not time-critical) modes.

2.8.2 Use Case 37: DOE Extreme Data from Cosmological Sky Survey and Simulations

Submitted by Salman Habib, Argonne National Laboratory; Andrew Connolly, University of Washington

Application

A cosmology discovery tool integrates simulations and observation to clarify the nature of dark matter, dark energy, and inflation—some of the most exciting, perplexing, and challenging questions facing modern physics, including the properties of fundamental particles affecting the early universe. The simulations will generate data sizes comparable to observation.

Current Approach

At this time, this project is in the preliminary planning phases and, therefore, the current approach is not fully developed.

Future

These systems will use huge amounts of supercomputer time—over 200 million hours. Associated data sizes are as follows:

·         Dark Energy Survey (DES): 4 PB per year in 2015

·         Zwicky Transient Factory (ZTF): 1 PB per year in 2015

·         LSST (see CRTS discussion above): 7 PB per year in 2019

·         Simulations: 10 PB per year in 2017

2.8.3 Use Case 38: Large Survey Data for Cosmology

Submitted by Peter Nugent, LBNL

Application

For DES, the data are sent from the mountaintop, via a microwave link, to La Serena, Chile. From there, an optical link forwards them to the NCSA and to NERSC for storage and ‘reduction.’ Here, galaxies and stars in both the individual and stacked images are identified and catalogued, and finally their properties are measured and stored in a database.

Current Approach

Subtraction pipelines are run using extant imaging data to find new optical transients through machine learning algorithms. Data technologies are Linux cluster, Oracle RDBMS server, Postgres PSQL, large memory machines, standard Linux interactive hosts, and the General Parallel File System (GPFS). HPC resources are needed for simulations. Software needs include standard astrophysics reduction software as well as Perl/Python wrapper scripts and Linux Cluster scheduling.

Future

Techniques are needed for handling Cholesky decomposition for thousands of simulations with matrices of order one million on a side and parallel image storage. LSST will generate 60 PB of imaging data and 15 PB of catalog data and a correspondingly large (or larger) amount of simulation data. In total, over 20 TB of data will be generated per night.

2.8.4 Use Case 39: Particle PhysicsAnalysis of Large Hadron Collider Data: Discovery of Higgs Particle

Submitted by Michael Ernst, Brookhaven National Laboratory (BNL); Lothar Bauerdick, Fermi National Accelerator Laboratory (FNAL); Geoffrey Fox, Indiana University; Eli Dart, LBNL

Application

Analysis is conducted on collisions at the European Organization for Nuclear Research (CERN) Large Hadron Collider (LHC) accelerator (Figure 6) and Monte Carlo producing events describing particle-apparatus interaction.

Figure 6: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – CERN LHC Location

Volume3Figure6.png

Processed information defines physics properties of events and generates lists of particles with type and momenta. These events are analyzed to find new effects—both new particles (e.g., Higgs), and present evidence that conjectured particles (e.g., Supersymmetry) have not been detected. A few major experiments are being conducted at LHC, including ATLAS and CMS (Compact Muon Solenoid). These experiments have global participants (e.g., CMS has 3,600 participants from 183 institutions in 38 countries), and so the data at all levels are transported and accessed across continents.

Current Approach

The LHC experiments are pioneers of a distributed Big Data science infrastructure. Several aspects of the LHC experiments’ workflow highlight issues that other disciplines will need to solve. These issues include automation of data distribution, high-performance data transfer, and large-scale high-throughput computing. Figure 7 shows grid analysis with 350,000 cores running near-continuously—over two million jobs per day arranged in three major tiers: CERN, Continents/Countries, and Universities. The analysis uses distributed, high-throughput computing (i.e., pleasing parallel) architecture with facilities integrated across the world by the Worldwide LHC Computing Grid (WLCG) and Open Science Grid in the U.S. Accelerator data and analysis generates 15 PB of data each year for a total of 200 PB. Specifically, in 2012, ATLAS had 8 PB on Tier1 tape and over 10 PB on Tier 1 disk at BNL and 12 PB on disk cache at U.S. Tier 2 centers. CMS has similar data sizes. Over half the resources are used for Monte Carlo simulations as opposed to data analysis.

Figure 7: Particle Physics: Analysis of LHC Data: Discovery of Higgs Particle – The Multi-tier LHC Computing Infrastructure

Volume3Figure7.png

Future

In the past, the particle physics community has been able to rely on industry to deliver exponential increases in performance per unit cost over time, as described by Moore's Law. However, the available performance will be much more difficult to exploit in the future since technology limitations, in particular regarding power consumption, have led to profound changes in the architecture of modern central processing unit (CPU) chips. In the past, software could run unchanged on successive processor generations and achieve performance gains that follow Moore's Law, thanks to the regular increase in clock rate that continued until 2006. The era of scaling sequential applications on an HEP (heterogeneous element processor) is now over. Changes in CPU architectures imply significantly more software parallelism, as well as exploitation of specialized floating point capabilities. The structure and performance of HEP data processing software need to be changed such that they can continue to be adapted and developed to run efficiently on new hardware. This represents a major paradigm shift in HEP software design and implies large-scale re-engineering of data structures and algorithms. Parallelism needs to be added simultaneously at all levels: the event level, the algorithm level, and the sub-algorithm level. Components at all levels in the software stack need to interoperate, and therefore the goal is to standardize as much as possible on basic design patterns and on the choice of a concurrency model. This will also help to ensure efficient and balanced use of resources.

2.8.5 Use Case 40: Belle II High Energy Physics Experiment

Submitted by David Asner and Malachi Schram, Pacific Northwest National Laboratory (PNNL)

Application

The Belle experiment is a particle physics experiment with more than 400 physicists and engineers investigating charge parity (CP) violation effects with B meson production at the High Energy Accelerator KEKB e+ e- accelerator in Tsukuba, Japan. In particular, numerous decay modes at the Upsilon(4S) resonance are sought to identify new phenomena beyond the standard model of particle physics. This accelerator has the largest intensity of any in the world, but the events are simpler than those from LHC, and so analysis is less complicated, but similar in style to the CERN accelerator analysis.

Current Approach

At this time, this project is in the preliminary planning phases and, therefore, the current approach is not fully developed.

Future

An upgraded experiment Belle II and accelerator SuperKEKB will start operation in 2015. Data will increase by a factor of 50, with total integrated raw data of ~120 PB and physics data of ~15 PB and ~100 PB of Monte Carlo samples. The next stage will necessitate a move to a distributed computing model requiring continuous raw data transfer of ~20 GB per second at designed luminosity between Japan and the United States. Open Science Grid, Geant4, DIRAC, FTS, and Belle II framework software will be needed.

2.9 Earth, Environmental, and Polar Science

2.9.1 Use Case 41: EISCAT 3D Incoherent Scatter Radar System

Submitted by Yin Chen, Cardiff University; Ingemar Häggström, Ingrid Mann, and Craig Heinselman, EISCAT

Application

EISCAT, the European Incoherent Scatter Scientific Association, conducts research on the lower, middle, and upper atmosphere and ionosphere using the incoherent scatter radar technique. This technique is the most powerful ground-based tool for these research applications. EISCAT studies instabilities in the ionosphere and investigates the structure and dynamics of the middle atmosphere. EISCAT operates a diagnostic instrument in ionospheric modification experiments with addition of a separate heating facility. Currently, EISCAT operates three of the ten major incoherent radar scattering instruments worldwide; their three systems are located in the Scandinavian sector, north of the Arctic Circle.

Current Approach

The currently running EISCAT radar generates data at rates of terabytes per year. The system does not present special challenges.

Future

The design of the next-generation radar, EISCAT_3D, will consist of a core site with transmitting and receiving radar arrays and four sites with receiving antenna arrays at some 100 kilometers from the core. The fully operational five-site system will generate several thousand times the number of data of the current EISCAT system, with 40 PB per year in 2022, and is expected to operate for 30 years. EISCAT_3D data e-Infrastructure plans to use high-performance computers for central site data processing and high-throughput computers for mirror site data processing. Downloading the full data is not time-critical, but operations require real-time information about certain pre-defined events, which would be sent from the sites to the operations center, and a real-time link from the operations center to the sites to set the mode of radar operation in real time. See Figure 8.

Figure 8: EISCAT 3D Incoherent Scatter Radar System – System Architecture

Volume3Figure8.png

2.9.2 Use Case 42: ENVRI, Common Operations of Environmental Research Infrastructure

Submitted by Yin Chen, Cardiff University

Application

ENVRI addresses European distributed, long-term, remote-controlled observational networks focused on understanding processes, trends, thresholds, interactions, and feedbacks, as well as increasing the predictive power to address future environmental challenges. The following efforts are part of ENVRI:

·         ICOS (Integrated Carbon Observation System) is a European distributed infrastructure dedicated to the monitoring of greenhouse gases (GHGs) through its atmospheric, ecosystem, and ocean networks.

·         EURO-Argo is the European contribution to Argo, which is a global ocean observing system.

·         EISCAT_3D (described separately) is a European new-generation incoherent scatter research radar system for upper atmospheric science.

·         LifeWatch (described separately) is an e-science infrastructure for biodiversity and ecosystem research.

·         EPOS (European Plate Observing System) is a European research infrastructure for earthquakes, volcanoes, surface dynamics, and tectonics.

·         EMSO (European Multidisciplinary Seafloor and Water Column Observatory) is a European network of seafloor observatories for the long-term monitoring of environmental processes related to ecosystems, climate change, and geo-hazards.

·         IAGOS (In-service Aircraft for a Global Observing System) is setting up a network of aircraft for global atmospheric observation.

·         SIOS (Svalbard Integrated Arctic Earth Observing System) is establishing an observation system in and around Svalbard that integrates the studies of geophysical, chemical, and biological processes from all research and monitoring platforms.

Current Approach

ENVRI develops a reference model (ENVRI RM) as a common ontological framework and standard for the description and characterization of computational and storage infrastructures. The goal is to achieve seamless interoperability between the heterogeneous resources of different infrastructures. The ENVRI RM serves as a common language for community communication, providing a uniform framework into which the infrastructure’s components can be classified and compared. The RM also serves to identify common solutions to common problems. Data sizes in a given infrastructure vary from gigabytes to petabytes per year.

Future

ENVRI’s common environment will empower the users of the collaborating environmental research infrastructures and enable multidisciplinary scientists to access, study, and correlate data from multiple domains for system-level research. Collaboration affects Big Data requirements coming from interdisciplinary research.

ENVRI analyzed the computational characteristics of the six European Strategy Forum on Research Infrastructures (ESFRI) environmental research infrastructures, and identified five common subsystems (Figure 9). They are defined in the ENVRI RM (http://www.envri.eu/rm) and below:

·         Data acquisition: Collects raw data from sensor arrays, various instruments, or human observers, and brings the measurements (data streams) into the system.

·         Data curation: Facilitates quality control and preservation of scientific data and is typically operated at a data center.

·         Data access: Enables discovery and retrieval of data housed in data resources managed by a data curation subsystem.

·         Data processing: Aggregates data from various resources and provides computational capabilities and capacities for conducting data analysis and scientific experiments.

·         Community support: Manages, controls, and tracks users' activities and supports users in conduct of their community roles.

Figure 9: ENVRI Common Architecture

Volume3Figure9.png

Figures 10(a) through 10(e) illustrate how well the five subsystems map to the architectures of the ESFRI environmental research infrastructures.

Figure 10(a): ICOS Architecture

Volume3Figure10a.png

Figure 10(b): LifeWatch Architecture

Volume3Figure10b.png

Figure 10(c): EMSO Architecture

Volume3Figure10c.png

Figure 10(d): EURO-Argo Architecture

Volume3Figure10d.png

Figure 10(e): EISCAT 3D Architecture

Volume3Figure10e.png

2.9.3 Use Case 43: Radar Data Analysis for the Center for Remote Sensing of Ice Sheets

Submitted by Geoffrey Fox, Indiana University

Application

As illustrated in Figure 11, the Center for Remote Sensing of Ice Sheets (CReSIS) effort uses custom radar systems to measure ice sheet bed depths and (annual) snow layers at the North and South Poles and mountainous regions.

Figure 11: Typical CReSIS Radar Data After Analysis

Volume3Figure11.png

Resulting data feed into the Intergovernmental Panel on Climate Change (IPCC). The radar systems are typically flown in by aircraft in multiple paths, as illustrated by Figure 12.

Figure 12: Radar Data Analysis for CReSIS Remote Sensing of Ice Sheets

Typical Flight Paths of Data Gathering in Survey Region

Volume3Figure12.png

Current Approach

The initial analysis uses Matlab signal processing that produces a set of radar images. These cannot be transported from the field over the Internet and are typically copied onsite to a few removable disks that hold a terabyte of data, then flown to a laboratory for detailed analysis. Figure 13 illustrates image features (i.e., layers) found using image understanding tools with some human oversight. Figure 13 is a typical echogram with detected boundaries. The upper (green) boundary is between air and ice layers, while the lower (red) boundary is between ice and terrain This information is stored in a database front-ended by a geographical information system. The ice sheet bed depths are used in simulations of glacier flow. Each trip into the field, usually lasting a few weeks, results in 50 to 100 TB of data.

Figure 13: Typical echogram with detected boundaries

Volume3Figure13.png

Future

With improved instrumentation, an order of magnitude more data (a petabyte per mission) is projected. As the increasing field data must be processed in an environment with constrained power access, low-power or low-performance architectures, such as GPU systems, are indicated.

2.9.4 Use Case 44: Unmanned Air Vehicle Synthetic Aperture Radar (UAVSAR) Data Processing, Data Product Delivery, and Data Services

Submitted by Andrea Donnellan and Jay Parker, National Aeronautics and Space Administration (NASA) Jet Propulsion Laboratory

Application

Synthetic aperture radar (SAR) can identify landscape changes caused by seismic activity, landslides, deforestation, vegetation changes, and flooding. This function can be used to support earthquake science, as shown in Figure 14, as well as disaster management. Figure 14 shows the combined unwrapped coseismic interferograms for flight lines 26501, 26505, and 08508 for the October 2009–April 2010 time period. End points where slip can be seen on the Imperial, Superstition Hills, and Elmore Ranch faults are noted. GPS stations are marked by dots and are labeled. This use case supports the storage, image processing application, and visualization of geo-located data with angular specification.

Figure 14: Combined Unwrapped Coseismic Interferograms

Volume3Figure14.png

Current Approach

Data from planes and satellites are processed on NASA computers before being stored after substantial data communication. The data are made public upon processing. They require significant curation owing to instrumental glitches. The current data size is approximately 150 TB.

Future

The data size would increase dramatically if Earth Radar Mission launched. Clouds are suitable hosts but are not used today in production.

2.9.5 Use Case 45: NASA Langley Research Center/ Goddard Space Flight Center iRODS Federation Test Bed

Submitted by Brandi Quam, NASA Langley Research Center

Application

NASA Center for Climate Simulation and NASA Atmospheric Science Data Center have complementary data sets, each containing vast amounts of data that are not easily shared and queried. Climate researchers, weather forecasters, instrument teams, and other scientists need to access data from across multiple datasets in order to compare sensor measurements from various instruments, compare sensor measurements to model outputs, calibrate instruments, look for correlations across multiple parameters, and more.

Current Approach

Data are generated from two products: the Modern Era Retrospective Analysis for Research and Applications (MERRA, described separately) and NASA Clouds and Earth's Radiant Energy System (CERES) EBAF–TOA (Energy Balanced and Filled–Top of Atmosphere) product, which accounts for about 420 MB, and the EBAF–Surface product, which accounts for about 690 MB. Data numbers grow with each version update (about every six months). To analyze, visualize, and otherwise process data from heterogeneous datasets is currently a time-consuming effort. Scientists must separately access, search for, and download data from multiple servers, and often the data are duplicated without an understanding of the authoritative source. Often accessing data takes longer than scientific analysis. Current datasets are hosted on modest-sized (144 to 576 cores) Infiniband clusters.

Future

Improved access will be enabled through the use of iRODS. These systems support parallel downloads of datasets from selected replica servers, providing users with worldwide access to the geographically dispersed servers. iRODS operation will be enhanced with semantically organized metadata and managed via a highly precise NASA Earth Science ontology. Cloud solutions will also be explored.

2.9.6 Use Case 46: MERRA Analytic Services (MERRA/AS)

Submitted by John L. Schnase and Daniel Q. Duffy, NASA Goddard Space Flight Center

Application

This application produces global temporally and spatially consistent syntheses of 26 key climate variables by combining numerical simulations with observational data. Three-dimensional results are produced every six hours extending from 1979 to the present. The data support important applications such as IPCC research and the NASA/Department of Interior RECOVER wildfire decision support system; these applications typically involve integration of MERRA with other datasets. Figure 15 shows a typical MERRA/AS output.

Figure 15: Typical MERRA/AS Output

Volume3Figure15.png

Current Approach

MapReduce is used to process a current total of 480 TB. The current system is hosted on a 36-node Infiniband cluster.

Future

Clouds are being investigated. The data is growing by one TB a month.

2.9.7 Use Case 47: Atmospheric Turbulence – Event Discovery and Predictive Analytics

Submitted by Michael Seablom, NASA headquarters

Application

Data mining is built on top of reanalysis products, including MERRA (described separately) and the North American Regional Reanalysis (NARR), a long-term, high-resolution climate data set for the North American domain. The analytics correlate aircraft reports of turbulence (either from pilot reports or from automated aircraft measurements of eddy dissipation rates) with recently completed atmospheric reanalyses. The information is of value to aviation industry and to weather forecasters. There are no standards for reanalysis products, complicating systems for which MapReduce is being investigated. The reanalysis data are hundreds of terabytes, slowly updated, whereas the turbulence dataset is smaller in size and implemented as a streaming service. Figure 16 shows a typical turbulent wave image.

Figure 16: Typical NASA Image of Turbulent Waves

Volume3Figure16.png

Current Approach

The current 200 TB dataset can be analyzed with MapReduce or the like using SciDB or another scientific database.

Future

The dataset will reach 500 TB in five years. The initial turbulence case can be extended to other ocean/atmosphere phenomena, but the analytics would be different in each case.

2.9.8 Use Case 48: Climate Studies Using the Community Earth System Model at the U.S. Department of Energy (DOE) NERSC Center

Submitted by Warren Washington, National Center for Atmospheric Research

Application

Simulations with the Community Earth System Model (CESM) can be used to understand and quantify contributions of natural and anthropogenic-induced patterns of climate variability and change in the 20th and 21st centuries. The results of supercomputer simulations across the world should be stored and compared.

Current Approach

The Earth System Grid (ESG) enables global access to climate science data on a massive scale—petascale, or even exascale—with multiple petabytes of data at dozens of federated sites worldwide. The ESG is recognized as the leading infrastructure for the management and access of large distributed data volumes for climate change research. It supports the Coupled Model Intercomparison Project (CMIP), whose protocols enable the periodic assessments carried out by the IPCC.

Future

Rapid growth of data is expected, with 30 PB produced at NERSC (assuming 15 end-to-end climate change experiments) in 2017 and many times more than this worldwide.

2.9.9 Use Case 49: DOE Biological and Environmental Research (BER) Subsurface Biogeochemistry Scientific Focus Area

Submitted by Deb Agarwal, LBNL

Application

A genome-enabled watershed simulation capability (GEWaSC) is needed to provide a predictive framework for understanding the following:

·         How genomic information stored in a subsurface microbiome affects biogeochemical watershed functioning.

·         How watershed-scale processes affect microbial functioning.

·         How these interactions co-evolve.

Current Approach

Current modeling capabilities can represent processes occurring over an impressive range of scales—from a single bacterial cell to that of a contaminant plume. Data cross all scales from genomics of the microbes in the soil to watershed hydro-biogeochemistry. Data are generated by the different research areas and include simulation data, field data (e.g., hydrological, geochemical, geophysical), ‘omics’ data, and observations from laboratory experiments.

Future

Little effort to date has been devoted to developing a framework for systematically connecting scales, as is needed to identify key controls and to simulate important feedbacks. GEWaSC will develop a simulation framework that formally scales from genomes to watersheds and will synthesize diverse and disparate field, laboratory, and simulation datasets across different semantic, spatial, and temporal scales.

2.9.10 Use Case 50: DOE BER AmeriFlux and FLUXNET Networks

Submitted by Deb Agarwal, LBNL

Application

AmeriFlux and Flux Tower Network (FLUXNET) are U.S. and world collections, respectively, of sensors that observe trace gas fluxes (e.g., CO2, water vapor) across a broad spectrum of times (e.g., hours, days, seasons, years, and decades) and space. Moreover, such datasets provide the crucial linkages among organisms, ecosystems, and process-scale studies—at climate-relevant scales of landscapes, regions, and continents—for incorporation into biogeochemical and climate models.

Current Approach

Software includes EddyPro, custom analysis software, R, Python, neural networks, and Matlab. There are approximately 150 towers in AmeriFlux and over 500 towers distributed globally collecting flux measurements.

Future

Field experiment data-taking would be improved by access to existing data and automated entry of new data via mobile devices. Interdisciplinary studies integrating diverse data sources will be expanded.

2.10 Energy

2.10.1 Use Case 51: Consumption Forecasting in Smart Grids

Submitted by Yogesh Simmhan, University of Southern California

Application

Smart meters support prediction of energy consumption for customers, transformers, substations and the electrical grid service area. Advanced meters provide measurements every 15 minutes at the granularity of individual consumers within the service area of smart power utilities. Data to be combined include the head end of smart meters (distributed), utility databases (customer information, network topology; centralized), U.S. Census data (distributed), NOAA weather data (distributed), micro-grid building information systems (centralized), and micro-grid sensor networks (distributed). The central theme is real-time, data-driven analytics for time series from cyber physical systems.

Current Approach

Forecasting uses GIS-based visualization. Data amount to around 4 TB per year for a city such as Los Angeles with 1.4 million sensors. The process uses R/Matlab, Weka, and Hadoop software. There are significant privacy issues requiring anonymization by aggregation. Real-time and historic data are combined with machine learning to predict consumption.

Future

Advanced grid technologies will have wide-spread deployment. Smart grids will have new analytics integrating diverse data and supporting curtailment requests. New technologies will support mobile applications for client interactions.

3 Use Case Requirements

Requirements are the challenges limiting further use of Big Data. After collection, processing, and review of the use cases, requirements within seven characteristic categories were extracted from the individual use cases. These use case specific requirements were then aggregated to produce high-level, general requirements, within the seven characteristic categories, that are vendor neutral and technology agnostic. It is emphasized that neither the use case nor the requirements lists are exhaustive.

3.1 Use Case Specific Requirements

Each use case was evaluated for requirements within the following seven categories. These categories were derived from Subgroup discussions and motivated by components of the evolving reference architecture at the time. The process involved several Subgroup members extracting requirements and iterating back their suggestions for modifying the categories.

1.      Data source (e.g., data size, file formats, rate of growth, at rest or in motion)

2.      Data transformation (e.g., data fusion, analytics)

3.      Capabilities (e.g., software tools, platform tools, hardware resources such as storage and networking)

4.      Data consumer (e.g., processed results in text, table, visual, and other formats)

5.      Security and privacy

6.      Lifecycle management (e.g., curation, conversion, quality check, pre-analytic processing)

7.      Other requirements

Some use cases contained requirements in all seven categories while others only included requirements for a few categories. The complete list of specific requirements extracted from the use cases is presented in Appendix D. Section 2.1 of the NIST Big Data Interoperability Framework: Volume 6 Reference Architecture maps these seven categories to terms used in the reference architecture. The categories map in a one-to-one fashion but have slightly different terminology as the use case requirements analysis was performed before the reference architecture was finalized.

3.2 General Requirements

Aggregation of the use case-specific requirements allowed formation of more generalized requirements under the seven categories. These generalized requirements are listed below by category.

Data Source Requirements (DSR)

·         DSR-1: Needs to support reliable real-time, asynchronous, streaming, and batch processing to collect data from centralized, distributed, and cloud data sources, sensors, or instruments.

·         DSR-2: Needs to support slow, bursty, and high-throughput data transmission between data sources and computing clusters.

·         DSR-3: Needs to support diversified data content ranging from structured and unstructured text, document, graph, web, geospatial, compressed, timed, spatial, multimedia, simulation, and instrumental data.

Transformation Provider Requirements (TPR)

·         TPR-1: Needs to support diversified compute-intensive, analytic processing, and machine learning techniques.

·         TPR-2: Needs to support batch and real-time analytic processing.

·         TPR-3: Needs to support processing large diversified data content and modeling.

·         TPR-4: Needs to support processing data in motion (streaming, fetching new content, tracking, etc.).

Capability Provider Requirements (CPR)

·         CPR-1: Needs to support legacy and advanced software packages (software).

·         CPR-2: Needs to support legacy and advanced computing platforms (platform).

·         CPR-3: Needs to support legacy and advanced distributed computing clusters, co-processors, input output (I/O) processing (infrastructure).

·         CPR-4: Needs to support elastic data transmission (networking).

·         CPR-5: Needs to support legacy, large, and advanced distributed data storage (storage).

·         CPR-6: Needs to support legacy and advanced executable programming: applications, tools, utilities, and libraries (software).

Data Consumer Requirements (DCR)

·         DCR-1: Needs to support fast searches (~0.1 seconds) from processed data with high relevancy, accuracy, and recall.

·         DCR-2: Needs to support diversified output file formats for visualization, rendering, and reporting.

·         DCR-3: Needs to support visual layout for results presentation.

·         DCR-4: Needs to support rich user interface for access using browser, visualization tools.

·         DCR-5: Needs to support high-resolution, multi-dimension layer of data visualization.

·         DCR-6: Needs to support streaming results to clients.

Security and Privacy Requirements (SPR)

·         SPR-1: Needs to protect and preserve security and privacy of sensitive data.

·         SPR-2: Needs to support sandbox, access control, and multi-level, policy-driven authentication on protected data.

Lifecycle Management Requirements (LMR)

·         LMR-1: Needs to support data quality curation including pre-processing, data clustering, classification, reduction, and format transformation.

·         LMR-2: Needs to support dynamic updates on data, user profiles, and links.

·         LMR-3: Needs to support data lifecycle and long-term preservation policy, including data provenance.

·         LMR-4: Needs to support data validation.

·         LMR-5: Needs to support human annotation for data validation.

·         LMR-6: Needs to support prevention of data loss or corruption.

·         LMR-7: Needs to support multi-site archives.

·         LMR-8: Needs to support persistent identifier and data traceability.

·         LMR-9: Needs to support standardizing, aggregating, and normalizing data from disparate sources.

Other Requirements (OR)

·         OR-1: Needs to support rich user interface from mobile platforms to access processed results.

·         OR-2: Needs to support performance monitoring on analytic processing from mobile platforms.

·         OR-3: Needs to support rich visual content search and rendering from mobile platforms.

·         OR-4: Needs to support mobile device data acquisition.

·         OR-5: Needs to support security across mobile devices.

Appendix A: Use Case Study Source Materials

Appendix A contains one blank use case template and the original completed use cases. These use cases were the source material for the use case summaries presented in Section 2 and the use case requirements presented in Section 3 of this document. The completed use cases have not been edited and contain the original text as submitted by the author(s). The use cases are as follows:

NBD-PWG Use Case Studies Template

Use Case Title

 

Vertical (area)

 

Author/Company/Email

 

Actors/ Stakeholders and their roles and responsibilities

 

Goals

 

Use Case Description

 

Current

Solutions

Compute(System)

 

Storage

 

Networking

 

Software

 

Big Data
Characteristics

Data Source (distributed/centralized)

 

Volume (size)

 

Velocity

(e.g. real time)

 

Variety

(multiple datasets, mashup)

 

Variability (rate of change)

 

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

 

Visualization

 

Data Quality (syntax)

 

Data Types

 

Data Analytics

 

Big Data Specific Challenges (Gaps)

 

Big Data Specific Challenges in Mobility

 

Security and Privacy

Requirements

 

Highlight issues for generalizing this use case (e.g. for ref. architecture)

 

More Information (URLs)

 

Note: <additional comments>

Notes:    No proprietary or confidential information should be included

ADD picture of operation or data architecture of application below table.

Comments on fields

The following descriptions of fields in the template are provided to help with the understanding of both document intention and meaning of the 26 fields and also indicate ways that they can be improved.

  • Use Case Title: Title provided by the use case author

·         Vertical (area): Intended to categorize the use cases. However, an ontology was not created prior to the use case submissions so this field was not used in the use case compilation.

·         Author/Company/Email: Name, company, and email (if provided) of the person(s) submitting the use case.

·         Actors/ Stakeholders and their roles and responsibilities: Describes the players and their roles in the use case.

·         Goals: Objectives of the use case.

·         Use Case Description: Brief description of the use case.

·         Current Solutions: Describes current approach to processing Big Data at the hardware and software infrastructure level.

o   Compute (System): Computing component of the data analysis system.

o   Storage: Storage component of the data analysis system.

o   Networking: Networking component of the data analysis system.

o   Software: Software component of the data analysis system.

·         Big Data Characteristics: Describes the properties of the (raw) data including the four major ‘V’s’ of Big Data described in NIST Big Data Interoperability Framework: Volume 1, Big Data Definition of this report series.

o   Data Source: The origin of data, which could be from instruments, Internet of Things, Web, Surveys, Commercial activity, or from simulations. The source(s) can be distributed, centralized, local, or remote.

o   Volume: The characteristic of data at rest that is most associated with Big Data. The size of data varied drastically between use cases from terabytes to petabytes for science research (100 petabytes was the largest science use case for LHC data analysis), or up to exabytes in a commercial use case.

o   Velocity: Refers to the rate of flow at which the data is created, stored, analyzed, and visualized. For example, big velocity means that a large quantity of data is being processed in a short amount of time.

o   Variety: Refers to data from multiple repositories, domains, or types.

o   Variability: Refers to changes in rate and nature of data gathered by use case.

·         Big Data Science: Describes the high level aspects of the data analysis process

o   Veracity: Refers to the completeness and accuracy of the data with respect to semantic content. NIST Big Data Interoperability Framework: Volume 1, Big Data Definition discusses veracity in more detail.

o   Visualization: Refers to the way data is viewed by an analyst making decisions based on the data. Typically visualization is the final stage of a technical data analysis pipeline and follows the data analytics stage.

o   Data Quality: This refers to syntactical quality of data. In retrospect, this template field could have been included in the Veracity field.

o   Data Types: Refers to the style of data such as structured, unstructured, images (e.g., pixels), text (e.g., characters), gene sequences, and numerical.

o   Data Analytics: Defined in NIST Big Data Interoperability Framework: Volume 1, Big Data Definition as “the synthesis of knowledge from information”. In the context of these use cases, analytics refers broadly to tools and algorithms used in processing the data at any stage including the data to information or knowledge to wisdom stages, as well as the information to knowledge stage.

·         Big Data Specific Challenges (Gaps): Allows for explanation of special difficulties for processing Big Data in the use case and gaps where new approaches/technologies are used.

·         Big Data Specific Challenges in Mobility: Refers to issues in accessing or generating Big Data from Smart Phones and tablets.

·         Security and Privacy Requirements: Allows for explanation of security and privacy issues or needs related to this use case.

·         Highlight issues for generalizing this use case: Allows for documentation of issues that could be common across multiple use-cases and could lead to reference architecture constraints.

·         More Information (URLs): Resources that provide more information on the use case.

·         Note: <additional comments>: Includes pictures of use-case in action but was not otherwise used.

Submitted Use Case Studies

Government Operation> Use Case 1: Big Data Archival: Census 2010 and 2000

Use Case Title

Big Data Archival: Census 2010 and 2000Title 13 Big Data

Vertical (area)

Digital Archives

Author/Company/Email

Vivek Navale and Quyen Nguyen (NARA)

Actors/Stakeholders and their roles and responsibilities

NARA’s Archivists

Public users (after 75 years)

Goals

Preserve data for a long term in order to provide access and perform analytics after 75 years. Title 13 of U.S. code authorizes the Census Bureau and guarantees that individual and industry specific data is protected.

Use Case Description

Maintain data “as-is”. No access and no data analytics for 75 years.

Preserve the data at the bit-level.

Perform curation, which includes format transformation if necessary.

Provide access and analytics after nearly 75 years.

Current

Solutions

Compute(System)

Linux servers

Storage

NetApps, Magnetic tapes.

Networking

 

Software

 

Big Data
Characteristics

Data Source (distributed/centralized)

Centralized storage.

Volume (size)

380 Terabytes.

Velocity

(e.g. real time)

Static.

Variety

(multiple datasets, mashup)

Scanned documents

Variability (rate of change)

None

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Cannot tolerate data loss.

Visualization

TBD

Data Quality

Unknown.

Data Types

Scanned documents

Data Analytics

Only after 75 years.

Big Data Specific Challenges (Gaps)

Preserve data for a long time scale.

Big Data Specific Challenges in Mobility

TBD

Security and Privacy

Requirements

Title 13 data.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

 

More Information (URLs)

 

 

Government Operation> Use Case 2: NARA Accession, Search, Retrieve, Preservation

Use Case Title

National Archives and Records Administration Accession NARA Accession, Search, Retrieve, Preservation

Vertical (area)

Digital Archives

Author/Company/Email

Quyen Nguyen and Vivek Navale (NARA)

Actors/Stakeholders and their roles and responsibilities

Agencies’ Records Managers

NARA’s Records Accessioners

NARA’s Archivists

Public users

Goals

Accession, Search, Retrieval, and Long term Preservation of Big Data.

Use Case Description

1)       Get physical and legal custody of the data. In the future, if data reside in the cloud, physical custody should avoid transferring Big Data from Cloud to Cloud or from Cloud to Data Center.

2)       Pre-process data for virus scan, identifying file format identification, removing empty files

3)       Index

4)       Categorize records (sensitive, unsensitive, privacy data, etc.)

5)       Transform old file formats to modern formats (e.g. WordPerfect to PDF)

6)       E-discovery

7)       Search and retrieve to respond to special request

8)       Search and retrieve of public records by public users

Current

Solutions

Compute(System)

Linux servers

Storage

NetApps, Hitachi, Magnetic tapes.

Networking

 

Software

Custom software, commercial search products, commercial databases.

Big Data
Characteristics

Data Source (distributed/centralized)

Distributed data sources from federal agencies.

Current solution requires transfer of those data to a centralized storage.

In the future, those data sources may reside in different Cloud environments.

Volume (size)

Hundreds of Terabytes, and growing.

Velocity

(e.g. real time)

Input rate is relatively low compared to other use cases, but the trend is bursty. That is the data can arrive in batches of size ranging from GB to hundreds of TB.

Variety

(multiple datasets, mashup)

Variety data types, unstructured and structured data: textual documents, emails, photos, scanned documents, multimedia, social networks, web sites, databases, etc.

Variety of application domains, since records come from different agencies.

Data come from variety of repositories, some of which can be cloud-based in the future.

Variability (rate of change)

Rate can change especially if input sources are variable, some having audio, video more, some more text, and other images, etc.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Search results should have high relevancy and high recall.

Categorization of records should be highly accurate.

Visualization

TBD

Data Quality

Unknown.

Data Types

Variety data types: textual documents, emails, photos, scanned documents, multimedia, databases, etc.

Data Analytics

Crawl/index; search; ranking; predictive search.

Data categorization (sensitive, confidential, etc.)

Personally Identifiable Information (PII) data detection and flagging.

Big Data Specific Challenges (Gaps)

Perform pre-processing and manage for long-term of large and varied data.

Search huge amount of data.

Ensure high relevancy and recall.

Data sources may be distributed in different clouds in future.

Big Data Specific Challenges in Mobility

Mobile search must have similar interfaces/results

Security and Privacy

Requirements

Need to be sensitive to data access restrictions.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

 

More Information (URLs)

 

 

 

Government Operation> Use Case 3: Statistical Survey Response Improvement

Use Case Title

Statistical Survey Response Improvement (Adaptive Design)

Vertical (area)

Government Statistical Logistics

Author/Company/Email

Cavan Capps: U.S. Census Bureau/  cavan.paul.capps@census.gov

Actors/Stakeholders and their roles and responsibilities

U.S. statistical agencies are charged to be the leading authoritative sources about the nation’s people and economy, while honoring privacy and rigorously protecting confidentiality. This is done by working with states, local governments and other government agencies.

Goals

To use advanced methods, that are open and scientifically objective, the statistical agencies endeavor to improve the quality, the specificity and the timeliness of statistics provided while reducing operational costs and maintaining the confidentiality of those measured.

Use Case Description

Survey costs are increasing as survey response declines. The goal of this work is to use advanced “recommendation system techniques” using data mashed up from several sources and historical survey para-data to drive operational processes in an effort to increase quality and reduce the cost of field surveys.

Current

Solutions

Compute(System)

Linux systems

Storage

SAN and Direct Storage

Networking

Fiber, 10 gigabit Ethernet, Infiniband 40 gigabit.

Software

Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig

Big Data
Characteristics

Data Source (distributed/centralized)

Survey data, other government administrative data, geographical positioning data from various sources.

Volume (size)

For this particular class of operational problem approximately one petabyte.

Velocity

(e.g. real time)

Varies, paradata from field data streamed continuously, during the decennial census approximately 150 million records transmitted.

Variety

(multiple datasets, mashup)

Data is typically defined strings and numerical fields. Data can be from multiple datasets mashed together for analytical use.

Variability (rate of change)

Varies depending on surveys in the field at a given time. High rate of velocity during a decennial census.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Data must have high veracity and systems must be very robust. The semantic integrity of conceptual metadata concerning what exactly is measured and the resulting limits of inference remain a challenge

Visualization

Data visualization is useful for data review, operational activity and general analysis. It continues to evolve.

Data Quality (syntax)

Data quality should be high and statistically checked for accuracy and reliability throughout the collection process.

Data Types

Pre-defined ASCII strings and numerical data

Data Analytics

Analytics are required for recommendation systems, continued monitoring and general survey improvement.

Big Data Specific Challenges (Gaps)

Improving recommendation systems that reduce costs and improve quality while providing confidentiality safeguards that are reliable and publically auditable.

Big Data Specific Challenges in Mobility

Mobile access is important.

Security and Privacy

Requirements

All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Recommender systems have features in common to e-commerce like Amazon, Netflix, UPS etc.

More Information (URLs)

 

 

 

Government Operation> Use Case 4: Non Traditional Data in Statistical Survey

Use Case Title

Non Traditional Data in Statistical Survey Response Improvement (Adaptive Design)

Vertical (area)

Government Statistical Logistics

Author/Company/Email

Cavan Capps: U.S. Census Bureau /   cavan.paul.capps@census.gov

Actors/Stakeholders and their roles and responsibilities

U.S. statistical agencies are charged to be the leading authoritative sources about the nation’s people and economy, while honoring privacy and rigorously protecting confidentiality. This is done by working with states, local governments and other government agencies.

Goals

To use advanced methods, that are open and scientifically objective, the statistical agencies endeavor to improve the quality, the specificity and the timeliness of statistics provided while reducing operational costs and maintaining the confidentiality of those measured.

Use Case Description

Survey costs are increasing as survey response declines. The potential of using non-traditional commercial and public data sources from the web, wireless communication, electronic transactions mashed up analytically with traditional surveys to improve statistics for small area geographies, new measures and to improve the timeliness of released statistics.

Current

Solutions

Compute(System)

Linux systems

Storage

SAN and Direct Storage

Networking

Fiber, 10 gigabit Ethernet, Infiniband 40 gigabit.

Software

Hadoop, Spark, Hive, R, SAS, Mahout, Allegrograph, MySQL, Oracle, Storm, BigMemory, Cassandra, Pig

Big Data
Characteristics

Data Source (distributed/centralized)

Survey data, other government administrative data, web scrapped data, wireless data, e-transaction data, potentially social media data and positioning data from various sources.

Volume (size)

TBD

Velocity

(e.g. real time)

TBD

Variety

(multiple datasets, mashup)

Textual data as well as the traditionally defined strings and numerical fields. Data can be from multiple datasets mashed together for analytical use.

Variability (rate of change)

TBD.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Data must have high veracity and systems must be very robust. The semantic integrity of conceptual metadata concerning what exactly is measured and the resulting limits of inference remain a challenge

Visualization

Data visualization is useful for data review, operational activity and general analysis. It continues to evolve.

Data Quality (syntax)

Data quality should be high and statistically checked for accuracy and reliability throughout the collection process.

Data Types

Textual data, pre-defined ASCII strings and numerical data

Data Analytics

Analytics are required to create reliable estimates using data from traditional survey sources, government administrative data sources and non-traditional sources from the digital economy.

Big Data Specific Challenges (Gaps)

Improving analytic and modeling systems that provide reliable and robust statistical estimated using data from multiple sources that are scientifically transparent and while providing confidentiality safeguards that are reliable and publically auditable.

Big Data Specific Challenges in Mobility

Mobile access is important.

Security and Privacy

Requirements

All data must be both confidential and secure. All processes must be auditable for security and confidentiality as required by various legal statutes.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Statistical estimation that provide more detail, on a more near real time basis for less cost. The reliability of estimated statistics from such “mashed up” sources still must be evaluated.

More Information (URLs)

 

 

 

Commercial> Use Case 5: Cloud Computing in Financial Industries

Use Case Title

This use case represents one approach to implementing a BD (Big Data) strategy, within a Cloud Eco-System, for FI (Financial Industries) transacting business within the United States.

Vertical (area)

The following lines of business (LOB) include:

Banking, including: Commercial, Retail, Credit Cards, Consumer Finance, Corporate Banking, Transaction Banking, Trade Finance, and Global Payments.

Securities and Investments, such as; Retail Brokerage, Private Banking/Wealth Management, Institutional Brokerages, Investment Banking, Trust Banking, Asset Management, Custody and Clearing Services

Insurance, including; Personal and Group Life, Personal and Group Property/Casualty, Fixed and Variable Annuities, and Other Investments

Please Note: Any Public/Private entity, providing financial services within the regulatory and jurisdictional risk and compliance purview of the United States, are required to satisfy a complex multilayer number of regulatory GRC/CIA (Governance, Risk and Compliance/Confidentiality, Integrity and Availability) requirements, as overseen by various jurisdictions and agencies, including; Fed., State, Local and cross-border.

Author/Company/Email

Pw Carey, Compliance Partners, LLC, pwc.pwcarey@email.com

Actors/Stakeholders and their roles and responsibilities

Regulatory and advisory organizations and agencies including the; SEC (Securities and Exchange Commission), FDIC (Federal Deposit Insurance Corporation), CFTC (Commodity Futures Trading Commission), US Treasury, PCAOB (Public Company Accounting and Oversight Board), COSO, CobiT, reporting supply chains and stakeholders, investment community, shareholders, pension funds, executive management, data custodians, and employees.

At each level of a financial services organization, an inter-related and inter-dependent mix of duties, obligations and responsibilities are in-place, which are directly responsible for the performance, preparation and transmittal of financial data, thereby satisfying both the regulatory GRC (Governance, Risk and Compliance) and CIA (Confidentiality, Integrity and Availability) of their organizations financial data. This same information is directly tied to the continuing reputation, trust and survivability of an organization's business.

Goals

The following represents one approach to developing a workable BD/FI strategy within the financial services industry. Prior to initiation and switch-over, an organization must perform the following baseline methodology for utilizing BD/FI within a Cloud Eco-system for both public and private financial entities offering financial services within the regulatory confines of the United States; Federal, State, Local and/or cross-border such as the UK, EU and China.

Each financial services organization must approach the following disciplines supporting their BD/FI initiative, with an understanding and appreciation for the impact each of the following four overlaying and inter-dependent forces will play in a workable implementation.

These four areas are:

  1. People (resources),
  2. Processes (time/cost/ROI),
  3. Technology (various operating systems, platforms and footprints) and
  4. Regulatory Governance (subject to various and multiple regulatory agencies).

In addition, these four areas must work through the process of being; identified, analyzed, evaluated, addressed, tested, and reviewed in preparation for attending to the following implementation phases:

  1. Project Initiation and Management Buy-in
  2. Risk Evaluations and Controls
  3. Business Impact Analysis
  4. Design, Development and Testing of the Business Continuity Strategies
  5. Emergency Response and Operations (aka; Disaster Recovery)
  6. Developing and Implementing Business Continuity Plans
  7. Awareness and Training Programs
  8. Maintaining and Exercising Business Continuity, (aka: Maintaining Regulatory Currency)

Please Note: Whenever appropriate, these eight areas should be tailored and modified to fit the requirements of each organizations unique and specific corporate culture and line of financial services.

Use Case Description

Big Data as developed by Google was intended to serve as an Internet Web site indexing tool to help them sort, shuffle, categorize and label the Internet. At the outset, it was not viewed as a replacement for legacy IT data infrastructures. With the spin-off development within OpenGroup and Hadoop, Big Data has evolved into a robust data analysis and storage tool that is still undergoing development. However, in the end, Big Data is still being developed as an adjunct to the current IT client/server/‌big iron data warehouse architectures which is better at some things, than these same data warehouse environments, but not others.

Currently within FI, BD/Hadoop is used for fraud detection, risk analysis and assessments as well as improving the organizations knowledge and understanding of the customers via a strategy known as....'know your customer', pretty clever, eh?

However, this strategy still must following a well thought out taxonomy that satisfies the entities unique, and individual requirements. One such strategy is the following formal methodology which address two fundamental yet paramount questions; “What are we doing”? and “Why are we doing it”?:

1). Policy Statement/Project Charter (Goal of the Plan, Reasons and Resources....define each),

2). Business Impact Analysis (how does effort improve our business services),

3). Identify System-wide Policies, Procedures and Requirements,

4). Identify Best Practices for Implementation (including Change Management/‌Configuration Management) and/or Future Enhancements,

5). Plan B-Recovery Strategies (how and what will need to be recovered, if necessary),

6). Plan Development (Write the Plan and Implement the Plan Elements),

7). Plan buy-in and Testing (important everyone Knows the Plan, and Knows What to Do), and

8). Implement the Plan (then identify and fix gaps during first 3 months, 6 months, and annually after initial implementation)

9). Maintenance (Continuous monitoring and updates to reflect the current enterprise environment)

10). Lastly, System Retirement

Current

Solutions

Compute(System)

Currently, Big Data/Hadoop within a Cloud Eco-system within the FI is operating as part of a hybrid system, with BD being utilized as a useful tool for conducting risk and fraud analysis, in addition to assisting in organizations in the process of ('know your customer'). These are three areas where BD has proven to be good at;

  1. detecting fraud,
  2. associated risks and a
  3. 'know your customer' strategy.

At the same time, the traditional client/server/data warehouse/‌RDBM (Relational Database Management) systems are used for the handling, processing, storage and archival of the entities financial data. Recently the SEC has approved the initiative for requiring the FI to submit financial statements via the XBRL (extensible Business Related Markup Language), as of May 13th, 2013.

Storage

The same Federal, State, Local and cross-border legislative and regulatory requirements can impact any and all geographical locations, including; VMware, NetApps, Oracle, IBM, Brocade, et cetera.

Please Note: Based upon legislative and regulatory concerns, these storage solutions for FI data must ensure this same data conforms to US regulatory compliance for GRC/CIA, at this point in time.

For confirmation, please visit the following agencies web sites: SEC (Security and Exchange Commission), CFTC (Commodity Futures Trading Commission), FDIC (Federal Deposit Insurance Corporation), DOJ (Dept. of Justice), and my favorite the PCAOB (Public Company Accounting and Oversight Board).

Networking

Please Note: The same Federal, State, Local and cross-border legislative and regulatory requirements can impact any and all geographical locations of HW/SW, including but not limited to; WANs, LANs, MANs WiFi, fiber optics, Internet Access, via Public, Private, Community and Hybrid Cloud environments, with or without VPNs.

Based upon legislative and regulatory concerns, these networking solutions for FI data must ensure this same data conforms to US regulatory compliance for GRC/CIA, such as the US Treasury Dept., at this point in time.

For confirmation, please visit the following agencies web sites: SEC (Security and Exchange Commission), CFTC (Commodity Futures Trading Commission), FDIC (Federal Deposit Insurance Corporation), US Treasury Dept., DOJ (Dept. of Justice), and my favorite the PCAOB (Public Company Accounting and Oversight Board).

Software

Please Note: The same legislative and regulatory obligations impacting the geographical location of HW/SW, also restricts the location for; Hadoop, MapReduce, Open-source, and/or Vendor Proprietary such as AWS (Amazon Web Services), Google Cloud Services, and Microsoft

Based upon legislative and regulatory concerns, these software solutions incorporating both SOAP (Simple Object Access Protocol), for Web development and OLAP (Online Analytical Processing) software language for databases, specifically in this case for FI data, both must ensure this same data conforms to US regulatory compliance for GRC/CIA, at this point in time.

For confirmation, please visit the following agencies web sites: SEC (Security and Exchange Commission), CFTC (Commodity Futures Trading Commission), US Treasury, FDIC (Federal Deposit Insurance Corporation), DOJ (Dept. of Justice), and my favorite the PCAOB (Public Company Accounting and Oversight Board).

Big Data
Characteristics

Data Source (distributed/

centralized)

Please Note: The same legislative and regulatory obligations impacting the geographical location of HW/SW, also impacts the location for; both distributed/centralized data sources flowing into HA/DR Environment and HVSs (Hosted Virtual Servers), such as the following constructs: DC1---> VMWare/KVM (Clusters, w/Virtual Firewalls), Data link-Vmware Link-Vmotion Link-Network Link, Multiple PB of NAS (Network as A Service), DC2--->, VMWare/KVM (Clusters w/Virtual Firewalls), DataLink (Vmware Link, Vmotion Link, Network Link), Multiple PB of NAS (Network as A Service), (Requires Fail-Over Virtualization), among other considerations.

Based upon legislative and regulatory concerns, these data source solutions, either distributed and/or centralized for FI data, must ensure this same data conforms to US regulatory compliance for GRC/CIA, at this point in time.

For confirmation, please visit the following agencies web sites: SEC (Security and Exchange Commission), CFTC (Commodity Futures Trading Commission), US Treasury, FDIC (Federal Deposit Insurance Corporation), DOJ (Dept. of Justice), and my favorite the PCAOB (Public Company Accounting and Oversight Board).

Volume (size)

Tera-bytes up to Peta-bytes.

Please Note: This is a 'Floppy Free Zone'.

Velocity

(e.g. real time)

Velocity is more important for fraud detection, risk assessments and the 'know your customer' initiative within the BD FI.

Please Note: However, based upon legislative and regulatory concerns, velocity is not at issue regarding BD solutions for FI data, except for fraud detection, risk analysis and customer analysis.

Based upon legislative and regulatory restrictions, velocity is not at issue, rather the primary concern for FI data, is that it must satisfy all US regulatory compliance obligations for GRC/CIA, at this point in time.

Variety

(multiple data sets, mash-up)

Multiple virtual environments either operating within a batch processing architecture or a hot-swappable parallel architecture supporting fraud detection, risk assessments and customer service solutions.

Please Note: Based upon legislative and regulatory concerns, variety is not at issue regarding BD solutions for FI data within a Cloud Eco-system, except for fraud detection, risk analysis and customer analysis.

Based upon legislative and regulatory restrictions, variety is not at issue, rather the primary concern for FI data, is that it must satisfy all US regulatory compliance obligations for GRC/CIA, at this point in time.

Variability (rate of change)

Please Note: Based upon legislative and regulatory concerns, variability is not at issue regarding BD solutions for FI data within a Cloud Eco-system, except for fraud detection, risk analysis and customer analysis.

Based upon legislative and regulatory restrictions, variability is not at issue, rather the primary concern for FI data, is that it must satisfy all US regulatory compliance obligations for GRC/CIA, at this point in time.

Variability with BD FI within a Cloud Eco-System will depending upon the strength and completeness of the SLA agreements, the costs associated with (CapEx), and depending upon the requirements of the business.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Please Note: Based upon legislative and regulatory concerns, veracity is not at issue regarding BD solutions for FI data within a Cloud Eco-system, except for fraud detection, risk analysis and customer analysis.

Based upon legislative and regulatory restrictions, veracity is not at issue, rather the primary concern for FI data, is that it must satisfy all US regulatory compliance obligations for GRC/CIA, at this point in time.

Within a Big Data Cloud Eco-System, data integrity is important over the entire life-cycle of the organization due to regulatory and compliance issues related to individual data privacy and security, in the areas of CIA (Confidentiality, Integrity and Availability) and GRC (Governance, Risk and Compliance) requirements.

Visualization

Please Note: Based upon legislative and regulatory concerns, visualization is not at issue regarding BD solutions for FI data, except for fraud detection, risk analysis and customer analysis, FI data is handled by traditional client/server/data warehouse big iron servers.

Based upon legislative and regulatory restrictions, visualization is not at issue, rather the primary concern for FI data, is that it must satisfy all US regulatory compliance obligations for GRC/CIA, at this point in time.

Data integrity within BD is critical and essential over the entire life-cycle of the organization due to regulatory and compliance issues related to CIA (Confidentiality, Integrity and Availability) and GRC (Governance, Risk and Compliance) requirements.

Data Quality

Please Note: Based upon legislative and regulatory concerns, data quality will always be an issue, regardless of the industry or platform.

Based upon legislative and regulatory restrictions, data quality is at the core of data integrity, and is the primary concern for FI data, in that it must satisfy all US regulatory compliance obligations for GRC/CIA, at this point in time.

For BD/FI data, data integrity is critical and essential over the entire life-cycle of the organization due to regulatory and compliance issues related to CIA (Confidentiality, Integrity and Availability) and GRC (Governance, Risk and Compliance) requirements.

Data Types

Please Note: Based upon legislative and regulatory concerns, data types is important in that it must have a degree of consistency and especially survivability during audits and digital forensic investigations where the data format deterioration can negatively impact both an audit and a forensic investigation when passed through multiple cycles.

For BD/FI data, multiple data types and formats, include but is not limited to; flat files, .txt, .pdf, android application files, .wav, .jpg and VOIP (Voice over IP)

Data Analytics

Please Note: Based upon legislative and regulatory concerns, data analytics is an issue regarding BD solutions for FI data, especially in regards to fraud detection, risk analysis and customer analysis.

However, data analytics for FI data is currently handled by traditional client/server/data warehouse big iron servers which must ensure they comply with and satisfy all United States GRC/CIA requirements, at this point in time.

For BD/FI data analytics must be maintained in a format that is non-destructive during search and analysis processing and procedures.

Big Data Specific Challenges (Gaps)

Currently, the areas of concern associated with BD/FI with a Cloud Eco-system, include the aggregating and storing of data (sensitive, toxic and otherwise) from multiple sources which can and does create administrative and management problems related to the following:

  • Access control
  • Management/Administration
  • Data entitlement and
  • Data ownership

However, based upon current analysis, these concerns and issues are widely known and are being addressed at this point in time, via the R&D (Research and Development) SDLC/HDLC (Software Development Life Cycle/Hardware Development Life Cycle) sausage makers of technology. Please stay tuned for future developments in this regard

Big Data Specific Challenges in Mobility

Mobility is a continuously growing layer of technical complexity; however, not all Big Data mobility solutions are technical in nature. There are two interrelated and co-dependent parties who required to work together to find a workable and maintainable solution, the FI business side and IT. When both are in agreement sharing a, common lexicon, taxonomy and appreciation and understand for the requirements each is obligated to satisfy, these technical issues can be addressed.

Both sides in this collaborative effort will encounter the following current and on-going FI data considerations:

  • Inconsistent category assignments
  • Changes to classification systems over time
  • Use of multiple overlapping or
  • Different categorization schemes

In addition, each of these changing and evolving inconsistencies, are required to satisfy the following data characteristics associated with ACID:

  • Atomic- All of the work in a transaction completes (commit) or none of it completes
  • Consistent- A transmittal transforms the database from one consistent state to another consistent state. Consistency is defined in terms of constraints.
  • Isolated- The results of any changes made during a transaction are not visible until the transaction has committed.
  • Durable- The results of a committed transaction survive failures.

When each of these data categories is satisfied, well, it's a glorious thing. Unfortunately, sometimes glory is not in the room, however, that does not mean we give up the effort to resolve these issues.

Security and Privacy

Requirements

No amount of security and privacy due diligence will make up for the innate deficiencies associated with human nature that creep into any program and/or strategy. Currently, the BD/FI must contend with a growing number of risk buckets, such as:

  • AML-Anti-money Laundering
  • CDD- Client Due Diligence
  • Watch-lists
  • FCPA – Foreign Corrupt Practices Act

...to name a few.

For a reality check, please consider Mr. Harry M. Markopolos's nine year effort to get the SEC among other agencies to do their job and shut down Mr. Bernard Madoff's billion dollar Ponzi scheme.

However, that aside, identifying and addressing the privacy/security requirements of the FI, providing services within a BD/Cloud Eco-system, via continuous improvements in:

  1. technology,
  2. processes,
  3. procedures,
  4. people and
  5. regulatory jurisdictions

...is a far better choice for both the individual and the organization, especially when considering the alternative.

Utilizing a layered approach, this strategy can be broken down into the following sub categories:

  1. Maintaining operational resilience
  2. Protecting valuable assets
  3. Controlling system accounts
  4. Managing security services effectively, and
  5. Maintaining operational resilience

For additional background security and privacy solutions addressing both security and privacy, we'll refer you to the two following organization's:

  • ISACA (International Society of Auditors and Computer Analysts)
  • isc2 (International Security Computer and Systems Auditors)

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Areas of concern include the aggregating and storing data from multiple sources can create problems related to the following:

  • Access control
  • Management/Administration
  • Data entitlement and
  • Data ownership

Each of these areas is being improved upon, yet they still must be considered and addressed, via access control solutions, and SIEM (Security Incident/Event Management) tools.

I don't believe we're there yet, based upon current security concerns mentioned whenever Big Data/Hadoop within a Cloud Eco-system is brought up in polite conversation.

Current and on-going challenges to implementing BD Finance within a Cloud Eco, as well as traditional client/server data warehouse architectures, include the following areas of Financial Accounting under both US GAAP (Generally Accepted Accounting Practices) or IFRS (…..):

XBRL (extensible Business Related Markup Language)

Consistency (terminology, formatting, technologies, regulatory gaps)

SEC mandated use of XBRL (extensible Business Related Markup Language) for regulatory financial reporting.

SEC, GAAP/IFRS and the yet to be fully resolved new financial legislation impacting reporting requirements are changing and point to trying to improve the implementation, testing, training, reporting and communication best practices required of an independent auditor, regarding:

Auditing, Auditor's reports, Control self-assessments, Financial audits, GAAS / ISAs, Internal audits, and the Sarbanes–Oxley Act of 2002 (SOX).

More Information (URLs)

  1. Cloud Security Alliance Big Data Working Group, “Top 10 Challenges in Big Data Security and Privacy”, 2012.
  2. The IFRS, Securities and Markets Working Group, http://www.xbrl-eu.org
  3. IEEE Big Data conference http://www.ischool.drexel.edu/bigdata/bigdata2013/topics.htm
  4. MapReduce http://www.mapreduce.org.
  5. PCAOB http://www.pcaob.org
  6. http://www.ey.com/GL/en/Industries/Financial-Services/Insurance
  7. http://www.treasury.gov/resource-center/fin-mkts/Pages/default.aspx
  8. CFTC http://www.cftc.org
  9. SEC http://www.sec.gov
  10. FDIC http://www.fdic.gov
  11. COSO http://www.coso.org
  12. isc2 International Information Systems Security Certification Consortium, Inc.: http://www.isc2.org
  13. ISACA Information Systems Audit and Control Association: http://www.isca.org
  14. IFARS http://www.ifars.org
  15. Apache http://www.opengroup.org
  16. http://www.computerworld.com/s/article/print/9221652/IT_must_prepare_for_Hadoop_security_issues?tax ...
  17. "No One Would Listen: A True Financial Thriller" (hard-cover book). Hoboken, NJ: John Wiley & Sons. March 2010. Retrieved April 30, 2010. ISBN 978-0-470-55373-2
  18. Assessing the Madoff Ponzi Scheme and Regulatory Failures (Archive of: Subcommittee on Capital Markets, Insurance, and Government Sponsored Enterprises Hearing) (http://financialserv.edgeboss.net/wmedia/‌financialserv/hearing020409.wvx) (Windows Media). U.S. House Financial Services Committee. February 4, 2009. Retrieved June 29, 2009.
  19. COSO, The Committee of Sponsoring Organizations of the Treadway Commission (COSO), Copyright© 2013, http://www.coso.org.
  20. (ITIL) Information Technology Infrastructure Library, Copyright© 2007-13 APM Group Ltd. All rights reserved, Registered in England No. 2861902, http://www.itil-officialsite.com.
  21. CobiT, Ver. 5.0, 2013, ISACA, Information Systems Audit and Control Association, (a framework for IT Governance and Controls), http://www.isaca.org.
  22. TOGAF, Ver. 9.1, The Open Group Architecture Framework (a framework for IT architecture), http://www.opengroup.org.
  23. ISO/IEC 27000:2012 Info. Security Mgt., International Organization for Standardization and the International Electrotechnical Commission, http://www.standards.iso.org/

Note: Please feel free to improve our INITIAL DRAFT, Ver. 0.1, August 25th, 2013....as we do not consider our efforts to be pearls, at this point in time......Respectfully yours, Pw Carey, Compliance Partners, LLC  pwc.pwcarey@gmail.com

 

 

Commercial> Use Case 6: MendeleyAn International Network of Research

Use Case Title

Mendeley – An International Network of Research

Vertical (area)

Commercial Cloud Consumer Services

Author/Company/Email

William Gunn / Mendeley / william.gunn@mendeley.com

Actors/Stakeholders and their roles and responsibilities

Researchers, librarians, publishers, and funding organizations.

Goals

To promote more rapid advancement in scientific research by enabling researchers to efficiently collaborate, librarians to understand researcher needs, publishers to distribute research findings more quickly and broadly, and funding organizations to better understand the impact of the projects they fund.

Use Case Description

Mendeley has built a database of research documents and facilitates the creation of shared bibliographies. Mendeley uses the information collected about research reading patterns and other activities conducted via the software to build more efficient literature discovery and analysis tools. Text mining and classification systems enables automatic recommendation of relevant research, improving the cost and performance of research teams, particularly those engaged in curation of literature on a particular subject, such as the Mouse Genome Informatics group at Jackson Labs, which has a large team of manual curators who scan the literature. Other use cases include enabling publishers to more rapidly disseminate publications, facilitating research institutions and librarians with data management plan compliance, and enabling funders to better understand the impact of the work they fund via real-time data on the access and use of funded research.

Current

Solutions

Compute(System)

Amazon EC2

Storage

HDFS Amazon S3

Networking

Client-server connections between Mendeley and end user machines, connections between Mendeley offices and Amazon services.

Software

Hadoop, Scribe, Hive, Mahout, Python

Big Data
Characteristics

Data Source (distributed/centralized)

Distributed and centralized

Volume (size)

15TB presently, growing about 1 TB/month

Velocity

(e.g. real time)

Currently Hadoop batch jobs are scheduled daily, but work has begun on real-time recommendation

Variety

(multiple datasets, mashup)

PDF documents and log files of social network and client activities

Variability (rate of change)

Currently a high rate of growth as more researchers sign up for the service, highly fluctuating activity over the course of the year

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Metadata extraction from PDFs is variable, it’s challenging to identify duplicates, there’s no universal identifier system for documents or authors (though ORCID proposes to be this)

Visualization

Network visualization via Gephi, scatterplots of readership vs. citation rate, etc.

Data Quality

90% correct metadata extraction according to comparison with Crossref, Pubmed, and Arxiv

Data Types

Mostly PDFs, some image, spreadsheet, and presentation files

Data Analytics

Standard libraries for machine learning and analytics, LDA, custom built reporting tools for aggregating readership and social activities per document

Big Data Specific Challenges (Gaps)

The database contains ~400M documents, roughly 80M unique documents, and receives 5-700k new uploads on a weekday. Thus a major challenge is clustering matching documents together in a computationally efficient way (scalable and parallelized) when they’re uploaded from different sources and have been slightly modified via third-part annotation tools or publisher watermarks and cover pages

Big Data Specific Challenges in Mobility

Delivering content and services to various computing platforms from Windows desktops to Android and iOS mobile devices

Security and Privacy

Requirements

Researchers often want to keep what they’re reading private, especially industry researchers, so the data about who’s reading what has access controls.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

This use case could be generalized to providing content-based recommendations to various scenarios of information consumption

More Information (URLs)

http://mendeley.com http://dev.mendeley.com

 

 

Commercial> Use Case 7: Netflix Movie Service

Use Case Title

Netflix Movie Service

Vertical (area)

Commercial Cloud Consumer Services

Author/Company/Email

Geoffrey Fox, Indiana University gcf@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Netflix Company (Grow sustainable Business), Cloud Provider (Support streaming and data analysis), Client user (Identify and watch good movies on demand)

Goals

Allow streaming of user selected movies to satisfy multiple objectives (for different stakeholders) -- especially retaining subscribers. Find best possible ordering of a set of videos for a user (household) within a given context in real time; maximize movie consumption.

Use Case Description

Digital movies stored in cloud with metadata; user profiles and rankings for small fraction of movies for each user. Use multiple criteria – content based recommender system; user-based recommender system; diversity. Refine algorithms continuously with A/B testing.

Current

Solutions

Compute(System)

Amazon Web Services AWS

Storage

Uses Cassandra NoSQL technology with Hive, Teradata

Networking

Need Content Delivery System to support effective streaming video

Software

Hadoop and Pig; Cassandra; Teradata

Big Data
Characteristics

Data Source (distributed/centralized)

Add movies institutionally. Collect user rankings and profiles in a distributed fashion

Volume (size)

Summer 2012. 25 million subscribers; 4 million ratings per day; 3 million searches per day; 1 billion hours streamed in June 2012. Cloud storage 2 petabytes (June 2013)

Velocity

(e.g. real time)

Media (video and properties) and Rankings continually updated

Variety

(multiple datasets, mashup)

Data varies from digital media to user rankings, user profiles and media properties for content-based recommendations

Variability (rate of change)

Very competitive business. Need to aware of other companies and trends in both content (which Movies are hot) and technology. Need to investigate new business initiatives such as Netflix sponsored content

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Success of business requires excellent quality of service

Visualization

Streaming media and quality user-experience to allow choice of content

Data Quality

Rankings are intrinsically “rough” data and need robust learning algorithms

Data Types

Media content, user profiles, “bag” of user rankings

Data Analytics

Recommender systems and streaming video delivery. Recommender systems are always personalized and use logistic/linear regression, elastic nets, matrix factorization, clustering, latent Dirichlet allocation, association rules, gradient boosted decision trees and others. Winner of Netflix competition (to improve ratings by 10%) combined over 100 different algorithms.

Big Data Specific Challenges (Gaps)

Analytics needs continued monitoring and improvement.

Big Data Specific Challenges in Mobility

Mobile access important

Security and Privacy

Requirements

Need to preserve privacy for users and digital rights for media.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Recommender systems have features in common to e-commerce like Amazon. Streaming video has features in common with other content providing services like iTunes, Google Play, Pandora and Last.fm

More Information (URLs)

http://www.slideshare.net/xamat/building-largescale-realworld-recommender-systems-recsys2012-tutorial by Xavier Amatriain

http://techblog.netflix.com/

 

 

Commercial> Use Case 8: Web Search

Use Case Title

Web Search (Bing, Google, Yahoo...)

Vertical (area)

Commercial Cloud Consumer Services

Author/Company/Email

Geoffrey Fox, Indiana University gcf@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Owners of web information being searched; search engine companies; advertisers; users

Goals

Return in ~0.1 seconds, the results of a search based on average of 3 words; important to maximize “precision@10”; number of great responses in top 10 ranked results

Use Case Description

1) Crawl the web; 2) Pre-process data to get searchable things (words, positions); 3) Form Inverted Index mapping words to documents; 4) Rank relevance of documents: PageRank; 5) Lots of technology for advertising, “reverse engineering ranking” “preventing reverse engineering”; 6) Clustering of documents into topics (as in Google News) 7) Update results efficiently

Current

Solutions

Compute(System)

Large Clouds

Storage

Inverted Index not huge; crawled documents are petabytes of text – rich media much more

Networking

Need excellent external network links; most operations pleasingly parallel and I/O sensitive. High performance internal network not needed

Software

MapReduce + Bigtable; Dryad + Cosmos. PageRank. Final step essentially a recommender engine

Big Data
Characteristics

Data Source (distributed/centralized)

Distributed web sites

Volume (size)

45B web pages total, 500M photos uploaded each day, 100 hours of video uploaded to YouTube each minute

Velocity

(e.g. real time)

Data continually updated

Variety

(multiple datasets, mashup)

Rich set of functions. After processing, data similar for each page (except for media types)

Variability (rate of change)

Average page has life of a few months

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Exact results not essential but important to get main hubs and authorities for search query

Visualization

Not important although page layout critical

Data Quality

A lot of duplication and spam

Data Types

Mainly text but more interest in rapidly growing image and video

Data Analytics

Crawling; searching including topic based search; ranking; recommending

Big Data Specific Challenges (Gaps)

Search of “deep web” (information behind query front ends)

Ranking of responses sensitive to intrinsic value (as in Pagerank) as well as advertising value

Link to user profiles and social network data

Big Data Specific Challenges in Mobility

Mobile search must have similar interfaces/results

Security and Privacy

Requirements

Need to be sensitive to crawling restrictions. Avoid Spam results

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Relation to Information retrieval such as search of scholarly works.

More Information (URLs)

http://www.slideshare.net/kleinerperkins/kpcb-internet-trends-2013

http://webcourse.cs.technion.ac.il/236621/Winter2011-2012/en/ho_Lectures.html

http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws

http://www.slideshare.net/beechung/recommender-systems-tutorialpart1intro

http://www.worldwidewebsize.com/

 

 

Commercial> Use Case 9: Cloud-based Continuity and Disaster Recovery

Use Case Title

IaaS (Infrastructure as a Service) Big Data Business Continuity and Disaster Recovery (BC/DR) Within A Cloud Eco-System provided by Cloud Service Providers (CSPs) and Cloud Brokerage Service Providers (CBSPs)

Vertical (area)

Large Scale Reliable Data Storage

Author/Company/Email

Pw Carey, Compliance Partners, LLC, pwc.pwcarey@email.com

Actors/Stakeholders and their roles and responsibilities

Executive Management, Data Custodians, and Employees responsible for the integrity, protection, privacy, confidentiality, availability, safety, security and survivability of a business by ensuring the 3-As of data accessibility to an organizations services are satisfied; anytime, anyplace and on any device.

Goals

The following represents one approach to developing a workable BC/DR strategy. Prior to outsourcing an organizations BC/DR onto the backs/shoulders of a CSP or CBSP, the organization must perform the following Use Case, which will provide each organization with a baseline methodology for business continuity and disaster recovery (BC/DR) best practices, within a Cloud Eco-system for both Public and Private organizations.

Each organization must approach the ten disciplines supporting BC/DR (Business Continuity/Disaster Recovery), with an understanding and appreciation for the impact each of the following four overlaying and inter-dependent forces will play in ensuring a workable solution to an entity's business continuity plan and requisite disaster recovery strategy. The four areas are; people (resources), processes (time/cost/ROI), technology (various operating systems, platforms and footprints) and governance (subject to various and multiple regulatory agencies).

These four concerns must be; identified, analyzed, evaluated, addressed, tested, reviewed, addressed during the following ten phases:

  1. Project Initiation and Management Buy-in
  2. Risk Evaluations and Controls
  3. Business Impact Analysis
  4. Design, Development and Testing of the Business Continuity Strategies
  5. Emergency Response and Operations (aka; Disaster Recovery
  6. Developing and Implementing Business Continuity Plans
  7. Awareness and Training Programs
  8. Maintaining and Exercising Business Continuity Plans, (aka: Maintaining Currency)
  9. Public Relations (PR) and Crises Management Plans
  10. Coordination with Public Agencies

Please Note: When appropriate, these ten areas can be tailored to fit the requirements of the organization.

Use Case Description

Big Data as developed by Google was intended to serve as an Internet Web site indexing tool to help them sort, shuffle, categorize and label the Internet. At the outset, it was not viewed as a replacement for legacy IT data infrastructures. With the spin-off development within OpenGroup and Hadoop, Big Data has evolved into a robust data analysis and storage tool that is still undergoing development. However, in the end, BigData is still being developed as an adjunct to the current IT client/server/big iron data warehouse architectures which is better at some things, than these same data warehouse environments, but not others.

As a result, it is necessary, within this business continuity/disaster recovery use case, we ask good questions, such as; why are we doing this and what are we trying to accomplish? What are our dependencies upon manual practices and when can we leverage them? What systems have been and remain outsourced to other organizations, such as our Telephony and what are their DR/BC business functions, if any? Lastly, we must recognize the functions that can be simplified and what are the preventative steps we can take that do not have a high cost associated with them such as simplifying business practices.

We must identify what are the critical business functions that need to be recovered, 1st, 2nd, 3rd in priority, or at a later time/date, and what is the Model of A Disaster we're trying to resolve, what are the types of disasters more likely to occur realizing that we don't need to resolve all types of disasters. When backing up data within a Cloud Eco-system is a good solution, this will shorten the fail-over time and satisfy the requirements of RTO/RPO (Response Time Objectives and Recovery Point Objectives). In addition, there must be 'Buy-in', as this is not just an IT problem; it is a business services problem as well, requiring the testing of the Disaster Plan via formal walk-throughs, et cetera. There should be a formal methodology for developing a BC/DR Plan, including: 1). Policy Statement (Goal of the Plan, Reasons and Resources....define each), 2). Business Impact Analysis (how does a shutdown impact the business financially and otherwise), 3). Identify Preventive Steps (can a disaster be avoided by taking prudent steps), 4). Recovery Strategies (how and what you will need to recover), 5). Plan Development (Write the Plan and Implement the Plan Elements), 6). Plan buy-in and Testing (very important so that everyone knows the Plan and knows what to do during its execution), and 7). Maintenance (Continuous changes to reflect the current enterprise environment)

Current

Solutions

Compute(System)

Cloud Eco-systems, incorporating IaaS (Infrastructure as a Service), supported by Tier 3 Data Centers....Secure Fault Tolerant (Power).... for Security, Power, Air Conditioning et cetera...geographically off-site data recovery centers...providing data replication services, Note: Replication is different from Backup. Replication only moves the changes since the last time a replication, including block level changes. The replication can be done quickly, with a five second window, while the data is replicated every four hours. This data snap shot is retained for seven business days, or longer if necessary. Replicated data can be moved to a Fail-over Center to satisfy the organizations RPO (Recovery Point Objectives) and RTO (Recovery Time Objectives)

 

Storage

VMware, NetApps, Oracle, IBM, Brocade,

 

Networking

WANs, LANs, WiFi, Internet Access, via Public, Private, Community and Hybrid Cloud environments, with or without VPNs.

 

Software

Hadoop, MapReduce, Open-source, and/or Vendor Proprietary such as AWS (Amazon Web Services), Google Cloud Services, and Microsoft

 

Big Data
Characteristics

Data Source (distributed

/centralized)

Both distributed/centralized data sources flowing into HA/DR Environment and HVSs (Hosted Virtual Servers), such as the following: DC1---> VMWare/KVM (Clusters, w/Virtual Firewalls), Data link-VMware Link-Vmotion Link-Network Link, Multiple PB of NAS (Network as A Service), DC2--->, VMWare/KVM (Clusters w/Virtual Firewalls), DataLink (VMware Link, Motion Link, Network Link), Multiple PB of NAS (Network as A Service), (Requires Fail-Over Virtualization)

 

Volume (size)

Terabytes up to Petabytes

 

Velocity

(e.g. real time)

Tier 3 Data Centers with Secure Fault Tolerant (Power) for Security, Power, and Air Conditioning. IaaS (Infrastructure as a Service) in this example, based upon NetApps. Replication is different from Backup; replication requires only moving the CHANGES since the last time a REPLICATION was performed, including the block level changes. The Replication can be done quickly as the data is Replicated every four hours. These replications can be performed within a 5 second window, and this Snap Shot will be kept for 7 business days, or longer if necessary to a Fail-Over Center.....at the RPO and RTO....

 

Variety

(multiple data sets, mash-up)

Multiple virtual environments either operating within a batch processing architecture or a hot-swappable parallel architecture.

 

Variability (rate of change)

Depending upon the SLA agreement, the costs (CapEx) increases, depending upon the RTO/RPO and the requirements of the business.

 

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Data integrity is critical and essential over the entire life-cycle of the organization due to regulatory and compliance issues related to data CIA (Confidentiality, Integrity and Availability) and GRC (Governance, Risk and Compliance) data requirements.

 

Visualization

Data integrity is critical and essential over the entire life-cycle of the organization due to regulatory and compliance issues related to data CIA (Confidentiality, Integrity and Availability) and GRC (Governance, Risk and Compliance) data requirements.

 

Data Quality

Data integrity is critical and essential over the entire life-cycle of the organization due to regulatory and compliance issues related to data CIA (Confidentiality, Integrity and Availability) and GRC (Governance, Risk and Compliance) data requirements.

 

Data Types

Multiple data types and formats, including but not limited to; flat files, .txt, .pdf, android application files, .wav, .jpg and VOIP (Voice over IP)

 

Data Analytics

Must be maintained in a format that is non-destructive during search and analysis processing and procedures.

 

Big Data Specific Challenges (Gaps)

The complexities associated with migrating from a Primary Site to either a Replication Site or a Backup Site is not fully automated at this point in time. The goal is to enable the user to automatically initiate the Fail Over Sequence, moving Data Hosted within Cloud requires a well-defined and continuously monitored server configuration management. In addition, both organizations must know which servers have to be restored and what are the dependencies and inter-dependencies between the Primary Site servers and Replication and/or Backup Site servers. This requires a continuous monitoring of both, since there are two solutions involved with this process, either dealing with servers housing stored images or servers running hot all the time, as in running parallel systems with hot-swappable functionality, all of which requires accurate and up-to-date information from the client.

Big Data Specific Challenges in Mobility

Mobility is a continuously growing layer of technical complexity; however, not all DR/BC solutions are technical in nature, as there are two sides required to work together to find a solution, the business side and the IT side. When they are in agreement, these technical issues must be addressed by the BC/DR strategy implemented and maintained by the entire organization. One area, which is not limited to mobility challenges, concerns a fundamental issue impacting most BC/DR solutions. If your Primary Servers (A, B, C) understand X, Y, Z....but your Secondary Virtual Replication/Backup Servers (a, b, c) over the passage of time, are not properly maintained (configuration management) and become out of sync with your Primary Servers, and only understand X, and Y, when called upon to perform a Replication or Back-up, well "Houston, we have a problem...."

Please Note: Over time all systems can and will suffer from sync-creep, some more than others, when relying upon manual processes to ensure system stability.

Security and Privacy

Requirements

Dependent upon the nature and requirements of the organization's industry verticals, such as; Finance, Insurance, and Life Sciences including both public and/or private entities, and the restrictions placed upon them by; regulatory, compliance and legal jurisdictions.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Challenges to Implement BC/DR, include the following:

1) Recognition, a). Management Vision, b). Assuming the issue is an IT issue, when it is not just an IT issue, 2). People: a). Staffing levels - Many SMBs are understaffed in IT for their current workload, b). Vision - (Driven from the Top Down) Can the business and IT resources see the whole problem and craft a strategy such a 'Call List' in case of a Disaster, c). Skills - Are there resources that can architect, implement and test a BC/DR Solution, d). Time - Do Resources have the time and does the business have the Windows of Time for constructing and testing a DR/BC Solution as DR/BC is an additional Add-On Project the organization needs the time and resources. 3). Money - This can be turned in to an OpEx Solution rather than a CapEx Solution which and can be controlled by varying RPO/RTO, a). Capital is always a constrained resource, b). BC Solutions need to start with "what is the Risk" and "how does cost constrain the solution"?, 4). Disruption - Build BC/DR into the standard "Cloud" infrastructure (IaaS) of the SMB, a). Planning for BC/DR is disruptive to business resources, b). Testing BC is also disruptive.....

More Information (URLs)

  1. http://www.disasterrecovery.org/, (March, 2013).
  2. BC_DR From the Cloud, Avoid IT Disasters EN POINTE Technologies and dinCloud, Webinar Presenter Barry Weber, http://www.dincloud.com.
  3. COSO, The Committee of Sponsoring Organizations of the Treadway Commission (COSO), Copyright© 2013, http://www.coso.org.
  4. ITIL Information Technology Infrastructure Library, Copyright© 2007-13 APM Group Ltd. All rights reserved, Registered in England No. 2861902, http://www.itil-officialsite.com.
  5. CobiT, Ver. 5.0, 2013, ISACA, Information Systems Audit and Control Association, (a framework for IT Governance and Controls), http://www.isaca.org.
  6. TOGAF, Ver. 9.1, The Open Group Architecture Framework (a framework for IT architecture), http://www.opengroup.org.
  7. ISO/IEC 27000:2012 Info. Security Mgt., International Organization for Standardization and the International Electrotechnical Commission, http://www.standards.iso.org/.
  8. PCAOB, Public Company Accounting and Oversight Board, http://www.pcaobus.org.

Note: Please feel free to improve our INITIAL DRAFT, Ver. 0.1, August 10th, 2013....as we do not consider our efforts to be pearls, at this point in time......Respectfully yours, Pw Carey, Compliance Partners, LLC_pwc.pwcarey@gmail.com

 

 

Commercial> Use Case 10: Cargo Shipping

Use Case Title

Cargo Shipping

Vertical (area)

Industry

Author/Company/Email

William Miller/MaCT USA/, mact-usa@att.net

Actors/Stakeholders and their roles and responsibilities

End-users (Sender/Recipients)

Transport Handlers (Truck/Ship/Plane)

Telecom Providers (Cellular/SATCOM)

Shippers (Shipping and Receiving)

Goals

Retention and analysis of items (Things) in transport

Use Case Description

The following use case defines the overview of a Big Data application related to the shipping industry (i.e. FedEx, UPS, DHL, etc.). The shipping industry represents possible the largest potential use case of Big Data that is in common use today. It relates to the identification, transport, and handling of item (Things) in the supply chain. The identification of an item begins with the sender to the recipients and for all those in between with a need to know the location and time of arrive of the items while in transport. A new aspect will be status condition of the items which will include sensor information, GPS coordinates, and a unique identification schema based upon a new ISO 29161 standards under development within ISO JTC1 SC31 WG2. The data is in near real time being updated when a truck arrives at a depot or upon delivery of the item to the recipient. Intermediate conditions are not currently known; the location is not updated in real time, items lost in a warehouse or while in shipment represent a problem potentially for homeland security. The records are retained in an archive and can be accessed for xx days.

Current

Solutions

Compute(System)

Unknown

Storage

Unknown

Networking

LAN/T1/Internet Web Pages

Software

Unknown

Big Data
Characteristics

Data Source (distributed/centralized)

Centralized today

Volume (size)

Large

Velocity

(e.g. real time)

The system is not currently real time.

Variety

(multiple datasets, mashup)

Updated when the driver arrives at the depot and download the time and date the items were picked up. This is currently not real time.

Variability (rate of change)

Today the information is updated only when the items that were checked with a bar code scanner are sent to the central server. The location is not currently displayed in real time.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

 

Visualization

NONE

Data Quality

YES

Data Types

Not Available

Data Analytics

YES

Big Data Specific Challenges (Gaps)

Provide more rapid assessment of the identity, location, and conditions of the shipments, provide detailed analytics and location of problems in the system in real time.

Big Data Specific Challenges in Mobility

Currently conditions are not monitored on-board trucks, ships, and aircraft

Security and Privacy

Requirements

Security need to be more robust

Highlight issues for generalizing this use case (e.g. for ref. architecture)

This use case includes local data bases as well as the requirement to synchronize with the central server. This operation would eventually extend to mobile device and on-board systems which can track the location of the items and provide real-time update of the information including the status of the conditions, logging, and alerts to individuals who have a need to know.

More Information (URLs)

 

See Figure 1: Cargo Shipping – Scenario.

 

 

Commercial> Use Case 11: Materials Data

Use Case Title

Materials Data

Vertical (area)

Manufacturing, Materials Research

Author/Company/Email

John Rumble, R&R Data Services; jumbleusa@earthlink.net

Actors/Stakeholders and their roles and responsibilities

Product Designers (Inputters of materials data in CAE)

Materials Researchers (Generators of materials data; users in some cases)

Materials Testers (Generators of materials data; standards developers)

Data distributors ( Providers of access to materials, often for profit)

Goals

Broaden accessibility, quality, and usability; Overcome proprietary barriers to sharing materials data; Create sufficiently large repositories of materials data to support discovery

Use Case Description

Every physical product is made from a material that has been selected for its properties, cost, and availability. This translates into hundreds of billion dollars of material decisions made every year.

In addition, as the Materials Genome Initiative has so effectively pointed out, the adoption of new materials normally takes decades (two to three) rather than a small number of years, in part because data on new materials is not easily available.

All actors within the materials life cycle today have access to very limited quantities of materials data, thereby resulting in materials-related decision that are non-optimal, inefficient, and costly. While the Materials Genome Initiative is addressing one major and important aspect of the issue, namely the fundamental materials data necessary to design and test materials computationally, the issues related to physical measurements on physical materials ( from basic structural and thermal properties to complex performance properties to properties of novel (nanoscale materials) are not being addressed systematically, broadly (cross-discipline and internationally), or effectively (virtually no materials data meetings, standards groups, or dedicated funded programs).

One of the greatest challenges that Big Data approaches can address is predicting the performance of real materials (gram to ton quantities) starting at the atomistic, nanometer, and/or micrometer level of description.

As a result of the above considerations, decisions about materials usage are unnecessarily conservative, often based on older rather than newer materials R&D data, and not taking advantage of advances in modeling and simulations. Materials informatics is an area in which the new tools of data science can have major impact.

Current

Solutions

Compute(System)

None

Storage

Widely dispersed with many barriers to access

Networking

Virtually none

Software

Narrow approaches based on national programs (Japan, Korea, and China), applications (EU Nuclear program), proprietary solutions (Granta, etc.)

Big Data
Characteristics

Data Source (distributed/centralized)

Extremely distributed with data repositories existing only for a very few fundamental properties

Volume (size)

It is has been estimated (in the 1980s) that there were over 500,000 commercial materials made in the last fifty years. The last three decades has seen large growth in that number.

Velocity

(e.g. real time)

Computer-designed and theoretically design materials (e.g., nanomaterials) are growing over time

Variety

(multiple datasets, mashup)

Many data sets and virtually no standards for mashups

Variability (rate of change)

Materials are changing all the time, and new materials data are constantly being generated to describe the new materials

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

More complex material properties can require many (100s?) of independent variables to describe accurately. Virtually no activity no exists that is trying to identify and systematize the collection of these variables to create robust data sets.

Visualization

Important for materials discovery. Potentially important to understand the dependency of properties on the many independent variables. Virtually unaddressed.

Data Quality

Except for fundamental data on the structural and thermal properties, data quality is poor or unknown. See Munro’s NIST Standard Practice Guide.

Data Types

Numbers, graphical, images

Data Analytics

Empirical and narrow in scope

Big Data Specific Challenges (Gaps)

1.       Establishing materials data repositories beyond the existing ones that focus on fundamental data

2.       Developing internationally-accepted data recording standards that can be used by a very diverse materials community, including developers materials test standards (such as ASTM and ISO), testing companies, materials producers, and R&D labs

3.       Tools and procedures to help organizations wishing to deposit proprietary materials in data repositories to mask proprietary information, yet to maintain the usability of data

4.       Multi-variable materials data visualization tools, in which the number of variables can be quite high

Big Data Specific Challenges in Mobility

Not important at this time

Security and Privacy

Requirements

Proprietary nature of many data very sensitive.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Development of standards; development of large scale repositories; involving industrial users; integration with CAE (don’t underestimate the difficulty of this – materials people are generally not as computer savvy as chemists, bioinformatics people, and engineers)

More Information (URLs)

 

 

 

Commercial> Use Case 12: Simulation Driven Materials Genomics

Use Case Title

Simulation driven Materials Genomics

Vertical (area)

Scientific Research: Materials Science

Author/Company/Email

David Skinner/LBNL/  deskinner@lbl.gov

Actors/Stakeholders and their roles and responsibilities

Capability providers: National labs and energy hubs provide advanced materials genomics capabilities using computing and data as instruments of discovery.

User Community: DOE, industry and academic researchers as a user community seeking capabilities for rapid innovation in materials.

Goals

Speed the discovery of advanced materials through informatically driven simulation surveys.

Use Case Description

Innovation of battery technologies through massive simulations spanning wide spaces of possible design. Systematic computational studies of innovation possibilities in photovoltaics. Rational design of materials based on search and simulation.

Current

Solutions

Compute(System)

Hopper.nersc.gov (150K cores), omics-like data analytics hardware resources.

Storage

GPFS, MongoDB

Networking

10Gb

Software

PyMatGen, FireWorks, VASP, ABINIT, NWChem, BerkeleyGW, varied community codes

Big Data
Characteristics

Data Source (distributed/centralized)

Gateway-like. Data streams from simulation surveys driven on centralized peta/exascale systems. Widely distributed web of dataflows from central gateway to users.

Volume (size)

100TB (current), 500TB within 5 years. Scalable key-value and object store databases needed.

Velocity

(e.g. real time)

High-throughput computing (HTC), fine-grained tasking and queuing. Rapid start/stop for ensembles of tasks. Real-time data analysis for web-like responsiveness.

Variety

(multiple datasets, mashup)

Mashup of simulation outputs across codes and levels of theory. Formatting, registration and integration of datasets. Mashups of data across simulation scales.

Variability (rate of change)

The targets for materials design will become more search and crowd-driven. The computational backend must flexibly adapt to new targets.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Validation and UQ of simulation with experimental data of varied quality. Error checking and bounds estimation from simulation inter-comparison.

Visualization

Materials browsers as data from search grows. Visual design of materials.

Data Quality (syntax)

UQ in results based on multiple datasets.

Propagation of error in knowledge systems.

Data Types

Key value pairs, JSON, materials file formats

Data Analytics

MapReduce and search that join simulation and experimental data.

Big Data Specific Challenges (Gaps)

HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that integrate data from publications, experiments, and simulations to advance goal-driven thinking in materials design.

Big Data Specific Challenges in Mobility

Potential exists for widespread delivery of actionable knowledge in materials science. Many materials genomics “apps” are amenable to a mobile platform.

Security and Privacy

Requirements

Ability to “sandbox” or create independent working areas between data stakeholders. Policy-driven federation of datasets.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

An OSTP blueprint toward broader materials genomics goals was made available in May 2013.

More Information (URLs)

http://www.materialsproject.org

 

 

Defense> Use Case 13: Large Scale Geospatial Analysis and Visualization

Use Case Title

Large Scale Geospatial Analysis and Visualization

Vertical (area)

Defense – but applicable to many others

Author/Company/Email

David Boyd/Data Tactics/ dboyd@data-tactics.com

Actors/Stakeholders and their roles and responsibilities

Geospatial Analysts

Decision Makers

Policy Makers

Goals

Support large scale geospatial data analysis and visualization.

Use Case Description

As the number of geospatially aware sensors increase and the number of geospatially tagged data sources increases the volume geospatial data requiring complex analysis and visualization is growing exponentially. Traditional GIS systems are generally capable of analyzing a millions of objects and easily visualizing thousands. Today’s intelligence systems often contain trillions of geospatial objects and need to be able to visualize and interact with millions of objects.

Current

Solutions

Compute(System)

Compute and Storage systems - Laptops to Large servers (see notes about clusters)

Visualization systems - handhelds to laptops

Storage

Compute and Storage - local disk or SAN

Visualization - local disk, flash ram

Networking

Compute and Storage - Gigabit or better LAN connection

Visualization - Gigabit wired connections, Wireless including WiFi (802.11), Cellular (3g/4g), or Radio Relay

Software

Compute and Storage – generally Linux or Win Server with Geospatially enabled RDBMS, Geospatial server/analysis software – ESRI ArcServer, Geoserver

Visualization – Windows, Android, IOS – browser based visualization. Some laptops may have local ArcMap.

Big Data
Characteristics

Data Source (distributed/centralized)

Very distributed.

Volume (size)

Imagery – 100s of Terabytes

Vector Data – 10s of Gigabytes but billions of points

Velocity

(e.g. real time)

Some sensors delivery vector data in NRT. Visualization of changes should be NRT.

Variety

(multiple datasets, mashup)

Imagery (various formats NITF, GeoTiff, CADRG)

Vector (various formats shape files, kml, text streams: Object types include points, lines, areas, polylines, circles, ellipses.

Variability (rate of change)

Moderate to high

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Data accuracy is critical and is controlled generally by three factors:

1.       Sensor accuracy is a big issue.

2.       datum/spheroid.

3.       Image registration accuracy

Visualization

Displaying in a meaningful way large data sets (millions of points) on small devices (handhelds) at the end of low bandwidth networks.

Data Quality

The typical problem is visualization implying quality/accuracy not available in the original data. All data should include metadata for accuracy or circular error probability.

Data Types

Imagery (various formats NITF, GeoTiff, CADRG)

Vector (various formats shape files, kml, text streams: Object types include points, lines, areas, polylines, circles, ellipses.

Data Analytics

Closest point of approach, deviation from route, point density over time, PCA and ICA

Big Data Specific Challenges (Gaps)

Indexing, retrieval and distributed analysis

Visualization generation and transmission

Big Data Specific Challenges in Mobility

Visualization of data at the end of low bandwidth wireless connections.

Security and Privacy

Requirements

Data is sensitive and must be completely secure in transit and at rest (particularly on handhelds)

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Geospatial data requires unique approaches to indexing and distributed analysis.

More Information (URLs)

Applicable Standards: http://www.opengeospatial.org/standards

http://geojson.org/

http://earth-info.nga.mil/publications/specs/printed/CADRG/cadrg.html

 

Geospatial Indexing: Quad Trees, Space Filling Curves (Hilbert Curves) – You can google these for lots of references.

Note: There has been some work with in DoD related to this problem set. Specifically, the DCGS-A standard cloud (DSC) stores, indexes, and analyzes some Big Data sources. However, many issues still remain with visualization.

 

 

Defense> Use Case 14: Object Identification and Tracking – Persistent Surveillance

Use Case Title

Object identification and tracking from Wide Area Large Format Imagery (WALF) Imagery or Full Motion Video (FMV) – Persistent Surveillance

Vertical (area)

Defense (Intelligence)

Author/Company/Email

David Boyd/Data Tactics/  dboyd@data-tactics.com

Actors/Stakeholders and their roles and responsibilities

1.       Civilian Military decision makers

2.       Intelligence Analysts

3.       Warfighters

Goals

To be able to process and extract/track entities (vehicles, people, packages) over time from the raw image data. Specifically, the idea is to reduce the petabytes of data generated by persistent surveillance down to a manageable size (e.g. vector tracks)

Use Case Description

Persistent surveillance sensors can easily collect petabytes of imagery data in the space of a few hours. It is unfeasible for this data to be processed by humans for either alerting or tracking purposes. The data needs to be processed close to the sensor which is likely forward deployed since it is too large to be easily transmitted. The data should be reduced to a set of geospatial object (points, tracks, etc.) which can easily be integrated with other data to form a common operational picture.

Current

Solutions

Compute(System)

Various – they range from simple storage capabilities mounted on the sensor, to simple display and storage, to limited object extraction. Typical object extraction systems are currently small (1-20 node) GPU enhanced clusters.

Storage

Currently flat files persisted on disk in most cases. Sometimes RDBMS indexes pointing to files or portions of files based on metadata/telemetry data.

Networking

Sensor comms tend to be Line of Sight or Satellite based.

Software

A wide range custom software and tools including traditional RDBM’s and display tools.

Big Data
Characteristics

Data Source (distributed/centralized)

Sensors include airframe mounted and fixed position optical, IR, and SAR images.

Volume (size)

FMV – 30-60 frames per/sec at full color 1080P resolution.

WALF – 1-10 frames per/sec at 10Kx10K full color resolution.

Velocity

(e.g. real time)

Real Time

Variety

(multiple datasets, mashup)

Data Typically exists in one or more standard imagery or video formats.

Variability (rate of change)

Little

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

The veracity of extracted objects is critical. If the system fails or generates false positives people are put at risk.

Visualization

Visualization of extracted outputs will typically be as overlays on a geospatial display. Overlay objects should be links back to the originating image/video segment.

Data Quality

Data quality is generally driven by a combination of sensor characteristics and weather (both obscuring factors - dust/moisture and stability factors – wind).

Data Types

Standard imagery and video formats are input. Output should be in the form of OGC compliant web features or standard geospatial files (shape files, KML).

Data Analytics

1.       Object identification (type, size, color) and tracking.

2.       Pattern analysis of object (did the truck observed every Weds. afternoon take a different route today or is there a standard route this person takes every day).

3.       Crowd behavior/dynamics (is there a small group attempting to incite a riot. Is this person out of place in the crowd or behaving differently?

4.       Economic activity

a.        is the line at the bread store, the butcher, or the ice cream store,

b.       are more trucks traveling north with goods than trucks going south

c.        Has activity at or the size of stores in this market place increased or decreased over the past year.

5.       Fusion of data with other data to improve quality and confidence.

Big Data Specific Challenges (Gaps)

Processing the volume of data in NRT to support alerting and situational awareness.

Big Data Specific Challenges in Mobility

Getting data from mobile sensor to processing

Security and Privacy

Requirements

Significant – sources and methods cannot be compromised the enemy should not be able to know what we see.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Typically this type of processing fits well into massively parallel computing such as provided by GPUs. Typical problem is integration of this processing into a larger cluster capable of processing data from several sensors in parallel and in NRT.

Transmission of data from sensor to system is also a large challenge.

More Information (URLs)

Motion Imagery Standards - http://www.gwg.nga.mil/misb/

Some of many papers on object identity/tracking: http://www.dabi.temple.edu/~hbling/publication/SPIE12_Dismount_Formatted_v2_BW.pdf

http://csce.uark.edu/~jgauch/library/Tracking/Orten.2005.pdf

http://www.sciencedirect.com/science/article/pii/S0031320305004863

General Articles on the need: http://www.militaryaerospace.com/topics/m/video/79088650/persistent-surveillance-relies-on-extracting-relevant-data-points-and-connecting-the-dots.htm

http://www.defencetalk.com/wide-area-persistent-surveillance-revolutionizes-tactical-isr-45745/

http://www.defencetalk.com/wide-area-persistent-surveillance-revolutionizes-tactical-isr-45745/

 

 

Defense> Use Case 15: Intelligence Data Processing and Analysis

Use Case Title

Intelligence Data Processing and Analysis

Vertical (area)

Defense (Intelligence)

Author/ Company/Email

David Boyd/Data Tactics/  dboyd@data-tactics.com

Actors/Stakeholders and their roles and responsibilities

Senior Civilian/Military Leadership

Field Commanders

Intelligence Analysts

Warfighters

Goals

1.       Provide automated alerts to Analysts, Warfighters, Commanders, and Leadership based on incoming intelligence data.

2.       Allow Intelligence Analysts to identify in Intelligence data

a.        Relationships between entities (people, organizations, places, equipment)

b.       Trends in sentiment or intent for either general population or leadership group (state, non-state actors).

c.        Location of and possibly timing of hostile actions (including implantation of IEDs).

d.       Track the location and actions of (potentially) hostile actors

3.       Ability to reason against and derive knowledge from diverse, disconnected, and frequently unstructured (e.g. text) data sources.

4.       Ability to process data close to the point of collection and allow data to be shared easily to/from individual soldiers, forward deployed units, and senior leadership in garrison.

Use Case Description

1.       Ingest/accept data from a wide range of sensors and sources across intelligence disciplines (IMINT, MASINT, GEOINT, HUMINT, SIGINT, OSINT, etc.)

2.       Process, transform, or align date from disparate sources in disparate formats into a unified data space to permit:

a.        Search

b.       Reasoning

c.        Comparison

3.       Provide alerts to users of significant changes in the state of monitored entities or significant activity within an area.

4.       Provide connectivity to the edge for the Warfighter (in this case the edge would go as far as a single soldier on dismounted patrol)

Current

Solutions

Compute(System)

Fixed and deployed computing clusters ranging from 1000s of nodes to 10s of nodes.

Storage

10s of Terabytes to 100s of Petabytes for edge and fixed site clusters. Dismounted soldiers would have at most 1-100s of Gigabytes (mostly single digit handheld data storage sizes).

Networking

Networking with-in and between in garrison fixed sites is robust. Connectivity to forward edge is limited and often characterized by high latency and packet loss. Remote comms might be Satellite based (high latency) or even limited to RF Line of sight radio.

Software

Currently baseline leverages:

1.       Hadoop

2.       Accumulo (Big Table)

3.       Solr

4.       NLP (several variants)

5.       Puppet (for deployment and security)

6.       Storm

7.       Custom applications and visualization tools

Big Data
Characteristics

Data Source (distributed/centralized)

Very distributed

Volume (size)

Some IMINT sensors can produce over a petabyte of data in the space of hours. Other data is as small as infrequent sensor activations or text messages.

Velocity

(e.g. real time)

Much sensor data is real time (Full motion video, SIGINT) other is less real time. The critical aspect is to be able ingest, process, and disseminate alerts in NRT.

Variety

(multiple datasets, mashup)

Everything from text files, raw media, imagery, video, audio, electronic data, human generated data.

Variability (rate of change)

While sensor interface formats tend to be stable, most other data is uncontrolled and may be in any format. Much of the data is unstructured.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Data provenance (e.g. tracking of all transfers and transformations) must be tracked over the life of the data.

Determining the veracity of “soft” data sources (generally human generated) is a critical requirement.

Visualization

Primary visualizations will be Geospatial overlays and network diagrams. Volume amounts might be millions of points on the map and thousands of nodes in the network diagram.

Data Quality (syntax)

Data Quality for sensor generated data is generally known (image quality, sig/noise) and good.

Unstructured or “captured” data quality varies significantly and frequently cannot be controlled.

Data Types

Imagery, Video, Text, Digital documents of all types, Audio, Digital signal data.

Data Analytics

1.       NRT Alerts based on patterns and baseline changes.

2.       Link Analysis

3.       Geospatial Analysis

4.       Text Analytics (sentiment, entity extraction, etc.)

Big Data Specific Challenges (Gaps)

1.       Big (or even moderate size data) over tactical networks

2.       Data currently exists in disparate silos which must be accessible through a semantically integrated data space.

3.       Most critical data is either unstructured or imagery/video which requires significant processing to extract entities and information.

Big Data Specific Challenges in Mobility

The outputs of this analysis and information must be transmitted to or accessed by the dismounted forward soldier.

Security and Privacy

Requirements

Foremost. Data must be protected against:

1.       Unauthorized access or disclosure

2.       Tampering

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Wide variety of data types, sources, structures, and quality which will span domains and requires integrated search and reasoning.

 

Healthcare and Life Sciences> Use Case 16: Electronic Medical Record (EMR) Data

Use Case Title

Electronic Medical Record (EMR) Data

Vertical (area)

Healthcare

Author/Company/Email

Shaun Grannis/Indiana University/, sgrannis@regenstrief.org

Actors/Stakeholders and their roles and responsibilities

Biomedical informatics research scientists (implement and evaluate enhanced methods for seamlessly integrating, standardizing, analyzing, and operationalizing highly heterogeneous, high-volume clinical data streams); Health services researchers (leverage integrated and standardized EMR data to derive knowledge that supports implementation and evaluation of translational, comparative effectiveness, patient-centered outcomes research); Healthcare providers – physicians, nurses, public health officials (leverage information and knowledge derived from integrated and standardized EMR data to support direct patient care and population health)

Goals

Use advanced methods for normalizing patient, provider, facility and clinical concept identification within and among separate health care organizations to enhance models for defining and extracting clinical phenotypes from non-standard discrete and free-text clinical data using feature selection, information retrieval and machine learning decision-models. Leverage clinical phenotype data to support cohort selection, clinical outcomes research, and clinical decision support.

Use Case Description

As health care systems increasingly gather and consume electronic medical record data, large national initiatives aiming to leverage such data are emerging, and include developing a digital learning health care system to support increasingly evidence-based clinical decisions with timely accurate and up-to-date patient-centered clinical information; using electronic observational clinical data to efficiently and rapidly translate scientific discoveries into effective clinical treatments; and electronically sharing integrated health data to improve healthcare process efficiency and outcomes. These key initiatives all rely on high-quality, large-scale, standardized and aggregate health data. Despite the promise that increasingly prevalent and ubiquitous electronic medical record data hold, enhanced methods for integrating and rationalizing these data are needed for a variety of reasons. Data from clinical systems evolve over time. This is because the concept space in healthcare is constantly evolving: new scientific discoveries lead to new disease entities, new diagnostic modalities, and new disease management approaches. These in turn lead to new clinical concepts, which drive the evolution of health concept ontologies. Using heterogeneous data from the Indiana Network for Patient Care (INPC), the nation's largest and longest-running health information exchange, which includes more than 4 billion discrete coded clinical observations from more than 100 hospitals for more than 12 million patients, we will use information retrieval techniques to identify highly relevant clinical features from electronic observational data. We will deploy information retrieval and natural language processing techniques to extract clinical features. Validated features will be used to parameterize clinical phenotype decision models based on maximum likelihood estimators and Bayesian networks. Using these decision models we will identify a variety of clinical phenotypes such as diabetes, congestive heart failure, and pancreatic cancer.

Current

Solutions

Compute(System)

Big Red II, a new Cray supercomputer at I.U.

Storage

Teradata, PostgreSQL, MongoDB

Networking

Various. Significant I/O intensive processing needed.

Software

Hadoop, Hive, R. Unix-based.

Big Data
Characteristics

Data Source (distributed/centralized)

Clinical data from more than 1,100 discrete logical, operational healthcare sources in the Indiana Network for Patient Care (INPC) the nation's largest and longest-running health information exchange.

Volume (size)

More than 12 million patients, more than 4 billion discrete clinical observations. > 20 TB raw data.

Velocity

(e.g. real time)

Between 500,000 and 1.5 million new real-time clinical transactions added per day.

Variety

(multiple datasets, mashup)

We integrate a broad variety of clinical datasets from multiple sources: free text provider notes; inpatient, outpatient, laboratory, and emergency department encounters; chromosome and molecular pathology; chemistry studies; cardiology studies; hematology studies; microbiology studies; neurology studies; provider notes; referral labs; serology studies; surgical pathology and cytology, blood bank, and toxicology studies.

Variability (rate of change)

Data from clinical systems evolve over time because the clinical and biological concept space is constantly evolving: new scientific discoveries lead to new disease entities, new diagnostic modalities, and new disease management approaches. These in turn lead to new clinical concepts, which drive the evolution of health concept ontologies, encoded in highly variable fashion.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Data from each clinical source are commonly gathered using different methods and representations, yielding substantial heterogeneity. This leads to systematic errors and bias requiring robust methods for creating semantic interoperability.

Visualization

Inbound data volume, accuracy, and completeness must be monitored on a routine basis using focus visualization methods. Intrinsic informational characteristics of data sources must be visualized to identify unexpected trends.

Data Quality (syntax)

A central barrier to leveraging electronic medical record data is the highly variable and unique local names and codes for the same clinical test or measurement performed at different institutions. When integrating many data sources, mapping local terms to a common standardized concept using a combination of probabilistic and heuristic classification methods is necessary.

Data Types

Wide variety of clinical data types including numeric, structured numeric, free-text, structured text, discrete nominal, discrete ordinal, discrete structured, binary large blobs (images and video).

Data Analytics

Information retrieval methods to identify relevant clinical features (tf-idf, latent semantic analysis, mutual information). Natural Language Processing techniques to extract relevant clinical features. Validated features will be used to parameterize clinical phenotype decision models based on maximum likelihood estimators and Bayesian networks. Decision models will be used to identify a variety of clinical phenotypes such as diabetes, congestive heart failure, and pancreatic cancer.

Big Data Specific Challenges (Gaps)

Overcoming the systematic errors and bias in large-scale, heterogeneous clinical data to support decision-making in research, patient care, and administrative use-cases requires complex multistage processing and analytics that demands substantial computing power. Further, the optimal techniques for accurately and effectively deriving knowledge from observational clinical data are nascent.

Big Data Specific Challenges in Mobility

Biological and clinical data are needed in a variety of contexts throughout the healthcare ecosystem. Effectively delivering clinical data and knowledge across the healthcare ecosystem will be facilitated by mobile platform such as mHealth.

Security and Privacy

Requirements

Privacy and confidentiality of individuals must be preserved in compliance with federal and state requirements including HIPAA. Developing analytic models using comprehensive, integrated clinical data requires aggregation and subsequent de-identification prior to applying complex analytics.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Patients increasingly receive health care in a variety of clinical settings. The subsequent EMR data is fragmented and heterogeneous. In order to realize the promise of a Learning Health Care system as advocated by the National Academy of Science and the Institute of Medicine, EMR data must be rationalized and integrated. The methods we propose in this use-case support integrating and rationalizing clinical data to support decision-making at multiple levels.

More Information (URLs)

Regenstrief Institute (http://www.regenstrief.org); Logical observation identifiers names and codes (http://www.loinc.org); Indiana Health Information Exchange (http://www.ihie.org); Institute of Medicine Learning Healthcare System (http://www.iom.edu/Activities/Quality/LearningHealthcare.aspx)

 

 

 

Healthcare and Life Sciences> Use Case 17: Pathology Imaging/Digital Pathology

Use Case Title

Pathology Imaging/digital pathology

Vertical (area)

Healthcare

Author/Company/Email

Fusheng Wang/Emory University/, fusheng.wang@emory.edu

Actors/Stakeholders and their roles and responsibilities

Biomedical researchers on translational research; hospital clinicians on imaging guided diagnosis

Goals

Develop high performance image analysis algorithms to extract spatial information from images; provide efficient spatial queries and analytics, and feature clustering and classification

Use Case Description

Digital pathology imaging is an emerging field where examination of high resolution images of tissue specimens enables novel and more effective ways for disease diagnosis. Pathology image analysis segments massive (millions per image) spatial objects such as nuclei and blood vessels, represented with their boundaries, along with many extracted image features from these objects. The derived information is used for many complex queries and analytics to support biomedical research and clinical diagnosis. Recently, 3D pathology imaging is made possible through 3D laser technologies or serially sectioning hundreds of tissue sections onto slides and scanning them into digital images. Segmenting 3D microanatomic objects from registered serial images could produce tens of millions of 3D objects from a single image. This provides a deep “map” of human tissues for next generation diagnosis.

Current

Solutions

Compute(System)

Supercomputers; Cloud

Storage

SAN or HDFS

Networking

Need excellent external network link

Software

MPI for image analysis; MapReduce + Hive with spatial extension

Big Data
Characteristics

Data Source (distributed/centralized)

Digitized pathology images from human tissues

Volume (size)

1GB raw image data + 1.5GB analytical results per 2D image; 1TB raw image data + 1TB analytical results per 3D image. 1PB data per moderated hospital per year

Velocity

(e.g. real time)

Once generated, data will not be changed

Variety

(multiple datasets, mashup)

Image characteristics and analytics depend on disease types

Variability (rate of change)

No change

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

High quality results validated with human annotations are essential

Visualization

Needed for validation and training

Data Quality

Depend on pre-processing of tissue slides such as chemical staining and quality of image analysis algorithms

Data Types

Raw images are whole slide images (mostly based on BIGTIFF), and analytical results are structured data (spatial boundaries and features)

Data Analytics

Image analysis, spatial queries and analytics, feature clustering and classification

Big Data Specific Challenges (Gaps)

Extreme large size; multi-dimensional; disease specific analytics; correlation with other data types (clinical data, -omic data)

Big Data Specific Challenges in Mobility

3D visualization of 3D pathology images is not likely in mobile platforms

Security and Privacy

Requirements

Protected health information has to be protected; public data have to be de-identified

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Imaging data; multi-dimensional spatial data analytics

More Information (URLs)

https://web.cci.emory.edu/confluence/display/PAIS

https://web.cci.emory.edu/confluence/display/HadoopGIS

See Figure 2: Pathology Imaging/Digital Pathology – Examples of 2-D and 3-D pathology images.

See Figure 3: Pathology Imaging/Digital Pathology – Architecture of Hadoop-GIS, a spatial data warehousing system, over MapReduce to support spatial analytics for analytical pathology imaging.

 

 

Healthcare and Life Sciences> Use Case 18: Computational Bioimaging

Use Case Title

Computational Bioimaging

Vertical (area)

Scientific Research: Biological Science

Author/Company/Email

David Skinner1, deskinner@lbl.gov

Joaquin Correa1, JoaquinCorrea@lbl.gov

Daniela Ushizima2, dushizima@lbl.gov

Joerg Meyer2, joergmeyer@lbl.gov

1National Energy Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory, USA

2Computational Research Division, Lawrence Berkeley National Laboratory, USA

Actors/Stakeholders and their roles and responsibilities

Capability providers: Bioimaging instrument operators, microscope developers, imaging facilities, applied mathematicians, and data stewards.

User Community: DOE, industry and academic researchers seeking to collaboratively build models from imaging data.

Goals

Data delivered from bioimaging is increasingly automated, higher resolution, and multi-modal. This has created a data analysis bottleneck that, if resolved, can advance the biosciences discovery through Big Data techniques. Our goal is to solve that bottleneck with extreme scale computing.

Meeting that goal will require more than computing. It will require building communities around data resources and providing advanced algorithms for massive image analysis. High-performance computational solutions can be harnessed by community-focused science gateways to guide the application of massive data analysis toward massive imaging data sets. Workflow components include data acquisition, storage, enhancement, minimizing noise, segmentation of regions of interest, crowd-based selection and extraction of features, and object classification, and organization, and search.

Use Case Description

Web-based one-stop-shop for high performance, high throughput image processing for producers and consumers of models built on bio-imaging data.

Current

Solutions

Compute(System)

Hopper.nersc.gov (150K cores)

Storage

Database and image collections

Networking

10Gb, could use 100Gb and advanced networking (SDN)

Software

ImageJ, OMERO, VolRover, advanced segmentation and feature detection methods from applied math researchers

Big Data
Characteristics

Data Source (distributed/centralized)

Distributed experimental sources of bioimages (instruments). Scheduled high volume flows from automated high-resolution optical and electron microscopes.

Volume (size)

Growing very fast. Scalable key-value and object store databases needed. In-database processing and analytics. 50TB here now, but currently over a petabyte overall. A single scan on emerging machines is 32TB

Velocity

(e.g. real time)

High-throughput computing (HTC), responsive analysis

Variety

(multiple datasets, mashup)

Multi-modal imaging essentially must mash-up disparate channels of data with attention to registration and dataset formats.

Variability (rate of change)

Biological samples are highly variable and their analysis workflows must cope with wide variation.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Data is messy overall as is training classifiers.

Visualization

Heavy use of 3D structural models.

Data Quality (syntax)

 

Data Types

Imaging file formats

Data Analytics

Machine learning (SVM and RF) for classification and recommendation services.

Big Data Specific Challenges (Gaps)

HTC at scale for simulation science. Flexible data methods at scale for messy data. Machine learning and knowledge systems that drive pixel based data toward biological objects and models.

Big Data Specific Challenges in Mobility

 

Security and Privacy

Requirements

 

Highlight issues for generalizing this use case (e.g. for ref. architecture)

There is potential in generalizing concepts of search in the context of bioimaging.

More Information (URLs)

 

 

 

Healthcare and Life Sciences> Use Case 19: Genomic Measurements

Use Case Title

Genomic Measurements

Vertical (area)

Healthcare

Author/Company/Email

Justin Zook/NIST/, jzook@nist.gov

Actors/Stakeholders and their roles and responsibilities

NIST/Genome in a Bottle Consortium – public/private/academic partnership

Goals

Develop well-characterized Reference Materials, Reference Data, and Reference Methods needed to assess performance of genome sequencing

Use Case Description

Integrate data from multiple sequencing technologies and methods to develop highly confident characterization of whole human genomes as Reference Materials, and develop methods to use these Reference Materials to assess performance of any genome sequencing run

Current

Solutions

Compute(System)

72-core cluster for our NIST group, collaboration with >1000 core clusters at FDA, some groups are using cloud

Storage

~40TB NFS at NIST, PBs of genomics data at NIH/NCBI

Networking

Varies. Significant I/O intensive processing needed

Software

Open-source sequencing bioinformatics software from academic groups (UNIX-based)

Big Data
Characteristics

Data Source (distributed/centralized)

Sequencers are distributed across many laboratories, though some core facilities exist.

Volume (size)

40TB NFS is full, will need >100TB in 1-2 years at NIST; Healthcare community will need many PBs of storage

Velocity

(e.g. real time)

DNA sequencers can generate ~300GB compressed data/day. Velocity has increased much faster than Moore’s Law

Variety

(multiple datasets, mashup)

File formats not well-standardized, though some standards exist. Generally structured data.

Variability (rate of change)

Sequencing technologies have evolved very rapidly, and new technologies are on the horizon.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

All sequencing technologies have significant systematic errors and biases, which require complex analysis methods and combining multiple technologies to understand, often with machine learning

Visualization

“Genome browsers” have been developed to visualize processed data

Data Quality

Sequencing technologies and bioinformatics methods have significant systematic errors and biases

Data Types

Mainly structured text

Data Analytics

Processing of raw data to produce variant calls. Also, clinical interpretation of variants, which is now very challenging.

Big Data Specific Challenges (Gaps)

Processing data requires significant computing power, which poses challenges especially to clinical laboratories as they are starting to perform large-scale sequencing. Long-term storage of clinical sequencing data could be expensive. Analysis methods are quickly evolving. Many parts of the genome are challenging to analyze, and systematic errors are difficult to characterize.

Big Data Specific Challenges in Mobility

Physicians may need access to genomic data on mobile platforms

Security and Privacy

Requirements

Sequencing data in health records or clinical research databases must be kept secure/private, though our Consortium data is public.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

I have some generalizations to medical genome sequencing above, but focus on NIST/Genome in a Bottle Consortium work. Currently, labs doing sequencing range from small to very large. Future data could include other ‘omics’ measurements, which could be even larger than DNA sequencing

More Information (URLs)

Genome in a Bottle Consortium: http://www.genomeinabottle.org

 

 

 
Healthcare and Life Sciences> Use Case 20: Comparative Analysis for (meta) Genomes

Use Case Title

Comparative analysis for metagenomes and genomes

Vertical (area)

Scientific Research: Genomics

Author/Company/Email

Ernest Szeto / LBNL /, eszeto@lbl.gov

Actors/Stakeholders and their roles and responsibilities

Joint Genome Institute (JGI) Integrated Microbial Genomes (IMG) project. Heads: Victor M. Markowitz, and Nikos C. Kyrpides. User community: JGI, bioinformaticians and biologists worldwide.

Goals

Provide an integrated comparative analysis system for metagenomes and genomes. This includes interactive Web UI with core data, backend precomputations, batch job computation submission from the UI.

Use Case Description

Given a metagenomic sample, (1) determine the community composition in terms of other reference isolate genomes, (2) characterize the function of its genes, (3) begin to infer possible functional pathways, (4) characterize similarity or dissimilarity with other metagenomic samples, (5) begin to characterize changes in community composition and function due to changes in environmental pressures, (6) isolate sub-sections of data based on quality measures and community composition.

Current

Solutions

Compute(System)

Linux cluster, Oracle RDBMS server, large memory machines, standard Linux interactive hosts

Storage

Oracle RDBMS, SQLite files, flat text files, Lucy (a version of Lucene) for keyword searches, BLAST databases, USEARCH databases

Networking

Provided by NERSC

Software

Standard bioinformatics tools (BLAST, HMMER, multiple alignment and phylogenetic tools, gene callers, sequence feature predictors…), Perl/Python wrapper scripts, Linux Cluster scheduling

Big Data
Characteristics

Data Source (distributed/centralized)

Centralized.

Volume (size)

50tb

Velocity

(e.g. real time)

Front end web UI must be real time interactive. Back end data loading processing must keep up with exponential growth of sequence data due to the rapid drop in cost of sequencing technology.

Variety

(multiple datasets, mashup)

Biological data is inherently heterogeneous, complex, structural, and hierarchical. One begins with sequences, followed by features on sequences, such as genes, motifs, regulatory regions, followed by organization of genes in neighborhoods (operons), to proteins and their structural features, to coordination and expression of genes in pathways. Besides core genomic data, new types of “Omics” data such as transcriptomics, methylomics, and proteomics describing gene expression under a variety of conditions must be incorporated into the comparative analysis system.

Variability (rate of change)

The sizes of metagenomic samples can vary by several orders of magnitude, such as several hundred thousand genes to a billion genes (e.g., latter in a complex soil sample).

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Metagenomic sampling science is currently preliminary and exploratory. Procedures for evaluating assembly of highly fragmented data in raw reads are better defined, but still an open research area.

Visualization

Interactive speed of web UI on very large data sets is an ongoing challenge. Web UI’s still seem to be the preferred interface for most biologists. It is use for basic querying and browsing of data. More specialized tools may be launched from them, e.g. for viewing multiple alignments. Ability to download large amounts of data for offline analysis is another requirement of the system.

Data Quality

Improving quality of metagenomic assembly is still a fundamental challenge. Improving the quality of reference isolate genomes, both in terms of the coverage in the phylogenetic tree, improved gene calling and functional annotation is a more mature process, but an ongoing project.

Data Types

Cf. above on “Variety”

Data Analytics

Descriptive statistics, statistical significance in hypothesis testing, discovering new relationships, data clustering and classification is a standard part of the analytics. The less quantitative part includes the ability to visualize structural details at different levels of resolution. Data reduction, removing redundancies through clustering, more abstract representations such as representing a group of highly similar genomes in a pangenome are all strategies for both data management as well as analytics.

Big Data Specific Challenges (Gaps)

The biggest friend for dealing with the heterogeneity of biological data is still the relational database management system (RDBMS). Unfortunately, it does not scale for the current volume of data. NoSQL solutions aim at providing an alternative. Unfortunately, NoSQL solutions do not always lend themselves to real time interactive use, rapid and parallel bulk loading, and sometimes have issues regarding robustness. Our current approach is currently ad hoc, custom, relying mainly on the Linux cluster and the file system to supplement the Oracle RDBMS. The custom solution oftentimes rely in knowledge of the peculiarities of the data allowing us to devise horizontal partitioning schemes as well as inversion of data organization when applicable.

Big Data Specific Challenges in Mobility

No special challenges. Just world wide web access.

Security and Privacy

Requirements

No special challenges. Data is either public or requires standard login with password.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

A replacement for the RDBMS in Big Data would be of benefit to everyone. Many NoSQL solutions attempt to fill this role, but have their limitations.

More Information (URLs)

http://img.jgi.doe.gov

 

 

 

Healthcare and Life Sciences> Use Case 21: Individualized Diabetes Management

Use Case Title

Individualized Diabetes Management

Vertical (area)

Healthcare

Author/Company/Email

Peter Li, Ying Ding, Philip Yu, Geoffrey Fox, David Wild at Mayo Clinic, Indiana University, UIC; dingying@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Mayo Clinic + IU/semantic integration of EHR data

UIC/semantic graph mining of EHR data

IU cloud and parallel computing

Goals

Develop advanced graph-based data mining techniques applied to EHR to search for these cohorts and extract their EHR data for outcome evaluation. These methods will push the boundaries of scalability and data mining technologies and advance knowledge and practice in these areas as well as clinical management of complex diseases.

Use Case Description

Diabetes is a growing illness in world population, affecting both developing and developed countries. Current management strategies do not adequately take into account of individual patient profiles, such as co-morbidities and medications, which are common in patients with chronic illnesses. We propose to approach this shortcoming by identifying similar patients from a large Electronic Health Record (EHR) database, i.e. an individualized cohort, and evaluate their respective management outcomes to formulate one best solution suited for a given patient with diabetes.

Project under development as below

Stage 1: Use the Semantic Linking for Property Values method to convert an existing data warehouse at Mayo Clinic, called the Enterprise Data Trust (EDT), into RDF triples that enables us to find similar patients much more efficiently through linking of both vocabulary-based and continuous values,

Stage 2: Needs efficient parallel retrieval algorithms, suitable for cloud or HPC, using open source Hbase with both indexed and custom search to identify patients of possible interest.

Stage 3: The EHR, as an RDF graph, provides a very rich environment for graph pattern mining. Needs new distributed graph mining algorithms to perform pattern analysis and graph indexing technique for pattern searching on RDF triple graphs.

Stage 4: Given the size and complexity of graphs, mining subgraph patterns could generate numerous false positives and miss numerous false negatives. Needs robust statistical analysis tools to manage false discovery rate and determine true subgraph significance and validate these through several clinical use cases.

Current

Solutions

Compute(System)

supercomputers; cloud

Storage

HDFS

Networking

Varies. Significant I/O intensive processing needed

Software

Mayo internal data warehouse called Enterprise Data Trust (EDT)

Big Data
Characteristics

Data Source (distributed/centralized)

distributed EHR data

Volume (size)

The Mayo Clinic EHR dataset is a very large dataset containing over 5 million patients with thousands of properties each and many more that are derived from primary values.

Velocity

(e.g. real time)

not real time but updated periodically

Variety

(multiple datasets, mashup)

Structured data, a patient has controlled vocabulary (CV) property values (demographics, diagnostic codes, medications, procedures, etc.) and continuous property values (lab tests, medication amounts, vitals, etc.). The number of property values could range from less than 100 (new patient) to more than 100,000 (long term patient) with typical patients composed of 100 CV values and 1000 continuous values. Most values are time based, i.e. a timestamp is recorded with the value at the time of observation.

Variability (rate of change)

Data will be updated or added during each patient visit.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Data are annotated based on domain ontologies or taxonomies. Semantics of data can vary from labs to labs.

Visualization

no visualization

Data Quality

Provenance is important to trace the origins of the data and data quality

Data Types

text, and Continuous Numerical values

Data Analytics

Integrating data into semantic graph, using graph traverse to replace SQL join. Developing semantic graph mining algorithms to identify graph patterns, index graph, and search graph. Indexed Hbase. Custom code to develop new patient properties from stored data.

Big Data Specific Challenges (Gaps)

For individualized cohort, we will effectively be building a datamart for each patient since the critical properties and indices will be specific to each patient. Due to the number of patients, this becomes an impractical approach. Fundamentally, the paradigm changes from relational row-column lookup to semantic graph traversal.

Big Data Specific Challenges in Mobility

Physicians and patient may need access to this data on mobile platforms

Security and Privacy

Requirements

Health records or clinical research databases must be kept secure/private.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Data integration: continuous values, ontological annotation, taxonomy

Graph Search: indexing and searching graph

Validation: Statistical validation

More Information (URLs)

 

 

 

 

Healthcare and Life Sciences> Use Case 22: Statistical Relational AI for Health Care

Use Case Title

Statistical Relational AI for Health Care

Vertical (area)

Healthcare

Author/Company/Email

Sriraam Natarajan / Indiana University /, natarasr@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Researchers in Informatics, medicine and practitioners in medicine.

Goals

The goal of the project is to analyze large, multi-modal, longitudinal data. Analyzing different data types such as imaging, EHR, genetic and natural language data requires a rich representation. This approach employs the relational probabilistic models that have the capability of handling rich relational data and modeling uncertainty using probability theory. The software learns models from multiple data types and can possibly integrate the information and reason about complex queries.

Use Case Description

Users can provide a set of descriptions – say for instance, MRI images and demographic data about a particular subject. They can then query for the onset of a particular disease (say Alzheimer’s) and the system will then provide a probability distribution over the possible occurrence of this disease.

Current

Solutions

Compute(System)

A high performance computer (48 GB RAM) is needed to run the code for a few hundred patients. Clusters for large datasets

Storage

A 200 GB – 1 TB hard drive typically stores the test data. The relevant data is retrieved to main memory to run the algorithms. Backend data in database or NoSQL stores

Networking

Intranet.

Software

Mainly Java based, in house tools are used to process the data.

Big Data
Characteristics

Data Source (distributed/centralized)

All the data about the users reside in a single disk file. Sometimes, resources such as published text need to be pulled from internet.

Volume (size)

Variable due to the different amount of data collected. Typically can be in 100s of GBs for a single cohort of a few hundred people. When dealing with millions of patients, this can be in the order of 1 petabyte.

Velocity

(e.g. real time)

Varied. In some cases, EHRs are constantly being updated. In other controlled studies, the data often comes in batches in regular intervals.

Variety

(multiple datasets, mashup)

This is the key property in medical data sets. That data is typically in multiple tables and need to be merged in order to perform the analysis.

Variability (rate of change)

The arrival of data is unpredictable in many cases as they arrive in real time.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Challenging due to different modalities of the data, human errors in data collection and validation.

Visualization

The visualization of the entire input data is nearly impossible. But typically, partially visualizable. The models built can be visualized under some reasonable assumptions.

Data Quality (syntax)

 

Data Types

EHRs, imaging, genetic data that are stored in multiple databases.

Data Analytics

 

Big Data Specific Challenges (Gaps)

Data is in abundance in many cases of medicine. The key issue is that there can possibly be too much data (as images, genetic sequences etc.) that can make the analysis complicated. The real challenge lies in aligning the data and merging from multiple sources in a form that can be made useful for a combined analysis. The other issue is that sometimes, large amount of data is available about a single subject but the number of subjects themselves is not very high (i.e., data imbalance). This can result in learning algorithms picking up random correlations between the multiple data types as important features in analysis. Hence, robust learning methods that can faithfully model the data are of paramount importance. Another aspect of data imbalance is the occurrence of positive examples (i.e., cases). The incidence of certain diseases may be rare making the ratio of cases to controls extremely skewed making it possible for the learning algorithms to model noise instead of examples.

Big Data Specific Challenges in Mobility

 

Security and Privacy

Requirements

Secure handling and processing of data is of crucial importance in medical domains.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Models learned from one set of populations cannot be easily generalized across other populations with diverse characteristics. This requires that the learned models can be generalized and refined according to the change in the population characteristics.

More Information (URLs)

 

 

 

Healthcare and Life Sciences> Use Case 23: World Population Scale Epidemiology

Use Case Title

World Population Scale Epidemiological Study

Vertical (area)

Epidemiology, Simulation Social Science, Computational Social Science

Author/Company/Email

Madhav Marathe Stephen Eubank or Chris Barrett/ Virginia Bioinformatics Institute, Virginia Tech, mmarathe@vbi.vt.edu, seubank@vbi.vt.edu or cbarrett@vbi.vt.edu

Actors/Stakeholders and their roles and responsibilities

Government and non-profit institutions involved in health, public policy, and disaster mitigation. Social Scientist who wants to study the interplay between behavior and contagion.

Goals

(a) Build a synthetic global population. (b) Run simulations over the global population to reason about outbreaks and various intervention strategies.

Use Case Description

Prediction and control of pandemic similar to the 2009 H1N1 influenza.

Current

Solutions

Compute(System)

Distributed (MPI) based simulation system written in Charm++. Parallelism is achieved by exploiting the disease residence time period.

Storage

Network file system. Exploring database driven techniques.

Networking

Infiniband. High bandwidth 3D Torus.

Software

Charm++, MPI

Big Data
Characteristics

Data Source (distributed/centralized)

Generated from synthetic population generator. Currently centralized. However, could be made distributed as part of post-processing.

Volume (size)

100TB

Velocity

(e.g. real time)

Interactions with experts and visualization routines generate large amount of real time data. Data feeding into the simulation is small but data generated by simulation is massive.

Variety

(multiple datasets, mashup)

Variety depends upon the complexity of the model over which the simulation is being performed. Can be very complex if other aspects of the world population such as type of activity, geographical, socio-economic, cultural variations are taken into account.

Variability (rate of change)

Depends upon the evolution of the model and corresponding changes in the code. This is complex and time intensive. Hence low rate of change.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Robustness of the simulation is dependent upon the quality of the model. However, robustness of the computation itself, although non-trivial, is tractable.

Visualization

Would require very large amount of movement of data to enable visualization.

Data Quality (syntax)

Consistent due to generation from a model

Data Types

Primarily network data.

Data Analytics

Summary of various runs and replicates of a simulation

Big Data Specific Challenges (Gaps)

Computation of the simulation is both compute intensive and data intensive. Moreover, due to unstructured and irregular nature of graph processing the problem is not easily decomposable. Therefore it is also bandwidth intensive. Hence, a supercomputer is applicable than cloud type clusters.

Big Data Specific Challenges in Mobility

None

Security and Privacy

Requirements

Several issues at the synthetic population-modeling phase (see social contagion model).

Highlight issues for generalizing this use case (e.g. for ref. architecture)

In general contagion diffusion of various kinds: information, diseases, social unrest can be modeled and computed. All of them are agent-based model that utilize the underlying interaction network to study the evolution of the desired phenomena.

More Information (URLs)

 

 

 

 

Healthcare and Life Sciences> Use Case 24: Social Contagion Modeling

Use Case Title

Social Contagion Modeling

Vertical (area)

Social behavior (including national security, public health, viral marketing, city planning, disaster preparedness)

Author/Company/Email

Madhav Marathe or Chris Kuhlman /Virginia Bioinformatics Institute, Virginia Tech mmarathe@vbi.vt.edu or ckuhlman@vbi.vt.edu

/Actors/Stakeholders and their roles and responsibilities

 

Goals

Provide a computing infrastructure that models social contagion processes.

The infrastructure enables different types of human-to-human interactions (e.g., face-to-face versus online media; mother-daughter relationships versus mother-coworker relationships) to be simulated. It takes not only human-to-human interactions into account, but also interactions among people, services (e.g., transportation), and infrastructure (e.g., internet, electric power).

Use Case Description

Social unrest. People take to the streets to voice unhappiness with government leadership. There are citizens that both support and oppose government. Quantify the degrees to which normal business and activities are disrupted owing to fear and anger. Quantify the possibility of peaceful demonstrations, violent protests. Quantify the potential for government responses ranging from appeasement, to allowing protests, to issuing threats against protestors, to actions to thwart protests. To address these issues, must have fine-resolution models and datasets.

Current

Solutions

Compute(System)

Distributed processing software running on commodity clusters and newer architectures and systems (e.g., clouds).

Storage

File servers (including archives), databases.

Networking

Ethernet, Infiniband, and similar.

Software

Specialized simulators, open source software, and proprietary modeling environments. Databases.

Big Data
Characteristics

Data Source (distributed/centralized)

Many data sources: populations, work locations, travel patterns, utilities (e.g., power grid) and other man-made infrastructures, online (social) media.

Volume (size)

Easily 10s of TB per year of new data.

Velocity

(e.g. real time)

During social unrest events, human interactions and mobility key to understanding system dynamics. Rapid changes in data; e.g., who follows whom in Twitter.

Variety

(multiple datasets, mashup)

Variety of data seen in wide range of data sources. Temporal data. Data fusion.

 

Data fusion a big issue. How to combine data from different sources and how to deal with missing or incomplete data? Multiple simultaneous contagion processes.

Variability (rate of change)

Because of stochastic nature of events, multiple instances of models and inputs must be run to ranges in outcomes.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Failover of soft real-time analyses.

Visualization

Large datasets; time evolution; multiple contagion processes over multiple network representations. Levels of detail (e.g., individual, neighborhood, city, state, country-level).

Data Quality (syntax)

Checks for ensuring data consistency, corruption. Preprocessing of raw data for use in models.

Data Types

Wide-ranging data, from human characteristics to utilities and transportation systems, and interactions among them.

Data Analytics

Models of behavior of humans and hard infrastructures, and their interactions. Visualization of results.

Big Data Specific Challenges (Gaps)

How to take into account heterogeneous features of 100s of millions or billions of individuals, models of cultural variations across countries that are assigned to individual agents? How to validate these large models? Different types of models (e.g., multiple contagions): disease, emotions, behaviors. Modeling of different urban infrastructure systems in which humans act. With multiple replicates required to assess stochasticity, large amounts of output data are produced; storage requirements.

Big Data Specific Challenges in Mobility

How and where to perform these computations? Combinations of cloud computing and clusters. How to realize most efficient computations; move data to compute resources?

Security and Privacy

Requirements

Two dimensions. First, privacy and anonymity issues for individuals used in modeling (e.g., Twitter and Facebook users). Second, securing data and computing platforms for computation.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Fusion of different data types. Different datasets must be combined depending on the particular problem. How to quickly develop, verify, and validate new models for new applications. What is appropriate level of granularity to capture phenomena of interest while generating results sufficiently quickly; i.e., how to achieve a scalable solution. Data visualization and extraction at different levels of granularity.

More Information (URLs)

 

 

 

 

Healthcare and Life Sciences> Use Case 25: LifeWatch Biodiversity

Use Case Title

LifeWatch – E-Science European Infrastructure for Biodiversity and Ecosystem Research

Vertical (area)

Scientific Research: Life Science

Author/Company/Email

Wouter Los, Yuri Demchenko (y.demchenko@uva.nl), University of Amsterdam

Actors/Stakeholders and their roles and responsibilities

End-users (biologists, ecologists, field researchers)

Data analysts, data archive managers, e-Science Infrastructure managers, EU states national representatives

Goals

Research and monitor different ecosystems, biological species, their dynamics and migration.

Use Case Description

LifeWatch project and initiative intends to provide integrated access to a variety of data, analytical and modeling tools as served by a variety of collaborating initiatives. Another service is offered with data and tools in selected workflows for specific scientific communities. In addition, LifeWatch will provide opportunities to construct personalized ‘virtual labs', also allowing to enter new data and analytical tools.

New data will be shared with the data facilities cooperating with LifeWatch.

Particular case studies: Monitoring alien species, monitoring migrating birds, wetlands

LifeWatch operates Global Biodiversity Information facility and Biodiversity Catalogue that is Biodiversity Science Web Services Catalogue

Current

Solutions

Compute(System)

Field facilities TBD

Data center: General Grid and cloud based resources provided by national e-Science centers

Storage

Distributed, historical and trends data archiving

Networking

May require special dedicated or overlay sensor network.

Software

Web Services based, Grid based services, relational databases

Big Data
Characteristics

Data Source (distributed/centralized)

Ecological information from numerous observation and monitoring facilities and sensor network, satellite images/information, climate and weather, all recorded information.

Information from field researchers

Volume (size)

Involves many existing data sets/sources

Collected amount of data TBD

Velocity

(e.g. real time)

Data analyzed incrementally, processes dynamics corresponds to dynamics of biological and ecological processes.

However may require real-time processing and analysis in case of the natural or industrial disaster.

May require data streaming processing.

Variety

(multiple datasets, mashup)

Variety and number of involved databases and observation data is currently limited by available tools; in principle, unlimited with the growing ability to process data for identifying ecological changes, factors/reasons, species evolution and trends.

See below in additional information.

Variability (rate of change)

Structure of the datasets and models may change depending on the data processing stage and tasks

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

In normal monitoring mode are data are statistically processed to achieve robustness.

Some biodiversity research is critical to data veracity (reliability/trustworthiness).

In case of natural and technogenic disasters data veracity is critical.

Visualization

Requires advanced and rich visualization, high definition visualization facilities, visualization data

  • 4D visualization
  • Visualizing effects of parameter change in (computational) models
  • Comparing model outcomes with actual observations (multi dimensional)

Data Quality

Depends on and ensued by initial observation data.

Quality of analytical data depends on used mode and algorithms that are constantly improved.

Repeating data analytics should be possible to re-evaluate initial observation data.

Actionable data are human aided.

Data Types

Multi-type.

Relational data, key-value, complex semantically rich data

Data Analytics

Parallel data streams and streaming analytics

Big Data Specific Challenges (Gaps)

Variety, multi-type data: SQL and no-SQL, distributed multi-source data.

Visualization, distributed sensor networks.

Data storage and archiving, data exchange and integration; data linkage: from the initial observation data to processed data and reported/visualized data.

  • Historical unique data
  • Curated (authorized) reference data (i.e. species names lists), algorithms, software code, workflows
  • Processed (secondary) data serving as input for other researchers
  • Provenance (and persistent identification (PID)) control of data, algorithms, and workflows

Big Data Specific Challenges in Mobility

Require supporting mobile sensors (e.g. birds migration) and mobile researchers (both for information feed and catalogue search)

  • Instrumented field vehicles, Ships, Planes, Submarines, floating buoys, sensor tagging on organisms
  • Photos, video, sound recording

Security and Privacy

Requirements

Data integrity, referral integrity of the datasets.

Federated identity management for mobile researchers and mobile sensors

Confidentiality, access control and accounting for information on protected species, ecological information, space images, climate information.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

·         Support of distributed sensor network

·         Multi-type data combination and linkage; potentially unlimited data variety

·         Data lifecycle management: data provenance, referral integrity and identification

·         Access and integration of multiple distributed databases

More Information (URLs)

http://www.lifewatch.eu/web/guest/home

https://www.biodiversitycatalogue.org/

Note:

Variety of data used in Biodiversity research

Genetic (genomic) diversity

  • DNA sequences and barcodes
  • Metabolomics functions

Species information

  • species names
  • occurrence data (in time and place)
  • species traits and life history data
  • host-parasite relations
  • collection specimen data

Ecological information

  • biomass, trunk/root diameter and other physical characteristics
  • population density etc.
  • habitat structures
  • C/N/P etc. molecular cycles

Ecosystem data

  • species composition and community dynamics
  • remote and earth observation data
  • CO2 fluxes
  • Soil characteristics
  • Algal blooming
  • Marine temperature, salinity, pH, currents, etc.

Ecosystem services

  • productivity (i.e.., biomass production/time)
  • fresh water dynamics
  • erosion
  • climate buffering
  • genetic pools

Data concepts

  • conceptual framework of each data
  • ontologies
  • provenance data

Algorithms and workflows

  • software code and provenance
  • tested workflows

Multiple sources of data and information

  • Specimen collection data
  • Observations (human interpretations)
  • Sensors and sensor networks (terrestrial, marine, soil organisms), bird etc. tagging
  • Aerial and satellite observation spectra
  • Field * Laboratory experimentation
  • Radar and LiDAR
  • Fisheries and agricultural data
  • Deceases and epidemics

 

 

 

Deep Learning and Social Media> Use Case 26: Large-scale Deep Learning

Use Case Title

Large-scale Deep Learning

Vertical (area)

Machine Learning/AI

Author/Company/Email

Adam Coates / Stanford University / acoates@cs.stanford.edu

Actors/Stakeholders and their roles and responsibilities

Machine learning researchers and practitioners faced with large quantities of data and complex prediction tasks. Supports state-of-the-art development in computer vision as in automatic car driving, speech recognition, and natural language processing in both academic and industry systems.

Goals

Increase the size of datasets and models that can be tackled with deep learning algorithms. Large models (e.g., neural networks with more neurons and connections) combined with large datasets are increasingly the top performers in benchmark tasks for vision, speech, and NLP.

Use Case Description

A research scientist or machine learning practitioner wants to train a deep neural network from a large (>>1TB) corpus of data (typically imagery, video, audio, or text). Such training procedures often require customization of the neural network architecture, learning criteria, and dataset pre-processing. In addition to the computational expense demanded by the learning algorithms, the need for rapid prototyping and ease of development is extremely high.

Current

Solutions

Compute(System)

GPU cluster with high-speed interconnects (e.g., Infiniband, 40gE)

Storage

100TB Lustre filesystem

Networking

Infiniband within HPC cluster; 1G ethernet to outside infrastructure (e.g., Web, Lustre).

Software

In-house GPU kernels and MPI-based communication developed by Stanford CS. C++/Python source.

Big Data
Characteristics

Data Source (distributed/centralized)

Centralized filesystem with a single large training dataset. Dataset may be updated with new training examples as they become available.

Volume (size)

Current datasets typically 1 to 10 TB. With increases in computation that enable much larger models, datasets of 100TB or more may be necessary in order to exploit the representational power of the larger models. Training a self-driving car could take 100 million images.

Velocity

(e.g. real time)

Much faster than real-time processing is required. Current computer vision applications involve processing hundreds of image frames per second in order to ensure reasonable training times. For demanding applications (e.g., autonomous driving) we envision the need to process many thousand high-resolution (6 megapixels or more) images per second.

Variety

(multiple datasets, mashup)

Individual applications may involve a wide variety of data. Current research involves neural networks that actively learn from heterogeneous tasks (e.g., learning to perform tagging, chunking and parsing for text, or learning to read lips from combinations of video and audio).

Variability (rate of change)

Low variability. Most data is streamed in at a consistent pace from a shared source. Due to high computational requirements, server loads can introduce burstiness into data transfers.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Datasets for ML applications are often hand-labeled and verified. Extremely large datasets involve crowd-sourced labeling and invite ambiguous situations where a label is not clear. Automated labeling systems still require human sanity-checks. Clever techniques for large dataset construction is an active area of research.

Visualization

Visualization of learned networks is an open area of research, though partly as a debugging technique. Some visual applications involve visualization predictions on test imagery.

Data Quality (syntax)

Some collected data (e.g., compressed video or audio) may involve unknown formats, codecs, or may be corrupted. Automatic filtering of original source data removes these.

Data Types

Images, video, audio, text. (In practice: almost anything.)

Data Analytics

Small degree of batch statistical pre-processing; all other data analysis is performed by the learning algorithm itself.

Big Data Specific Challenges (Gaps)

Processing requirements for even modest quantities of data are extreme. Though the trained representations can make use of many terabytes of data, the primary challenge is in processing all of the data during training. Current state-of-the-art deep learning systems are capable of using neural networks with more than 10 billion free parameters (akin to synapses in the brain), and necessitate trillions of floating point operations per training example. Distributing these computations over high-performance infrastructure is a major challenge for which we currently use a largely custom software system.

Big Data Specific Challenges in Mobility

After training of large neural networks is completed, the learned network may be copied to other devices with dramatically lower computational capabilities for use in making predictions in real time. (E.g., in autonomous driving, the training procedure is performed using a HPC cluster with 64 GPUs. The result of training, however, is a neural network that encodes the necessary knowledge for making decisions about steering and obstacle avoidance. This network can be copied to embedded hardware in vehicles or sensors.)

Security and Privacy

Requirements

None.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Deep Learning shares many characteristics with the broader field of machine learning. The paramount requirements are high computational throughput for mostly dense linear algebra operations, and extremely high productivity. Most deep learning systems require a substantial degree of tuning on the target application for best performance and thus necessitate a large number of experiments with designer intervention in between. As a result, minimizing the turn-around time of experiments and accelerating development is crucial.

 

These two requirements (high throughput and high productivity) are dramatically in contention. HPC systems are available to accelerate experiments, but current HPC software infrastructure is difficult to use which lengthens development and debugging time and, in many cases, makes otherwise computationally tractable applications infeasible.

 

The major components needed for these applications (which are currently in-house custom software) involve dense linear algebra on distributed-memory HPC systems. While libraries for single-machine or single-GPU computation are available (e.g., BLAS, CuBLAS, MAGMA, etc.), distributed computation of dense BLAS-like or LAPACK-like operations on GPUs remains poorly developed. Existing solutions (e.g., ScaLapack for CPUs) are not well-integrated with higher level languages and require low-level programming which lengthens experiment and development time.

More Information (URLs)

Recent popular press coverage of deep learning technology:

http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html

http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html

http://www.wired.com/wiredenterprise/2013/06/andrew_ng/

A recent research paper on HPC for Deep Learning: http://www.stanford.edu/~acoates/papers/CoatesHuvalWangWuNgCatanzaro_icml2013.pdf

Widely-used tutorials and references for Deep Learning:

http://ufldl.stanford.edu/wiki/index.php/Main_Page

http://deeplearning.net/

 

 

 

Deep Learning and Social Media> Use Case 27: Large Scale Consumer Photos Organization

Use Case Title

Organizing large-scale, unstructured collections of consumer photos

Vertical (area)

(Scientific Research: Artificial Intelligence)

Author/Company/Email

David Crandall, Indiana University, djcran@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Computer vision researchers (to push forward state of art), media and social network companies (to help organize large-scale photo collections), consumers (browsing both personal and public photo collections), researchers and others interested in producing cheap 3d models (archaeologists, architects, urban planners, interior designers…)

Goals

Produce 3d reconstructions of scenes using collections of millions to billions of consumer images, where neither the scene structure nor the camera positions are known a priori. Use resulting 3d models to allow efficient and effective browsing of large-scale photo collections by geographic position. Geolocate new images by matching to 3d models. Perform object recognition on each image.

Use Case Description

3d reconstruction is typically posed as a robust non-linear least squares optimization problem in which observed (noisy) correspondences between images are constraints and unknowns are 6-d camera pose of each image and 3-d position of each point in the scene. Sparsity and large degree of noise in constraints typically makes naïve techniques fall into local minima that are not close to actual scene structure. Typical specific steps are: (1) extracting features from images, (2) matching images to find pairs with common scene structures, (3) estimating an initial solution that is close to scene structure and/or camera parameters, (4) optimizing non-linear objective function directly. Of these, (1) is embarrassingly parallel. (2) is an all-pairs matching problem, usually with heuristics to reject unlikely matches early on. We solve (3) using discrete optimization using probabilistic inference on a graph (Markov Random Field) followed by robust Levenberg-Marquardt in continuous space. Others solve (3) by solving (4) for a small number of images and then incrementally adding new images, using output of last round as initialization for next round. (4) is typically solved with Bundle Adjustment, which is a non-linear least squares solver that is optimized for the particular constraint structure that occurs in 3d reconstruction problems. Image recognition problems are typically embarrassingly parallel, although learning object models involves learning a classifier (e.g. a Support Vector Machine), a process that is often hard to parallelize.

Current

Solutions

Compute(System)

Hadoop cluster (about 60 nodes, 480 core)

Storage

Hadoop DFS and flat files

Networking

Simple Unix

Software

Hadoop Map-reduce, simple hand-written multithreaded tools (ssh and sockets for communication)

Big Data
Characteristics

Data Source (distributed/centralized)

Publicly-available photo collections, e.g. on Flickr, Panoramio, etc.

Volume (size)

500+ billion photos on Facebook, 5+ billion photos on Flickr.

Velocity

(e.g. real time)

100+ million new photos added to Facebook per day.

Variety

(multiple datasets, mashup)

Images and metadata including EXIF tags (focal distance, camera type, etc.),

Variability (rate of change)

Rate of photos varies significantly, e.g. roughly 10x photos to Facebook on New Years versus other days. Geographic distribution of photos follows long-tailed distribution, with 1000 landmarks (totaling only about 100 square km) accounting for over 20% of photos on Flickr.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Important to make as accurate as possible, subject to limitations of computer vision technology.

Visualization

Visualize large-scale 3-d reconstructions, and navigate large-scale collections of images that have been aligned to maps.

Data Quality

Features observed in images are quite noisy due both to imperfect feature extraction and to non-ideal properties of specific images (lens distortions, sensor noise, image effects added by user, etc.)

Data Types

Images, metadata

Data Analytics

 

Big Data Specific Challenges (Gaps)

Analytics needs continued monitoring and improvement.

Big Data Specific Challenges in Mobility

Many/most images are captured by mobile devices; eventual goal is to push reconstruction and organization to phone to allow real-time interaction with the user.

Security and Privacy

Requirements

Need to preserve privacy for users and digital rights for media.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Components of this use case including feature extraction, feature matching, and large-scale probabilistic inference appear in many or most computer vision and image processing problems, including recognition, stereo resolution, image denoising, etc.

More Information (URLs)

http://vision.soic.indiana.edu/disco

 

 

 

Deep Learning and Social Media> Use Case 28: Truthy Twitter Data Analysis

Use Case Title

Truthy: Information diffusion research from Twitter Data

Vertical (area)

Scientific Research: Complex Networks and Systems research

Author/Company/Email

Filippo Menczer, Indiana University, fil@indiana.edu;

Alessandro Flammini, Indiana University, aflammin@indiana.edu;

Emilio Ferrara, Indiana University, ferrarae@indiana.edu;

Actors/Stakeholders and their roles and responsibilities

Research funded by NFS, DARPA, and McDonnel Foundation.

Goals

Understanding how communication spreads on socio-technical networks. Detecting potentially harmful information spread at the early stage (e.g., deceiving messages, orchestrated campaigns, untrustworthy information, etc.)

Use Case Description

(1) Acquisition and storage of a large volume of continuous streaming data from Twitter (~100 million messages per day, ~500GB data/day increasing over time); (2) near real-time analysis of such data, for anomaly detection, stream clustering, signal classification and online-learning; (3) data retrieval, Big Data visualization, data-interactive Web interfaces, public API for data querying.

Current

Solutions

Compute(System)

Current: in-house cluster hosted by Indiana University. Critical requirement: large cluster for data storage, manipulation, querying and analysis.

Storage

Current: Raw data stored in large compressed flat files, since August 2010. Need to move towards Hadoop/IndexedHBase and HDFS distributed storage. Redis as an in-memory database as a buffer for real-time analysis.

Networking

10GB/Infiniband required.

Software

Hadoop, Hive, Redis for data management.

Python/SciPy/NumPy/MPI for data analysis.

Big Data
Characteristics

Data Source (distributed/centralized)

Distributed – with replication/redundancy

Volume (size)

~30TB/year compressed data

Velocity (e.g. real time)

Near real-time data storage, querying and analysis

Variety (multiple datasets, mashup)

Data schema provided by social media data source. Currently using Twitter only. We plan to expand incorporating Google+, Facebook

Variability (rate of change)

Continuous real-time data stream incoming from each source.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

99.99% uptime required for real-time data acquisition. Service outages might corrupt data integrity and significance.

Visualization

Information diffusion, clustering, and dynamic network visualization capabilities already exist.

Data Quality (syntax)

Data structured in standardized formats, the overall quality is extremely high. We generate aggregated statistics; expand the features set, etc., generating high-quality derived data.

Data Types

Fully-structured data (JSON format) enriched with users meta-data, geo-locations, etc.

Data Analytics

Stream clustering: data are aggregated according to topics, meta-data and additional features, using ad hoc online clustering algorithms. Classification: using multi-dimensional time series to generate, network features, users, geographical, content features, etc., we classify information produced on the platform. Anomaly detection: real-time identification of anomalous events (e.g., induced by exogenous factors). Online learning: applying machine learning/deep learning methods to real-time information diffusion patterns analysis, users profiling, etc.

Big Data Specific Challenges (Gaps)

Dealing with real-time analysis of large volume of data. Providing a scalable infrastructure to allocate resources, storage space, etc. on-demand if required by increasing data volume over time.

Big Data Specific Challenges in Mobility

Implementing low-level data storage infrastructure features to guarantee efficient, mobile access to data.

Security and Privacy

Requirements

Twitter publicly releases data collected by our platform. Although, data-sources incorporate user meta-data (in general, not sufficient to uniquely identify individuals) therefore some policy for data storage security and privacy protection must be implemented.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Definition of high-level data schema to incorporate multiple data-sources providing similarly structured data.

More Information (URLs)

http://truthy.indiana.edu/

http://cnets.indiana.edu/groups/nan/truthy

http://cnets.indiana.edu/groups/nan/despic

 

 

 

Deep Learning and Social Media> Use Case 29: Crowd Sourcing in the Humanities

Use Case Title

Crowd Sourcing in the Humanities as Source for Big and Dynamic Data

Vertical (area)

Humanities, Social Sciences

Author/Company/Email

Sebastian Drude <Sebastian.Drude@mpi.nl>, Max Planck Institute for Psycholinguistics (MPI)

Actors/Stakeholders and their roles and responsibilities

Scientists (Sociologists, Psychologists, Linguists, Politic Scientists, Historians, etc.), data managers and analysts, data archives

The general public as data providers and participants

Goals

Capture information (manually entered, recorded multimedia, reaction times, pictures, sensor information) from many individuals and their devices.

Thus capture wide ranging individual, social, cultural and linguistic variation among several dimensions (space, social space, time).

Use Case Description

Many different possible use cases: get recordings of language usage (words, sentences, meaning descriptions, etc.), answers to surveys, info on cultural facts, transcriptions of pictures and texts -- correlate these with other phenomena, detect new cultural practices, behavior, values and believes, discover individual variation

Current

Solutions

Compute(System)

Individual systems for manual data collection (mostly Websites)

Storage

Traditional servers

Networking

barely used other than for data entry via web

Software

XML technology, traditional relational databases for storing pictures, not much multi-media yet.

Big Data
Characteristics

Data Source (distributed/centralized)

Distributed, individual contributors via webpages and mobile devices

Volume (size)

Depends dramatically, from hundreds to millions of data records.

Depending on data-type: from gigabytes (text, surveys, experiment values) to hundreds of terabytes (multimedia)

Velocity

(e.g. real time)

Depends very much on project: dozens to thousands of new data records per day

Data has to be analyzed incrementally.

Variety

(multiple datasets, mashup)

so far mostly homogeneous small data sets; expected large distributed heterogeneous datasets which have to be archived as primary data

Variability (rate of change)

Data structure and content of collections are changing during data lifecycle.

There is no critical variation of data producing speed, or runtime characteristics variations.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Noisy data is possible, unreliable metadata, identification and pre-selection of appropriate data

Visualization

important for interpretation, no special visualization techniques

Data Quality

validation is necessary; quality of recordings, quality of content, spam

Data Types

individual data records (survey answers, reaction times);

text (e.g., comments, transcriptions,…);

multi-media (pictures, audio, video)

Data Analytics

pattern recognition of all kind (e.g., speech recognition, automatic A&V analysis, cultural patterns), identification of structures (lexical units, linguistic rules, etc.)

Big Data Specific Challenges (Gaps)

Data management (metadata, provenance info, data identification with PIDs)

Data curation

Digitizing existing audio-video, photo and documents archives

Big Data Specific Challenges in Mobility

Include data from sensors of mobile devices (position, etc.);

Data collection from expeditions and field research.

Security and Privacy

Requirements

Privacy issues may be involved (A/V from individuals), anonymization may be necessary but not always possible (A/V analysis, small speech communities)

Archive and metadata integrity, long term preservation

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Many individual data entries from many individuals, constant flux of data entry, metadata assignment, etc.

Offline vs. online use, to be synchronized later with central database.

Giving significant feedback to contributors.

More Information (URLs)

---

Note: Crowd sourcing has been barely started to be used on a larger scale.
With the availability of mobile devices, now there is a huge potential for collecting much data from many individuals, also making use of sensors in mobile devices. This has not been explored on a large scale so far; existing projects of crowd sourcing are usually of a limited scale and web-based.

 

 

Deep Learning and Social Media> Use Case 30: CINET Network Science Cyberinfrastructure

Use Case Title

CINET: Cyberinfrastructure for Network (Graph) Science and Analytics

Vertical (area)

Network Science

Author/Company/Email

Team lead by Virginia Tech and comprising of researchers from Indiana University, University at Albany, North Carolina AT, Jackson State University, University at Houston Downtown, Argonne National Laboratory

Point of Contact: Madhav Marathe or Keith Bisset, Network Dynamics and Simulation Science Laboratory, Virginia Bio-informatics Institute Virginia Tech, mmarathe@vbi.vt.edu / kbisset@vbi.vt.edu

Actors/Stakeholders and their roles and responsibilities

Researchers, practitioners, educators and students interested in the study of networks.

Goals

CINET cyberinfrastructure middleware to support network science. This middleware will give researchers, practitioners, teachers and students access to a computational and analytic environment for research, education and training. The user interface provides lists of available networks and network analysis modules (implemented algorithms for network analysis). A user, who can be a researcher in network science area, can select one or more networks and analysis them with the available network analysis tools and modules. A user can also generate random networks following various random graph models. Teachers and students can use CINET for classroom use to demonstrate various graph theoretic properties and behaviors of various algorithms. A user is also able to add a network or network analysis module to the system. This feature of CINET allows it to grow easily and remain up-to-date with the latest algorithms.

The goal is to provide a common web-based platform for accessing various (i) network and graph analysis tools such as SNAP, NetworkX, Galib, etc. (ii) real-world and synthetic networks, (iii) computing resources and (iv) data management systems to the end-user in a seamless manner.

Use Case Description

Users can run one or more structural or dynamic analysis on a set of selected networks. The domain specific language allows users to develop flexible high level workflows to define more complex network analysis.

Current

Solutions

Compute(System)

A high performance computing cluster (DELL C6100), named Shadowfax, of 60 compute nodes and 12 processors (Intel Xeon X5670 2.93GHz) per compute node with a total of 720 processors and 4GB main memory per processor.

Shared memory systems ; EC2 based clouds are also used

Some of the codes and networks can utilize single node systems and thus are being currently mapped to Open Science Grid

Storage

628 TB GPFS

Networking

Internet, infiniband. A loose collection of supercomputing resources.

Software

Graph libraries: Galib, NetworkX.

Distributed Workflow Management: Simfrastructure, databases, semantic web tools

Big Data
Characteristics

Data Source (distributed/centralized)

A single network remains in a single disk file accessible by multiple processors. However, during the execution of a parallel algorithm, the network can be partitioned and the partitions are loaded in the main memory of multiple processors.

Volume (size)

Can be hundreds of GB for a single network.

Velocity

(e.g. real time)

Two types of changes: (i) the networks are very dynamic and (ii) as the repository grows, we expect at least a rapid growth to lead to over 1000-5000 networks and methods in about a year

Variety

(multiple datasets, mashup)

Data sets are varied: (i) directed as well as undirected networks, (ii) static and dynamic networks, (iii) labeled, (iv) can have dynamics over these networks,

Variability (rate of change)

The rate of graph-based data is growing at increasing rate. Moreover, increasingly other life sciences domains are using graph-based techniques to address problems. Hence, we expect the data and the computation to grow at a significant pace.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

Challenging due to asynchronous distributed computation. Current systems are designed for real-time synchronous response.

Visualization

As the input graph size grows the visualization system on client side is stressed heavily both in terms of data and compute.

Data Quality (syntax)

 

Data Types

 

Data Analytics

 

Big Data Specific Challenges (Gaps)

Parallel algorithms are necessary to analyze massive networks. Unlike many structured data, network data is difficult to partition. The main difficulty in partitioning a network is that different algorithms require different partitioning schemes for efficient operation. Moreover, most of the network measures are global in nature and require either i) huge duplicate data in the partitions or ii) very large communication overhead resulted from the required movement of data. These issues become significant challenges for big networks.

Computing dynamics over networks is harder since the network structure often interacts with the dynamical process being studied.

CINET enables large class of operations across wide variety, both in terms of structure and size, of graphs. Unlike other compute + data intensive systems, such as parallel databases or CFD, performance on graph computation is sensitive to underlying architecture. Hence, a unique challenge in CINET is manage the mapping between workload (graph type + operation) to a machine whose architecture and runtime is conducive to the system.

Data manipulation and bookkeeping of the derived for users is another big challenge since unlike enterprise data there is no well-defined and effective models and tools for management of various graph data in a unified fashion.

Big Data Specific Challenges in Mobility

 

Security and Privacy

Requirements

 

Highlight issues for generalizing this use case (e.g. for ref. architecture)

HPC as a service. As data volume grows increasingly large number of applications such as biological sciences need to use HPC systems. CINET can be used to deliver the compute resource necessary for such domains.

More Information (URLs)

http://cinet.vbi.vt.edu/cinet_new/

 

 

Deep Learning and Social Media> Use Case 31: NIST Analytic Technology Measurement and Evaluations

Use Case Title

NIST Information Access Division analytic technology performance measurement, evaluations, and standards

Vertical (area)

Analytic technology performance measurement and standards for government, industry, and academic stakeholders

Author/Company/Email

John Garofolo (john.garofolo@nist.gov)

Actors/Stakeholders and their roles and responsibilities

NIST developers of measurement methods, data contributors, analytic algorithm developers, users of analytic technologies for unstructured, semi-structured data, and heterogeneous data across all sectors.

Goals

Accelerate the development of advanced analytic technologies for unstructured, semi-structured, and heterogeneous data through performance measurement and standards. Focus communities of interest on analytic technology challenges of importance, create consensus-driven measurement metrics and methods for performance evaluation, evaluate the performance of the performance metrics and methods via community-wide evaluations which foster knowledge exchange and accelerate progress, and build consensus towards widely-accepted standards for performance measurement.

Use Case Description

Develop performance metrics, measurement methods, and community evaluations to ground and accelerate the development of advanced analytic technologies in the areas of speech and language processing, video and multimedia processing, biometric image processing, and heterogeneous data processing as well as the interaction of analytics with users. Typically employ one of two processing models: 1) Push test data out to test participants and analyze the output of participant systems, 2) Push algorithm test harness interfaces out to participants and bring in their algorithms and test them on internal computing clusters. Developing approaches to support scalable Cloud-based developmental testing. Also perform usability and utility testing on systems with users in the loop.

Current

Solutions

Compute (System)

Linux and OS-10 clusters; distributed computing with stakeholder collaborations; specialized image processing architectures.

Storage

RAID arrays, and distribute data on 1-2TB drives, and occasionally FTP. Distributed data distribution with stakeholder collaborations.

Networking

Fiber channel disk storage, Gigabit Ethernet for system-system communication, general intra- and Internet resources within NIST and shared networking resources with its stakeholders.

Software

PERL, Python, C/C++, Matlab, R development tools. Create ground-up test and measurement applications.

Big Data
Characteristics

Data Source (distributed/centralized)

Large annotated corpora of unstructured/semi-structured text, audio, video, images, multimedia, and heterogeneous collections of the above including ground truth annotations for training, developmental testing, and summative evaluations.

Volume (size)

The test corpora exceed 900M Web pages occupying 30 TB of storage, 100M tweets, 100M ground-truthed biometric images, several hundred thousand partially ground-truthed video clips, and terabytes of smaller fully ground-truthed test collections. Even larger data collections are being planned for future evaluations of analytics involving multiple data streams and very heterogeneous data.

Velocity

(e.g. real time)

Most legacy evaluations are focused on retrospective analytics. Newer evaluations are focusing on simulations of real-time analytic challenges from multiple data streams.

Variety

(multiple datasets, mashup)

The test collections span a wide variety of analytic application types including textual search/extraction, machine translation, speech recognition, image and voice biometrics, object and person recognition and tracking, document analysis, human-computer dialogue, and multimedia search/extraction. Future test collections will include mixed type data and applications.

Variability (rate of change)

Evaluation of tradeoffs between accuracy and data rates as well as variable numbers of data streams and variable stream quality.

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues, semantics)

The creation and measurement of the uncertainty associated with the ground-truthing process – especially when humans are involved – is challenging. The manual ground-truthing processes that have been used in the past are not scalable. Performance measurement of complex analytics must include measurement of intrinsic uncertainty as well as ground truthing error to be useful.

Visualization

Visualization of analytic technology performance results and diagnostics including significance and various forms of uncertainty. Evaluation of analytic presentation methods to users for usability, utility, efficiency, and accuracy.

Data Quality (syntax)

The performance of analytic technologies is highly impacted by the quality of the data they are employed against with regard to a variety of domain- and application-specific variables. Quantifying these variables is a challenging research task in itself. Mixed sources of data and performance measurement of analytic flows pose even greater challenges with regard to data quality.

Data Types

Unstructured and semi-structured text, still images, video, audio, multimedia (audio+video).

Data Analytics

Information extraction, filtering, search, and summarization; image and voice biometrics; speech recognition and understanding; machine translation; video person/object detection and tracking; event detection; imagery/document matching; novelty detection; a variety of structural/semantic/temporal analytics and many subtypes of the above.

Big Data Specific Challenges (Gaps)

Scaling ground-truthing to larger data, intrinsic and annotation uncertainty measurement, performance measurement for incompletely annotated data, measuring analytic performance for heterogeneous data and analytic flows involving users.

Big Data Specific Challenges in Mobility

Moving training, development, and test data to evaluation participants or moving evaluation participants’ analytic algorithms to computational testbeds for performance assessment. Providing developmental tools and data. Supporting agile developmental testing approaches.

Security and Privacy

Requirements

Analytic algorithms working with written language, speech, human imagery, etc. must generally be tested against real or realistic data. It’s extremely challenging to engineer artificial data that sufficiently captures the variability of real data involving humans. Engineered data may provide artificial challenges that may be directly or indirectly modeled by analytic algorithms and result in overstated performance. The advancement of analytic technologies themselves is increasing privacy sensitivities. Future performance testing methods will need to isolate analytic technology algorithms from the data the algorithms are tested against. Advanced architectures are needed to support security requirements for protecting sensitive data while enabling meaningful developmental performance evaluation. Shared evaluation testbeds must protect the intellectual property of analytic algorithm developers.

Highlight issues for generalizing this use case (e.g. for ref. architecture)

Scalability of analytic technology performance testing methods, source data creation, and ground truthing; approaches and architectures supporting developmental testing; protecting intellectual property of analytic algorithms and PII and other personal information in test data; measurement of uncertainty using partially-annotated data; composing test data with regard to qualities impacting performance and estimating test set difficulty; evaluating complex analytic flows involving multiple analytics, data types, and user interactions; multiple heterogeneous data streams and massive numbers of streams; mixtures of structured, semi-structured, and unstructured data sources; agile scalable developmental testing approaches and mechanisms.

More Information (URLs)

http://www.nist.gov/itl/iad/

 

 

 

<
The Ecosystem for Research> Use Case 32: DataNet Federation Consortium (DFC)

Use Case Title

DataNet Federation Consortium (DFC)

Vertical (area)

Collaboration Environments

Author/Company/Email

Reagan Moore / University of North Carolina at Chapel Hill / rwmoore@renci.org

Actors/Stakeholders and their roles and responsibilities

National Science Foundation research projects: Ocean Observatories Initiative (sensor archiving); Temporal Dynamics of Learning Center (Cognitive science data grid); the iPlant Collaborative (plant genomics); Drexel engineering digital library; Odum Institute for social science research (data grid federation with Dataverse).

Goals

Provide national infrastructure (collaboration environments) that enables researchers to collaborate through shared collections and shared workflows. Provide policy-based data management systems that enable the formation of collections, data grid, digital libraries, archives, and processing pipelines. Provide interoperability mechanisms that federate existing data repositories, information catalogs, and web services with collaboration environments.

Use Case Description

Promote collaborative and interdisciplinary research through federation of data management systems across federal repositories, national academic research initiatives, institutional repositories, and international collaborations. The collaboration environment runs at scale: petabytes of data, hundreds of millions of files, hundreds of millions of metadata attributes, tens of thousands of users, and a thousand storage resources.

Current

Solutions

Compute(System)

Interoperability with workflow systems (NCSA Cyberintegrator, Kepler, Taverna)

Storage

Interoperability across file systems, tape archives, cloud storage, object-based storage

Networking

Interoperability across TCP/IP, parallel TCP/IP, RBUDP, HTTP

Software

Integrated Rule Oriented Data System (iRODS)

Big Data
Characteristics

Data Source (distributed/centralized)

Manage internationally distributed data

Volume (size)

Petabytes, hundreds of millions of files

Velocity

(e.g. real time)

Support sensor data streams, satellite imagery, simulation output, observational data, experimental data

Variety

(multiple datasets, mashup)

Support logical collections that span administrative domains, data aggregation in containers, metadata, and workflows as objects

Variability (rate of change)

Support active collections (mutable data), versioning of data, and persistent identifiers

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

Provide reliable data transfer, audit trails, event tracking, periodic validation of assessment criteria (integrity, authenticity), distributed debugging

Visualization

Support execution of external visualization systems through automated workflows (GRASS)

Data Quality

Provide mechanisms to verify quality through automated workflow procedures

Data Types

Support parsing of selected formats (NetCDF, HDF5, Dicom), and provide mechanisms to invoke other data manipulation methods

Data Analytics

Provide support for invoking analysis workflows, tracking workflow provenance, sharing of workflows, and re-execution of workflows

Big Data Specific Challenges (Gaps)

Provide standard policy sets that enable a new community to build upon data management plans that address federal agency requirements

Big Data Specific Challenges in Mobility

Capture knowledge required for data manipulation, and apply resulting procedures at either the storage location, or a computer server.

Security and Privacy

Requirements

Federate across existing authentication environments through Generic Security Service API and Pluggable Authentication Modules (GSI, Kerberos, InCommon, Shibboleth). Manage access controls on files independently of the storage location.