Table of contents
  1. Story
  2. Emails
  3. Slides
    1. The NITRD Dashboard 2013: Cover Page
    2. The NITRD Dashboard 2013: Data Ecosystem
    3. Search for NIST at New.Data.gov 1
    4. Search for NIST at New.Data.gov 2
    5. Search for National Institute of Standards and Technology (NIST) at New.Data.gov 1
    6. Search for National Institute of Standards and Technology (NIST) at New.Data.gov 2
    7. Google Search for National Institute of Standards and Technology (NIST)
      1. Statistical Reference Datasets (StRD)
      2. NIST Data Gateway
      3. NIST Standard Reference Data
      4. NIST Datasets | 0.3.1 - ATOMS - Scilab
  4. Spotfire Dashboard
  5. Research Notes
    1. NITRD Home Page
    2. NITRD goes Open
    3. The NITRD Dashboard
    4. Data Sources
  6. Big Data at NIST
    1. Background
    2. NIST Big Data Working Group
    3. NIST Big Data Working Group Announcement
    4. NBD-WG Virtual Meeting
    5. Charter
    6. Co-Chairs
      1. Dr. Chaitan Baru
      2. Dr. Robert Marcus
      3. Mr. Wo Chang
    7. Guidelines
    8. Work Plan
    9. Virtual Meeting
  7. Tasks to Develop
    1. Definitions
    2. Taxonomies
    3. Reference Architectures
    4. Technical Roadmap
  8. Documents
    1. Input Listing
    2. Output Listing
    3. Upload Document
  9. Documents
    1. Security and Privacy Subgroup Meeting Agenda for July 24, 2013
    2. Defining the Big Data Architecture Framework (BDAF): Outcome of the Brainstorming session at UvA
    3. Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised
    4. Questions for Subgroups from Deliverables User Perspective
    5. NIST Big Data Roadmap Matrix v1.1
    6. NBD-Requirements Meeting Minutes July 23 2013
    7. NBD Requirements WG Use Case Template-v2
    8. BD Privacy Use Case for Social Media
    9. Big Data Privacy Use Case
    10. Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised
    11. NIST Big Data Roadmap Matrix v.1.1
    12. NBD-Requirements Agenda July 23 2013
    13. Def/Tax Agenda mtg-3 20130722
    14. NIST Genomics Use Case
    15. NBD-WG Meeting Agenda for July 24 Subgroups Joint Meeting
    16. Towards A Big Data Reference Architecture 2011 (Selected Slides)
    17. NBD Requirements Working Group Interim Minutes 16 July 2013
    18. Collection of Architectures and Discussions for NBD Reference Architecture
    19. Technology Roadmap_Meeting Agenda for July 19 2013
    20. Reference Architecture, Requirements, Gaps, and Roles
    21. NBD-WG - Thinking Foward
      1. Slide 1 NBD-WG: Thinking Forward
      2. Slide 2 Disclaimer 1
      3. Slide 3 Good Approach
      4. Slide 4 Big Data Current Approach
      5. Slide 5 Semi-new Approach
      6. Slide 6 Big Data Processing Protocol and Markup Language
      7. Slide 7 Subgroups
      8. Slide 8 Definitions and Taxonomies
      9. Slide 9 Taxonomies
      10. Slide 10 Requirements
      11. Slide 11 Security and Privacy
      12. Slide 12 Reference Architecture
      13. Slide 13 Technology Roadmap
      14. Slide 14 How Can You Help?
      15. Slide 15 Disclaimer 2
    22. Existing Standards Efforts Related to Big Data
      1. Abstract
      2. Introduction
      3. INCITS, ISO, and ISO/IEC Standards Efforts
      4. Data Management and Interchange
      5. Metadata
      6. Data Storage and Retrieval
      7. Support for Complex Data Types
      8. Geographical Information Systems
      9. IT Security techniques
      10. Cloud Computing – Distributed application platforms and services
      11. Coding of Audio, Picture, Multimedia, and Hypermedia Information
      12. Apache Software Foundation
      13. W3C
    23. Security and Privacy Subgroup Agenda for July 17, 2013
    24. Agenda for 7/18/2013
    25. NBD Use Case Template
    26. Different Requirements Use Case Templates
    27. Reference Arch Subgroup Minutes 11 July 2013
    28. Requirements/Use Case Working Group Minutes July 9 2013
    29. Requirements/Use Case Working Group Agenda July 16 2013
    30. Reference Architecture Subgroup Agenda for July 11, 2013
    31. Technology Roadmap Subgroup Agenda for July 12, 2013
    32. Def/Tax Agenda mtg-2 20130715
    33. NIST BDWG def-tax Definitions Draft
    34. NIST BDWG DefTax Minutes 20130708
    35. Technology Roadmap Subgroup Charter
    36. Reference Architecture Subgroup Charter
    37. Requirements Subgroup Charter
    38. Security and Privacy Subgroup Charter
    39. Definitions and Taxonomies Subgroup Charter
    40. For presentation: Big Data Ecosystem RA (Microsoft)
      1. Slide 1 Big Data Ecosystem Reference Architecture
      2. Slide 2 RA Objectives
      3. Slide 3 Big Data Ecosystem RA
      4. Slide 4 An Example of Cloud Computing Usage in Big Data Ecosystem
      5. Slide 5 Use Case: Advertising
      6. Slide 6 Use Case: Enterprise Data Warehouse
    41. Evaluation Criteria for Reference Architecture
      1. Slide 1 Evaluation Criteria for Reference Architecture
      2. Slide 2 Criteria for Evaluating Reference Architecture
      3. Slide 3 Brainstorming Big Data Reference Architecture
      4. Slide 4 Criteria Examples
      5. Slide 5 Apache Big Data Framework in Reference Architecture
    42. Big Data Ecosystem RA (Microsoft)
    43. Combined Deliverables Overview Presentation
    44. Security and Privacy Subgroup Meeting Agenda for July 10, 2013
    45. Requirements Subgroup Meeting Agenda for July 9, 2013
    46. Subgroup Definitions and Taxonomies Meeting Agenda
    47. Personal Brainstorming Document
      1. 1. Introduction and Definition
      2. 2. Reference Architecture and Taxonomy
        1. 1. Applications
        2. 2. Design, Develop, Deploy Tools
        3. 3. Security and Privacy
        4. 4. Analytics and Interfaces
          1. 4.1 Complex Analytics
          2. 4.2 Interfaces
          3. 4.3 Visualization
          4. 4.4 Business Intelligence
        5. 5. System and Process Management
        6. 6. Data Processing within the Architecture Framework
          1. 6.1 ETL
          2. 6.2 Data Serialization
        7. 7. Data Governance
        8. 8. Data Stores
          1. 8.1 File Systems
          2. 8.2 Databases
        9. 9. IO External to Architecture Framework and Stream Processing
        10. 10. Infrastructure
          1. 10.1 Appliances
          2. 10.2 Internal Server Farm
          3. 10.3 Data Grids and Fabrics
          4. 10.4 Cloud-based
      3. 3. Requirements, Gap Analysis, and Suggested Best Practices
        1. 1. Data input and output to Big Data File System (ETL, ELT)
        2. 2. Data exported to Databases from Big Data File System
        3. 3 Big Data File Systems as a data resource for batch and interactive queries
        4. 4. Batch Analytics on Big Data File System using Big Data Parallel Processing
        5. 5. Stream Processing and ETL
        6. 6. Real Time Analytics (e.g. Complex Event Processing)
        7. 7. NoSQL (and NewSQL) DBs as operational databases for large-scale updates and queries
        8. 8. NoSQL DBs for storing diverse data types
        9. 9. Databases optimized for complex ad hoc queries
        10. 10. Databases optimized for rapid updates and retrieval (e.g. in memory or SSD)
      4. 4. Future Directions and Roadmap
      5. 5. Security and Privacy - 10 Top Challenges
      6. 6. Conclusions and General Advice
        1. Recommended Roadmap for Getting Started
          1. Figure 3: Road Map for Getting Started
          2. Table 3: Practical Road Map Summary
      7. 7. References
      8. Appendix A. Terminology Glossary
        1. ACiD
        2. Analytics
        3. Avatarnode
        4. BASE
        5. Big Data
        6. BSON
        7. BSP (Bulk Synchronous Parallel)
        8. Business Analytics
        9. Cache
        10. (CEP) Complex Event Processing
        11. Consistent Hashing
        12. Descriptive Analytics
        13. Discovery Analytics
        14. ELT (Extract, Load, Transform)
        15. ETL (Extract, Transform Load)
        16. In Memory Database
        17. JSON (Javascript Object Notation)
        18. MapReduce
        19. MPP (Massively Parallel Processing)
        20. NewSQL
        21. NoSQL
        22. OLAP (Online Analytic Processing)
        23. OLTP (Online Transactional Processing)
        24. Paxos
        25. Predictive Analytics
        26. Presciptive Analytics
        27. Semi-Structured Data
        28. SSD (Solid State Drive)
        29. Stream Processing
        30. Structured Data
        31. Unstructured Data
        32. Vector Clocks
        33. Web Analytics
      9. Appendix B. Solutions Glossary
        1. Accumulo - (Database, NoSQL, Key-Value) from Apache
        2. Acunu Analytics - (Analytics Tool) on top of Aster Data Platform based on Cassandra
        3. Aerospike - (Database NoSQL Key-Value)
        4. Alteryx - (Analytics Tool)
        5. Ambari - (Hadoop Cluster Management) from Apache
        6. Analytica - (Analytics Tool) from Lumina
        7. ArangoDB - (Database, NoSQL, Multi-model) Open source from Europe
        8. Aster - (Analytics) Combines SQL and Hadoop on top of Aster MPP Database
        9. Avro - (Data Serialization) from Apache
        10. Azkaban - (Process Scheduler) for Hadoop
        11. Azure Table Storage - (Database, NoSQL, Columnar) from Microsoft
        12. Berkeley DB - (Database)
        13. BigData Appliance - (Integrated Hardware and Software Architecture) from Oracle includes Cloudera, Oracle NoSQL, Oracle R, and Sun Servers
        14. BigML - (Analytics tool) for business end-users
        15. BigQuery - (Query Tool) on top of Google Storage
        16. BigSheets - (BI Tool) from IBM
        17. BigTable - (Database, NOSQL. Column oriented) from Google
        18. Caffeine - (Search Engine) from Google use BigTable Indexing
        19. Cascading - (Processing) SQL on top of Hadoop from Apache
        20. Cascalog - (Query Tool) on top of Hadoop
        21. Cassandra - (Database, NoSQL, Column oriented)
        22. Chukwa - (Monitoring Hadoop Clusters) from Apache
        23. Clojure - (Lisp-based Programming Language) compiles to JVM byte code
        24. Cloudant - (Distributed Database as a Service)
        25. Cloudera - (Hadoop Distribution) including real-time queries
        26. Clustrix - (NewSQL DB) runs on AWS
        27. Coherence - (Data Grid/Caching) from Oracle
        28. Colossus - (File System) Next Generation Google File System
        29. Continuity - (Data fabric layer) Interfaces to Hadoop Processing and data stores
        30. Corona - (Hadoop Processing tool) used internally by Facebook and now open sourced
        31. Couchbase - (Database, NoSQL, Document) with CouchDB and Membase capabilities
        32. CouchDB - (Database, NoSQL, Document)
        33. Data Tamer - (Data integration and curation tools) from MIT
        34. Datameer - (Analytics) built on top of Hadoop
        35. Datastax - (Integration) Built on Cassandra, Solr, Hadoop
        36. Dremel - (Query Tool) interactive for columnar DBs from Google
        37. Drill - (Query Tool) interactive for columnar DBs from Apache
        38. Dynamo DB - (Database, NoSQL, Key-Value)
        39. Elastic MapReduce - (Cloud-based MapReduce) from Amazon
        40. ElasticSearch - (Search Engine) on top of Apache Lucerne
        41. Enterprise Control Language (ECL) - (Data Processing Language) from HPPC
        42. Erasure Codes - (Alternate to file replication for availability) Replicates fragments.
        43. eXtreme Scale - (Data Grid/Caching) from IBM
        44. F1 - (Combines relational and Hadoop processing) from Google built on Google Spanner
        45. Falcon - (Data processing and management platform) from Apache
        46. Flume - (Data Collection, Aggregation, Movement)
        47. FlumeJava - (Java Library) Supports development and running data parallel pipelines
        48. Fusion-io - (SSD Storage Platform) can be used with HBase
        49. GemFire - (Data Grid/Caching) from VMware
        50. Gensonix - (NoSQL database) from Scientel
        51. Gephi - (Visualization Tool) for Graphs
        52. Gigaspaces - (Data Grid/Caching)
        53. Giraph - (Graph Processing) from Apache
        54. Google Refine - (Data Cleansing)
        55. Google Storage - (Database, NoSQL, Key-Value)
        56. Graphbase - (Database, NoSQL, Graphical)
        57. Greenplum - ( MPP Database. Analytic Tools, Hadoop )
        58. HBase - (Database, NoSQL, Column oriented)
        59. Hadapt - (Combined SQL Layer and Hadoop)
        60. Hadoop Distributed File System - (Distributed File System) from Apache
        61. Hama - (Processing Framework) Uses BSP model on top of HDFS
        62. Hana - (Database, NewSQL) from SAP
        63. Haven - (Analytics Package) from HP
        64. HAWQ - (SQL Interface to Hadoop) from Greenplum and Pivotal
        65. HCatalog - (Table and Storage Management) for Hadoop data
        66. HDF5- (A data model, library, and file format for storing/managing large complex data)
        67. High Performance Computing Cluster (HPCC) - (Big Data Processing Platform)
        68. Hive - (Data warehouse structure on top of Hadoop)
        69. HiveQL - (SQL-like interface on Hadoop File System)
        70. Hortonworks - (Extensions of Hadoop)
        71. HStreaming - (Real time analytics on top of Hadoop)
        72. Hue - (Open source UI for Hadoop) from Cloudera
        73. Hypertable - (Database, NoSQL, Key-Value) open source runs on multiple file systems
        74. Impala - (Ad hoc query capability for Hadoop) from Cloudera
        75. InfiniDB - (Scale-up analytic database)
        76. Infochimps - (Big Data Storage and Analytics in the Cloud)
        77. Infosphere Big Insights - (Analytic) from IBM
        78. InnoDB - (Default storage engine for MYSQL)
        79. Jaql = (Query Language for Hadoop) from IBM
        80. Kafka - (Publish-and-subscribe for data) from Apache
        81. Karmasphere - (Analytics)
        82. Knox - (Secure gateway to Hadoop) from Apache
        83. Lucidworks - (Search built on Solr/Lucene) and an associated Big Data Platform
        84. Knowledge Graph - (Graphical data store) from Google
        85. Mahout - (Machine Learning Toolkit) built on Apache Hadoop
        86. MapD - (Massive Parallel Database) Open Source on top of GPUs
        87. MapReduce - (Processing algorithm)
        88. MapR - (MapReduce extensions) built on NFS
        89. MarkLogic - (Database, NoSQL, Document) interfaced with Hadoop
        90. Memcached - (Data Grid/Caching)
        91. MemSQL - (In memory analytics database)
        92. MongoDB - (Database, NoSQL, Document) from 10gen
        93. mrjob - (Workflow) for Hadoop from Yelp
        94. MRQL - (Query Language) supports Map-Reduce and BSP processing
        95. Muppet - (Stream Processing) MapUpdate implementation
        96. MySql - (Database Relational)
        97. Namenode - Directory service for Hadoop
        98. Neo4j - (Database, NoSQL, Graphical)
        99. Netezza - (Database Appliance)
        100. NuoDB - (MPP Database)
        101. Oozie - (Workflow Scheduler for Hadoop) from Apache
        102. Oracle NoSQL - (Database, Key-Value)
        103. ORC (Optimized Row Columnar) Files - File Format for Hive data in HDFS
        104. Parquet - (Columnar file format for Hadoop) from Cloudera
        105. Pentaho - (Analytic tools)
        106. Percolater - (Data Processing) from Google
        107. Pig - (Procedural framework on top of Hadoop)
        108. Pig Latin - (Interface language for Pig procedures)
        109. Pivotal - (New company utilizing VMware and EMC technologies)
        110. Platfora - (In memory caching for BI on top of Hadoop)
        111. Postgres - (Database Relational)
        112. Precog - (Analytics Tool) for JSON data
        113. Pregel - (Graph Processing) used by Google
        114. Presto - (SQL Query for HDFS) from Facebook
        115. Protocol Buffers - (Serialization) from Google )
        116. Protovis - (Visualization)
        117. PureData - (Database Products) from IBM
        118. R - (Data Analysis Tool)
        119. Rainstor - (Combines Hadoop and Relational Processing)
        120. RCFile (Record Columnar File) - File format optimized for HDFS data warehouses
        121. Redis - (Database, NoSQL, Key-Value)
        122. Redshift - (Database Relational) Amazon
        123. Resilient Distributed Datasets - (Fauklt tolerant in memory data sharing)
        124. Riak - (Database, NoSQL,Key-Value with built-in MapReduce) from Basho
        125. Roxie - (Query processing cluster) from HPCC
        126. RushAnalytics - (Analytics) from Pervasive
        127. S3 - (Database, NoSQL,Key-Value)
        128. S4 - (Stream Processing)
        129. Sawzall - (Query Language for Map-Reduce) from Google
        130. Scala - (Programming Language) Combines functional and imperative programming
        131. Scalebase - (Scalable Front-end to distributed Relational Databases)
        132. SciDB - (Database, NoSQL, Arrays)
        133. scikit learn - (Machine Learning Toolkit) in Python
        134. Scribe - (Server for Aggregating Log Data) originally from Facebook may be inactive
        135. SequenceFiles - (File format) Binary key-value pairs
        136. Shark - (Complex Analytics Platform) on top of Spark
        137. Simba - (ODBC SQL Driver for Hive)
        138. SimpleDB - (Database, NoSQL, Document) from Amazon
        139. Skytree - (Analytics Server)
        140. Solr/Lucene - (Search Engine) from Apache
        141. Spotfire - (Stream processing tool) from TIBCO
        142. Spanner - (Database, NewSQL) from Google
        143. Spark - (In memory cluster computing system)
        144. Splunk - (Machine Data Analytics)
        145. Spring Data - (Data access tool for Hadoop and NoSQL) in Spring Framework
        146. SQLite - (Software library supporting server-less relational database)
        147. SQLstream - (Streaming data analysis products)
        148. Sqoop - (Data movement) from Apache
        149. Sqrrl - (Security and Analytics on top of Apache Accumulo)
        150. Stinger - (Optimized Hive Queries) from Hortonworks
        151. Storm - (Stream Processing)
        152. Sumo Logic - (Log Analytics)
        153. Tableau - (Visualization)
        154. Tachyon - (File system) from Berkeley
        155. Talend - (Data Integration Product)
        156. TempoDB - (Database, NoSQL, Time Series)
        157. Teradata Active EDW - (Database, Relational)
        158. Terracotta - (In memory data management)
        159. Terraswarm - (Data Acquisition) Sensor Integration
        160. Thor - (Filesystem and Processing Cluster) from HPCC Systems
        161. Thrift - (Framework for cross-language development)
        162. Tinkerpop - (Graph Database and Toolkit)
        163. Vertica - (Database Relational)
        164. Voldemort - (Database, NoSQL, Key- Value)
        165. VoltDB - (Database NewSQL)
        166. Watson from IBM - (Analytic Framework)
        167. WebHDFS - (REST API for Hadoop)
        168. WEKA - (Machine Learning Toolkit) in Java
        169. Wibidata - (Components for building Big Data applications)
        170. YarcData - (Graph Analytics for Big Data)
        171. Yarn - (Next Generation Hadoop) from Apache
        172. Yottamine - (Machine Learning Toolkit) Cloud-based
        173. Zettaset Orchestrator - (Management and Security for Hadoop)
        174. ZooKeeper - (Distributed Computing Management)
      10. Appendix C. Use Case Examples
        1. EDW Use Cases
        2. Retail/Consumer Use Cases
        3. Financial Services Use Cases
        4. Web & Digital Media Services Use Cases
        5. Health & Life Sciences Use Cases
        6. Telecommunications Use Cases
        7. Government Use Cases
        8. New Application Use Cases
        9. Fraud Use-Cases
        10. E-Commerce and Customer Service Use-Cases
          1. Big Data Case Studies
      11. Appendix D. Actors and Roles
    48. NIST Special Publication 500-291 - NIST Cloud Computing Standards Roadmap - July 2011
    49. NIST Special Publication 500-292 - NIST Cloud Computing Reference Architecture - September 2011
    50. NIST_Security_Reference_Architecture_2013.05.15_v1
    51. SP800-145 Cloud Definition
    52. NBD-WD July 3rd Meeting Agenda
    53. NIST Big Data Working Kick-off Meeting Agenda (w/ minor update)
    54. Big Data Definitions v1
    55. NIST Big Data Working Kick-off Meeting Agenda
    56. NIST Big Data Working Group Charter - draft
  10. NEXT

Big Data at NIST

Last modified
Table of contents
  1. Story
  2. Emails
  3. Slides
    1. The NITRD Dashboard 2013: Cover Page
    2. The NITRD Dashboard 2013: Data Ecosystem
    3. Search for NIST at New.Data.gov 1
    4. Search for NIST at New.Data.gov 2
    5. Search for National Institute of Standards and Technology (NIST) at New.Data.gov 1
    6. Search for National Institute of Standards and Technology (NIST) at New.Data.gov 2
    7. Google Search for National Institute of Standards and Technology (NIST)
      1. Statistical Reference Datasets (StRD)
      2. NIST Data Gateway
      3. NIST Standard Reference Data
      4. NIST Datasets | 0.3.1 - ATOMS - Scilab
  4. Spotfire Dashboard
  5. Research Notes
    1. NITRD Home Page
    2. NITRD goes Open
    3. The NITRD Dashboard
    4. Data Sources
  6. Big Data at NIST
    1. Background
    2. NIST Big Data Working Group
    3. NIST Big Data Working Group Announcement
    4. NBD-WG Virtual Meeting
    5. Charter
    6. Co-Chairs
      1. Dr. Chaitan Baru
      2. Dr. Robert Marcus
      3. Mr. Wo Chang
    7. Guidelines
    8. Work Plan
    9. Virtual Meeting
  7. Tasks to Develop
    1. Definitions
    2. Taxonomies
    3. Reference Architectures
    4. Technical Roadmap
  8. Documents
    1. Input Listing
    2. Output Listing
    3. Upload Document
  9. Documents
    1. Security and Privacy Subgroup Meeting Agenda for July 24, 2013
    2. Defining the Big Data Architecture Framework (BDAF): Outcome of the Brainstorming session at UvA
    3. Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised
    4. Questions for Subgroups from Deliverables User Perspective
    5. NIST Big Data Roadmap Matrix v1.1
    6. NBD-Requirements Meeting Minutes July 23 2013
    7. NBD Requirements WG Use Case Template-v2
    8. BD Privacy Use Case for Social Media
    9. Big Data Privacy Use Case
    10. Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised
    11. NIST Big Data Roadmap Matrix v.1.1
    12. NBD-Requirements Agenda July 23 2013
    13. Def/Tax Agenda mtg-3 20130722
    14. NIST Genomics Use Case
    15. NBD-WG Meeting Agenda for July 24 Subgroups Joint Meeting
    16. Towards A Big Data Reference Architecture 2011 (Selected Slides)
    17. NBD Requirements Working Group Interim Minutes 16 July 2013
    18. Collection of Architectures and Discussions for NBD Reference Architecture
    19. Technology Roadmap_Meeting Agenda for July 19 2013
    20. Reference Architecture, Requirements, Gaps, and Roles
    21. NBD-WG - Thinking Foward
      1. Slide 1 NBD-WG: Thinking Forward
      2. Slide 2 Disclaimer 1
      3. Slide 3 Good Approach
      4. Slide 4 Big Data Current Approach
      5. Slide 5 Semi-new Approach
      6. Slide 6 Big Data Processing Protocol and Markup Language
      7. Slide 7 Subgroups
      8. Slide 8 Definitions and Taxonomies
      9. Slide 9 Taxonomies
      10. Slide 10 Requirements
      11. Slide 11 Security and Privacy
      12. Slide 12 Reference Architecture
      13. Slide 13 Technology Roadmap
      14. Slide 14 How Can You Help?
      15. Slide 15 Disclaimer 2
    22. Existing Standards Efforts Related to Big Data
      1. Abstract
      2. Introduction
      3. INCITS, ISO, and ISO/IEC Standards Efforts
      4. Data Management and Interchange
      5. Metadata
      6. Data Storage and Retrieval
      7. Support for Complex Data Types
      8. Geographical Information Systems
      9. IT Security techniques
      10. Cloud Computing – Distributed application platforms and services
      11. Coding of Audio, Picture, Multimedia, and Hypermedia Information
      12. Apache Software Foundation
      13. W3C
    23. Security and Privacy Subgroup Agenda for July 17, 2013
    24. Agenda for 7/18/2013
    25. NBD Use Case Template
    26. Different Requirements Use Case Templates
    27. Reference Arch Subgroup Minutes 11 July 2013
    28. Requirements/Use Case Working Group Minutes July 9 2013
    29. Requirements/Use Case Working Group Agenda July 16 2013
    30. Reference Architecture Subgroup Agenda for July 11, 2013
    31. Technology Roadmap Subgroup Agenda for July 12, 2013
    32. Def/Tax Agenda mtg-2 20130715
    33. NIST BDWG def-tax Definitions Draft
    34. NIST BDWG DefTax Minutes 20130708
    35. Technology Roadmap Subgroup Charter
    36. Reference Architecture Subgroup Charter
    37. Requirements Subgroup Charter
    38. Security and Privacy Subgroup Charter
    39. Definitions and Taxonomies Subgroup Charter
    40. For presentation: Big Data Ecosystem RA (Microsoft)
      1. Slide 1 Big Data Ecosystem Reference Architecture
      2. Slide 2 RA Objectives
      3. Slide 3 Big Data Ecosystem RA
      4. Slide 4 An Example of Cloud Computing Usage in Big Data Ecosystem
      5. Slide 5 Use Case: Advertising
      6. Slide 6 Use Case: Enterprise Data Warehouse
    41. Evaluation Criteria for Reference Architecture
      1. Slide 1 Evaluation Criteria for Reference Architecture
      2. Slide 2 Criteria for Evaluating Reference Architecture
      3. Slide 3 Brainstorming Big Data Reference Architecture
      4. Slide 4 Criteria Examples
      5. Slide 5 Apache Big Data Framework in Reference Architecture
    42. Big Data Ecosystem RA (Microsoft)
    43. Combined Deliverables Overview Presentation
    44. Security and Privacy Subgroup Meeting Agenda for July 10, 2013
    45. Requirements Subgroup Meeting Agenda for July 9, 2013
    46. Subgroup Definitions and Taxonomies Meeting Agenda
    47. Personal Brainstorming Document
      1. 1. Introduction and Definition
      2. 2. Reference Architecture and Taxonomy
        1. 1. Applications
        2. 2. Design, Develop, Deploy Tools
        3. 3. Security and Privacy
        4. 4. Analytics and Interfaces
          1. 4.1 Complex Analytics
          2. 4.2 Interfaces
          3. 4.3 Visualization
          4. 4.4 Business Intelligence
        5. 5. System and Process Management
        6. 6. Data Processing within the Architecture Framework
          1. 6.1 ETL
          2. 6.2 Data Serialization
        7. 7. Data Governance
        8. 8. Data Stores
          1. 8.1 File Systems
          2. 8.2 Databases
        9. 9. IO External to Architecture Framework and Stream Processing
        10. 10. Infrastructure
          1. 10.1 Appliances
          2. 10.2 Internal Server Farm
          3. 10.3 Data Grids and Fabrics
          4. 10.4 Cloud-based
      3. 3. Requirements, Gap Analysis, and Suggested Best Practices
        1. 1. Data input and output to Big Data File System (ETL, ELT)
        2. 2. Data exported to Databases from Big Data File System
        3. 3 Big Data File Systems as a data resource for batch and interactive queries
        4. 4. Batch Analytics on Big Data File System using Big Data Parallel Processing
        5. 5. Stream Processing and ETL
        6. 6. Real Time Analytics (e.g. Complex Event Processing)
        7. 7. NoSQL (and NewSQL) DBs as operational databases for large-scale updates and queries
        8. 8. NoSQL DBs for storing diverse data types
        9. 9. Databases optimized for complex ad hoc queries
        10. 10. Databases optimized for rapid updates and retrieval (e.g. in memory or SSD)
      4. 4. Future Directions and Roadmap
      5. 5. Security and Privacy - 10 Top Challenges
      6. 6. Conclusions and General Advice
        1. Recommended Roadmap for Getting Started
          1. Figure 3: Road Map for Getting Started
          2. Table 3: Practical Road Map Summary
      7. 7. References
      8. Appendix A. Terminology Glossary
        1. ACiD
        2. Analytics
        3. Avatarnode
        4. BASE
        5. Big Data
        6. BSON
        7. BSP (Bulk Synchronous Parallel)
        8. Business Analytics
        9. Cache
        10. (CEP) Complex Event Processing
        11. Consistent Hashing
        12. Descriptive Analytics
        13. Discovery Analytics
        14. ELT (Extract, Load, Transform)
        15. ETL (Extract, Transform Load)
        16. In Memory Database
        17. JSON (Javascript Object Notation)
        18. MapReduce
        19. MPP (Massively Parallel Processing)
        20. NewSQL
        21. NoSQL
        22. OLAP (Online Analytic Processing)
        23. OLTP (Online Transactional Processing)
        24. Paxos
        25. Predictive Analytics
        26. Presciptive Analytics
        27. Semi-Structured Data
        28. SSD (Solid State Drive)
        29. Stream Processing
        30. Structured Data
        31. Unstructured Data
        32. Vector Clocks
        33. Web Analytics
      9. Appendix B. Solutions Glossary
        1. Accumulo - (Database, NoSQL, Key-Value) from Apache
        2. Acunu Analytics - (Analytics Tool) on top of Aster Data Platform based on Cassandra
        3. Aerospike - (Database NoSQL Key-Value)
        4. Alteryx - (Analytics Tool)
        5. Ambari - (Hadoop Cluster Management) from Apache
        6. Analytica - (Analytics Tool) from Lumina
        7. ArangoDB - (Database, NoSQL, Multi-model) Open source from Europe
        8. Aster - (Analytics) Combines SQL and Hadoop on top of Aster MPP Database
        9. Avro - (Data Serialization) from Apache
        10. Azkaban - (Process Scheduler) for Hadoop
        11. Azure Table Storage - (Database, NoSQL, Columnar) from Microsoft
        12. Berkeley DB - (Database)
        13. BigData Appliance - (Integrated Hardware and Software Architecture) from Oracle includes Cloudera, Oracle NoSQL, Oracle R, and Sun Servers
        14. BigML - (Analytics tool) for business end-users
        15. BigQuery - (Query Tool) on top of Google Storage
        16. BigSheets - (BI Tool) from IBM
        17. BigTable - (Database, NOSQL. Column oriented) from Google
        18. Caffeine - (Search Engine) from Google use BigTable Indexing
        19. Cascading - (Processing) SQL on top of Hadoop from Apache
        20. Cascalog - (Query Tool) on top of Hadoop
        21. Cassandra - (Database, NoSQL, Column oriented)
        22. Chukwa - (Monitoring Hadoop Clusters) from Apache
        23. Clojure - (Lisp-based Programming Language) compiles to JVM byte code
        24. Cloudant - (Distributed Database as a Service)
        25. Cloudera - (Hadoop Distribution) including real-time queries
        26. Clustrix - (NewSQL DB) runs on AWS
        27. Coherence - (Data Grid/Caching) from Oracle
        28. Colossus - (File System) Next Generation Google File System
        29. Continuity - (Data fabric layer) Interfaces to Hadoop Processing and data stores
        30. Corona - (Hadoop Processing tool) used internally by Facebook and now open sourced
        31. Couchbase - (Database, NoSQL, Document) with CouchDB and Membase capabilities
        32. CouchDB - (Database, NoSQL, Document)
        33. Data Tamer - (Data integration and curation tools) from MIT
        34. Datameer - (Analytics) built on top of Hadoop
        35. Datastax - (Integration) Built on Cassandra, Solr, Hadoop
        36. Dremel - (Query Tool) interactive for columnar DBs from Google
        37. Drill - (Query Tool) interactive for columnar DBs from Apache
        38. Dynamo DB - (Database, NoSQL, Key-Value)
        39. Elastic MapReduce - (Cloud-based MapReduce) from Amazon
        40. ElasticSearch - (Search Engine) on top of Apache Lucerne
        41. Enterprise Control Language (ECL) - (Data Processing Language) from HPPC
        42. Erasure Codes - (Alternate to file replication for availability) Replicates fragments.
        43. eXtreme Scale - (Data Grid/Caching) from IBM
        44. F1 - (Combines relational and Hadoop processing) from Google built on Google Spanner
        45. Falcon - (Data processing and management platform) from Apache
        46. Flume - (Data Collection, Aggregation, Movement)
        47. FlumeJava - (Java Library) Supports development and running data parallel pipelines
        48. Fusion-io - (SSD Storage Platform) can be used with HBase
        49. GemFire - (Data Grid/Caching) from VMware
        50. Gensonix - (NoSQL database) from Scientel
        51. Gephi - (Visualization Tool) for Graphs
        52. Gigaspaces - (Data Grid/Caching)
        53. Giraph - (Graph Processing) from Apache
        54. Google Refine - (Data Cleansing)
        55. Google Storage - (Database, NoSQL, Key-Value)
        56. Graphbase - (Database, NoSQL, Graphical)
        57. Greenplum - ( MPP Database. Analytic Tools, Hadoop )
        58. HBase - (Database, NoSQL, Column oriented)
        59. Hadapt - (Combined SQL Layer and Hadoop)
        60. Hadoop Distributed File System - (Distributed File System) from Apache
        61. Hama - (Processing Framework) Uses BSP model on top of HDFS
        62. Hana - (Database, NewSQL) from SAP
        63. Haven - (Analytics Package) from HP
        64. HAWQ - (SQL Interface to Hadoop) from Greenplum and Pivotal
        65. HCatalog - (Table and Storage Management) for Hadoop data
        66. HDF5- (A data model, library, and file format for storing/managing large complex data)
        67. High Performance Computing Cluster (HPCC) - (Big Data Processing Platform)
        68. Hive - (Data warehouse structure on top of Hadoop)
        69. HiveQL - (SQL-like interface on Hadoop File System)
        70. Hortonworks - (Extensions of Hadoop)
        71. HStreaming - (Real time analytics on top of Hadoop)
        72. Hue - (Open source UI for Hadoop) from Cloudera
        73. Hypertable - (Database, NoSQL, Key-Value) open source runs on multiple file systems
        74. Impala - (Ad hoc query capability for Hadoop) from Cloudera
        75. InfiniDB - (Scale-up analytic database)
        76. Infochimps - (Big Data Storage and Analytics in the Cloud)
        77. Infosphere Big Insights - (Analytic) from IBM
        78. InnoDB - (Default storage engine for MYSQL)
        79. Jaql = (Query Language for Hadoop) from IBM
        80. Kafka - (Publish-and-subscribe for data) from Apache
        81. Karmasphere - (Analytics)
        82. Knox - (Secure gateway to Hadoop) from Apache
        83. Lucidworks - (Search built on Solr/Lucene) and an associated Big Data Platform
        84. Knowledge Graph - (Graphical data store) from Google
        85. Mahout - (Machine Learning Toolkit) built on Apache Hadoop
        86. MapD - (Massive Parallel Database) Open Source on top of GPUs
        87. MapReduce - (Processing algorithm)
        88. MapR - (MapReduce extensions) built on NFS
        89. MarkLogic - (Database, NoSQL, Document) interfaced with Hadoop
        90. Memcached - (Data Grid/Caching)
        91. MemSQL - (In memory analytics database)
        92. MongoDB - (Database, NoSQL, Document) from 10gen
        93. mrjob - (Workflow) for Hadoop from Yelp
        94. MRQL - (Query Language) supports Map-Reduce and BSP processing
        95. Muppet - (Stream Processing) MapUpdate implementation
        96. MySql - (Database Relational)
        97. Namenode - Directory service for Hadoop
        98. Neo4j - (Database, NoSQL, Graphical)
        99. Netezza - (Database Appliance)
        100. NuoDB - (MPP Database)
        101. Oozie - (Workflow Scheduler for Hadoop) from Apache
        102. Oracle NoSQL - (Database, Key-Value)
        103. ORC (Optimized Row Columnar) Files - File Format for Hive data in HDFS
        104. Parquet - (Columnar file format for Hadoop) from Cloudera
        105. Pentaho - (Analytic tools)
        106. Percolater - (Data Processing) from Google
        107. Pig - (Procedural framework on top of Hadoop)
        108. Pig Latin - (Interface language for Pig procedures)
        109. Pivotal - (New company utilizing VMware and EMC technologies)
        110. Platfora - (In memory caching for BI on top of Hadoop)
        111. Postgres - (Database Relational)
        112. Precog - (Analytics Tool) for JSON data
        113. Pregel - (Graph Processing) used by Google
        114. Presto - (SQL Query for HDFS) from Facebook
        115. Protocol Buffers - (Serialization) from Google )
        116. Protovis - (Visualization)
        117. PureData - (Database Products) from IBM
        118. R - (Data Analysis Tool)
        119. Rainstor - (Combines Hadoop and Relational Processing)
        120. RCFile (Record Columnar File) - File format optimized for HDFS data warehouses
        121. Redis - (Database, NoSQL, Key-Value)
        122. Redshift - (Database Relational) Amazon
        123. Resilient Distributed Datasets - (Fauklt tolerant in memory data sharing)
        124. Riak - (Database, NoSQL,Key-Value with built-in MapReduce) from Basho
        125. Roxie - (Query processing cluster) from HPCC
        126. RushAnalytics - (Analytics) from Pervasive
        127. S3 - (Database, NoSQL,Key-Value)
        128. S4 - (Stream Processing)
        129. Sawzall - (Query Language for Map-Reduce) from Google
        130. Scala - (Programming Language) Combines functional and imperative programming
        131. Scalebase - (Scalable Front-end to distributed Relational Databases)
        132. SciDB - (Database, NoSQL, Arrays)
        133. scikit learn - (Machine Learning Toolkit) in Python
        134. Scribe - (Server for Aggregating Log Data) originally from Facebook may be inactive
        135. SequenceFiles - (File format) Binary key-value pairs
        136. Shark - (Complex Analytics Platform) on top of Spark
        137. Simba - (ODBC SQL Driver for Hive)
        138. SimpleDB - (Database, NoSQL, Document) from Amazon
        139. Skytree - (Analytics Server)
        140. Solr/Lucene - (Search Engine) from Apache
        141. Spotfire - (Stream processing tool) from TIBCO
        142. Spanner - (Database, NewSQL) from Google
        143. Spark - (In memory cluster computing system)
        144. Splunk - (Machine Data Analytics)
        145. Spring Data - (Data access tool for Hadoop and NoSQL) in Spring Framework
        146. SQLite - (Software library supporting server-less relational database)
        147. SQLstream - (Streaming data analysis products)
        148. Sqoop - (Data movement) from Apache
        149. Sqrrl - (Security and Analytics on top of Apache Accumulo)
        150. Stinger - (Optimized Hive Queries) from Hortonworks
        151. Storm - (Stream Processing)
        152. Sumo Logic - (Log Analytics)
        153. Tableau - (Visualization)
        154. Tachyon - (File system) from Berkeley
        155. Talend - (Data Integration Product)
        156. TempoDB - (Database, NoSQL, Time Series)
        157. Teradata Active EDW - (Database, Relational)
        158. Terracotta - (In memory data management)
        159. Terraswarm - (Data Acquisition) Sensor Integration
        160. Thor - (Filesystem and Processing Cluster) from HPCC Systems
        161. Thrift - (Framework for cross-language development)
        162. Tinkerpop - (Graph Database and Toolkit)
        163. Vertica - (Database Relational)
        164. Voldemort - (Database, NoSQL, Key- Value)
        165. VoltDB - (Database NewSQL)
        166. Watson from IBM - (Analytic Framework)
        167. WebHDFS - (REST API for Hadoop)
        168. WEKA - (Machine Learning Toolkit) in Java
        169. Wibidata - (Components for building Big Data applications)
        170. YarcData - (Graph Analytics for Big Data)
        171. Yarn - (Next Generation Hadoop) from Apache
        172. Yottamine - (Machine Learning Toolkit) Cloud-based
        173. Zettaset Orchestrator - (Management and Security for Hadoop)
        174. ZooKeeper - (Distributed Computing Management)
      10. Appendix C. Use Case Examples
        1. EDW Use Cases
        2. Retail/Consumer Use Cases
        3. Financial Services Use Cases
        4. Web & Digital Media Services Use Cases
        5. Health & Life Sciences Use Cases
        6. Telecommunications Use Cases
        7. Government Use Cases
        8. New Application Use Cases
        9. Fraud Use-Cases
        10. E-Commerce and Customer Service Use-Cases
          1. Big Data Case Studies
      11. Appendix D. Actors and Roles
    48. NIST Special Publication 500-291 - NIST Cloud Computing Standards Roadmap - July 2011
    49. NIST Special Publication 500-292 - NIST Cloud Computing Reference Architecture - September 2011
    50. NIST_Security_Reference_Architecture_2013.05.15_v1
    51. SP800-145 Cloud Definition
    52. NBD-WD July 3rd Meeting Agenda
    53. NIST Big Data Working Kick-off Meeting Agenda (w/ minor update)
    54. Big Data Definitions v1
    55. NIST Big Data Working Kick-off Meeting Agenda
    56. NIST Big Data Working Group Charter - draft
  10. NEXT

  1. Story
  2. Emails
  3. Slides
    1. The NITRD Dashboard 2013: Cover Page
    2. The NITRD Dashboard 2013: Data Ecosystem
    3. Search for NIST at New.Data.gov 1
    4. Search for NIST at New.Data.gov 2
    5. Search for National Institute of Standards and Technology (NIST) at New.Data.gov 1
    6. Search for National Institute of Standards and Technology (NIST) at New.Data.gov 2
    7. Google Search for National Institute of Standards and Technology (NIST)
      1. Statistical Reference Datasets (StRD)
      2. NIST Data Gateway
      3. NIST Standard Reference Data
      4. NIST Datasets | 0.3.1 - ATOMS - Scilab
  4. Spotfire Dashboard
  5. Research Notes
    1. NITRD Home Page
    2. NITRD goes Open
    3. The NITRD Dashboard
    4. Data Sources
  6. Big Data at NIST
    1. Background
    2. NIST Big Data Working Group
    3. NIST Big Data Working Group Announcement
    4. NBD-WG Virtual Meeting
    5. Charter
    6. Co-Chairs
      1. Dr. Chaitan Baru
      2. Dr. Robert Marcus
      3. Mr. Wo Chang
    7. Guidelines
    8. Work Plan
    9. Virtual Meeting
  7. Tasks to Develop
    1. Definitions
    2. Taxonomies
    3. Reference Architectures
    4. Technical Roadmap
  8. Documents
    1. Input Listing
    2. Output Listing
    3. Upload Document
  9. Documents
    1. Security and Privacy Subgroup Meeting Agenda for July 24, 2013
    2. Defining the Big Data Architecture Framework (BDAF): Outcome of the Brainstorming session at UvA
    3. Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised
    4. Questions for Subgroups from Deliverables User Perspective
    5. NIST Big Data Roadmap Matrix v1.1
    6. NBD-Requirements Meeting Minutes July 23 2013
    7. NBD Requirements WG Use Case Template-v2
    8. BD Privacy Use Case for Social Media
    9. Big Data Privacy Use Case
    10. Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised
    11. NIST Big Data Roadmap Matrix v.1.1
    12. NBD-Requirements Agenda July 23 2013
    13. Def/Tax Agenda mtg-3 20130722
    14. NIST Genomics Use Case
    15. NBD-WG Meeting Agenda for July 24 Subgroups Joint Meeting
    16. Towards A Big Data Reference Architecture 2011 (Selected Slides)
    17. NBD Requirements Working Group Interim Minutes 16 July 2013
    18. Collection of Architectures and Discussions for NBD Reference Architecture
    19. Technology Roadmap_Meeting Agenda for July 19 2013
    20. Reference Architecture, Requirements, Gaps, and Roles
    21. NBD-WG - Thinking Foward
      1. Slide 1 NBD-WG: Thinking Forward
      2. Slide 2 Disclaimer 1
      3. Slide 3 Good Approach
      4. Slide 4 Big Data Current Approach
      5. Slide 5 Semi-new Approach
      6. Slide 6 Big Data Processing Protocol and Markup Language
      7. Slide 7 Subgroups
      8. Slide 8 Definitions and Taxonomies
      9. Slide 9 Taxonomies
      10. Slide 10 Requirements
      11. Slide 11 Security and Privacy
      12. Slide 12 Reference Architecture
      13. Slide 13 Technology Roadmap
      14. Slide 14 How Can You Help?
      15. Slide 15 Disclaimer 2
    22. Existing Standards Efforts Related to Big Data
      1. Abstract
      2. Introduction
      3. INCITS, ISO, and ISO/IEC Standards Efforts
      4. Data Management and Interchange
      5. Metadata
      6. Data Storage and Retrieval
      7. Support for Complex Data Types
      8. Geographical Information Systems
      9. IT Security techniques
      10. Cloud Computing – Distributed application platforms and services
      11. Coding of Audio, Picture, Multimedia, and Hypermedia Information
      12. Apache Software Foundation
      13. W3C
    23. Security and Privacy Subgroup Agenda for July 17, 2013
    24. Agenda for 7/18/2013
    25. NBD Use Case Template
    26. Different Requirements Use Case Templates
    27. Reference Arch Subgroup Minutes 11 July 2013
    28. Requirements/Use Case Working Group Minutes July 9 2013
    29. Requirements/Use Case Working Group Agenda July 16 2013
    30. Reference Architecture Subgroup Agenda for July 11, 2013
    31. Technology Roadmap Subgroup Agenda for July 12, 2013
    32. Def/Tax Agenda mtg-2 20130715
    33. NIST BDWG def-tax Definitions Draft
    34. NIST BDWG DefTax Minutes 20130708
    35. Technology Roadmap Subgroup Charter
    36. Reference Architecture Subgroup Charter
    37. Requirements Subgroup Charter
    38. Security and Privacy Subgroup Charter
    39. Definitions and Taxonomies Subgroup Charter
    40. For presentation: Big Data Ecosystem RA (Microsoft)
      1. Slide 1 Big Data Ecosystem Reference Architecture
      2. Slide 2 RA Objectives
      3. Slide 3 Big Data Ecosystem RA
      4. Slide 4 An Example of Cloud Computing Usage in Big Data Ecosystem
      5. Slide 5 Use Case: Advertising
      6. Slide 6 Use Case: Enterprise Data Warehouse
    41. Evaluation Criteria for Reference Architecture
      1. Slide 1 Evaluation Criteria for Reference Architecture
      2. Slide 2 Criteria for Evaluating Reference Architecture
      3. Slide 3 Brainstorming Big Data Reference Architecture
      4. Slide 4 Criteria Examples
      5. Slide 5 Apache Big Data Framework in Reference Architecture
    42. Big Data Ecosystem RA (Microsoft)
    43. Combined Deliverables Overview Presentation
    44. Security and Privacy Subgroup Meeting Agenda for July 10, 2013
    45. Requirements Subgroup Meeting Agenda for July 9, 2013
    46. Subgroup Definitions and Taxonomies Meeting Agenda
    47. Personal Brainstorming Document
      1. 1. Introduction and Definition
      2. 2. Reference Architecture and Taxonomy
        1. 1. Applications
        2. 2. Design, Develop, Deploy Tools
        3. 3. Security and Privacy
        4. 4. Analytics and Interfaces
          1. 4.1 Complex Analytics
          2. 4.2 Interfaces
          3. 4.3 Visualization
          4. 4.4 Business Intelligence
        5. 5. System and Process Management
        6. 6. Data Processing within the Architecture Framework
          1. 6.1 ETL
          2. 6.2 Data Serialization
        7. 7. Data Governance
        8. 8. Data Stores
          1. 8.1 File Systems
          2. 8.2 Databases
        9. 9. IO External to Architecture Framework and Stream Processing
        10. 10. Infrastructure
          1. 10.1 Appliances
          2. 10.2 Internal Server Farm
          3. 10.3 Data Grids and Fabrics
          4. 10.4 Cloud-based
      3. 3. Requirements, Gap Analysis, and Suggested Best Practices
        1. 1. Data input and output to Big Data File System (ETL, ELT)
        2. 2. Data exported to Databases from Big Data File System
        3. 3 Big Data File Systems as a data resource for batch and interactive queries
        4. 4. Batch Analytics on Big Data File System using Big Data Parallel Processing
        5. 5. Stream Processing and ETL
        6. 6. Real Time Analytics (e.g. Complex Event Processing)
        7. 7. NoSQL (and NewSQL) DBs as operational databases for large-scale updates and queries
        8. 8. NoSQL DBs for storing diverse data types
        9. 9. Databases optimized for complex ad hoc queries
        10. 10. Databases optimized for rapid updates and retrieval (e.g. in memory or SSD)
      4. 4. Future Directions and Roadmap
      5. 5. Security and Privacy - 10 Top Challenges
      6. 6. Conclusions and General Advice
        1. Recommended Roadmap for Getting Started
          1. Figure 3: Road Map for Getting Started
          2. Table 3: Practical Road Map Summary
      7. 7. References
      8. Appendix A. Terminology Glossary
        1. ACiD
        2. Analytics
        3. Avatarnode
        4. BASE
        5. Big Data
        6. BSON
        7. BSP (Bulk Synchronous Parallel)
        8. Business Analytics
        9. Cache
        10. (CEP) Complex Event Processing
        11. Consistent Hashing
        12. Descriptive Analytics
        13. Discovery Analytics
        14. ELT (Extract, Load, Transform)
        15. ETL (Extract, Transform Load)
        16. In Memory Database
        17. JSON (Javascript Object Notation)
        18. MapReduce
        19. MPP (Massively Parallel Processing)
        20. NewSQL
        21. NoSQL
        22. OLAP (Online Analytic Processing)
        23. OLTP (Online Transactional Processing)
        24. Paxos
        25. Predictive Analytics
        26. Presciptive Analytics
        27. Semi-Structured Data
        28. SSD (Solid State Drive)
        29. Stream Processing
        30. Structured Data
        31. Unstructured Data
        32. Vector Clocks
        33. Web Analytics
      9. Appendix B. Solutions Glossary
        1. Accumulo - (Database, NoSQL, Key-Value) from Apache
        2. Acunu Analytics - (Analytics Tool) on top of Aster Data Platform based on Cassandra
        3. Aerospike - (Database NoSQL Key-Value)
        4. Alteryx - (Analytics Tool)
        5. Ambari - (Hadoop Cluster Management) from Apache
        6. Analytica - (Analytics Tool) from Lumina
        7. ArangoDB - (Database, NoSQL, Multi-model) Open source from Europe
        8. Aster - (Analytics) Combines SQL and Hadoop on top of Aster MPP Database
        9. Avro - (Data Serialization) from Apache
        10. Azkaban - (Process Scheduler) for Hadoop
        11. Azure Table Storage - (Database, NoSQL, Columnar) from Microsoft
        12. Berkeley DB - (Database)
        13. BigData Appliance - (Integrated Hardware and Software Architecture) from Oracle includes Cloudera, Oracle NoSQL, Oracle R, and Sun Servers
        14. BigML - (Analytics tool) for business end-users
        15. BigQuery - (Query Tool) on top of Google Storage
        16. BigSheets - (BI Tool) from IBM
        17. BigTable - (Database, NOSQL. Column oriented) from Google
        18. Caffeine - (Search Engine) from Google use BigTable Indexing
        19. Cascading - (Processing) SQL on top of Hadoop from Apache
        20. Cascalog - (Query Tool) on top of Hadoop
        21. Cassandra - (Database, NoSQL, Column oriented)
        22. Chukwa - (Monitoring Hadoop Clusters) from Apache
        23. Clojure - (Lisp-based Programming Language) compiles to JVM byte code
        24. Cloudant - (Distributed Database as a Service)
        25. Cloudera - (Hadoop Distribution) including real-time queries
        26. Clustrix - (NewSQL DB) runs on AWS
        27. Coherence - (Data Grid/Caching) from Oracle
        28. Colossus - (File System) Next Generation Google File System
        29. Continuity - (Data fabric layer) Interfaces to Hadoop Processing and data stores
        30. Corona - (Hadoop Processing tool) used internally by Facebook and now open sourced
        31. Couchbase - (Database, NoSQL, Document) with CouchDB and Membase capabilities
        32. CouchDB - (Database, NoSQL, Document)
        33. Data Tamer - (Data integration and curation tools) from MIT
        34. Datameer - (Analytics) built on top of Hadoop
        35. Datastax - (Integration) Built on Cassandra, Solr, Hadoop
        36. Dremel - (Query Tool) interactive for columnar DBs from Google
        37. Drill - (Query Tool) interactive for columnar DBs from Apache
        38. Dynamo DB - (Database, NoSQL, Key-Value)
        39. Elastic MapReduce - (Cloud-based MapReduce) from Amazon
        40. ElasticSearch - (Search Engine) on top of Apache Lucerne
        41. Enterprise Control Language (ECL) - (Data Processing Language) from HPPC
        42. Erasure Codes - (Alternate to file replication for availability) Replicates fragments.
        43. eXtreme Scale - (Data Grid/Caching) from IBM
        44. F1 - (Combines relational and Hadoop processing) from Google built on Google Spanner
        45. Falcon - (Data processing and management platform) from Apache
        46. Flume - (Data Collection, Aggregation, Movement)
        47. FlumeJava - (Java Library) Supports development and running data parallel pipelines
        48. Fusion-io - (SSD Storage Platform) can be used with HBase
        49. GemFire - (Data Grid/Caching) from VMware
        50. Gensonix - (NoSQL database) from Scientel
        51. Gephi - (Visualization Tool) for Graphs
        52. Gigaspaces - (Data Grid/Caching)
        53. Giraph - (Graph Processing) from Apache
        54. Google Refine - (Data Cleansing)
        55. Google Storage - (Database, NoSQL, Key-Value)
        56. Graphbase - (Database, NoSQL, Graphical)
        57. Greenplum - ( MPP Database. Analytic Tools, Hadoop )
        58. HBase - (Database, NoSQL, Column oriented)
        59. Hadapt - (Combined SQL Layer and Hadoop)
        60. Hadoop Distributed File System - (Distributed File System) from Apache
        61. Hama - (Processing Framework) Uses BSP model on top of HDFS
        62. Hana - (Database, NewSQL) from SAP
        63. Haven - (Analytics Package) from HP
        64. HAWQ - (SQL Interface to Hadoop) from Greenplum and Pivotal
        65. HCatalog - (Table and Storage Management) for Hadoop data
        66. HDF5- (A data model, library, and file format for storing/managing large complex data)
        67. High Performance Computing Cluster (HPCC) - (Big Data Processing Platform)
        68. Hive - (Data warehouse structure on top of Hadoop)
        69. HiveQL - (SQL-like interface on Hadoop File System)
        70. Hortonworks - (Extensions of Hadoop)
        71. HStreaming - (Real time analytics on top of Hadoop)
        72. Hue - (Open source UI for Hadoop) from Cloudera
        73. Hypertable - (Database, NoSQL, Key-Value) open source runs on multiple file systems
        74. Impala - (Ad hoc query capability for Hadoop) from Cloudera
        75. InfiniDB - (Scale-up analytic database)
        76. Infochimps - (Big Data Storage and Analytics in the Cloud)
        77. Infosphere Big Insights - (Analytic) from IBM
        78. InnoDB - (Default storage engine for MYSQL)
        79. Jaql = (Query Language for Hadoop) from IBM
        80. Kafka - (Publish-and-subscribe for data) from Apache
        81. Karmasphere - (Analytics)
        82. Knox - (Secure gateway to Hadoop) from Apache
        83. Lucidworks - (Search built on Solr/Lucene) and an associated Big Data Platform
        84. Knowledge Graph - (Graphical data store) from Google
        85. Mahout - (Machine Learning Toolkit) built on Apache Hadoop
        86. MapD - (Massive Parallel Database) Open Source on top of GPUs
        87. MapReduce - (Processing algorithm)
        88. MapR - (MapReduce extensions) built on NFS
        89. MarkLogic - (Database, NoSQL, Document) interfaced with Hadoop
        90. Memcached - (Data Grid/Caching)
        91. MemSQL - (In memory analytics database)
        92. MongoDB - (Database, NoSQL, Document) from 10gen
        93. mrjob - (Workflow) for Hadoop from Yelp
        94. MRQL - (Query Language) supports Map-Reduce and BSP processing
        95. Muppet - (Stream Processing) MapUpdate implementation
        96. MySql - (Database Relational)
        97. Namenode - Directory service for Hadoop
        98. Neo4j - (Database, NoSQL, Graphical)
        99. Netezza - (Database Appliance)
        100. NuoDB - (MPP Database)
        101. Oozie - (Workflow Scheduler for Hadoop) from Apache
        102. Oracle NoSQL - (Database, Key-Value)
        103. ORC (Optimized Row Columnar) Files - File Format for Hive data in HDFS
        104. Parquet - (Columnar file format for Hadoop) from Cloudera
        105. Pentaho - (Analytic tools)
        106. Percolater - (Data Processing) from Google
        107. Pig - (Procedural framework on top of Hadoop)
        108. Pig Latin - (Interface language for Pig procedures)
        109. Pivotal - (New company utilizing VMware and EMC technologies)
        110. Platfora - (In memory caching for BI on top of Hadoop)
        111. Postgres - (Database Relational)
        112. Precog - (Analytics Tool) for JSON data
        113. Pregel - (Graph Processing) used by Google
        114. Presto - (SQL Query for HDFS) from Facebook
        115. Protocol Buffers - (Serialization) from Google )
        116. Protovis - (Visualization)
        117. PureData - (Database Products) from IBM
        118. R - (Data Analysis Tool)
        119. Rainstor - (Combines Hadoop and Relational Processing)
        120. RCFile (Record Columnar File) - File format optimized for HDFS data warehouses
        121. Redis - (Database, NoSQL, Key-Value)
        122. Redshift - (Database Relational) Amazon
        123. Resilient Distributed Datasets - (Fauklt tolerant in memory data sharing)
        124. Riak - (Database, NoSQL,Key-Value with built-in MapReduce) from Basho
        125. Roxie - (Query processing cluster) from HPCC
        126. RushAnalytics - (Analytics) from Pervasive
        127. S3 - (Database, NoSQL,Key-Value)
        128. S4 - (Stream Processing)
        129. Sawzall - (Query Language for Map-Reduce) from Google
        130. Scala - (Programming Language) Combines functional and imperative programming
        131. Scalebase - (Scalable Front-end to distributed Relational Databases)
        132. SciDB - (Database, NoSQL, Arrays)
        133. scikit learn - (Machine Learning Toolkit) in Python
        134. Scribe - (Server for Aggregating Log Data) originally from Facebook may be inactive
        135. SequenceFiles - (File format) Binary key-value pairs
        136. Shark - (Complex Analytics Platform) on top of Spark
        137. Simba - (ODBC SQL Driver for Hive)
        138. SimpleDB - (Database, NoSQL, Document) from Amazon
        139. Skytree - (Analytics Server)
        140. Solr/Lucene - (Search Engine) from Apache
        141. Spotfire - (Stream processing tool) from TIBCO
        142. Spanner - (Database, NewSQL) from Google
        143. Spark - (In memory cluster computing system)
        144. Splunk - (Machine Data Analytics)
        145. Spring Data - (Data access tool for Hadoop and NoSQL) in Spring Framework
        146. SQLite - (Software library supporting server-less relational database)
        147. SQLstream - (Streaming data analysis products)
        148. Sqoop - (Data movement) from Apache
        149. Sqrrl - (Security and Analytics on top of Apache Accumulo)
        150. Stinger - (Optimized Hive Queries) from Hortonworks
        151. Storm - (Stream Processing)
        152. Sumo Logic - (Log Analytics)
        153. Tableau - (Visualization)
        154. Tachyon - (File system) from Berkeley
        155. Talend - (Data Integration Product)
        156. TempoDB - (Database, NoSQL, Time Series)
        157. Teradata Active EDW - (Database, Relational)
        158. Terracotta - (In memory data management)
        159. Terraswarm - (Data Acquisition) Sensor Integration
        160. Thor - (Filesystem and Processing Cluster) from HPCC Systems
        161. Thrift - (Framework for cross-language development)
        162. Tinkerpop - (Graph Database and Toolkit)
        163. Vertica - (Database Relational)
        164. Voldemort - (Database, NoSQL, Key- Value)
        165. VoltDB - (Database NewSQL)
        166. Watson from IBM - (Analytic Framework)
        167. WebHDFS - (REST API for Hadoop)
        168. WEKA - (Machine Learning Toolkit) in Java
        169. Wibidata - (Components for building Big Data applications)
        170. YarcData - (Graph Analytics for Big Data)
        171. Yarn - (Next Generation Hadoop) from Apache
        172. Yottamine - (Machine Learning Toolkit) Cloud-based
        173. Zettaset Orchestrator - (Management and Security for Hadoop)
        174. ZooKeeper - (Distributed Computing Management)
      10. Appendix C. Use Case Examples
        1. EDW Use Cases
        2. Retail/Consumer Use Cases
        3. Financial Services Use Cases
        4. Web & Digital Media Services Use Cases
        5. Health & Life Sciences Use Cases
        6. Telecommunications Use Cases
        7. Government Use Cases
        8. New Application Use Cases
        9. Fraud Use-Cases
        10. E-Commerce and Customer Service Use-Cases
          1. Big Data Case Studies
      11. Appendix D. Actors and Roles
    48. NIST Special Publication 500-291 - NIST Cloud Computing Standards Roadmap - July 2011
    49. NIST Special Publication 500-292 - NIST Cloud Computing Reference Architecture - September 2011
    50. NIST_Security_Reference_Architecture_2013.05.15_v1
    51. SP800-145 Cloud Definition
    52. NBD-WD July 3rd Meeting Agenda
    53. NIST Big Data Working Kick-off Meeting Agenda (w/ minor update)
    54. Big Data Definitions v1
    55. NIST Big Data Working Kick-off Meeting Agenda
    56. NIST Big Data Working Group Charter - draft
  10. NEXT

Story

Make a Technology Roadmap Project a Data Project

Dominic Sale, new OMB Chief of Data Analytics & Reporting said the new Digital Government Strategy is "treating all content as data." So big data = all your content and here is my matrix for the Third Annual Government Big Data Forum.

 

Four Vs Concept Method Goal Result
Volume and Velocity Big Data=All Content Make all content as data. Federal Digital Government Strategy Knowledge Base
Veracity Web-Linked Data Semantic Web Data Strong Relationships Spreadsheets
Value Unified Data Architecture Data Integration Data Ecosystem Network Visualizations
 
 

How do we reliably detect a potential pandemic early enough to intervene?

Example of Big Data for a NIST Big Data Work Group Use Case Question: http://kff.org/global-indicator/h1n1-influenza-swine-flu-cases/

Attended Forum and Workshop, Wrote Stories, and Told Chris Greer I Could Help With This

NIST Cloud Computing AND Big Data Forum & Workshop, January 15-17, Gaithersburg, MD. Wiki

Led to Organizing Our Own Conference

September 10-11: Cloud: SOA, Semantics, & Data Science

Reminds Me of the NIST Cloud Computing Initiative Where I Provided One of the First Use Cases (Desktop in the Cloud) Which Was Adopted by the Japanese Government (METI) and Then Was Asked to Do the Same For Them with Open Government Data

February 12-14, Meetings on Open Government Data for Japan, Washington, DC. Slides Final Report (PDF) Final Report (Word) Interim Report (Word) (Wiki)

Deliverables:

Thinking Forward - How Can You Help?
Provide good use cases - Yes, see OMB work
Provide feedback on Big Data direction - Yes, see Big Data at the Hill
Provide recommendations on Big Data Adoption - Yes, also see Big Data at the Hill
Others... Yes, after I hear this weeks conference call

NIST Data Project

My Request: Please make some of the NIST “big data” available in a readily usable format for free. See below

NIST ReplyWe’re in the process of developing a platform to make all NIST data more readily discoverable and usable. Stay tuned.

Emails

From: Chang, Wo L [mailto:wchang@nist.gov]
Sent: Thursday, July 18, 2013 4:49 PM
To: 'Brand Niemann'; bigdatadeftax
Subject: RE: [BigdataDefTax] Subgroups report for joint meeting on July 24 from 13:00 - 15:00 EDT

Dear Brand,

Welcome to our big data WG!

Yes, the more understanding of analytic tools, visualizations, and their processing for big data would be our interest.

Best,

--Wo

From: Brand Niemann [mailto:bniemann@cox.net]

Sent: Thursday, July 18, 2013 5:51 AM

To: Chang, Wo L; bigdatadeftax; bigdatareq; bigdatasec; bigdataarch; bigdataroadmap

Subject: RE: [BigdataDefTax] Subgroups report for joint meeting on July 24 from 13:00 - 15:00 EDT

I just joined and have been trying to catch up on the content of Wiki pages and emails.

"The focus of the NIST Big Data Working Group (NBD-WG) is to form a community of interest..., to create vendor-neutral, technology and infrastructure ...,  to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms,.."

is similar to what I am trying to do for the new OMB Analytics and Performance activity.

You need what I used to do (try) for the Federal CIO Council, an evangelist

- community collaboration builder that finds a common language (semantics) and purpose (better use of big data), with preferably co-chairs from government and industry that can work together to hold regular meetings to give participants an environment in which "collective intelligence" can grow (Doug Engelbart).

See: http://semanticommunity.info/CNSTAT/Principles_and_Practices_for_a_Federal_Statistical_Agency#Slide_3_Preface

My past efforts were recognized:

http://fcw.com/articles/2006/10/30/brand-niemann-champions-collaboration.aspx

My current efforts for OMB are just starting again:

http://semanticommunity.info/Data_Science/Free_Data_Visualization_and_Analysis_Tools

Is this relevant to this NIST Big Data activity?

Dr. Brand Niemann

Director and Senior Data Scientist

Semantic Community

http://semanticommunity.info

http://breakinggov.com/author/brand-niemann/

703-268-9314

From: bigdatadeftax-bounces@nist.gov [mailto:bigdatadeftax-bounces@nist.gov]

On Behalf Of Chang, Wo L

Sent: Monday, July 15, 2013 10:00 AM

To: bigdatadeftax; bigdatareq; bigdatasec; bigdataarch; bigdataroadmap

Subject: [BigdataDefTax] Subgroups report for joint meeting on July 24 from

13:00 - 15:00 EDT

Dear All,

Note: I am sending this email to all five subgroups instead to the bigdatawg@nist.gov<mailto:bigdatawg@nist.gov> since I don't want to flood the general audience from that reflector.  You should only receive one email instead of 5 from each subgroup.

Thanks for kicking off and brainstorming different aspects of big data from each subgroup last week!

Before we get into much detail discussion on certain techniques, technologies, and infrastructures, I just want to remind the overall goal for the NBD-WG as I highlighted in RED below under the Background and Scope in the NBD-WG's charter:

Background

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperability, portability, reusability, and extensibility for big data usage, analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope

The focus of the NIST Big Data Working Group (NBD-WG) is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, reference architectures, and technology roadmaps.  The aim is to create vendor-neutral, technology and infrastructure agnostic deliverables to enable Big Data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added from Big Data service providers and flow of data between the stakeholders in a cohesive and secure manner.

There will be a joint meeting for all subgroups on July 24 (Wed.) from 13:00

- 15:00 EDT and all subgroup Co-Chairs need to report (Lead Co-Chair will be doing the reporting) what has been done in their respective subgroup work toward or fulfill the goal of NBD-WG.

Thanks!

--Wo

_______________________________________________

BigdataDefTax mailing list

BigdataDefTax@nist.gov

https://email.nist.gov/mailman/listinfo/bigdatadeftax

Slides

The NITRD Dashboard 2013: Cover Page

NITRDDashboard-Spotfire.png

The NITRD Dashboard 2013: Data Ecosystem

NITRDDashboard-Spotfire2.png

Search for NIST at New.Data.gov 1

New.Data.govSearchforNIST1.png

Search for NIST at New.Data.gov 2

New.Data.govSearchforNIST2.png

Search for National Institute of Standards and Technology (NIST) at New.Data.gov 1

New.Data.govSearchforNIST3.png

Search for National Institute of Standards and Technology (NIST) at New.Data.gov 2

New.Data.govSearchforNIST4.png

Google Search for National Institute of Standards and Technology (NIST)

NIST Standard Reference Data

http://www.nist.gov/srd/

GoogleSearchforNISTDataSets2.png

NIST Datasets | 0.3.1 - ATOMS - Scilab

http://atoms.scilab.org/toolboxes/nistdataset

GoogleSearchforNISTDataSets4.png

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Research Notes

NITRD Home Page

Source: http://www.nitrd.gov/Index.aspx

NITRD goes Open

Source: http://www.nitrd.gov/open/index.aspx

National Coordination Office for Networking and Information Technology Research and Development: Aggregated Federal R&D Budget for Networking and Information Technology

As a commitment to openness and transparency, NCO is making all of its NITRD budget data available to the public and our customers. This information comes from member Federal agencies and is broken down over the following eight program component areas: High End Computing Infrastructure & Application, High End Computing Research & Development, Cyber Security & Information Assurance, Human-Computer Interaction & Information Management, Large Scale Networking, High Confidence Software & Systems, Social Economic & Workforce Implications of IT & IT Workforce Development, and Software Design & Productivity. Interested individuals will now be able to track funding trends and identify agencies with investments in technical areas of interest, which will enable entrepreneurs and grant seekers to better direct their efforts to engage the correct Federal agency.

The NITRD Dashboard

Source: http://itdashboard.nitrd.gov/

My Note: See previous NITRD Dashboard work - http://semanticommunity.info/A_NITRD_Dashboard

The NITRD Dashboard provides a graphical interface for displaying NITRD budget data charts with drill-downs to Program Component Areas by year and agency

NITRDDashboardSlide1.png

NITRDDashboardSlide2.png

NITRDDashboardSlide3.png

NITRDDashboardSlide4.jpg

Data Sources

The data sources driving the NITRD Dashboard are the Agency NITRD Budget Cross Cuts as reported in the NITRD Supplement to the President’s Budget. Data sources are available in Google Fusion Tables format under the following links:

 

To download the NITRD Supplements to the President's Budget and the Agency NITRD Budget Cross Cuts, please select Fiscal Year below.

 
Download data for all years
 Excel Spreadsheet

Big Data at NIST

Source: http://bigdatawg.nist.gov/home.php

Points of Contact
   Chris Greer
     NIST / ITL
     Associate Director,
     Program Implementation

   Wo Chang
     NIST / ITL
     Digital Data Advisor

Background

There is a broad agreement among commercial, academic, and government leaders about the remarkable potential of "Big Data" to spark innovation, fuel commerce, and drive progress. Big Data is the term used to describe the deluge of data in our networked, digitized, sensor-laden, information driven world. The availability of vast data resources carries the potential to answer questions previously out of reach. Questions like: How do we reliably detect a potential pandemic early enough to intervene? Can we predict new materials with advanced properties before these materials have ever been synthesized? How can we reverse the current advantage of the attacker over the defender in guarding against cybersecurity threats?

However there is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The rate at which data volumes, speeds, and complexity are growing is outpacing scientific and technological advances in data analytics, management, transport, and more.

Despite the widespread agreement on the opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions is confusing potential users and holding back progress. What are the attributes that define Big Data solutions? How is Big Data different from the traditional data environments and related applications that we have encountered thus far? What are the essential characteristics of Big Data environments? How do these environments integrate with currently deployed architectures? What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?

NIST Big Data Working Group

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperability, reusability, and extendibility for big data analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope: The focus of the NBD-WG is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, reference architectures, and technology roadmap which would enable breakthrough discoveries and innovation by advancing measurement science, standards, and technology in ways that enhance economic security and improve quality of life.

Deliverables:

  • Develop Big Data Definitions
  • Develop Big Data Taxonomies
  • Develop Big Data Reference Architectures
  • Develop Big Data Technology Roadmap

Target Date: The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013. Further milestones will be developed once the WG has initiated its regular meetings.

Participants: The NBD-WG is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.

Meetings: The NBD-WG will hold weekly meetings on Wednesdays from 1300 - 1500 EDT (unless announce otherwise) by teleconference. Please click here for the virtual meeting information.

Questions: General questions to the NBD-WG can be addressed to BigDataInfo@nist.gov

NIST Big Data Working Group Announcement

Source: http://bigdatawg.nist.gov/NIST_BigData_Working_Group_Announcement_Final_banner.pdf (PDF)

You are cordially invited to participate in the NIST Big Data Working Group Kick-off meeting to be held on June 19, 2013, 1300 – 1400 EDT via teleconferencing. The information you’ll need for participating in the kick-off meeting can be found at the bottom of this message.

 
There is a broad agreement among commercial, academic, and government leaders about the remarkable potential of “Big Data” to spark innovation, fuel commerce, and drive progress. Big Data is the term used to describe the deluge of data in our networked, digitized, sensor-laden, information driven world. The availability of vast data resources carries the potential to answer questions previously out of reach.
 
Questions like: How do we reliably detect a potential pandemic early enough to intervene? Can we predict new materials with advanced properties before these materials have ever been synthesized? How can we reverse the current advantage of the attacker over the defender in guarding against cybersecurity threats? However there is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The rate at which data volumes, speeds, and complexity are growing is outpacing scientific and technological advances in data analytics, management, transport, and more.
 
Despite the widespread agreement on the opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions is confusing potential users and holding back progress. What are the attributes that define Big Data solutions? How is Big Data different from the traditional data environments and related applications that we have encountered thus far? What are the essential characteristics of Big Data environments? How do these environments integrate with currently deployed architectures? What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions?
 
The NIST Big Data Working Group (NBD-WG) is being launched to address these questions. The Group is charged with developing over the coming months a consensus definition, taxonomy, reference architecture, and technology roadmap for Big Data that can be embraced by all sectors. The NBD-WG is co-chaired by Chaitan Baru, Bob Marcus, and Wo Chang. Dr. Baru is Distinguished Scientist at the San Diego Supercomputer Center and Director of the Center for Large-scale Data Systems Research. Dr. Marcus is CTO of ET-Strategies and a leader in cloud and data standards efforts with experience in commercial, academic, and government settings. Wo Chang is Digital Data Advisor in the
NIST Information Technology Laboratory and an experienced contributor to national and international standards efforts.
 
Participation in the NIST Big Data Working Group is open to everyone. We hope to bring together stakeholder communities across industry, academic, and government sectors representing all of those with interests in Big Data techniques, technologies, and applications. The group needs your input to meet its goals so please join us for the kick-off meeting and contribute your ideas and insights.
 
Meetings
The NIST Big Data Working Group will hold weekly meetings on Wednesdays (unless announce otherwise) by teleconference. The meeting time is 1300 -1500 EDT. The dial-in information is as follows: Phone: 866-692-4541, Participant Passcode: 312-484-475-560
More information about web conference tools can be found at: http://bigdatawg.nist.gov/virtualmeeting.php.
 
Questions
More details about the NIST Big Data Working Group can be found at http://bigdatawg.nist.gov.
General questions to the NIST Big Data Working Group can be addressed to BigDataInfo@nist.gov.

NBD-WG Virtual Meeting

Source: http://bigdatawg.nist.gov/virtualmeeting.php

 

NBD-WG Virtual Meeting

Group Meetings:
Starting June 19, 2013, the NIST NBD-WG will hold weekly meetings on Wednesdays from 13:00-1500 EDT by teleconf (unless announce otherwise). The dial-in and web conferencing information for the meetings are as follows:
  • Phone (New): 206-402-0823, Participant code: 272-30-504
  • Web conferencing tool with share screen and VoIP audio, click here: NBD-WG (Preferred)
Meeting Schedules:
No. Date Time Agenda Minutes
1 June 19, 2013 [Rescheduled to June 26] 1300 - 1400 EDT Agenda_1 Minutes_1
2 June 26, 2013:WG Kick-off Meeting 1300 - 1500 EDT Agenda_2 Minutes_2
3 July 3, 2013 1300 - 1500 EDT Agenda_3 Minutes_3
4 July 10, 2013 1300 - 1500 EDT Agenda_4 Minutes_4
5 July 17, 2013 1300 - 1500 EDT Agenda_5 Minutes_5
6 July 24, 2013 1300 - 1500 EDT Agenda_6 Minutes_6
7 July 31, 2013 1300 - 1500 EDT Agenda_7 Minutes_7
8 August 7, 2013 1300 - 1500 EDT Agenda_8 Minutes_8
9 August 14, 2013 1300 - 1500 EDT Agenda_9 Minutes_9
10 August 21, 2013 1300 - 1500 EDT Agenda_10 Minutes_10
11 August 28, 2013 1300 - 1500 EDT Agenda_11 Minutes_11
12 September 4, 2013 1300 - 1500 EDT Agenda_12 Minutes_12
13 September 11, 2013 1300 - 1500 EDT Agenda_13 Minutes_13
14 September 18, 2013 1300 - 1500 EDT Agenda_14 Minutes_14
15 September 25, 2013 1300 - 1500 EDT Agenda_15 Minutes_15
16 September 30, 2013 [Save-The-Date] Workshop #1 Agenda_16 Minutes_16
 

Charter

Source: http://bigdatawg.nist.gov/_uploadfil...837324457.docx (Word)

Document: NBD-WG M0001

July 12, 2013

CHARTER for

NIST Big Data Working Group (NBD-WG)

Background

NIST is leading the development of a Big Data Technology Roadmap. This roadmap will define and prioritize requirements for interoperability, portability, reusability, and extensibility for big data usage, analytic techniques and technology infrastructure in order to support secure and effective adoption of Big Data. To help develop the ideas in the Big Data Technology Roadmap, NIST is creating the Public Working Group for Big Data.

Scope

The focus of the NIST Big Data Working Group (NBD-WG) is to form a community of interest from industry, academia, and government, with the goal of developing a consensus definitions, taxonomies, reference architectures, and technology roadmaps.  The aim is to create vendor-neutral, technology and infrastructure agnostic deliverables to enable Big Data stakeholders to pick-and-choose best analytics tools for their processing and visualization requirements on the most suitable computing platforms and clusters while allowing value-added from Big Data service providers and flow of data between the stakeholders in a cohesive and secure manner. 

Deliverables

1.       Produce a working draft for Big Data Definitions

2.       Produce a working draft for Big Data Taxonomies

3.       Produce a working draft for Big Data Requirements  

4.       Produce a working draft for Big Data Secure and Privacy Reference Architectures

5.       Produce a working draft for Big Data Reference Architectures

6.       Produce a working draft for Big Data Technology Roadmap

Target Date

The goal for completion of INITIAL DRAFTs is Friday, September 27, 2013.  Further milestones will be developed once the WG has initiated its regular meetings.

Co-Conveners

Wo Chang, NIST

Robert Marcus, ET-Strategies

Chaitanya Baru, UC San Diego

Subgroups and Descriptions

Definitions and Taxonomies – Survey and study current and future best practices of Big Data; develop concise definitions using factual information, characteristic description, and/or potential usage scenario; develop taxonomies by identifying clear boundaries between actors, roles, activities, components and subcomponents in their respective environments.

Requirements – Collect and organize use cases, general and specified focus from different application domains; identify and document needed requirements with sample scenarios.

Security and Privacy – Gather and analyze security and privacy related requirements from all Big Data stakeholders; develop security and privacy requirements that supplement the general Requirements of Big Data.

Reference Architecture – Gather and study available Big Data architectures from all stakeholders that meet the Big Data taxonomies model; develop Big Data reference architecture that can satisfy various use cases from diversified application domains.

Technology Roadmap – Gather input from NBD subgroups and study the taxonomies for the actors’ roles and responsibility, use cases and requirements, and secure reference architecture; perform gap analysis and produce vision and recommendations.

Subgroups Co-Chairs

Definitions and Taxonomies: *Nancy Grady (SAIC), Natasha Balac (SDSC), Eugene Luster (R2AD)

Requirements: *Geoffrey Fox (U. of Indiana), Joe  Paiva (VA), Tsegereda Beyene (Cisco)

Security and Privacy: *Arnab Roy (CSA, Fujitsu), Nancy Landreville (U. of MD), Akhil Manchanda (GE)

Reference Architecture: *Orit Levin (Microsoft), James Ketner (AT&T), Don Krapohl (Augmented Intelligence)

Technology Roadmap: *Carl Buffington (USDA), Dan McClary (Oracle), David Boyd (Data Tactic)

Formation of the Subgroup:  There will be one Lead Co-Chair, two Co-Chairs:

Lead Co-Chair - sends meeting agenda with consensus from other Co-Chairs, calls for meeting, conducts meeting, and acts as the main point of contact for the subgroup

2 Co-Chairs - help create approach to meet NBD-WG needs and carryout subgroup tasks.

Meeting Frequency

It is anticipated that subgroups will meet by telecon weekly 2-hour Mondays to Fridays from 11:00AM – 1:00PM EDT while NBD-WG will meet every other three week Wednesdays from 1300 -1500ET.  The order for subgroup meeting is:

Mondays: Definitions and Taxonomies

Tuesday: Requirements

Wednesdays: Security and Privacy

Thursdays: Reference Architecture

Fridays: Technology Roadmap

Membership

Participation in the WG and Subgroups are open to all interested parties.  There are no membership fees.

Coordination/Interaction

The NIST Big Data Working Group will function in close coordination with other big data related standards and best practices from industry, academia, and government.

Standing Rules

All information exchanged within the WG will be non-proprietary.

All information exchanged within the WG will contain non-PII materials.

WG members should assume that all materials exchanged will be made public.

Documents will be publicly accessible on the NIST Big Data Portal.

Outreach

WG results will be available to all stakeholders in the commercial, academic, and government sectors.

Co-Chairs

Source: http://bigdatawg.nist.gov/cochairs.php

Co-Chairs Information...

Co-Chairs

Dr. Chaitan Baru

Dr. Chaitan Baru, Distinguished Scientist and Associate Director Data Initiatives, San Diego Supercomputer Center, University of California San Diego; Director, Center for Large-scale Data Systems Research (CLDS) and Lead of Big Data Benchmarking Community Working Group 
Dr. Baru has played a leadership role in a number of national-scale cyberinfrastructure R&D efforts across a wide range of science disciplines from earth sciences to ecology, biomedical informatics, and healthcare. Dr. Baru's interests are in research and development in the areas of parallel database systems, scientific data management, data analytics, and the challenges of data-driven science and data-driven enterprises.

Prior to joining SDSC, Dr. Baru was one of the group leaders in the development team for IBM's UNIX-based shared-nothing database system (DB2 Parallel Edition, released in Dec 1995). He contributed to the definition of the TPC-D decision support performance benchmark and led a performance team at IBM that produced the first result for TPC-D. Recently, Dr. Baru has helped establish a community effort to create a Big Data Benchmark, leading to the proposed BigData Top100 List (bigdatatop100.org), borrowing benchmarking ideas from the high-performance computing and transaction processing and database communities.

Dr. Robert Marcus

Dr. Robert Marcus, Chief Technology Officer at ET-Strategies, Co-Chair of Big Data Working Group at Cloud Standards Customer Council
Dr. Robert Marcus has been active for many years in software standards. He has worked with the DMTF, SNIA, OGF, OMG, Open Group, OCC, NCOIC, and TM Forum to organize major Cloud Workshops and coordination activities such as Cloud-Standards.org. He is currently the Co-Chair of the Cloud Standards Customer Council's Big Data Working Group. In March 2013, he organized a "Big Data in the Cloud" Conference that brought togethers executives from governments with leaders from the standards, industry associations, and vendor communities.

His previous experience includes Director of Technology Transformation and Deployment at General Motors, CTO of Rogue Wave Software, VP of Technical Strategy at the MCC Research Consortium, Director of the Colorado Grid Computing Initiative, Director of Object Technology at American Management Systems, Coordinator of Object Technology at Boeing, Advanced Technology Software Engineer at HP, and Professor of CS at City University of New York. In 2003, he published a book on "Great Global Grid: Emerging Technology Strategies".

Mr. Wo Chang

Mr. Wo Chang, Digital Data Advisor for the NIST Information Technology Laboratory (ITL), Chair of ISO/IEC JTC/1 SC29 WG11 (MPEG) Multimedia Preservation AHG
Mr. Wo Chang is Digital Data Advisor for the NIST Information Technology Laboratory (ITL). His responsibilities include, but are not limited to, promoting a vital and growing Big Data community at NIST with external stakeholders in commercial, academic, and government sectors. Mr. Chang currently chairs the ISO/IEC JTC/1 SC29 WG11 (MPEG) Multimedia Preservation AHG.

Prior to joining ITL Office, Mr. Chang was manager of the Digital Media Group in ITL and his duties included overseeing several key projects including digital data, long-term preservation and management of EHRs, motion image quality, and multimedia standards. In the past, Mr. Chang was the Deputy Chair for the US National Body for MPEG (INCITS L3.1) and chaired several other key projects for MPEG, including MPQF, MAF, MPEG-7 Profiles and Levels, and co-chaired the JPEG Search project. Mr. Chang was one of the original members of the W3C's SMIL WG and developed one of the SMIL reference software. Furthermore, Mr. Chang also participated in the HL7 and ISO/IEC TC215 for health informatics and IETF for the protocols development of SIP, RTP/RTPC, RTSP, and RSVP. Mr. Chang's research interests include digital data preservation, cloud computing, big data analytics, content metadata description, digital file formats, multimedia synchronization, and Internet protocols.

 

Guidelines

Source:http://bigdatawg.nist.gov/guidelines.php

NBD-WG General Guidelines

Public Working Group Guidance:
The National Institute of Standards and Technology (NIST) Big Data Program focuses on the development of community-based guidelines, metrics, and standards that accelerate the adoption of big data. This effort is undertaken in partnership with interested Federal, commercial, and private sector parties through the formation of Public Working Group. This page summarizes the key provisions for forming and managing these Working Group.

 Formation: Public Working Group is established by the NIST Big Data Program to pursue its goals for accelerating the adoption of big data through a combined public-private effort. Input from government, commercial, and academic sector individuals and entities is considered by the Program in evaluating the formation or dissolution of a Public Working Group, including its initial and continuing scope and tasking.

 

 Membership: Participation in a Public Working Group is open and free to all interested parties; there are no membership fees.

 

 Leadership: Public Working Group is led by NIST staff acting as Chairs and/or Co-Chairs.

 

 Products: The products of Working Group prepared at the direction of NIST staff and/or its designated representatives are subject to NIST review and are intended for public release at NIST.s discretion.

Mailing List:
Discussion on this topic will be done through the working group mailing list. The email address for NBD-WG is BigDataWG@nist.gov. If you are interested in signing up to the mailing list, please go to:http://bigdatawg.nist.gov/newuser.php

If you want to remove yourself from the list, send an email with the word "unsubscribe" in the subject or message body (or both) to BigDataWG-request@nist.gov.

Virtual Meeting

Source: http://bigdatawg.nist.gov/program.php

NBD-WG/Subgroups Virtual Meeting
Group Meetings:
The dial-in and web conferencing information for the meetings are as follows:
  • All Subgroup Meeting Time: 11:00 - 13:00 EDT
  • Phone: 206-402-0823, Participant code: 272-30-504
  • Web conferencing tool with share screen and VoIP audio, click here: NBD-WG (Preferred)
Date Mondays Tuesdays Wednesdays Thursdays Fridays
Subgroup Definitions & Taxonomies Requirements Security & Privacy Reference Architecture Technology Roadmap
June 19 NBD-WG (13:00 - 15:00 EDT)
Kick-off Meeting (Rescheduled to June 26)
Agenda | Minutes
June 26 NBD-WG (13:00 - 15:00 EDT)
Kick-off Meeting
Agenda | Minutes
July 3 NBD-WG (13:00 - 15:00 EDT)
Establish Subgroups with Co-Chairs, Overall WG Direction
Agenda | Minutes
July 8 - 12 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
July 15 - 19 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
July 22 - 26 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
July 24 NBD-WG (13:00 - 15:00 EDT)
Subgroups Report and document status
Agenda | Minutes
July 29 - Aug. 2 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Aug. 5 - 9 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Aug. 12 - 16 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Aug. 14 NBD-WG (13:00 - 15:00 EDT)
Subgroups Report and document status 
Agenda | Minutes
Aug. 19 - 23 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Aug. 26 -30 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Sep. 2 - 6 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Sep. 4 NBD-WG (13:00 - 15:00 EDT)
Subgroups Report and document status
Agenda | Minues
Sep. 9 - 13 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Sep. 16 - 20 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Sep. 23 - 27 Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes Agenda | Minutes
Sep. 25 NBD-WG (13:00 - 15:00 EDT)
Subgroups final draft report
Agenda | Minutes
Sep. 30 Big Data Workshop (Tentative)[Save-The-Date]

Tasks to Develop

Definitions

Source: http://bigdatawg.nist.gov/_uploadfil...301148665.docx (Word)

Big Data Definitions, v1

(Developed at Jan. 15 - 17, 2013 NIST Cloud/BigData Workshop)

Big Data refers to digital data volume, velocity and/or variety
[,veracity] that:

  • enable novel approaches to frontier questions previously
    inaccessible or impractical using current or conventional
    methods; and/or
  • exceed the capacity or capability of current or conventional
    methods and systems.

From this baseline, please help to improve/enhance the Big Data definitions.

Taxonomies

Source:

Reference Architectures

Source:

Technical Roadmap

Source:

Documents

Input Listing

Source: http://bigdatawg.nist.gov/show_InputDoc.php

 

Input Document Listing
Documents Received: 56
Please review and prepare to discuss the submitted documents before each meeting. 

No. Date Subgroup Title Revision
M0056 2013-07-24
SecPrivacy
Security and Privacy Subgroup Meeting Agenda for July 24, 2013 v1 
M0055 2013-07-24
All
Defining the Big Data Architecture Framework (BDAF): Outcome of the Brainstorming session at UvA. v1 
M0054 2013-07-24
All
Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised v1 
M0053 2013-07-24
All
Questions for Subgroups from Deliverables User Perspective v1 
M0052 2013-07-23
TechRoadmap
NIST Big Data Roadmap Matrix v1.1 v1 
M0051 2013-07-23
Requirements
NBD-Requirements Meeting Minutes July 23 2013 v1 
M0050 2013-07-23
UseCases
NBD Requirements WG Use Case Template-v2 v1 
M0049 2013-07-23
SecPrivacy
BD Privacy Use Case for Social Media v1 
M0048 2013-07-23
UseCases
Big Data Privacy Use Case v1 
M0047 2013-07-24
All
Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised v1,v2 
M0046 2013-07-22
TechRoadmap
NIST Big Data Roadmap Matrix v.1.1 v1 
M0045 2013-07-22
Requirements
NBD-Requirements Agenda July 23 2013 v1,v2 
M0044 2013-07-22
DefTax
Def/Tax Agenda mtg-3 20130722 v1 
M0043 2013-07-22
UseCases
NIST Genomics Use Case v1 
M0042 2013-07-22
All
NBD-WG Meeting Agenda for July 24 Subgroups Joint Meeting v1 
M0041 2013-07-22
All
Towards A Big Data Reference Architecture 2011 (Selected Slides) v1 
M0040 2013-07-19
Requirements
NBD Requirements Working Group Interim Minutes 16 July 2013 v1 
M0039 2013-07-22
RefArch
Collection of Architectures and Discussions for NBD Reference Architecture v1,v2 
M0038 2013-07-19
TechRoadmap
Technology Roadmap_Meeting Agenda for July 19 2013 v1 
M0037 2013-07-19
All
Reference Architecture, Requirements, Gaps, and Roles v1 
M0036 2013-07-18
All
NBD-WG - Thinking Foward v1 
M0035 2013-07-18
RefArch
Existing Standards Efforts Related to Big Data v1 
M0034 2013-07-17
SecPrivacy
Security and Privacy Subgroup Agenda for July 17, 2013 v1 
M0033 2013-07-17
RefArch
Agenda for 7/18/2013 v1 
M0032 2013-07-17
UseCases
NBD Use Case Template v1 
M0031 2013-07-16
UseCases
Different Requirements Use Case Templates v1 
M0030 2013-07-16
RefArch
Reference Arch Subgroup Minutes 11 July 2013 v1 
M0029 2013-07-15
Requirements
Requirements/Use Case Working Group Minutes July 9 2013 v1 
M0028 2013-07-15
Requirements
Requirements/Use Case Working Group Agenda July 16 2013 v1 
M0027 2013-07-15
RefArch
Reference Architecture Subgroup Agenda for July 11, 2013 v1 
M0026 2013-07-15
TechRoadmap
Technology Roadmap Subgroup Agenda for July 12, 2013 v1 
M0025 2013-07-15
DefTax
Def/Tax Agenda mtg-2 20130715 v1 
M0024 2013-07-22
DefTax
NIST BDWG def-tax Definitions Draft v1,v2,v3 
M0023 2013-07-12
DefTax
NIST BDWG DefTax Minutes 20130708 v1 
M0022 2013-07-17
TechRoadmap
Technology Roadmap Subgroup Charter v1,v2 
M0021 2013-07-17
RefArch
Reference Architecture Subgroup Charter v1,v2 
M0020 2013-07-17
Requirements
Requirements Subgroup Charter v1,v2 
M0019 2013-07-17
SecPrivacy
Security and Privacy Subgroup Charter v1,v2 
M0018 2013-07-17
DefTax
Definitions and Taxonomies Subgroup Charter v1,v2 
M0017 2013-07-18
RefArch
For presentation: Big Data Ecosystem RA (Microsoft) v1,v2 
M0016 2013-07-10
RefArch
Evaluation Criteria for Reference Architecture v1 
M0015 2013-07-10
All
Big Data Ecosystem RA (Microsoft) v1 
M0014 2013-07-10
All
Combined Deliverables Overview Presentation v1 
M0013 2013-07-09
SecPrivacy
Security and Privacy Subgroup Meeting Agenda for July 10, 2013 v1 
M0012 2013-07-08
Requirements
Requirements Subgroup Meeting Agenda for July 9, 2013 v1 
M0011 2013-07-08
DefTax
Subgroup Definitions and Taxonomies Meeting Agenda v1 
M0010 2013-07-08
All
Personal Brainstorming Document v1 
M0009 2013-07-03
All
NIST Special Publication 500-291 - NIST Cloud Computing Standards Roadmap - July 2011 v1 
M0008 2013-07-03
All
NIST Special Publication 500-292 - NIST Cloud Computing Reference Architecture - September 2011 v1 
M0007 2013-07-03
All
NIST_Security_Reference_Architecture_2013.05.15_v1 v1 
M0006 2013-07-03
All
SP800-145 Cloud Definition v1 
M0005 2013-07-03
All
NBD-WD July 3rd Meeting Agenda v1 
M0004 2013-06-25
All
NIST Big Data Working Kick-off Meeting Agenda (w/ minor update) v1 
M0003 2013-06-19
All
Big Data Definitions v1 v1 
M0002 2013-06-19
All
NIST Big Data Working Kick-off Meeting Agenda v1 
M0001 2013-07-12
All
NIST Big Data Working Group Charter - draft v1,v2 

Output Listing

Source:

Upload Document

Source:

Documents

Security and Privacy Subgroup Meeting Agenda for July 24, 2013

Source: http://bigdatawg.nist.gov/_uploadfil...768372564.docx

Defining the Big Data Architecture Framework (BDAF): Outcome of the Brainstorming session at UvA

Source: http://bigdatawg.nist.gov/_uploadfil...7606723276.pdf (PDF)

My Note: 56 slides in PDF

Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised

Source: http://bigdatawg.nist.gov/_uploadfil...8456980532.pdf

Questions for Subgroups from Deliverables User Perspective

Source: http://bigdatawg.nist.gov/_uploadfil...8336273678.doc (Word)

Questions for Subgroups of NIST Big Data Working Group

These questions are meant to be from the perspective of a target customer for the September deliverables e.g. an IT executive considering a Big Data project.

Starting point is the initial NIST Big Data definition.

------------------------------------------------------------------------

“Big Data refers to digital data volume, velocity and/or variety [,veracity] that:

·         enable novel approaches to frontier questions previously inaccessible or impractical using current or conventional ·methods; and/or

·         exceed the capacity or capability of current or conventional ·methods and systems”

------------------------------------------------------------------------

This definition implies that Big Data solutions must use new developments beyond “current or conventional methods and systems”. Previous data architectures, technologies, requirements, taxonomies, etc. can be a starting point but need to be expanded for Big Data solutions. A possible user of Big Data solutions will be very interested in answering the  following questions:

General Questions

*  What are the new developments that are included in Big Data solutions?

*  How do the new developments address the issues of needed capacity and capability?

*  What are the strengths and weaknesses of these new developments?

*  What are the new applications that are enabled by Big Data solutions?

*  Are there any best practices or standards for the use of Big Data solutions?

Questions for subgroups

*  What new definitions are needed to describe elements of new Big Data solutions?

*  How can the best Big Data solution be chosen based on use case requirements?

*  What new Security and Privacy challenge arise from new Big Data solutions?

*  How are new Big Data developments captured in new Reference Architectures?

* How will systems and methods evolve to remove Big Data solution weaknesses? 

NIST Big Data Roadmap Matrix v1.1

Source: http://bigdatawg.nist.gov/_uploadfil...889465949.xlsx (Excel)

My Note: Excel for Spotfire!

NBD-Requirements Meeting Minutes July 23 2013

Source: http://bigdatawg.nist.gov/_uploadfil...452918537.docx

NBD Requirements WG Use Case Template-v2

Source: http://bigdatawg.nist.gov/_uploadfil...663277969.docx (Word)

Current Draft: NBD(NIST Big Data) Requirements WG Use Case Template

Use Case Title

 

Vertical (area)

 

Author/Company/Email

 

Actors/Stakeholders and their roles and responsibilities

 

Goals

 

 

Use Case Description

 

 

 

Current

Solutions

Compute(System)

 

Storage

 

Networking

 

Software

 

Big Data
Characteristics

 

 

Data Source (distributed/centralized)

 

Volume (size)

 

Velocity

(e.g. real time)

 

Variety

(multiple datasets, mashup)

 

Variability (rate of change)

 

Big Data Science (collection, curation,

analysis,

action)

Veracity (Robustness Issues)

 

Visualization

 

Data Quality

 

Data Types

 

Data Analytics

 

Big Data Specific Challenges (Gaps)

 

 

 

 

Big Data Specific Challenges in Mobility

 

 

 

Security & Privacy

Requirements

 

 

 

Highlight issues for generalizing this use case (e.g. for ref. architecture)

 

 

 

More Information (URLs)

 

 

 

Note: <additional comments>

       

Note: No proprietary or confidential information should be included

Examples using previous draft

 

Use Case Title

Particle Physics: Analysis of LHC (Large Hadron Collider) Data (Discovery of Higgs particle)

Vertical

Fundamental Scientific Research

Author/Company/email

Geoffrey Fox, Indiana University gcf@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Physicists(Design and Identify need for Experiment, Analyze Data) Systems Staff (Design, Build and Support distributed Computing Grid), Accelerator Physicists (Design, Build and Run Accelerator), Government (funding based on long term importance of discoveries in field))

Goals

Understanding properties of fundamental particles

 

Use Case Description

CERN LHC Accelerator and Monte Carlo producing events describing particle-apparatus interaction. Processed information defines physics properties of events (lists of particles with type and momenta)

Current

Solutions

Compute(System)

200,000 cores running “continuously” arranged in 3 tiers (CERN, “Continents/Countries”. “Universities”). Uses “High Throughput Computing” (Pleasing parallel).

Storage

Mainly Distributed cached files

Analytics(Software)

Initial analysis is processing of experimental data specific to each experiment (ALICE, ATLAS, CMS, LHCb) producing summary information. Second step in analysis uses “exploration” (histograms, scatter-plots) with model fits. Substantial Monte-Carlo computations to estimate analysis quality

Big Data
Characteristics

 

 

Volume (size)

15 Petabytes per year from Accelerator and Analysis

Velocity

Real time with some long "shut downs" with no data except Monte Carlo

Variety

Lots of types of events with from 2- few hundred final particle but all data is collection of particles after initial analysis

Veracity (Robustness Issues)

One can lose modest amount of data without much pain as errors proportional to 1/SquareRoot(Events gathered). Importance that accelerator and experimental apparatus work both well and in understood fashion. Otherwise data too "dirty"/"uncorrectable"

Visualization

Modest use of visualization outside histograms and model fits

Data Quality

Huge effort to make certain complex apparatus well understood and "corrections" properly applied to data. Often requires data to be re-analysed

Big Data Specific Challenges (Gaps)

Analysis system set up before clouds. Clouds have been shown to be effective for this type of problem. Object databases (Objectivity) were explored for this use case

 

 

 

Security & Privacy

Requirements

Not critical although the different experiments keep results confidential until verified and presented.

 

 

More Information (URLs)

http://grids.ucs.indiana.edu/ptliupages/publications/ Where%20does%20all%20the%20data%20come%20from%20v7.pdf

 

Highlight issues for generalizing this use case (e.g. for ref. architecture)

1. Shall be able to analyze large amount of data in a parallel fashion

2. Shall be able to process huge amount of data in a parallel fashion

3. Shall be able to perform analytic and processing in multi-nodes (200,000 cores) computing cluster

4. Shall be able to convert legacy computing infrastructure into generic big data computing environment

Note: <additional comments>

       

 

Use Case Title

Netflix Movie Service

Vertical

Commercial Cloud Consumer Services

Author/Company/email

Geoffrey Fox, Indiana University gcf@indiana.edu

Actors/Stakeholders and their roles and responsibilities

Netflix Company (Grow sustainable Business), Cloud Provider (Support streaming and data analysis), Client user (Identify and watch good movies on demand)

Goals

Allow streaming of user selected movies to satisfy multiple objectives (for different stakeholders) -- especially retaining subscribers. Find best possible ordering of a set of videos for a user (household) within a given context in real-time; maximize movie consumption.

Use Case Description

Digital movies stored in cloud with metadata; user profiles and rankings for small fraction of movies for each user. Use multiple criteria – content based recommender system; user-based recommender system; diversity. Refine algorithms continuously with A/B testing.

 

Current

Solutions

Compute(System)

Amazon Web Services AWS with Hadoop and Pig.

Storage

Uses Cassandra NoSQL technology with Hive, Teradata

Analytics(Software)

Recommender systems and streaming video delivery. Recommender systems are always personalized and use logistic/linear regression, elastic nets, matrix factorization, clustering, latent Dirichlet allocation, association rules, gradient boosted decision trees and others. Winner of Netflix competition (to improve ratings by 10%) combined over 100 different algorithms.

Big Data
Characteristics

 

 

Volume (size)

Summer 2012. 25 million subscribers; 4 million ratings per day; 3 million searches per day; 1 billion hours streamed in June 2012. Cloud storage 2 petabytes (June 2013)

Velocity

Media and Rankings continually updated

Variety

Data varies from digital media to user rankings, user profiles and media properties for content-based recommendations

Veracity (Robustness Issues)

Success of business requires excellent quality of service

Visualization

Streaming media

Data Quality

Rankings are intrinsically “rough” data and need robust learning algorithms

Big Data Specific Challenges (Gaps)

Analytics needs continued monitoring and improvement.

Security & Privacy

Requirements

Need to preserve privacy for users and digital rights for media.

 

 

More Information (URLs)

http://www.slideshare.net/xamat/building-largescale-realworld-recommender-systems-recsys2012-tutorial by Xavier Amatriain

http://techblog.netflix.com/

Note: <additional comments>

BD Privacy Use Case for Social Media

Source: http://bigdatawg.nist.gov/_uploadfil...287217610.docx

Big Data Privacy Use Case

Source: http://bigdatawg.nist.gov/_uploadfil...526614644.docx

Towards A Big Data Reference Architecture 2011 (Selected Slides) Revised

Source: http://bigdatawg.nist.gov/_uploadfil...3966324695.pdf

NIST Big Data Roadmap Matrix v.1.1

Source: http://bigdatawg.nist.gov/_uploadfil...470520573.xlsx

My Note: Excel! Use in Spotfire!

NBD-Requirements Agenda July 23 2013

Source: http://bigdatawg.nist.gov/_uploadfil...215769261.docx

Def/Tax Agenda mtg-3 20130722

Source: http://bigdatawg.nist.gov/_uploadfil...996047870.docx

NBD-WG Meeting Agenda for July 24 Subgroups Joint Meeting

Source: http://bigdatawg.nist.gov/_uploadfil...426754928.docx

Towards A Big Data Reference Architecture 2011 (Selected Slides)

Source: http://bigdatawg.nist.gov/_uploadfil...7271372958.pdf (PDF)

My Note: 41 slides in PDF

NBD Requirements Working Group Interim Minutes 16 July 2013

Source: http://bigdatawg.nist.gov/_uploadfil...912094737.docx (Word)

The discussion of use case template continued and several useful enhancements were suggested. A new template with additional detail, has been proposed after the meeting and is being evaluated.

The chat log follows

(10:17 AM) Wo Chang (Host, NIST) disconnected.

(10:17 AM) AR joined.

(10:18 AM) Wo Chang (Host, NIST) joined.

(10:22 AM) Wo Chang (Host, NIST): ejdjirewoher;ewjheoh

(10:22 AM) Wo Chang (Host, NIST): dkd;h;ifhd;fhd;d

(10:22 AM) Wo Chang (Host, NIST): dksd;dh;dd

(10:24 AM) William Miller (MaCT USA) joined.

(10:25 AM) Wo Chang (Host, NIST)20 joined.

(10:25 AM) William Miller (MaCT USA) disconnected.

(10:26 AM) William Miller (MaCT USA) joined.

(10:26 AM) Wo Chang (Host, NIST) disconnected.

(10:26 AM) William Miller (MaCT USA): hi wo

(10:34 AM) William Miller (MaCT USA) disconnected.

(10:34 AM) Wo Chang (Host, NIST)20 disconnected.

(10:34 AM) AR disconnected.

(10:49 AM) Wo Chang (Host, NIST) joined.

(10:49 AM) Wo Chang (Host, NIST) disconnected.

(10:51 AM) Wo Chang (Host, NIST) joined.

(10:52 AM) Eugene Luster (DISA CTO/R2AD) joined.

(10:52 AM) Eugene Luster (DISA CTO/R2AD): Good morning!

(10:54 AM) Cherry Tom joined.

(10:56 AM) Wo Chang (Host, NIST): goog morning all!

(10:56 AM) tbeyene joined.

(10:59 AM) Cherry Tom: any audio yet?

(11:00 AM) Orit Levin (Microsoft) joined.

(11:00 AM) Eugene Luster (DISA CTO/R2AD): I am muted.

(11:01 AM) Yuri Demchenko (UvA) joined.

(11:04 AM) Cherry Tom: still no audio on megameeting, should I call in telecon number?

(11:05 AM) William Miller (MaCT USA) joined.

(11:05 AM) Tim Zimmerlin (Automation Technologies) joined.

(11:05 AM) Wo Chang (Host, NIST): we are waiting for Geoffrey to lead the Requirements...

(11:08 AM) Dusty Jackson joined.

(11:08 AM) Geoffrey Fox joined.

(11:08 AM) Nancy Grady (SAIC) joined.

(11:13 AM) Pw Carey (Compliance Partners, LLC) joined.

(11:14 AM) PavithraKenjige joined.

(11:16 AM) Pw Carey (Compliance Partners, LLC): Can we collect real world vendor use cases business cases revolving around 'issues' associted with BIG DATA....perhaps...?

(11:17 AM) tbeyene24 joined.

(11:18 AM) Pw Carey (Compliance Partners, LLC): Yah beat us to it....way tah go....

(11:18 AM) PavithraKenjige disconnected.

(11:19 AM) Dusty Jackson disconnected.

(11:19 AM) PavithraKenjige joined.

(11:19 AM) Pw Carey (Compliance Partners, LLC): Except....for when things go South....do we address how to fix stuff...when things go South...?

(11:21 AM) Pw Carey (Compliance Partners, LLC): We believe NIST did develop a business case template....

(11:22 AM) tbeyene24: I thought so.  it would be if that would be loaded on the wiki site

(11:23 AM) Pw Carey (Compliance Partners, LLC): We need to look for it....then.....yes?

(11:25 AM) PavithraKenjige: what is the music with the audio

(11:26 AM) Pw Carey (Compliance Partners, LLC): Is there such a thing as DISASTER RECOVERY BUSINESS CONTINUITY within BigData....?

(11:30 AM) PavithraKenjige: Disaster recovery needs to address any specific storage and retrievel needs for big data other then that it is at the server level

(11:31 AM) Pw Carey (Compliance Partners, LLC): Do you want to assign folks...?

(11:32 AM) PavithraKenjige: yeah, someone needs to be assigned to address that.. lead can assign someone..

(11:33 AM) Bob Marcus (ET-Strategies) joined.

(11:33 AM) Pw Carey (Compliance Partners, LLC): Here's one use case template from the Cloud Computing WG at NIST....http://www.nist.gov/itl/cloud/use-cases.cfm

(11:35 AM) William Miller (MaCT USA): i would suggest that the characteristic requirements for big data and not do comparison between different implementations

(11:35 AM) Pw Carey (Compliance Partners, LLC): Pw Carey volunteers Pw for any use case that nobody else wants....Respectfully yours, Pw Carey

(11:35 AM) William Miller (MaCT USA): you cannot do a comparison without a criteria

(11:35 AM) William Miller (MaCT USA): the requirements are effectively in the criteria

(11:36 AM) Pw Carey (Compliance Partners, LLC): Correct.....lets leverage as much as we can.....

(11:37 AM) tbeyene: i agree

(11:37 AM) Orit Levin (Microsoft): I can attempt to submit the existing "Ad industry" use case using the template.

(11:38 AM) Pw Carey (Compliance Partners, LLC): For your eyes only....our email address is.....drum roll here.....pwc.pwcarey@gmail.com (GRC Application Security Analyst, CISA, CISSP) Respectfully yours, Pw

(11:39 AM) PavithraKenjige: people don't like drum roll

(11:39 AM) PavithraKenjige: only uite doers..

(11:39 AM) Pw Carey (Compliance Partners, LLC): Bummer

(11:40 AM) Geoffrey Fox: i can add astronomy, search, some genomics

(11:41 AM) PavithraKenjige disconnected.

(11:42 AM) Pw Carey (Compliance Partners, LLC): IBM has built a Traffic Mgt. element to Big Data....correct....?

(11:42 AM) PavithraKenjige joined.

(11:42 AM) Alicia Zuniga-Alvarado/AZA joined.

(11:43 AM) PavithraKenjige disconnected.

(11:43 AM) Pw Carey (Compliance Partners, LLC): P230 IEEE Use Case Criteria....where do we get a copy of this effort....?

(11:43 AM) PavithraKenjige joined.

(11:43 AM) Tim Zimmerlin (Automation Technologies): Library of Congress, NASA, USGS, NOAA have major, significant big data projects. Could NIST approach them for use case info?

(11:48 AM) Pw Carey (Compliance Partners, LLC): Here is one Big Data Application introduction from ComputerWeek:...http://www.computerweekly.com/featur...ces-challenges

(11:49 AM) Geoffrey Fox: there is an excellent set of talks at http://fisheritcenter.haas.berkeley....ata/index.html where many companies describe their big data app. IBM does traffic, GE does engine monitoring plus ebay linkedin etc.

(11:52 AM) Eugene Luster (DISA CTO/R2AD): Areas of Big Data but not necessarily Big Data Cloud required such as Particle Physics, Astrophysics, Genomics, Microbiology Biotech, Nanotechnology, Meteorology, Physical Oceanography, Ecological Management, Customer Behavior, Telecommunication Planning, Societal Administration, Healthcare Administration....there are more but these are not specifically requirements

(11:53 AM) PavithraKenjige disconnected.

(11:55 AM) Pw Carey (Compliance Partners, LLC): Should we include/incorporate a Big Data Cloud Computing Eco-system Lifecycle...the goal being not to be looking in the rearview mirror.....Yes....?

(11:55 AM) PavithraKenjige joined.

(11:57 AM) Pw Carey (Compliance Partners, LLC): These are the perceived gaps within Big Data...and these are the perceived requirements of Big Data.....correct...?

(11:57 AM) Geoffrey Fox disconnected.

(11:58 AM) Pw Carey (Compliance Partners, LLC): Please feel free to sign us up....you all have our permission.....Respectfully yours, Pw

(11:58 AM) Orit Levin (Microsoft): yes :-) Thanks

(11:58 AM) Orit Levin (Microsoft): The gaps and the requirements for the specific case described.

(11:58 AM) Pw Carey (Compliance Partners, LLC): You're welcome.....

(11:59 AM) Pw Carey (Compliance Partners, LLC): Submit our efforts in one week...before the next meeting....eh?

(11:59 AM) tbeyene: yes

(11:59 AM) Pw Carey (Compliance Partners, LLC): Okey Dokey....

(12:02 PM) Pw Carey (Compliance Partners, LLC): How about a Duty & Responsiblity Czar and hand out the Team their duties and there responsiblities....?

(12:03 PM) Pw Carey (Compliance Partners, LLC): We'll develop a Use Case for Big Data Fraud Detection...if this is acceptable...for one use case....?

(12:03 PM) Nancy Grady (SAIC) disconnected.

(12:04 PM) Wo Chang (Host, NIST): yes, Pw please submit this use case, thanks.

(12:05 PM) Pw Carey (Compliance Partners, LLC): Ok....will do....word of honor.....Respectfully yours, Pw

(12:07 PM) Nancy Grady (SAIC) joined.

(12:09 PM) Bob Marcus (ET-Strategies): When do you expect to begin making requirements available to other subgroups?

(12:11 PM) Pw Carey (Compliance Partners, LLC): Folks on the call today are.....Wo2Wo Chang (Host, NIST)Eugene Luster (DISA CTO...)Cherry TomtbeyeneOrit Levin (Microsoft)Yuri Demchenko (UvA)William Miller (MaCT USA)Tim Zimmerlin (Automation...)Pw Carey (Compliance Partners, LLC)tbeyene24Bob Marcus (ET-Strategies)Alicia Zuniga-Alvarado/AZA....)PavithraKenjigeNancy Grady (SAIC)....

(12:25 PM) Yuri Demchenko (UvA): need to leave, apologies

(12:25 PM) Yuri Demchenko (UvA) disconnected.

(12:26 PM) Pw Carey (Compliance Partners, LLC): Couldn

(12:26 PM) Pw Carey (Compliance Partners, LLC): Couldn't we just off-load it to the NSA...(National Student Assn.)....No?

(12:29 PM) Pw Carey (Compliance Partners, LLC): Neat flow chart....

(12:30 PM) _USER_NAME_(_COMPANY_) joined.

(12:36 PM) Pw Carey (Compliance Partners, LLC): Thats nice if there are no human errors....neat-o.....

(12:38 PM) Cherry Tom: more applications - new book Big Data: A Revolution That Will Transform How We Live, Work, and Think [Hardcover] Viktor Mayer-Schonberger (Author), Kenneth Cukier

(12:39 PM) Pw Carey (Compliance Partners, LLC): Thanks...we'll look it up......unfortunately...we just saw Minority Report.....however, our thanks are genuine.....Respectfully yours, Pw

(12:42 PM) tbeyene disconnected.

(12:43 PM) Pw Carey (Compliance Partners, LLC): So, who owns the data...the individual or the database....?

(12:44 PM) PavithraKenjige disconnected.

(12:44 PM) PavithraKenjige joined.

(12:45 PM) Pw Carey (Compliance Partners, LLC): One use case could deal with data ownership with a Big Data Eco-system....no?

(12:45 PM) Pw Carey (Compliance Partners, LLC): Including, jurisdictions and 3rd parties.....

(12:46 PM) Pw Carey (Compliance Partners, LLC): I suppose data leakage could become a cottage industry for lawyers....no?

(12:48 PM) PavithraKenjige disconnected.

(12:48 PM) Tim Zimmerlin (Automation Technologies): I suggest that requirements and use cases separate three processes: data ingestion, data extraction, and data presentation.

(12:50 PM) PavithraKenjige joined.

(12:51 PM) Pw Carey (Compliance Partners, LLC): Can we take a hack at 'Data Ownership'....who owns the data....?

(12:51 PM) PavithraKenjige disconnected.

(12:52 PM) Pw Carey (Compliance Partners, LLC): Your suggestion: data ingestion, data extraction & presentation.....is good....do you have any further thoughts....in this regard....?

(12:53 PM) Pw Carey (Compliance Partners, LLC): Such as data security...data GRC....data CIA....and such....?

(12:53 PM) Tim Zimmerlin (Automation Technologies): Pw, I intend to keep it simple and obvious to interested parties.

(12:53 PM) PavithraKenjige joined.

(12:54 PM) Pw Carey (Compliance Partners, LLC): Ok.....we'll settle down....thanks for the hint....Respectfully yours, Pw

(12:55 PM) PavithraKenjige disconnected.

(12:56 PM) PavithraKenjige joined.

(12:59 PM) Pw Carey (Compliance Partners, LLC): Thank you for a very good session....quite good....&...useful, too....Pw

(1:00 PM) Pw Carey (Compliance Partners, LLC): We're receiving a bit of a feed-back...while we have our phone on mute.....

(1:01 PM) Pw Carey (Compliance Partners, LLC): Thank you....best wishes....Respectfully yours, Pw Carey

(1:01 PM) Tim Zimmerlin (Automation Technologies) disconnected.

(1:01 PM) Nancy Grady (SAIC) disconnected.

(1:01 PM) Cherry Tom disconnected.

(1:01 PM) Alicia Zuniga-Alvarado/AZA disconnected.

(1:01 PM) Pw Carey (Compliance Partners, LLC) disconnected.

(1:05 PM) William Miller (MaCT USA) disconnected.

(1:06 PM) Eugene Luster (DISA CTO/R2AD) disconnected.

(1:12 PM) Bob Marcus (ET-Strategies) disconnected.

(1:23 PM) Wo Chang (Host, NIST) disconnected.

(1:27 PM) Wo Chang (Host, NIST) joined.

Collection of Architectures and Discussions for NBD Reference Architecture

Source: http://bigdatawg.nist.gov/_uploadfil...103189105.docx (Word)

NIST Big Data Working Group (NBD-WD)

NBD-WD-2013/M0038

 

Source:

Technology Roadmap Subgroup

Title:

Technology Roadmap Subgroup Agenda

Date:

July 19, 2013

Author(s):

Carl Buffington (Vistronix), Dan McClary (Oracle), David Boyd (Data Tactic)

 

Agenda for July 19, 2013

I.       Review of Action Items from Last Week

1)      Send out Draft Charter to the subgroup – Carl

2)      Develop the initial "This Tech Roadmap will..." (Audience, document goals/vision) - Carl

3)      Develop an initial graphical vision of the Tech Roadmap

4)      Collect Charter feedback and finalize charter - Carl

5)      Develop a matrix of Big Data categories to focus on, that features will go under - Bruno

6)      Maturity model evaluation and selection - Dan

7)      Define a target timeline

8)      Identify relevant standards that should be considered for Big Data Categories / Features

9)      Develop the outline

II.      Type of road maps we may consider - Gary

III.      Big Data Too Big for one person to design, build, or operate – Keith

IV.     New Action Items for this week

1)      Develop a timeline for the Tech Roadmap Deliverable

2)      Identify what the skill mix of a good big data team is

3)      Strategy for the socialization of big data within an organization

Technology Roadmap_Meeting Agenda for July 19 2013

Source: http://bigdatawg.nist.gov/_uploadfil...628691478.docx (Word)

NIST Big Data Working Group (NBD-WD)

NBD-WD-2013/M0038

 

Source:

Technology Roadmap Subgroup

Title:

Technology Roadmap Subgroup Agenda

Date:

July 19, 2013

Author(s):

Carl Buffington (Vistronix), Dan McClary (Oracle), David Boyd (Data Tactic)

Agenda for July 19, 2013

I.       Review of Action Items from Last Week

1)      Send out Draft Charter to the subgroup – Carl

2)      Develop the initial "This Tech Roadmap will..." (Audience, document goals/vision) - Carl

3)      Develop an initial graphical vision of the Tech Roadmap

4)      Collect Charter feedback and finalize charter - Carl

5)      Develop a matrix of Big Data categories to focus on, that features will go under - Bruno

6)      Maturity model evaluation and selection - Dan

7)      Define a target timeline

8)      Identify relevant standards that should be considered for Big Data Categories / Features

9)      Develop the outline

II.      Type of road maps we may consider - Gary

III.     Big Data Too Big for one person to design, build, or operate – Keith

IV.    New Action Items for this week

1)      Develop a timeline for the Tech Roadmap Deliverable

2)      Identify what the skill mix of a good big data team is

3)      Strategy for the socialization of big data within an organization

Reference Architecture, Requirements, Gaps, and Roles

Source: http://bigdatawg.nist.gov/_uploadfil...2377399813.pdf

My Note: Looks like excerpts from previous document

NBD-WG - Thinking Foward

Source: http://bigdatawg.nist.gov/_uploadfil...7458216169.pdf (PDF)

My Note: I looked at this and thought it was very good (15 slides)

Slide 1 NBD-WG: Thinking Forward

NBD-WG-ThinkingFowardWoChang2013-07-18Slide1.png

Slide 2 Disclaimer 1

NBD-WG-ThinkingFowardWoChang2013-07-18Slide2.png

Slide 3 Good Approach

NBD-WG-ThinkingFowardWoChang2013-07-18Slide3.png

Slide 4 Big Data Current Approach

NBD-WG-ThinkingFowardWoChang2013-07-18Slide4.png

Slide 5 Semi-new Approach

NBD-WG-ThinkingFowardWoChang2013-07-18Slide5.png

Slide 6 Big Data Processing Protocol and Markup Language

NBD-WG-ThinkingFowardWoChang2013-07-18Slide6.png

Slide 7 Subgroups

NBD-WG-ThinkingFowardWoChang2013-07-18Slide7.png

Slide 8 Definitions and Taxonomies

NBD-WG-ThinkingFowardWoChang2013-07-18Slide8.png

Slide 9 Taxonomies

NBD-WG-ThinkingFowardWoChang2013-07-18Slide9.png

Slide 10 Requirements

NBD-WG-ThinkingFowardWoChang2013-07-18Slide10.png

Slide 11 Security and Privacy

NBD-WG-ThinkingFowardWoChang2013-07-18Slide11.png

Slide 12 Reference Architecture

NBD-WG-ThinkingFowardWoChang2013-07-18Slide12.png

Slide 13 Technology Roadmap

NBD-WG-ThinkingFowardWoChang2013-07-18Slide13.png

Slide 14 How Can You Help?

NBD-WG-ThinkingFowardWoChang2013-07-18Slide14.png

Slide 15 Disclaimer 2

NBD-WG-ThinkingFowardWoChang2013-07-18Slide15.png

Existing Standards Efforts Related to Big Data

Source: http://bigdatawg.nist.gov/_uploadfil...798723092.docx (Word)

My Note: I looked at this document

NIST Big Data Working Group (NBD-WD)
NBD-WD-2013/M0035

Source:    Technology Roadmap Subgroup
Status:    Informational
Title:    Existing Standards Efforts Related to Big Data
Author:    Keith W. Hare
Senior Consultant, JCC Consulting, Inc.
Vice Chair, ANSI INCITS DM32.2, Databases
Convenor, ISO/IEC JTC1 SC32 WG3 Database Languages

Abstract

As the NIST Big Data Working Group Technology Roadmap Subgroup proceeds, one task is to identify “standardization and adoption priorities through an understanding of what standards are available or under development.” While a complete identification of standardization and adoption priorities depends on the Reference Architecture, this document presents an initial view of related standards organization and standards.

Introduction

Big Data has generated interest in a wide variety of organizations, including the de jure standards process, industry consortiums, and open source organizations. Each of these organizations operates differently and focuses on different aspects.

The following sections describe work currently in planning and in progress in three organizations:
•    INCITS and ISO de jure standards process
•    Apache Software Foundation
•    W3C – Industry consortium

As a participant in the USA and international data management standards, I am most familiar with that work. However, I have attempted to identify efforts in other organizations as well. I expect that this document will prompt others to identify areas that I have missed. Ultimately,  in this document

INCITS, ISO, and ISO/IEC Standards Efforts

There are two major international standards development organizations working on computer related standards, ISO (International Organization for Standardization) and IEC (International Electrotechnical Commission). Many of the information technology standards are developed by subcommittees (SC) of an ISO/IEC Joint Technical Committee – JTC1.

Within the USA, there are standards committees that correspond to the ISO and JTC1 committees. The USA committees are under the auspices of INCITS (InterNational Committee for Information Technology Standards), an ANSI accredited standards development organization.

Data Management and Interchange

Within ISO, the ISO/IEC JTC1 SC32 committee is responsible for Data Management and Interchange standards. Working groups within SC32 are responsible for a variety of standards in the areas of Metadata and Database languages. SC32 currently has a study group in place looking at Next Generation Analytics and Big Data.

SC32 standards efforts focus on interfaces between software layers, not on the underlying physical storage.

The structures of the committees are:

USA (ANSI) Standards Groups    ISO/IEC Standards Groups
INCITS DM32 Data Management and Interchange
Chair: Dr. Donald Deutsch, Oracle
Secretary: Mr. Michael Gorman, Whitemarsh Information Systems    ISO/IEC JTC1 SC32 Data Management and Interchange
Chair: Mr. Jim Melton, USA
Secretary: Dr. Timothy D. Schoechle, USA
DM32.2 Database
Chair: Dr. Donald Deutsch, Oracle
Vice Chair: Mr. Keith W. Hare, JCC Consulting, Inc.
Secretary: Mr. Michael Gorman, Whitemarsh Information Systems Corporation
Int'l. Rep: Krishna Kulkarni, IBM Corporation    Working Group 3 – Database Languages
Convenor: Mr. Keith W. Hare, USA

    Working Group 4 – Multimedia
Convenor: Kohji Shibano, Japan
DM32.8 Metadata
Chair: Mr. Daniel Gillman, Bureau of Labor Statistics
Int'l. Rep: Mr. Frank Farance, Farance Inc    Working Group 1 –  e-Business
Convenor: Wenfeng Sun, China
    Working Group 2 – Metadata
Convenor: Denise Warzel, USA

The following descriptions of SC32 work were extracted from the June 2013 report from SC32 Next Generation Analytics/Big Data working group:

SC32 N2388b “Report of study group on next generation analytics and big data”, Keith Hare - study group rapporteur, June 2013
http://www.jtc1sc32.org/doc/N2351-24..._analytics.pdf

Metadata

The aim of the metadata standards developed within Working Group 2 of SC32 is to provide an information and software services infrastructure that supports communities that wish to interoperate.

The ISO/IEC 11179 series of standards provides specifications for the structure of a metadata registry and the procedures for the operation of such a registry. These standards address the semantics of data (both terminological and computational), the representation of data, and the registration of the descriptions of that data. It is through these descriptions that an accurate understanding of the semantics and a useful depiction of the data are found. These standards promote:

•    Standard description of data
•    Common understanding of data across organizational elements and between organizations
•    Re-use and standardization of data over time, space, and applications
•    Harmonization and standardization of data within an organization and across organizations
•    Management of the components of data
•    Re-use of the components of data

The ISO/IEC 20943 series of technical reports provide a set of procedures for achieving consistency of content within a metadata registry.

The ISO/IEC 19763 series of standards provides specifications for a metamodel framework for interoperability. In this context interoperability should be interpreted in its broadest sense: the capability to communicate, execute programs, or transfer data among various functional units in a manner that requires the user to have little or no knowledge of the unique characteristics of those units (ISO/IEC 2382-1:1993). ISO/IEC 19763 will eventually cover:

•    A core model to provide common facilities
•    A basic mapping model to allow for the common semantics of two models to be registered
•    A metamodel for the registration of ontologies
•    A metamodel for the registration of information models
•    A metamodel for the registration of process models
•    A metamodel for the registration of models of services, principally web services
•    A metamodel for the registration of roles and goals associated with processes and services
•    A metamodel for the registration of form designs

The ISO/IEC 19763 series of standards will also include a technical report describing on-demand model selection based on roles, goals, processes and services and a standard for a registry of registries.

Data Storage and Retrieval

SC32 WG3 has defined the SQL database language in the 9075 family of standards. Additional work is needed to accommodate new types of objects such as JSON.
Additional work may be needed in Call Level Interfaces (CLI) to better support distributed access

•    ISO/IEC 9075-1:2011 Information technology -- Database languages -- SQL -- Part 1: Framework (SQL/Framework)
•    ISO/IEC 9075-2:2011 Information technology -- Database languages -- SQL -- Part 2: Foundation (SQL/Foundation)
•    ISO/IEC 9075-3:2008 Information technology -- Database languages -- SQL -- Part 3: Call-Level Interface (SQL/CLI)
•    ISO/IEC 9075-4:2011 Information technology -- Database languages -- SQL -- Part 4: Persistent Stored Modules (SQL/PSM)
•    ISO/IEC 9075-9:2008 Information technology -- Database languages -- SQL -- Part 9: Management of External Data (SQL/MED)
•    ISO/IEC 9075-10:2008 Information technology -- Database languages -- SQL -- Part 10: Object Language Bindings (SQL/OLB)
•    ISO/IEC 9075-11:2011 Information technology -- Database languages -- SQL -- Part 11: Information and Definition Schemas (SQL/Schemata)
•    ISO/IEC 9075-13:2008 Information technology -- Database languages -- SQL -- Part 13: SQL Routines and Types Using the Java TM Programming Language (SQL/JRT)
•    ISO/IEC 9075-14:2011 Information technology -- Database languages -- SQL -- Part 14: XML-Related Specifications (SQL/XML)

Support for Complex Data Types

SC32 WG4 has defined standards for complex data storage and retrieval:

•    ISO/IEC 13249-2 SQL/MM Part 2: Full Text provides full information retrieval capabilities and complement SQL and SQL/XML. SQL/XML provides facilities to manage XML structured data while MM Part 2 provides contents based retrieval.
•    ISO/IEC 13249-3 Part 3: Spatial provides all the functionalities required to support geo applications. Most big data application now includes processing of GPS data together with geographic information. Thus Part 3: Spatial is also one of the key components of big data applications. This work is carefully coordinated with ISO TC 211 and the Open GIS Consortium.
•    ISO/IEC 13249-5 Part 5: Still Image provides basic functionalities for Image data management.
•    ISO/IEC 13249-6 Part 6: Data Mining provides all the functionalities required to support statistical data mining applications. SQL/OLAP functionality provide simple online analytic processing while MM Part 6: provides sophisticated statistical data mining functionalities.

Geographical Information Systems

From the INCITS web site:

Geographic Information Systems form a distinct class of information systems through their unique requirements for collecting, converting, storing, retrieving, processing, analyzing, creating, and displaying geographic data. The generic nature of GIS, organizing information by location, is interdisciplinary and not specific to any application.

The work of L1, Geographic Information Systems (GIS) consists of adopting or adapting information technology standards and developing digital geographic data standards. Digital geographic data standards are concerned with creating, defining, describing, and processing such data.

The structures of the committees are:

USA (ANSI) Standards Groups    ISO Standards Groups
INCITS L1 - Geographical Information Systems
Chair: Dr. Liping Di, George Mason University
Int'l. Rep: Dr. Meixia Deng, George Mason University    ISO/TC 211 – Geographic information/Geomatics
Chair: Mr. Olaf Østensen, Norway
Secretary: Ms. Bjørnhild Sæterøy, Norway

IT Security techniques

From the INCITS web site:

The area of work includes standardization in the following areas:

•    Management of information security and systems
•    Management of third party information security service providers
•    Intrusion detection
•    Network security
•    Incident handling
•    IT Security evaluation and assurance
•    Security assessment of operational systems
•    Security requirements for cryptographic modules

Protection profiles
•    Role based access control
•    Security checklists
•    Security metrics

Cryptographic and non-cryptographic techniques and mechanisms including:
•    confidentiality
•    entity authentication
•    non-repudiation
•    key management
•    data integrity
•    message authentication
•    hash-functions
•    digital signatures

Future service and applications standards supporting the implementation of control objectives and controls as defined in IS 27001, in the areas of:
•    business continuity
•    outsourcing

Identity management, including:
•    identity management framework
•    role based access control
•    single sign-on

Privacy technologies, including:
•    privacy framework
•    privacy reference architecture
•    privacy
•    anonymity and credentials
•    specific privacy enhancing technologies

The structures of the committees are:

USA (ANSI) Standards Groups    ISO Standards Groups
INCITS  CS1 - Cyber Security
Chair: Mr. Dan Benigni, NIST
Vice Chair: Mr. Sal Francomacaro, NIST
Int'l. Rep: Mr. Eric Hibbard, Hitachi Data Systems    ISO/IEC JTC 1/SC 27 IT Security techniques
Chair: Dr. Walter Fumy
Secretary: Mrs Krystyna Passia

Cloud Computing – Distributed application platforms and services

SC38 is responsible for developing standards for interoperable Distributed Application Platforms and Services including:
•    Web Services
•    Service Oriented Architecture (SOA),
•    Cloud Computing

The structures of the committees are

USA (ANSI) Standards Groups    ISO/IEC Standards Groups
INCITS DAPS38 - Distributed Application Platforms & Services (DAPS)
Chair: Mr. Steve Holbrook, IBM Corporation
Vice Chair: Mr. Tom Rutt, Fujitsu America Inc.
Secretary: Mr. Joel Fleck II, Hewlett-Packard Company
Int'l. Rep: Mr. John Calhoon, Microsoft Corporation    ISO/IEC JTC 1/SC 38 Distributed Application Platforms and Services (DAPS)
Chair: Dr. Donald Deutsch (USA)
Secretary: Marisa Peacock (USA)

Coding of Audio, Picture, Multimedia, and Hypermedia Information

From the INCITS web site:
Responsible for the standardization of coded representation of audio, picture, multimedia and hypermedia information - and of sets of compression and control functions for use with such information - such as: audio information, bi-level and limited bits-per-pixel still pictures; computer graphics images; moving pictures and associated audio, multimedia and hypermedia information for real-time final form interchange; and audio visual interactive scriptware.

The structures of the committees are:

USA (ANSI) Standards Groups    ISO Standards Groups
INCITS L3 - Coding of Audio, Picture, Multimedia, and Hypermedia Information
Chair: Dr. Arianne Hinds, Cable Television Laboratories
a.hinds@cablelabs.com
Int'l. Rep: Dr. Andrew Tescher, Microsoft Corporation
andytescher@comcast.net     ISO/IEC JTC 1/SC29 – Coding of audio, picture, multimedia and hypermedia information
Chair: Mr Kohtaro Asai
Secretary: Ms. Yukiko Ogura

Apache Software Foundation

The Apache Software Foundation (http://www.apache.org/) provides support for the Apache community of open-source software projects. There are two categories of projects that of obvious interest to the NIST Big Data efforts:

•    Big Data – http://projects.apache.org/indexes/c....html#big-data
•    Databases – http://projects.apache.org/indexes/c....html#database

W3C

The W3C is an industry consortium that develops standard

XML Technology
HTML5 Image Description Extension

The W3C has a “Big Data Community Group”. The website http://www.w3.org/community/bigdata/ contains the following description:

This group will explore emerging BIG DATA pipelines and discuss the potential for developing standard architectures, Application Programming Interfaces (APIs), and languages that will improve interoperability, enable security, and lower the overall cost of BIG DATA solutions.

The BIG DATA community group will also develop tools and methods that will enable: a) trust in BIG DATA solutions; b) standard techniques for operating on BIG DATA, and c) increased education and awareness of accuracy and uncertainties associated with applying emerging techniques to BIG DATA.

This community group has not published any reports.

Security and Privacy Subgroup Agenda for July 17, 2013

Source: http://bigdatawg.nist.gov/_uploadfil...872189885.docx

My Note:

Agenda for 7/18/2013

Source: http://bigdatawg.nist.gov/_uploadfil...932687260.docx

My Note:

NBD Use Case Template

Source: http://bigdatawg.nist.gov/_uploadfil...739896264.docx

My Note:

Different Requirements Use Case Templates

Source: http://bigdatawg.nist.gov/_uploadfil...088741235.docx

My Note:

Reference Arch Subgroup Minutes 11 July 2013

Source: http://bigdatawg.nist.gov/_uploadfil...215627041.docx

My Note:

Requirements/Use Case Working Group Minutes July 9 2013

Source: http://bigdatawg.nist.gov/_uploadfil...505587968.docx

My Note:

Requirements/Use Case Working Group Agenda July 16 2013

Source: http://bigdatawg.nist.gov/_uploadfil...920486796.docx

My Note:

Reference Architecture Subgroup Agenda for July 11, 2013

Source: http://bigdatawg.nist.gov/_uploadfil...234363302.docx

My Note:

Technology Roadmap Subgroup Agenda for July 12, 2013

Source: http://bigdatawg.nist.gov/_uploadfil...173957004.docx

My Note:

Def/Tax Agenda mtg-2 20130715

Source: http://bigdatawg.nist.gov/_uploadfil...443761258.docx

My Note:

NIST BDWG def-tax Definitions Draft

Source: http://bigdatawg.nist.gov/_uploadfil...763872254.docx

My Note:

NIST BDWG DefTax Minutes 20130708

Source: http://bigdatawg.nist.gov/_uploadfil...350384080.docx

My Note:

Technology Roadmap Subgroup Charter

Source: http://bigdatawg.nist.gov/_uploadfil...157997524.docx

My Note:

Reference Architecture Subgroup Charter

Source: http://bigdatawg.nist.gov/_uploadfil...192387432.docx

My Note:

Requirements Subgroup Charter

Source: http://bigdatawg.nist.gov/_uploadfil...552150483.docx

My Note:

Security and Privacy Subgroup Charter

Source: http://bigdatawg.nist.gov/_uploadfil...431774731.docx

My Note:

Definitions and Taxonomies Subgroup Charter

Source: http://bigdatawg.nist.gov/_uploadfil...569839699.docx

My Note:

For presentation: Big Data Ecosystem RA (Microsoft)

Source: http://bigdatawg.nist.gov/_uploadfil...4175727836.pdf (PDF)

My Note: I looked at this (6 slides)

Slide 1 Big Data Ecosystem Reference Architecture

PresentationBigDataEcosystemRAMicrosoftOritLevin2013-07-10Slide1.png

Slide 2 RA Objectives

PresentationBigDataEcosystemRAMicrosoftOritLevin2013-07-10Slide2.png

Slide 3 Big Data Ecosystem RA

PresentationBigDataEcosystemRAMicrosoftOritLevin2013-07-10Slide3.png

Slide 4 An Example of Cloud Computing Usage in Big Data Ecosystem

PresentationBigDataEcosystemRAMicrosoftOritLevin2013-07-10Slide4.png

Slide 5 Use Case: AdvertisingPresentationBigDataEcosystemRAMicrosoftOritLevin2013-07-10Slide5.png

Slide 6 Use Case: Enterprise Data Warehouse

PresentationBigDataEcosystemRAMicrosoftOritLevin2013-07-10Slide6.png

Evaluation Criteria for Reference Architecture

Source: http://bigdatawg.nist.gov/_uploadfil...4244171934.pdf (PDF)

My Note: This should be included

Slide 1 Evaluation Criteria for Reference Architecture

EvaluationCriteriaforReferenceArchitecture2013-07-10Slide1.png

Slide 2 Criteria for Evaluating Reference Architecture

EvaluationCriteriaforReferenceArchitecture2013-07-10Slide2.png

Slide 3 Brainstorming Big Data Reference Architecture

EvaluationCriteriaforReferenceArchitecture2013-07-10Slide3.png

Slide 4 Criteria Examples

EvaluationCriteriaforReferenceArchitecture2013-07-10Slide4.png

Slide 5 Apache Big Data Framework in Reference Architecture

EvaluationCriteriaforReferenceArchitecture2013-07-10Slide5.png

Big Data Ecosystem RA (Microsoft)

Source: http://bigdatawg.nist.gov/_uploadfil...596737703.docx

My Note:

Combined Deliverables Overview Presentation

Source: http://bigdatawg.nist.gov/_uploadfil...3993569620.pdf (PDF)

My Note:Based on Personal Brainstorming Document below?

Security and Privacy Subgroup Meeting Agenda for July 10, 2013

Source: http://bigdatawg.nist.gov/_uploadfil...587795410.docx

My Note:

Requirements Subgroup Meeting Agenda for July 9, 2013

Source: http://bigdatawg.nist.gov/_uploadfil...302762128.docx

My Note:

Subgroup Definitions and Taxonomies Meeting Agenda

Source: http://bigdatawg.nist.gov/_uploadfil...148313186.docx

My Note:

Personal Brainstorming Document

Source: http://bigdatawg.nist.gov/_uploadfil...6762570643.pdf (PDF)

My Note: Excellent. Should put here!

1. Introduction and Definition

The purpose of this outline is to illustrate how some initial brainstorming documents might be pulled together into an integrated deliverable. The outline will follow the diagram below.
PersonalBrainstormingDocument2013-07-08Figure1.png
 
Section 1 introduces a definition of Big Data. An extended terminology Glossary is found in Appendix A. In section 2, a Reference Architecture diagram is presented followed by a taxonomy describing and extending the elements of the Reference Architecture. Section 3 maps requirements from use case building blocks to the Reference Architecture. A description of the requirement, a gap analysis, and suggested best practice is included with each mapping. In Section 4 future improvements in Big Data technology are mapped to the Reference Architecture. An initial Technology Roadmap is created on the requirements and gap analysis in Section 3 and the expected future improvements from Section 4. Section 5 is a placeholder for an extended discussion of Security and Privacy. Section 6 gives an example of some general advice. The Appendices provide Big Data terminology and solutions glossaries, Use Case Examples, and some possible Actors and Roles.
 
Big Data Definition - “Big Data refers to the new technologies and applications introduced to handle increasing Volumes of data while enhancing data utilization capabilities such as Variety, Velocity, Variability, Veracity, and Value.”
 
The key attribute is the large Volume of data available that forces horizontal scalability of storage and processing and has implications for all the other V-attributes. It should be noted that the other V-attributes were present before the introduction of “Big Data”. (For example, non-relational databases are not a new idea.) It is the combination of these attributes with required horizontal scalability that requires new technology.
 
Some implications of the V-attributes implications are given below:
 
Volume - Key driving requirement for robust horizontal scalability of storage/processing
Variety - Driving move to non-relational data models (e.g. key-value)
Variability - Driving need for adaptive infrastructure
Value - Driving need for new querying and analytics tools
Veracity - Driving need for ensuring trust in the accuracy, relevance, and security of data
Velocity - Driving many optimization such as key-based interfaces, high availability, in memory databases, columnar storage, SSD storage, and parallel stream processing

2. Reference Architecture and Taxonomy

The Reference Architecture below will help focus the discussion of other deliverables.

PersonalBrainstormingDocument2013-07-08Figure2.png

 

Requirements, capabilities, and gap analysis for Big Data technologies will be mapped to elements of the Reference Architecture. This will enable the development of a detailed Technology Roadmap across the entire Big Data space.
 
A Reference Architecture Taxonomy will provide descriptions and extensions for the elements of the Reference Architecture. Elements in the Reference Architecture Diagram are in bold.
5. System and Process Management
5.1 Systems Management
5.2 Process Management
6. Data Processing within the Architecture Framework
9. IO External to Architecture Framework and Stream Processing

3. Requirements, Gap Analysis, and Suggested Best Practices

In the Requirements discussion, building block components for use cases will be mapped to elements of the Reference. These components will occur in many use cases across multiple application domains. A short description, possible requirements, gap analysis, and suggested best practices is provided for each building block.
1. Data input and output to Big Data File System (ETL, ELT)
 
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure3.png
Description: The Foundation Data Store can be used as a repository for very large amounts of data (structured, unstructured, semi-structured). This data can be imported and exported to external data sources using data integration middleware. Possible Requirements: The data integration middleware should be able to do high performance extraction, transformation and load operations for diverse data models and
formats.
 
Gap Analysis: The technology for fast ETL to external data sources (e.g Apache Flume, Apache Sqoop) is available for most current data flows. There could be problems in the future as the size of data flows increases (e.g. LHC). This may require some filtering or summation to avoid overloading storage and processing capabilities Suggested Best Practices: Use packages that support data integration. Be aware of the possibilities for Extract-Load-Transform (ELT) where transformations can be done using data processing software after the raw data has been loaded into the data store e.g, Map-Reduce processing on top of HDFS.
2. Data exported to Databases from Big Data File System
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure4.png
Description: A data processing system can extract, transform, and transmit data to operational and analytic databases.
 
Possible Requirements: For good through-put performance on very large data sets, the data processing system will require multi-stage parallel processing
 
Gap Analysis: Technology for ETL is available (e.g. Apache Sqoop for relational databases, MapReduce processing of files). However if high performance multiple passes through the data are necessary, it will be necessary to avoid rewriting intermediate results to files as is done by the original implementations of MapReduce.
 
Suggested Best Practices: Consider using data processing that does not need to write intermediate results to files e.g. Spark.
3 Big Data File Systems as a data resource for batch and interactive queries
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure5.png
Description: The foundation data store can be queried through interfaces using batch data processing or direct foundation store access.
 
Possible Requirements: The interfaces should provide good throughput performance for batch queries and low latency performance for direct interactive queries.
 
Gap Analysis: Optimizations will be necessary in the internal format for file storage to provide high performance (e.g. Hortonworks ORC files, Cloudera Parquet)
 
Suggested Best Practices: If performance is required, use optimizations for file formats within the foundation data store. If multiple processing steps are required, data processing packages that retain intermediate values in memory.
4. Batch Analytics on Big Data File System using Big Data Parallel Processing
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure6.png
Description: A data processing system augmented by user defined functions can perform batch analytics on data sets stored in the foundation data store.
 
Possible Requirements: High performance data processing is needed for efficient analytics.
 
Gap Analysis: Analytics will often use multiple passes through the data. High performance will require the processing engine to avoid writing intermediate results to files as is done in the original version of MapReduce
 
Suggested Best Practices: If possible, intermediate results of iterations should be kept in memory. Consider moving data to be analyzed into memory or an analytics optimized database.
5. Stream Processing and ETL
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure7.png
Description: Stream processing software can transform, process, and route data to databases and real time analytics
 
Possible Requirements: The stream processing software should be capable of high performance processing of large high velocity data streams.
 
Gap Analysis: Many stream processing solutions are available. In the future, complex analytics will be necessary to enable stream process to perform accurate filtering and summation of very large data streams.
 
Suggested Best Practices: Parallel processing is necessary for good performance on large data streams.
6. Real Time Analytics (e.g. Complex Event Processing)

PersonalBrainstormingDocument2013-07-08Figure6.png

Description: Large high velocity data streams and notifications from in memory operational databases can be analyzed to detect pre-determined patterns, discover new relationships, and provide predictive analytics.
 
Possible Requirements: Efficient algorithms for pattern matching and/or machine learning are necessary.
 
Gap Analysis: There are many solutions available for complex event processing. It would be useful to have standards for describing event patterns to enable portability.
 
Suggested Best Practices: Evaluate commercial packages to determine the best fit for your application.
7. NoSQL (and NewSQL) DBs as operational databases for large-scale updates and queries
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure7.png
Description: Non-relational databases can be used for high performance for large data volumes (e.g. horizontally scaled). New SQL databases support horizontal scalability within the relational model.
 
Possible Requirements: It is necessary to decide on the level of consistency vs. availability is needed since the CAP theorem demonstrates that both can not be achieved in horizontally scaled systems.
 
Gap Analysis: The first generation of horizontal scaled databases emphasized availability over consistency. The current trend seems to be toward increasing the role of consistency. In some cases (e.g. Apache Cassandra), it is possible to adjust the balance between consistency and availability.
 
Suggested Best Practices: Horizontally scalable databases are experiencing rapid advances in performance and functionality. Choices should be based on application requirements and evaluation testing. Be very careful about choosing a cutting edge solution that has not been used in applications similar to your use case. SQL (or SQLlike) interfaces will better enable future portability until there are standards for NoSQL interfaces.
8. NoSQL DBs for storing diverse data types
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure8.png
Description: Non-relational databases can store diverse data types (e.g. documents, graphs, heterogeneous rows) that can be retrieved by key or queries.
 
Possible Requirements: The data types to be stored depend on application data usage requirements and query patterns.
 
Gap Analysis: In general, the NoSQL databases are not tuned for analytic applications.
 
Suggested Best Practices: There is a trade off when using non-relational databases. Usually some functionality is given up (e.g. joins, referential integrity) in exchange for some advantages (e.g. higher availability, better performance). Be sure that the trade-off meets application requirements.
9. Databases optimized for complex ad hoc queries
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure9.png
Description: Interactive ad hoc queries and analytics to specialized databases are key Big Data capabilities
 
Possible Requirements: Analytic databases need high performance on complex queries which require optimizations such as columnar storage, in memory caches, and star schema data models.
 
Gap Analysis: There is a need for embedded analytics and/or specialized databases for complex analytics applications.
 
Suggested Best Practices: Use databases that have been optimized for analytics and/or support embedded analytics. It will often be necessary to move data from operational databases and/or foundation data stores using ETL tools.
10. Databases optimized for rapid updates and retrieval (e.g. in memory or SSD)
Example Diagram:
PersonalBrainstormingDocument2013-07-08Figure12.png
Description: Very high performance operational databases are necessary for some large-scale applications.
 
Possible Requirements: Very high performance will often require in memory databases and/or solid state drive (SSD) storage.
 
Gap Analysis: Data retrieval from disk files is extremely slow compared in memory, cache, or SSD access. There will be increased need for these faster options as performance requirements increase.
 
Suggested Best Practices: In the future, disk drives will be used for archiving or for nonperformance oriented applications. Evaluate the use of data stores that can reside in memory, caches, or on SSDs.

4. Future Directions and Roadmap

In the Big Data Technology Roadmap, the results of the gap analysis should be augmented with a list of future developments that will help close the gaps. Ideally some timelines should be included to aid in project planning. This sections lists ongoing improvements mapped to elements of Reference Architecture with links for more detail

5. Security and Privacy - 10 Top Challenges

 
1. Secure computations in distributed programming frameworks
2. Security best practices for non-relational data stores
3. Secure data storage and transactions logs
4. End-point input validation/filtering
5. Real-time security monitoring
6. Scalable and composable privacy-preserving data mining and analytics
7. Cryptographically enforced data centric security
8. Granular access control
9. Granular audits
10. Data provenance

6. Conclusions and General Advice

From Demystifying Big Data by TechAmerica
Recommended Roadmap for Getting Started
Edit section
Based on these observations, the Big Data Commission recommends the following five step cyclical approach to successfully take advantage of the Big Data opportunity. These steps are iterative versus serial in nature, with a constant closed feedback loop that informs ongoing efforts. Critical to success is to approach the Big Data initiative from a simple definition of the business and operational imperatives that the government organization seeks to address, and a set of specific business requirements and use cases that each phase of deployment with support. At each phase, the government organization should consistently review progress, the value of the investment, the key lessons learned, and the potential impacts on governance, privacy, and security policies. In this way, the organization can move tactically to address near term business challenges, but operate within the strategic context of building a robust Big Data capability
Figure 3: Road Map for Getting Started
Edit section

 

Table 3: Practical Road Map SummaryEdit section
 
Define Define the Big Data opportunity including the key business and mission challenges, the initial use case or set of use cases, and the value Big Data can deliver
• Identify key business challenges, and potential use cases to address
• Identify areas of opportunity where access to Big Data can be used to better serve the citizenry, the mission, or reduce costs
• Ask – does Big Data hold a unique promise to satisfy the use case(s)
• Identify the value of a Big Data investment against more traditional analytic investments, or doing nothing
• Create your overall vision, but chunk the work into tactical phases (time to value within 12-18 month timeframe)
• Don’t attempt to solve all Big Data problems in the initial project – seek to act tactically, but in the strategic context of your key business imperatives
Assess Assess the organization’s currently available data and technical capabilities, against the data and technical capabilities required to satisfy the defined set of business requirements and use cases
• Assess the use case across velocity, variety and volume requirements, and determine if they rise to the level of a Big Data initiative, versus a more traditional approach
• Assess the data and data sources required to satisfy the defined use case, versus current availability
• Assess the technical requirements to support accessing, governing, managing and analyzing the data, against current capability
• Leverage the reference architecture defined in the report above to identify key gaps
• Develop an ROI assessment for the current phase of deployment (ROI used conceptually, as the ROI may be better services for customers/citizens and not necessarily a financial ROI)
Plan
Select the most appropriate deployment pattern and entry point, design the “to be” technical architecture, and identify potential policy,
privacy and security considerations
• Identify the “entry point” capability as described in the section above
• Identify successful outcomes (success criteria)
• Develop architectural roadmap in support of the selected use case or use cases
• Identify any policy, privacy and security considerations
• Plan iterative phases of deployment
• Develop program management and acquisitions planning
• Identify required skills, resources and staffing
• Plan development, test and deployment platforms (e.g., Cloud, HW)
• If appropriate, Pilot to mitigate business and technical risk
Execute
The gov’t agency deploys the current phase Big Data project, maintaining the flexibility to leverage its investment to accommodate
subsequent business requirements and use cases
• Deploy the current phase project plan
• Build out the Big Data platform as the plan requires, with an eye toward flexibility and expansion
• Deploy technologies with both the flexibility and performance to scale to support subsequent use cases and corresponding data volume, velocity and variety
Review The gov’t agency continually reviews progress, adjusts the deployment plan as required, and tests business process, policy, governance, privacy and security considerations
• This is a continual process that cuts across the remainder of the roadmap steps
• Throughout the assess and planning stages, continually review plans against set governance, privacy, security policies
• Assess big data objectives against current Federal, state and local policy
• At each stage, assess ROI, and derive lessons learned
• Review deployed architecture and technologies against the needs of the broader organization – both to close any gaps, as well as to identify adjacent business areas that might leverage the developing Big Data capability
• Move toward Big Data Transformation in a continual iterative process

7. References

Demystifying Big Data by TechAmerica
 
Consumer Guide to Big Data from ODCA

Appendix A. Terminology Glossary

The description and links for terms are listed to help in understanding other sections.
ACiD
Atomicity, Consistency, Isolation, Durability are a group of properties that together guarantee that database transactions are processed reliably
Analytics
“The discovery and communication of meaningful patterns in data”
Avatarnode
Fault-tolerant extension to Namenode
BASE
Basically Available, Soft state, Eventual consistency semantics
Big Data
“A collection of data set so large and complex that it is difficult to process using on-hand database management tools or traditional data processing applications.”
BSON
Binary coding of JSON
BSP (Bulk Synchronous Parallel)
A programming model for distributed computation that avoid writing intermediate results to files
Business Analytics
“Refers to the skills, technologies, applications and practices forcontinuous iterative exploration and investigation of past business performance to gain insight and drive business planning”
Cache
Intermediate storage between files and memory used to improve performance
(CEP) Complex Event Processing
“Event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances”.
Consistent Hashing
A hashing algorithm that is resilient to dynamic changes
Descriptive Analytics
”The discipline of quantitatively describing the main features of a collection of data.”
Discovery Analytics
Data mining and related processes
ELT (Extract, Load, Transform)
“A process architecture where a bulk of the transformation work occurs after the data has been loaded into the target database in its raw format”
ETL (Extract, Transform Load)
Extracting data from source databases, transforming it, and then loading it into target databases.
In Memory Database
A database that primarily resides in computer main memory.
JSON (Javascript Object Notation)
Hierarchical serialization format derived from Javascript.
MapReduce
A programming model for processing large data sets. It consists of a mapping processing to distributed resources, followed by a sorting phase of intermediate results, and parallel reduction to a final result.
MPP (Massively Parallel Processing)
“Refers to the use of a large number of processors to perform a set of coordinated computations in parallel”
NewSQL
Big Data databases supporting relational model and SQL
NoSQL
Big Data databases not supporting relational model
OLAP (Online Analytic Processing)
“OLAP tools enable users to analyze multidimensional data interactively from multiple perspective”
OLTP (Online Transactional Processing)
“A class of information systems that facilitate and manage transaction-oriented applications”
Paxos
A distributed coordination protocol
Predictive Analytics
“Encompasses a variety of techniques fthat analyze facts to make predictions about future, or otherwise unknown, events”
Presciptive Analytics
“Automatically synthesizes big data, multiple disciplines of mathematical sciences and computational sciences, and business rules, to make predictions and then suggests decision options to take advantage of the predictions”
Semi-Structured Data
Unstructured data combine with structured data (e.g. metadata)
SSD (Solid State Drive)
“ A data storage device using integrated circuit assemblies as memory to store data persistently”
Stream Processing
“Given a set of data (a stream), a series of operations (kernel functions) is applied to each element in the stream”
Structured Data
Schema can be in part of data store or within application
Unstructured Data
Data stored with no schema and at most Implicit structure.
Vector Clocks
An algorithm that generates partial ordering of events in distributed systems
Web Analytics
“The measurement, collection, analysis and reporting of Internet data for purposes of understanding and optimizing web usage.

Appendix B. Solutions Glossary

Descriptions and links are listed here to provide references for technology capabilities.
Accumulo - (Database, NoSQL, Key-Value) from Apache
Acunu Analytics - (Analytics Tool) on top of Aster Data Platform based on Cassandra
Aerospike - (Database NoSQL Key-Value)
Alteryx - (Analytics Tool)
Ambari - (Hadoop Cluster Management) from Apache
Analytica - (Analytics Tool) from Lumina
ArangoDB - (Database, NoSQL, Multi-model) Open source from Europe
Aster - (Analytics) Combines SQL and Hadoop on top of Aster MPP Database
Avro - (Data Serialization) from Apache
Azkaban - (Process Scheduler) for Hadoop
Azure Table Storage - (Database, NoSQL, Columnar) from Microsoft
BigData Appliance - (Integrated Hardware and Software Architecture) from Oracle includes Cloudera, Oracle NoSQL, Oracle R, and Sun Servers
BigML - (Analytics tool) for business end-users
BigQuery - (Query Tool) on top of Google Storage
BigTable - (Database, NOSQL. Column oriented) from Google
Caffeine - (Search Engine) from Google use BigTable Indexing
Cascading - (Processing) SQL on top of Hadoop from Apache
Cascalog - (Query Tool) on top of Hadoop
html
Cassandra - (Database, NoSQL, Column oriented)
Chukwa - (Monitoring Hadoop Clusters) from Apache
Clojure - (Lisp-based Programming Language) compiles to JVM byte code
Cloudant - (Distributed Database as a Service)
Cloudera - (Hadoop Distribution) including real-time queries
Clustrix - (NewSQL DB) runs on AWS
Coherence - (Data Grid/Caching) from Oracle
Colossus - (File System) Next Generation Google File System
Continuity - (Data fabric layer) Interfaces to Hadoop Processing and data stores
Corona - (Hadoop Processing tool) used internally by Facebook and now open sourced
Couchbase - (Database, NoSQL, Document) with CouchDB and Membase capabilities
CouchDB - (Database, NoSQL, Document)
Data Tamer - (Data integration and curation tools) from MIT
Datameer - (Analytics) built on top of Hadoop
Datastax - (Integration) Built on Cassandra, Solr, Hadoop
Dremel - (Query Tool) interactive for columnar DBs from Google
Drill - (Query Tool) interactive for columnar DBs from Apache
Dynamo DB - (Database, NoSQL, Key-Value)
Elastic MapReduce - (Cloud-based MapReduce) from Amazon
ElasticSearch - (Search Engine) on top of Apache Lucerne
Enterprise Control Language (ECL) - (Data Processing Language) from HPPC
Erasure Codes - (Alternate to file replication for availability) Replicates fragments.
eXtreme Scale - (Data Grid/Caching) from IBM
F1 - (Combines relational and Hadoop processing) from Google built on Google Spanner
Falcon - (Data processing and management platform) from Apache
Flume - (Data Collection, Aggregation, Movement)
FlumeJava - (Java Library) Supports development and running data parallel pipelines
Fusion-io - (SSD Storage Platform) can be used with HBase
GemFire - (Data Grid/Caching) from VMware
Gensonix - (NoSQL database) from Scientel
Gephi - (Visualization Tool) for Graphs
Gigaspaces - (Data Grid/Caching)
Giraph - (Graph Processing) from Apache
Google Refine - (Data Cleansing)
Google Storage - (Database, NoSQL, Key-Value)
Graphbase - (Database, NoSQL, Graphical)
Greenplum - ( MPP Database. Analytic Tools, Hadoop )
HBase - (Database, NoSQL, Column oriented)
Hadapt - (Combined SQL Layer and Hadoop)
Hadoop Distributed File System - (Distributed File System) from Apache
Hama - (Processing Framework) Uses BSP model on top of HDFS
Hana - (Database, NewSQL) from SAP
Haven - (Analytics Package) from HP
HAWQ - (SQL Interface to Hadoop) from Greenplum and Pivotal
HCatalog - (Table and Storage Management) for Hadoop data
HDF5- (A data model, library, and file format for storing/managing large complex data)
High Performance Computing Cluster (HPCC) - (Big Data Processing Platform)
Hive - (Data warehouse structure on top of Hadoop)
HiveQL - (SQL-like interface on Hadoop File System)
Hortonworks - (Extensions of Hadoop)
HStreaming - (Real time analytics on top of Hadoop)
Hue - (Open source UI for Hadoop) from Cloudera
Hypertable - (Database, NoSQL, Key-Value) open source runs on multiple file systems
Impala - (Ad hoc query capability for Hadoop) from Cloudera
InfiniDB - (Scale-up analytic database)
Infochimps - (Big Data Storage and Analytics in the Cloud)
Infosphere Big Insights - (Analytic) from IBM
InnoDB - (Default storage engine for MYSQL)
Jaql = (Query Language for Hadoop) from IBM
Kafka - (Publish-and-subscribe for data) from Apache
Karmasphere - (Analytics)
Knox - (Secure gateway to Hadoop) from Apache
Lucidworks - (Search built on Solr/Lucene) and an associated Big Data Platform
Knowledge Graph - (Graphical data store) from Google
Mahout - (Machine Learning Toolkit) built on Apache Hadoop
MapD - (Massive Parallel Database) Open Source on top of GPUs
MapReduce - (Processing algorithm)
MapR - (MapReduce extensions) built on NFS
MarkLogic - (Database, NoSQL, Document) interfaced with Hadoop
Memcached - (Data Grid/Caching)
MemSQL - (In memory analytics database)
MongoDB - (Database, NoSQL, Document) from 10gen
mrjob - (Workflow) for Hadoop from Yelp
MRQL - (Query Language) supports Map-Reduce and BSP processing
Muppet - (Stream Processing) MapUpdate implementation
MySql - (Database Relational)
Namenode - Directory service for Hadoop
Neo4j - (Database, NoSQL, Graphical)
Netezza - (Database Appliance)
NuoDB - (MPP Database)
Oozie - (Workflow Scheduler for Hadoop) from Apache
Oracle NoSQL - (Database, Key-Value)
ORC (Optimized Row Columnar) Files - File Format for Hive data in HDFS
Parquet - (Columnar file format for Hadoop) from Cloudera
Pentaho - (Analytic tools)
Percolater - (Data Processing) from Google
Pig - (Procedural framework on top of Hadoop)
Pig Latin - (Interface language for Pig procedures)
Pivotal - (New company utilizing VMware and EMC technologies)
Platfora - (In memory caching for BI on top of Hadoop)
Postgres - (Database Relational)
Precog - (Analytics Tool) for JSON data
Pregel - (Graph Processing) used by Google
Presto - (SQL Query for HDFS) from Facebook
Protocol Buffers - (Serialization) from Google )
Protovis - (Visualization)
PureData - (Database Products) from IBM
Rainstor - (Combines Hadoop and Relational Processing)
RCFile (Record Columnar File) - File format optimized for HDFS data warehouses
Redis - (Database, NoSQL, Key-Value)
Redshift - (Database Relational) Amazon
Resilient Distributed Datasets - (Fauklt tolerant in memory data sharing)
Riak - (Database, NoSQL,Key-Value with built-in MapReduce) from Basho
Roxie - (Query processing cluster) from HPCC
RushAnalytics - (Analytics) from Pervasive
S4 - (Stream Processing)
Sawzall - (Query Language for Map-Reduce) from Google
Scala - (Programming Language) Combines functional and imperative programming
Scalebase - (Scalable Front-end to distributed Relational Databases)
SciDB - (Database, NoSQL, Arrays)
scikit learn - (Machine Learning Toolkit) in Python
Scribe - (Server for Aggregating Log Data) originally from Facebook may be inactive
SequenceFiles - (File format) Binary key-value pairs
Shark - (Complex Analytics Platform) on top of Spark
Simba - (ODBC SQL Driver for Hive)
SimpleDB - (Database, NoSQL, Document) from Amazon
Skytree - (Analytics Server)
Solr/Lucene - (Search Engine) from Apache
Spotfire - (Stream processing tool) from TIBCO
Spanner - (Database, NewSQL) from Google
Spark - (In memory cluster computing system)
Splunk - (Machine Data Analytics)
Spring Data - (Data access tool for Hadoop and NoSQL) in Spring Framework
SQLite - (Software library supporting server-less relational database)
SQLstream - (Streaming data analysis products)
Sqoop - (Data movement) from Apache
Sqrrl - (Security and Analytics on top of Apache Accumulo)
Stinger - (Optimized Hive Queries) from Hortonworks
Sumo Logic - (Log Analytics)
Tableau - (Visualization)
Tachyon - (File system) from Berkeley
Talend - (Data Integration Product)
TempoDB - (Database, NoSQL, Time Series)
Teradata Active EDW - (Database, Relational)
Terracotta - (In memory data management)
Terraswarm - (Data Acquisition) Sensor Integration
Thor - (Filesystem and Processing Cluster) from HPCC Systems
Thrift - (Framework for cross-language development)
Tinkerpop - (Graph Database and Toolkit)
Vertica - (Database Relational)
Voldemort - (Database, NoSQL, Key- Value)
VoltDB - (Database NewSQL)
Watson from IBM - (Analytic Framework)
WebHDFS - (REST API for Hadoop)
WEKA - (Machine Learning Toolkit) in Java
Wibidata - (Components for building Big Data applications)
YarcData - (Graph Analytics for Big Data)
Yarn - (Next Generation Hadoop) from Apache
Yottamine - (Machine Learning Toolkit) Cloud-based
Zettaset Orchestrator - (Management and Security for Hadoop)
ZooKeeper - (Distributed Computing Management)

Appendix C. Use Case Examples

 
“While there are extensive industry-specific use cases, here are some for handy reference:
EDW Use Cases
• Augment EDW by offloading processing and storage
• Support as preprocessing hub before getting to EDW
Retail/Consumer Use Cases
• Merchandizing and market basket analysis
• Campaign management and customer loyalty programs
Supply-chain management and analytics
• Event- and behavior-based targeting
• Market and consumer segmentations
Financial Services Use Cases
• Compliance and regulatory reporting
• Risk analysis and management
Fraud detection and security analytics
• CRM and customer loyalty programs
• Credit risk, scoring and analysis
• High speed arbitrage trading
• Trade surveillance
• Abnormal trading pattern analysis
Web & Digital Media Services Use Cases
• Large-scale clickstream analytics
• Ad targeting, analysis, forecasting and optimization
• Abuse and click-fraud prevention
• Social graph analysis and profile segmentation
• Campaign management and loyalty programs
Health & Life Sciences Use Cases
• Clinical trials data analysis
Disease pattern analysis
• Campaign and sales program optimization
• Patient care quality and program analysis
• Medical device and pharma supply-chain management
• Drug discovery and development analysis
Telecommunications Use Cases
• Revenue assurance and price optimization
• Customer churn prevention
• Campaign management and customer loyalty
• Call detail record (CDR) analysis
• Network performance and optimization
• Mobile user location analysis
Government Use Cases
• Fraud detection
• Cybersecurity
• Compliance and regulatory analysis
http://www.whitehouse.gov/sites/defa...crosites/ostp/big_data_press_release_final_2.pdf
New Application Use Cases
• Online dating
• Social gaming
Fraud Use-Cases
• Credit and debit payment card fraud
• Technical fraud and bad debt
• Medicaid and Medicare fraud
• Property and casualty (P&C) insurance fraud
• Workers’ compensation fraud
E-Commerce and Customer Service Use-Cases
• Cross-channel analytics
• Event analytics
• Recommendation engines using predictive analytics
• Right offer at the right time
• Next best offer or next best action”
 
discusses some Big Data Use Case examples.
 
Case Studies from TechAmerica
My Note: See below from:http://semanticommunity.info/AOL_Government/BIG_DATA_at_the_Hill#Big_Data_Case_Studies
Big Data Case StudiesEdit section
The Commission has compiled a set of 10 case studies detailing the business or mission challenge faced, the initial Big Data use case, early steps the agency took to address the challenge and support the use case, and the business results. Although the full text of these case studies will be posted at the TechAmerica Foundation Website, some are summarized below.

Table 2: Case Studies High Level Summary

Edit section

Agency/Organization/

Company Big Data Project Name

Underpinning Technologies Big Data Metrics Initial Big Data Entry Point Public/User Benefits
Case Studies and Use Cases National Archive and Records Administration (NARA) Electronics Records Archive Metadata, Submission, Access, Repository, Search and Taxonomy applications for storage and archival systems Petabytes, Terabytes/ sec, Semi-structured Warehouse Optimization, Distributed Info Mgt Provides Electronic Records Archive and Online Public Access systems for US records and documentary heritage
TerraEchos
Perimeter Intrusion Detection
Streams analytic software, predictive analytics Terabytes/sec Streaming and Data Analytics Helps organizations protect and monitor critical infrastructure and secure borders
Royal Institute of Technology of Sweden (KTH) Traffic Pattern Analysis Streams analytic software, predictive analytics Gigabits/sec Streaming and Data Analytics Improve traffic in metropolitan areas by decreasing congestion and reducing traffic accident injury rates
Vestas Wind Energy
Wind Turbine Placement & Maintenance
Apache Hadoop  Petabytes Streaming and Data Analytics Pinpointing the optimal location for wind turbines to maximize power generation and reduce energy cost

University of Ontario (UOIT) Medical   Monitoring

Streams analytic software, predictive analytics, supporting Relational Database Petabytes Streaming and Data Analytics Detecting infections in premature infants up to 24 hours before they exhibit symptoms
National Aeronautics and Space Administration (NASA) Human Space Flight Imagery Metadata, Archival, Search and Taxonomy applications for tape library systems, GOTS Petabytes, Terabytes/ sec, Semi-structured Warehouse   Optimization Provide industry and the public with some of the most iconic and historic human spaceflight imagery for scientific discovery, education and entertainment
AM Biotechnologies (AM Biotech) DNA Sequence Analysys for Creating Aptamers Cloud-based HPC genomic applications and transportable data files Gigabytes, 107 DNA sequences compared Streaming Data & Analytics, Warehouse   Optimization,   Distributed Info Mgt Creation of a unique aptamer compounds to develop improved therapeutics for many medical conditions and diseases

National Oceanic and Atmospheric Administration(NOAA) National Weather Service

HPC modeling, data from satellites, ships, aircraft and deployed sensors Petabytes, Terabytes/ sec, Semi-structured, ExaFLOPS,  PetaFLOPS Streaming Data & Analytics, Warehouse   Optimization,   Distributed Info Mgt Provide weather, water, and climate data, forecasts and warnings for the protection of life and property and enhancement of the national economy
Internal Revenue Service (IRS) Compliance Data Warehouse Columnar database architecture, multiple analytics applications, descriptive, exploratory, and predictive analysis Petabytes Streaming Data & Analytics, Warehouse   Optimization,   Distributed Info Mgt Provide America's taxpayers top quality service by helping them to understand and meet their tax responsibilities and enforce the law with integrity and fairness to all
Centers for Medicare & Medicaid Services (CMS) Medical Records Analytics
Columnar and NoSQL databases, Hadoop being looked at, EHR on the front end, with legacy structured database systems (including
DB2 and COBOL)
Petabytes, Terabytes/ day Streaming Data & Analytics, Warehouse   Optimization,   Distributed Info Mgt Protect the health of all Americans and ensure compliant processing of insurance claims
 
 

Appendix D. Actors and Roles

 
The job roles are mapped to elements of the Reference Architecture in red “Here are 7 new types of jobs being created by Big Data:
 
1. Data scientists: This emerging role is taking the lead in processing raw data and determining what types of analysis would deliver the best results. Typical backgrounds, as cited by Harbert, include math and statistics, as well as artificial intelligence and natural language processing. (Analytics)
 
2. Data architects: Organizations managing Big Data need professionals who will be able to build a data model, and plan out a roadmap of how and when various data sources and analytical tools will come online, and how they will all fit together. (Design, Develop, Deploy Tools)
 
3. Data visualizers: These days, a lot of decision-makers rely on information that is presented to them in a highly visual format — either on dashboards with colorful alerts and “dials,” or in quick-to-understand charts and graphs. Organizations need professionals who can “harness the data and put it in context, in layman’s language, exploring what the data means and how it will impact the company,” says Harbert. (Applications)
 
4. Data change agents: Every forward-thinking organization needs “change agents” — usually an informal role — who can evangelize and marshal the necessary resources for new innovation and ways of doing business. Harbert predicts that “data change agents” may be more of a formal job title in the years to come, driving “changes in internal operations and processes based on data analytics.” They need to be good communicators, and a Six Sigma background — meaning they know how to apply statistics to improve quality on a continuous basis — also helps. (Not applicable to Reference Architecture)
 
5. Data engineer/operators: These are the people that make the Big Data infrastructure hum on a day-to-day basis. “They develop the architecture that helps analyze and supply data in the way the business needs, and make sure systems are performing smoothly,” says Harbert. (Data Processing and Data Stores)
 
6. Data stewards: Not mentioned in Harbert’s list, but essential to any analytics-driven organization, is the emerging role of data steward. Every bit and byte of data across the enterprise should be owned by someone — ideally, a line of business. Data stewards ensure that data sources are properly accounted for, and may also maintain a centralized repository as part of a Master Data Management approach, in which there is one “gold copy” of enterprise data to be referenced. (Data Governance)
 
7. Data virtualization/cloud specialists: Databases themselves are no longer as unique as they use to be. What matters now is the ability to build and maintain a virtualized data service layer that can draw data from any source and make it available across organizations in a consistent, easy-to-access manner. Sometimes, this is called “Databaseas-a-Service.” No matter what it’s called, organizations need professionals that can also build and support these virtualized layers or clouds.” (Infrastructure)
 

NIST Special Publication 500-291 - NIST Cloud Computing Standards Roadmap - July 2011

Source: http://bigdatawg.nist.gov/_uploadfil...7425925966.pdf

My Note:

NIST Special Publication 500-292 - NIST Cloud Computing Reference Architecture - September 2011

Source: http://bigdatawg.nist.gov/_uploadfil...7256814129.pdf

My Note:

NIST_Security_Reference_Architecture_2013.05.15_v1

Source: http://bigdatawg.nist.gov/_uploadfil...3376532289.pdf

My Note:

SP800-145 Cloud Definition

Source: http://bigdatawg.nist.gov/_uploadfil...3333767255.pdf

My Note:

NBD-WD July 3rd Meeting Agenda

Source: http://bigdatawg.nist.gov/_uploadfil...1436928078.pdf

My Note:

NIST Big Data Working Kick-off Meeting Agenda (w/ minor update)

Source: http://bigdatawg.nist.gov/_uploadfil...039923755.docx

My Note: See above

Big Data Definitions v1

Source: http://bigdatawg.nist.gov/_uploadfil...301148665.docx

My Note: See above

NIST Big Data Working Kick-off Meeting Agenda

Source: http://bigdatawg.nist.gov/_uploadfil...909880210.docx

My Note: See above

NIST Big Data Working Group Charter - draft

Source: http://bigdatawg.nist.gov/_uploadfil...837324457.docx

My Note: See above

NEXT

Page statistics
7754 view(s) and 64 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments