Table of contents
  1. Story
  2. Slides
    1. Panel on Big Data Challenges and Opportunities
    2. NSF’s Perspective and Role
    3. NSF: Research to Infrastructure Continuum
    4. Big Data in the CISE Context
    5. 2013: Core Techniques and Technologies for Advancing Big Data Science & Engineering
    6. 2014: Critical Techniques and Technologies for Advancing Big Data Science and Engineering
    7. Data Infrastructure Building Blocks
    8. NSF Research Traineeship (NRT)
    9. Initiatives underway…need your inputs
    10. CISE RFI: Big Data Regional Innovation Hubs
    11. Big Data Hubs examples…
    12. RFI National Big Data R&D Initiative
    13. Increase the Nation’s Data Science Capacity – An NSF Priority Goal
    14. Thank You!
  3. Accelerating the Big Data Innovation Ecosystem
  4. Spotfire Dashboard
  5. Research Notes
    1. Definition: NAICS 31-33, Manufacturing
      1. Guide to Data Sources for Manufacturing from the U.S. Census Bureau
    2. Manufacturers’ Shipments, Inventories, & Orders
      1. Overview
      2. 2014-2015 Release Schedule
      3. Annual Survey of Manufactures: Geographic Area Statistics: Statistics for All Manufacturing by State: 2011 REFRESH
    3. Peter Morosoff
  6. The 2014 IEEE International Conference on Big Data
  7. Main Page
  8. Main Conference
    1. Cover Page
    2. Copyright Information & Catalog Numbers
    3. Preface
    4. Organization Team Members
      1. Big Data Steering Committee
      2. Conference Co-Chairs
      3. Program Co-Chairs
      4. Industry and Government Program Co-Chairs
      5. Workshop Co-Chairs
      6. Poster Co-Chairs
      7. Publicity Co-chairs
      8. Proceedings Co-Chairs
      9. Sponsorship Co-Chairs
      10. Tutorial Co-Chairs
      11. Registration Co-Chairs
      12. Student Travel Award Co-Chairs
    5. Program Committee List
      1. Senior Program Committee Members
      2. Program Committee Members
      3. Industy & Government Program Committee Members
    6. Keynote Speeches
      1. Never-Ending Language Learning
        1. Abstract 
        2. Biography 
        3. Never Ending Language Learning
        4. Rationale
        5. NELL: Never-Ending Language Learner
        6. NELL Today
        7. Summary
        8. Thank You
      2. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities
        1. Abstract 
        2. Biography 
        3. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities
        4. Thanks: My Team
        5. I Will Use Applications in 3 Domains to Demonstrate
        6. My 2004-2005 Fomulation of SMART DATA-Semagix
        7. In Process, Engaging Both Top and Bottom Brain
        8. W3C Semantic Sensor Network Ontology
        9. Personalized Digital Health Research and Applications at Kno.e.sis
        10. Semantic Annotation Using Background Knowledge
        11. Matching Requests With Offers
        12. Wrapping Up: Two Excellent Videos
        13. Wrapping Up: Take Away 1
        14. Wrapping Up: Take Away 2
        15. Special Thanks
        16. Kno.e.sis
      3. Addressing Human Bottlenecks in Big Data
        1. Abstract 
        2. Biography
        3. Addressing Human Bottlenecks in Big Data
        4. Outline
        5. I Can Give You Power
        6. Data Distributed Programming
        7. Where Does Time Go?
        8. Wangler 2011: Add Intelligence
        9. Technical Philosophy
        10. Takeaways
    7. Sessions
      1. 1. Big Data Foundations
        1. Regular Papers
          1. BASIC: an Alternative to BASE for Large-Scale Data Management System
          2. BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured BigData
        2. Short Papers
          1. Scaling up M-estimation via sampling designs: the Horvitz-Thompson stochastic gradient descent
          2. Metadata Capital: Simulating the Predictive Value of Self-Generated Heatlh Information (SGHI)
          3. Representative Subsets For Big Data Learning using k-NN graphs
          4. Towards Building and Evaluating a Personalized Location-Based Recommender System
          5. On the Performance of MapReduce: A Stochastic Approach
          6. PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
      2. 2. Big Data Infrastructure
        1. Regular Papers
          1. FusionFS: Toward Supporting Data-Intensive Scientific Applications on Extreme-Scale High-Performance Computing Systems
          2. ​BurstMem: A High-Performance Burst Buffer System for Scientific Applications
          3. Partial Rollback-based Scheduling on In-memory Transactional Data Grids
          4. Detecting and Identifying System Changes in the Cloud via Discovery by Example
          5. ​PigOut: Making Multiple Hadoop Clusters Work Together
          6. Parallel Breadth First Search on GPU Clusters
          7. Optimizing Load Balancing and Data-Locality with Data-aware Scheduling
        2. Short Papers
          1. Online Temporal-Spatial Analysis for Detection of Critical Events in Cyber-Physical Systems
          2. ​A Cross-job Framework for MapReduce Scheduling
          3. Scheduling MapReduce Tasks on Virtual MapReduce Clusters from a Tenant’s Perspective
          4. ​FlexDAS: A Flexible Direct Attached Storage for I/O Intensive Applications
          5. ​A Two-Sided Market Mechanism for Trading Big Data Computing Commodities
          6. ​MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping
          7. Large-Scale Network Traffic Monitoring with DBStream, a System for Rolling Big Data Analysis
          8. ​Synthetic Data Generation for the Internet of Things
          9. ​Evaluating the Performance and Scalability of the Ceph Distributed Storage System
          10. ​Incremental Window Aggregates over Array Database
          11. ​BigCache for Big-data Systems
          12. Automated Workload-aware Elasticity of NoSQL Clusters in the Cloud
          13. ​Distributed Class Dependent Feature Analysis - A Big Data Approach
          14. ​VENU: Orchestrating SSDs in Hadoop Storage
          15. In-Memory I/O and Replication for HDFS with Memcached: Early Experiences
          16. Enabling Composite Applications through an Asynchronous Shared Memory Interface
          17. k-balanced sorting and skew join in MPI and MapReduce
      3. 3. Big Data Management
        1. Regular Papers
          1. Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems
          2. Examination of Data, Rule Generation and Detection of Phishing URLs using Online Logistic Regression
          3. Main Memory Evaluation of Recursive Queries on Multicore Machines
          4. Predicting Glaucoma Progression using Multi-task Learning with Heterogeneous Features
          5. Provenance-Based Object Storage Prediction Scheme for Scientific Big Data Applications
          6. Synergistic Partitioning in Multiple Large Scale Social Networks
          7. TRISTAN: Real-Time Analytics on Massive Time Series Using Sparse Dictionary Compression
          8. Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing
        2. Short Papers
          1. Minimizing Data Movement through Query Transformation
          2. Multilevel Partitioning of Large Unstructured Grids
          3. Low Complexity Sensing for Big Spatio-Temporal Data
          4. In-advance Data Analytics for Reducing Time to Discovery
      4. 4. Big Data Search and Mining
        1. Regular Papers
          1. Learning to Estimate Pairwise Distances in Large Graphs
          2. Distributed Adaptive Model Rules for Mining Big Data Streams
          3. Sparse computation for large-scale data mining
          4. Topic Similarity Networks: Visual Analytics for Large Document Sets
          5. Efficient Breadth-First Search on a Heterogeneous Processor
          6. Web-based Visual Analytics for Extreme Scale Climate Science
          7. Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization
          8. Meeting Predictable Buffer Limits in the Parallel Execution of Event Processing Operators
          9. Metadata Extraction and Correction for Large-Scale Traffic Surveillance Videos
          10. Facilitating Twitter Data Analytics: Platform, Language and Functionality
          11. Visual Fusion of Mega-City Big Data: An Application to Traffic and Tweets Data Analysis of Metro Passengers
          12. GRAPHiQL: A Graph Intuitive Query Language for Relational Databases
          13. Evaluating Density-based Motion for Big Data Visual Analytics
          14. Regression Trees for Streaming Data with Local Performance Guarantees
          15. Distributed Algorithms for k-truss Decomposition
          16. PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
          17. Effective Caching Techniques for Accelerating Pattern Matching Queries
          18. Clique Guided Community Detection
          19. Large-scale Distributed Sorting for GPU-based Heterogeneous Supercomputers
          20. Large-scale Logistic Regression and Linear Support Vector Machines Using Spark
          21. NVM-based Hybrid BFS with Memory Efficient Data Structure
          22. Identification of SNP Interactions Using Data-Parallel Primitives on GPUs
        2. Short Papers
          1. Random Walks on Adjacency Graphs for Mining Lexical Relations from Big Text Data
          2. Entity Resolution Using Inferred Relationships and Behavior
          3. Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability
          4. In-Situ Visualization and Computational Steering for Large-Scale Simulation of Turbulent Flows in Complex Geometries
          5. Building k-nn graphs from large text data
          6. Learning to Predict Subject-Line Opens for Large-Scale Email Marketing
          7. MAGE: Matching Approximate Patterns in Richly-Attributed Graphs
          8. Bootstrapping K-means for Big data analysis
          9. Distributed Adaptive Importance Sampling on Graphical Models Using MapReduce
          10. Knowledge-based Clustering of Ship Trajectories Using Density-based Approach
          11. Immersive and collaborative data visualization using virtual reality platforms
          12. The Adaptive Projection Forest: Using Adjustable Exclusion and Parallelism in Metric Space Indexes
          13. Scaling Up Prioritized Grammar Enumeration for Scientific Discovery in the Cloud
      5. 5. Big Data Security & Privacy
        1. Regular Papers
          1. MR-TRIAGE: Scalable Multi-Criteria Clustering for Big Data Security Intelligence Applications
          2. Increasing the Veracity of Event Detection on Social Media Networks Through User Trust Modeling
        2. Short Papers
          1. Empowering users of social networks to assess their privacy risks
          2. A Unified Approach to Network Anomaly Detection
      6. 6. Big Data Applications
        1. Regular Papers
          1. E-Sketch: Gathering Large-scale Energy Consumption Data Based on Consumption Patterns
          2. Hierarchical Management of Large-Scale Malware Data
          3. Random Projection Based Clustering for Population Genomics
          4. Structure Recognition from High Resolution Images of Ceramic Composites
          5. Combining Hadoop and GPU to Preprocess Large Affymetrix Microarray Data
          6. Content-Based Access Control: Use Data Content to Assist Access Control for Large-Scale Content-Centric Databases
          7. Locating Visual Storm Signatures from Satellite Images
          8. Accurate and Efficient Selection of the Best Consumption Prediction Method in Smart Grids
        2. Short Papers
          1. Visualizations for Sense-Making in Financial Market Regulation
          2. Big Automotive Data - Leveraging large volumes of data for knowledge-driven product development
          3. On the Impact of Socio-economic Factors on Power Load Forecasting
          4. Toward Personalized and Scalable Voice-Enabled Services Powered by Big Data
          5. MaPLE: A MapReduce Pipeline for Lattice-based Evaluation and Its Application to SNOMED CT
          6. Dynamic Pre-training of Deep Recurrent Neural Networks for Predicting Environmental Monitoring Data
          7. Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters
          8. Department of Energy Strategic Roadmap for Earth System Science Data Integration
          9. Analyzing the Language of Food on Social Media
          10. Using Geometric Structures to Improve the Error Correction Algorithm of High-Throughput Sequencing Data on MapReduce Framework
          11. Empowering Personalized Medicine with Big Data and Semantic Web Technology: Promises, Challenges, and Use Cases
          12. On Scaling Time Dependent Shortest Path Computations for Dynamic Traffic Assignment
          13. High Volume Geospatial Mapping for Internet-of-Vehicle Solutions with In-Memory Map-Reduce Processing
      7. 7. Industry and Government Papers
        1. Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon
        2. Spatial Computations over Terabyte-Sized Images on Hadoop Platforms
        3. Recall Estimation for Rare Topic Retrieval from Large Corpuses
        4. Lightweight Approximate Top-k for Distributed Settings
        5. Query Revision During Cluster Based Search on Large Unstructured Corpora
        6. Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop
        7. Automating Data Integration with HiperFuse
        8. Recommending Similar Items in Large-scale Online Marketplaces
        9. SE-CDA: A Scalable and Efficient Community Detection Algorithm
        10. Increasing the Accessibility to Big Data Systems via a Common Services API
        11. Big Data Predictive Analytics for Proactive Semiconductor Equipment Maintenance
        12. Future Directions of Humans in Big Data Research
        13. ALOJA: a Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness
        14. Heterogeneous Stream Processing for Disaster Detection and Alarming
        15. Identifying top Chinese network buzzwords from social media big data set based on time-distribution features
        16. Bridging High Velocity and High Volume Industrial Big Data Through Distributed In-Memory Storage & Analytics
        17. Explore Efficient Data Organization for Large Scale Graph Analytics and Storage
        18. An Initial Study of Predictive Machine Learning Analytics on Large Volumes of Historical Data for Power System Applications
        19. In Unity There is Strength: Showcasing a Unified Big Data Platform with MapReduce Over both Object and File Storage
    8. Author Index
  9. Panel with Program Directors: Big Data Challenges and Opportunities
    1. Dr. Chaitanya Baru (NSF)
    2. Dr. Yuan Liu (NIH)
      1. Big Data NIH Funding Opportunities
      2. BD2K Funded Phase I
      3. BISTI Funding Opportunities
      4. Seven High Priority Research Areas
    3. Dr. David Kuehn (DoT)
      1. Data Science Challenges and Opportunities in Highway Transportation
      2. Presentation Outline
      3. Program Status
      4. Connected Highway Systems
      5. Opportunities
      6. Thank You
    4. Dr.Tsengdar Lee (NASA)
      1. NASA's Big Data Challenges in Climate Science
      2. Projected Data Holdings
      3. Turning Observations into Knowledge and Decision Products
      4. Future Directions and Challenges
      5. Research Opportunities
      6. Thank You!
    5. Dr. Sudarsan Rachuri (NIST)
      1. From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems
      2. Outline
      3. Smart Manufacturing Systems Design and Analysis
      4. Manufacturing Big Data Analytics
    6. Mr. Matti Vakkuri (DIGILE)
      1. Data to Intelligence Research Program
      2. Data to Intelligence Program Implementation Facts
      3. Data to Intelligence-Main Results Obtained
      4. Oulu Traffic Pilot
      5. Forest Big Data
      6. Remote and HIghly Automated Services
  10. Tutorials
    1. Tutorial 1: Big Data Stream Mining
      1. Presenters
      2. Summary
      3. Content
      4. Short Profiles
    2. Tutorial 2: Big ML Software for Modern ML Algorithms
      1. Presenters
      2. Summary
      3. Content
      4. Petuum
      5. Spark
      6. GraphLab
    3. Tutorial 3: Large-scale Heterogeneous Learning in Big Data Analytics
      1. Presenter
      2. Summary
      3. Content
      4. Short Profile
    4. Tutorial 4: Big Data Benchmarking
      1. Presenters
      2. Summary
      3. Content
      4. Short Profile
  11. Special Session
    1. Special Session 1: From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems
      1. Overview
      2. Organizers
      3. Papers
        1. Toward Smart Manufacturing Using Decision Analytics
        2. ​An Intelligent Machine Monitoring System Using Gaussian Process Regression for Energy Prediction
        3. ​Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing
        4. ​Uncertainty Quantification in Performance Evaluation of Manufacturing Processes
        5. ​CloudMan: A Platform for Portable Cloud Manufacturing Services
        6. ​Building a Rigorous Foundation for Performance Assurance Assessment Techniques for “Smart” Manufacturing Systems
        7. A System Architecture for Manufacturing Process Analysis based on Big Data and Process Mining Techniques
    2. Special Session 2: Special Session on Big Data Representation and Processing in Data Science
      1. Overview
      2. Organizers
      3. Papers
        1. Researching Persons & Organizations AWAKE: From Text to an Entity-Centric Knowledge Base
        2. ​Integrating Existing Large Scale Medical Laboratory Data Into the Semantic Web Framework
        3. ​Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons
        4. ​Stochastic Finite Automata for the Translation of DNA to Protein
        5. Data mining and sharing tool for high content screening large scale biological image data
        6. ​A Building Performance Evaluation & Visualization System
        7. ​Statistical Technique for Online Anomaly Detection Using Spark Over Heterogeneous Data from Multi-source VMware Performance Data
        8. ​Extracting Discriminative Shapelets from Heterogeneous Sensor Data
  12. Workshops
    1. 1. Challenges and Issues on Scholarly Big Data Discovery and Collaboration (SBD 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
      5. ​Sponsors
    2. 2. The 2nd Workshop on Scalable Machine Learning- Theory and Applications
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    3. 3. 1st International Workshop on High Performance Big Graph Data Management Analysis, and Mining
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    4. 6. The Second Workshop on Distributed Storage Systems and Coding for Big Data 
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Program Committee Members
      5. Papers
    5. 7. First IEEE International Workshop on Big Data Security and Privacy (BDSP 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    6. 8. The 2nd Workshop on Big Data in Bioinformatics and Health Care Informatics
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    7. 9. 1st Workshop on Management, Search and Mining of Massive Repositories of Solar Astronomy Data (SABID'14)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    8. 10. Using Big Data to Understand Spatial Connectivity
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    9. 15. Workshop on Advances in Software and Hardware for Big Data to Knowledge Discovery (ASH 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    10. 16. IEEE Big Data Workshop on Semantics for Big Data on the Internet of Things (SemBIoT 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    11. 17. Big Data in Computational Epidemiology
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    12. 18. Large Scale Data Analytics in Transportation and Railway Infrastructure (BIGDATA-TRANSPORTATION 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    13. 19. 2nd Workshop on Scalable Cloud Data Management (SCDM 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    14. 20. Big Humanities Data
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    15. 21. Complexity for Big Data
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    16. 23. IEEE NIST Big Data PWG Workshop on Big Data- Challenges, Practices and Technologies
      1. Introduction to workshop
      2. The candidate list of panels is:
      3. Organizers
      4. Papers
  13. Posters
    1. Addressing Data Veracity in Big Data Applications
    2. Machine Learning and Interactive Visualization applied to TB-sized Images of Stem Cells
    3. Big Data: a new challenge for tourism
    4. OME: A tool for generating and managing metadata to handle BigData
    5. The EMBERS Architecture for Streaming Predictive Analytics
    6. A Novel Approach to Determine Docking Locations Using Fuzzy Logic and Shape Determination
    7. Linked Open Data Mining for Democratization of Big Data 
    8. Building Wrangler: A Transformational Data Intensive Resource for the Open Science Community
    9. CELAR: Automated Application Elasticity Platform
    10. Semantic HMC for Big Data Analysis
    11. A Layer Based Architecture for Provenance in Big Data
    12. B-dIDS: Mining Anomalies in a Big-distributed Intrusion Detection System
    13. Enabling Genomic Analysis on Federated Clouds
    14. Challenges of Data Integration and Interoperability in Big Data
    15. Large Scale Author Name Disambiguation in Digital Libraries
    16. Real-Time Traffic Incident Detection Using Probe-Car Data on the Tokyo Metropolitan Expressway
    17. Differentially Private Models of Tollgate Usage in Metropolitan Areas: The Milan Tollgate Data Set
    18. MoDisSENSE: A Distributed Platform for Social Networking Services over Mobile Devices
    19. A Challenge of Authorship Identification for Ten-thousand-scale Microblog Users
    20. Cognitive Map of Tourist Behavior based on Tripadvisor
    21. Biclustering using Spark-MapReduce
    22. A Summarization Paradigm for Big Data
    23. An open source framework to add spatial extent and geospatial visibility to Big Data
    24. Developing a Cloud Computing Platform for Big Data: The OpenStack Nova case
    25. Cultural Influences and Intelligence in Historical Gazetteers of the Great War
    26. The Bot Will Serve You Now: Automating Access to Archival Materials
    27. Incremental and Parallel Spatial Association Mining
    28. Sharding for Literature Search via Cutting Citation Graphs
    29. Repair Efficient Storage Codes via Combinatorial Configurations
  14. Sponsors
    1. IEEE
    2. IEEE Computer Society
    3. Elsevier
    4. Paradigm4
    5. NSF
    6. Cisco
    7. CCF
  15. NEXT

Data Science for IEEE Big Data

Last modified
Table of contents
  1. Story
  2. Slides
    1. Panel on Big Data Challenges and Opportunities
    2. NSF’s Perspective and Role
    3. NSF: Research to Infrastructure Continuum
    4. Big Data in the CISE Context
    5. 2013: Core Techniques and Technologies for Advancing Big Data Science & Engineering
    6. 2014: Critical Techniques and Technologies for Advancing Big Data Science and Engineering
    7. Data Infrastructure Building Blocks
    8. NSF Research Traineeship (NRT)
    9. Initiatives underway…need your inputs
    10. CISE RFI: Big Data Regional Innovation Hubs
    11. Big Data Hubs examples…
    12. RFI National Big Data R&D Initiative
    13. Increase the Nation’s Data Science Capacity – An NSF Priority Goal
    14. Thank You!
  3. Accelerating the Big Data Innovation Ecosystem
  4. Spotfire Dashboard
  5. Research Notes
    1. Definition: NAICS 31-33, Manufacturing
      1. Guide to Data Sources for Manufacturing from the U.S. Census Bureau
    2. Manufacturers’ Shipments, Inventories, & Orders
      1. Overview
      2. 2014-2015 Release Schedule
      3. Annual Survey of Manufactures: Geographic Area Statistics: Statistics for All Manufacturing by State: 2011 REFRESH
    3. Peter Morosoff
  6. The 2014 IEEE International Conference on Big Data
  7. Main Page
  8. Main Conference
    1. Cover Page
    2. Copyright Information & Catalog Numbers
    3. Preface
    4. Organization Team Members
      1. Big Data Steering Committee
      2. Conference Co-Chairs
      3. Program Co-Chairs
      4. Industry and Government Program Co-Chairs
      5. Workshop Co-Chairs
      6. Poster Co-Chairs
      7. Publicity Co-chairs
      8. Proceedings Co-Chairs
      9. Sponsorship Co-Chairs
      10. Tutorial Co-Chairs
      11. Registration Co-Chairs
      12. Student Travel Award Co-Chairs
    5. Program Committee List
      1. Senior Program Committee Members
      2. Program Committee Members
      3. Industy & Government Program Committee Members
    6. Keynote Speeches
      1. Never-Ending Language Learning
        1. Abstract 
        2. Biography 
        3. Never Ending Language Learning
        4. Rationale
        5. NELL: Never-Ending Language Learner
        6. NELL Today
        7. Summary
        8. Thank You
      2. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities
        1. Abstract 
        2. Biography 
        3. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities
        4. Thanks: My Team
        5. I Will Use Applications in 3 Domains to Demonstrate
        6. My 2004-2005 Fomulation of SMART DATA-Semagix
        7. In Process, Engaging Both Top and Bottom Brain
        8. W3C Semantic Sensor Network Ontology
        9. Personalized Digital Health Research and Applications at Kno.e.sis
        10. Semantic Annotation Using Background Knowledge
        11. Matching Requests With Offers
        12. Wrapping Up: Two Excellent Videos
        13. Wrapping Up: Take Away 1
        14. Wrapping Up: Take Away 2
        15. Special Thanks
        16. Kno.e.sis
      3. Addressing Human Bottlenecks in Big Data
        1. Abstract 
        2. Biography
        3. Addressing Human Bottlenecks in Big Data
        4. Outline
        5. I Can Give You Power
        6. Data Distributed Programming
        7. Where Does Time Go?
        8. Wangler 2011: Add Intelligence
        9. Technical Philosophy
        10. Takeaways
    7. Sessions
      1. 1. Big Data Foundations
        1. Regular Papers
          1. BASIC: an Alternative to BASE for Large-Scale Data Management System
          2. BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured BigData
        2. Short Papers
          1. Scaling up M-estimation via sampling designs: the Horvitz-Thompson stochastic gradient descent
          2. Metadata Capital: Simulating the Predictive Value of Self-Generated Heatlh Information (SGHI)
          3. Representative Subsets For Big Data Learning using k-NN graphs
          4. Towards Building and Evaluating a Personalized Location-Based Recommender System
          5. On the Performance of MapReduce: A Stochastic Approach
          6. PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
      2. 2. Big Data Infrastructure
        1. Regular Papers
          1. FusionFS: Toward Supporting Data-Intensive Scientific Applications on Extreme-Scale High-Performance Computing Systems
          2. ​BurstMem: A High-Performance Burst Buffer System for Scientific Applications
          3. Partial Rollback-based Scheduling on In-memory Transactional Data Grids
          4. Detecting and Identifying System Changes in the Cloud via Discovery by Example
          5. ​PigOut: Making Multiple Hadoop Clusters Work Together
          6. Parallel Breadth First Search on GPU Clusters
          7. Optimizing Load Balancing and Data-Locality with Data-aware Scheduling
        2. Short Papers
          1. Online Temporal-Spatial Analysis for Detection of Critical Events in Cyber-Physical Systems
          2. ​A Cross-job Framework for MapReduce Scheduling
          3. Scheduling MapReduce Tasks on Virtual MapReduce Clusters from a Tenant’s Perspective
          4. ​FlexDAS: A Flexible Direct Attached Storage for I/O Intensive Applications
          5. ​A Two-Sided Market Mechanism for Trading Big Data Computing Commodities
          6. ​MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping
          7. Large-Scale Network Traffic Monitoring with DBStream, a System for Rolling Big Data Analysis
          8. ​Synthetic Data Generation for the Internet of Things
          9. ​Evaluating the Performance and Scalability of the Ceph Distributed Storage System
          10. ​Incremental Window Aggregates over Array Database
          11. ​BigCache for Big-data Systems
          12. Automated Workload-aware Elasticity of NoSQL Clusters in the Cloud
          13. ​Distributed Class Dependent Feature Analysis - A Big Data Approach
          14. ​VENU: Orchestrating SSDs in Hadoop Storage
          15. In-Memory I/O and Replication for HDFS with Memcached: Early Experiences
          16. Enabling Composite Applications through an Asynchronous Shared Memory Interface
          17. k-balanced sorting and skew join in MPI and MapReduce
      3. 3. Big Data Management
        1. Regular Papers
          1. Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems
          2. Examination of Data, Rule Generation and Detection of Phishing URLs using Online Logistic Regression
          3. Main Memory Evaluation of Recursive Queries on Multicore Machines
          4. Predicting Glaucoma Progression using Multi-task Learning with Heterogeneous Features
          5. Provenance-Based Object Storage Prediction Scheme for Scientific Big Data Applications
          6. Synergistic Partitioning in Multiple Large Scale Social Networks
          7. TRISTAN: Real-Time Analytics on Massive Time Series Using Sparse Dictionary Compression
          8. Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing
        2. Short Papers
          1. Minimizing Data Movement through Query Transformation
          2. Multilevel Partitioning of Large Unstructured Grids
          3. Low Complexity Sensing for Big Spatio-Temporal Data
          4. In-advance Data Analytics for Reducing Time to Discovery
      4. 4. Big Data Search and Mining
        1. Regular Papers
          1. Learning to Estimate Pairwise Distances in Large Graphs
          2. Distributed Adaptive Model Rules for Mining Big Data Streams
          3. Sparse computation for large-scale data mining
          4. Topic Similarity Networks: Visual Analytics for Large Document Sets
          5. Efficient Breadth-First Search on a Heterogeneous Processor
          6. Web-based Visual Analytics for Extreme Scale Climate Science
          7. Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization
          8. Meeting Predictable Buffer Limits in the Parallel Execution of Event Processing Operators
          9. Metadata Extraction and Correction for Large-Scale Traffic Surveillance Videos
          10. Facilitating Twitter Data Analytics: Platform, Language and Functionality
          11. Visual Fusion of Mega-City Big Data: An Application to Traffic and Tweets Data Analysis of Metro Passengers
          12. GRAPHiQL: A Graph Intuitive Query Language for Relational Databases
          13. Evaluating Density-based Motion for Big Data Visual Analytics
          14. Regression Trees for Streaming Data with Local Performance Guarantees
          15. Distributed Algorithms for k-truss Decomposition
          16. PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
          17. Effective Caching Techniques for Accelerating Pattern Matching Queries
          18. Clique Guided Community Detection
          19. Large-scale Distributed Sorting for GPU-based Heterogeneous Supercomputers
          20. Large-scale Logistic Regression and Linear Support Vector Machines Using Spark
          21. NVM-based Hybrid BFS with Memory Efficient Data Structure
          22. Identification of SNP Interactions Using Data-Parallel Primitives on GPUs
        2. Short Papers
          1. Random Walks on Adjacency Graphs for Mining Lexical Relations from Big Text Data
          2. Entity Resolution Using Inferred Relationships and Behavior
          3. Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability
          4. In-Situ Visualization and Computational Steering for Large-Scale Simulation of Turbulent Flows in Complex Geometries
          5. Building k-nn graphs from large text data
          6. Learning to Predict Subject-Line Opens for Large-Scale Email Marketing
          7. MAGE: Matching Approximate Patterns in Richly-Attributed Graphs
          8. Bootstrapping K-means for Big data analysis
          9. Distributed Adaptive Importance Sampling on Graphical Models Using MapReduce
          10. Knowledge-based Clustering of Ship Trajectories Using Density-based Approach
          11. Immersive and collaborative data visualization using virtual reality platforms
          12. The Adaptive Projection Forest: Using Adjustable Exclusion and Parallelism in Metric Space Indexes
          13. Scaling Up Prioritized Grammar Enumeration for Scientific Discovery in the Cloud
      5. 5. Big Data Security & Privacy
        1. Regular Papers
          1. MR-TRIAGE: Scalable Multi-Criteria Clustering for Big Data Security Intelligence Applications
          2. Increasing the Veracity of Event Detection on Social Media Networks Through User Trust Modeling
        2. Short Papers
          1. Empowering users of social networks to assess their privacy risks
          2. A Unified Approach to Network Anomaly Detection
      6. 6. Big Data Applications
        1. Regular Papers
          1. E-Sketch: Gathering Large-scale Energy Consumption Data Based on Consumption Patterns
          2. Hierarchical Management of Large-Scale Malware Data
          3. Random Projection Based Clustering for Population Genomics
          4. Structure Recognition from High Resolution Images of Ceramic Composites
          5. Combining Hadoop and GPU to Preprocess Large Affymetrix Microarray Data
          6. Content-Based Access Control: Use Data Content to Assist Access Control for Large-Scale Content-Centric Databases
          7. Locating Visual Storm Signatures from Satellite Images
          8. Accurate and Efficient Selection of the Best Consumption Prediction Method in Smart Grids
        2. Short Papers
          1. Visualizations for Sense-Making in Financial Market Regulation
          2. Big Automotive Data - Leveraging large volumes of data for knowledge-driven product development
          3. On the Impact of Socio-economic Factors on Power Load Forecasting
          4. Toward Personalized and Scalable Voice-Enabled Services Powered by Big Data
          5. MaPLE: A MapReduce Pipeline for Lattice-based Evaluation and Its Application to SNOMED CT
          6. Dynamic Pre-training of Deep Recurrent Neural Networks for Predicting Environmental Monitoring Data
          7. Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters
          8. Department of Energy Strategic Roadmap for Earth System Science Data Integration
          9. Analyzing the Language of Food on Social Media
          10. Using Geometric Structures to Improve the Error Correction Algorithm of High-Throughput Sequencing Data on MapReduce Framework
          11. Empowering Personalized Medicine with Big Data and Semantic Web Technology: Promises, Challenges, and Use Cases
          12. On Scaling Time Dependent Shortest Path Computations for Dynamic Traffic Assignment
          13. High Volume Geospatial Mapping for Internet-of-Vehicle Solutions with In-Memory Map-Reduce Processing
      7. 7. Industry and Government Papers
        1. Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon
        2. Spatial Computations over Terabyte-Sized Images on Hadoop Platforms
        3. Recall Estimation for Rare Topic Retrieval from Large Corpuses
        4. Lightweight Approximate Top-k for Distributed Settings
        5. Query Revision During Cluster Based Search on Large Unstructured Corpora
        6. Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop
        7. Automating Data Integration with HiperFuse
        8. Recommending Similar Items in Large-scale Online Marketplaces
        9. SE-CDA: A Scalable and Efficient Community Detection Algorithm
        10. Increasing the Accessibility to Big Data Systems via a Common Services API
        11. Big Data Predictive Analytics for Proactive Semiconductor Equipment Maintenance
        12. Future Directions of Humans in Big Data Research
        13. ALOJA: a Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness
        14. Heterogeneous Stream Processing for Disaster Detection and Alarming
        15. Identifying top Chinese network buzzwords from social media big data set based on time-distribution features
        16. Bridging High Velocity and High Volume Industrial Big Data Through Distributed In-Memory Storage & Analytics
        17. Explore Efficient Data Organization for Large Scale Graph Analytics and Storage
        18. An Initial Study of Predictive Machine Learning Analytics on Large Volumes of Historical Data for Power System Applications
        19. In Unity There is Strength: Showcasing a Unified Big Data Platform with MapReduce Over both Object and File Storage
    8. Author Index
  9. Panel with Program Directors: Big Data Challenges and Opportunities
    1. Dr. Chaitanya Baru (NSF)
    2. Dr. Yuan Liu (NIH)
      1. Big Data NIH Funding Opportunities
      2. BD2K Funded Phase I
      3. BISTI Funding Opportunities
      4. Seven High Priority Research Areas
    3. Dr. David Kuehn (DoT)
      1. Data Science Challenges and Opportunities in Highway Transportation
      2. Presentation Outline
      3. Program Status
      4. Connected Highway Systems
      5. Opportunities
      6. Thank You
    4. Dr.Tsengdar Lee (NASA)
      1. NASA's Big Data Challenges in Climate Science
      2. Projected Data Holdings
      3. Turning Observations into Knowledge and Decision Products
      4. Future Directions and Challenges
      5. Research Opportunities
      6. Thank You!
    5. Dr. Sudarsan Rachuri (NIST)
      1. From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems
      2. Outline
      3. Smart Manufacturing Systems Design and Analysis
      4. Manufacturing Big Data Analytics
    6. Mr. Matti Vakkuri (DIGILE)
      1. Data to Intelligence Research Program
      2. Data to Intelligence Program Implementation Facts
      3. Data to Intelligence-Main Results Obtained
      4. Oulu Traffic Pilot
      5. Forest Big Data
      6. Remote and HIghly Automated Services
  10. Tutorials
    1. Tutorial 1: Big Data Stream Mining
      1. Presenters
      2. Summary
      3. Content
      4. Short Profiles
    2. Tutorial 2: Big ML Software for Modern ML Algorithms
      1. Presenters
      2. Summary
      3. Content
      4. Petuum
      5. Spark
      6. GraphLab
    3. Tutorial 3: Large-scale Heterogeneous Learning in Big Data Analytics
      1. Presenter
      2. Summary
      3. Content
      4. Short Profile
    4. Tutorial 4: Big Data Benchmarking
      1. Presenters
      2. Summary
      3. Content
      4. Short Profile
  11. Special Session
    1. Special Session 1: From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems
      1. Overview
      2. Organizers
      3. Papers
        1. Toward Smart Manufacturing Using Decision Analytics
        2. ​An Intelligent Machine Monitoring System Using Gaussian Process Regression for Energy Prediction
        3. ​Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing
        4. ​Uncertainty Quantification in Performance Evaluation of Manufacturing Processes
        5. ​CloudMan: A Platform for Portable Cloud Manufacturing Services
        6. ​Building a Rigorous Foundation for Performance Assurance Assessment Techniques for “Smart” Manufacturing Systems
        7. A System Architecture for Manufacturing Process Analysis based on Big Data and Process Mining Techniques
    2. Special Session 2: Special Session on Big Data Representation and Processing in Data Science
      1. Overview
      2. Organizers
      3. Papers
        1. Researching Persons & Organizations AWAKE: From Text to an Entity-Centric Knowledge Base
        2. ​Integrating Existing Large Scale Medical Laboratory Data Into the Semantic Web Framework
        3. ​Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons
        4. ​Stochastic Finite Automata for the Translation of DNA to Protein
        5. Data mining and sharing tool for high content screening large scale biological image data
        6. ​A Building Performance Evaluation & Visualization System
        7. ​Statistical Technique for Online Anomaly Detection Using Spark Over Heterogeneous Data from Multi-source VMware Performance Data
        8. ​Extracting Discriminative Shapelets from Heterogeneous Sensor Data
  12. Workshops
    1. 1. Challenges and Issues on Scholarly Big Data Discovery and Collaboration (SBD 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
      5. ​Sponsors
    2. 2. The 2nd Workshop on Scalable Machine Learning- Theory and Applications
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    3. 3. 1st International Workshop on High Performance Big Graph Data Management Analysis, and Mining
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    4. 6. The Second Workshop on Distributed Storage Systems and Coding for Big Data 
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Program Committee Members
      5. Papers
    5. 7. First IEEE International Workshop on Big Data Security and Privacy (BDSP 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    6. 8. The 2nd Workshop on Big Data in Bioinformatics and Health Care Informatics
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    7. 9. 1st Workshop on Management, Search and Mining of Massive Repositories of Solar Astronomy Data (SABID'14)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    8. 10. Using Big Data to Understand Spatial Connectivity
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    9. 15. Workshop on Advances in Software and Hardware for Big Data to Knowledge Discovery (ASH 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    10. 16. IEEE Big Data Workshop on Semantics for Big Data on the Internet of Things (SemBIoT 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    11. 17. Big Data in Computational Epidemiology
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    12. 18. Large Scale Data Analytics in Transportation and Railway Infrastructure (BIGDATA-TRANSPORTATION 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    13. 19. 2nd Workshop on Scalable Cloud Data Management (SCDM 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    14. 20. Big Humanities Data
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    15. 21. Complexity for Big Data
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    16. 23. IEEE NIST Big Data PWG Workshop on Big Data- Challenges, Practices and Technologies
      1. Introduction to workshop
      2. The candidate list of panels is:
      3. Organizers
      4. Papers
  13. Posters
    1. Addressing Data Veracity in Big Data Applications
    2. Machine Learning and Interactive Visualization applied to TB-sized Images of Stem Cells
    3. Big Data: a new challenge for tourism
    4. OME: A tool for generating and managing metadata to handle BigData
    5. The EMBERS Architecture for Streaming Predictive Analytics
    6. A Novel Approach to Determine Docking Locations Using Fuzzy Logic and Shape Determination
    7. Linked Open Data Mining for Democratization of Big Data 
    8. Building Wrangler: A Transformational Data Intensive Resource for the Open Science Community
    9. CELAR: Automated Application Elasticity Platform
    10. Semantic HMC for Big Data Analysis
    11. A Layer Based Architecture for Provenance in Big Data
    12. B-dIDS: Mining Anomalies in a Big-distributed Intrusion Detection System
    13. Enabling Genomic Analysis on Federated Clouds
    14. Challenges of Data Integration and Interoperability in Big Data
    15. Large Scale Author Name Disambiguation in Digital Libraries
    16. Real-Time Traffic Incident Detection Using Probe-Car Data on the Tokyo Metropolitan Expressway
    17. Differentially Private Models of Tollgate Usage in Metropolitan Areas: The Milan Tollgate Data Set
    18. MoDisSENSE: A Distributed Platform for Social Networking Services over Mobile Devices
    19. A Challenge of Authorship Identification for Ten-thousand-scale Microblog Users
    20. Cognitive Map of Tourist Behavior based on Tripadvisor
    21. Biclustering using Spark-MapReduce
    22. A Summarization Paradigm for Big Data
    23. An open source framework to add spatial extent and geospatial visibility to Big Data
    24. Developing a Cloud Computing Platform for Big Data: The OpenStack Nova case
    25. Cultural Influences and Intelligence in Historical Gazetteers of the Great War
    26. The Bot Will Serve You Now: Automating Access to Archival Materials
    27. Incremental and Parallel Spatial Association Mining
    28. Sharding for Literature Search via Cutting Citation Graphs
    29. Repair Efficient Storage Codes via Combinatorial Configurations
  14. Sponsors
    1. IEEE
    2. IEEE Computer Society
    3. Elsevier
    4. Paradigm4
    5. NSF
    6. Cisco
    7. CCF
  15. NEXT

  1. Story
  2. Slides
    1. Panel on Big Data Challenges and Opportunities
    2. NSF’s Perspective and Role
    3. NSF: Research to Infrastructure Continuum
    4. Big Data in the CISE Context
    5. 2013: Core Techniques and Technologies for Advancing Big Data Science & Engineering
    6. 2014: Critical Techniques and Technologies for Advancing Big Data Science and Engineering
    7. Data Infrastructure Building Blocks
    8. NSF Research Traineeship (NRT)
    9. Initiatives underway…need your inputs
    10. CISE RFI: Big Data Regional Innovation Hubs
    11. Big Data Hubs examples…
    12. RFI National Big Data R&D Initiative
    13. Increase the Nation’s Data Science Capacity – An NSF Priority Goal
    14. Thank You!
  3. Accelerating the Big Data Innovation Ecosystem
  4. Spotfire Dashboard
  5. Research Notes
    1. Definition: NAICS 31-33, Manufacturing
      1. Guide to Data Sources for Manufacturing from the U.S. Census Bureau
    2. Manufacturers’ Shipments, Inventories, & Orders
      1. Overview
      2. 2014-2015 Release Schedule
      3. Annual Survey of Manufactures: Geographic Area Statistics: Statistics for All Manufacturing by State: 2011 REFRESH
    3. Peter Morosoff
  6. The 2014 IEEE International Conference on Big Data
  7. Main Page
  8. Main Conference
    1. Cover Page
    2. Copyright Information & Catalog Numbers
    3. Preface
    4. Organization Team Members
      1. Big Data Steering Committee
      2. Conference Co-Chairs
      3. Program Co-Chairs
      4. Industry and Government Program Co-Chairs
      5. Workshop Co-Chairs
      6. Poster Co-Chairs
      7. Publicity Co-chairs
      8. Proceedings Co-Chairs
      9. Sponsorship Co-Chairs
      10. Tutorial Co-Chairs
      11. Registration Co-Chairs
      12. Student Travel Award Co-Chairs
    5. Program Committee List
      1. Senior Program Committee Members
      2. Program Committee Members
      3. Industy & Government Program Committee Members
    6. Keynote Speeches
      1. Never-Ending Language Learning
        1. Abstract 
        2. Biography 
        3. Never Ending Language Learning
        4. Rationale
        5. NELL: Never-Ending Language Learner
        6. NELL Today
        7. Summary
        8. Thank You
      2. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities
        1. Abstract 
        2. Biography 
        3. Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities
        4. Thanks: My Team
        5. I Will Use Applications in 3 Domains to Demonstrate
        6. My 2004-2005 Fomulation of SMART DATA-Semagix
        7. In Process, Engaging Both Top and Bottom Brain
        8. W3C Semantic Sensor Network Ontology
        9. Personalized Digital Health Research and Applications at Kno.e.sis
        10. Semantic Annotation Using Background Knowledge
        11. Matching Requests With Offers
        12. Wrapping Up: Two Excellent Videos
        13. Wrapping Up: Take Away 1
        14. Wrapping Up: Take Away 2
        15. Special Thanks
        16. Kno.e.sis
      3. Addressing Human Bottlenecks in Big Data
        1. Abstract 
        2. Biography
        3. Addressing Human Bottlenecks in Big Data
        4. Outline
        5. I Can Give You Power
        6. Data Distributed Programming
        7. Where Does Time Go?
        8. Wangler 2011: Add Intelligence
        9. Technical Philosophy
        10. Takeaways
    7. Sessions
      1. 1. Big Data Foundations
        1. Regular Papers
          1. BASIC: an Alternative to BASE for Large-Scale Data Management System
          2. BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured BigData
        2. Short Papers
          1. Scaling up M-estimation via sampling designs: the Horvitz-Thompson stochastic gradient descent
          2. Metadata Capital: Simulating the Predictive Value of Self-Generated Heatlh Information (SGHI)
          3. Representative Subsets For Big Data Learning using k-NN graphs
          4. Towards Building and Evaluating a Personalized Location-Based Recommender System
          5. On the Performance of MapReduce: A Stochastic Approach
          6. PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems
      2. 2. Big Data Infrastructure
        1. Regular Papers
          1. FusionFS: Toward Supporting Data-Intensive Scientific Applications on Extreme-Scale High-Performance Computing Systems
          2. ​BurstMem: A High-Performance Burst Buffer System for Scientific Applications
          3. Partial Rollback-based Scheduling on In-memory Transactional Data Grids
          4. Detecting and Identifying System Changes in the Cloud via Discovery by Example
          5. ​PigOut: Making Multiple Hadoop Clusters Work Together
          6. Parallel Breadth First Search on GPU Clusters
          7. Optimizing Load Balancing and Data-Locality with Data-aware Scheduling
        2. Short Papers
          1. Online Temporal-Spatial Analysis for Detection of Critical Events in Cyber-Physical Systems
          2. ​A Cross-job Framework for MapReduce Scheduling
          3. Scheduling MapReduce Tasks on Virtual MapReduce Clusters from a Tenant’s Perspective
          4. ​FlexDAS: A Flexible Direct Attached Storage for I/O Intensive Applications
          5. ​A Two-Sided Market Mechanism for Trading Big Data Computing Commodities
          6. ​MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping
          7. Large-Scale Network Traffic Monitoring with DBStream, a System for Rolling Big Data Analysis
          8. ​Synthetic Data Generation for the Internet of Things
          9. ​Evaluating the Performance and Scalability of the Ceph Distributed Storage System
          10. ​Incremental Window Aggregates over Array Database
          11. ​BigCache for Big-data Systems
          12. Automated Workload-aware Elasticity of NoSQL Clusters in the Cloud
          13. ​Distributed Class Dependent Feature Analysis - A Big Data Approach
          14. ​VENU: Orchestrating SSDs in Hadoop Storage
          15. In-Memory I/O and Replication for HDFS with Memcached: Early Experiences
          16. Enabling Composite Applications through an Asynchronous Shared Memory Interface
          17. k-balanced sorting and skew join in MPI and MapReduce
      3. 3. Big Data Management
        1. Regular Papers
          1. Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems
          2. Examination of Data, Rule Generation and Detection of Phishing URLs using Online Logistic Regression
          3. Main Memory Evaluation of Recursive Queries on Multicore Machines
          4. Predicting Glaucoma Progression using Multi-task Learning with Heterogeneous Features
          5. Provenance-Based Object Storage Prediction Scheme for Scientific Big Data Applications
          6. Synergistic Partitioning in Multiple Large Scale Social Networks
          7. TRISTAN: Real-Time Analytics on Massive Time Series Using Sparse Dictionary Compression
          8. Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing
        2. Short Papers
          1. Minimizing Data Movement through Query Transformation
          2. Multilevel Partitioning of Large Unstructured Grids
          3. Low Complexity Sensing for Big Spatio-Temporal Data
          4. In-advance Data Analytics for Reducing Time to Discovery
      4. 4. Big Data Search and Mining
        1. Regular Papers
          1. Learning to Estimate Pairwise Distances in Large Graphs
          2. Distributed Adaptive Model Rules for Mining Big Data Streams
          3. Sparse computation for large-scale data mining
          4. Topic Similarity Networks: Visual Analytics for Large Document Sets
          5. Efficient Breadth-First Search on a Heterogeneous Processor
          6. Web-based Visual Analytics for Extreme Scale Climate Science
          7. Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization
          8. Meeting Predictable Buffer Limits in the Parallel Execution of Event Processing Operators
          9. Metadata Extraction and Correction for Large-Scale Traffic Surveillance Videos
          10. Facilitating Twitter Data Analytics: Platform, Language and Functionality
          11. Visual Fusion of Mega-City Big Data: An Application to Traffic and Tweets Data Analysis of Metro Passengers
          12. GRAPHiQL: A Graph Intuitive Query Language for Relational Databases
          13. Evaluating Density-based Motion for Big Data Visual Analytics
          14. Regression Trees for Streaming Data with Local Performance Guarantees
          15. Distributed Algorithms for k-truss Decomposition
          16. PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks
          17. Effective Caching Techniques for Accelerating Pattern Matching Queries
          18. Clique Guided Community Detection
          19. Large-scale Distributed Sorting for GPU-based Heterogeneous Supercomputers
          20. Large-scale Logistic Regression and Linear Support Vector Machines Using Spark
          21. NVM-based Hybrid BFS with Memory Efficient Data Structure
          22. Identification of SNP Interactions Using Data-Parallel Primitives on GPUs
        2. Short Papers
          1. Random Walks on Adjacency Graphs for Mining Lexical Relations from Big Text Data
          2. Entity Resolution Using Inferred Relationships and Behavior
          3. Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability
          4. In-Situ Visualization and Computational Steering for Large-Scale Simulation of Turbulent Flows in Complex Geometries
          5. Building k-nn graphs from large text data
          6. Learning to Predict Subject-Line Opens for Large-Scale Email Marketing
          7. MAGE: Matching Approximate Patterns in Richly-Attributed Graphs
          8. Bootstrapping K-means for Big data analysis
          9. Distributed Adaptive Importance Sampling on Graphical Models Using MapReduce
          10. Knowledge-based Clustering of Ship Trajectories Using Density-based Approach
          11. Immersive and collaborative data visualization using virtual reality platforms
          12. The Adaptive Projection Forest: Using Adjustable Exclusion and Parallelism in Metric Space Indexes
          13. Scaling Up Prioritized Grammar Enumeration for Scientific Discovery in the Cloud
      5. 5. Big Data Security & Privacy
        1. Regular Papers
          1. MR-TRIAGE: Scalable Multi-Criteria Clustering for Big Data Security Intelligence Applications
          2. Increasing the Veracity of Event Detection on Social Media Networks Through User Trust Modeling
        2. Short Papers
          1. Empowering users of social networks to assess their privacy risks
          2. A Unified Approach to Network Anomaly Detection
      6. 6. Big Data Applications
        1. Regular Papers
          1. E-Sketch: Gathering Large-scale Energy Consumption Data Based on Consumption Patterns
          2. Hierarchical Management of Large-Scale Malware Data
          3. Random Projection Based Clustering for Population Genomics
          4. Structure Recognition from High Resolution Images of Ceramic Composites
          5. Combining Hadoop and GPU to Preprocess Large Affymetrix Microarray Data
          6. Content-Based Access Control: Use Data Content to Assist Access Control for Large-Scale Content-Centric Databases
          7. Locating Visual Storm Signatures from Satellite Images
          8. Accurate and Efficient Selection of the Best Consumption Prediction Method in Smart Grids
        2. Short Papers
          1. Visualizations for Sense-Making in Financial Market Regulation
          2. Big Automotive Data - Leveraging large volumes of data for knowledge-driven product development
          3. On the Impact of Socio-economic Factors on Power Load Forecasting
          4. Toward Personalized and Scalable Voice-Enabled Services Powered by Big Data
          5. MaPLE: A MapReduce Pipeline for Lattice-based Evaluation and Its Application to SNOMED CT
          6. Dynamic Pre-training of Deep Recurrent Neural Networks for Predicting Environmental Monitoring Data
          7. Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters
          8. Department of Energy Strategic Roadmap for Earth System Science Data Integration
          9. Analyzing the Language of Food on Social Media
          10. Using Geometric Structures to Improve the Error Correction Algorithm of High-Throughput Sequencing Data on MapReduce Framework
          11. Empowering Personalized Medicine with Big Data and Semantic Web Technology: Promises, Challenges, and Use Cases
          12. On Scaling Time Dependent Shortest Path Computations for Dynamic Traffic Assignment
          13. High Volume Geospatial Mapping for Internet-of-Vehicle Solutions with In-Memory Map-Reduce Processing
      7. 7. Industry and Government Papers
        1. Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon
        2. Spatial Computations over Terabyte-Sized Images on Hadoop Platforms
        3. Recall Estimation for Rare Topic Retrieval from Large Corpuses
        4. Lightweight Approximate Top-k for Distributed Settings
        5. Query Revision During Cluster Based Search on Large Unstructured Corpora
        6. Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop
        7. Automating Data Integration with HiperFuse
        8. Recommending Similar Items in Large-scale Online Marketplaces
        9. SE-CDA: A Scalable and Efficient Community Detection Algorithm
        10. Increasing the Accessibility to Big Data Systems via a Common Services API
        11. Big Data Predictive Analytics for Proactive Semiconductor Equipment Maintenance
        12. Future Directions of Humans in Big Data Research
        13. ALOJA: a Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness
        14. Heterogeneous Stream Processing for Disaster Detection and Alarming
        15. Identifying top Chinese network buzzwords from social media big data set based on time-distribution features
        16. Bridging High Velocity and High Volume Industrial Big Data Through Distributed In-Memory Storage & Analytics
        17. Explore Efficient Data Organization for Large Scale Graph Analytics and Storage
        18. An Initial Study of Predictive Machine Learning Analytics on Large Volumes of Historical Data for Power System Applications
        19. In Unity There is Strength: Showcasing a Unified Big Data Platform with MapReduce Over both Object and File Storage
    8. Author Index
  9. Panel with Program Directors: Big Data Challenges and Opportunities
    1. Dr. Chaitanya Baru (NSF)
    2. Dr. Yuan Liu (NIH)
      1. Big Data NIH Funding Opportunities
      2. BD2K Funded Phase I
      3. BISTI Funding Opportunities
      4. Seven High Priority Research Areas
    3. Dr. David Kuehn (DoT)
      1. Data Science Challenges and Opportunities in Highway Transportation
      2. Presentation Outline
      3. Program Status
      4. Connected Highway Systems
      5. Opportunities
      6. Thank You
    4. Dr.Tsengdar Lee (NASA)
      1. NASA's Big Data Challenges in Climate Science
      2. Projected Data Holdings
      3. Turning Observations into Knowledge and Decision Products
      4. Future Directions and Challenges
      5. Research Opportunities
      6. Thank You!
    5. Dr. Sudarsan Rachuri (NIST)
      1. From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems
      2. Outline
      3. Smart Manufacturing Systems Design and Analysis
      4. Manufacturing Big Data Analytics
    6. Mr. Matti Vakkuri (DIGILE)
      1. Data to Intelligence Research Program
      2. Data to Intelligence Program Implementation Facts
      3. Data to Intelligence-Main Results Obtained
      4. Oulu Traffic Pilot
      5. Forest Big Data
      6. Remote and HIghly Automated Services
  10. Tutorials
    1. Tutorial 1: Big Data Stream Mining
      1. Presenters
      2. Summary
      3. Content
      4. Short Profiles
    2. Tutorial 2: Big ML Software for Modern ML Algorithms
      1. Presenters
      2. Summary
      3. Content
      4. Petuum
      5. Spark
      6. GraphLab
    3. Tutorial 3: Large-scale Heterogeneous Learning in Big Data Analytics
      1. Presenter
      2. Summary
      3. Content
      4. Short Profile
    4. Tutorial 4: Big Data Benchmarking
      1. Presenters
      2. Summary
      3. Content
      4. Short Profile
  11. Special Session
    1. Special Session 1: From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems
      1. Overview
      2. Organizers
      3. Papers
        1. Toward Smart Manufacturing Using Decision Analytics
        2. ​An Intelligent Machine Monitoring System Using Gaussian Process Regression for Energy Prediction
        3. ​Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing
        4. ​Uncertainty Quantification in Performance Evaluation of Manufacturing Processes
        5. ​CloudMan: A Platform for Portable Cloud Manufacturing Services
        6. ​Building a Rigorous Foundation for Performance Assurance Assessment Techniques for “Smart” Manufacturing Systems
        7. A System Architecture for Manufacturing Process Analysis based on Big Data and Process Mining Techniques
    2. Special Session 2: Special Session on Big Data Representation and Processing in Data Science
      1. Overview
      2. Organizers
      3. Papers
        1. Researching Persons & Organizations AWAKE: From Text to an Entity-Centric Knowledge Base
        2. ​Integrating Existing Large Scale Medical Laboratory Data Into the Semantic Web Framework
        3. ​Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons
        4. ​Stochastic Finite Automata for the Translation of DNA to Protein
        5. Data mining and sharing tool for high content screening large scale biological image data
        6. ​A Building Performance Evaluation & Visualization System
        7. ​Statistical Technique for Online Anomaly Detection Using Spark Over Heterogeneous Data from Multi-source VMware Performance Data
        8. ​Extracting Discriminative Shapelets from Heterogeneous Sensor Data
  12. Workshops
    1. 1. Challenges and Issues on Scholarly Big Data Discovery and Collaboration (SBD 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
      5. ​Sponsors
    2. 2. The 2nd Workshop on Scalable Machine Learning- Theory and Applications
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    3. 3. 1st International Workshop on High Performance Big Graph Data Management Analysis, and Mining
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    4. 6. The Second Workshop on Distributed Storage Systems and Coding for Big Data 
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Program Committee Members
      5. Papers
    5. 7. First IEEE International Workshop on Big Data Security and Privacy (BDSP 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    6. 8. The 2nd Workshop on Big Data in Bioinformatics and Health Care Informatics
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    7. 9. 1st Workshop on Management, Search and Mining of Massive Repositories of Solar Astronomy Data (SABID'14)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    8. 10. Using Big Data to Understand Spatial Connectivity
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    9. 15. Workshop on Advances in Software and Hardware for Big Data to Knowledge Discovery (ASH 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    10. 16. IEEE Big Data Workshop on Semantics for Big Data on the Internet of Things (SemBIoT 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    11. 17. Big Data in Computational Epidemiology
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    12. 18. Large Scale Data Analytics in Transportation and Railway Infrastructure (BIGDATA-TRANSPORTATION 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    13. 19. 2nd Workshop on Scalable Cloud Data Management (SCDM 2014)
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    14. 20. Big Humanities Data
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    15. 21. Complexity for Big Data
      1. Introduction to workshop
      2. Research topics included in the workshop
      3. Organizers
      4. Papers
    16. 23. IEEE NIST Big Data PWG Workshop on Big Data- Challenges, Practices and Technologies
      1. Introduction to workshop
      2. The candidate list of panels is:
      3. Organizers
      4. Papers
  13. Posters
    1. Addressing Data Veracity in Big Data Applications
    2. Machine Learning and Interactive Visualization applied to TB-sized Images of Stem Cells
    3. Big Data: a new challenge for tourism
    4. OME: A tool for generating and managing metadata to handle BigData
    5. The EMBERS Architecture for Streaming Predictive Analytics
    6. A Novel Approach to Determine Docking Locations Using Fuzzy Logic and Shape Determination
    7. Linked Open Data Mining for Democratization of Big Data 
    8. Building Wrangler: A Transformational Data Intensive Resource for the Open Science Community
    9. CELAR: Automated Application Elasticity Platform
    10. Semantic HMC for Big Data Analysis
    11. A Layer Based Architecture for Provenance in Big Data
    12. B-dIDS: Mining Anomalies in a Big-distributed Intrusion Detection System
    13. Enabling Genomic Analysis on Federated Clouds
    14. Challenges of Data Integration and Interoperability in Big Data
    15. Large Scale Author Name Disambiguation in Digital Libraries
    16. Real-Time Traffic Incident Detection Using Probe-Car Data on the Tokyo Metropolitan Expressway
    17. Differentially Private Models of Tollgate Usage in Metropolitan Areas: The Milan Tollgate Data Set
    18. MoDisSENSE: A Distributed Platform for Social Networking Services over Mobile Devices
    19. A Challenge of Authorship Identification for Ten-thousand-scale Microblog Users
    20. Cognitive Map of Tourist Behavior based on Tripadvisor
    21. Biclustering using Spark-MapReduce
    22. A Summarization Paradigm for Big Data
    23. An open source framework to add spatial extent and geospatial visibility to Big Data
    24. Developing a Cloud Computing Platform for Big Data: The OpenStack Nova case
    25. Cultural Influences and Intelligence in Historical Gazetteers of the Great War
    26. The Bot Will Serve You Now: Automating Access to Archival Materials
    27. Incremental and Parallel Spatial Association Mining
    28. Sharding for Literature Search via Cutting Citation Graphs
    29. Repair Efficient Storage Codes via Combinatorial Configurations
  14. Sponsors
    1. IEEE
    2. IEEE Computer Society
    3. Elsevier
    4. Paradigm4
    5. NSF
    6. Cisco
    7. CCF
  15. NEXT

Story

Data Science for IEEE Big Data

I submitted a paper that was not accepted and Joan Aron submitted a workshop presentation that was accepted and received praise. See: 2014 IEEE International Conference on Big  Data, October  27-30, Washington DC.. Paper and Panel. Previously I had done a data science data publication for the IEEE Big Data Conference: Data Science for COM.BigData 2014 and for the COM Big Data Conference itself: Data Publication

Since I was unable to attend and received a nice summary email: The 2014 IEEE International Conference on Big Data, I decided to build a data science publication by mining and curating the content.

The three URLs in the email:

  1. Keynote speech http://cci.drexel.edu/bigdata/bigdat...notespeech.htm My Note: See Keynote Speeches
  2. 4 Tutorials  http://cci.drexel.edu/bigdata/bigdat...4/tutorial.htm My Note: See Tutorials
  3. Panel with funding agency program directors: Big Data Challenges and Opportunities: http://cci.drexel.edu/bigdata/bigdata2014/panel.htm My Note: See Panel with Program Directors: Big Data Challenges and Opportunities

were very useful, especially the slides presentation by Dr. Chaitanya Baru (NSF):

Chaitan Baru currently serves as Senior Advisor for Data Science in the CISE Directorate at the National Science Foundation. He is on assignment from the San Diego Supercomputer Center, UC San Diego, where he is Distinguished Scientist and Associate Director of Data Initiatives. He has served in leadership positions in a number of national-scale cyberinfrastructure R&D initiatives across a wide range of science and engineering disciplines including, earth science, ecology, earthquake engineering, and biomedical informatics. In 2012, he initiated an industry-academia effort to define big data benchmarks via the Workshops for Big Data Benchmarking (WBDB). This has resulted in the recent formation of the SPEC Research Group on Big Data Benchmarking, which he co-chairs. He is co-editor of the Lecture Notes in Computer Science series entitled, Specifying Big Data Benchmarks, published by Springer Verlag. He co-chairs the National Institute for Standards and Technology’s (NIST) Public Working Group on Big Data. He is a member of the teaching faculty for the Masters in Advanced Studies program in Data Science and Engineering (MAS-DSE) in the Computer Science Department at UC San Diego.

Baru has co-edited the book, Geoinformatics: Cyberinfrastructure for the Solid Earth Sciences with Prof. Randy Keller, University of Oklahoma, published by Cambridge University Press (ISBN: 9780521897150).

Baru has a B.Tech (Electronics Engineering) from IIT Madras and an M.E. and Ph.D. (Electrical Engineering) from the University of Florida.

This panelist speech slide can be downloaded at here. See Slides

Dr. Bau's slide: CISE RFI: Big Data Regional Innovation Hubs mentions an RFI: Accelerating the Big Data Innovation Ecosystem The Federal Big Data Working Group Meetup is a Big Data Regional Innovation Hub!

Some comments on the process I followed for producing a data science data publication are as follows:

  • Copyright Information & Catalog Numbers
    • My Note: I am abstracting for a data science data publication with credit from PDFs and the Web, all of which I assume is open and reusable content that would add value to IEEE content at no cost to users.
  • Author Index
    • My Note: I have this by Google Chrome Find on this entire page. There is no search across all their web pages and PDFs.
  • Special Session
    • My Note: Need Paper PDF or just Abstract?
  • Posters
    • My Note: Attached 29 PDFs
  • Sessions
    • My Note: These could be further structured by paper and have the PDFs attached and excerpts inserted.

So how would I now identify and mine the actual data within the publications and referenced by the publications?

I can at least do a Google Find for "data". The result is 613 hits, of which 137 are outside this page in the overall table of contents of my wiki! It looks like I will have to open and skim each paper. Certainly having the PDFs on a thumb drive mankes that easier than downloading them all first.

Another example is a find for Professor Sam Madden (one of the leaders of the MIT Big Data Course):GRAPHiQL: A Graph Intuitive Query Language for Relational Databases

An inventory is:

  • Industry &Government Papers: 19 for 12 MB
  • Keynotes: 3 for 1.6 MB My Note: Uploaded
  • Posters: 29 for 45 MB My Note: Uploaded
  • Regular Papers: 49 for 104 MB
  • Short Papers: 55 for 72 MB
  • Special Session: 7 for 24 MB and 8 for 10 MB
  • Workshops: 131 for 284 MB
    • ws#1 challenges and issues on scholarly big data discovery and collaboration (sbd 2014) 4 for 6 MB
    • WS#2 The 2nd Workshop on Scalable Machine Learning- Theory and Applications 16 for 35 MB
    • WS#3 1st International Workshop on High Performance Big Graph Data Management Analysis, and Mining 11 for 28 MB
    • WS#6 The Second Workshop on Distributed Storage Systems and Coding for Big Data 6 for 8 MB
    • WS#7 First IEEE International Workshop on Big Data Security and Privacy (BDSP 2014) 3 for 4 MB
    • WS#8 The 2nd Workshop on Big Data in Bioinformatics and Health Care Informatics 13 for 44 MB
    • ws#9 solar astronomy big data (sabid) – 1st workshop on management, search and mining of massive repositories of solar astronomy data (sabid'14) 7 for 15 MB
    • WS#10 Using Big Data to Understand Spatial Connectivity 1 for 2 MB
    • WS#15 Workshop on Advances in Software and Hardware for Big Data to Knowledge Discovery (ASH 2014) 9 for 23 MB
    • WS#16 IEEE Big Data Workshop on Semantics for Big Data on the Internet of Things (SemBIoT 2014) 4 for 6 MB
    • WS#17 Big Data in Computational Epidemiology 4 for 6 MB
    • WS#18 Large Scale Data Analytics in Transportation and Railway Infrastructure (BIGDATA-TRANSPORTATION 2014) 15 for 33 MB
    • WS#19 2nd Workshop on Scalable Cloud Data Management (SCDM 2014) 7 for 13 MB
    • WS#20 Big Humanities Data 16 for 39 MB
    • WS#21 Complexity for Big Data 7 for 15 MB
    • WS#23 IEEE NIST Big Data PWG Workshop on Big Data- Challenges, Practices and Technologies 8 for 7 MB My Note: This is where my joint paper with Dr. Joan Aron is.

Interesting, the Tutorials and Panels are not with these, possibly because they are not considered publications.

I mined for big data for manufacturing ontologies and data sets and found: Special Session 1: From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems and Research Notes.

I mined for manufacturing and building performance data and found:

My goal is to go from high-level manufacturing statistics to individual manufacturing processes.

built a Excel Spreadsheet Knowledge Base Index for use in Tibco Spotfire for analytics and visualizations.

The results are shown in the next three screen captures and in the Spotfire Dashboard below.

IEEManufacturingBigData-SpotfireCoverPage.png

IEEManufacturingBigData-SpotfireCensusManufacturingData.png

IEEManufacturingBigData-Spotfire2011SurveyofManufactures.png

Briefly the story is: Story: Big Data and Analytics for Smart Manufacturing Systems in Special Session. Found Census Manufacturing Data from Web Search. Found 2011 Survey of Manufactures in Census Manufacturing Data Inventory.

Next I need to find big data for individual manufacturing processes, especially for California and Texas which lead the states in Payroll and Expenditures, respectively.

MORE TO FOLLOW

Slides

Slides

Panel on Big Data Challenges and Opportunities

ChaitanyaBaru10272014Slide1.png

NSF’s Perspective and Role

ChaitanyaBaru10272014Slide2.png

NSF: Research to Infrastructure Continuum

ChaitanyaBaru10272014Slide3.png

Big Data in the CISE Context

ChaitanyaBaru10272014Slide4.png

2013: Core Techniques and Technologies for Advancing Big Data Science & Engineering

ChaitanyaBaru10272014Slide5.png

2014: Critical Techniques and Technologies for Advancing Big Data Science and Engineering

ChaitanyaBaru10272014Slide6.png

Data Infrastructure Building Blocks

ChaitanyaBaru10272014Slide7.png

NSF Research Traineeship (NRT)

ChaitanyaBaru10272014Slide8.png

Initiatives underway…need your inputs

ChaitanyaBaru10272014Slide9.png

CISE RFI: Big Data Regional Innovation Hubs

ChaitanyaBaru10272014Slide10.png

RFI National Big Data R&D Initiative

https://www.nitrd.gov/bigdata/rfi/02102014.aspx

ChaitanyaBaru10272014Slide12.png

Increase the Nation’s Data Science Capacity – An NSF Priority Goal

ChaitanyaBaru10272014Slide13.png

Thank You!

cbaru@nsf.gov

ChaitanyaBaru10272014Slide14.png

Accelerating the Big Data Innovation Ecosystem

Source: http://www.nsf.gov/cise/news/2014-bigdata-rfi.jsp

My Note: The Federal Big Data Working Group Meetup is a Big Data Regional Innovation Hub!

Aiming to make the most of the fast-growing volume of digital data, in March 2012, the Obama Administration announced the "Big Data Research and Development Initiative." By improving our ability to extract knowledge and insights from large and complex collections of digital data, the initiative promises to help solve some the Nation’s most pressing challenges. Beginning in the second year of the National Big Data Initiative, the Administration encouraged multiple stakeholders including federal agencies, private industry, academia, state and local government, non-profits, and foundations, to develop and participate in Big Data innovation projects across the country.

On November 12, 2013, dozens of public and private organizations met at a White House-sponsored "Data to Knowledge to Action" event to announce an inspiring array of Big Data related collaborations. These collaborations were the product and culmination of over a year of activities catalyzed by teams at OSTP, NITRD and federal agencies to support the Big Data innovation ecosystem; such activities included workshops, one-on-one conversations with potential partners, Requests for Information, invited speaker lectures, and more.

To sustain the forward momentum of the Data2Action event, the National Science Foundation is exploring the establishment of a national network of "Big Data Regional Innovation Hubs." These Hubs will help to continue and scale up the kinds of activities and partnerships established at Data2Action as well as stimulate, track, and help sustain new regional and grassroots partnerships around Big Data. Potential roles for Hubs include, but are not limited to:

  • Accelerate the ideation and development Big Data solutions to specific global and societal challenges by convening stakeholders across sectors to partner in results-driven programs and projects.
  • Act as a matchmaker between the various academic, industry, and community stakeholders to help drive successful pilot programs for emerging Big Data technology.
  • Coordinate across multiple regions of the country, based on shared interests and industry sector engagement to enable dialogue and share best practices.
  • Aim to increase the speed and volume of technology transfer between universities, public and private research centers and laboratories, large enterprises, and SMB's.
  • Facilitate engagement with opinion and thought leaders on the societal impact of Big Data technologies as to maximize positive outcomes of adoption while reducing unwanted consequences.
  • Support the education and training of the entire Big Data workforce, from data scientists to managers to data end-users.

The NSF seeks input from stakeholders across academia, state and local government, industry, and non-profits across all parts of the Big Data innovation ecosystem on the formation of Big Data Regional Innovation Hubs. Please submit a response of no more than two-pages to BIGDATA@nsf.gov outlining

  1. The goals of interest for a Big Data Regional Hub, with metrics for evaluating the success or failure of the Hub to meet that goal;
  2. The multiple stakeholders that would participate in the Hub and their respective roles and responsibilities;
  3. Plans for initial and long-term financial and in-kind resources that the stakeholders would need to commit to this hub; and
  4. A principal point of contact.

This announcement is posted solely for information and planning purposes; it does not constitute a formal solicitation for grants, contracts, or cooperative agreements.

Please submit responses no later than Nov 1, 2014.

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

Definition: NAICS 31-33, Manufacturing

Source: http://www.census.gov/econ/manufacturing.html

The Manufacturing sector comprises establishments engaged in the mechanical, physical, or chemical transformation of materials, substances, or components into new products. The assembling of component parts of manufactured products is considered manufacturing, except in cases where the activity is appropriately classified in Sector 23, Construction.

Establishments in the Manufacturing sector are often described as plants, factories, or mills and characteristically use power-driven machines and materials-handling equipment. However, establishments that transform materials or substances into new products by hand or in the worker's home and those engaged in selling to the general public products made on the same premises from which they are sold, such as bakeries, candy stores, and custom tailors, may also be included in this sector. Manufacturing establishments may process materials or may contract with other establishments to process their materials for them. Both types of establishments are included in manufacturing.   more

Guide to Data Sources for Manufacturing from the U.S. Census Bureau

See Excel Spreadsheet

Manufacturers’ Shipments, Inventories, & Orders

Source: http://www.census.gov/manufacturing/m3/

Overview

The Manufacturers’ Shipments, Inventories, and Orders (M3) survey provides broad-based, monthly statistical data on economic conditions in the domestic manufacturing sector. The survey measures current industrial activity and provides an indication of future business trends.

2014-2015 Release Schedule

Excel  [30kb] See Excel Spreadsheet

Annual Survey of Manufactures: Geographic Area Statistics: Statistics for All Manufacturing by State: 2011 REFRESH

2011 Annual Survey of Manufactures

Source: http://factfinder2.census.gov/faces/...prodType=table

The table contains a total of 2,857 data rows.

See Excel Spreadsheet

Release Date : 12/17/2013

NOTE: For information on confidentiality protection, sampling error, nonsampling error, and definitions, see Survey Methodology. For information on the ASM industry groupings, see Comparability. Data in this table represent those available when this report was created. Data may not be available for all NAICS industries or geographies. Detail items may not add to total due to independent rounding.

Data from the 2011 Annual Survey of Manufactures - REFRESH supersede data originally published in the 2011 Annual Survey of Manufactures (ASM). A typical ASM publishes data for two years: the current year, and an updated version of the prior year. Because the Economic Census only publishes one year of data, the ASM REFRESH is released in years ending in 1 and 6 - those preceding the Economic Census - as a means of updating data for those reference periods.

Peter Morosoff

Here is more information on data in manufacturing.  I am looking for another good paper on this subject. The papers I have read on data and manufacturing state that manufacturing has been impeded by inconsistent terminology across departments, disciplines, etc.,  Thus, there appears to be lots of data needing data science - although it appears from the age of some of the papers that I have seen that manufacturers have been making progress in this area for years.

Supporting Technologies\Ontology\Documents \Ontology-118 Digital Mockup for Manufacturing.doc (Word)

June 18, 2013

Minor revisions August 13, 2014

MEMORANDUM

Subject:  EXPLOITATION OF THE SCIENCE OF ONTOLOGY IN MANUFACTURING

1.Purpose.  To record information on exploitation of the science of ontology in manufacturing provided in a 2009 online paper, “Digital Mock-up to Optimize the Assembly of a Ship Fuel System” URL: (http://www.hypersciences.org/JMSS/Iss.1-2010/JMSS-1-1-2010.pdf). (PDF)

2.Basic Point

a.Ontology is the consistent, perspective-neutral representation of reality across multiple specialties, organizations, and IT systems.  The paper on the digital mock-up describes in 7½ pages how several departments within AVIO (Italian company that is a world leader in aerospace and marine propulsion) use a perspective-neutral model or representation of a component to (1) improve the collaboration among departments on the design and manufacture of that component and (2) check the engineering from various perspectives (e.g., safety, easy of assembly, and maintainability).  Much of the checking is accomplished with software that uses the model of the component and its parts in simulations of assembly and repair.

b.By explaining that AVIO was improving design, manufacturing, etc,, of equipment using (1) a single perspective-neutral model or representation of an entity and (2) a simulation of its model assembly, the paper documents that ontology is being exploited in engineering systems to increase efficiency and organizational effectiveness.  The question, therefore, is not whether ontology is applicable to the Engineered Resilient Systems (ERS) problem but how to maximize the exploitation of ontology to achieve ERS goals.

3.Background

a.Although “Digital Mock-up to Optimize the Assembly of a Ship Fuel System” was submitted for publication in 2009, it was written that year or earlier based on a manufacturing operation that was implemented in 2009 or earlier.  The paper describes how one company was able “to achieve integration among single disciplines and concurrent engineering issues to avoid the case where a product (re)design is good from the perspective of assembly but difficult, time consuming or expensive to manufacture or test.”

b.Searching for the paper on the digital mock-up and drafting this document were prompted by DoD’s S&T ERS emphasis area.  ERS objectives include (1) making “much more informed decisions early in the acquisition process;” (2) making “the engineering design process much more efficient and effective;” and (3) “affect feedback between manufacturability and capability.”

4.Specific Passages from Paper on Digital Mock-ups

a.“In particular, digital mock-up (DMU) allows designers to investigate the assembly feasibility of a product and the constraints imposed by the manufacturing processes.”

b.“Ford Motor Company is today strongly involved in the adoption of virtual manufacturing tools and processes designed to catch possible manufacture concerns by simulating automotive performance.  During the prototypal phase of the 2009 vehicles, Ford Flex, Lincoln MKS and Ford F-150, the program teams were able to reduce more than 80 percent of the potential manufacturing concerns by running more than 10,000 advanced digital pre-assembly engineering checks.  In addition, the engineers were able to reduce design and production tooling issues by 50 percent.”

c.“The analysis of overlaps among components allows to identify mistakes in terms of geometric fit, positioning and orientation which can prevent a correct coupling of parts.”

d.“The whole process is carried on by a multi-disciplinary team of technicians and designers.”

e.“Before DPA (digital pre-assembly), company departments communicated only during official design reviews….With the introduction of DPA, each department involved in product development has access to a common set of product model designs and DPA data and it can thus verify the design from its own point of view….The number of design verifications increases and it’s not limited to official design reviews.”

f.“At AVIO, the introduction of the DPA process and the minimization of physical prototypes have reduced the overall time-to-market and product development costs.”

g.“Facilitating the handoff among different engineering disciplines is crucial for manufacturing firms to be competitive today.  DMU can be a useful approach in the challenge of building tighter interaction between design and manufacturing departments.”

h.“Three points are to be highlighted.  First, the methodology defined supports the explicit definition of knowledge used in the assembly process; second, it was conceived as a methodology to streamline knowledge sharing in a multi-disciplinary design environment; third, it helps disseminate information about the task and process knowledge to concerned users.”

5.Conclusion.  The paper on digital mock-ups explains how AVIO uses a consistent, perspective-neutral representation of the reality it intends to create (i.e., a component) as a basis for simulations that check proposed designs to realize benefits that are goals of the ERS Priority Area.  It follows from this that the ERS effort needs support in understanding ontology and through demonstrations with ICODES on how models and simulation can be used to achieveERS goals.

Note:  The paper on digital mock-ups does not mention ontology.  This may seem a problem, but the reason becomes clear when one starts to read papers on ontology.  The basic idea of ontology is simple (i.e., a perspective-neutral representation of reality) but even the most general discussion on how to create and use an ontology quickly starts to involve concepts not found in discussions about almost any other subjects.  This means that a paper that discusses ontology almost always has about ten pages of discussion on the basics of ontology.  Thus, the author of a short paper on the benefits of ontology (as opposed to a short paper on the basics of ontology) is well advised to assume, as the author of the digital mock-up paper did, that the reader understands ontology and therefore talk instead about what can be achieved with a perspective-neutral representation.

The 2014 IEEE International Conference on Big Data

Source: November 13, 2014

The IEEE International Conference on Big Data (IEEE Big Data) is a top-tier conference sponsored by IEEE CS in Big Data Research and Application. The 2014 IEEE International Conference on Big Data (IEEE Big Data 214) was held in Washington DC from Oct 27-30, 2014.  It was a great success. Close to 600 registered participants attended the 4-day events.

The highlights of the program are:

  • 18 workshops/ 2 special sessions in many emerging important research topics in big data  
  • 4 excellent tutorials 
  • 3 keynote speeches
  • 46 regular paper and 50 short papers presentations from the main conference
  • 4 sessions of research and application presentation in the industry and government program. 
  • A Panel with funding agency program directors on Big Data Challenges and Opportunities

The tutorials, keynote speeches, funding agency program director ppts are now available for download from the conference website.

(1)    Keynote speech http://cci.drexel.edu/bigdata/bigdat...notespeech.htm My Note: See Keynote Speeches

  • Never-Ending Language Learning, Tom Mitchell - E. Fredkin University Professor, Machine Learning Department, Carnegie Mellon University
  • Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities, Amit Sheth, LexisNexis Ohio Eminent Scholar, Kno.e.sis - Wright State University
  • Addressing Human Bottlenecks in Big Data, Joseph M. Hellerstein, Chancellor's Professor of Computer Science, University of California, Berkeley and Trifacta

(2)    4 Tutorials  http://cci.drexel.edu/bigdata/bigdat...4/tutorial.htm My Note: See Tutorials

  • TUTORIAL 1: Big Data Stream Mining
    • Presenters: Gianmarco De Francisci Morales, Joao Gama, Albert Bifet, and Wei Fan
  • TUTORIAL 2: Big ML Software for Modern ML Algorithms
    • Presenters: Eric P. Xing and Qirong Ho
  • TUTORIAL 3: Large-scale Heterogeneous Learning in Big Data Analytics
    • Presenters: Jun Huan
  • TUTORIAL 4: Big Data Benchmarking
    • Presenters:  Chaitan Baru and Tilmann Rabl

(3)    Panel with funding agency program directors: Big Data Challenges and Opportunities http://cci.drexel.edu/bigdata/bigdata2014/panel.htm My Note: See Panel with Program Directors: Big Data Challenges and Opportunities

  • Chair :  Prof. Xiaohua Tony Hu 
  • Dr. Chaitanya Baru (NSF)
  • Dr. Yuan Liu (NIH)
  • Dr. David Kuehn (DoT)
  • Dr.Tsengdar Lee (NASA)
  • Dr. Sudarsan  Rachuri   (NIST)
  • Mr. Matti Vakkuri (DIGILE)

We believe you will like these excellent presentations. The IEEE Big Data 2015 will be back to Santa Clara, California in November, 2015, hope to see you there.

Main Page

Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at pubs-permissions@ieee.org. All rights reserved. Copyright 2014 by IEEE.
IEEE Catalog Number: CFP14BGD-USB
ISBN: 978-1-4799-5665-4.

Main Conference

Cover Page

PDF - one page

CoverPage.png

Copyright Information & Catalog Numbers

PDF - one page

My Note: I am abstracting for a data science data publication with credit.

Copyright and Reprint Permission: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy beyond the limit of U.S. copyright law for private use of patrons those articles in this volume that carry a code at the bottom of the first page, provided the per-copy fee indicated in the code is paid through Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923. For reprint or republication permission, email to IEEE Copyrights Manager at pubs-permissions@ieee.org. All rights reserved. Copyright 2014 by IEEE.
IEEE Catalog Number: CFP14BGD-USB
ISBN: 978-1-4799-5665-4.

Preface

PDF - 2 pages

IEEE Big Data 2014 Welcome Message from the Organizers

We are delighted to welcome you to the Second IEEE International Conference on Big Data (IEEE BigData 2014). As an exciting initiative in the area of big data, this conference series is becoming a leading forum for disseminating the latest advances in big data research, development, and applications.

The IEEE Big Data conferences are sponsored by the IEEE Computer Society and they attract high-quality original research papers on various aspects of big data. This year, we received 264 submissions from 36 countries. After a rigorous peer review process undertaken by the program committee members, 49 regular papers and 55 short papers were accepted, representing acceptance rates of 18.6% for regular papers and 20.8% for short papers, respectively. We are thrilled to congratulate the authors of those accepted papers and sincerely thank all the 936 submitting authors for their interests in the conference.

The IEEE Big Data 2014 Industry and Government program consists of 4 full session presentations. All the presentations reflect perspectives from industry on how to address challenging big data issues relevant to industrial settings. In addition, the conference features 17 workshops covering many emerging research directions in Big Data.

We are honored to feature the following high-profile distinguished speakers for their insightful and inspiring keynotes:

  • Tom Mitchell, the E. Fredkin University Professor, Machine Learning Department, Carnegie Mellon University;
  • Amit Sheth, the LexisNexis Eminent Scholar and founder/executive director of the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) at Wright State University;
  • Joseph M. Hellerstein, a Chancellor's Professor of Computer Science at the University of California, Berkeley.

We are deeply grateful to all who have contributed to this amazing technical program in one way or another. Without close collaboration among authors, reviewers, and organizers, an appealing technical program like this is simply mission impossible. Particularly, we profoundly appreciate the great contribution by the 31 senior program committee members, 175 program committee members, and many external reviewers. They reviewed and discussed the submissions carefully, critically, and constructively. We thank all our corporate sponsors for their generous support. Last but not least, we thank all attendees who are joining us in Washington, D.C.

We invite you to explore and enjoy this interesting and thought-provoking program. Our goal was not only to give you the opportunity to share ideas with other researchers and practitioners from around the world, but also to foster new research and innovation in big data.

Conference Co-Chairs Dr. Charu Aggarwal, IBM T.J Watson Research, USA Prof. Nick Cercone, York University, Canada Prof. Vasant Honavar, Penn State University, USA

Program Co-Chairs Prof. Jimmy Lin, University of Maryland, USA Prof. Jian Pei, Simon Fraser University, Canada

Industry and Government Program Co-Chairs Mr. Wo Chang, National Institute of Standard and Technology, USA Dr. Raghunath Nambiar, Cisco Systems Inc, USA

Workshop Co-Chairs Prof. Jun Huan, Univ. of Kansas, USA Prof. Bamshad Mobasher, Depaul University, USA Prof. Saumyadipta Pyne, CR Rao AIMSCS, Hyderabad, India

Big Data Steering Committee:
Prof. Xiaohua Tony Hu, Drexel University, USA

Organization Team Members

PDF- 2 pages

Source: http://cci.drexel.edu/bigdata/bigdat...ganization.htm

Big Data Steering Committee

Dr. Ama Awadallah, Cloudera, USA
Dr. Xueqi Cheng, Chinese Academy of Science, China
Prof. Moncef Gabbouj, Tampere University of Technology, Finland
Prof. Xiaohua Tony Hu (Chair), Drexel University, USA
Prof. T.Y. Lin, San Jose State University, USA
Prof. Jian Pei, Simon Fraser University, Canada
Prof. Beth Plale, Indiana University, USA
Prof. Vijay Raghavan, University of Louisiana at Lafayette, USA
Prof. Amit Sheth, Wright State University, USA
Prof. Matthew Smith, Leibniz Universität Hannove, Germany
Prof. Shusaku Tsumoto, Shimane University, Japan

Conference Co-Chairs

Dr. Charu Aggarwal, IBM T.J Watson Research, USA
Prof. Nick Cercone, York University, Canada
Prof. Vasant Honavar, Penn State University, USA

Program Co-Chairs

Prof. Jimmy Lin, University of Maryland, USA
Prof. Jian Pei, Simon Fraser University, Canada

Industry and Government Program Co-Chairs

Mr. Wo Chang, National Institute of Standard and Technology, USA
Dr. Raghunath Nambiar, Cisco Systems Inc, USA

Workshop Co-Chairs

Prof. Jun Huan, Univ. of Kansas, USA
Prof. Bamshad Mobasher, Depaul University, USA mobasher@cs.depaul.edu
Prof. Saumyadipta Pyne, CR Rao AIMSCS, Hyderabad, India

Poster Co-Chairs

Dr. David G. Belanger, Stevens Institute of Technology, USA
Dr. Saurav Karmakar, Samsung R&D Institute, USA 
Prof. Guangrong Li, Hunan University, China 

Publicity Co-chairs

Prof. Yon Ge, Univ. of North Carolina at Charlotte
Prof. Yihua Huang, Nanjing University, China
Prof. Roberto V. Zicari, Goethe University Frankfurt, Germany

Proceedings Co-Chairs

Dr. Eric Chen, NTT Innovation Institute Inc, USA

Local Arrangement Co-Chairs:

Prof. Dechang Chen, Uniformed Services University of the Health Sciences, USA

Sponsorship Co-Chairs

Dr. Ashok Krishnamurthy, University of North Carolina at Chapel Hill, USA
Dr. N. Raju Gottumukkala, University of Louisiana at Lafayette, USA

Tutorial Co-Chairs

Prof. Jimmy Huang, York University, Canada
Prof. Shusaku Tsumoto, Shimane University, Japan

Registration Co-Chairs

Prof. Jianchao Han, California State University, USA
Prof. Yuan An, Drexel University, USA

Student Travel Award Co-Chairs

Prof. Claire Monteleoni, The George Washington Univ. USA

Program Committee List

PDF - 7 page table

Source: http://cci.drexel.edu/bigdata/bigdat.../pcmembers.htm

Senior Program Committee Members

Karl Aberer (Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland)
Divy Agrawal (University of California, Santa Barbara, USA)
Francesco Bonchi (Yahoo! Research, USA)
Lei Chen (Hong Kong University of Science and Technology, China)
Bruce Croft (University of Massachusetts Amherst, USA)
Gautam Das (University of Texas at Arlington, USA)
Shirshanka Das (LinkedIn, USA)
Juliana Freire (NYU Poly, USA)
Rainer Gemulla (Max-Planck-Institute für Informatik, Germany)
Lise Getoor (University of California, Santa Cruz, USA)
Dimitrios Gunopulos (University of Athens, Greece)
Jiawei Han (University of Illinois at Urbana-Champaign, USA)
Bingsheng He (Nanyang Technological University, Singapore)
Jimmy Huang (York University, Canada)
Panos Kalnis (KAUST, Saudi Arabia)
George Karypis (University of Minnesota, USA)
Irwin King (The Chinese University of Hong Kong, China)
Guoliang Li (Tsinghua University, China)
Ee-Peng Lim (Singapore Management University, Singapore)
Xuemin Lin (University of New South Wales, Australia)
Huan Liu (Arizona State University, USA)
Raymond Ng (University of British Columbia, Canada)
M. Tamer Ozsu (University of Waterloo, Canada)
Yannis Papakonstantinou (University of California, San Diego, USA)
Jignesh M. Patel (University of Wisconsin, USA)
Kyuseok Shim (Seoul National University, South Korea)
Vincent Tseng (National Cheng Kung University, Taiwan)
Liqiang Wang (University of Wyoming, USA)
Ji-Rong Wen (Renmin University, China)
Jianliang Xu (Hong Kong Baptist University, China)
Ben Zhao (University of California, Santa Barbara, USA)

Program Committee Members

Arvind Agarwal (Xerox Research Center, USA)
Ankit Agrawal (Northwestern Univesity, USA)
Gagan Agrawal (Ohio State University, USA)
Nour Ali (University of Brighton, UK)
Aris Anagnostopoulos (Sapienza University of Rome, Italy)
Christoforos Anagnostopoulos (Imperial College London, UK)
Rafal Angryk (Montana State University, USA)
Gabriel Antoniu (INRIA, France)
Amy Apon (Clemson University, USA)
Gustavo Batista (University of Sao Paulo, Brazil)
Peter Baumann (Jacobs University, Germany)
Daniel Berrar (Tokyo Institute of Technology, Japan)
Martin Berzins (University of Utah, USA)
Abhinav Bhatele (Lawrence Livermore National Laboratory, USA)
Bishwaranjan Bhattacharjee (IBM Research, USA)
Luc Bougé (ENS Rennes, France)
Mario Bravetti (University of Bologna, Italy)
Hoang Bui (Rutgers University, USA)
Ali Butt (Virginia Tech, USA)
Suren Byna (Lawrence Berkeley National Laboratory, USA)
Ricardo J. G. B. Campello (University of Sao Paulo, Brazil)
Giuliano Casale (Imperial College London, UK)
Carlos Castillo (Qatar Computing Research Institute, Qatar)
Abhishek Chandra (University of Minnesota, USA)
Kevin Chang (University of Illinois at Urbana-Champaign, USA)
Keke Chen (Wright State University, USA)
Shu-Ching Chen (Florida International University, USA)
Yong Chen (Texas Tech University, USA)
Reynold Cheng (University of Hong Kong, China)
Kenneth Chiu (University at Buffalo (SUNY), USA)
Gregory Chockler (University of London, UK)
Frans Coenen (University of Liverpool, UK)
Ayse Coskun (Boston University, USA)
Noel De Palma (University Joseph Fourier, France)
Sven Dietrich (Stevens Institute of Technology, USA)
Ying Ding (Indiana University, USA)
Juzhen Dong (Maebashi Institute of Technology, Japan)
Zhicheng Dou (Microsoft Research Asia, China)
Saso Dzeroski (Jozef Stefan Institute, Slovenia)
Magdalini Eirinaki (San Jose State University, USA)
Trilce Estrada (University of New Mexico, USA)
Yi Fang (Santa Clara University, USA)
Dennis Fetterly (Microsoft Research, USA)
Jose Fortes (University of Florida, USA)
Ian Foster (University of Chicago, USA)
Bernd Freisleben (Philipps-Universität Marburg, Germany)
Johannes Furnkranz (TU Darmstadt, Germany)
Todd Gamblin (Lawrence Livermore National Laboratory, USA)
Jing Gao (University at Buffalo (SUNY), USA)
Aditya K. Ghose (University of Wollongong, Australia)
Harald Gjermundrod (University of Nicosia, Cyprus)
Aris Gkoulalas-Divanis (IBM Research, Ireland)
David Gleich (Purdue University, USA)
Bart Goethals (University of Antwerp, Belgium)
Kartik Gopalan (Binghamton University, USA)
Anastasios Gounaris (Aristotle University of Thessaloniki, Greece)
Tyrone Grandison (Proficiency Labs, USA)
Clemens Grelck (University of Amsterdam, Netherlands)
Amarnath Gupta (San Diego Supercomputing Center, USA)
Vijay K. Gurbani (Bell Laboratories, Alcatel-Lucent, USA)
Mohammad Al Hasan (Indiana University-Purdue University Indianapolis, USA)
Claudia Hauff (Delft University of Technology, Netherlands)
Daqing He (University of Pittsburgh, USA)
H. Howie Huang (George Washington University, USA)
Fabrice Huet (INRIA-I3S-CNRS, France)
Marty Humphrey (University of Virginia, USA)
Alexandru Iosup (Delft University of Technology, Netherlands)
Gareth Jones (Dublin City University, Ireland)
Gideon Juve (USC Information Sciences Institute, USA)
David Kaeli (Northeastern University, USA)
Vana Kalogeraki (Athens University of Economics and Business, Greece)
Jaap Kamps (University of Amsterdam, Netherlands)
Zoi Kaoudi (INRIA Saclay, France)
Yiping Ke (Institute of High Performance Computing, A*Star, Singapore)
Kristian Kersting (University of Bonn, Germany)
Harald Kornmayer (DHBW Mannheim, Germany)
Tevfik Kosar (University at Buffalo (SUNY), USA)
Yannis Kotidis (Athens University of Economics and Business, Greece)
Christopher Kunz (Filoo GmbH, Germany)
Alberto Laender (Federal University of Minas Gerais, Brazil)
Zhiling Lan (Illinois Institute of Technology, USA)
John Lange (University of Pittsburgh, USA)
Alexey Lastovetsky (University College Dublin, Ireland)
Wang-Chien Lee (Penn State University, USA)
Chengkai Li (University of Texas at Arlington, USA)
Cuiping Li (Renmin University, China)
Dong Li (Oak Ridge National Lab, USA)
Lipyeow Lim (University of Hawaii at Manoa, USA)
Zhiqiang Lin (University of Texas at Dallas, USA)
Yan Liu (University of Southern California, USA)
Paul Lu (University of Alberta, Canada)
Claudio Lucchese (Institute of the National Research Council of Italy, Italy)
Matteo Maffei (Saarland University, Germany)
Bradley Malin (Vanderbilt University, USA)
Tiziana Margaria (University of Potsdam, Germany)
J.P. Martin-Flatin (EPFL, Switzerland)
Edgar Meij (University of Amsterdam, Netherlands)
Ningfang Mi (Northeastern University, China)
Shinichi Morishita (University of Tokyo, Japan)
Alessandro Moschitti (University of Trento, Italy)
Sebastien Mosser (Université Nice-Sophia Antipolis, France)
Emmanuel Müller (Karlsruhe Institute of Technology, Germany)
Hidemoto Nakada (National Institute of Advanced Industrial Science and Technology, Japan)
Jian-Yun Nie (Univeristy of Montreal, Canada)
Dimitrios Nikolopoulos (Queen's University of Belfast, Ireland)
Alexandros Ntoulas (Zynga Inc., USA)
Silvia Olabarriaga (University of Amsterdam, Netherlands)
Beng Chin Ooi (National University of Singapore, Singapore)
David Padua (University of Illinois at Urbana-Champaign, USA)
Rajesh Parekh (Groupon, USA)
Dino Pedreschi (University of Pisa, Italy)
Peter Pietzuch (Imperial College London, UK)
Evaggelia Pitoura (University of Ioannina, Greece)
Beth Plale (Indiana University, USA)
B. Aditya Prakash (Virginia Tech, USA)
Baojun Qiu (eBay, USA)
Lavanya Ramakrishnan (Lawrence Berkeley National Laboratory, USA)
Maya Ramanath (IIT Delhi, India)
Jan Ramon (Katholieke Universiteit Leuvenm, Belgium)
Anand Ranganathan (IBM T.J. Watson Research Center, USA)
Chandan K. Reddy (Wayne State University, USA)
Philip Rhodes (University of Mississippi, USA)
Uwe Roehm (University of Sydney, Australia)
Paolo Romano (INESC, Portugal)
Lotfi Ben Romdhane (University of Sousse, Tunisia)
Indrajit Roy (HP Labs, USA)
Rizos Sakellariou (University of Manchester, UK)
Carlos Scheidegger (AT&T Research, USA)
Martin Schulz (Lawrence Livermore National Laboratory, USA)
Haiying Helen Shen (Clemson University, USA)
Suzanne Shontz (Mississippi State University, USA)
Yasin N. Silva (Arizona State University, USA)
Fabrizio Silvestri (Yahoo Labs, Spain)
Alkis Simitsis (HP Labs, USA)
Yogesh Simmhan (University of Southern California, USA)
Dan Simovici (University of Massachusetts, USA)
Shaoxu Song (Tsinghua University, China)
Domenico Talia (University of Calabria, Italy)
Wei Tan (IBM T.J. Watson Research Center, USA)
Evimaria Terzi (Boston University, USA)
Andrew Trotman (University of Otago, New Zealand)
Dimitrios Tsoumakos (Ionian University, Greece)
Mauricio Tsugawa (University of Florida, USA)
Ana Lucia Varbanescu (Delft University of Technology, The Netherlands)
Carlos Varela (Rensselaer Polytechnic Institute, USA)
Maksims Volkovs (University of Toronto, Canada)
Arno Wacker (University of Kassel, Germany)
Xiaojun Wan (Peking University, China)
Andy Ju An Wang (Southern Illinois University, USA)
XiaoFeng Wang (Indiana University, USA)
Ran Wolff (University of Haifa, Israel)
Raymond Chi-Wing Wong (HKUST, China)
Yongwei Wu (Tsinghua University, China)
Yinglong Xia (IBM T.J. Watson Research Center, USA)
Li Xiong (Emory University, USA)
Jun Xu (ICT, Chinese Academy of Science, China)
Ramin Yahyapour (University of Göttingen, Germany)
Feng Yan (College of William and Mary, USA)
Haiqin Yang (The Chinese University of Hong Kong, China)
Philip Yu (University of Illinois, USA)
Qi Yu (Rochester Institute of Technology, USA)
Xin Yuan (Florida State University, USA)
Aidong Zhang (University at Buffalo (SUNY), USA)
Kunpeng Zhang (University of Illinois at Chicago, USA)
Rui Zhang (IBM Research - Almaden, USA)
Zhenjie Zhang (Advanced Digital Sciences Center, Singapore)
Ming Zhao (Florida International University, USA)
Jiang Zheng (ABB Corporate Research, USA)
Vincent Zheng (Advanced Digital Sciences Center, Singapore)
Yu Zheng (Microsoft Research Asia, China)
Hill Zhu (Florida Atlantic University, USA)
Ziliang Zong (Texas State University, USA)

Industy & Government Program Committee Members

Chaitanya Baru (San Diego Supercomputer Center, USA)
Milind Bhandarkar (Greenplum, USA)
Carlos Castillo (Qatar Computing Research Institute, Qatar)
Nitesh Chawla (University of Notre Dame, USA)
Haifeng Chen (NEC Laboratories America, USA)
C. Jason Chiang (App ComSci, USA)
Alice Deane (Amazon Web Services, USA)
Julia Deng (Intelligent Automation, Inc., USA)
Debora Donato (StumbleUpon, USA)
Harry Foxwell (Oracle, USA)
Nancy Grady (SAIC, USA)
Caimei Lu (Maritz Research Inc., USA)
Akhil Manchanda (GE, USA)
Bob Marcus (ET-Strategies, USA)
Sanjay Mishra (Verizon Communications, USA)
Felix Njeh (U.S. Army Information Technology Agency, USA)
Quyen Nquyen (NARA, USA)
Clifton Phua (SAS Institute Pte Ltd, Singapore)
Meikel Poess (Oracle, USA)
Matthew Rattigan (Edgeflip, LLC, USA)
John Rogers (HP, USA)
Arnab Roy (Fujitsu, USA)
Ramendra Sahoo (PwC, USA)
Mohak Shah (GE Software Center of Excellence, USA)
Chii-Ren Tsai (Citigroup, USA)
Mark Underwood (Krypton Brothers LLC, USA)
Chonggang Wang (InterDigital, USA)
Carl Willis-Ford (SRA International, Inc., USA)

Keynote Speeches

Source: http://cci.drexel.edu/bigdata/bigdat...notespeech.htm

Last update: 13 Nov. 2014

Never-Ending Language Learning

(Page 1) 1.pdf

Tom Mitchell - E. Fredkin University Professor, Machine Learning Department, Carnegie Mellon University

Abstract 

We will never really understand learning until we can build machines that learn many different things, over years, and become better learners over time. We describe our research to build a Never-Ending Language Learner (NELL) that runs 24 hours per day, forever, learning to read the web. Each day NELL extracts (reads) more facts from the web, into its growing knowledge base of beliefs. Each day NELL also learns to read better than the day before. NELL has been running 24 hours/day for over four years now. The result so far is a collection of 70 million interconnected beliefs (e.g., servedWtih(coffee, applePie)), NELL is considering at different levels of confidence, along with millions of learned phrasings, morphological features, and web page structures that NELL uses to extract beliefs from the web. NELL is also learning to reason over its extracted knowledge, and to automatically extend its ontology. Track NELL's progress at http://rtw.ml.cmu.edu, or follow it on Twitter at @CMUNELL.

Biography 

Tom M. Mitchell founded and chairs the Machine Learning Department at Carnegie Mellon University, where he is the E. Fredkin University Professor. His research uses machine learning to develop computers that are learning to read the web, and uses brain imaging to study how the human brain understands what it reads. Mitchell is a member of the U.S. National Academy of Engineering, a Fellow of the American Association for the Advancement of Science (AAAS), and a Fellow and Past President of the Association for the Advancement of Artificial Intelligence (AAAI). He believes the field of machine learning will be the fastest growing branch of computer science during the 21st century.

This keynote speech slide can be downloaded at here. (PDF)

Never Ending Language Learning

TomMitchell10292014Slide1.png

Rationale

TomMitchell10292014Slide2.png

NELL: Never-Ending Language Learner

TomMitchell10292014Slide3.png

NELL Today

TomMitchell10292014Slide4.png

Summary

TomMitchell10292014Slide5.png

Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities

(Page 2) 2.pdf

Amit Sheth, LexisNexis Ohio Eminent Scholar, Kno.e.sis - Wright State University

Abstract 

Big Data has captured a lot of interest in industry, with the emphasis on the challenges of the four Vs of Big Data: Volume, Variety, Velocity, and Veracity, and their applications to drive value for businesses.   Recently, there is rapid growth in situations where a big data challenge relates to making individually relevant decisions.  A key example is personalized digital health that related to taking better decisions about our health, fitness, and well-being.  Consider for instance, understanding the reasons for and avoiding an asthma attack based on Big Data in the form of personal health signals (e.g., physiological data measured by devices/sensors or Internet of Things around humans, on the humans, and inside/within the humans), public health signals (e.g., information coming from the healthcare system such as hospital admissions), and population health signals (such as Tweets by people related to asthma occurrences and allergens, Web services providing pollen and smog information).  However, no individual has the ability to process all these data without the help of appropriate technology, and each human has different set of relevant data!

In this talk, I will describe Smart Data that is realized by extracting value from Big Data, to benefit not just large companies but each individual. If my child is an asthma patient, for all the data relevant to my child with the four V-challenges, what I care about is simply, “How is her current health, and what are the risk of having an asthma attack in her current situation (now and today), especially if that risk has changed?” As I will show, Smart Data that gives such personalized and actionable information will need to utilize metadata, use domain specific knowledge, employ semantics and intelligent processing, and go beyond traditional reliance on ML and NLP.  I will motivate the need for a synergistic combination of techniques similar to the close interworking of the top brain and the bottom brain in the cognitive models.

For harnessing volume, I will discuss the concept of Semantic Perception, that is, how to convert massive amounts of data into information, meaning, and insight useful for human decision-making. For dealing with Variety, I will discuss experience in using agreement represented in the form of ontologies, domain models, or vocabularies, to support semantic interoperability and integration.  For Velocity, I will discuss somewhat more recent work on Continuous Semantics, which seeks to use dynamically created models of new objects, concepts, and relationships, using them to better understand new cues in the data that capture rapidly evolving events and situations.

Smart Data applications in development at Kno.e.sis come from the domains of personalized health, energy, disaster response, and smart city. I will present examples from a couple of these.

Biography 

Amit P. Sheth (http://knoesis.org/amit) is an educator, researcher, and entrepreneur. He is the LexisNexis Eminent Scholar and founder/executive director of the Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis) at Wright State University. Kno.e.sis conducts research in social/sensor/semantic data and Web 3.0 with real-world applications and multidisciplinary solutions for translational research, healthcare and life sciences, cognitive science, material sciences, and others. Kno.e.sis' activities have resulted in Wright State University being recognized as a top organization in the world on World Wide Web in research impact. Prof. Sheth is one of top authors in Computer Science, World Wide Web, and databases (cf: Microsoft Academic Search; Google H-index). His research has led to several commercial products, many real-world applications, and two earlier companies with two more in early stages of development. One of these was Taalee/Voquette/Semagix, which was likely the first company (founded in 1999) that developed Semantic Web enabled search and analysis, and semantic application development platforms.

This keynote speech slide can be downloaded at here.

Smart Data - How you and I will exploit Big Data for personalized digital health and many other activities

AmitSheth10292014Slide1.png

Thanks: My Team

AmitSheth10292014Slide2.png

I Will Use Applications in 3 Domains to Demonstrate

AmitSheth10292014Slide3.png

My 2004-2005 Fomulation of SMART DATA-Semagix

AmitSheth10292014Slide4.png

In Process, Engaging Both Top and Bottom Brain

http://brainblogger.com/2013/12/19/t...gnitive-modes/

AmitSheth10292014Slide5.png

W3C Semantic Sensor Network Ontology

AmitSheth10292014Slide6.png

Personalized Digital Health Research and Applications at Kno.e.sis

https://www.youtube.com/watch?v=mATR...ature=youtu.be

AmitSheth10292014Slide7.png

Semantic Annotation Using Background Knowledge

AmitSheth10292014Slide8.png

Matching Requests With Offers

AmitSheth10292014Slide9.png

Wrapping Up: Take Away 1

AmitSheth10292014Slide11.png

Wrapping Up: Take Away 2

AmitSheth10292014Slide12.png

Special Thanks

AmitSheth10292014Slide13.png

Addressing Human Bottlenecks in Big Data

(Page 4) ​3.pdf

Joseph M. HellersteinChancellor's Professor of Computer Science, University of California, Berkeley and Trifacta

Abstract 

We live in an era when compute is cheap, data is plentiful, and system software is being given away for free.  Today, the critical bottlenecks in data-driven organizations are human bottlenecks, measured in the costs of software developers, IT professionals, and data analysts.  How can computer science remain relevant in this context?  The Big Data ecosystem presents two archetypal settings for answering this question: NoSQL distributed databases, and analytics on Hadoop.

In the case of NoSQL, developers are being asked to build parallel programs for global-scale systems that cannot even guarantee the consistency of a single register of memory.  How can this possibly be made to work?  I’ll talk about what we have seen in the wild in user deployments, and what we’ve learned from developers and their design patterns.  Then I’ll present theoretical results—the CALM Theorem—that shed light on what’s possible here, and what requires more expensive tools for coordination on top of the typical NoSQL offerings.  Finally, I will highlight some new approaches to writing and testing software—exemplified by the Bloom language—that can help developers of distributed software avoid expensive coordination when possible, and have the coordination logic synthesized for them automatically when necessary.

In the Hadoop context, the key bottlenecks lie with data analysts and data engineers, who are routinely asked to work with data that cannot possibly be loaded into tools for statistical analytics or visualization.  Instead, they have to engage in time-consuming data “wrangling”—to try and figure out what’s in their data, whip it into a rectangular shape for analysis, and figure out how to clean and integrate it for use.  I’ll discuss what we heard talking with data analysts in both academic interviews and commercial engagements.  Then I’ll talk about how techniques from human-computer interaction, machine learning, and database systems can be brought together to address this human bottleneck, as exemplified by our work on various systems including the Data Wrangler project and Trifacta's platform for data transformation.

Biography

Joseph M. Hellerstein is a Chancellor's Professor of Computer Science at the University of California, Berkeley, whose research focuses on data-centric systems and the way they drive computing. A Fellow of the ACM, his work has been recognized via awards including an Alfred P. Sloan Research Fellowship, MIT Technology Review's TR10 and TR100 lists, Fortune Magazine's "Smartest in Tech" list, and three ACM-SIGMOD "Test of Time" awards.  In 2012, Joe co-founded TrifactaInc (http://www.trifacta.com/company/people/), where he currently serves as Chief Strategy Officer.

This keynote speech slide can be downloaded at here. (PDF)

Addressing Human Bottlenecks in Big Data

JosephHellerstein10292014Slide1.png

Outline

JosephHellerstein10292014Slide2.png

I Can Give You Power

JosephHellerstein10292014Slide3.png

Data Distributed Programming

JosephHellerstein10292014Slide4.png

Where Does Time Go?

JosephHellerstein10292014Slide5.png

Wangler 2011: Add Intelligence

JosephHellerstein10292014Slide6.png

Technical Philosophy

JosephHellerstein10292014Slide7.png

Takeaways

JosephHellerstein10292014Slide8.png

Sessions

My Note: These could be further structured by paper and have the PDFs attached and excerpts inserted.

1. Big Data Foundations

Regular Papers
BASIC: an Alternative to BASE for Large-Scale Data Management System

(Page 5) BigD306.pdf
Lengdong Wu, Li-Yan Yuan, and Jia-Huai You

BayesWipe: A Multimodal System for Data Cleaning and Consistent Query Answering on Structured BigData

(Page 15) 
Sushovan De, Yuheng Hu, Yi Chen, and Subbarao Kambhampati

Short Papers
Scaling up M-estimation via sampling designs: the Horvitz-Thompson stochastic gradient descent

(Page 25) BigD293.pdf
Stéphan Clémençon, Bertail Patrice, and Emilie Chautru

Metadata Capital: Simulating the Predictive Value of Self-Generated Heatlh Information (SGHI)

(Page 31) BigD327.pdf
Jane Greenberg, Adrian Ogletree, Angela Murillo, Thomas Caruso, and Herbie Huang

Representative Subsets For Big Data Learning using k-NN graphs

(Page 37) BigD343.pdf
Raghvendra Mall, Vilen Jumutc, Rocco Langone, and Johan Suykens

Towards Building and Evaluating a Personalized Location-Based Recommender System

(Page 43) BigD356.pdf
Rubing Duan

On the Performance of MapReduce: A Stochastic Approach

(Page 49) BigD392.pdf
Sarker Ahmed and Dmitri Loguinov

PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems

(Page 55) BigD401.pdf
Khalifeh Aljadda, Mohammed Korayem, Camilo Ortiz, Trey Grainger, John A. Miller, and William York

2. Big Data Infrastructure

Regular Papers
FusionFS: Toward Supporting Data-Intensive Scientific Applications on Extreme-Scale High-Performance Computing Systems

(Page 61) BigD216.pdf
Dongfang Zhao, Zhao Zhang, Xiaobing Zhou, Tonglin Li, Ke Wang, Dries Kimpe, Philip Carns, Rob Ross, and Ioan Raicu

​BurstMem: A High-Performance Burst Buffer System for Scientific Applications

(Page 71) BigD271.pdf
Teng Wang, Sarp Oral, Yandong Wang, Brad Settlemyer, Scott Atchley, and Weikuan Yu

Partial Rollback-based Scheduling on In-memory Transactional Data Grids

(Page 80) BigD318.pdf
Junwhan Kim

Detecting and Identifying System Changes in the Cloud via Discovery by Example

(Page 90) BigD423.pdf
Hao Chen, Sastry Duri, Vasanth Bala, Nilton Bila, Canturk Isci, and Ayse Coskun

​PigOut: Making Multiple Hadoop Clusters Work Together

(Page 100) BigD426.pdf
Kyungho Jeon, Sharath Chandrashekhara, Feng Shen, Shikhar Mehra, Oliver Kennedy, and Steven Ko

Parallel Breadth First Search on GPU Clusters

(Page 110) BigD434.pdf
Zhisong Fu, Harish Dasari, Bradley Bebee, Martin Berzins, and Bryan Thompson

Optimizing Load Balancing and Data-Locality with Data-aware Scheduling

(Page 119) BigD471.pdf
Ke Wang, Xiaobing Zhou, Tonglin Li, Dongfang Zhao, Michael Lang, and Ioan Raicu

Short Papers
Online Temporal-Spatial Analysis for Detection of Critical Events in Cyber-Physical Systems

(Page 129) BigD227.pdf
Zhang Fu, Magnus Almgren, Olaf Landsiedel, and Marina Papatriantafilou

​A Cross-job Framework for MapReduce Scheduling

(Page 135) BigD230.pdf
Xuejie Xiao, Jian Tang, Zhenhua Chen, Jielong Xu, and Chonggang Wang

Scheduling MapReduce Tasks on Virtual MapReduce Clusters from a Tenant’s Perspective

(Page 141) BigD242.pdf
Jia-Chun Lin, Ming-Chang Lee, and Ramin Yahyapour

​FlexDAS: A Flexible Direct Attached Storage for I/O Intensive Applications

(Page 147) BigD264.pdf
Takatsugu Ono, Yotaro Konishi, Teruo Tanimoto, Noboru Iwamatsu, Takashi Miyoshi, and Jun Tanaka

​A Two-Sided Market Mechanism for Trading Big Data Computing Commodities

(Page 153) BigD270.pdf
Lena Mashayekhy, Mahyar Movahed Nejad, and Daniel Grosu

​MMap: Fast Billion-Scale Graph Computation on a PC via Memory Mapping

(Page 159) BigD284.pdf
Zhiyuan Lin, Minsuk Kahng, Kaeser Md. Sabrin, Duen Horng Chau, Ho Lee, and U Kang

Large-Scale Network Traffic Monitoring with DBStream, a System for Rolling Big Data Analysis

(Page 165) BigD288.pdf
Arian Bär, Alessandro Finamore, Pedro Casas, Lukasz Golab, and Marco Mellia

​Synthetic Data Generation for the Internet of Things

(Page 171) BigD312.pdf
Jason Anderson, Ken Kennedy, Linh Ngo, Andre Luckow, and Amy Apon

​Evaluating the Performance and Scalability of the Ceph Distributed Storage System

(Page 177) BigD315.pdf
Diana Gudu, Marcus Hardt, and Achim Streit

​Incremental Window Aggregates over Array Database

(Page 183) BigD347.pdf
Jiang Li, Hideyuki Kawashima, and Osamu Tatebe

​BigCache for Big-data Systems

(Page 189) BigD362.pdf
Michel Roger, Yiqi Xu, and Ming Zhao

Automated Workload-aware Elasticity of NoSQL Clusters in the Cloud

(Page 195) BigD364.pdf
Evie Kassela, Christina Boumpouka, IOANNIS KONSTANTINOU, and NECTARIOS KOZIRIS

​Distributed Class Dependent Feature Analysis - A Big Data Approach

(Page 201) BigD410.pdf
Khoa Luu, Chenchen Zhu, and Marios Savvides

​VENU: Orchestrating SSDs in Hadoop Storage

(Page 207) BigD428.pdf
Krish K.R., M. Safdar Iqbal, and Ali Butt

In-Memory I/O and Replication for HDFS with Memcached: Early Experiences

(Page 213) BigD438.pdf
Nusrat Islam, Xiaoyi Lu, Md. Rahman, Raghunath Rajachandrasekar, and Dhabaleswar Panda

Enabling Composite Applications through an Asynchronous Shared Memory Interface

(Page 219) BigD475.pdf
Douglas Otstott, Noah Evans, Latchesar Ionkov, Ming Zhao, and Michael Lang

k-balanced sorting and skew join in MPI and MapReduce

​(Page 225) BigD476.pdf
Silu Huang and Ada Fu

3. Big Data Management

Regular Papers
Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems

(Page 231) BigD215.pdf
Dongfang Zhao, Jian Yin, Kan Qiao, and Ioan Raicu

Examination of Data, Rule Generation and Detection of Phishing URLs using Online Logistic Regression

(Page 241) BigD283.pdf
Mohammed Nazim Feroz and Susan Mengel

Main Memory Evaluation of Recursive Queries on Multicore Machines

(Page 251) BigD337.pdf
Mohan Yang and Carlo Zaniolo

Predicting Glaucoma Progression using Multi-task Learning with Heterogeneous Features

(Page 261) BigD402.pdf
Shigeru Maya, Kai Morino, and Kenji Yamanishi

Provenance-Based Object Storage Prediction Scheme for Scientific Big Data Applications

(Page 271) BigD407.pdf
Dong Dai, Yong Chen, Dries Kimpe, and Rob Ross

Synergistic Partitioning in Multiple Large Scale Social Networks

(Page 281) BigD436.pdf
Songchang Jin, Jiawei Zhang, Philip S. Yu, Shuqiang Yang, and Aiping Li

TRISTAN: Real-Time Analytics on Massive Time Series Using Sparse Dictionary Compression

(Page 291) BigD445.pdf
Alice Marascu, Pascal Pompey, Eric Bouillet, Michael Wurst, Olivier Verscheure, Martin Grund, and Philippe Cudre-Mauroux

Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing

(Page 301) BigD451.pdf
Hao Li, Di Yu, Anand Kumar, and Yicheng Tu

Short Papers
Minimizing Data Movement through Query Transformation

(Page 311) BigD311.pdf
Patrick Leyshock, David Maier, and Kristin Tufte

Multilevel Partitioning of Large Unstructured Grids

(Page 317) BigD384.pdf
Oyindamola Akande and Philip Rhodes

Low Complexity Sensing for Big Spatio-Temporal Data

(Page 323) BigD440.pdf
Dongeun Lee and Jaesik Choi

In-advance Data Analytics for Reducing Time to Discovery

(Page 329) BigD469.pdf
Jialin Liu, Yin Lu, and Yong Chen

4. Big Data Search and Mining

Regular Papers
Learning to Estimate Pairwise Distances in Large Graphs

(Page 335) BigD210.pdf
Maria Christoforaki and Torsten Suel

Distributed Adaptive Model Rules for Mining Big Data Streams

(Page 345) BigD234.pdf
Anh Thu Vu, Gianmarco De Francisci Morales, Joao Gama, and Albert Bifet

Sparse computation for large-scale data mining

(Page 354) BigD253.pdf
Dorit S. Hochbaum and Philipp Baumann

Topic Similarity Networks: Visual Analytics for Large Document Sets

(Page 364) BigD258.pdf
Arun Maiya and Robert Rolfe

Efficient Breadth-First Search on a Heterogeneous Processor

(Page 373) BigD301.pdf
Mayank Daga, Mark Nutter, and Mitesh Meswani

Web-based Visual Analytics for Extreme Scale Climate Science

(Page 383) BigD303.pdf
Chad Steed, Katherine Evans, John Harney, Brian Jewell, Galen Shipman, Brian Smith, Peter Thornton, and Dean Williams

Geotagging One Hundred Million Twitter Accounts with Total Variation Minimization

(Page 393) BigD304.pdf
Ryan Compton, David Jurgens, and David Allen

Meeting Predictable Buffer Limits in the Parallel Execution of Event Processing Operators

(Page 402) BigD313.pdf
Ruben Mayer, Boris Koldehofe, and Kurt Rothermel

Metadata Extraction and Correction for Large-Scale Traffic Surveillance Videos

(Page 412) BigD316.pdf
Xiaomeng Zhao, Huadong Ma, Haitao Zhang, Yi Tang, and Guangping Fu

Facilitating Twitter Data Analytics: Platform, Language and Functionality

(Page 421) BigD336.pdf
Ke Tao, Claudia Hauff, Geert-Jan Houben, Fabian Abel, and Guido Wachsmuth

Visual Fusion of Mega-City Big Data: An Application to Traffic and Tweets Data Analysis of Metro Passengers

(Page 431) BigD338.pdf
Masahiko Itoh, Daisaku Yokoyama, Masashi Toyoda, Yoshimitsu Tomita, Satoshi Kawamura, and Masaru Kitsuregawa

GRAPHiQL: A Graph Intuitive Query Language for Relational Databases

(Page 441) BigD357.pdf
Alekh Jindal and Samuel Madden

Evaluating Density-based Motion for Big Data Visual Analytics

(Page 451) BigD379.pdf
Ronak Etemadpour, Paul Murray, and Angus Forbes

Regression Trees for Streaming Data with Local Performance Guarantees

(Page 461) BigD382.pdf
Ulf Johansson, Cecilia Sönströd, Henrik Linusson, and Henrik Boström

Distributed Algorithms for k-truss Decomposition

(Page 471) BigD391.pdf
Pei-Ling Chen, Chung-Kuang Chou, and Ming-Syan Chen

PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

(Page 481) BigD395.pdf
George Slota, Kamesh Madduri, and Sivasankaran Rajamanickam

Effective Caching Techniques for Accelerating Pattern Matching Queries

(Page 491) BigD398.pdf
Arash Fard, Satya Manda, Lakshmish Ramaswamy, and John A. Miller

Clique Guided Community Detection

(Page 500) BigD419.pdf
Diana Palsetia, Mostofa Patwary, William Hendrix, Ankit Agrawal, and Alok Choudhary

Large-scale Distributed Sorting for GPU-based Heterogeneous Supercomputers

(Page 510) BigD444.pdf
Hideyuki Shamoto, Koichi Shirahata, Aleksandr Drozd, Hitoshi Sato, and Satoshi Matsuoka

Large-scale Logistic Regression and Linear Support Vector Machines Using Spark

(Page 519) BigD454.pdf
Chieh-Yen Lin, Cheng-Hao Tsai, Ching-Pei Lee, and Chih-Jen Lin

NVM-based Hybrid BFS with Memory Efficient Data Structure

(Page 529) BigD455.pdf
Keita Iwabuchi, Hitoshi Sato, Yuichiro Yasui, Katsuki Fujisawa, and Satoshi Matsuoka

Identification of SNP Interactions Using Data-Parallel Primitives on GPUs

(Page 539) BigD460.pdf
Can Altinigneli, Bettina Konte, Dan Rujescu, Christian Boehm, and Claudia Plant

Short Papers
Random Walks on Adjacency Graphs for Mining Lexical Relations from Big Text Data

(Page 549) BigD225.pdf
Shan Jiang and Chengxiang Zhai

Entity Resolution Using Inferred Relationships and Behavior

(Page 555) BigD238.pdf
Jonathan Mugan, Ranga Chari, Laura Hitt, Eric McDermid, Marsha Sowell, Yuan Qu, and Thayne Coffman

Rainbow: A Distributed and Hierarchical RDF Triple Store with Dynamic Scalability

(Page 561) BigD247.pdf
Rong Gu, Yihua Huang, and Wei Hu

In-Situ Visualization and Computational Steering for Large-Scale Simulation of Turbulent Flows in Complex Geometries

(Page 567) BigD252.pdf
Hong Yi, Michel Rasquin, Jun Fang, and Igor Bolotnov

Building k-nn graphs from large text data

(Page 573) BigD287.pdf
Thibault Debatty, Pietro Michiardi, Olivier Thonnard, and Wim Mees

Learning to Predict Subject-Line Opens for Large-Scale Email Marketing

(Page 579) BigD324.pdf
Raju Balakrishnan and Rajesh Parekh

MAGE: Matching Approximate Patterns in Richly-Attributed Graphs

(Page 585) BigD333.pdf
Robert Pienta, Acar Tamersoy, Hanghang Tong, and Duen Horng Chau

Bootstrapping K-means for Big data analysis

(Page 591) BigD339.pdf
Jungkyu Han and Min Luo

Distributed Adaptive Importance Sampling on Graphical Models Using MapReduce

(Page 597) BigD361.pdf
Ahsanul Haque, Swarup Chandra, Latifur Khan, and Charu Aggarwal

Knowledge-based Clustering of Ship Trajectories Using Density-based Approach

(Page 603) BigD376.pdf
Bo Liu, Erico N.de Souza, Stan Matwin, and Marcin Sydow

Immersive and collaborative data visualization using virtual reality platforms

(Page 609) BigD387.pdf
Ciro Donalek, S.G. Djorgovski, Scott Davidoff, Alex Cioc, Anwell Wang, Giuseppe Longo, Jeffrey S. Norris, Jerry Zhang, Elizabeth Lawler, Stacy Yeh, Ashish Mahabal, Matthew Graham, and Andrew Drake

The Adaptive Projection Forest: Using Adjustable Exclusion and Parallelism in Metric Space Indexes

(Page 615) BigD431.pdf
Lee Thompson, Weijia Xu, and Daniel Miranker

Scaling Up Prioritized Grammar Enumeration for Scientific Discovery in the Cloud

(Page 621) BigD448.pdf
Tony Worm and Kenneth Chiu

5. Big Data Security & Privacy

Regular Papers
MR-TRIAGE: Scalable Multi-Criteria Clustering for Big Data Security Intelligence Applications

(Page 627) BigD294.pdf
Yun Shen and Olivier Thonnard

Increasing the Veracity of Event Detection on Social Media Networks Through User Trust Modeling

(Page 636) BigD441.pdf
Todd Bodnar, Conrad Tucker, Kenneth Hopkinson, and Sven Bilén

Short Papers
Empowering users of social networks to assess their privacy risks

(Page 644) BigD331.pdf
Vladimir Estivill-Castro, Md Zahidul Islam, and Peter Hough

A Unified Approach to Network Anomaly Detection

(Page 650) BigD346.pdf
Tara Babaie, Sanjay Chawla, Sebastien Ardon, and Yue Yu

6. Big Data Applications

Regular Papers
E-Sketch: Gathering Large-scale Energy Consumption Data Based on Consumption Patterns

(Page 656) BigD244.pdf
Zhichuan Huang, Hongyao Luo, David Skoda, Ting Zhu, and Yu Gu

Hierarchical Management of Large-Scale Malware Data

(Page 666) BigD260.pdf
Lee Kellogg, Brian Ruttenberg, Alison O'Connor, Michael Howard, and Avi Pfeffer

Random Projection Based Clustering for Population Genomics

(Page 675) BigD277.pdf
Sotiris Tasoulis, Lu Cheng, Niko Välimäki, Nicholas Croucher, Simon Harris, William Hanage, Teemu Roos, and Jukka Corander

Structure Recognition from High Resolution Images of Ceramic Composites

(Page 683) BigD360.pdf
Daniela Ushizima, Talita Perciano, Harinarayan Krishnan, Burlen Loring, Hrishikesh Bale, Dilworth Parkinson, and James Sethian

Combining Hadoop and GPU to Preprocess Large Affymetrix Microarray Data

(Page 692) BigD380.pdf
sufeng Niu, guangyu yang, nilim sarma, Pengfei Xuan, Melissa Smith, Pradip Srimani, and Feng Luo

Content-Based Access Control: Use Data Content to Assist Access Control for Large-Scale Content-Centric Databases

(Page 701) BigD383.pdf
Wenrong Zeng, Yuhao Yang, and Bo Luo

Locating Visual Storm Signatures from Satellite Images

(Page 711) BigD421.pdf
Yu Zhang, Stephen Wistar, Jose A. Piedra-Fernández, Jia Li, Michael Steinberg, and James Z. Wang

Accurate and Efficient Selection of the Best Consumption Prediction Method in Smart Grids

(Page 721) BigD432.pdf
Marc Frincu, Charalampos Chelmis, Muhammad Noor, and Viktor Prasanna

Short Papers
Visualizations for Sense-Making in Financial Market Regulation

(Page 730) BigD204.pdf
Mark Paddrik, Richard Haynes, Andrew Todd, William Scherer, and Peter Beling

Big Automotive Data - Leveraging large volumes of data for knowledge-driven product development

(Page 736) BigD232.pdf
Mathias Johanson, Stanislav Belenki, Jonas Jalminger, Magnus Fant, and Mats Gjertz

On the Impact of Socio-economic Factors on Power Load Forecasting

(Page 742) BigD233.pdf
Yufei Han, Xiaolan Sha, Etta Grover-Silva, and Pietro Michiardi

Toward Personalized and Scalable Voice-Enabled Services Powered by Big Data

(Page 748) BigD239.pdf
JONG HOON AHNN

MaPLE: A MapReduce Pipeline for Lattice-based Evaluation and Its Application to SNOMED CT

(Page 754) BigD259.pdf
Guo-Qiang Zhang, Wei Zhu, Mengmeng Sun, Shiqiang Tao, Olivier Bodenreider, and Licong Cui

Dynamic Pre-training of Deep Recurrent Neural Networks for Predicting Environmental Monitoring Data

(Page 760) BigD291.pdf
Bun Theang Ong, Komei Sugiura, and Koji Zettsu

Perldoop: Efficient Execution of Perl Scripts on Hadoop Clusters

(Page 766) BigD296.pdf
Jose M. Abuin, Juan C. Pichel, Tomas F. Pena, Pablo Gamallo, and Marcos Garcia

Department of Energy Strategic Roadmap for Earth System Science Data Integration

(Page 772) BigD310.pdf
Dean Williams, Giriprakash Palanisamy, Galen Shipman, Thomas Boden, and Jimmy Voyles

Analyzing the Language of Food on Social Media

(Page 778) BigD350.pdf
Daniel Fried, Mihai Surdeanu, Stephen Kobourov, Melanie Hingle, and Dane Bell

Using Geometric Structures to Improve the Error Correction Algorithm of High-Throughput Sequencing Data on MapReduce Framework

(Page 784) BigD366.pdf
Wei-Chun Chung, Yu-Jung Chang, D. T. Lee, and Jan-Ming Ho

Empowering Personalized Medicine with Big Data and Semantic Web Technology: Promises, Challenges, and Use Cases

(Page 790) BigD409.pdf
Maryam Panahiazar, Vahid Taslimitehrani, Ashutosh Jadhav, and Jyotishman Pathak

On Scaling Time Dependent Shortest Path Computations for Dynamic Traffic Assignment

(Page 796) BigD411.pdf
Amit Gupta, Weijia Xu, Kenneth Perrine, Dennis Bell, and Natalia Ruiz-Juri

High Volume Geospatial Mapping for Internet-of-Vehicle Solutions with In-Memory Map-Reduce Processing

(Page 802) BigD413.pdf
Tao Zhong, Kshitij Doshi, Gang Deng, Xiaoming Yang, and Hegao Zhang

7. Industry and Government Papers

Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon

(Page 808) N216.pdf
Khalifeh Aljadda, Mohammed Korayem, Trey Grainger, and Chris Russell

The authors would like to thank Big Data team at CareerBuilder for their support with implementing the proposed system within CareerBuilder’s Hadoop ecosystem. The authors would also like to show deep gratitude to the Search Development group at CareerBuilder for their help integrating this system within CareerBuilder’s search engine to enable an improved semantic search experience.

Abstract—Most work in semantic search has thus far focused upon either manually building language-specific taxonomies/
ontologies or upon automatic techniques such as clustering or dimensionality reduction to discover latent semantic links within the content that is being searched. The former is very labor intensive and is hard to maintain, while the latter is prone to noise and may be hard for a human to understand or to interact with directly. We believe that the links between similar user’s queries represent a largely untapped source for discovering latent semantic relationships between search terms. The proposed system is capable of mining user search logs to discover semantic relationships between key phrases in a manner that is language agnostic, human understandable, and virtually noise-free.

Spatial Computations over Terabyte-Sized Images on Hadoop Platforms

(Page 816) N211.pdf
Peter Bajcsy, Phuong Nguyen, Antoine Vandecreme, and Mary Brady

This work was sponsored by NIST as a part of the Computational Science in Biological Metrology project. We would like to acknowledge all project team members for their contributions.

Abstract—Our objective is to lower the barrier of executing spatial image computations in a computer cluster/cloud environment instead of in a desktop/laptop computing environment. We research two related problems encountered during an execution of spatial computations over terabyte-sized images using Apache Hadoop running on distributed computing resources. The two problems address (a) detection of spatial computations and their parameter estimation from a library of image processing functions, and (b) partitioning of image data for spatial image computations on Hadoop cluster/cloud computing platforms in order to minimize network data transfer. The first problem is solved by designing an iterative estimation methodology. The second problem is formulated as an optimization over three partitioning schemas (physical, logical without overlap and logical with overlap), and evaluated over several system configuration parameters. Our experimental results for the two problems demonstrate 100% accuracy in detecting spatial computations in the Java Advanced Imaging and ImageJ libraries, a speed-up of 5.36 between the default Hadoop physical partitioning and developed logical image partitioning with overlap, and 3.14 times faster execution of logical partitioning with overlap than the one without overlap. The novelty of our work is in designing an extension to Apache Hadoop to run a class of spatial image processing operations efficiently on a distributed computing resource.

Recall Estimation for Rare Topic Retrieval from Large Corpuses

(Page 825) N201.pdf
Praveen Bommannavar, Alek Kolcz, and Anand Rajaraman

Thanks to Twitter and @WalmartLabs for providing data and resources with which to conduct experiments.

Abstract - The problem of finding documents pertaining to a particular topic finds application in a variety of scenarios. Indeed, the demand for topically pertinent documents has led to myriad companies offering services to find and deliver them (perhaps along with sentiment analysis or clustering) to customers for any topics of interest. The methodologies used to uncover relevant documents range from manually curated keyword filters to trained classification models. Any serious topical analysis requires a sound understanding of key metrics behind the retrieval process, two of the most important being precision and recall. While precision can be easily and inexpensively measured by sampling from classified documents and utilizing (paid) human computation to mark incorrectly classified instances, it is not as straightforward to use the same approach for measuring recall. With most topics occurring relatively sparsely, an unbiased sampling approach becomes prohibitively expensive. In this paper, we introduce a recall measurement procedure requiring only relatively few human judgements. The technique makes use of pairs of suciently independent classifiers and the paper provides a detailed discussion of how such classifier pairs can be constructed in practice, with a focus on social media classifiers. We report the performance of the proposed method with simple keyword filters as well as with classifiers of progressive levels of complexity and show that under reasonable conditions, recall can be estimated to within 0.10 absolute error and 15% relative error, and often closer with a reduction of cost by a factor of as much as 1000x as compared with unbiased sampling.

Lightweight Approximate Top-k for Distributed Settings

(Page 835) N217.pdf
Vinay Deolalikar and Kave Eshghi

Abstract—Consider the problem of finding the Top-k records in a relation based on the sum of their attributes. This problem
occurs in various settings in big data management, for example in geographically distributed data centers and clouds, both
at the application layer and the storage management layer. We propose a lightweight distributed, order and duplication
insensitive approach based on order statistics. The salient feature of our algorithm that makes it extremely lightweight is that it
only processes and communicates the items most likely to be in the Top-k. We validate the efficacy of our algorithm on a wide
range of datasets.

Query Revision During Cluster Based Search on Large Unstructured Corpora

(Page 845) N218.pdf
Vinay Deolalikar

I thank Hernan Laffitte for gathering some of the data presented in this paper.

Abstract—We investigate a frequently occurring issue in search (retrieval) in the age of big unstructured data. Searches conducted
on large unstructured corpora result in long results lists. Such results lists are often clustered and reranked for ease
of navigation. Should a query be revised during time-critical examinations of such long cluster based reranked lists? This
question arises naturally during early stages of commercially important applications of IR such as eDiscovery, but has not
yet been given any research attention. Four factors compound the difficulty of this question in the setting of eDiscovery: (a)
the query sources (the technical experts) are different from the legal staff that are actually executing the query and using the
retrieval system, (b) the retrieved lists for each query tend to be very long, and (c) the user might be accessing these retrieved
results through a clustering interface, and (c) all decisions must be transparent and easy to explain due to the litigious nature of
the application. Analogous difficulties arise in other applications involving search over large unstructured corpora.

We introduce a framework to help users make the decision of “whether to revise.” Our framework consists of two components. First, we introduce a “limited view” which is a summary of a long cluster-based reranked list. This is the first input to the user. This provides the user a summary of the long clusterbased list. Second, we construct query predictors for this limited view, and provide their prediction as a second input to the user. This prediction is used to corroborate the inspection of the summary limited view. The proposed combination of a limited view and query performance prediction can assist search staff in determining whether to pursue an expensive query revision or not, as well as save precious time by precluding inspections of lists with very few relevant documents during the early stages of commercially important applications such as eDiscovery.

Astro: A Predictive Model for Anomaly Detection and Feedback-based Scheduling on Hadoop

(Page 854) N223.pdf
Chaitali Gupta, Mayank Bansal, Tzu-Cheng Chuang, Ranjan Sinha, and Sami Ben-romdhane

We thank ebay Inc. for providing support for this research work. We also thank Debashis Saha, Balamurali Meduri, Chris Severs, Karthik Chandrasekar, and Ravinder Purumala for their time and valuable suggestions.

Abstract— The sheer growth in data volume and Hadoop cluster size make it a significant challenge to diagnose and locate problems in a production-level cluster environment efficiently and within a short period of time. Often times, the distributed monitoring systems are not capable of detecting a problem well in advance when a large-scale Hadoop cluster starts to deteriorate in performance or becomes unavailable. Thus, incoming workloads, scheduled between the time when cluster starts to deteriorate and the time when the problem is identified, suffer from longer execution times. As a result, both reliability and throughput of the cluster reduce significantly. In this paper, we address this problem by proposing a system called Astro, which consists of a predictive model and an extension to the Hadoop scheduler. The predictive model in Astro takes into account a rich set of cluster behavioral information that are collected by monitoring processes and model them using machine learning algorithms to predict future behavior of the cluster. The Astro predictive model detects anomalies in the cluster and also identifies a ranked set of metrics that have contributed the most towards the problem. The Astro scheduler uses the prediction outcome and the list of metrics to decide whether it needs to move and reduce workloads from the problematic cluster nodes or to prevent additional workload allocations to them, in order to improve both throughput and reliability of the cluster. The results demonstrate that the Astro scheduler improves usage of cluster compute resources significantly by 64.23% compared to traditional Hadoop. Furthermore, the runtime of the benchmark application reduced by 26.68% during the time of anomaly, thus improving the cluster throughput.

Automating Data Integration with HiperFuse

(Page 863) N219.pdf
Eric Huang, Andres Quiroz, and Luca Ceriani

Abstract— Integrating heterogeneous datasets has been a significant barrier to many analytics tasks, due to the variety in structure and level of cleanliness of raw datasets requiring oneoff ETL code. We propose HiperFuse, which significantly automates the data integration process by providing a declarative interface, robust type inference, extensible domain-specific data models, and a data integration planner which optimizes for plan completion time.

Recommending Similar Items in Large-scale Online Marketplaces

(Page 868) N230.pdf
Jayasimha Reddy Katukuri, Tolga Konik, Rajyashree Mukherjee, and Santanu Kolay

We thank Abhinaya Sinha, Riyaaz Shaik, Kranthi Chalasani, Aravind Ragipindi, Murali Padavala, Venkat Sundaranatha and Merchandizing Team at eBay for contributions to this research with implementation and ideas.

Abstract—We are proposing a new similarity based recommendation system for large-scale dynamic marketplaces. Our solution consists of an offline process, which generates long-term cluster definitions grouping short-lived item listings, and an online system, which utilizes these clusters to first focus on important similarity dimensions and next conducts a trade-off between further similarity and other quality factors such as seller trustworthiness. Our system generates these clusters from several hundred millions of item listings using a large Hadoop map-reduce based system. The clusters are learned using user queries as the main information source and therefore biased towards how users conceptually group items. Our system is deployed on several eBay sites in large-scale and has increased user-engagement and business metrics compared to the previous system. We show that utilizing user queries helps capturing similarity better. We also present experiments demonstrating that adapting the ranking function, which controls the trade-off between similarity and quality, to a specific context improves recommendation performance.

SE-CDA: A Scalable and Efficient Community Detection Algorithm

(Page 877) N213.pdf
Dhaval C. Lunagariya, Somayajulu D.V.L.N., and Radha Krishna P.

Abstract-- Detecting communities is of great importance in various disciplines such as social media, biology and telephone networks, where systems are often represented as graphs. Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group. The communities have different properties such as node degree, betweenness, centrality, cluster coefficient and modularity. Discovering communities from social networks of big data scale on a single sequential machine is a tedious task. In this paper, we present a Scalable Community Detection Algorithm which relaxes the performance issues due to many I/Os.

We adopt Girvan–Newman’s modularity based hierarchical community detection algorithm in bottom up approach and proposed an approximation algorithm for community detection in a distributed environment. We developed our approach using MapReduce and Giraph computing platforms. Experimental results demonstrate that the proposed approach is more efficient than standard MapReduce approach and easily scaled to graph of any size.

Increasing the Accessibility to Big Data Systems via a Common Services API

(Page 883) N209.pdf
Rohan Malcolm, Cherrelle Morrison, Tyrone Grandison, Sean Thorpe, Kimron Christie, Akim Wallace, Damian Green, Julian Jarrett, and Arnett Campbell

Abstract — Despite the plethora of polls, surveys, and reports stating that most companies are embracing Big Data, there is slow adoption of Big Data technologies, like Hadoop, in enterprises. One of the primary reasons for this is that companies have significant investments in legacy languages and systems and the process of migrating to newer (Big Data) technologies would represent a substantial commitment of time and money, while threatening their short-term service quality and revenue goals. In this paper, we propose a possible solution that enables existing infrastructure to access Big Data systems via a services application programming interface (API); minimizing the migration drag and (possibly negative) business repercussions.

Big Data Predictive Analytics for Proactive Semiconductor Equipment Maintenance

(Page 893) N232.pdf
Sathyan Munirathinam

Abstract— Manufacturing Industry generates about a third of all data today and the modern semiconductor manufacturing is one of the most contribution to this tsunami data volume. Terabytes of data is generated on a daily basis during ~500 steps in semiconductor chip processing. During this complex manufacturing process, equipment downtime may cause a significant loss of productivity and profit. In this paper, we are going to explore the predictive analytical algorithms and big data techniques in helping to achieve near-zero equipment downtime in the fabrication unit and to improve OEE (Overall Equipment Effectiveness), which is a key machine manufacturing productivity metric. 

Future Directions of Humans in Big Data Research

(Page 903) N202.pdf
Celeste Lyn Paul, Chris Argenta, William Elm, and Alex Endert

Abstract—The goal of the 1st Workshop on Human-Centered Big Data Research was to explore the multi-disciplinary challenges of researching humans in Big Data environments. This paper summarizes the outcomes of the workshop and aims to define potential future work in this area.

ALOJA: a Systematic Study of Hadoop Deployment Variables to Enable Automated Characterization of Cost-Effectiveness

(Page 905) N222.pdf
Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, Nikola Vujic, Daron Green, José Blakeley, Sergio Mendoza, Yolanda Becerra, Jordi Torres, Eduard Ayguadé, and Jesús Labarta

This work is partially supported the BSC-Microsoft Research Centre, the Spanish Ministry of Education (TIN2012-34557), the MINECO Severo Ochoa Research program (SEV-2011-0067) and the Generalitat de Catalunya (2014-SGR-1051).

Abstract—This article presents the ALOJA project, an initiative to produce mechanisms for an automated
characterization of cost-effectiveness of Hadoop deployments and reports its initial results. ALOJA is the latest phase of a
long-term collaborative engagement between BSC and Microsoft which, over the past 6 years has explored a range of different
aspects of computing systems, software technologies and performance profiling. While during the last 5 years, Hadoop
has become the de-facto platform for Big Data deployments, still little is understood of how the different layers of the software
and hardware deployment options affects its performance. Early ALOJA results show that Hadoop’s runtime performance, and
therefore its price, are critically affected by relatively simple software and hardware configuration choices e.g., number of
mappers, compression, or volume configuration. Project ALOJA presents a vendor-neutral repository featuring over 5000 Hadoop
runs, a test bed, and tools to evaluate the cost-effectiveness of different hardware, parameter tuning, and Cloud services for
Hadoop. As few organizations have the time or performance profiling expertise, we expect our growing repository will benefit
Hadoop customers to meet their Big Data application needs. ALOJA seeks to provide both knowledge and an online service to
with which users make better informed configuration choices for their Hadoop compute infrastructure whether this be on-premise
or cloud-based.

The initial version of ALOJA’s Web application and sources are available at http://hadoop.bsc.es

Heterogeneous Stream Processing for Disaster Detection and Alarming

(Page 914) N224.pdf
Francois Schnitzler, Thomas Liebig, Shie Mannor, Gustavo Souto, Sebastian Bothe, and Hendrik Stange

This work is funded by the EU FP7 INSIGHT2 project (Intelligent Synthesis and Real-time Response using Massive Streaming of Heterogeneous Data), 318225.

Abstract—We present a novel approach for event recognition in massive streams of heterogeneous data driven by privacy
policies and big data event processing. New technologies in mobile computing combined with sensing infrastructures
distributed in a city or country are generating massive, polystructured spatio-temporal data. With a view on emergencies and disasters these various data sources enable early response and offer situative insights when integrated in an on-line incident recognition system.

Our hereby presented system architecture integrates multifaceted sensing and distributed event detection to identify,
label and increase confidence in detected incidents. A higher flexibility than existing event detection approaches is achieved
by combination of the data streams at a round table. At the round table the data flow adjusts itself during execution of
the real-time detection system. This offers more robustness in case streams appear or disappear. The developed architecture
is used in nation-wide and city-level incident recognition scenarios.

Identifying top Chinese network buzzwords from social media big data set based on time-distribution features

(Page 924) N236.pdf
Yongli Tang, Tingting He, Bo Li, and Xiaohua Hu

This work was supported by the Major Project of National Social Science Fund (No. 12&2D223), the Natural Science Foundation of China (No. 61300144), the Natural Science Foundation of Hubei Province (No.2011CDA034), the Major Project of State Language Commission in the Twelfth Five-year Plan Period (No.ZDI125-1), the Project in the National Science &Technology Pillar Program in the Twelfth Five-year Plan Period (No.2012BAK24B01), the Program of Introducing Talents of Discipline to Universities (No.B07042), and the self-determined research funds of CCNU from the colleges’ basic research and operation of MOE(No. CCNU13A05014, No. CCNU13C01001, No. CCNU13F010).

Abstract—Buzzwords are the main embodiment of internet culture, which play an important role in public opinion analysis, social focus tracking and language evolution study. At present, questionnaire has been wildly used as a standard method to obtain network buzzwords, which is subjective and costly. In this paper, we will propose a novel algorithm relying on the time-distribution feature of words and a KL-divergence measure to estimate words’ popularity so as to figure out buzzwords in a specific period. The time-distribution feature simply states the fact that buzzwords’ usage has a sharp increase during a very short period, which is then modeled formally with the KL-divergence measure. Compared with traditional method involving much workforce, the automatic algorithm presented here is clearly more efficient. Moreover, buzzwords identified in this manner will not be affected by individual’s subjective opinions, so they can reflect the language usage in practice better. When applying the algorithm to a social media big data set, our experimental results show that the proposed approach can accurately identify buzzwords in a certain period, which is highly coincident with results tagged manually.

Bridging High Velocity and High Volume Industrial Big Data Through Distributed In-Memory Storage & Analytics

(Page 932) N203.pdf
Jenny Weisenberg Williams, Kareem Aggour, John Interrante, Justin McHugh, and Eric Pool

Abstract—With an exponential increase in time series sensor data generated by an ever-growing number of sensors on
industrial equipment, new systems are required to efficiently store and analyze this “Industrial Big Data.” To actively
monitor industrial equipment there is a need to process large streams of high velocity time series sensor data as it arrives,
and then store that data for subsequent analysis. Historically, separate systems would meet these needs, with neither system
having the ability to perform fast analytics incorporating both just-arrived and historical data. In-memory data grids are a
promising technology that can support both near real-time analysis and mid-term storage of big datasets, bridging the gap
between high velocity and high volume big time series sensor data. This paper describes the development of a prototype
infrastructure with an in-memory data grid at its core to analyze high velocity (>100,000 points per second), high
volume (TB’s) time series data produced by a fleet of gas turbines
monitored at GE Power & Water’s Remote
Monitoring & Diagnostics Center.

Explore Efficient Data Organization for Large Scale Graph Analytics and Storage

(Page 942) N215.pdf

Yinglong Xia, Ilie G. Tanase, Lifeng Nai, Wei Tan, Yanbin Liu, Jason Crawford, and Ching-Yung Lin

Abstract—Many Big Data analytics essentially explore the relationship among interconnected entities, which are naturally
represented as graphs. However, due to the irregular data access patterns in the graph computations, it remains a fundamental
challenge to deliver highly efficient solutions for large scale graph analytics. Such inefficiency restricts the utilization of
many graph algorithms in Big Data scenarios. To address the performance issues in large scale graph analytics, we develop a
graph processing system called System G, which explores efficient graph data organization for parallel computing architectures. We
discuss various graph data organizations and their impact on data locality during graph traversals, which results in various cache
performance behavior on processor side. In addition, we analyze data parallelism from architecture’s perspective and experimentally
show the efficiency for System G based graph analytics. We present experimental results for commodity multicore clusters
and IBM PERCS supercomputers to illustrate the performance of System G for large scale graph analytics.

An Initial Study of Predictive Machine Learning Analytics on Large Volumes of Historical Data for Power System Applications

(Page 952) N228.pdf
Jiang Zheng and Aldo Dagnino

Abstract — Nowadays large volumes of industrial data are being actively generated and collected in various power system
applications. Industrial Analytics in the power system field requires more powerful and intelligent machine learning tools,
strategies, and environments to properly analyze the historical data and extract predictive knowledge. This paper discusses the
situation and limitations of current approaches, analytic models, and tools utilized to conduct predictive machine learning analytics
for very large volumes of data where the data processing causes the processor to run out of memory. Two industrial analytics cases
in the power systems field are presented. Our results indicated the feasibility of forecasting substations fault events and power load
using machine learning algorithm written in MapReduce paradigm or machine learning tools specific for Big Data.

In Unity There is Strength: Showcasing a Unified Big Data Platform with MapReduce Over both Object and File Storage

(Page 960) N207.pdf
Rui Zhang, Dean Hildebrand, and Renu Tewari

Abstract—Big data platforms often need to support emerging data sources and applications while accommodating existing ones.
Since different data and applications have varying requirements, multiple types of data stores (e.g. file-based and object-based)
frequently co-exist in the same solution today without proper integration. Hence cross-store data access, key to effective data
analytics, can not be achieved without laborious application re-programming, prohibitively expensive data migration, and/or
costly maintenance of multiple data copies.

We address this vital issue by introducing a first unified big data platform over heterogeneous storage. In particular, we
present a prototype joining Apache Hadoop MapReduce with OpenStack’s open-source object store Swift and IBM’s cluster
file system GPFSTM. A sentiment analysis application using 3 months of real Twitter data is employed to test and showcase
our prototype. We have found that our prototype achieves 50% data capacity savings, eliminates data migration overhead, offers
stronger reliability and enterprise support. Through our case study, we have learned important theoretical lessons concerning
performance and reliability, as well as practical ones related to platform configuration. We have also identified several potentially
high-impact research directions.

Author Index

PDF - 7 pages

My Note: I have this by Google Chrome Find on this entire page. There is no search across all their web pages and PDFs.

Panel with Program Directors: Big Data Challenges and Opportunities

Source: http://cci.drexel.edu/bigdata/bigdata2014/panel.htm

Chair :  Prof. Xiaohua Tony Hu

  1. Dr. Chaitanya Baru (NSF)
  2. Dr. Yuan Liu (NIH)
  3. Dr. David Kuehn (DoT)
  4. Dr.Tsengdar Lee (NASA)
  5. Dr. Sudarsan Rachuri   (NIST)
  6. Mr. Matti Vakkuri (DIGILE)

Dr. Chaitanya Baru (NSF)

Chaitan Baru currently serves as Senior Advisor for Data Science in the CISE Directorate at the National Science Foundation. He is on assignment from the San Diego Supercomputer Center, UC San Diego, where he is Distinguished Scientist and Associate Director of Data Initiatives. He has served in leadership positions in a number of national-scale cyberinfrastructure R&D initiatives across a wide range of science and engineering disciplines including, earth science, ecology, earthquake engineering, and biomedical informatics. In 2012, he initiated an industry-academia effort to define big data benchmarks via the Workshops for Big Data Benchmarking (WBDB). This has resulted in the recent formation of the SPEC Research Group on Big Data Benchmarking, which he co-chairs. He is co-editor of the Lecture Notes in Computer Science series entitled, Specifying Big Data Benchmarks, published by Springer Verlag. He co-chairs the National Institute for Standards and Technology’s (NIST) Public Working Group on Big Data. He is a member of the teaching faculty for the Masters in Advanced Studies program in Data Science and Engineering (MAS-DSE) in the Computer Science Department at UC San Diego.

Baru has co-edited the book, Geoinformatics: Cyberinfrastructure for the Solid Earth Sciences with Prof. Randy Keller, University of Oklahoma, published by Cambridge University Press (ISBN: 9780521897150).

Baru has a B.Tech (Electronics Engineering) from IIT Madras and an M.E. and Ph.D. (Electrical Engineering) from the University of Florida.

This panelist speech slide can be downloaded at here.MY Note: See Slides Above

Dr. Yuan Liu (NIH)

Dr. Yuan Liu is the Chief of the Office of International Activities, and the Director of Computational Neuroscience and Neuroinformatics Program at the National Institute of Neurological Disorders and Stroke (NINDS), National Institutes of Health (NIH).

Dr. Liu leads the NINDS’ international activities, which focus on fostering international and global health research, training and collaborations. She also oversees the computational neuroscience and neuroinformatics program, which promotes collaborations between experimental, computational and informatics neuroscientists to advance the understanding of nervous system structure and function, and the mechanisms underlying nervous system disorders.

In addition, Dr. Liu has been serving as the NINDS representative on over 30 international, interagency and trans-NIH committees and working groups that develop international and computational biology and bioinformatics related programs, including many inter-agency initiatives (e.g., CRCNS, IMAG, and BigData), trans-NIH programs (e.g., Roadmap, BISTI, and BD2K), and Blueprint for Neuroscience activities (e.g., NIF, NITRC and Human Connectome Project). For her achievement and contribution, she received several NIH Director’s Awards and NIH Blueprint Neuroscience Research Directors Awards.

Dr. Liu received her bachelors and masters degrees in neurophysiology from Peking University in P.R. China, and her Ph.D. in neuroscience, under the mentorship of Prof. John G. Nicholls, from the Biozentrum, Universität Basel in Switzerland. Her research career was focused on the area of neurophysiology at single channel, synaptic and systems levels. Between 1999 and 2004, she managed the research portfolio centered on channels, synapses and circuit grants at NINDS. Prior to joining the NINDS, Dr. Liu was Program Director for Basic Neuroscience Research.

This panelist speech slide can be downloaded at here.(PDF)

Big Data NIH Funding Opportunities

YuanLiu11292014Slide1.png

BISTI Funding Opportunities

http://www.bisti.nih.gov/funding/

YuanLiu11292014Slide3.png

Seven High Priority Research Areas

YuanLiu11292014Slide4.png

Dr. David Kuehn (DoT)

David Kuehn is the Program Manager for the Federal Highway Administration (FHWA) Exploratory Advanced Research Program. The Program Manager serves as the senior advisor to agency leadership on the communication and coordination of exploratory advanced research activities and fosters partnerships with other Federal agencies, national scientific societies and organizations, and the academic community in support of the Program.  The program focuses on longer term and higher risk research with the potential for transformational improvements to the transportation system.  David entered federal service as a Presidential Management Fellow.  Before working at the federal level, David worked in local government and as a consultant in southern California.  He holds a Masters of Public Administration from the University of Southern California and a B.A from the University of California, Irvine and is a member of the American Institute of Certified Planners (AICP).

This panelist speech slide can be downloaded at here.(PDF)

Data Science Challenges and Opportunities in Highway Transportation

DavidKuehn10292014Slide1.png

Presentation Outline

DavidKuehn10292014Slide2.png

Program Status

DavidKuehn10292014Slide3.png

Connected Highway Systems

DavidKuehn10292014Slide4.png

Opportunities

DavidKuehn10292014Slide5.png

Dr.Tsengdar Lee (NASA)

Dr. Tsengdar Lee manages the High-End Computing Program from NASA Headquarters. He is responsible for maintaining the high-end computing capability to support the agency's aeronautics research, human exploration, scientific discovery, and space operations missions. Lee is also the manager of the NASA Weather Data Analysis Program, focusing on the transition of research results into the operational forecast centers and the acceleration of operational use of research data. Two major activities include the multi-agency Joint Center for Satellite Data Assimilation and the Short-term Prediction Research and Transition Center.

In 2011, Lee served as Acting Chief Technology Officer (CTO) for Information Technology (IT) in the NASA Office of the Chief Information Officer. In this capacity, Lee funded agency-wide IT research and advanced prototyping and created NASA's IT Labs. He also chaired the CTO-ITCouncil.Lee joined NASA in 2001 as the High-End Computing Program Manager for the Earth Science Enterprise. He was responsible for the Earth science computational modeling needs, primarily focusing on weather and climate modeling. Between 2002 and 2006, Lee also managed the Earth Science Global Modeling Program. He funded research efforts to study the global climate change, weather forecasting, and hurricane predictionproblems.Prior to 2001, Lee held positions as Senior Technical Advisor with Northrop Grumman Information Technology and Senior Staff Engineer with Litton PRC. He worked on the Advanced Weather Information Processing System (AWIPS) project for the National Weather Service. He was responsible for the rapid development, integration, and commercialization of the AWIPS client-server system. Lee also was a principal engineer on the effort to develop the AWIPS network monitoring and control system.

He was a Research Scientist and worked on the dispersion problem of bio-chemical agents during his short tenure with the Science Applications International Corporation between 1994 and 1996. Lee received two graduate degrees from Colorado State University, a PhD in Atmospheric Science in 1992 and an MS in Civil Engineering in 1988. Trained as a short-term weather modeler, his work focused on the integration of weather and ancillary geographical information data into weather models to produce reliable forecasts. His research pioneered the modeling of land surface hydrology’s impact on weather forecasting.

This panelist speech slide can be downloaded at here.(PDF)

NASA's Big Data Challenges in Climate Science

TsengdarLee10292014Slide1.png

Turning Observations into Knowledge and Decision Products

TsengdarLee10292014Slide3.png

Future Directions and Challenges

TsengdarLee10292014Slide4.png

Dr. Sudarsan Rachuri (NIST)

Dr. Sudarsan Rachuri is the Program Manager for Smart Manufacturing Design and Analysis program at NIST. Prior to joining NIST, he was a research professor at George Washington University. His primary research objectives are to develop and transfer knowledge to industry about information models for sustainable and smart manufacturing, green products, big data analytics for manufacturing, system level analysis, and knowledge representation.  Specific focus is on identifying integration and technology issues that promote industry acceptance of information models, and standards, that will enable designers to develop products that are sustainable and manufactured using smart technologies in a distributed and collaborative environment. Dr. Rachuri's primary areas of interest are smart and sustainable manufacturing, scientific computing, CAD/CAM/CAE, design for Sustainability, data analytics, object-oriented modeling, and ontology.

Dr. Rachuri is an ASME Fellow, having been elected in 2012 for his significant contributions in the areas of information and semantic modeling of product life cycle management, and the application of measurement science for sustainable manufacturing.

This panelist speech slide can be downloaded at here.

From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems

sudarsan@nist.gov

SudarsanRachuri10292014Slide 1.png

Outline

SudarsanRachuri10292014Slide 2.png

Smart Manufacturing Systems Design and Analysis

SudarsanRachuri10292014Slide 3.png

Manufacturing Big Data Analytics

SudarsanRachuri10292014Slide 4.png

Mr. Matti Vakkuri (DIGILE)

Mr. Matti Vakkuri, Program Director, Big Data, Tieto & Focus Area Director, DIGILE’s Data-to-Intelligence Programme

Mr. Matti Vakkuri graduated from Finnish Military Academy in 1993. He has  20 years of experience from areas of management, leadership, business development, security, quality, human resource management, project management, program management, offering development , crisis management and consulting in both governmental and private sector.

In his current position in Tieto his tasks are to enable the power of Big Data and its enormous impact on the customers’ businesses, develop and ramp-up the offering, sales and delivery capabilities, build competences in Big Data, Hadoop and data sciences, assure cross-organizational collaboration and network, evaluate partners, suppliers and competitors in Big Data market. Tasks include advocating and lobbying Big Data’s possibilities internally and externally in business operations, research and product development.  Since April 2013, in addition to his job in Tieto he has held a part-time occupation of Focus Area Director (the Head of the Program) for Digile’s Data to Intelligence research program. The program is focused on Big Data, data reserves and user-centric service development. The aim of the program is together with companies and research institutions to develop intelligent tools and methods for managing, refining and utilizing diverse data. The results of the program enable innovative business models and services. One of the program’s targets is to develop methods for Big Data analytics that handle complexity through fusion of heterogeneous data sources, and use adaptivity, context-sensitivity, scalability, and user relevance as the main methodological objectives.

From January, 2014 he has been a full member of Finland’s ministry of Transportation and communications Big Data working group which has built and written Finland’s national Big Data strategy draft in June, 2014.

His motto is "Management by leadership".

This panelist speech slide can be downloaded at here. (PDF)

Data to Intelligence Research Program

http://www.datatointelligence.fi/

MattiVakkuri10292014Slide1.png

Data to Intelligence Program Implementation Facts

MattiVakkuri10292014Slide2.png

Data to Intelligence-Main Results Obtained

MattiVakkuri10292014Slide3.png

Oulu Traffic Pilot

http://beta.oulutraffoc/fi

MattiVakkuri10292014Slide4.png

Forest Big Data

MattiVakkuri10292014Slide5.png

Remote and HIghly Automated Services

MattiVakkuri10292014Slide6.png

Tutorials

Source: http://cci.drexel.edu/bigdata/bigdat...4/tutorial.htm 

Tutorial 1: Big Data Stream Mining

PDF

Presenters

Gianmarco De Francisci Morales, Joao Gama, Albert Bifet, andWei Fan

Summary

The challenge of deriving insights from big data has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. This tutorial is a gentle introduction to mining big data streams. The first part introduces data stream learners for classification, regression, clustering, and frequent pattern mining. The second part discusses data stream mining on distributed engines such as Storm, S4, and Samza.

Content

Fundamentals and Stream Mining Algorithms

 Stream mining setting

 Concept drift

 Classification and Regression

 Clustering

 Frequent Pattern mining

Distributed Big Data Stream Mining

 Distributed Stream Processing Engines

 Classification

 Regression

Short Profiles

Gianmarco De Francisci Morales 's Profile

Gianmarco De Francisci Morales is a Research Scientist at Yahoo Labs Barcelona. He received his Ph.D. in Computer Science and Engineering from the IMT Institute for Advanced Studies of Lucca in 2012. His research focuses on large scale data mining and big data, with a particular emphasis on web mining and Data Intensive Scalable Computing systems. He is an active member of the open source community of the Apache Software Foundation working on theHadoop ecosystem, and a committer for the Apache Pig project. He is the co-leader of the SAMOA project, an open-source platform for mining big data streams.   

Joao Gama's Profile

Joao Gama is a Researcher at LIAAD, University of Porto, working at the Machine Learning group. His main research interest is in Learning from Data Streams. He published more than 80 articles. He served as Co-chair of ECML 2005, DS09, ADMA09 and a series ofWorkshops on KDDS and Knowledge Discovery from Sensor Data with ACM SIGKDD. He is serving as Co-Chair of next ECM-PKDD 2015. He is author of a recent book on Knowledge Discovery from Data Streams.   

Albert Bifet's Profile

Albert Bifet is a Research Scientist at Huawei. He is the author of a book on Adaptive Stream Mining and Pattern Learning and Mining from Evolving Data Streams. He is one of the leaders of MOA and SAMOA software environments for implementing algorithms and running experiments for online learning from evolving data streams.          

Wei Fan's Profile   

Wei Fan is the associate director of Huawei Noah’s Ark Lab. He received his PhD in Computer Science from Columbia University in 2001. His main research interests and experiences are in various areas of data mining and database systems, such as, stream computing, high performance computing, extremely skewed distribution, cost-sensitive learning, risk analysis, ensemble methods, easy-touse nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, time series analysis, bioinformatics, social network analysis, novel applications and commercial data mining systems. His co-authored paper received ICDM’2006 Best Application PaperAward, he led the team that used his Random Decision Tree method to win 2008 ICDM Data Mining Cup Championship. He received 2010 IBM Outstanding Technical Achievement Award for his contribution to IBMInfosphere Streams. He is the associate editor of ACM Transaction on Knowledge Discovery and Data Mining (TKDD). Since he joined Huawei in August 2012, he has led his colleagues to develop Huawei StreamSMART – a streaming platform for online and real-time processing, query and mining of very fast streaming data. In addition, he also led his colleagues to develop a real-time processing and analysis platform of Mobile Broad Band (MBB) data.

IEEE Big Data 2014 Big Data Stream Mining Tutorial website: https://sites.google.com/site/bigdatastreamminingtutorial/    

Slide can be downloaded at here.

Tutorial 2: Big ML Software for Modern ML Algorithms

PDF

Presenters

Eric P. Xing and Qirong Ho

Summary

Many Big Data practitioners are familiar with classical Machine Learning techniques such as Naive Bayes, Decision Trees, K­means, PCA, and Collaborative Filtering (to name but a few), and their implementations on popular Big Data systems such as Hadoop. Going beyond these classic techniques, a new generation of ML algorithms ­ for example, topic models, nonparametric Bayesian models, deep neural networks, and sparse regression ­ has been gaining popularity in both academia and industry, because they improve performance on existing tasks like recommendation and prediction, or even enable completely new ones such as topical visualization and image object detection. Initially, these algorithms were the exclusive privilege of large companies with the engineering resources to build their own cluster implementations from scratch. Today however, new open­source software platforms, such as GraphLab, Petuum and Spark, have democratized some or all of these advanced algorithms, putting them within reach of individual researchers and data analysts that do not mind getting their hands a little dirty. In this tutorial, you will learn about these emerging ML algorithms, the software platforms that can run them today, the ML­centric theory, principles and design of an ideal parallel ML system and how today’s platforms fit that idea, and the open research opportunities that have sprouted in this space between advanced ML and distributed systems.

Content

Advanced, emerging ML algorithms:

 e.g. Deep Neural Networks, topic models, sparse regression

Open source software platforms that can run some or all of these algorithms at scale:

 e.g. GraphLab, Petuum and Spark

Principles, design and theory of an algorithmic and systems interface to BigML

 Pros and cons of each platform: when should you favor one over the other

Research opportunities in the space between advanced ML and distributed systems

a list of references for the 3 systems:

Petuum

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server

http://www.cs.cmu.edu/~epxing/papers/2013/SSPTable_NIPS2013.pdf

Exploiting bounded staleness to speed up Big Data analytics

http://www.cs.cmu.edu/~epxing/papers/2014/Cui_etal_ATC14.pdf

Fugue: Slow-Worker-Agnostic Distributed Learning for Big Models on Big Data

http://www.cs.cmu.edu/~epxing/papers/2014/Kumar_Beutel_Ho_Xing_aistat14.pdf

We have one more papers whose camera ready is still being prepared. I'd be happy to send a link once it is ready.

Spark

Spark: Cluster Computing with Working Sets

http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Discretized Streams: Fault-Tolerant Streaming Computation at Scale

http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf

Tutorial 3: Large-scale Heterogeneous Learning in Big Data Analytics

PDF

Presenter

Jun Huan

Summary

Heterogeneous learning deals with data from complex real-world applications such as social networks, biological networks, internet of things among others. The heterogeneity could be found in multi-task learning, multi-view learning, multi-label learning, and multi-instance learning.  In this talk we will present our and other groups’ recent progresses for designing and implementing large-scale heterogeneous learning algorithms include multi-task learning, multi-view learning,transfer learning algorithms. The applications of these work in social network analysis and bioinformatics will be discussed as well.

Content

We cover the recent progresses on the following aspects:

Multi-task learning (MTL) aims to train multiple related learning tasks together to reduce generalization error. MTL has been widely utilized in many application domains include bioinformatics, social network analysis, image processing among others.

Multi-view learning (MVL) aims to identify a model where data are collected from different sources (a.k.a. views). There is an intense discussion on how and to what extend multi-view may help.

Multi-label learning (MLL) aims to build classifier that assign multi-labels to an instance. It has wide applications in image annotation, recommender systems, and etc.

We cover the theoretic foundation of MTL/MVL/MLL learning algorithms using penalized maximum likelihood estimation, Bayesian MTL, and Gaussian process. We also cover the related algorithms such as MTL with known task relationship, multi-task & multi-view learning, learning with structured input and output. We also want to discuss a very important but less investigated area of scaling those learning algorithms to large-scale data. We plan to cover a few platforms that are suitable to support large-scale heterogeneous learning. Applications of heterogeneous learning in Bioinformatics, Health care informatics, Drug Discovery, Social network analysis will be reviewed.

Short Profile

Dr. Jun (Luke) Huan's Profile

Dr. Jun (Luke) Huan is a Professor in the Department of Electrical Engineering and Computer Science at the University of Kansas. He directs the Bioinformatics and Computational Life Sciences Laboratory at KU Information and Telecommunication Technology Center (ITTC) and the Cheminformatics core at KU Specialized Chemistry Center, funded by NIH. He holds courtesy appointments at the KU Bioinformatics Center, the KU Bioengineering Program, an adjunct professorship from the Department of Internal Medicine in the KU Medical School, and a visiting professorship from GlaxoSmithKline plc.. Dr. Huan received his Ph.D. in Computer Science from the University of North Carolina.

Dr. Huan works on data science, machine learning, data mining, big data, and interdisciplinary topics including bioinformatics. He has published more than 80 peer-reviewed papers in leading conferences and journals and has graduated more than ten graduate students including six PhDs. Dr. Huan serves the editorial board of several international journals including the Springer Journal of Big Data, Elsevier Journal of Big Data Research, and the International Journal of Data Mining and Bioinformatics. He regularly serves the program committee of top-tier international conferences on machine learning, data mining, big data, and bioinformatics.

Dr. Huan's research is recognized internationally. He was a recipient of the prestigious National Science Foundation Faculty Early Career Development Award in 2009. His group won the Best Student Paper Award at the IEEE International Conference on Data Mining in 2011 and the Best Paper Award (runner-up) at the ACM International Conference on Information and Knowledge Management in 2009. His work appeared at mass media including Science Daily, R&D magazine, and EurekAlert (sponsored by AAAS). Dr. Huan's research was supported by NSF, NIH, DoD, and the University of Kansas.

The tutorial presentation slide can be downloaded at here. (updated on Nov. 05, 2014)

Tutorial 4: Big Data Benchmarking

PDF

Presenters

Chaitan Baru and Tilmann Rabl

Summary

This tutorial will introduce the audience to the broad set of issues involved in defining big data benchmarks, for creating auditable industry-standard benchmarks that consider performance as well as price/performance. Big data benchmarks must capture the essential characteristics of big data applications and systems, including heterogeneous data, e.g. structured, semi- structured, unstructured, graphs, and streams; large-scale and evolving system configurations; varying system loads; processing pipelines that progressively transform data; workloads that include queries as well as data mining and machine learning operations and algorithms. Different benchmarking approaches will be introduced, from micro-benchmarks to application- level benchmarking.

Since May 2012, five workshops have been held on Big Data Benchmarking including participation from industry and academia. One of the outcomes of these meetings has been the creation of industry’s first big data benchmark, viz.,TPCx-HS, the Transaction Processing Performance Council’s benchmark for Hadoop Systems. During these workshops, a number of other proposals have been put forward for more comprehensive big data benchmarking. The tutorial will present and discuss salient points and essential features of such benchmarks that have been identified in these meetings, by experts in big data as well as benchmarking. Two key approaches are now  being pursued—one, called BigBench, is based  on extending the TPC- Decision Support (TPC-DS) benchmark with big data applications characteristics.  The  other called Deep Analytics Pipeline, is based on modeling processing that is routinely encountered in real-life big data applications. Both will be discussed.

We  conclude  with  a  discussion  of  a  number  of  future  directions  for  big  data benchmarking

Content

Introduction

 Introduction to benchmarking: What are TPC and SPEC; what is each organization’s role and approach to benchmarking.

 Characteristics of good industry standard benchmarks: Why has TPC-C lasted so long? Brief overview of TPCx-HS.

 Overview of big data benchmarking approaches: From micro-benchmarks to application-level pipelines.

 Applications scenarios and use cases: Big data scenarios and use cases that help define the application-level benchmark.

BigBench: In-depth discussion of an example, proposed big data benchmark

 Data generation: Synthetic data generation for big data.

 The Benchmarking process: Steps involved in setting up, executing, and verifying end-to-end benchmarks.

 Benchmark metrics: Existing metrics for industry standards, and appropriate metrics for big data benchmarks.

 Discussion of performance results on a small 6-node cluster at Intel and a large, 540-node cluster at Pivotal.

 Possible extensions to BigBench

Benchmarking challenges and future directions

 Modeling system failures in the benchmark; extrapolating from one scale factor to the next; benchmarking for new application scenarios, e.g. the Internet of Things.

Q&A.

Short Profile

Dr. Chaitan Baru and Dr. Tilmann Rabl 's Profile

The tutorial presenters are Dr. Chaitan Baru from the San Diego Supercomputer Center, UC San Diego, and Dr.Tilmann Rabl, from the Middleware Systems Research Group, University of Toronto.

Dr. Baru and Dr. Rabl have been collaborating since 2012 on the topic of big data benchmarking. They were both instrumental in starting the Workshops on Big Data Benchmarking series, and serve on the Steering Committee for these workshops. Five workshop have been held so far in May 2012 (San Jose), December 2012 (Pune, India), July 2013 (Xi’an, China), October 2013 (San Jose), and August 2014 (Potsdam, Germany). They are co-editors of the SpringerVerlag Lecture Notes in Computer Science series on Specifying Big Data Benchmarks. They have also co-authored three papers:

  • Big  Data  Benchmarking  and  the  BigData  Top100  List,  C.  Baru,  M.  Bhandarkar,  R.  Nambiar, M.  Poess,  T. Rabl,  Big  Data  Journal,  Vol.1,  No.1,  Mary  Ann  Liebert  Inc.  Publishers, http://online.liebertpub.com/toc/big/1/1.
  • Setting the Direction for Big Data Benchmark Standards, C. Baru, M. Bhandarkar, R. Nambiar, M.    Poess, T. Rabl, TPC Technical Conference, VLDB 2012, Aug 27-30, Istanbul, Turkey. http://link.springer.com/chapter/10....642-36727-4_14.
  • Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data, Baru, Bhandarkar,Curino, Danisch, Frank, Gowda, Jacobsen, Jie, Kumar, Nambiar, Poess, Raab, Rabl, Ravi, Sachs, Sen, Yi, Youn, Proceedings of the TPC Technical Conference, VLDB 2014, September, Hangzhou, China.

Furthermore, Dr. Rabl is the Chair of the recently formed SPEC Research Group on Big Data Benchmarking, and Dr.Baru is co-Chair. Thus, even though the tutorial instructors are from different institutions, they have worked together closely for several years, and have a continuing working relationship.

 The tutorial presentation slide can be downloaded at here.

Special Session

My Note: Need Paper PDF or just Abstract?

Special Session 1: From Data to Insight: Big Data and Analytics for Smart Manufacturing Systems

Overview

Smart manufacturing system requires capabilities and technologies for designing and improving the overall system performance through diagnostic and prognostic assessment based on (big) data analytics . The right insight derived from big data could lead to right actions for enhancing sustainability, productivity, flexibility, and competitive advantages and enabling sustainable and agile manufacturing. 

This special session invites contributions from different engineering disciplines to bring out common issues and specific challenges that address predictive modeling for Smart Manufacturing Systems. An initial set of topics includes (but is not limited to): Manufacturing Key Performance Indicators (KPIs); Methods, Technologies, and IT Infrastructure for Big Data and Data Analytics; Standards and Protocols; and Business Best Practices for Smart Manufacturing Systems.

specialsession.jpg
Figure 1. The concept and main topics of the special session

Organizers

  • Dr. Sudarsan Rachuri, Program Manager of Smart Manufacturing Systems Design and Analysis National Institute of Standards and Technology
  • Dr. Juyeon Lee, IT Converged Process R&BD Group Korea Institute of Industrial Technology
  • Prof. Kincho H. Law, Engineering Informatics, Civil and Environmental Engineering Stanford University

Papers

Toward Smart Manufacturing Using Decision Analytics

(Page 967) DtoI208.pdf
Alexander Brodsky, Mohan Krishnamoorthy, Daniel Menasce, Guodong Shao, and Sudarsan Rachuri

The work represented here was partially funded through cooperative agreement #70NANB12H277 between George Mason University and the National Institute of Standards and Technology (NIST). The authors thank Kevin Lyons and Sharon Kemmerer from NIST for their valuable comments and suggestions which helped improve the paper.

Abstract—This paper is focused on decision analytics for smart manufacturing. We consider temporal manufacturing processes with stochastic throughput and inventories. We demonstrate the use of the recently proposed concept of the decision guidance analytics language to perform monitoring, analysis, planning, and execution tasks. To support these tasks we define the structure of and develop modular reusable process component models, which represent data, decision/control variables, computation of functions, constraints, and uncertainty. The tasks are then implemented by posing declarative queries of the decision guidance analytics language
for data manipulation, what-if prediction analysis, decision optimization, and machine learning.

​An Intelligent Machine Monitoring System Using Gaussian Process Regression for Energy Prediction

(Page 978) DtoI203.pdf
Raunak Bhinge, Jinkyoo Park, Nishant Biswas, Moneer Helu, David Dornfeld, Kincho Law, and Sudarsan Rachuri

The authors acknowledge the support in part of the Smart Manufacturing Systems Design and Analysis Program at the National Institute of Standards and Technology (NIST), Grant Numbers 70NANB12H225 and 70NANB12H273 awarded to University of California, Berkeley, and to Stanford University respectively. In addition, the authors appreciate the support of the Machine Tool Technologies Research Foundation (MTTRF) and System Insights for the equipment used in this research. Certain commercial systems are identified in this paper. Such identification does not imply recommendation or endorsement by NIST; nor does it imply that the products identified are necessarily the best available for the purpose.

Abstract—Recent advances in machine automation and sensing technology offer new opportunities for continuous condition monitoring of an operating machine. This paper describes an intelligent machine monitoring framework that integrates and utilizes data collection, management, and analytics to derive an adaptive predictive model for the energy usage of a milling machine. This model is designed using a Gaussian Process (GP) regression algorithm, which is a flexible regression method that also provides an uncertainty estimate. To improve computational efficiency, we propose a Collective Gaussian Process (CGP) in which the overall energy prediction is made by constructing local GP models weighted by probability distribution functions obtained using the Gaussian Mixture Model (GMM) technique. Finally, we demonstrate the ability of the proposed monitoring framework to construct an energy prediction model to predict the energy used to machine a part.

​Towards a Domain-Specific Framework for Predictive Analytics in Manufacturing

(Page 987) DtoI205.pdf
David Lechevalier, Anantha Narayanan, and Sudarsan Rachuri

The work represented here was partially funded through a cooperative agreement #70NANB13H159 between the University of Maryland and NIST.

Abstract— Data analytics is proving to be very useful for achieving productivity gains in manufacturing. Predictive analytics (using advanced machine learning) is particularly valuable in manufacturing, as it leads to production improvement with respect to the cost, quantity, quality and sustainability of manufactured products by anticipating changes to the manufacturing system states. Many small and medium manufacturers do not have the infrastructure, technical capability or financial means to take advantage of predictive analytics. A domain-specific language and framework for performing predictive analytics for manufacturing and production frameworks can counter this deficiency. In this paper, we survey some of the applications of predictive analytics in manufacturing and we discuss the challenges that need to be addressed. Then, we propose a core set of abstractions and a domain-specific framework for applying predictive analytics on manufacturing applications. Such a framework will allow manufacturers to take advantage of predictive analytics to improve their production.

​Uncertainty Quantification in Performance Evaluation of Manufacturing Processes

(Page 996) DtoI206.pdf
Saideep Nannapaneni and Sankaran Mahadevan

The research reported in this paper was supported by funds from the National Institute of Standards and Technology (Cooperative Agreement No. 70NANB14H036, Project Monitor: Dr. Sudarsan Rachuri). The mathematical models for the illustrative examples were provided by NIST staff (Systems Integration Division, Engineering Laboratory). The support is gratefully acknowledged.

Abstract –– This paper proposes a systematic framework using Bayesian networks to integrate all the available information for uncertainty quantification (UQ) in the performance evaluation of a manufacturing process. Energy consumption, one of the key metrics of sustainability, is used to illustrate the proposed methodology. The evaluation of energy consumption is not straight-forward due to the presence of uncertainties in different variables in the process and occurring at different stages in the process. Both aleatory and epistemic sources of uncertainty are considered in the UQ methodology. A dimension reduction approach through variance-based global sensitivity analysis is proposed to reduce the number of variables in the system and facilitate scalability to high-dimensional problems. The proposed methodologies for uncertainty quantification and dimension reduction are demonstrated using two examples – an injection molding process and a welding process.

​CloudMan: A Platform for Portable Cloud Manufacturing Services

(Page 1006) DtoI204.pdf
Soheil Qanbari, Samira Mahdi Zadeh, Soroush Vedaie, and Schahram Dustdar

The research leading to these results is sponsored by the Doctoral College of Adaptive Distributed Systems at the Vienna University of Technology as well as the Pacific Controls cloud Computing Lab (PC3L)11, a joint lab between Pacific Controls, Dubai, and the Distributed Systems Group at the Vienna University of Technology.

Abstract—Cloud manufacturing refers to “as a Service” production model that exploits an on-demand access to a distributed pool of diversified manufacturing services and resources. It forms elastic and reconfigurable production lines, which enhance efficiency, by allowing optimal resource allocation in response to demand changes and market dynamics. This paper studies these challenges and proposes a portable cloud manufacturing platform, entitled “CloudMan” , aiming at achieving a portable deployment of cloud manufacturing services to any compliant distributed production line in the cloud. The stakeholders of CloudMan are detailed together with their API requirements, where each stakeholder has an interest in. Having this rigorous analysis in mind, we present a holistic architecture for CloudMan, as it considers the manufacturing data, material and event flow from sensors and shop floors, through services to end products. In architecting such platform, there is a lack of agreed standard for the portability and orchestration of manufacturing services, as well as their definition. The proposed platform incorporates OASIS Topology and Orchestration Specification for cloud Applications (TOSCA) policies, plans and templates as a mechanism for dynamic configuration, portability and deployment of manufacturing services across multiple collaborating manufacturers. Thereby, the architecture provides a set of abstraction levels for various types of manufacturing services in which encapsulates and addresses specific requirements to satisfy the needs of stakeholders.

​Building a Rigorous Foundation for Performance Assurance Assessment Techniques for “Smart” Manufacturing Systems

(Page 1015) DtoI201.pdf
Utpal Roy, Yunpeng Li, and Bicheng Zhu

The work presented in this paper is a part of an on-going project at the Syracuse University (SU), funded by the National Institute of Standards and Technology (NIST). No approval or endorsement of any commercial products by the National Institute of Standards and Technology (NIST) is intended or implied.

Abstract – The highly networked and real-time data analysis features of smart manufacturing systems (SMS) require different information infrastructure, data analytics technology, and performance assurance methodologies. The main purpose of this paper is to (i) explore the complete product-process performance assurance space to identify the key performance indicators that help evaluate and quantify system performance at different abstraction levels, (ii) discuss models and methodologies for data analytics, and (iii) suggest a digital factory-based simulation technique to evaluate those key indicators for performance prediction. The paper presents a systematic and rigorous approach towards establishing these performance assurance methodologies applicable to complex value chains of smart manufacturing systems by extensively exploring all possible product and process related performance issues. A hypercube information model is proposed for the purpose of formal representation of the highly dimensional and correlated information among different actors in a smart manufacturing system, thus providing a rigorous foundation for the performance assurance space. The relevant taxonomy and an ontology-based framework are then developed for formal representation of the entities, activities and knowledge involved in the performance assurance domain. It provides a detailed insight into the PA space and defines appropriate measures which can be applied to predict and improve system performance assurances.

A System Architecture for Manufacturing Process Analysis based on Big Data and Process Mining Techniques

(Page 1024) DtoI207.pdf
Hanna Yang, Minjeong Park, Minsu Cho, Minseok Song, and Seongjoo Kim

This work was supported by the Industrial Core Technology Development program (10045047) and Technology Diffusion Platform for C-MES project (10033159) funded by the Ministry of Trade, Industry and Energy (MOTIE, Korea).

Abstract— Interests in manufacturing process management and analysis are increasing, but it is difficult to conduct process analysis due to the increase of manufacturing data. Therefore, we suggest a manufacturing data analysis system that collects event logs from so-called big data and analyzes the collected logs with process mining. There are two kinds of big data generated from manufacturing processes, structured data and unstructured data. Usually, manufacturing process analysis is conducted by using only structured data, however the proposed system uses both structured and unstructured data for enhancing the process analysis results. The system automatically discovers a process model and conducts various performance analysis on the manufacturing processes.

Special Session 2: Special Session on Big Data Representation and Processing in Data Science

Overview

Goal: The ultimate long-term goal of this special session is to uncover the underlying computing paradigms, models, or theories that achieve high “velocity” processing of large “volume” and “variety” of data without the compromise of “veracity” and “value”.

Big data are everywhere! This special session focuses on learning from the practices of real world engineering applications, in particular for multimodal data such as biomedical, video, and some traditional textual data coupled with metadata. Multimodal data are data of multiple modes or “variety,” that is, they contain heterogeneous signals and/or come from heterogeneous sources. Participants should propose solutions and approaches to the analysis and/or design of data with heterogeneity in the context of big data. 

Related Topics

  • Knowledge management and mining
  • Big data analytics: data visualization, statistical and exploratory analytics
  • Quantitative and qualitative uncertainty
  • Approximation: approximate retrieval, reasoning and proof
  • Parallel computing theory in Big Data, e.g., MapReduce applications for Data Science
  • Granular and rough computing for heterogeneous data

Organizers

  • Asmi Shah, PhD (MGH Visiting Scientist, Harvard Medical School)
  • Brendan Jou (Electrical Engineering, Columbia University)
  • Program Committee
    • Wesley Chu, PhD (Computer Science, UCLA)
    • Liangliang Cao, PhD (Research Staff Member, IBM T. J. Watson Research)
    • Tomoyuki Higuchi, Ph.D (The Institute of Statistical Mathematics)
    • Howard Ho, PhD (Manager, IBM Almaden Research Center)
    • Urban Liebel, PhD (CTO, Acquifer AG, Karlsruhe Institute of Technology)
    • Shusaku Tsumoto, MD, PhD (Medicine, Shimane University)
    • S. Felix Wu, PhD (Computer Science, UC Davis)
    • Ying Xie, PhD (Computer Science, Kennesaw State University)
    • Justin Zhan, PhD (Computer Science, North Carolina A&T State University)
  • Session Chair
    • T. Y. Lin, PhD (Computer Science, San Jose State University)

Papers

Researching Persons & Organizations AWAKE: From Text to an Entity-Centric Knowledge Base

(Page 1030) BDRP206.pdf
Elizabeth Boschee, Marjorie Freedman, Saurabh Khanwalker, Anoop Kumar, Amit Srivastava, and Ralph Weischedel

This paper is based upon work supported by the DARPA DEFT Program. The views expressed are those of the author(s) and do not reflect the official policy or position of the Department of Defense or the U.S. Government. This research was developed with funding from the Defense Advanced. Approved for Public Release, Distribution Unlimited. The authors would also like to thank Alex Zamanian, Manaj Srivastava, and Connor Stokes, Artan Sameqi, Fred Choi, Jonathan Watson, Paul Martin, and Ray Tomlinson.

Abstract—We describe a pilot experiment building a capability to automatically read documents, develop a knowledge base, support analytics, and visualize the information found. The capability allows someone researching a topic of interest of focus on analysis and synthesis rather than on reading. We show how information from multiple modalities (speech, text, structured databases) and multiple approaches (ontology driven and open information extraction) can be fused to create a resource about both previously known and novel entities. We describe an extensible framework for language understanding tools that allows for scalability, plug-and-play of alternative components, and incorporation of additional input streams, including video, images, and foreign language text.

​Integrating Existing Large Scale Medical Laboratory Data Into the Semantic Web Framework

(Page 1040) BDRP210.pdf
Newres Al Haider, Samina Abidi, William van Woensel, and Syed SR Abidi

The original relational dataset that we use consist of 7 tables. These contain information related to the patient, his or her address and doctor, as well the encounter in which they were referred to a laboratory test. With regards to the direct lab data, information about the lab order as well as the results of the lab tests are given. The personal patient information in these datasets was anonimized to protect the patients’ privacy. Table I contains information about the various tables and columns of the original database, as well as the number of records in each table.

Abstract—Semantic Web technologies have shown to have great potential in many different domains, to facilitate knowledge representation, exchange and reasoning, in a formal and yet both human and machine understandable way. In particular, within the health domain, they enable knowledge integration and understanding by explicitly defining and linking concepts and relationships using ontologies to information within clinical knowledge bases. This additional metadata also allows for automated decision support and semantic based analytics to be implemented, that facilitate improved healthcare at a lower cost. Unfortunately many existing datasets in healthcare environments are still stored in relational databases, as opposed to using semantic technologies. Due to this, the link with explicit metadata is often lacking or non-existent. Furthermore, both the databases and the clinical terminologies can be considerably large, making the mapping and subsequent uses of the information a difficult process. In a full fledged decision support system the level and accuracy of the mapping can greatly influence the effectiveness of any subsequent analysis and decision support tasks. This is especially true in clinical scenarios, where very large and complex sets of terms need to be mapped to relational databases. In this paper we aim to provide a general approach for interlinking relational data with clinical ontology based metadata that allows for a fine grade evaluation, with respect to the mapping’s impact on analytics. We evaluate our approach by mapping information from clinical terminologies, such as SNOMED CT, to a large laboratory dataset contained in a relational database, with the goal of creating a full fledged, semantically enabled, analytics and decision support system.

​Path Knowledge Discovery: Association Mining Based on Multi-Category Lexicons

(Page 1049) BDRP213.pdf
Chen Liu Liu, Wesley W Chu, Fred Sabb, D. Stott Parker, and Joseph Korpela

This work was supported by USPHS grants, including the NIH Roadmap Initiative Consortium for Neuropsychiatric Phenomics and including linked awards UL1DE019580, RL1LM009833. We thank Jianming He, Ying Wang, Jiacheng Yang, Jianwen Zhou, Xiuming Chen and Jiajun Lu of the CoBase research group in the UCLA Computer Science Department for their work on the PhenoMining tools and PhenoWiki+ implementation. We would also like to thank Professors Robert Bilder, Carrie Bearden, and Joseph Ventura from the Consortium for Neuropsychiatric Phenomics for testing of the tools and for their stimulating discussions during the development of this work.

Abstract—Transdisciplinary research is a rapidly expanding part of science and engineering, demanding new methods for connecting results across fields. In biomedicine for example, modeling complex biological systems requires linking knowledge across multiple levels of science, from genes to disease. The move to multilevel research requires new strategies; in this paper we present path knowledge discovery, a novel methodology for linking published research findings. Path knowledge discovery consists of two integral tasks: 1) association path mining among concepts in a multipart lexicon that crosses disciplines, and 2) fine-granularity knowledge-based content retrieval along the path(s) to permit deeper analysis. Implementing this methodology has required development of innovative measures of association strength for pairwise associations, as well as the strength for sequences of associations, in addition to powerful lexicon-based association expansion to increase the scope of matching. In our discussions, we describe the validation of the methodology using a published heritability study from cognition research, and we obtain comparable results. We show how path knowledge discovery can greatly reduce a domain expert’s time (by several orders of magnitude) when searching and gathering knowledge from the published literature, and can facilitate derivation of interpretable results.

​Stochastic Finite Automata for the Translation of DNA to Protein

(Page 1060) BDRP202.pdf
Tsau-Young Lin and Asmi Shah

The authors would like to thank Nikhil Kalantri, Parnika Achrekar, Qun Yu, Yue Lu, and Ahmad Marzdasht Yazdankhah for their contributions towards improving the software in general. The results presented here are based on these accumulated improvements.

Abstract—The use of Statistical Finite Automata (SFA) has been explored in the field of understanding the DNA sequences; many focus on local patterns, namely partial representations of DNA sequences. In this paper, we focus on global and complete representations to understand the patterns in whole DNA sequences. Obviously, DNA sequences are not random. Based on Kolmogorov complexity theory, there should be some simple Turing machines that write out such sequences; here simple means the complexity of the Turing machine is simpler than the data. The primary goal of this paper is to approximate such simple Turing machines by SFA. We use SFA, via ALERGIA algorithm (in the light granular computing), to capture and analyze the translation process (DNA to protein) based on amino acids’ chemical property viz., polarity. This, in turn, enables the understanding of interspecies DNA comparisons and the creation of phylogeny – the ‘tree of life’.

Data mining and sharing tool for high content screening large scale biological image data

(Page 1068) BDRP215.pdf
Asmi Shah, Gayathri Gopalakrishnan, Adithya Rajendran, and Urban Liebel

The authors thank Christine Wittmann, Clemens Grabher, and Jochen Gehrig for providing with data. This work was supported by BOLD – Marie Curie Initial Training Network [238821 to A.H.S.].

Abstract—The constantly developing high content and high throughput screening (HCS & HTS) microscopy and next generation sequencing technologies routinely produce experiment datasets in the terabyte (TB) or petabyte (PB) range resulting in millions of data files that can vary from simply numbers to signals and images. If the collaborators working on the same project are spread over large geographical distances, data sharing, interactive visualization and collaborative annotation techniques become important determinants of the success of a research project. On the other hand, there are hundreds of bioinformatic and cheminformatic databases, billions of documents in available literature, and many image based biological repositories, which need to be referred simultaneously to make sense out of the acquired data. To draw the conclusions from such increasingly complex and large scale data sources, the scientific community must be provided with simple to use methods to retrieve, analyze, visualize, annotate, and crosslink these data sources on a common platform in an efficient manner. However, on the other hand, these modern biological experiments and the subsequent analyses are completed with use of an array of different software suites and automated tools. A constant feedback from the experimenter is needed to change experimental paradigms for the follow-up experiments. Such a software platform to address the HCS data in these aspects does not exist yet, to the best of our knowledge. We have developed a simple to use software package called “AskMe” for users to publish their large scale biological experiment data on to the web by use of data mining and visualization concepts. With use of AskMe, scientists can share these HCS datasets easily with their collaborators or made publicly accessible to the whole scientific community. From the initial stages of experiments, AskMe can ease the experimental analysis process by mining data and providing useful visualizations. Moreover, integration and crosslinks to other databases also allow easy evaluation of data generated. By these principles, we bring the tools to the data and make the data access transparent to the users without any capacity tradeoff.

​A Building Performance Evaluation & Visualization System

(Page 1077) BDRP203.pdf
Georgios Stavropoulos, Stelios Krinidis, Dimosthenis Ioannidis, Konstantinos Moustakas, and Dimitris Tzovaras

This work was supported by the EU funded Adapt4EE ICT STREP (FP7-288150).

Abstract—A novel big data building performance evaluation knowledge processing and mining system utilizing visual analytics is going to be presented in this paper. A large dataset comprised of building information, energy consumption, environmental measurements, human presence and behavior and business processes is going to be exploited for the building performance evaluation. Building performance evaluation is one of the most important factors in engineering that leads to building renovation and construction with low energy consumption and gas emissions in conjunction with comfort, utility and durability. For this purpose, business processes occurring in the building are correlated with the energy consumption and the human flows in the spatiotemporal domain modeling the dynamic behavior of the building. These models lead to the extraction of useful semantic information and the detection of spatiotemporal patterns that are important for the evaluation of the building performance. Furthermore, a number of novel visual analytics techniques allow the end-users to process data in different temporal resolutions and with different temporal filters, assisting them to detect patterns that may be difficult to be detected otherwise. The proposed visual analytics techniques support design and energy management decisions by visualizing the building measurements regarding business and comfort aspects. To do so, the proposed system includes a variety of techniques and components, properly selected to offer quick identification of focal points and evaluation of the building performance. Considering the increasing interest and the green building goals of almost all world governments including EU, the suggested methodology and application could be rendered a very useful tool for the Architecture and Engineering Community working on Building Performance Simulation and Analysis, and all related communities in Architect, Engineering and Construction (AEC) industry.

​Statistical Technique for Online Anomaly Detection Using Spark Over Heterogeneous Data from Multi-source VMware Performance Data

(Page 1086) BDRP204.pdf
Mohiuddin Solaimani, Mohammed Iftekhar, Latifur Khan, and Bhavani Thuraisingham

This material is based upon work supported by NSF Award No. CNS 1229652 and NSF Award No. DUE 1129435.

Abstract—Anomaly detection refers to the identification of patterns in a dataset that do not conform to expected patterns. Depending on the domain, the non-conformant patterns are assigned various tags, e.g. anomalies, outliers, exceptions, malwares and so forth. Online anomaly detection aims to detect anomalies in data flowing in a streaming fashion. Such stream data is commonplace in today’s cloud data centers that house a large array of virtual machines(VM) producing vast amounts of performance data in real-time. Sophisticated detection mechanism will likely entail collation of data from heterogeneous sources with diversified data format and semantics. Therefore, detection of performance anomaly in this context requires a distributed framework with high throughput and low latency. Apache Spark is one such framework that represents the bleeding-edge amongst its contemporaries. In this paper, we have taken up the challenge of anomaly detection in VMware based cloud data centers. We have employed a Chi-square based statistical anomaly detection technique in Spark.We have demonstrated how to take advantage of the high processing power of Spark to perform anomaly detection on heterogeneous data using statistical techniques. Our approach is optimally designed to cope with the heterogeneity of input data streams and the experiments we conducted testify to its efficacy in online anomaly detection.

​Extracting Discriminative Shapelets from Heterogeneous Sensor Data

(Page 1095) BDRP208.pdf
Om Patri, Abhishek Sharma, Haifeng Chen, Guofei Jiang, Anand Panangadan, and Viktor Prasanna

Abstract—We study the problem of identifying discriminative features in Big Data arising from heterogeneous sensors. We highlight the heterogeneity in sensor data from engineering applications and the challenges involved in automatically extracting only the most interesting features from large datasets. We formulate this problem as that of classification of multivariate time series and design shapelet-based algorithms for this task. We design a novel approach, called Shapelet Forests (SF), which combines shapelet extraction with feature selection. We evaluate our proposed method with other approaches for mining shapelets from multivariate time series using data from real-world engineering applications. Quantitative analysis of the experiments shows that SF performs better than the baseline approaches and achieves high classification accuracy. In addition, the method enables identification of noisy sensors from multivariate data and discounts their use for classification.

Workshops

My Note: These came from the thumb drive of files

1. Challenges and Issues on Scholarly Big Data Discovery and Collaboration (SBD 2014)

Ingemar J. Cox ingemar.cox@di.ku.dk

Introduction to workshop

Academics and researchers worldwide continue to produce large numbers of scholarly documents including papers, books, technical reports, etc. and associated data such as tutorials, proposals, lab note books, and course materials. For example PubMed has over 20 million documents, 10 million unique names and 70 million name mentions. Google Scholar has recently been estimated to be a 100 million. Besides these traditional databases, there are also online social reference sites such as CiteULike or BibSonomy. This enables researchers to study scholarly collaboration at a very large scale. For example, researchers who are interested in scholarly collaboration can apply data mining techniques (including text mining, network mining, and social network analysis) within a distributed framework such as Hadoop to discover patterns and structures of collaboration, and potentially develop better tools for searching and recommending academic information. Collaboration among scholars is recognized as a feature in scientific discovery. The ever increasing diversity of disciplines and complexity of research problems, particularly multi-disciplinary research, requires collaboration. Besides the traditional venues of collaboration where scholars typically meet annually at conferences or meetings, the Internet provides a wide range of platforms for scholars to engage with other scholars. These new platforms include academic sites such as Academia.edu, ResearchGate and Mendeley, more interactive social sites such as Twitter and Facebook, and Wiki-style virtual collaboration sites. These services allow scholars to share academic resources, exchange opinions, follow each other’s research, keep up with current research trends, and most importantly, build their professional networks. Researchers, with encouragement from funding agencies, increasingly realize that scholarly achievements should not merely be the final published articles. The dataset used in this study and many other intermediary results are equally important for supporting research. Therefore, a set of rapidly developing research topics, research data management, data curation/stewardship, data sharing policy, etc. are becoming important issues for research communities.

Research topics included in the workshop

  • Access and Analytical tools for studying scholarly discovery and collaboration.
  • Online platforms for scholarly discovery and collaboration.
  • Data curation and management for scholarly discovery and collaboration.
  • Integration of all aspects of scholarly discovery and collaboration

Organizers

  • Cornelia Caragea, University of North Texas, USA
  • Ingemar J. Cox, University of Copenhagen, Denmark
  • Sujatha Das G. Institute for Infocomm Research, A*STAR, Singapore
  • C. Lee Giles, Pennsylvania State University, USA
  • Daqing He, University of Pittsburgh, USA
  • Madian Khabsa, Pennsylvania State University, USA
  • Yu-Ru Lin, University of Pittsburgh, USA
  • Christina Lioma, University of Copenhagen, Denmark
  • Natasa Milic-Frayling, Microsoft Research Cambridge, UK

Papers

Why Name Ambiguity Resolution Matters for Scholarly Big Data Research(Page 1)
Jinseok Kim

Evolution of Scientific Collaboration Networks(Page 7)
Gaurav Madaan

The OceanLink Project(Page 14)
Tom Narock

Managing the Academic Data Lifecycle: A Case Study of HPCC(Page 22)
Michael Payne, Linh Ngo, Flavio Villanustre, and Amy Apon

​Sponsors

2. The 2nd Workshop on Scalable Machine Learning- Theory and Applications

Zenglin Xu zenglin@gmail.com 

Introduction to workshop

Big data are encountered in various areas, including Internet search, social networks, finance, business sectors, meteorology, genomics, complex physics simulations, biological and environmental research. Machine learning as an important tool of big data analytics is playing more and more important roles in the big data era. However, the characteristics of large volume, high velocity, variety and veracity bring challenges to current machine learning techniques. It is therefore desirable to discuss (1) how to scale up existing machine learning techniques for modeling and analyzing big data from various domains; (2) how to design new machine learning algorithms for various parallel/distributed machine learning platforms (such as Hadoop, GraphLab, Spark, etc.); and (3) how to design universal machine learning interfaces for GPUs or cloud computing architectures, and so on.

Research topics included in the workshop

  • Distributed data analytics architectures
  • Theory and algorithms of data reduction techniques for big data
  • Theory and algorithms of large-scale matrix approximation
  • Heterogeneous learning on big multimodal data
  • Temporal analysis and spatial analysis in big data
  • Scalable machine learning in large graphs
  • Novel applications of scalable machine learning in big data

Organizers

Program Committee Members

  • Rafal A. Angryk, Georgia State University 
  • Polo Chau, Georgia Institute of Technology 
  • Qirong Ho, Carneige Mellon University 
  • Yi Fang, Santa Clara University 
  • Fengjun Li, Kansas University 
  • Weike Pan, Shenzhen University 
  • Shuo Xiang, Arizona State University 
  • Feng Yan, Facebook Inc. 
  • Dan Zhang, Facebook Inc. 
  • Xinhua Zhang, NICTA 
  • Bin Zhao, Carneige Mellon University 
  • Shandian Zhe, Purdue University 
  • Jiayu Zhou, Samsung Research America

Papers

Scalable Big Data Computing for the Personalization of Machine Learned Models and its Application to Automatic Speech Recognition Service(Page 1)
JONG HOON AHNN

Computing Fuzzy Rough Approximations in Large Scale Information Systems(Page 9)
Hasan Asfoor, Rajagopalan Srinivasan, Gayathri Vasudevan, Nele Verbiest, Chris Cornelis, Matthew Tolentino, Ankur Teredesai, and Martine De Cock

Fast Learning for Big Data Applications using Parameterized Multilayer Perceptron(Page 17)
Chandra B and Rajesh Kumar sharma

Calculating Feature Importance in Data Streams with Concept Drift using Online Random Forest(Page 23)
Andrew Cassidy and Frank Deviney

Towards Scalable Graph Computation on Mobile Devices(Page 29)
Yiqi Chen, Zhiyuan Lin, Robert Pienta, Minsuk Kahng, and Duen Horng Chau

Boosting Stochastic Newton Descent for Bigdata Mining and Classification(Page 36)
Roberto D'Ambrosio, Wafa Belhajali, and Michel Barlaud

Feature Selection for Text Clustering in Limited Memory Using Monte Carlo Wrapper(Page 42)
Vinay Deolalikar

WS^2F: A Weakly Supervised Framework for Data Stream Filterings(Page 50)
Cailing Dong and Arvind Agarwal

Yifang Jiang, Kai Chen, An Improved Memory Management Scheme for Large Scale Graph Computing Engine GraphChi(Page 58)
Yifang Jiang, Kai Chen, Yi Zhou, Diao Zhang, Qu Zhou, and Jianhua He

Fast Algorithm for Computing Weighted Projection Quantiles and Data Depth for High-Dimensional Large Data Clouds(Page 64)
Ujjal Mukherjee and Ansu Chatterjee

FS^3: A sampling based method for top-k frequent subgraph mining(Page 72)
Tanay Kumar Saha and Mohammad Hasan

A Clustering Based Scalable Hybrid Approach for Web Page Recommendation(Page 80)
Mohammad Sharif and Vijay Raghavan

Multiresolution analysis of incomplete rankings with applications to prediction(Page 88)
Eric Sibony, Stéphan Clémençon, and Jérémie Jakubowicz

Pairwise Topic Model via Relation Extraction (Page 96)
Xiaoli Song, Yue Shang, Yuan Ling, Mengwen Liu, and Xiaohua Hu

A Multi-View Two-level Classification Method for Generalized Multi-instance Problems(Page 104)
Xiaoguang Wang, Xuan Liu, Stan Matwin, Nathalie Nathalie Japkowicz, and hongyu guo

Applying Instance-weighted Support Vector Machines to Class Imbalanced Datasets(Page 112)
Xiaoguang Wang, Xuan Liu, Stan Matwin, and Nathalie Japkowicz

3. 1st International Workshop on High Performance Big Graph Data Management Analysis, and Mining

Fengguang Song fgsong@cs.iupui.edu

Introduction to workshop

Modern Big Data increasingly appears in the form of complex graphs and networks. Examples include the physical Internet, the world wide web, online social networks, phone networks, and biological networks. In addition to their massive sizes, these graphs are dynamic, noisy, and sometimes transient. They also conform to all five Vs (Volume, Velocity, Variety, Value and Veracity) that define Big Data. However, many graph-related problems are computationally difficult, and thus big graph data brings unique challenges, as well as numerous opportunities for researchers, to solve various problems that are significant to our communities. Big graph problems are currently solved using several complementary paradigms. The most popular approach is perhaps by exploiting parallelism, through specialized algorithms for supercomputers, shared-memory multicore and manycore systems, and heterogeneous CPU-GPU systems. However, since real-world graphs are sparse and highly irregular, there are very few parallel implementations that can actually deliver high performance. The major challenges to scaling and efficiency include irregular data dependencies, poor locality, and high synchronization costs of current approaches. In addition to parallelism, researchers are developing approximation algorithms that use sampling for compressing and summarizing graph data. Streaming algorithms are also being considered for scenarios where the rate of updates is too fast to process the entire graph in a single pass. Further, out-of-core algorithms are necessary for massive graphs that do not fit in the main memory of a typical system. Researchers can use graph-based solutions for solving problems from many diverse disciplines, including routing and transportation, social networks, bioinformatics, computational science, health care, security and intelligence analysis. This workshop aims to bring together researchers from different paradigms solving big graph problems under a unified platform for sharing their work and exchanging ideas. We are soliciting novel and original research contributions related to big graph data management, analysis, and mining (algorithms, software systems, applications, best practices, performance). Significant work-in-progress papers are also encouraged. 

Research topics included in the workshop

  • Parallel algorithms for big graph analysis on HPC systems 
  • Heterogeneous CPU-GPU solutions to solve big graph problems 
  • Extreme-scale computing for large graph, tensor, and network problems 
  • Sampling and summarization of large graphs 
  • Graph algorithms for large-scale scientific computing problems 
  • Graph clustering, partitioning, and classification methods 
  • Scalable graph topology measurement: diameter approximation, eigenvalues, triangle and graphlet counting 
  • Parallel algorithms for computing graph kernels 
  • Inference on Large graph data 
  • Graph evolution and dynamic graph models 
  • Graph databases, novel querying and indexing strategies for RDF data 
  • Novel applications of big graph problems in bioinformatics, health care, security, and social networks 
  • New software systems and runtime systems for big graph data mining 

Organizers

Workshop Organizers

  • Mohammad Al Hasan, Indiana University - Purdue University 
  • Kamesh Madduri, The Pennsylvania State University 
  • Fengguang Song, Indiana University - Purdue University 

Program Committee Members

  • Nesreen Ahmed, Purdue University 
  • Medha Atre, University of Pennsylvania 
  • Mohammad Al Hasan ,Indiana University Purdue University 
  • Kamesh Madduri ,Pennsylvania State University 
  • David Mizell ,YarcData / Cray, Inc. 
  • Xia Ning ,NEC Laboratories 
  • Siva Rajamanickam ,Sandia National Lab 
  • Saeed Salem ,North Dakota State University 
  • Manu Shantharam ,San Diego Supercomputer Center 
  • Fengguang Song ,Indiana University Purdue University 
  • Guangming Tan ,Chinese Academy of Sciences, China 
  • Chen Tian ,Huawei Technologies USA 
  • Stanimire Tomov ,University of Tennessee Knoxville 
  • Jeff Vetter ,Oak Ridge National Laboratory and Georgia Tech 
  • Daniel Waddington ,Samsung Research America 
  • Qian (Jane) You ,Amazon 
  • Mohammed J. Zaki ,Rensselaer Polytechnic Institute 

Papers

Connecting the dots: Triangle completion and related problems on large data sets using GPUs(Page 1)
Amlan Chatterjee, Sridhar Radhakrishnan, and Chandra N. Sekharan

Park: An Efficient Algorithm of k-core Decomposition on Multicore Processors(Page 9)
Naga Shailaja Dasari, Ranjan Desh, and Zubair M

A Partitioning Approach to Scaling Anomaly Detection in Graph Streams(Page 17)
William Eberle and Lawrence Holder

Fractional Greedy and Partial Restreaming Partitioning : New Methods For Massive Graph Partitioning(Page 25)
Ghizlane ECHBARTHI and Hamamache KHEDDOUCI

Global Graphs: A Middleware for Large Scale Graph Processings(Page 33)
S M Faisal, Srinivasan Parthasarathy, and P Sadayappan

Toward an Efficient, Highly Scalable Maximum Clique Solver for Massive Graphs(Page 41)
Ronald Hagan, Charles Phillips, Kai Wang, Gary Rogers, and Michael Langston

Extending SPARQL with graph functions(Page 46)
David Mizell, Kristyn Maschhoff, and Steve Reinhardt

CHANGE DETECTION IN TEMPORALLY EVOLVING COMPUTER NETWORKS: A BIG DATA FRAMEWORK(Page 54)
Josephine Namayanja and Vandana Janeja

Detecting Communities Around Seed Nodes in Complex Networks(Page 62)
Christian L. Staudt, Henning Meyerhenke, and Yassine Marrakchi

Access-averse Framework for Computing Low-rank Matrix Approximations(Page 70)
Ichitaro Yamazaki, Theo Mary, Jakub Kurzak, Stanimire Tomov, and Jack Dongarra

Architecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing(Page 78)
Angen Zheng, Alexandros Labrinidis, and Panos Chrysanthis

6. The Second Workshop on Distributed Storage Systems and Coding for Big Data 

Zhu Bing zhubing@sz.pku.edu.cn

Introduction to workshop

Mass storage is critical in the era of Big Data. Dispersing a huge data file in a large-scale distributed storage system is necessary in order to enhance reliability and availability. By introducing redundancy in the system, we can protect the data integrity from node failures. As node failures occur frequently in large-scale storage system, a considerable volume of Internet traffic is dedicated to the repair of failed storage nodes. Several classes of distributed storage codes, such as regenerating codes, locally repairable codes and so on, are introduced recently to reduce this overhead and disk input/output cost. Nevertheless, there still remains substantial research work for advancing distributed storage coding and systems in both theory and applications. This workshop will provide an excellent platform for computer systems researchers and data scientists to exchange ideas and experience that coding techniques and distributed storage systems can offer to big data applications, and to understand the challenges that we need tackle to realize the full potential. 

Research topics included in the workshop

•  Access and Analytical tools for studying scholarly discovery and collaboration. 
•  Cloud storage and distributed storage system 
•  Erasure Codes for BigData 
•  Cloud computing systems for big data applications 
•  Cloud software and hardware support for big data 
•  Distributed I/O (wide-area, grid, peer-to-peer) 
•  Experience and empirical evaluation of deployed systems 
•  Solid-state drive (e.g., flash, PCM) in large-scale storage 
•  RAID and erasure coding 
•  Repair bandwidth and regenerating codes 
•  Locally repairable codes 
•  Storage management and security 
•  Power-aware storage architectures and technologies 
•  File system design 
•  Deduplication 
•  Key-value and NoSQL storage 
•  Memory-only storage systems 
•  Reliability, availability, and disaster recovery 
•  Scalable resource management for big data 
•  Big data in private and public Clouds 

Organizers

Program Chairs

  • Shane Canon, Lawrence Berkeley National Laboratory 
  • Ian Foster, Argonne National Laboratory 
  • Chaitanya Baru, San Diego Supercomputer Center 

Program Committee Members

  • Rajdeep Bhowmik, Cisco Systems, Inc. 
  • Reagan Moore, University of North Carolina 
  • Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory 
  • Galen Shipman, Oak Ridge National Laboratory 
  • Dantong Yu, Brookhaven National Laboratory 

Papers

A New Zigzag MDS Code with Optimal Encoding and Efficient Decoding(Page 1)
Jun Chen, Hui Li, Hanxu Hou, Bing Zhu, Tai Zhou, Lijia Lu, and Yumeng Zhang

Parity Declustering for Fault-Tolerant Storage Systems via t-designs(Page 7)
Son Hoang Dau, Yan Jia, Chao Jin, Weiya Xi, and Kheong Sann Chan

An Efficient Scheme to Ensure Data Availability for a Cloud Service Provider(Page 15)
Seungmin Kang, Bharadwaj Veeravalli, Khin Mi Mi Aung, and Chao Jin

A C Library of Repair-Efficient Erasure Codes for Distributed Data Storage Systems(Page 21)
Chao Tian

ReCT: Improving MapReduce Performance under Failures with Resilient Checkpointing Tacticss(Page 27)
Hao Wang, Haopeng Chen, and Fei Hu

STORE: Data Recovery with Approximate Minimum Network Bandwidth and Disk I/O in Distributed Storage Systems(Page 33)
Tai Zhou, Hui Li, Bing Zhu, Yumeng Zhang, Hanxu Hou, and Jun Chen

7. First IEEE International Workshop on Big Data Security and Privacy (BDSP 2014)

Tyrone W A Grandison tgrandison@proficiencylabs.com

Introduction to workshop

Big Data is characterized by the integration of a significant amount of data, of varying modalities or types, at a pace that cannot be handled by traditional data management systems. This has sparked innovation in the collection, processing and storage of this data. The analytic systems built to leverage Big Data have yielded (and hold even greater promise to uncover) remarkable insights that enable a host of new applications that were not thought possible prior to the era of Big Data. However, with this capacity to contribute to and benefit the greater good comes the responsibility to protect the subjects referenced in the data sets. In this context, the old adage is correct - “With great power, comes great responsibility”. Ultimately, the data subjects own the data and they stand to suffer most significantly from the data’s compromise. Thus, there needs to be advances in techniques for 1) ingesting Big Data in a secure and privacy-preserving, 2) performing Big Data analysis in a secure environment and in a privacy-preserving manner, and 3) storing and enforcing retention policy securely (and in private modes) for Big Data systems. If these solutions are not in place, then the willingness of people to contribute their data to be included in a Big Data system decreases. Additionally, Big Data professionals need to perform risk analyses, as they relate to security and privacy, to get a realistic view of the safety of the landscape. There is a lot of work to be done in this emerging field. This workshop is a venue for researchers and practitioners to come together and tackle them in a supportive and stimulating environment. 

Research topics included in the workshop

  • Security and/or Privacy Technologies for collecting, processing and storing Big Data. 
  • Theoretical Foundations of Security and/or Privacy of Big Data software, protocols, systems and infrastructure. 
  • Security and/or Privacy analysis of Big Data software, protocols, systems and infrastructure. 
  • Big Data Forensic Analysis. 
  • Trust Management theory and software for Big Data. 
  • Accountability Theory and Technologies for Big Data. 
  • Integrating Legal considerations in Big Data Security and Privacy solutions and technologies. 
  • User studies on Security and Privacy for Big Data software, protocols, systems and infrastructure. 
  • Usable Security and Privacy for Big Data software, protocols, systems and infrastructure. 

Organizers

Program Chairs

  • Shane Canon, Lawrence Berkeley National Laboratory 
  • Ian Foster, Argonne National Laboratory 
  • Chaitanya Baru, San Diego Supercomputer Center 

Program Committee Members

  • Rajdeep Bhowmik, Cisco Systems, Inc. 
  • Reagan Moore, University of North Carolina 
  • Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory 
  • Galen Shipman, Oak Ridge National Laboratory 
  • Dantong Yu, Brookhaven National Laboratory 

Papers

Privacy-aware Filter-based Feature Selection(Page 1)
Yasser Jafer, Stan Matwin, and Marina Sokolova

Secure Data Storage in Distributed Cloud Environments(Page 6)
Renata Jordão, Valério Martins, Fábio Buiati, Rafael Timóteo de Sousa Júnior, and Flávio Elias de Deus

Location Prediction Attacks Using Tensor Factorization and Optimal Defenses(Page 13)
Takao Murakami and Hajime Watanabe

8. The 2nd Workshop on Big Data in Bioinformatics and Health Care Informatics

Jun Huan jhuan@ku.edu

Introduction to workshop

In a recent McKinsey & Co. study, it is claimed that the healthcare industry in U.S. alone could potentially save $450 billion a year with the help of advanced analytics, but healthcare organizations continue to struggle with managing and leveraging the vast stores of data they are building up. By 2011, U.S. healthcare organizations had generated 150 exabytes — that is 150 billion gigabytes — of data. Leading providers such as Kaiser Permanente alone might have as much as 44 petabytes of patient data just from its Electronic Health Record (EHR) system, or 4,400 times the amount of information held at the Library of Congress. Adding to this the insurance sector, the independent laboratories and individual health records, and the number is increasingly astounding both in terms of volume and variety of data sources. There is a common theme emerging in the healthcare industry – big data enables unprecedented opportunity for aggregation and integration leading to cost-effective and improved patient care. It is the aim of this workshop to bring together big data practitioners, researchers, students, clinicians, health IT experts, and data scientists to share ideas on how to improve the state of our healthcare systems by delivering on the promise of big data infrastructure investments. We observe the same theme of data-driven science in biological and biomedical research. For example, a next-generation sequencing experiment may easily generate terabytes of raw data. In biological and biomedical imaging, large volumes of data are generated. How to store, achieve, index, manage, learn, mine, and visualize those data is clearly a challenge to the research community. 

Research topics included in the workshop

Bioinformatics and Biomedical Informatics:

  • Next-generation sequencing (NGS) data storage and analysis 
  • Large scale biological network construction and learning 
  • Population-based bioinformatics 
  • Genome structural change detection 
  • Large-scale bio-image and medical-image analysis 
  • Big data in molecular simulation and protein structure prediction 
  • Big data in systems biology 
  • Big data in precision medicine and stratified medicine 
  • Big data in drug discovery, development, and post-market surveillance 
  • Big data in semantics and bio-text mining 

Healthcare Systems: 

  • Real-time aspects of healthcare data infrastructure 
  • Security and privacy for clinical data in big data infrastructures 
  • Health IT implementations and demonstrations 
  • Case studies for healthcare analysis in distributed environments 
  • Benchmarking of big data infrastructure in healthcare 
  • Novel data analysis algorithms that enable integrated discovery of knowledge from structured and unstructured Electronic Medical Records (EMR) 
  • Analysis and visualizing for summarizing large patient data in EMRs 
  • Novel algorithms and applications dealing with noisy, incomplete, but large EMR data 
  • Integrating genomic data in today’s medicine to improve human health 
  • Data science and modeling for health analysis 
  • Advances in new storage models for data variety (records, images, Magnetic Resonance Imaging (MRI), scans) for hospitals 
  • Big data challenges in accountable care settings 
  • Extracting meaning from multi-structured big data in real time to improve outcome 
  • Combining information from imaging (RIS, PACS), Electronic Health Records (EHR), laboratories, genomics to give coherent diagnosis and treatment 
  • Leveraging social networks for data aggregation 
  • Smart visualizations for big data streams 
  • Analysis of big data from home monitoring devices 
  • Design patterns and anti-patterns for development of solutions for big data 

Analysis of Big Medical Data: 

  • Real-time analysis of big medical data in the course of precision medicine 
  • Analysis of longitudinal and time-series data to discover new correlations 
  • Co-registration of patient data acquired over several time-points in their life 
  • Identification of important metadata that has to be tracked over a longitudinal duration 
  • Software platforms for enabling easy access to the patient’s medical and clinical history 
  • Gap-handling in history-taking 
  • Quality improvement and noise-handling on longitudinal data 
  • Missing functionality in current clinical decision support systems using longitudinal data 

Organizers

Program Chairs

  • Shane Canon, Lawrence Berkeley National Laboratory 
  • Ian Foster, Argonne National Laboratory 
  • Chaitanya Baru, San Diego Supercomputer Center 

Program Committee Members

  • Rajdeep Bhowmik, Cisco Systems, Inc. 
  • Reagan Moore, University of North Carolina 
  • Lavanya Ramakrishnan, Lawrence Berkeley National Laboratory 
  • Galen Shipman, Oak Ridge National Laboratory 
  • Dantong Yu, Brookhaven National Laboratory

Papers

Information Gateway for Integrated Pharmacogenomics Data – IGIPD(Page 1)
Pavan Kumar A, Janaki Chintalapati, Neeharika N, Payal Saluja, Mangala N, and Prahlada Rao B.B

Predicting a Biological Response of Molecules from Their Chemical Properties Using Diverse and Optimized Ensembles of Stochastic Gradient Boosting Machine(Page 10)
Tarek Abdunabi and Otman Basir

Understanding the Effects of Concussion using Big Data(Page 18)
Jesus Caban, Gerard Riedy, Terry Oakes, Geoff Grammer, and Thomas DeGraba

Duplicate Drug Discovery Using Hadoop(Page 24)
Shao Hua Cheng, Yu Shian Chiu, Shih Yao Dai, and Hui-I Hsiao

Towards Integrating the Detection of Genetic Variants into an In-Memory Database(Page 27)
Cindy Fähnrich, Matthieu-P. Schapranow, and Hasso Plattner

A General Supervised Approach to Segmentation of Clinical Texts(Page 33)
Kavita Ganesan and Michael Subotin

A Fast and Memory-Efficient Algorithm for Learning and Retrieval of Phenotypic Dynamics in Multivariate Cohort Time Series(Page 41)
Shamim Nemati and Mohammad Ghassemi

Big Data in Genomics: An Overview(Page 45)
Adhiraaj Sethi, Raghunath Nambiar, and Ruchie Bhardwaj

Protective Effects of Rheumatoid Arthritis in Septic ICU Patients(Page 50)
Mallory Sheth, Abdullah Chahin, Roger Mark, and Natasha Markuzon

Workload Characterization for MG-RAST Metagenomic Data Analytics Service in the Cloud(Page 56)
Wei Tang, Jared Bischof, Narayan Desai, Kanak Mahadik, Wolfgang Gerlach, Travis Harrison, Andreas Wilke, and Folker Meyer

 TIDE: Inter-chromosomal Translocation and Insertion Detection using Embeddings(Page 64)
Rosanne Vetro, Roshanak Farhoodi, Nurit Haspel, David Weisman, Jennifer Rosen, and Dan Simovici

Low Redundancy Feature Selection with Grouped Variables and Its Application to Healthcare Data(Page 71)
Hang Wu, Ji-jiang Yang, and Jianqiang Li

Pharmacological Class Data Representation in the Web Ontology Language (OWL)(Page 74)
Qian Zhu and cui tao

9. 1st Workshop on Management, Search and Mining of Massive Repositories of Solar Astronomy Data (SABID'14)

Rafal Angryk  angryk@cs.gsu.edu

Introduction to workshop

With the launch of NASA’s Solar Dynamics Observatory (SDO) mission on 02/11/2010, researchers in solar physics have entered the era of Big Data. The Atmospheric Imaging Assembly (AIA) instrument on SDO provides imaging data and the Helioseismic and Magnetic Imager (HMI) instrument on SDO provides magnetic field data. Both instruments record data at a high spatial resolution and a time cadence, amounting to about 1 Petabyte of scientific data each year. The Big Data challenges in Solar Astronomy are expected to grow even further with the lunch of the NSF funded Daniel K. Inouye Solar Telescope (DKIST), currently under construction in Hawaii. This telescope is expected to generate: 3-5 Petabytes of data per year. 

Research topics included in the workshop

Foundations for Solar Astronomy Big Data Management: 

  • New Computational Models for Storage, Distribution and Processing of Solar Astronomy Data 
  • Evaluation of Information Quality for the Solar Astronomy Data from Telescopes, as well as Derived Data Products (Meta-Data 
  • New Scientific Standards for Solar Information Processing and Information Quality Evaluation 
  • System Architectures, Design and Deployment of Solar Data Archives, Portals and Services 
  • Data Management and Stream Computing for Solar Astronomy Data in Cloud and Distributed Environments 
  • Integration of Heterogeneous Solar Information from Multiple Data Repositories 

Solar Data Retrieval, Visualization, Recognition and Data-Driven Analyses: 

  • New Computational Models for Search, Retrieval, and Analysis of Solar Astronomy Data 
  • Scalable Algorithms and Systems for Solar Activity Recognition (e.g. Computer Vision) from Solar Big Data Repositories 
  • Solar Astronomy Data Search Architectures, their Scalability, Efficiency, and Real-life Usefulness, Visualization Tools for Large-scale Solar Astronomy Data 
  • Computational Modeling and Solar Data Integration 
  • Cloud, Distributed, and Stream Data Mining for Big Velocity Solar Astronomy Data 
  • Semantic-based Solar Data Mining for Big Variety Solar Information 
  • Multimedia, Multi-structured, and Spatiotemporal Solar Data Mining 
  • New Programming Models for Solar Data, including and beyond Hadoop and MapReduce 

Computer Applications related to Solar Astronomy Big Data (Big Value in Solar Data): 

  • Complex Solar Weather Applications in Science, Engineering, Education, Navigation, Power Grids, and Telecommunication for

Government, Public and Private Industry Sectors 

  • Real-life Case Studies of Big Value Creation through Solar Data Analytics (e.g. Space Weather) 
  • Experiences with Big Data Project Deployments in Solar Physics 
  • Solar Astronomy Data and Knowledge Distribution in the Social Web 

Organizers

Workshop Co-Chairs

  • Rafal A. Angryk, Georgia State University 
  • Piet C. Martens, Montana State University 

Program Committee Members

  • Rafal Angryk, Georgia State University 
  • Juan Banda, Stanford University 
  • Laura Boucheron, New Mexico State University 
  • André Csillaghy, University of Applied Sciences North Western Switzerland 
  • Veronique Delouille, Royal Belgium Observatory in Brussels 
  • Olac Fuentes, University of Texas at El Paso 
  • Karthik Ganesan Pillai, Cerner Corporation 
  • Joseph Gurman, NASA Goddard Space Flight Center 
  • Russell Hewett, MIT 
  • Jack Ireland, NASA Goddard Space Flight Center 
  • Thomas Lee, University of California-Davis 
  • Piet Martens, Montana State University 
  • Daniel Müller, European Space Agency 
  • Rami Qahwaji, University of Bradford 
  • Kevin Reardon, National Solar Observatory 
  • Pete Riley, Predictive Science Inc. 
  • Nathan Stein, University of Pennsylvania 
  • Hassan Ugail, University of Bradford 
  • Raju Vatsavai, Oak Ridge National Laboratory 
  • Ricardo Vilalta, University of Houston 
  • Tim Wylie, University of Alberta 
  • Alec Young, NASA Goddard Space Flight Center 

Papers

Spatiotemporal Indexing Techniques for Efficiently Mining Spatiotemporal Co-occurrence Patterns(Page 1)
Berkay Aydin, Dustin Kempton, Vijay Akkineni, Shaktidhar Gopavaram, Karthik Ganesan Pillai, and Rafal Angryk

Scalable Solar-Image Retrieval with Lucene(Page 11)
Juan Banda and Rafal Angryk

Stream Mining for Solar Physics: Applications and Implications for Big Solar Data(Page 18)
Karl Battams

A computer vision approach to mining big solar data(Page 27)
Simon Felix and André Csillaghy

Iterative Refinement of Multiple Targets Tracking of Solar Eventss(Page 36)
Dustin Kempton, Karthik Ganesan Pallai, and Rafal Angryk

Improved data exploitation for DKIST and high-resolution solar observations(Page 45)
Kevin Reardon and Steve Berukoff

Massive Labeled Solar Image Data Benchmarks for Automated Feature Recognition(Page 53)
Michael Schuh and Rafal Angryk

10. Using Big Data to Understand Spatial Connectivity

Clio Andris clio@mit.edu

Introduction to workshop

The revolutionary changes in geo-spatial data and information and communication technologies (ICTs) have enabled the collection and generation of a large amount of data that provide both spatial and non-spatial information. For instance, location-based services, social media, and mobile phone data can provide location information, such as the city of an agent who makes a telephone call. A subset of these datasets contain two sets of geo-located information per record (such as an origin and a destination), which provides evidence of inter-place connectivity through the transfer of goods, humans or information. A focus on the analysis of big connectivity data is important because flows of human movement and information not only tie places together, but also impact the places themselves. These inter-place connections transmit and receive human artifacts such as political opinions, capital, or culture. The presence of trends in these data can help predict future dynamics of flows that can inform infrastructure development for travel and telecommunications. Additionally, connectivity data may be able to tell us more about the nature of social capital over geographic space (such as two tech-industry-focused cities connecting often). Therefore, these data can be used to learn about phenomena within and across the built environment 

Research topics included in the workshop

  • Intercity or international migration flows 
  • Area-to-area commuter flows 
  • International commodity and trade flows 
  • International remittance flows 
  • Geo-located online social network friendship connections 
  • Point-to-point telecommunications flows or friendship connections 
  • Transportation statistics, such as inter-city flight magnitudes 
  • Origins and destinations of out-of-town vacations, holidays and travel 
  • International remittance flows

Organizers

  • Clio Andris, Penn State

Papers

Spatial Data Analysis of Complex Urban Systems(Page 1)
Farideddin Peiravian, Amirhassan Kermanshah, and Sybil Derrible

15. Workshop on Advances in Software and Hardware for Big Data to Knowledge Discovery (ASH 2014)

Weijia Xu xwj@tacc.utexas.edu

Introduction to workshop

Hailed by some as the fourth paradigm in science, data-intensive science has brought a profound transformation to scientific research. Indeed, the data-driven discovery has already happened in various research fields, such as earth sciences, medical sciences, biology and physics, to name just a few. It is expected that a vast volume of scientific data captured by new instruments will be publically accessible for the purposes of continued and deeper data analysis. Big Data analytic will result in the development of many new theories and discoveries but will also require substantial computational resources in the process. However, many domain sciences still mostly rely on traditional experimental paradigms. It is often a major challenge to transform a solution obtained on a standalone server into a massively parallel one running on tens, hundreds, or even thousands of servers. It is a crucial issue to make the latest technology advancements in software and hardware accessible and usable to the domain scientists, especially those in the fields that traditionally lack computation and programming, but have nonetheless become the driving forces of scientific discovery. Fueled by the big data analytics needs, new computing and storage technologies are also in rapid development and pushing for new high-end hardware for big data problems. These new hardware brings new opportunities for performance improvement but also new challenges. While those technologies have the potential to greatly improve the capabilities of big data analytics, such potential are often not fully realized. Due to the cost, sophistications of those technology, and limited initial application support, the new technologies often seem remote to the end users and are not fully utilized in the academia years after their invention. It is therefore very important to make those technologies understood and accessible by data scientists in a timely manner. Meanwhile, comprehensive analytic software packages and programming environments, have become increasingly popular as open-source platforms for data analysis. Most data scientists have had experiences with small to medium data and now facing the challenges posed by Big Data. Those software not only provide collection of analytic methods but also have the potential to utilize new hardware transparently and reduce the efforts required of the end users. For example, R has traditionally been the programming language preferred by data scientists. Recently members of the R and HPC communities have tried to step up to big data with R, resulting in methods for effectively adapting R to a variety of high-performance and high-throughput computing technologies. Parallel to these developments, a family of software frameworks (e.g., Apache Spark, Airavata) has been developed for executing and managing computational jobs and workflows on distributed computing resources, while providing web-based science gateways to assist domain scientists to compose, manage, execute, and monitor big data applications and workflows composed of these services. This workshop on Advances in Software and Hardware for Big Data to Knowledge Discovery (ASH) aims to connect the latest hardware and software developments with the end users of big data. It focuses on the accessibility and applicability of the latest hardware and software to practical domain problems and hence directly facilitates domain researchers' data driven discovery. The issues in discussion include performance evaluation, optimizations, accessibility and usability of new technologies. The participants will consist of computer scientists, domain users, service providers, as well as technology inventors in industry. The constituents of the workshop will advance direct and productive communication between cyber-infrastructure specialists and data scientists who normally work separately. 

Research topics included in the workshop

  • Adopting latest hardware technology with for Big Data analytics 
  • Application and use cases in using cyber-infrastructure for Big Data in sciences and engineering 
  • Performance tuning with new hardware infrastructure and software platform 
  • Advances in hardware technology 
  • Novel software platforms and models for big data collection management and analysis 
  • Search and data retrieval on large scale data set 
  • Service oriented architectures to enable data science 
  • Science gateway for domain big data research 
  • Big Data and interactive analysis languages (e.g., R, Python, and Matlab) 

Organizers

Workshop Organizers

  • Chair: Weijia Xu, Texas Advanced Computing Center, University of Texas at Austin 
  • Co-Chair: Hui Zhang, IU/Pervasive Technology Institute 

Program Committee Members

  • Eli Collins, Cloudera 
  • Cheqing Jin, East China Normal University 
  • Xu Liu, Rice University 
  • Xiaoyi Lu, Ohio State University 
  • Rui Mao, National High Performance Computing Center at ShenzhenShenzhen University 
  • Nirav Merchant, University of Arizona 
  • George Ostrouchov, ORNL/NICS University of Tennessee 
  • Marlon Pierce, Indiana University 
  • Keshav Pingali, University of Texas at Austin 
  • Andrew Purtell, Intel 
  • Smriti Ramakrishnan, Oracle 
  • Ananth Sankaranarayanan, Intel 
  • J. Ray Scott, Pittsburgh Supercomputing Center, Carnegie Mellon University 
  • Li Shen, Indiana University School of Medicine 
  • Raminder Singh, Indiana University 
  • Dan Stanzione, Texas Advanced Computing Center 
  • Steve Wong, Houston Methodist Hospital 
  • Hongfeng Yu, University of Nebraska-Lincoln 

Papers

Human activity recognition in big data smart home context(Page 1)
Sabrina Azzi, Cindy Dallaire, Abdenour Bouzouane, Bruno Bouchard, and Sylvain Giroux

A Visualized Data Analysis for Bogus Business Entities Detection(Page 9)
HILARY CHENG, Yi Chuan Lu, and Chih-Cheng Hsu
Advanced Planning and Control of Manufacturing Processes in Steel Industry through Big Data Analytics(Page 16)
Julian Krumeich, Dirk Werth, Jens Schimmelpfennig, Sven Jacobi, and Peter Loos

An Open Schema for XML Data in Hive (Page 25)
Wuheng Luo, Bo Liu, and Allie Watfa

Parallel and Quantitative Sequential PatternMining for Large-scale Interval-based Temporal Data(Page 32)
Guangchen Ruan, Hui Zhang, and Beth Plale

A Big Data Aggregation, Analysis and Exploitation Integrated Platform for Increasing Social Management Intelligence(Page 40)
Greg Sand, Leonidas Tsitouras, George Dimitrakopoulos, and Vassillis Chatzigiannakis

Investigating the Accuracy of the openFDA API using the FDA Adverse Event Reporting System (FAERS)(Page 48)
Jennifer Shin

A CCG Virtual System for Big Data Application Communication Costs Analysis(Page 54)
Yongen Yu, Hongbo Zou, Wei Tang, and Liwei Liu

High-Frequency Financial Statistics with Parallel R and Intel Xeon Phi Coprocessors(Page 61)
Jian Zou and Hui Zhang

16. IEEE Big Data Workshop on Semantics for Big Data on the Internet of Things (SemBIoT 2014)

Kemafor Ogan kogan@ncsu.edu

Introduction to workshop

The internet is rapidly evolving into an “Internet of Things” (IoT) that is estimated to have more than 50 billion connected smart devices (in homes, cars, hospitals, human wearable, etc.) by the year 2020. This number excludes the massive number of “human sensors” that also actively interact on the Internet using social media. These “things” on the Internet will continuously sense their environments, interpret user actions, and instruct their environments on how to react to perceived external events. Example applications abound in patient and elderly care, smart homes, buildings, automobiles, industrial plants, the power grid, etc., which will be monitored and controlled by different types of sensors and actuators. Generation and transmission of data from the large number of devices on the IoT will be continuous, giving rising to amounts of data that completely dwarf our current notion of “Big Data”. In addition to the volume and velocity of data, the wide variety of devices, heterogeneity of data they produce, and their geographic distribution, add significant complexities to IoT Big Data management. Further, there is the issue of dynamicity because devices may join and leave the network at different times due to connectivity issues or mobility. 

Research topics included in the workshop

  • Techniques for establishing semantic interoperability between a large number of data source 
  • Reasoning and querying over distributed semantic data streams 
  • Distributed ontological reasoning with massive fact bases 
  • Semantic indexing techniques for discovery and search in dynamic data source environments 
  • Real time large scale stream data mining / analytics 
  • Lightweight metadata modeling for resource-constrained environments 
  • Security and privacy of data on the Internet of Things 
  • Benchmarks 
  • Algorithms and data analysis methods for extracting meaning and value from the Internet of Things 
  • Use case analysis involving the application of social media and Linked Data methodologies in real world scenarios 
  • Data management for targeted IoT applications (e.g. smart health, transportation, disaster response) 
  • Applications and use-cases

Organizers

Workshop Chairs

  • Kemafor Anyanwu, North Carolina State University, USA 
  • Payam Barnaghi, University of Surrey, UK 
  • Rajendra Akerkar, Western Norway Research Institute, Norway 

Program Committee Members

  • Emanuele Della Valle, Politecnico di Milano, Milano, Italy 
  • Stratos Idreos, Harvard University, USA 
  • Axel Ngonga, University of Leipzig, Germany 
  • Frieder Ganz, Adobe, Germany 
  • Raffaele Giaffreda, CREATE-NET, Italy 
  • Axel Polleres, Vienna University of Economics and Business, Austria 
  • Daniel Puschmann, University of Surrey, UK 
  • Prateek Jain, IBM T J Watson Research Center, USA 
  • Krishnaprasad Thirunarayan, Wright State University, USA 
  • Hong-Linh Truong, Vienna University of Technology, Austria 
  • Minh-Son Dao, NICT, Japan 
  • Bin Guo, Northwestern Polytechnical University, China 
  • Alessandra Mileo, INSIGHT Centre for Data Analytics, Ireland 
  • Heri Ramampiaro, NTNU, Trondheim, Norway 
  • Yong Liu, Microsoft, USA 
  • Seyed Amir Hoseini Tabatabai, University of Surrey, UK 
  • Anand Ranganathan, IBM Watson Center, USA 
  • Claudia D'Amato, University of Bari, Italy 
  • Achille Fokoue , IBM T.J. Watson Research Center, USA 
  • Hyeongsik Kim, North Carolina State University, USA 
  • Dumitru Roman, SINTEF , Norway 
  • Mikhail Simonov, Istituto Superiore Mario Boella, Italy 
  • Fulvio Corno, Politecnico di Torino, Italy 
  • Padmashree Ravindra, North Carolina State University, USA 

Papers

Situation Aware Computing for Big Data (Page 1)
Eric Chan, Dieter Gawlick, Adel Ghoneimy, and Zhen Liu

Topic-Specific Post Identification in Microblog Streams(Page 7)
Shanika Karunasekera, Aaron Harwood, Sameendra Samarawickrama, Kotagiri Ramamohanarao, and Garry Robins

Handling smart environment devices, data and services at the semantic level with the FI-WARE core platform(Page 14)
Fano Ramparany, Fermin Galan Marquez, Javier Soriano, and Tarek Elsaleh

An IoT/IoE Enabled Architecture Framework for Precision On Shelf Availability: Enhancing Proactive Shopper Experience(Page 21)
Rajesh Vargheese and Hazim Dahir

17. Big Data in Computational Epidemiology

Jiangzhuo Chen chenj@vbi.vt.edu

Introduction to workshop

Computational epidemiology aims to understand the spread of diseases and efficient strategies to mitigate their outbreak. It studies dynamics in socio-technical systems, where disease spread co-evolves with public health interventions as well as individual behavior. It has evolved from ODE models to networked models which apply agent based modeling and simulation methodologies. Computation of such high resolution models involves processing data sets that are massive, disparate, heterogeneous, evolving (at an ever increasing rate), and potentially unstructured and of various quality. The workshop brings together researchers from epidemiology, data science, computational science, and health IT domains to tap the potential of emerging technologies in data intensive computations and analytical processing to advance the state of art in computational epidemiology. The central theme of how to manage, integrate, analyze, and visualize vast array of datasets has wider applications in the bio- and physical- simulation and informatics based sciences such as immunology, high energy physics, and, medical informatics 

Research topics included in the workshop

  • Collection and generation of large scale epidemiological datasets 
  • Management, provenance, storage, and archival of surveillance, synthetic, and experiment data 
  • Analytics of spatial, temporal, relational, and semi-structured data 
  • Mining social media data and other online data for public health 
  • Simulation driven statistical methods for knowledge discovery and forecasting 
  • Cloud, streaming, and high performance data intensive science 
  • Semantic web tools, informatics, inference, and integration of public health data 
  • Privacy in the big data era 
  • Agent mining, multi-agent systems, agent based modeling, and behavior modeling in epidemiology 

Organizers

Program co-chairs: Jiangzhuo Chen, Sandeep Gupta.

Program Committee Members

  • Kaja Abbas, Virginia Tech 
  • Feng Chen, University at Albany, State University of New York 
  • Courtney D. Corley, Pacific Northwest National Laboratory 
  • Xizhou Feng, Marquette University 
  • Bryan Lewis, Virginia Tech 
  • Sumiko Mekaru, Boston Children's Hospital 
  • Elaine Nsoesie, Boston Children's Hospital 
  • Shrideep Pallickara, Colorado State University 
  • Naren Ramakrishnan, Virginia Tech

Papers

Learning Machines for Computational Epidemiology(Page 1)
Magnus Boman and Daniel Gillblad

Epidemiological Modeling of Bovine Brucellosis in India(Page 6)
Gloria Kang, L Gunaseelan, and Kaja Abbas

Big data problems on discovering and analyzing causal relationships in epidemiological data(Page 11)
Yiheng Liang and Armin Mikler

Spatial Big Data Analytics of Influenza Epidemic in Vellore, India(Page 19)
Daphne Lopez, M. Gunasekaran, B. Senthil Murugan, Harpreet Kaur, and Kaja Abbas

18. Large Scale Data Analytics in Transportation and Railway Infrastructure (BIGDATA-TRANSPORTATION 2014)

Nii Attoh-Okine  okine@udel.edu

Introduction to workshop

Currently the increased use of sensors, imaging systems and other emerging techniques in transports and railway infrastructure testing, monitoring and control are enabling massive amount of information and data to be generated at unprecedented scales. Data are generated in such volumes making it very difficult to draw appropriate conclusions. Processing of these large sets which are the magnitude of terabytes to petabytes demand new tools, and new algorithmic, probabilistic and statistical techniques. Despite the advance in large storage, high computational power, mining and drawing inferences from large scale data in critical infrastructure has some challenges: (a) standardization of data format, (b) accurate modeling, (c) clustering and classifying, (d) integrating data from independent sources and finally, (e) uncovering hidden patterns and information, (f) hidden correlation and (g) interpretation. 

Research topics included in the workshop

  • Data Collection and Management 
  • Designing Big Data Systems 
  • Tool and Technologies 
  • Deep Analytics 
  • Predictive Analytics 
  • Large Scale Computing Platform 
  • Cluster Computing and Cloud Computing 
  • Grid Computing 
  • Heterogeneous Computing 
  • Randomized Algorithms 
  • Mining of Large Data 
  • Tensor Methods and Analysis 
  • Graphical Methods and Analysis 
  • Image Processing Tools 
  • Case Studies 
  • Railway Engineering 
  • Bridge Infrastructure 
  • Monitoring 
  • Pavement Monitoring 
  • Pavement Field Studies 
  • Traffic Studies 
  • ITS

Organizers

Organizing Committee

  • Nii Attoh-Okine,Chair, University of Delaware, USA 
  • Bilal M. Ayyub, University of Maryland, USA 
  • Badiru Adedeji, AFIT, USA 
  • Allan Zarembski, University of Delaware, USA 
  • John Sobanjo, Florida State University, USA 
  • Alexander Appea, Virginia DOT, USA 
  • Hugh Thompson, Federal Railroad Administration, USA

Papers

Multiway Analysis of Bridge Structural Types in the National Bridge Inventory (NBI)(Page 1)
Offei Adarkwa, Thomas Schumacher, and Nii Attoh-Okine

Big Data Challenges in Railway Engineering(Page 7)
Nii Attoh-Okine

Efficient Traffic Speed Forecasting Based on Massive Heterogenous Historical Data(Page 10)
Xing-Yu Chen, Hsing-Kuo Pao, and Yuh-Jye Lee

Topological Models of Document-Query Sets in Retrieval for Enterprise Information Management(Page 18)
Vinay Deolalikar

A Dynamic Programming Approach for 4D Flight Route Optimizations(Page 24)
Christian Kiss-Tóth and Gabor Takacs

Impact Analysis of Extreme Events on Flows in Spatial Networkss(Page 29)
Amirhassan Kermanshah, Alireza Karduni, Farideddin Peiravian, and Sybil Derrible

Applications of Linked Data in the Rail Domain(Page 35)
Christopher Morris, John Easton, and Clive Roberts

Metaheuristics in Big Data: An Approach to Railway Engineering(Page 42)
Silvia Galvan Nunez and Nii Attoh-Okine

Facilitating Maintenance Decisions on the Dutch Railways Using Big Data: The ABA Case Study(Page 48)
Alfredo Núñez, Jurjen Hendriks, Zili Li, Bart De Schutter, and Rolf Dollevoet

Spatial Data Analysis of Complex Urban Systems(Page 54)
Farideddin Peiravian, Amirhassan Kermanshah, Sybil Derrible, and Clio Andris

Evaluating Structural Engineering Finite Element Analysis Data Using Multiway Analysis(Page 60)
Matija Radovic and Jennifer McConnell

Multi-Objective Optimization for Resilient Airline Networks Using Socioeconomic-Environmental Data(Page 68)
Hidefumi Sawai and Aki-Hiro Sato

Predicting flight arrival times with a multistage model(Page 78)
Gabor Takacs

Ontology-driven Data Integration for Railway Asset Monitoring Applications(Page 85)
Jonathan Tutcher

Some Examples of Big Data in Railroad Engineering(Page 96)
Allan Zarembski

19. 2nd Workshop on Scalable Cloud Data Management (SCDM 2014)

Felix Gessert felix.gessert@gmail.com

Introduction to workshop

The Second Workshop on Scalable Cloud Data Management (SCDM 2014) is a workshop co-located with the IEEE BigData Conference 2014 and created to tackle the manifold new topics in the area of cloud data management. SCDM 2014 was successfully held in conjunction with the IEEE BigData 2014 in Santa Clara. 

Research topics included in the workshop

  • Database as a Service, Multi-tenancy 
  • Elasticity and Scalability for Cloud Data Management Systems 
  • New Protocols, Service Interfaces and Data Models for Cloud Databases 
  • Polyglot Persistence, NoSQL, Schemaless Data Modeling, Integration 
  • Data-Centric Web-Services, RESTful Data Services 
  • Database Architectures for Mobile and Web Clients 
  • Content Delivery Networks, Caching, Load-Balancing, Web-scale workloads 
  • Virtualization for Cloud databases, Storage Structures and Indexing 
  • Frameworks and Systems for Parallel and Distributed Computing 
  • Scalable Machine Learning, Analytics and Data Science 
  • Resource and Workload Management in Cloud Databases 
  • Tunable and Eventual Consistency, Latency 
  • High Availability, Reliability, Failover 
  • Transactional Models for Cloud Databases 
  • Query Languages and Processing, Programming Models 
  • Consistency, Replication and Partitioning 
  • CAP, Data Structures and Algorithms for Eventually Consistent Stores 

Organizers

Program Chairs

  • Norbert Ritter, University of Hamburg 
  • Felix Gessert, University of Hamburg 

Program Committee Members

  • David Carrera, Barcelona Supercomputing Center (BSC), Spain 
  • Rong Chang, IBM T.J. Watson Research Center, USA 
  • Keke Chen, Wright State University, USA 
  • Shiping Chen, CSIRO, Australia, Australia 
  • Alfredo Cuzzocrea, ICAR-CNR and University of Calabria, Italy 
  • Ernesto Damiani, Università degli Studi di Milano, Italy, Italy 
  • Changyu Dong, University of Strathclyde, UK 
  • Schahram Dustdar, Technical University of Vienna, Austria 
  • Sameh Elnikety, Microsoft Research, USA 
  • Ephraim Feig, Independent Consultant, USA 
  • Nils Gruschka, University of Applied Sciences, Kiel, Germany, Germany 
  • Shigeru Hosono, NEC Knowledge Discovery Research Lab., Japan 
  • Ching Hsien (Robert) Hsu, Chung Hua University, Taiwan 
  • Harald Kosch, University of Passau, Germany 
  • Alptekin Küpçü, Koç University, TURKEY 
  • Zhiqiang Lin, University of Texas at Dallas, USA 
  • Joe Loyall, BBN Technologies, USA 
  • Gregorio Martinez, University of Murcia, Spain 
  • Sébastien Mosser, Université Nice-Sophia Antipolis, France 
  • Jiannan Ouyang, University of Pittsburgh, USA 
  • Nohhyun Park, University of Minnesota, USA 
  • Peter Pietzuch, Imperial College London, USA 
  • Aravind Prakash, Syracuse University, USA 
  • Smriti R. Ramakrishnan, Oracle, USA 
  • Tilmann Rabl, University of Toronto, Canada 
  • Christoph Reich, Furtwangen Hochschule University, Germany 
  • Sherif Sakr, University of New South Wales, Australia 
  • Holger Schwarz, Universität Stuttgart, Germany 
  • Russell Sears, Microsoft, USA 
  • Andreas Thor, University of Leipzig, Germany 
  • Venkatanathan Varadarajan, University of Wisconsin-Madison, USA 
  • Liqiang Wang, University of Wyoming, USA 
  • Qi Yu, Rochester Institute of Technology, USA 
  • Yuan Yu, Microsoft Research, USA 

Papers

Taking an Electronic Ticketing System to the Cloud: Design and Discussion(Page 1)
Filipe Araujo, Marilia Curado, Pedro Furtado, and Raul Barbosa

A Contention Aware Hybrid Evaluator for Schedulers of Big Data Applications in Computer Clusters(Page 11)
Shouvik Bardhan and Daniel Menasce

RuleMR: Classification Rule Discovery with MapReduce(Page 20)
Vasilis Kolias, Constantinos Kolias, Ioannis Anagnostopoulos, and Eleftherios Kayafas

A Relational Database Schema on the Transactional Key-Value Store Scalaris(Page 29)
Nico Kruber, Florian Schintke, and Michael Berlin

Community Structure Analysis in Big Climate Datas(Page 38)
Michael McGuire and Nam Nguyen

The Best of Two Worlds: Integrating IBM InfoSphere Streams with Apache YARNs(Page 47)
Zubair Nabi, Rohit Wagle, and Eric Bouillet

Temporal Bipartite Projection and Link Prediction for Online Social Networkss(Page 52)
Tsunghan Wu, Sheau-Harn Yu, Wanjiun Liao, and Cheng-Shang Chang

20. Big Humanities Data

Mark Hedges mark.hedges@kcl.ac.uk

Introduction to workshop

This workshop will address applications of “big data” in the humanities, arts, culture, and social science, the challenges and possibilities that such increased scale brings for scholarship in these areas. The use of computational methods in the humanities is growing rapidly, with the increasing quantities of born-digital primary sources (such as archives of emails and social media) and the large-scale digitisation programmes applied to libraries and archives. This has resulted in a range of experiments with new methodologies and new applications. At the same time, humanities and culture research is itself challenged by interpretative issues raised by applying such data-driven methods for answering humanities research questions. Moreover, the questions and concerns raised by the humanities themselves have consequences for the interpretation in general of “big data” and the uses to which it is put, and the challenges of producing quality – meaning, knowledge and value – from quantity. The workshop will thus also address complementary research that uses the humanities and its methods to provide a critical appraisal of “big data” in other areas, both inside and outside academia. 

Research topics included in the workshop

  • Text- and data-mining of historical and archival material. 
  • Social media analysis, including sentiment analysis 
  • Social media analysis, including sentiment analysis 
  • Cultural analytics 
  • Social analytics 
  • Social analytics 
  • Cyber-infrastructures for the humanities (for instance, cloud computing) 
  • NoSQL databases and their applications in the humanities 
  • Big data and the construction of memory and identity 
  • Big data and archival practice 
  • Corpora and collections of big data 
  • Linked Data and Big Data 
  • Constructing big data for research in the humanities

Organizers

Program Chairs

  • Dr. Mark Hedges, Department of Digital Humanities King’s College London, UK 
  • Dr. Tobias Blanke, Department of Digital Humanities King’s College London, UK 
  • Prof. Richard Marciano, College of Information Studies – “Maryland’s iSchool” University of Maryland, USA 

Papers

The DEEP FILM Access Project: Ontology and Metadata Design for Digital Film Production Assets(Page 1)
Sarah Atkinson, Roger Evans, and Jos Lehmann

Mining Microdata: Economic Opportunity and Spatial Mobility in Britain and the United States, 1850-1881(Page 5)
Peter Baskerville, Lisa Dillon, Kris Inwood, Evan Roberts, Steven Ruggles, and John Robert Warren

Mining Mobile Youth Cultures(Page 15)
Tobias Blanke, Giles Greenway, Jennifer Pybus, and Mark Cote

Scientific Findings as Big Data for Research Synthesis: The metaBUS Project(Page 19)
Frank Bosco, Krista Uggerslev, and Piers Steel

Scaling Historical Text Re-uses(Page 28)
Marco BÜCHLER, Emily Franzini, Greta Franzini, and Maria Moritz

Revolutionary Entities: Turning Data into Knowledge to Drive Personalized Exploration of The Irish Rising of 1916(Page 37)
Owen Conlan, Alexander O'Connor, Órla Ní Loinsigh, Gary Munnelly, Séamus Lawless, and Rachel Murphy

Understanding the Role of Medical Experts during a Public Health Crisis: Digital Tools and Library Resources for Research on the 1918 Spanish Influenza(Page 44)
E. Thomas Ewing, Samah Gad, Naren Ramakrishnan, and Jeffrey S. Reznick

A metadata infrastructure for the analysis of parliamentary proceedings(Page 52)
Richard Gartner

Scaled Entity Search: A Method for Media Historiography and Response to Critiques of Big Humanities Data Research(Page 56)
Eric Hoyt, Kit Hughes, Derek Long, Kevin Ponto, and Anthony Tran

On the Coverage of Science in the Media: A Big Data Study on the Impact of the Fukushima Disaster(Page 65)
Thomas Lansdall-Welfare, Saatviga Sudhahar, Guiseppe Veltri, and Nello Cristianini

Integrating Data Mining and Data Management Technologies for Scholarly Inquiry(Page 72)
Ray Larson, Paul Watry, Richard Marciano, John Harrison, Chien-Yi Hou, Luis Aguilar, Shreyas ., and Jerome Fuselier

The Exceptional and the Everyday: 144 Hours in Kiev(Page 77)
Lev Manovich, Alise Tifentale, Mehrdad Yazdani, and Jay Chow

Dealing with Heterogeneous Big Data When Geoparsing Historical Corpora(Page 85)
C.J. Rupp, Paul Rayson, Ian Gregory, Andrew Hardie, Amelia Joulain, and Daniel Hartmann

BigExcel: A Web-Based Framework for Exploring Big Data in Social Sciences(Page 89)
Muhammed Asif Saleem, Blesson Varghese, and Adam Barker

Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution(Page 97)
Xin Wang, Ang Sun, Hakan Kardes, Siddharth Agrawal, Lin Chen, and Andrew Borthwick

A Computational Pipeline for Crowdsourced Transcriptions of Ancient Greek Papyrus Fragments(Page 105)
Alex Williams, John Wallin, Haoyu Yu, Marco Perale, Hyrum Carroll, Anne-Francoise Lamblin, Lucy Fortson, Dirk Obbink, Chris Lintott, and James Brusuelas

21. Complexity for Big Data

Guozhu Dong guozhu.dong@wright.edu

Introduction to workshop

Complexity will lead to both big challenges and opportunities in big data research. Complexity in big data can be caused by many factors, including: large number of features, which are related to each other through a rich variety of relationships ranging from simple to complicated; a large number of heterogeneous dimensions, which offer a variety of kinds of insights and require different treatments, in addition to a large number of data records; a large variety of heterogeneous types, such as vectors, sequences, (labeled) graphs, images, and multimedia, in addition to a large number of instances; a rich set of logical, semantic, and ontological relationships. Complexity has often been used to successfully characterize various kinds of complicated subjects, for example in computational complexity, descriptive complexity, Kolmogorov complexity. Since big data have many complicated parts with intricate relationships to each other, the study of complexity for big data has potential to be highly successful. Research on complexity of big data needs to consider the following facts, among others: Big data can be used for different purposes such as data integration, cross domain fertilization, and data mining; Structures among various parts of big data can be explicit or hidden, and can be described using a rich variety of patterns and models. 

Research topics included in the workshop

  • new concepts for complexity for Big Data 
  • new techniques for handling complexity for Big Data 
  • analysis of complex relationships 
  • type logic acquisition 
  • scalability of type logic reasoning 
  • dealing with semantic heterogeneity 
  • schema learning 
  • dimensionality-reduction techniques 
  • query and search over multiple data silos 
  • long-term data preservation 
  • capture and use of provenance information 
  • analysis of similarity and differences among heterogeneous parts 

Organizers

Workshop Organisers

  • Guozhu Dong, Wright State University, USA 
  • Pascal Hitzler, Wright State University, USA 
  • Cliff Joslyn, Pacific Northwest National Laboratory, USA 

Program Committee Members

  • James Bailey, University of Melbourne 
  • Longbing Cao, University of Technology Sydney 
  • Michel Dumontier, Stanford University 
  • Prateek Jain, IBM 
  • Krzysztof Janowicz, University of California, Santa Barbara 
  • Craig Knoblock, University of Southern California 
  • Steffen Lamparter, Siemens 
  • Huan Liu, Arizona State University 
  • Axel Polleres, Vienna University of Economics and Business 
  • Erik Wilde, EMC Corporation 

Papers

“Sustainable Assemblage for Energy (SAE)” inside Intelligent Urban Areas How massive heterogeneous data could help to reduce energy footprints and promote sustainable practices and an ecological transition(Page 1)
Philippe Calvez and Eddie Soulier

A Model Architecture for Big Data applications using Relational Databases(Page 9)
Erin-Elizabeth Durham, Andrew Rosen, and Robert Harrison

Optimizing Graph Queries with Graph Joins and Sprinkle SPARQL(Page 17)
Eric L. Goodman, Edward Jimenez, Cliff Joslyn, David Haglin, Sinan al-Saffar, and Dirk Grunwald

Vessel Route Anomaly Detection with Hadoop MapReduce(Page 25)
Xuan Liu, Xiaoguang Wang, Bo Liu, and Stan Matwin

HBGSim: A Structural Similarity Measurement over Heterogeneous Big Graph(Page 31)
Jiazhen Nian, Shan Jiang, and Yan Zhang

Knowledge Based Dimensionality Reduction for Technical Text Minings(Page 39)
Walid Shalaby, Wlodek Zadrozny, and Sean Gallagher

A Distributed Instance-weighted SVM Algorithm on Large-scale Imbalanced Datasets(Page 45)
Xiaoguang Wang, Xuan Liu, and Stan Matwin

23. IEEE NIST Big Data PWG Workshop on Big Data- Challenges, Practices and Technologies

Nancy Grady NANCY.W.GRADY@saic.com

Introduction to workshop

There is a broad agreement among commercial, academic, and government leaders about the remarkable potential of "Big Data" to spark innovation, fuel commerce, and drive progress. Big Data is the term used to describe the deluge of data in our networked, digitized, sensor-laden, information driven world. The availability of vast data resources carries the potential to answer questions previously out of reach. Questions like: How do we reliably detect a potential pandemic early enough to intervene? Can we predict new materials with advanced properties before these materials have ever been synthesized? How can we reverse the current advantage of the attacker over the defender in guarding against cybersecurity threats? However there is also broad agreement on the ability of Big Data to overwhelm traditional approaches. The rate at which data volumes, speeds, and complexity are growing is outpacing scientific and technological advances in data analytics, management, transport, and more. Despite the widespread agreement on the opportunities and current limitations of Big Data, a lack of consensus on some important, fundamental questions is confusing potential users and holding back progress. What are the attributes that define Big Data solutions? How is Big Data different from the traditional data environments and related applications that we have encountered thus far? What are the essential characteristics of Big Data environments? How do these environments integrate with currently deployed architectures? What are the central scientific, technological, and standardization challenges that need to be addressed to accelerate the deployment of robust Big Data solutions? 

The candidate list of panels is:

  • Current Practice and Lessons Learned 
  • Technology Landscape 
  • Business and Government Needs 
  • Security and Privacy 
  • The Road Forward 
  • Global Data Sharing and Reuse for Analytics

Organizers

Program Co-Chairs

  • Chaitan Baru, University of California San Diego 
  • Robert Marcus, ET-Strategies 
  • Wo Chang, NIST Information Technology Laboratory (ITL)

Papers

A Layer Based Architecture for Provenance in Big Data(Page 1)
Rajeev Agrawal, Ashiq Imran, Cameron Seay, and Jessie Walker

Sharing Best Practices for the Implementation of Big Data Applications in Government and Science (Page 8)
Joan L. Aron and Brand Niemann See Panel

Big Data: Challenges, Practices and Technologies(Page 11)
Nancy W. Grady, Mark Underwood, Arnab Roy, and Wo L. Chang

Big Data Machine Learning and Graph Analytics: Current State and Future Challenges(Page 16)
H. Howie Huang and Hang Liu

A Standard for Benchmarking Big Data Systemss(Page 18)
Raghunath Nambiar

Posters

My Note: This finds the attachments! Need to search for screen capture for each and data set URLs.

Addressing Data Veracity in Big Data Applications

Page 1

Saima Aman, Charalampos Chelmis, and Viktor Prasanna

P230_4754.pdf

Machine Learning and Interactive Visualization applied to TB-sized Images of Stem Cells

Page 4

Julien Amelot, Peter Bajcsy, Anne Plant, and Mary Brady

P222_6003.pdf

Big Data: a new challenge for tourism

Page 5

Gaël Chareyron, Jerome Da-Rugna, and Thomas Raimbault

P223_1967.pdf

OME: A tool for generating and managing metadata to handle BigData

Page 8

Ranjeet Devarakonda, Biva Shrestha, and Giriprakash Palanisamy

P210_3886.pdf

The EMBERS Architecture for Streaming Predictive Analytics

Page 11

Andy Doyle, Graham Katz, Kristen Summers, Chris Ackermann, Ilya Zavorin, Zunsik Lim, Sathappan Muthiah, Liang Zhao, Chang-Tien Lu, Patrick Butler, Rupinder Paul Khandpur, Youssef Fayed, and Naren Ramakrishnan

P237_3482.pdf

A Novel Approach to Determine Docking Locations Using Fuzzy Logic and Shape Determination

Page 14

Erin-Elizabeth Durham, Chinua Umoja, J.T. Torrance, Andrew Rosen, and Robert Harrison

P235_1278.pdf

Linked Open Data Mining for Democratization of Big Data 

Page 17

Roberto Espinosa, Larisa Garriga, Jose Jacobo Zubcoff, and Jose-Norberto Mazon

P229_6096.pdf

Building Wrangler: A Transformational Data Intensive Resource for the Open Science Community

Page 20

Niall Gaffney, Christopher Jordan, Tommy Minyard, and Dan Stanzione

P214_8562.pdf

CELAR: Automated Application Elasticity Platform

Page 23

Ioannis Giannakopoulos

P239_9003.pdf

Semantic HMC for Big Data Analysis

Page 26

Thomas Hassan, Rafael Peixoto, Christophe Cruz, Aurélie Bertaux, and Nuno Silva

P225_3761.pdf

A Layer Based Architecture for Provenance in Big Data

Page 29

Ashiq Imran, Rajeev Agrawal, Jessie Walker, and Anthony Gomes

P232_3097.pdf

B-dIDS: Mining Anomalies in a Big-distributed Intrusion Detection System

Page 32

Vandana P. Janeja, Ali Azari, Josephine Namayanja, and Brian Heilig

P219_2195.pdf

Enabling Genomic Analysis on Federated Clouds

Page 35

Fan Jiang, Michael Shoffner, Claris Castillo, and Charles Schmitt

P216_1944.pdf

Challenges of Data Integration and Interoperability in Big Data

Page 38

Anirudh Kadadi, Rajeev Agrawal, Christopher Nyamful, and Rahman Atiq

P217_6516.pdf

Large Scale Author Name Disambiguation in Digital Libraries

Page 4

Madian Khabsa, Pucktada Treeratpituk, and C Lee Giles

bigdata14_khabsa.pdf

Real-Time Traffic Incident Detection Using Probe-Car Data on the Tokyo Metropolitan Expressway

Page 43

Akira Kinoshita, Atsuhiro Takasu, and Jun Adachi

P215_8936.pdf

Differentially Private Models of Tollgate Usage in Metropolitan Areas: The Milan Tollgate Data Set

Page 46

Nick Manfredi, Darakhshan J. Mir, Shannon Lu, and Dominick Sanchez

P209_1866.pdf

MoDisSENSE: A Distributed Platform for Social Networking Services over Mobile Devices

Page 49

IOANNIS MYTILINIS, Ioannis Giannakopoulos, IOANNIS KONSTANTINOU, KATERINA DOKA, and NECTARIOS KOZIRIS

P226_8102.pdf

A Challenge of Authorship Identification for Ten-thousand-scale Microblog Users

Page 52

Syunya Okuno, Hiroki Asai, and Hayato Yamana

P212_6966.pdf

Cognitive Map of Tourist Behavior based on Tripadvisor

Page 55

Thomas Raimbault, Gaël Chareyron, and Corinne Krzyzanowski-Guillot

P224_3100.pdf

Biclustering using Spark-MapReduce

Page 58

Tugdual Sarazin, Mustapha Lebbah, and Hanane Azzag

P202_5000.pdf

A Summarization Paradigm for Big Data

Page 61

Zubair Shah and Abdun Mahmood

P220_6357.pdf

An open source framework to add spatial extent and geospatial visibility to Big Data

Page 64

Biva Shrestha, Ranjeet Devarakonda, and Giriprakash Palanisamy

P201_1623.pdf

Developing a Cloud Computing Platform for Big Data: The OpenStack Nova case

Page 67

Jose Teixeira

P205_6851.pdf

Cultural Influences and Intelligence in Historical Gazetteers of the Great War

Page 70

Robert Warren and Bo Liu, Language

P218_7692.pdf

The Bot Will Serve You Now: Automating Access to Archival Materials

Page 73

Joshua Westgard

P204_5303.pdf

Incremental and Parallel Spatial Association Mining

Page 75

Jin Soung Yoo and Douglas Boulware

P221_8558.pdf

Sharding for Literature Search via Cutting Citation Graphs

Page 77

Haozhen Zhao

P236_9924.pdf

Repair Efficient Storage Codes via Combinatorial Configurations

Page 80

Bing Zhu, Hui Li, and Kenneth Shum

P206_3584.pdf

Sponsors

IEEE

ieee_mb_blue.jpg

IEEE Computer Society

image_gallery.gif

Elsevier

ELS_NS_Logo_2C_RGB.jpg

Paradigm4

P4NewTagline (big).jpg

NSF

nsf_logo.png

Cisco

cisco.jpg

CCF

CCF logo.png

NEXT


 

Page statistics
5036 view(s) and 167 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments