Table of contents
  1. Story
  2. Slides
    1. Section 1: Introduction
      1. Unit 1: Course Introduction
      2. Unit 2: Course Motivation
    2. Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
      1. Unit 3: Part I: Data Science generics and Commercial Data Deluge
      2. Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
      3. Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
    3. Section 3: Technology Training
      1. Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
    4. Section 4 - Physics Case Study
      1. Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
      2. Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
      3. Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
      4. Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
    5. Section 5: Technology Training
      1. Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
    6. Section 6 - e-Commerce and LifeStyle Case Study
      1. Unit 12 - Part I: Recommender Systems: Introduction
      2. Unit 13- Part II: Recommender Systems: Examples and Algorithms
      3. Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
      4. Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
      5. Unit 16 - Part V: Clustering
    7. Section 7 - Infrastructure and Technologies for Big Data X-Informatics
      1. Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
      2. Unit 18 - X-Informatics Cloud Technology Part I: Introduction
      3. Unit 19 - X-Informatics Cloud Technology Part II: Introduction
      4. Unit 20 - X-Informatics Cloud Technology Part III: Introduction
    8. Section 8 - Web Search Informatics
      1. Unit 21- X-Informatics Web Search and Text Mining I
      2. Unit 22 - X-Informatics Web Search and Text Mining II
    9. Section 9 - Technology for X-Informatics
      1. Unit 23 - PageRank (Python Track)
      2. Unit 24 - K-means (Python Track)
      3. Unit 25 - MapReduce
      4. Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
    10. Section 10 - Health Informatics
      1. Unit 27 - Health Informatics
    11. Section 11 - Sensor Informatics
      1. Unit 28 - Sensors Informatics
    12. Section 12 - Radar Informatics
      1. Unit 29 - Radar Informatics
    13. Section 13 Spotfire
      1. Spotfire-Cover Page
      2. Spotfire-Plot Viz 3 Data Sets
      3. Spotfire-ShGTM3D Solvents Property
      4. Spotfire-10Kx10Ksampled.nw.4D.sqrt
  3. Spotfire Dashboard
  4. Research Notes
  5. Syllabus
    1. Section 1: Introduction
      1. Unit 1: Course Introduction
        1. Overview
        2. Big Data Ecosystem in One Sentence
        3. The Course: Overview
        4. The Approach
        5. The Course Structure
        6. The Topics of Big Data & X-Informatics
        7. 0: Motivation
        8. 3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III
        9. Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
        10. 4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
        11. Using Plotviz Software for Displaying Point Distributions in 3D
        12. 3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III
        13. 2 Units: X-Informatics Technologies: K-Nearest Neighbor Algorithms and Clustering
        14. Parallel Computing: Overview of Basic Principles with Familiar Examples
        15. 3 Units: X-Informatics Cloud Technology Parts I-III
        16. 2 Units: X-Informatics Web Search and Text Mining Parts I and II
        17. X-Informatics Technologies
        18. X-Informatics: Health Informatics
        19. X-Informatics: Sensors
        20. X-Informatics: Radar Informatics (with application to glaciology)
      2. Unit 2: Course Motivation
        1. Overview
        2. Introduction
        3. Data Deluge
        4. Jobs
        5. Industry Trends
        6. Computing Model: Industry Adopted Clouds Which Are Attractive for Data Analytics
        7. Research Model: 4th Paradigm; From Theory to Data Driven Science?
        8. Data Science Process
        9. Physics Informatics Looking for Higgs Particle with Large Hadron Collider LHC
        10. Recommender Systems
        11. Web Search Information Retrieval
        12. Cloud Applications in Research
        13. Parallel Computing and MapReduce
        14. Data Science Education
        15. Conclusions
    2. Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
      1. Unit 3: Part I: Data Science generics and Commercial Data Deluge
        1. Overview
        2. X-Informatics and Its Motto
        3. Jobs
        4. Data Deluge: General Structure
        5. Data Science Process
        6. Data Deluge: Internet
        7. Data Deluge: Business
      2. Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
        1. Overview
        2. Data Deluge: Science & Research
        3. Data Deluge: Implications for Scientific Method
        4. Data Deluge: Long Tail of Science
        5. Data Deluge: Internet of Things
      3. Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
        1. Overview
        2. Clouds
        3. Features of Data Deluge
        4. Processing Big Data
        5. Data Science Process
        6. Data Analytics
    3. Section 3: Technology Training
      1. Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
        1. Overview
        2. Python for X+Informatics: Introduction
        3. Python for X-Informatics: Introduction to Canopy
        4. Getting Necessary Packages
    4. Section 4 - Physics Case Study
      1. Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
        1. Overview
        2. Physics Informaics: Looking for Higgs Particle and Counting Errors Introduction
        3. Physics Informatics: Looking for Higgs Particle Experiments
        4. Physics Informatics: Accelerator Picture Gallery of Big Science
      2. Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
        1. Overview
        2. Event Counting
        3. Examples of Event Counting: With Python Examples of Signal Plus Background
      3. Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
        1. Overview
        2. Fundamental Idea: Random Variables
        3. Physics and Random Variables
        4. Statistics of Events with Normal Distributions
      4. Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
        1. Overview
        2. Random Numbers: Generators and Seeds
        3. Random Numbers: Binomial Distribution
        4. Random Numbers: Accept-Reject
        5. Random Numbers: Monte Carlo Method
        6. Random Numbers: Poisson Distribution
        7. Random Numbers: Central Limit Theorem
    5. Section 5: Technology Training
      1. Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
        1. Overview
        2. Data Sets
        3. Motivation and Introduction to Use
        4. Example of Use I: Cube and Structured Dataset
        5. Example of Use II: Proteomics and Synchronized Rotation
        6. Example of Use III: More Features and Larger Proteomics Sample
        7. Example of Use IV: Tools and Examples
        8. Example of Use V: Final Examples
    6. Section 6 - e-Commerce and LifeStyle Case Study
      1. Unit 12 - Part I: Recommender Systems: Introduction
        1. Overview
        2. Recommender Systems as an Optimization Problem
        3. Recommender Systems Introduction
        4. Kaggle Competitions
        5. Examples of Recommender Systems
        6. Netflix on Recommender Systems
      2. Unit 13- Part II: Recommender Systems: Examples and Algorithms
        1. Overview
        2. Recap of Recommender Systems
        3. Examples of Recommender Systems
        4. Case Study of Yahoo Recommender Systems
        5. Vector Space Formulation of Recommender Systems
        6. User-based nearest-neighbor collaborative filtering
      3. Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
        1. Overview
        2. Item-based Collaborative Filtering
        3. Technologies: k-Nearest Neighbors and High Dimensional Spaces
      4. Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
        1. Overview
        2. Python K-Nearest Neighbor Algorithms
        3. Visualization
        4. Testing K Nearest-Neighbor Algorithms
      5. Unit 16 - Part V: Clustering
        1. Overview
        2. Kmeans Clustering
        3. Clustering of Recommender System Example  
        4. Clustering Into More Than Three Clusters
        5. Local Optima in Clustering
        6. Clustering in General
        7. Heuristics
    7. Section 7 - Infrastructure and Technologies for Big Data X-Informatics
      1. Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
        1. Overview
        2. Decomposition
        3. Parallel Processing in Society
        4. Parallel Processing for Hadrian’s Wall
      2. Unit 18 - X-Informatics Cloud Technology Part I: Introduction
        1. Overview
        2. Cyberinfrastructure for e-more or less anything or more or less anything-Informatics
        3. What is Cloud Computing: Introduction
        4. Gartner’s Technology Landscape
        5. Simple Examples
      3. Unit 19 - X-Informatics Cloud Technology Part II: Introduction
        1. Overview
        2. What is Cloud Computing In More Detail
        3. As a Service and Platform Model: IaaS PaaS SaaS
        4. Platform as a Service
        5. Data in the Cloud
      4. Unit 20 - X-Informatics Cloud Technology Part III: Introduction
        1. Overview
        2. Cloud (Data Center) Architectures
        3. Cloud Industry Players
        4. Cloud Applications
        5. Security
        6. Comments on Fault Tolerance and Synchronicity Constraints
        7. Big Data Processing From Application Perspective Technology Discussed Earlier
    8. Section 8 - Web Search Informatics
      1. Unit 21- X-Informatics Web Search and Text Mining I
        1. Overview
        2. Data Mining Survey
        3. DIKW
        4. Web Search Solution in General Starting with History
        5. Information Retrieval (Web Search) Technology
        6. Boolean Query
        7. Fuzzy Index
        8. Vector Space Model
        9. Probabilistic Models
        10. Frequency Versus Bayes
      2. Unit 22 - X-Informatics Web Search and Text Mining II
        1. Overview
        2. Data Analytics for Web Search
        3. Document Preparation
        4. Inverted Index
        5. Index Construction
        6. Query Structure and Processing
        7. Link Structure Analysis including PageRank
        8. Summary-Issues
        9. Crawling the Web
        10. Web Advertising and Search
        11. Clustering and Topic Models
    9. Section 9 - Technology for X-Informatics
      1. Unit 23 - PageRank (Python Track)
        1. Overview
        2. PageRank in Python
          1. Calculate PageRank from Web Linkage Matrix
          2. Calculate PageRank of a real page
      2. Unit 24 - K-means (Python Track)
        1. Overview
        2. Kmeans in Python
        3. Analysis of 4 Artificial Clusters
      3. Unit 25 - MapReduce
        1. Overview
        2. MapReduce
          1. ​Introduction
          2. Advanced Topics
      4. Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
        1. Overview
        2. MapReduce Kmeans in Python
    10. Section 10 - Health Informatics
      1. Unit 27 - Health Informatics
        1. Overview
        2. Big Data and Health
        3. McKinsey Report on the big data revolution in US healthcare
        4. Microsoft Report on Big Data in Health
        5. EU Report on Redesigning Health in Europe for 2020
        6. Clouds and Health
        7. Genomics, Proteomics and Information Visualization
    11. Section 11 - Sensor Informatics
      1. Unit 28 - Sensors Informatics
        1. Overview
        2. Internet of Things
        3. Sensor Clouds
        4. Radar Data Gathered by Sensors
        5. Ubiquitous/Smart Cites
        6. Smart Grid
    12. Section 12 - Radar Informatics
      1. Unit 29 - Radar Informatics
        1. Overview
        2. Motivation
        3. Outline

Data Science for Big Data Application and Analytics MOOC

Last modified
Table of contents
  1. Story
  2. Slides
    1. Section 1: Introduction
      1. Unit 1: Course Introduction
      2. Unit 2: Course Motivation
    2. Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
      1. Unit 3: Part I: Data Science generics and Commercial Data Deluge
      2. Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
      3. Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
    3. Section 3: Technology Training
      1. Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
    4. Section 4 - Physics Case Study
      1. Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
      2. Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
      3. Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
      4. Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
    5. Section 5: Technology Training
      1. Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
    6. Section 6 - e-Commerce and LifeStyle Case Study
      1. Unit 12 - Part I: Recommender Systems: Introduction
      2. Unit 13- Part II: Recommender Systems: Examples and Algorithms
      3. Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
      4. Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
      5. Unit 16 - Part V: Clustering
    7. Section 7 - Infrastructure and Technologies for Big Data X-Informatics
      1. Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
      2. Unit 18 - X-Informatics Cloud Technology Part I: Introduction
      3. Unit 19 - X-Informatics Cloud Technology Part II: Introduction
      4. Unit 20 - X-Informatics Cloud Technology Part III: Introduction
    8. Section 8 - Web Search Informatics
      1. Unit 21- X-Informatics Web Search and Text Mining I
      2. Unit 22 - X-Informatics Web Search and Text Mining II
    9. Section 9 - Technology for X-Informatics
      1. Unit 23 - PageRank (Python Track)
      2. Unit 24 - K-means (Python Track)
      3. Unit 25 - MapReduce
      4. Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
    10. Section 10 - Health Informatics
      1. Unit 27 - Health Informatics
    11. Section 11 - Sensor Informatics
      1. Unit 28 - Sensors Informatics
    12. Section 12 - Radar Informatics
      1. Unit 29 - Radar Informatics
    13. Section 13 Spotfire
      1. Spotfire-Cover Page
      2. Spotfire-Plot Viz 3 Data Sets
      3. Spotfire-ShGTM3D Solvents Property
      4. Spotfire-10Kx10Ksampled.nw.4D.sqrt
  3. Spotfire Dashboard
  4. Research Notes
  5. Syllabus
    1. Section 1: Introduction
      1. Unit 1: Course Introduction
        1. Overview
        2. Big Data Ecosystem in One Sentence
        3. The Course: Overview
        4. The Approach
        5. The Course Structure
        6. The Topics of Big Data & X-Informatics
        7. 0: Motivation
        8. 3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III
        9. Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
        10. 4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
        11. Using Plotviz Software for Displaying Point Distributions in 3D
        12. 3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III
        13. 2 Units: X-Informatics Technologies: K-Nearest Neighbor Algorithms and Clustering
        14. Parallel Computing: Overview of Basic Principles with Familiar Examples
        15. 3 Units: X-Informatics Cloud Technology Parts I-III
        16. 2 Units: X-Informatics Web Search and Text Mining Parts I and II
        17. X-Informatics Technologies
        18. X-Informatics: Health Informatics
        19. X-Informatics: Sensors
        20. X-Informatics: Radar Informatics (with application to glaciology)
      2. Unit 2: Course Motivation
        1. Overview
        2. Introduction
        3. Data Deluge
        4. Jobs
        5. Industry Trends
        6. Computing Model: Industry Adopted Clouds Which Are Attractive for Data Analytics
        7. Research Model: 4th Paradigm; From Theory to Data Driven Science?
        8. Data Science Process
        9. Physics Informatics Looking for Higgs Particle with Large Hadron Collider LHC
        10. Recommender Systems
        11. Web Search Information Retrieval
        12. Cloud Applications in Research
        13. Parallel Computing and MapReduce
        14. Data Science Education
        15. Conclusions
    2. Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
      1. Unit 3: Part I: Data Science generics and Commercial Data Deluge
        1. Overview
        2. X-Informatics and Its Motto
        3. Jobs
        4. Data Deluge: General Structure
        5. Data Science Process
        6. Data Deluge: Internet
        7. Data Deluge: Business
      2. Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
        1. Overview
        2. Data Deluge: Science & Research
        3. Data Deluge: Implications for Scientific Method
        4. Data Deluge: Long Tail of Science
        5. Data Deluge: Internet of Things
      3. Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
        1. Overview
        2. Clouds
        3. Features of Data Deluge
        4. Processing Big Data
        5. Data Science Process
        6. Data Analytics
    3. Section 3: Technology Training
      1. Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
        1. Overview
        2. Python for X+Informatics: Introduction
        3. Python for X-Informatics: Introduction to Canopy
        4. Getting Necessary Packages
    4. Section 4 - Physics Case Study
      1. Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
        1. Overview
        2. Physics Informaics: Looking for Higgs Particle and Counting Errors Introduction
        3. Physics Informatics: Looking for Higgs Particle Experiments
        4. Physics Informatics: Accelerator Picture Gallery of Big Science
      2. Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
        1. Overview
        2. Event Counting
        3. Examples of Event Counting: With Python Examples of Signal Plus Background
      3. Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
        1. Overview
        2. Fundamental Idea: Random Variables
        3. Physics and Random Variables
        4. Statistics of Events with Normal Distributions
      4. Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
        1. Overview
        2. Random Numbers: Generators and Seeds
        3. Random Numbers: Binomial Distribution
        4. Random Numbers: Accept-Reject
        5. Random Numbers: Monte Carlo Method
        6. Random Numbers: Poisson Distribution
        7. Random Numbers: Central Limit Theorem
    5. Section 5: Technology Training
      1. Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
        1. Overview
        2. Data Sets
        3. Motivation and Introduction to Use
        4. Example of Use I: Cube and Structured Dataset
        5. Example of Use II: Proteomics and Synchronized Rotation
        6. Example of Use III: More Features and Larger Proteomics Sample
        7. Example of Use IV: Tools and Examples
        8. Example of Use V: Final Examples
    6. Section 6 - e-Commerce and LifeStyle Case Study
      1. Unit 12 - Part I: Recommender Systems: Introduction
        1. Overview
        2. Recommender Systems as an Optimization Problem
        3. Recommender Systems Introduction
        4. Kaggle Competitions
        5. Examples of Recommender Systems
        6. Netflix on Recommender Systems
      2. Unit 13- Part II: Recommender Systems: Examples and Algorithms
        1. Overview
        2. Recap of Recommender Systems
        3. Examples of Recommender Systems
        4. Case Study of Yahoo Recommender Systems
        5. Vector Space Formulation of Recommender Systems
        6. User-based nearest-neighbor collaborative filtering
      3. Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
        1. Overview
        2. Item-based Collaborative Filtering
        3. Technologies: k-Nearest Neighbors and High Dimensional Spaces
      4. Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
        1. Overview
        2. Python K-Nearest Neighbor Algorithms
        3. Visualization
        4. Testing K Nearest-Neighbor Algorithms
      5. Unit 16 - Part V: Clustering
        1. Overview
        2. Kmeans Clustering
        3. Clustering of Recommender System Example  
        4. Clustering Into More Than Three Clusters
        5. Local Optima in Clustering
        6. Clustering in General
        7. Heuristics
    7. Section 7 - Infrastructure and Technologies for Big Data X-Informatics
      1. Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
        1. Overview
        2. Decomposition
        3. Parallel Processing in Society
        4. Parallel Processing for Hadrian’s Wall
      2. Unit 18 - X-Informatics Cloud Technology Part I: Introduction
        1. Overview
        2. Cyberinfrastructure for e-more or less anything or more or less anything-Informatics
        3. What is Cloud Computing: Introduction
        4. Gartner’s Technology Landscape
        5. Simple Examples
      3. Unit 19 - X-Informatics Cloud Technology Part II: Introduction
        1. Overview
        2. What is Cloud Computing In More Detail
        3. As a Service and Platform Model: IaaS PaaS SaaS
        4. Platform as a Service
        5. Data in the Cloud
      4. Unit 20 - X-Informatics Cloud Technology Part III: Introduction
        1. Overview
        2. Cloud (Data Center) Architectures
        3. Cloud Industry Players
        4. Cloud Applications
        5. Security
        6. Comments on Fault Tolerance and Synchronicity Constraints
        7. Big Data Processing From Application Perspective Technology Discussed Earlier
    8. Section 8 - Web Search Informatics
      1. Unit 21- X-Informatics Web Search and Text Mining I
        1. Overview
        2. Data Mining Survey
        3. DIKW
        4. Web Search Solution in General Starting with History
        5. Information Retrieval (Web Search) Technology
        6. Boolean Query
        7. Fuzzy Index
        8. Vector Space Model
        9. Probabilistic Models
        10. Frequency Versus Bayes
      2. Unit 22 - X-Informatics Web Search and Text Mining II
        1. Overview
        2. Data Analytics for Web Search
        3. Document Preparation
        4. Inverted Index
        5. Index Construction
        6. Query Structure and Processing
        7. Link Structure Analysis including PageRank
        8. Summary-Issues
        9. Crawling the Web
        10. Web Advertising and Search
        11. Clustering and Topic Models
    9. Section 9 - Technology for X-Informatics
      1. Unit 23 - PageRank (Python Track)
        1. Overview
        2. PageRank in Python
          1. Calculate PageRank from Web Linkage Matrix
          2. Calculate PageRank of a real page
      2. Unit 24 - K-means (Python Track)
        1. Overview
        2. Kmeans in Python
        3. Analysis of 4 Artificial Clusters
      3. Unit 25 - MapReduce
        1. Overview
        2. MapReduce
          1. ​Introduction
          2. Advanced Topics
      4. Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
        1. Overview
        2. MapReduce Kmeans in Python
    10. Section 10 - Health Informatics
      1. Unit 27 - Health Informatics
        1. Overview
        2. Big Data and Health
        3. McKinsey Report on the big data revolution in US healthcare
        4. Microsoft Report on Big Data in Health
        5. EU Report on Redesigning Health in Europe for 2020
        6. Clouds and Health
        7. Genomics, Proteomics and Information Visualization
    11. Section 11 - Sensor Informatics
      1. Unit 28 - Sensors Informatics
        1. Overview
        2. Internet of Things
        3. Sensor Clouds
        4. Radar Data Gathered by Sensors
        5. Ubiquitous/Smart Cites
        6. Smart Grid
    12. Section 12 - Radar Informatics
      1. Unit 29 - Radar Informatics
        1. Overview
        2. Motivation
        3. Outline

  1. Story
  2. Slides
    1. Section 1: Introduction
      1. Unit 1: Course Introduction
      2. Unit 2: Course Motivation
    2. Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
      1. Unit 3: Part I: Data Science generics and Commercial Data Deluge
      2. Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
      3. Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
    3. Section 3: Technology Training
      1. Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
    4. Section 4 - Physics Case Study
      1. Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
      2. Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
      3. Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
      4. Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
    5. Section 5: Technology Training
      1. Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
    6. Section 6 - e-Commerce and LifeStyle Case Study
      1. Unit 12 - Part I: Recommender Systems: Introduction
      2. Unit 13- Part II: Recommender Systems: Examples and Algorithms
      3. Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
      4. Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
      5. Unit 16 - Part V: Clustering
    7. Section 7 - Infrastructure and Technologies for Big Data X-Informatics
      1. Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
      2. Unit 18 - X-Informatics Cloud Technology Part I: Introduction
      3. Unit 19 - X-Informatics Cloud Technology Part II: Introduction
      4. Unit 20 - X-Informatics Cloud Technology Part III: Introduction
    8. Section 8 - Web Search Informatics
      1. Unit 21- X-Informatics Web Search and Text Mining I
      2. Unit 22 - X-Informatics Web Search and Text Mining II
    9. Section 9 - Technology for X-Informatics
      1. Unit 23 - PageRank (Python Track)
      2. Unit 24 - K-means (Python Track)
      3. Unit 25 - MapReduce
      4. Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
    10. Section 10 - Health Informatics
      1. Unit 27 - Health Informatics
    11. Section 11 - Sensor Informatics
      1. Unit 28 - Sensors Informatics
    12. Section 12 - Radar Informatics
      1. Unit 29 - Radar Informatics
    13. Section 13 Spotfire
      1. Spotfire-Cover Page
      2. Spotfire-Plot Viz 3 Data Sets
      3. Spotfire-ShGTM3D Solvents Property
      4. Spotfire-10Kx10Ksampled.nw.4D.sqrt
  3. Spotfire Dashboard
  4. Research Notes
  5. Syllabus
    1. Section 1: Introduction
      1. Unit 1: Course Introduction
        1. Overview
        2. Big Data Ecosystem in One Sentence
        3. The Course: Overview
        4. The Approach
        5. The Course Structure
        6. The Topics of Big Data & X-Informatics
        7. 0: Motivation
        8. 3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III
        9. Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
        10. 4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
        11. Using Plotviz Software for Displaying Point Distributions in 3D
        12. 3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III
        13. 2 Units: X-Informatics Technologies: K-Nearest Neighbor Algorithms and Clustering
        14. Parallel Computing: Overview of Basic Principles with Familiar Examples
        15. 3 Units: X-Informatics Cloud Technology Parts I-III
        16. 2 Units: X-Informatics Web Search and Text Mining Parts I and II
        17. X-Informatics Technologies
        18. X-Informatics: Health Informatics
        19. X-Informatics: Sensors
        20. X-Informatics: Radar Informatics (with application to glaciology)
      2. Unit 2: Course Motivation
        1. Overview
        2. Introduction
        3. Data Deluge
        4. Jobs
        5. Industry Trends
        6. Computing Model: Industry Adopted Clouds Which Are Attractive for Data Analytics
        7. Research Model: 4th Paradigm; From Theory to Data Driven Science?
        8. Data Science Process
        9. Physics Informatics Looking for Higgs Particle with Large Hadron Collider LHC
        10. Recommender Systems
        11. Web Search Information Retrieval
        12. Cloud Applications in Research
        13. Parallel Computing and MapReduce
        14. Data Science Education
        15. Conclusions
    2. Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
      1. Unit 3: Part I: Data Science generics and Commercial Data Deluge
        1. Overview
        2. X-Informatics and Its Motto
        3. Jobs
        4. Data Deluge: General Structure
        5. Data Science Process
        6. Data Deluge: Internet
        7. Data Deluge: Business
      2. Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
        1. Overview
        2. Data Deluge: Science & Research
        3. Data Deluge: Implications for Scientific Method
        4. Data Deluge: Long Tail of Science
        5. Data Deluge: Internet of Things
      3. Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
        1. Overview
        2. Clouds
        3. Features of Data Deluge
        4. Processing Big Data
        5. Data Science Process
        6. Data Analytics
    3. Section 3: Technology Training
      1. Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
        1. Overview
        2. Python for X+Informatics: Introduction
        3. Python for X-Informatics: Introduction to Canopy
        4. Getting Necessary Packages
    4. Section 4 - Physics Case Study
      1. Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
        1. Overview
        2. Physics Informaics: Looking for Higgs Particle and Counting Errors Introduction
        3. Physics Informatics: Looking for Higgs Particle Experiments
        4. Physics Informatics: Accelerator Picture Gallery of Big Science
      2. Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
        1. Overview
        2. Event Counting
        3. Examples of Event Counting: With Python Examples of Signal Plus Background
      3. Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
        1. Overview
        2. Fundamental Idea: Random Variables
        3. Physics and Random Variables
        4. Statistics of Events with Normal Distributions
      4. Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
        1. Overview
        2. Random Numbers: Generators and Seeds
        3. Random Numbers: Binomial Distribution
        4. Random Numbers: Accept-Reject
        5. Random Numbers: Monte Carlo Method
        6. Random Numbers: Poisson Distribution
        7. Random Numbers: Central Limit Theorem
    5. Section 5: Technology Training
      1. Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
        1. Overview
        2. Data Sets
        3. Motivation and Introduction to Use
        4. Example of Use I: Cube and Structured Dataset
        5. Example of Use II: Proteomics and Synchronized Rotation
        6. Example of Use III: More Features and Larger Proteomics Sample
        7. Example of Use IV: Tools and Examples
        8. Example of Use V: Final Examples
    6. Section 6 - e-Commerce and LifeStyle Case Study
      1. Unit 12 - Part I: Recommender Systems: Introduction
        1. Overview
        2. Recommender Systems as an Optimization Problem
        3. Recommender Systems Introduction
        4. Kaggle Competitions
        5. Examples of Recommender Systems
        6. Netflix on Recommender Systems
      2. Unit 13- Part II: Recommender Systems: Examples and Algorithms
        1. Overview
        2. Recap of Recommender Systems
        3. Examples of Recommender Systems
        4. Case Study of Yahoo Recommender Systems
        5. Vector Space Formulation of Recommender Systems
        6. User-based nearest-neighbor collaborative filtering
      3. Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
        1. Overview
        2. Item-based Collaborative Filtering
        3. Technologies: k-Nearest Neighbors and High Dimensional Spaces
      4. Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
        1. Overview
        2. Python K-Nearest Neighbor Algorithms
        3. Visualization
        4. Testing K Nearest-Neighbor Algorithms
      5. Unit 16 - Part V: Clustering
        1. Overview
        2. Kmeans Clustering
        3. Clustering of Recommender System Example  
        4. Clustering Into More Than Three Clusters
        5. Local Optima in Clustering
        6. Clustering in General
        7. Heuristics
    7. Section 7 - Infrastructure and Technologies for Big Data X-Informatics
      1. Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
        1. Overview
        2. Decomposition
        3. Parallel Processing in Society
        4. Parallel Processing for Hadrian’s Wall
      2. Unit 18 - X-Informatics Cloud Technology Part I: Introduction
        1. Overview
        2. Cyberinfrastructure for e-more or less anything or more or less anything-Informatics
        3. What is Cloud Computing: Introduction
        4. Gartner’s Technology Landscape
        5. Simple Examples
      3. Unit 19 - X-Informatics Cloud Technology Part II: Introduction
        1. Overview
        2. What is Cloud Computing In More Detail
        3. As a Service and Platform Model: IaaS PaaS SaaS
        4. Platform as a Service
        5. Data in the Cloud
      4. Unit 20 - X-Informatics Cloud Technology Part III: Introduction
        1. Overview
        2. Cloud (Data Center) Architectures
        3. Cloud Industry Players
        4. Cloud Applications
        5. Security
        6. Comments on Fault Tolerance and Synchronicity Constraints
        7. Big Data Processing From Application Perspective Technology Discussed Earlier
    8. Section 8 - Web Search Informatics
      1. Unit 21- X-Informatics Web Search and Text Mining I
        1. Overview
        2. Data Mining Survey
        3. DIKW
        4. Web Search Solution in General Starting with History
        5. Information Retrieval (Web Search) Technology
        6. Boolean Query
        7. Fuzzy Index
        8. Vector Space Model
        9. Probabilistic Models
        10. Frequency Versus Bayes
      2. Unit 22 - X-Informatics Web Search and Text Mining II
        1. Overview
        2. Data Analytics for Web Search
        3. Document Preparation
        4. Inverted Index
        5. Index Construction
        6. Query Structure and Processing
        7. Link Structure Analysis including PageRank
        8. Summary-Issues
        9. Crawling the Web
        10. Web Advertising and Search
        11. Clustering and Topic Models
    9. Section 9 - Technology for X-Informatics
      1. Unit 23 - PageRank (Python Track)
        1. Overview
        2. PageRank in Python
          1. Calculate PageRank from Web Linkage Matrix
          2. Calculate PageRank of a real page
      2. Unit 24 - K-means (Python Track)
        1. Overview
        2. Kmeans in Python
        3. Analysis of 4 Artificial Clusters
      3. Unit 25 - MapReduce
        1. Overview
        2. MapReduce
          1. ​Introduction
          2. Advanced Topics
      4. Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
        1. Overview
        2. MapReduce Kmeans in Python
    10. Section 10 - Health Informatics
      1. Unit 27 - Health Informatics
        1. Overview
        2. Big Data and Health
        3. McKinsey Report on the big data revolution in US healthcare
        4. Microsoft Report on Big Data in Health
        5. EU Report on Redesigning Health in Europe for 2020
        6. Clouds and Health
        7. Genomics, Proteomics and Information Visualization
    11. Section 11 - Sensor Informatics
      1. Unit 28 - Sensors Informatics
        1. Overview
        2. Internet of Things
        3. Sensor Clouds
        4. Radar Data Gathered by Sensors
        5. Ubiquitous/Smart Cites
        6. Smart Grid
    12. Section 12 - Radar Informatics
      1. Unit 29 - Radar Informatics
        1. Overview
        2. Motivation
        3. Outline

Story

Data Science for Big Data Application and Analytics MOOC

The email said: Check Out Professor. Geoffrey Fox's MOOC Starting December 1st. Folks, Here is the link to the "Big Data Applications and Analytics" MOOC: https://bigdatacourse.appspot.com/preview. For big data novices, this is a gentle introduction. For big data experts, this exposes Professor Fox's perspectives and insights and source references without all the details and mathematical models.

I downloaded the ZIP file of Course Syllabus (PDF), Slides (PDF) and Python Files and explored them. Then I mined and structured the content into MindTouch to build a Knowledge Base of the essence of Professor Geoffrey Fox's MOOC so I could make it part of the Federal Big Data Working Group Meetup MOOC.

I asked Professor.Fox: Are there data sets used in the course? His reply was: Only a few small sample datasets and simple Monte Carlo sets. So I set about to find them and reuse them in Spotfire Statistics and Visualizations.

Important Point About Semantic Web and Big Data: Features of Data Deluge

MORE TO FOLLOW

Slides

Section 1: Introduction

Unit 1: Course Introduction

GeoffreyFox09222013Slide7.png

Unit 2: Course Motivation

GeoffreyFox09222013Slide141.png

Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?

Unit 3: Part I: Data Science generics and Commercial Data Deluge

GeoffreyFox09222013Slide147.png

Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology

GeoffreyFox09222013Slide204.png

Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics

GeoffreyFox09222013Slide226.png

Section 3: Technology Training

Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib

Overview

This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know.

Python for X+Informatics: IntroductionPython Packages and Tools

  • Canopy IDE and IPython
  • NumPy
  • Matplotlib
  • SciPy

​Virtual Box Appliance

  • Virtual Box Appliance with all packages set+up
  • Tutorial to use the virtual box appliance
  • You can manually setup the necessary stuff or use the virtual box appliance

Python for X-Informatics: Introduction to Canopy

Download Canopy

Getting Necessary Packages

NumPy is a python package which can be used to represent data in a vectorized format. A number of other packages are built on top of numpy. These packages use the data structures and operations used in numpy.The express version of canopy (the one recommended for this class) comes with its own version of NumPy.

Matplotlib is a popular plotting package and is used for creating graphs and plots for data visualization. It can be used with ipython is an interactive manner and proves to be very useful. It has interactive features like zooming, panning. It also supports all the popular graphic formats like JPEG, GIF, TIFF, etc and can be used to plot, modify, images. If you are familiar with MATLAB, you can do almost all kinds of plotting possible with MATLAB using matplotlib.

Scipy is a powerful python package built on top of numpy. It has a number of algorithms pertaining to different fields like calculus, stats,signal processing, image processing, etc.

Section 4 - Physics Case Study

Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators

GeoffreyFox09222013Slide300.png

Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)

GeoffreyFox09222013Slide316.png

Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)

GeoffreyFox09222013Slide341.png

Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)

GeoffreyFox09222013Slide375.png

Section 5: Technology Training

Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D

http://salsahpc.indiana.edu/pviz3/#screenshots My Note: Main Data Source and Documentation (PDF Papers)

GeoffreyFox09222013Slide410.png

Section 6 - e-Commerce and LifeStyle Case Study

Unit 12 - Part I: Recommender Systems: Introduction

GeoffreyFox09222013Slide487.png

Unit 13- Part II: Recommender Systems: Examples and Algorithms

GeoffreyFox09222013Slide501.png

Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies

GeoffreyFox09222013Slide563.png

Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)

 

GeoffreyFox09222013Slide573.png

Unit 16 - Part V: Clustering

GeoffreyFox09222013Slide616.png

Section 7 - Infrastructure and Technologies for Big Data X-Informatics

Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples

GeoffreyFox09222013Slide644.png

Unit 18 - X-Informatics Cloud Technology Part I: Introduction

GeoffreyFox09222013Slide665.png

Unit 19 - X-Informatics Cloud Technology Part II: Introduction

GeoffreyFox09222013Slide746.png

Unit 20 - X-Informatics Cloud Technology Part III: Introduction

GeoffreyFox09222013Slide804.png

Section 8 - Web Search Informatics

Unit 21- X-Informatics Web Search and Text Mining I

GeoffreyFox09222013Slide841.png

Unit 22 - X-Informatics Web Search and Text Mining II

GeoffreyFox09222013Slide934.png

Section 9 - Technology for X-Informatics

Unit 23 - PageRank (Python Track)

GeoffreyFox09222013Slide1026.png

Unit 24 - K-means (Python Track)

 

GeoffreyFox09222013Slide1040.png

Unit 25 - MapReduce

GeoffreyFox09222013Slide1076.png

Unit 26 - Kmeans and MapReduce Parallelism (Python Track)

GeoffreyFox09222013Slide1096.png

Section 10 - Health Informatics

Unit 27 - Health Informatics

GeoffreyFox09222013Slide1120.png

Section 11 - Sensor Informatics

Unit 28 - Sensors Informatics

GeoffreyFox09222013Slide1190.png

Section 12 - Radar Informatics

Unit 29 - Radar Informatics

X-Informatics: Radar Informatics (with application to glaciology)
Jerome E. Mitchell
School of Informatics and Computing
Indiana University Bloomington

MotivationEdit section

  • CHANGING (DISAPPEARING OUTLET GLACIERS)
  • DATA ANALYTICS INVOLVES IMAGE PROCESSING AFTER DETAILED PROCESSING OF RADAR DATA
  • EXAMPLE OF LARGE (PETABYTE TODAY) GEOLOCATED DATASET

Outline

  • Background – Remote Sensing
  • Background – Global Climate Change
  • Ice Sheet Science
  • Radar Overview
  • Radar Basics
  • Radar Informatics

Section 13 Spotfire

Spotfire-Cover Page

Web Player

FoxMOOC-Spotfire-CoverPage.png

Spotfire-Plot Viz 3 Data Sets

Web Player

FoxMOOC-Spotfire-Plot Viz 3 Data Sets.png

Spotfire-ShGTM3D Solvents Property

Web Player

FoxMOOC-Spotfire-ShGTM3D Solvents Property.png

Spotfire-10Kx10Ksampled.nw.4D.sqrt

Web Player

FoxMOOC-Spotfire-10Kx10Ksampled.nw.4D.sqrt.png

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

Thank you. Are there data sets used in the course?

Only a few small sample datasets and simple Monte Carlo sets

Check Out Prof. Geoffrey Fox's MOOC Starting Dec 1st

Folks, Here is the link to the "Big Data Applications and Analytics" MOOC: https://bigdatacourse.appspot.com/preview

For big data novices, this is a gentle introduction. For big data experts, this exposes Prof. Fox's perspectives and insights and source references without all the details and mathematical models. I have already peeked into the MOOC lectures.

As always, big data requires federated, distributed data sets. IBM tends to highlight the V's because IBM has the V's (except data veracity because provenance and curation are required) covered. Oracle tends to highlight the distinction among SQL, NOSQL, and unstructured (e.g., document stores, object stores) data sets because Oracle has different products for all three. I emphasize the requirement to process data in place while sharing data sets based on policies based on governance based on Common Law. Federated identity, capabilities, polices, logging, auditing, alarms, workload management, prioritization cross any bureaucratic boundaries in order to safely share distributed data sets globally.

Syllabus

Syllabus: Big Data Application and Analytics By Dr. Geoffrey Fox https://www.dropbox.com/s/ce3mw81scn...lides.pdf?dl=1 (PDF)

Section 1: Introduction

This section Contains Unit 1,2

Overview

This section has a technical overview of course followed by a broad motivation for course.

The course overview covers it’s content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks. The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.

Unit 1: Course Introduction

Go To Unit 1: https://bigdatacourse.appspot.com/unit?unit=1

GeoffreyFox09222013Slide3.png

Overview

Geoffrey gives a short introduction to the course covering it’s content and structure. He presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics (or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons.

The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The introduction ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications)

The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.

Big Data Ecosystem in One Sentence

GeoffreyFox09222013Slide4.png

The Course: Overview

GeoffreyFox09222013Slide5.png

The Approach

GeoffreyFox09222013Slide7.png

The Course Structure

GeoffreyFox09222013Slide8.png

The Topics of Big Data & X-Informatics

GeoffreyFox09222013Slide9.png

0: Motivation

GeoffreyFox09222013Slide11.png

3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III

GeoffreyFox09222013Slide12.png

Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib

GeoffreyFox09222013Slide13.png

4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV

GeoffreyFox09222013Slide14.png

Using Plotviz Software for Displaying Point Distributions in 3D

GeoffreyFox09222013Slide15.png

3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III

GeoffreyFox09222013Slide16.png

2 Units: X-Informatics Technologies: K-Nearest Neighbor Algorithms and Clustering

GeoffreyFox09222013Slide17.png

Parallel Computing: Overview of Basic Principles with Familiar Examples

GeoffreyFox09222013Slide18.png

3 Units: X-Informatics Cloud Technology Parts I-III

GeoffreyFox09222013Slide19.png

2 Units: X-Informatics Web Search and Text Mining Parts I and II

GeoffreyFox09222013Slide20.png

X-Informatics Technologies

GeoffreyFox09222013Slide21.png

X-Informatics: Health Informatics

GeoffreyFox09222013Slide22.png

X-Informatics: Sensors

GeoffreyFox09222013Slide23.png

X-Informatics: Radar Informatics (with application to glaciology)

GeoffreyFox09222013Slide24.png

Unit 2: Course Motivation

Go To Unit 2: https://bigdatacourse.appspot.com/unit?unit=2

Overview

Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.

He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.

Introduction

GeoffreyFox09222013Slide30.png

GeoffreyFox09222013Slide31.png

GeoffreyFox09222013Slide33.png

Data Deluge

GeoffreyFox09222013Slide40.png

GeoffreyFox09222013Slide44.png

Jobs

GeoffreyFox09222013Slide50.png

Industry Trends

GeoffreyFox09222013Slide61.png

Computing Model: Industry Adopted Clouds Which Are Attractive for Data Analytics

GeoffreyFox09222013Slide72.png

Research Model: 4th Paradigm; From Theory to Data Driven Science?

GeoffreyFox09222013Slide75.png

GeoffreyFox09222013Slide76.png

GeoffreyFox09222013Slide77.png

Data Science Process

GeoffreyFox09222013Slide79.png

Physics Informatics Looking for Higgs Particle with Large Hadron Collider LHC

GeoffreyFox09222013Slide83.png

Recommender Systems

GeoffreyFox09222013Slide91.png

Web Search Information Retrieval

GeoffreyFox09222013Slide103.png

Cloud Applications in Research

GeoffreyFox09222013Slide113.png

Parallel Computing and MapReduce

GeoffreyFox09222013Slide119.png

Data Science Education

GeoffreyFox09222013Slide139.png

Conclusions

GeoffreyFox09222013Slide141.png

Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?

This section Contains Unit 3,4,5

Overview

This section has a technical overview of course followed by a broad motivation for course.

The course overview covers it’s content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.

The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.

Unit 3: Part I: Data Science generics and Commercial Data Deluge

Go To Unit 3: https://bigdatacourse.appspot.com/unit?unit=3

GeoffreyFox09222013Slide147.png

Overview

Geoffrey starts with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. This unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Then he discusses data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.

X-Informatics and Its Motto

GeoffreyFox09222013Slide147.png

Jobs

GeoffreyFox09222013Slide154.png

Data Deluge: General Structure

GeoffreyFox09222013Slide156.png

Data Science Process

GeoffreyFox09222013Slide166.png

Data Deluge: Internet

GeoffreyFox09222013Slide169.png

Data Deluge: Business

GeoffreyFox09222013Slide169.png

GeoffreyFox09222013Slide178.png

Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology

Go To Unit 4: https://bigdatacourse.appspot.com/unit?unit=4

Overview

Geoffrey continues the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods.

We discuss the long tail of sciences; many users with individually modest data adding up to a lot. The last lesson emphasizes how everyday devices -- the Internet of Things -- are ???

Data Deluge: Science & Research

GeoffreyFox09222013Slide192.png

Data Deluge: Implications for Scientific Method

GeoffreyFox09222013Slide204.png

Data Deluge: Long Tail of Science

GeoffreyFox09222013Slide44.png

Data Deluge: Internet of Things

GeoffreyFox09222013Slide209.png

Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics

Go To Unit 5: https://bigdatacourse.appspot.com/unit?unit=5

Overview

Geoffrey starts with X-Informatics and its rallying cry. The growing number of jobs in data Geoffrey gives an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing.

He discusses features of the data deluge with a salutary example where more data did better than more thought. Examples are given of end to end systems to process big data. He introduces data science and one part of it -- data analytics -- the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.

Clouds

GeoffreyFox09222013Slide221.png

Features of Data Deluge

GeoffreyFox09222013Slide226.png

Processing Big Data

GeoffreyFox09222013Slide231.png

Data Science Process

GeoffreyFox09222013Slide239.png

Data Analytics

http://cra.org/ccc/docs/nitrdsymposium/pdfs/keyes.pdf My Note: Page not found!

GeoffreyFox09222013Slide249.png

Section 3: Technology Training

This section Contains Unit 3,4,5

Overview

This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know. This section covers. Canopy - Its is an IDE for python developed by EnThoughts. The aim of this IDE is to bring the various python libraries under one single framework or ‘’Canopy’’ - that is why the name. NumPy - It is popular library on top of which many other libraries (like pandas, scipy) are built. It provides a way a vectorizing data. This helps to organize in a more intuitive fashion and also helps us use the various matrix operations which are popularly used by the machine learning community. Matplotlib: This a data visualization package. It allows you to create graphs charts and other such diagrams. It supports Images in JPEG, GIF, TIFF format. SciPy: SciPy is a library built above numpy and has a number of off the shelf algorithms / operations implemented. These include algorithms from calculus(like integration), statistics, linear algebra, image-processing, signal processing, machine learning, etc.

Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib

Go To Unit 6: https://bigdatacourse.appspot.com/unit?unit=6

Overview

This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know.

Python for X+Informatics: Introduction

Python Packages and Tools

  • Canopy IDE and IPython
  • NumPy
  • Matplotlib
  • SciPy

​Virtual Box Appliance

  • Virtual Box Appliance with all packages set+up
  • Tutorial to use the virtual box appliance
  • You can manually setup the necessary stuff or use the virtual box appliance
Python for X-Informatics: Introduction to Canopy

Download Canopy

Getting Necessary Packages

NumPy is a python package which can be used to represent data in a vectorized format. A number of other packages are built on top of numpy. These packages use the data structures and operations used in numpy.The express version of canopy (the one recommended for this class) comes with its own version of NumPy.

Matplotlib is a popular plotting package and is used for creating graphs and plots for data visualization. It can be used with ipython is an interactive manner and proves to be very useful. It has interactive features like zooming, panning. It also supports all the popular graphic formats like JPEG, GIF, TIFF, etc and can be used to plot, modify, images. If you are familiar with MATLAB, you can do almost all kinds of plotting possible with MATLAB using matplotlib.

Scipy is a powerful python package built on top of numpy. It has a number of algorithms pertaining to different fields like calculus, stats,signal processing, image processing, etc.

Section 4 - Physics Case Study

This section Contains Unit 7,8,9,10

Overview

This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.

Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators

Go To Unit 7: https://bigdatacourse.appspot.com/unit?unit=7

Overview

In this short unit Geoffrey describes the LHC accelerator at CERN and evidence found by the experiments ATLAS suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery.

Physics Informaics: Looking for Higgs Particle and Counting Errors Introduction

GeoffreyFox09222013Slide298.png

Physics Informatics: Looking for Higgs Particle Experiments

GeoffreyFox09222013Slide300.png

Physics Informatics: Accelerator Picture Gallery of Big Science

GeoffreyFox09222013Slide306.png

Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)

Go To Unit 8: https://bigdatacourse.appspot.com/unit?unit=8

Overview

This unit is devoted to Python experiments with Geoffrey looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals.

Event Counting

GeoffreyFox09222013Slide316.png

GeoffreyFox09222013Slide318.png

Examples of Event Counting: With Python Examples of Signal Plus Background

GeoffreyFox09222013Slide331.png

Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)

Go To Unit 9: https://bigdatacourse.appspot.com/unit?unit=9

Overview

Geoffrey introduces random variables and some simple principles of statistics and explains why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given.

Fundamental Idea: Random Variables

GeoffreyFox09222013Slide339.png

Physics and Random Variables

GeoffreyFox09222013Slide341.png

Statistics of Events with Normal Distributions

GeoffreyFox09222013Slide365.png

Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)

Go To Unit 10: https://bigdatacourse.appspot.com/unit?unit=10

Overview

Geoffrey discusses Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem concludes discussion. Python examples and Physics applications are given.

Random Numbers: Generators and Seeds

GeoffreyFox09222013Slide375.png

Random Numbers: Binomial Distribution

GeoffreyFox09222013Slide389.png

Random Numbers: Accept-Reject

GeoffreyFox09222013Slide393.png

Random Numbers: Monte Carlo Method

GeoffreyFox09222013Slide397.png

Random Numbers: Poisson Distribution

GeoffreyFox09222013Slide401.png

Random Numbers: Central Limit Theorem

GeoffreyFox09222013Slide405.png

Section 5: Technology Training

This section Contains Unit 11

Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D

Go To Unit 11: https://bigdatacourse.appspot.com/unit?unit=11

Overview

Geoffrey introduces Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can ‘’see’’ structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots. Finally, he describes the download and software dependency of Plotviz.

Motivation and Introduction to Use

GeoffreyFox09222013Slide410.png

Example of Use I: Cube and Structured Dataset
Example of Use II: Proteomics and Synchronized Rotation
Example of Use III: More Features and Larger Proteomics Sample
Example of Use IV: Tools and Examples
Example of Use V: Final Examples

Section 6 - e-Commerce and LifeStyle Case Study

This section Contains Unit 12,13,14,15,16

Overview

Recommender systems operate under the hood of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs. Kaggle competitions h improve the success of the Netflix and other recommender systems. Attention is paid to models that are used to compare how changes to the systems affect their overall performance. Geoffrey muses how the humble ranking has become such a dominant driver of the world’s economy. More examples of recommender systems are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. The formulation of recommendations in terms of points in a space or bag is given where bags of item properties, user properties, rankings and users are useful. Detail is given on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items. Items are viewed as points in a space of users in item-based collaborative filtering. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed. A simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions is given. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of a training and a testing set are introduced with training set pre labeled. Recommender system are used to discuss clustering with k-means based clustering methods used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Unit 12 - Part I: Recommender Systems: Introduction

Go To Unit 12: https://bigdatacourse.appspot.com/unit?unit=12

Overview

Geoffrey introduces Recommender systems as an optimization technology used in a variety of applications and contexts online. They operate in the background of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs, to the benefit of both.

There follows an exploration of the Kaggle competition site, other recommender systems and Netflix, as well as competitions held to improve the success of the Netflix recommender system. Finally attention is paid to models that are used to compare how changes to the systems affect their overall performance. Geoffrey muses how the humble ranking has become such a dominant driver of the world’s economy.

Recommender Systems as an Optimization Problem

GeoffreyFox09222013Slide447.png

Recommender Systems Introduction

GeoffreyFox09222013Slide454.png

Kaggle Competitions

http://www.kaggle.com

Examples of Recommender Systems

GeoffreyFox09222013Slide469.png

Netflix on Recommender Systems

GeoffreyFox09222013Slide487.png

Unit 13- Part II: Recommender Systems: Examples and Algorithms

Go To Unit 13: https://bigdatacourse.appspot.com/unit?unit=13

Overview

Geoffrey continues the discussion of recommender systems and their use in e-commerce. More examples are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. Then the formulation of recommendations in terms of points in a space or bag is given. Here bags of item properties, user properties, rankings and users are useful. Then we go into detail on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items.

Recap of Recommender Systems

GeoffreyFox09222013Slide501.png

Examples of Recommender Systems

GeoffreyFox09222013Slide504.png

GeoffreyFox09222013Slide505.png

Case Study of Yahoo Recommender Systems

GeoffreyFox09222013Slide513.png

Vector Space Formulation of Recommender Systems

GeoffreyFox09222013Slide536.png

User-based nearest-neighbor collaborative filtering

GeoffreyFox09222013Slide538.png

Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies

Go To Unit 14: https://bigdatacourse.appspot.com/unit?unit=14

Overview

Geoffrey introduces Recommender systems as an optimization technology used in a Geoffrey moves on to item-based collaborative filtering where items are viewed as points in a space of users. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed

Item-based Collaborative Filtering

GeoffreyFox09222013Slide554.png

Technologies: k-Nearest Neighbors and High Dimensional Spaces

GeoffreyFox09222013Slide563.png

Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)

Go To Unit 15: https://bigdatacourse.appspot.com/unit?unit=15

Overview

Geoffrey discusses a simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of training and testing sets are introduced with training set prelabelled.

Python K-Nearest Neighbor Algorithms

GeoffreyFox09222013Slide573.png

Visualization

GeoffreyFox09222013Slide584.png

Testing K Nearest-Neighbor Algorithms

GeoffreyFox09222013Slide588.png

Unit 16 - Part V: Clustering

Go To Unit 16: https://bigdatacourse.appspot.com/unit?unit=16

Overview

Geoffrey uses example of recommender system to discuss clustering. The details of methods are not discussed but k-means based clustering methods are used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.

Kmeans Clustering

GeoffreyFox09222013Slide596.png

Clustering of Recommender System Example  

GeoffreyFox09222013Slide600.png

Clustering Into More Than Three Clusters

GeoffreyFox09222013Slide605.png

Local Optima in Clustering

GeoffreyFox09222013Slide614.png

Clustering in General

GeoffreyFox09222013Slide616.png

Heuristics

GeoffreyFox09222013Slide624.png

Section 7 - Infrastructure and Technologies for Big Data X-Informatics

This section Contains Unit 17,18,19,20

Overview

Clouds and Big Data which is decomposed into lots of ‘’Little data’’ running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition. Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing are introduced. This includes virtualization and the important ‘’as a Service’’ components and we go through several different definitions of cloud computing. Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Two simple examples of the value of clouds for enterprise applications are given with a review of different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). Features in Grid and cloud computing and data are treated. Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models are discussed followed by the Cloud Industry stakeholders and applications on the cloud including data intensive problems and comparison with high performance computing. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. The Big Data Processing from an application perspective with commercial examples including eBay concludes section.

Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples

Go To Unit 17: https://bigdatacourse.appspot.com/unit?unit=17

Overview

Geoffrey describes the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of ‘’Little data’’ running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition.

Decomposition

GeoffreyFox09222013Slide631.png

GeoffreyFox09222013Slide632.png

GeoffreyFox09222013Slide639.png

Parallel Processing in Society

GeoffreyFox09222013Slide644.png

Parallel Processing for Hadrian’s Wall

GeoffreyFox09222013Slide644.png

Unit 18 - X-Informatics Cloud Technology Part I: Introduction

Go To Unit 18: https://bigdatacourse.appspot.com/unit?unit=18

Overview

Geoffrey discusses Cyberinfrastructure for e-moreorlessanything or moreorlessanything- Informatics and the basics of cloud computing. This includes virtualization and the important ‘’as a Service’’ components and we go through several different definitions of cloud computing.

Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Geoffrey reviews the nature of 48 technologies in 2012 emerging technology hype cycle. Gartner has specific predictions for cloud computing growth areas. The unit concludes with two simple examples of the value of clouds for enterprise applications.

Cyberinfrastructure for e-more or less anything or more or less anything-Informatics

GeoffreyFox09222013Slide665.png

GeoffreyFox09222013Slide666.png

What is Cloud Computing: Introduction

GeoffreyFox09222013Slide679.png

Gartner’s Technology Landscape

GeoffreyFox09222013Slide702.png

Simple Examples

GeoffreyFox09222013Slide708.png

Unit 19 - X-Informatics Cloud Technology Part II: Introduction

Go To Unit 19: https://bigdatacourse.appspot.com/unit?unit=19

Overview

Geoffrey covers different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). The unit discusses features introduced in Grid computing and features introduced by clouds. The unit concludes with the treatment of data in the cloud from an architecture perspective.

What is Cloud Computing In More Detail

GeoffreyFox09222013Slide726.png

As a Service and Platform Model: IaaS PaaS SaaS

GeoffreyFox09222013Slide746.png

Platform as a Service

GeoffreyFox09222013Slide753.png

Data in the Cloud

GeoffreyFox09222013Slide763.png

Unit 20 - X-Informatics Cloud Technology Part III: Introduction

Go To Unit 20: https://bigdatacourse.appspot.com/unit?unit=20

Overview

Geoffrey opens up with a discussion of Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models. A discussion of Cloud Industry stakeholders is followed by applications on the cloud including data intensive problems and comparison with high performance computing. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. The Big Data Processing from an application perspective with commercial examples including eBay concludes unit.

Cloud (Data Center) Architectures

GeoffreyFox09222013Slide785.png

Cloud Industry Players

GeoffreyFox09222013Slide791.png

GeoffreyFox09222013Slide792.png

Cloud Applications

GeoffreyFox09222013Slide798.png

GeoffreyFox09222013Slide804.png

Security

GeoffreyFox09222013Slide806.png

Comments on Fault Tolerance and Synchronicity Constraints

GeoffreyFox09222013Slide813.png

Big Data Processing From Application Perspective Technology Discussed Earlier

GeoffreyFox09222013Slide824.png

Section 8 - Web Search Informatics

Overview

This section starts with an overview of data mining and puts our study of classification, clustering and exploration methods in context. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information Retrieval. The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosine Similarity as a distance in this bag follows. The basic TF-IDF approach is dis cussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit. Geoffrey starts with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes section.

Unit 21- X-Informatics Web Search and Text Mining I

Go To Unit 21: https://bigdatacourse.appspot.com/unit?unit=21

Overview

Geoffrey starts this unit with an overview of data mining and noting our study of classification, clustering and exploration methods. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information Retrieval.

The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosinne Similarity as a distance in this bag follows. The basic TF-IDF approach is discussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit.

Data Mining Survey

GeoffreyFox09222013Slide830.png

DIKW

GeoffreyFox09222013Slide841.png

Web Search Solution in General Starting with History

GeoffreyFox09222013Slide853.png

Information Retrieval (Web Search) Technology

GeoffreyFox09222013Slide870.png

Boolean Query

GeoffreyFox09222013Slide875.png

Fuzzy Index

GeoffreyFox09222013Slide884.png

Vector Space Model

GeoffreyFox09222013Slide892.png

Probabilistic Models

GeoffreyFox09222013Slide913.png

Frequency Versus Bayes

GeoffreyFox09222013Slide918.png

GeoffreyFox09222013Slide926.png

Unit 22 - X-Informatics Web Search and Text Mining II

Go To Unit 22: https://bigdatacourse.appspot.com/unit?unit=22

Overview

Geoffrey starts with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow.

This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes unit.

Data Analytics for Web Search

GeoffreyFox09222013Slide934.png

Document Preparation

GeoffreyFox09222013Slide936.png

Inverted Index

GeoffreyFox09222013Slide947.png

Index Construction

GeoffreyFox09222013Slide952.png

Query Structure and Processing

GeoffreyFox09222013Slide962.png

Link Structure Analysis including PageRank

GeoffreyFox09222013Slide979.png

Summary-Issues

GeoffreyFox09222013Slide989.png

Crawling the Web

GeoffreyFox09222013Slide992.png

Web Advertising and Search

GeoffreyFox09222013Slide1004.png

Clustering and Topic Models

GeoffreyFox09222013Slide1006.png

Section 9 - Technology for X-Informatics

This section Contains Unit 23,24,25,26

Overview

Geoffrey uses the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ‘’hill’’ between different solutions and rationale for running K-means many times and choosing best answer. Then we introduce MapReduce with the basic architecture and a homely example. The discussion of advanced topics includes an extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given. The SciPy K-means code is modified to support a MapReduce execution style. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ‘’parallel’’ maps run sequentially. This simple 2 map version can be generalized to scalable parallelism. Python is used to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

Unit 23 - PageRank (Python Track)

Go To Unit 23: https://bigdatacourse.appspot.com/unit?unit=23

Overview

Geoffrey uses Python to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.

PageRank in Python
Calculate PageRank from Web Linkage Matrix

GeoffreyFox09222013Slide1026.png

Calculate PageRank of a real page

GeoffreyFox09222013Slide1028.png

Unit 24 - K-means (Python Track)

Go To Unit 24: https://bigdatacourse.appspot.com/unit?unit=24

Overview

Geoffrey uses the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ‘’hill’’ between different solutions and rationale for running K-means many times and choosing best answer.

Kmeans in Python

GeoffreyFox09222013Slide1040.png

Analysis of 4 Artificial Clusters

GeoffreyFox09222013Slide1043.png

Unit 25 - MapReduce

Go To Unit 25: https://bigdatacourse.appspot.com/unit?unit=25

Overview

Geoffrey’s introduction to MapReduce describes the basic architecture and a homely example. The discussion of advanced topics includes extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given.

MapReduce
​Introduction

GeoffreyFox09222013Slide1076.png

Advanced Topics

GeoffreyFox09222013Slide1081.png

Unit 26 - Kmeans and MapReduce Parallelism (Python Track)

Go To Unit 26: https://bigdatacourse.appspot.com/unit?unit=26

Overview

Geoffrey modifies the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ‘’parallel’’ maps run sequentially. Geoffrey stresses that this simple 2 map version can be generalized to scalable parallelism.

MapReduce Kmeans in Python

GeoffreyFox09222013Slide1096.png

Section 10 - Health Informatics

This section Contains Unit 27

Unit 27 - Health Informatics

Go To Unit 27: https://bigdatacourse.appspot.com/unit?unit=27

Overview

Geoffrey starts by discussing general aspects of Big Data and Health including data sizes, different areas including genomics, EBI, radiology and the Quantified Self movement. We survey an April 2013 McKinsey report on the Big Data revolution in US health care; a Microsoft report in this area and a European Union report on how Big Data will allow patient centered care in the future. Some remarks on Cloud computing and Health focus on security and privacy issues. The final topic is Genomics, Proteomics and Information Visualization.

Big Data and Health

GeoffreyFox09222013Slide1105.png

McKinsey Report on the big data revolution in US healthcare

GeoffreyFox09222013Slide1120.png

Microsoft Report on Big Data in Health

GeoffreyFox09222013Slide1122.png

EU Report on Redesigning Health in Europe for 2020

GeoffreyFox09222013Slide1125.png

Clouds and Health

GeoffreyFox09222013Slide1132.png

Genomics, Proteomics and Information Visualization

GeoffreyFox09222013Slide1159.png

Section 11 - Sensor Informatics

This section Contains Unit 28

Unit 28 - Sensors Informatics

Go To Unit 28: https://bigdatacourse.appspot.com/unit?unit=28

Overview

Geoffrey starts with the Internet of Things giving examples like monitors of machine operation, QR codes, surveillance cameras, scientific sensors, drones and self driving cars and more generally transportation systems. Sensor clouds control these many small distributed devices. More detail is given for radar data gathered by sensors; ubiquitous or smart cities and homes including U-Korea; and finally the smart electric grid.

Internet of Things

GeoffreyFox09222013Slide1175.png

Sensor Clouds

GeoffreyFox09222013Slide1181.png

Radar Data Gathered by Sensors

GeoffreyFox09222013Slide1190.png

Ubiquitous/Smart Cites

U-Korea.
U.=.Ubiquitous.
See:
http://www.academia.edu/1731078/
Korean_Ubiquitous_Society_Visions

Smart Grid

GeoffreyFox09222013Slide1204.png

Section 12 - Radar Informatics

This section Contains Unit 29

Unit 29 - Radar Informatics

Go To Unit 29: https://bigdatacourse.appspot.com/unit?unit=29

Overview

The changing global climate is suspected to have long-term effects on much of the world’s inhabitants. Among the various effects, the rising sea level will directly affect many people living in low-lying coastal regions. While the ocean-s thermal expansion has been the dominant contributor to rises in sea level, the potential contribution of discharges from the polar ice sheets in Greenland and Antarctica may provide a more significant threat due to the unpredictable response to the changing climate. The Radar-Informatics unit provides a glimpse in the processes fueling global climate change and explains what methods are used for ice data acquisitions and analysis.

X-Informatics: Radar Informatics (with application to glaciology)
Jerome E. Mitchell
School of Informatics and Computing
Indiana University Bloomington

Motivation
  • CHANGING (DISAPPEARING OUTLET GLACIERS)
  • DATA ANALYTICS INVOLVES IMAGE PROCESSING AFTER DETAILED PROCESSING OF RADAR DATA
  • EXAMPLE OF LARGE (PETABYTE TODAY) GEOLOCATED DATASET
Outline
  • Background – Remote Sensing
  • Background – Global Climate Change
  • Ice Sheet Science
  • Radar Overview
  • Radar Basics
  • Radar Informatics
Page statistics
3545 view(s) and 105 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments