Table of contents
- Story
- Slides
- Section 1: Introduction
- Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
- Section 3: Technology Training
- Section 4 - Physics Case Study
- Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
- Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
- Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
- Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
- Section 5: Technology Training
- Section 6 - e-Commerce and LifeStyle Case Study
- Section 7 - Infrastructure and Technologies for Big Data X-Informatics
- Section 8 - Web Search Informatics
- Section 9 - Technology for X-Informatics
- Section 10 - Health Informatics
- Section 11 - Sensor Informatics
- Section 12 - Radar Informatics
- Section 13 Spotfire
- Spotfire Dashboard
- Research Notes
- Syllabus
- Section 1: Introduction
- Unit 1: Course Introduction
- Overview
- Big Data Ecosystem in One Sentence
- The Course: Overview
- The Approach
- The Course Structure
- The Topics of Big Data & X-Informatics
- 0: Motivation
- 3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III
- Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
- 4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
- Using Plotviz Software for Displaying Point Distributions in 3D
- 3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III
- 2 Units: X-Informatics Technologies: K-Nearest Neighbor Algorithms and Clustering
- Parallel Computing: Overview of Basic Principles with Familiar Examples
- 3 Units: X-Informatics Cloud Technology Parts I-III
- 2 Units: X-Informatics Web Search and Text Mining Parts I and II
- X-Informatics Technologies
- X-Informatics: Health Informatics
- X-Informatics: Sensors
- X-Informatics: Radar Informatics (with application to glaciology)
- Unit 2: Course Motivation
- Overview
- Introduction
- Data Deluge
- Jobs
- Industry Trends
- Computing Model: Industry Adopted Clouds Which Are Attractive for Data Analytics
- Research Model: 4th Paradigm; From Theory to Data Driven Science?
- Data Science Process
- Physics Informatics Looking for Higgs Particle with Large Hadron Collider LHC
- Recommender Systems
- Web Search Information Retrieval
- Cloud Applications in Research
- Parallel Computing and MapReduce
- Data Science Education
- Conclusions
- Unit 1: Course Introduction
- Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
- Section 3: Technology Training
- Section 4 - Physics Case Study
- Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
- Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
- Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
- Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
- Section 5: Technology Training
- Section 6 - e-Commerce and LifeStyle Case Study
- Section 7 - Infrastructure and Technologies for Big Data X-Informatics
- Section 8 - Web Search Informatics
- Section 9 - Technology for X-Informatics
- Section 10 - Health Informatics
- Section 11 - Sensor Informatics
- Section 12 - Radar Informatics
- Section 1: Introduction
- Story
- Slides
- Section 1: Introduction
- Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
- Section 3: Technology Training
- Section 4 - Physics Case Study
- Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
- Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
- Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
- Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
- Section 5: Technology Training
- Section 6 - e-Commerce and LifeStyle Case Study
- Section 7 - Infrastructure and Technologies for Big Data X-Informatics
- Section 8 - Web Search Informatics
- Section 9 - Technology for X-Informatics
- Section 10 - Health Informatics
- Section 11 - Sensor Informatics
- Section 12 - Radar Informatics
- Section 13 Spotfire
- Spotfire Dashboard
- Research Notes
- Syllabus
- Section 1: Introduction
- Unit 1: Course Introduction
- Overview
- Big Data Ecosystem in One Sentence
- The Course: Overview
- The Approach
- The Course Structure
- The Topics of Big Data & X-Informatics
- 0: Motivation
- 3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III
- Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
- 4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
- Using Plotviz Software for Displaying Point Distributions in 3D
- 3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III
- 2 Units: X-Informatics Technologies: K-Nearest Neighbor Algorithms and Clustering
- Parallel Computing: Overview of Basic Principles with Familiar Examples
- 3 Units: X-Informatics Cloud Technology Parts I-III
- 2 Units: X-Informatics Web Search and Text Mining Parts I and II
- X-Informatics Technologies
- X-Informatics: Health Informatics
- X-Informatics: Sensors
- X-Informatics: Radar Informatics (with application to glaciology)
- Unit 2: Course Motivation
- Overview
- Introduction
- Data Deluge
- Jobs
- Industry Trends
- Computing Model: Industry Adopted Clouds Which Are Attractive for Data Analytics
- Research Model: 4th Paradigm; From Theory to Data Driven Science?
- Data Science Process
- Physics Informatics Looking for Higgs Particle with Large Hadron Collider LHC
- Recommender Systems
- Web Search Information Retrieval
- Cloud Applications in Research
- Parallel Computing and MapReduce
- Data Science Education
- Conclusions
- Unit 1: Course Introduction
- Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
- Section 3: Technology Training
- Section 4 - Physics Case Study
- Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
- Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
- Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
- Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
- Section 5: Technology Training
- Section 6 - e-Commerce and LifeStyle Case Study
- Section 7 - Infrastructure and Technologies for Big Data X-Informatics
- Section 8 - Web Search Informatics
- Section 9 - Technology for X-Informatics
- Section 10 - Health Informatics
- Section 11 - Sensor Informatics
- Section 12 - Radar Informatics
- Section 1: Introduction
Story
Data Science for Big Data Application and Analytics MOOC
The email said: Check Out Professor. Geoffrey Fox's MOOC Starting December 1st. Folks, Here is the link to the "Big Data Applications and Analytics" MOOC: https://bigdatacourse.appspot.com/preview. For big data novices, this is a gentle introduction. For big data experts, this exposes Professor Fox's perspectives and insights and source references without all the details and mathematical models.
I downloaded the ZIP file of Course Syllabus (PDF), Slides (PDF) and Python Files and explored them. Then I mined and structured the content into MindTouch to build a Knowledge Base of the essence of Professor Geoffrey Fox's MOOC so I could make it part of the Federal Big Data Working Group Meetup MOOC.
I asked Professor.Fox: Are there data sets used in the course? His reply was: Only a few small sample datasets and simple Monte Carlo sets. So I set about to find them and reuse them in Spotfire Statistics and Visualizations.
Important Point About Semantic Web and Big Data: Features of Data Deluge
MORE TO FOLLOW
Slides
Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
Section 3: Technology Training
Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
Overview
This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know.
Python for X+Informatics: IntroductionPython Packages and Tools
- Canopy IDE and IPython
- NumPy
- Matplotlib
- SciPy
Virtual Box Appliance
- Virtual Box Appliance with all packages set+up
- Tutorial to use the virtual box appliance
- You can manually setup the necessary stuff or use the virtual box appliance
Python for X-Informatics: Introduction to Canopy
Download Canopy
- https://www.enthought.com/store/
- Versions
- Express
- Basic
- Professional
Getting Necessary Packages
NumPy is a python package which can be used to represent data in a vectorized format. A number of other packages are built on top of numpy. These packages use the data structures and operations used in numpy.The express version of canopy (the one recommended for this class) comes with its own version of NumPy.
Matplotlib is a popular plotting package and is used for creating graphs and plots for data visualization. It can be used with ipython is an interactive manner and proves to be very useful. It has interactive features like zooming, panning. It also supports all the popular graphic formats like JPEG, GIF, TIFF, etc and can be used to plot, modify, images. If you are familiar with MATLAB, you can do almost all kinds of plotting possible with MATLAB using matplotlib.
Scipy is a powerful python package built on top of numpy. It has a number of algorithms pertaining to different fields like calculus, stats,signal processing, image processing, etc.
Section 4 - Physics Case Study
Section 5: Technology Training
Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
http://salsahpc.indiana.edu/pviz3/#screenshots My Note: Main Data Source and Documentation (PDF Papers)
Section 6 - e-Commerce and LifeStyle Case Study
Section 7 - Infrastructure and Technologies for Big Data X-Informatics
Section 8 - Web Search Informatics
Section 9 - Technology for X-Informatics
Section 12 - Radar Informatics
Unit 29 - Radar Informatics
X-Informatics: Radar Informatics (with application to glaciology)
Jerome E. Mitchell
School of Informatics and Computing
Indiana University Bloomington
- CHANGING (DISAPPEARING OUTLET GLACIERS)
- DATA ANALYTICS INVOLVES IMAGE PROCESSING AFTER DETAILED PROCESSING OF RADAR DATA
- EXAMPLE OF LARGE (PETABYTE TODAY) GEOLOCATED DATASET
Outline
- Background – Remote Sensing
- Background – Global Climate Change
- Ice Sheet Science
- Radar Overview
- Radar Basics
- Radar Informatics
Section 13 Spotfire
Spotfire Dashboard
For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App
Research Notes
Thank you. Are there data sets used in the course?
Only a few small sample datasets and simple Monte Carlo sets
Check Out Prof. Geoffrey Fox's MOOC Starting Dec 1st
Folks, Here is the link to the "Big Data Applications and Analytics" MOOC: https://bigdatacourse.appspot.com/preview
For big data novices, this is a gentle introduction. For big data experts, this exposes Prof. Fox's perspectives and insights and source references without all the details and mathematical models. I have already peeked into the MOOC lectures.
As always, big data requires federated, distributed data sets. IBM tends to highlight the V's because IBM has the V's (except data veracity because provenance and curation are required) covered. Oracle tends to highlight the distinction among SQL, NOSQL, and unstructured (e.g., document stores, object stores) data sets because Oracle has different products for all three. I emphasize the requirement to process data in place while sharing data sets based on policies based on governance based on Common Law. Federated identity, capabilities, polices, logging, auditing, alarms, workload management, prioritization cross any bureaucratic boundaries in order to safely share distributed data sets globally.
Syllabus
Syllabus: Big Data Application and Analytics By Dr. Geoffrey Fox https://www.dropbox.com/s/ce3mw81scn...lides.pdf?dl=1 (PDF)
Section 1: Introduction
This section Contains Unit 1,2
Overview
This section has a technical overview of course followed by a broad motivation for course.
The course overview covers it’s content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks. The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.
Unit 1: Course Introduction
Go To Unit 1: https://bigdatacourse.appspot.com/unit?unit=1
Overview
Geoffrey gives a short introduction to the course covering it’s content and structure. He presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics (or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons.
The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The introduction ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications)
The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.
3 Units: X-Informatics Introduction: What is Big Data, Data Analytics and X-Informatics? PartsI-III
4 Units: X-Informatics Physics Use Case, Discovery of Higgs Particle; Counting Events and Basic Statistics Parts I-IV
3 Units: X-Informatics Case Study: e-Commerce and Life Style Informatics: Recommender Systems Parts I-III
Unit 2: Course Motivation
Go To Unit 2: https://bigdatacourse.appspot.com/unit?unit=2
Overview
Geoffrey motivates the study of X-informatics by describing data science and clouds. He starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data.
He introduces the cloud computing model developed at amazing speed by industry. The 4 paradigms of scientific research are described with growing importance of data oriented version. He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.
Section 2: Overview of Data Science: What is Big Data, Data Analytics and X-Informatics?
This section Contains Unit 3,4,5
Overview
This section has a technical overview of course followed by a broad motivation for course.
The course overview covers it’s content and structure. It presents the X-Informatics fields (defined values of X) and the Rallying cry of course: Use Clouds running Data Analytics Collaboratively processing Big Data to solve problems in X-Informatics ( or e-X). The courses is set up as a MOOC divided into units that vary in length but are typically around an hour and those are further subdivided into 5-15 minute lessons. The course covers a mix of applications (the X in X-Informatics) and technologies needed to support the field electronically i.e. to process the application data. The overview ends with a discussion of course content at highest level. The course starts with a longish Motivation unit summarizing clouds and data science, then units describing applications (X = Physics, e-Commerce, Web Search and Text mining, Health, Sensors and Remote Sensing). These are interspersed with discussions of infrastructure (clouds) and data analytics (algorithms like clustering and collaborative filtering used in applications). The course uses either Python or Java and there are Side MOOCs discussing Python and Java tracks.
The course motivation starts with striking examples of the data deluge with examples from research, business and the consumer. The growing number of jobs in data science is highlighted. He describes industry trend in both clouds and big data. Then the cloud computing model developed at amazing speed by industry is introduced. The 4 paradigms of scientific research are described with growing importance of data oriented version.He covers 3 major X-informatics areas: Physics, e-Commerce and Web Search followed by a broad discussion of cloud applications. Parallel computing in general and particular features of MapReduce are described. He comments on a data science education and the benefits of using MOOC’s.
Unit 3: Part I: Data Science generics and Commercial Data Deluge
Go To Unit 3: https://bigdatacourse.appspot.com/unit?unit=3
Overview
Geoffrey starts with X-Informatics and its rallying cry. The growing number of jobs in data science is highlighted. This unit offers a look at the phenomenon described as the Data Deluge starting with its broad features. Then he discusses data science and the famous DIKW (Data to Information to Knowledge to Wisdom) pipeline. Then more detail is given on the flood of data from Internet and Industry applications with eBay and General Electric discussed in most detail.
Unit 4 - Part II: Data Deluge and Scientific Applications and Methodology
Go To Unit 4: https://bigdatacourse.appspot.com/unit?unit=4
Overview
Geoffrey continues the discussion of the data deluge with a focus on scientific research. He takes a first peek at data from the Large Hadron Collider considered later as physics Informatics and gives some biology examples. He discusses the implication of data for the scientific method which is changing with the data-intensive methodology joining observation, theory and simulation as basic methods.
We discuss the long tail of sciences; many users with individually modest data adding up to a lot. The last lesson emphasizes how everyday devices -- the Internet of Things -- are ???
Unit 5 - Part III: Clouds and Big Data Processing; Data Science Process and Analytics
Go To Unit 5: https://bigdatacourse.appspot.com/unit?unit=5
Overview
Geoffrey starts with X-Informatics and its rallying cry. The growing number of jobs in data Geoffrey gives an initial technical overview of cloud computing as pioneered by companies like Amazon, Google and Microsoft with new centers holding up to a million servers. The benefits of Clouds in terms of power consumption and the environment are also touched upon, followed by a list of the most critical features of Cloud computing with a comparison to supercomputing.
He discusses features of the data deluge with a salutary example where more data did better than more thought. Examples are given of end to end systems to process big data. He introduces data science and one part of it -- data analytics -- the large algorithms that crunch the big data to give big wisdom. There are many ways to describe data science and several are discussed to give a good composite picture of this emerging field.
Section 3: Technology Training
This section Contains Unit 3,4,5
Overview
This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know. This section covers. Canopy - Its is an IDE for python developed by EnThoughts. The aim of this IDE is to bring the various python libraries under one single framework or ‘’Canopy’’ - that is why the name. NumPy - It is popular library on top of which many other libraries (like pandas, scipy) are built. It provides a way a vectorizing data. This helps to organize in a more intuitive fashion and also helps us use the various matrix operations which are popularly used by the machine learning community. Matplotlib: This a data visualization package. It allows you to create graphs charts and other such diagrams. It supports Images in JPEG, GIF, TIFF format. SciPy: SciPy is a library built above numpy and has a number of off the shelf algorithms / operations implemented. These include algorithms from calculus(like integration), statistics, linear algebra, image-processing, signal processing, machine learning, etc.
Unit 6 - Python for Big Data and X-Informatics: NumPy, SciPy, MatPlotlib
Go To Unit 6: https://bigdatacourse.appspot.com/unit?unit=6
Overview
This section is meant to give an overview of the python tools needed for doing for this course. These are really powerful tools which every data scientist who wishes to use python must know.
Python for X+Informatics: Introduction
Python Packages and Tools
- Canopy IDE and IPython
- NumPy
- Matplotlib
- SciPy
Virtual Box Appliance
- Virtual Box Appliance with all packages set+up
- Tutorial to use the virtual box appliance
- You can manually setup the necessary stuff or use the virtual box appliance
Python for X-Informatics: Introduction to Canopy
Download Canopy
- https://www.enthought.com/store/
- Versions
- Express
- Basic
- Professional
Getting Necessary Packages
NumPy is a python package which can be used to represent data in a vectorized format. A number of other packages are built on top of numpy. These packages use the data structures and operations used in numpy.The express version of canopy (the one recommended for this class) comes with its own version of NumPy.
Matplotlib is a popular plotting package and is used for creating graphs and plots for data visualization. It can be used with ipython is an interactive manner and proves to be very useful. It has interactive features like zooming, panning. It also supports all the popular graphic formats like JPEG, GIF, TIFF, etc and can be used to plot, modify, images. If you are familiar with MATLAB, you can do almost all kinds of plotting possible with MATLAB using matplotlib.
Scipy is a powerful python package built on top of numpy. It has a number of algorithms pertaining to different fields like calculus, stats,signal processing, image processing, etc.
Section 4 - Physics Case Study
This section Contains Unit 7,8,9,10
Overview
This section starts by describing the LHC accelerator at CERN and evidence found by the experiments suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery. The next unit is devoted to Python experiments looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals. Then random variables and some simple principles of statistics are introduced with explanation as to why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given. Random Numbers with their Generators and Seeds lead to a discussion of Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods. The Central Limit Theorem concludes discussion.
Unit 7 - Part I: Bumps in Histograms, Experiments and Accelerators
Go To Unit 7: https://bigdatacourse.appspot.com/unit?unit=7
Overview
In this short unit Geoffrey describes the LHC accelerator at CERN and evidence found by the experiments ATLAS suggesting existence of a Higgs Boson. The huge number of authors on a paper, remarks on histograms and Feynman diagrams is followed by an accelerator picture gallery.
Unit 8 - Part II: Python Event Counting for Signal and Background (Python Track)
Go To Unit 8: https://bigdatacourse.appspot.com/unit?unit=8
Overview
This unit is devoted to Python experiments with Geoffrey looking at histograms of Higgs Boson production with various forms of shape of signal and various background and with various event totals.
Unit 9 - Part III: Random Variables, Physics and Normal Distributions(Python Track)
Go To Unit 9: https://bigdatacourse.appspot.com/unit?unit=9
Overview
Geoffrey introduces random variables and some simple principles of statistics and explains why they are relevant to Physics counting experiments. The unit introduces Gaussian (normal) distributions and explains why they seen so often in natural phenomena. Several Python illustrations are given.
Unit 10 - Part IV: Random Numbers, Distributions and Central Limit Theorem(Python Track)
Go To Unit 10: https://bigdatacourse.appspot.com/unit?unit=10
Overview
Geoffrey discusses Random Numbers with their Generators and Seeds. It introduces Binomial and Poisson Distribution. Monte-Carlo and accept-reject methods are discussed. The Central Limit Theorem concludes discussion. Python examples and Physics applications are given.
Section 5: Technology Training
This section Contains Unit 11
Unit 11: Using Plotviz Software for Displaying Point Distributions in 3D
Go To Unit 11: https://bigdatacourse.appspot.com/unit?unit=11
Overview
Geoffrey introduces Plotviz, a data visualization tool developed at Indiana University to display 2 and 3 dimensional data. The motivation is that the human eye is very good at pattern recognition and can ‘’see’’ structure in data. Although most Big data is higher dimensional than 3, all can be transformed by dimension reduction techniques to 3D. He gives several examples to show how the software can be used and what kind of data can be visualized. This includes individual plots and the manipulation of multiple synchronized plots. Finally, he describes the download and software dependency of Plotviz.
Data Sets
- http://salsahpc.indiana.edu/
- http://salsahpc.indiana.edu/pviz3/
- http://salsahpc.indiana.edu/pviz3/#downloads
- http://salsahpc.indiana.edu/pviz3/da...iz3_sample.zip
- http://salsahpc.indiana.edu/pviz3/#samples
- http://salsahpc.indiana.edu/pviz3/da...iz3_sample.zip
- http://salsahpc.indiana.edu/pviz3/#screenshots My Note: Main Data Source and Documentation (PDF Papers)
- http://salsabiology.blogspot.com/p/introduction.html
Example of Use I: Cube and Structured Dataset
Example of Use II: Proteomics and Synchronized Rotation
Example of Use III: More Features and Larger Proteomics Sample
Example of Use IV: Tools and Examples
Example of Use V: Final Examples
Section 6 - e-Commerce and LifeStyle Case Study
This section Contains Unit 12,13,14,15,16
Overview
Recommender systems operate under the hood of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs. Kaggle competitions h improve the success of the Netflix and other recommender systems. Attention is paid to models that are used to compare how changes to the systems affect their overall performance. Geoffrey muses how the humble ranking has become such a dominant driver of the world’s economy. More examples of recommender systems are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. The formulation of recommendations in terms of points in a space or bag is given where bags of item properties, user properties, rankings and users are useful. Detail is given on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items. Items are viewed as points in a space of users in item-based collaborative filtering. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed. A simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions is given. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of a training and a testing set are introduced with training set pre labeled. Recommender system are used to discuss clustering with k-means based clustering methods used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.
Unit 12 - Part I: Recommender Systems: Introduction
Go To Unit 12: https://bigdatacourse.appspot.com/unit?unit=12
Overview
Geoffrey introduces Recommender systems as an optimization technology used in a variety of applications and contexts online. They operate in the background of such widely recognized sites as Amazon, eBay, Monster and Netflix where everything is a recommendation. This involves a symbiotic relationship between vendor and buyer whereby the buyer provides the vendor with information about their preferences, while the vendor then offers recommendations tailored to match their needs, to the benefit of both.
There follows an exploration of the Kaggle competition site, other recommender systems and Netflix, as well as competitions held to improve the success of the Netflix recommender system. Finally attention is paid to models that are used to compare how changes to the systems affect their overall performance. Geoffrey muses how the humble ranking has become such a dominant driver of the world’s economy.
Kaggle Competitions
Unit 13- Part II: Recommender Systems: Examples and Algorithms
Go To Unit 13: https://bigdatacourse.appspot.com/unit?unit=13
Overview
Geoffrey continues the discussion of recommender systems and their use in e-commerce. More examples are given from Google News, Retail stores and in depth Yahoo! covering the multi-faceted criteria used in deciding recommendations on web sites. Then the formulation of recommendations in terms of points in a space or bag is given. Here bags of item properties, user properties, rankings and users are useful. Then we go into detail on basic principles behind recommender systems: user-based collaborative filtering, which uses similarities in user rankings to predict their interests, and the Pearson correlation, used to statistically quantify correlations between users viewed as points in a space of items.
Unit 14 - Part III: Item-based Collaborative Filtering and its Technologies
Go To Unit 14: https://bigdatacourse.appspot.com/unit?unit=14
Overview
Geoffrey introduces Recommender systems as an optimization technology used in a Geoffrey moves on to item-based collaborative filtering where items are viewed as points in a space of users. The Cosine Similarity is introduced, the difference between implicit and explicit ratings and the k Nearest Neighbors algorithm. General features like the curse of dimensionality in high dimensions are discussed
Unit 15 - Part IV: k Nearest Neighbor Algorithm(Python Track)
Go To Unit 15: https://bigdatacourse.appspot.com/unit?unit=15
Overview
Geoffrey discusses a simple Python k Nearest Neighbor code and its application to an artificial data set in 3 dimensions. Results are visualized in Matplotlib in 2D and with Plotviz in 3D. The concept of training and testing sets are introduced with training set prelabelled.
Unit 16 - Part V: Clustering
Go To Unit 16: https://bigdatacourse.appspot.com/unit?unit=16
Overview
Geoffrey uses example of recommender system to discuss clustering. The details of methods are not discussed but k-means based clustering methods are used and their results examined in Plotviz. The original labelling is compared to clustering results and extension to 28 clusters given. General issues in clustering are discussed including local optima, the use of annealing to avoid this and value of heuristic algorithms.
Section 7 - Infrastructure and Technologies for Big Data X-Informatics
This section Contains Unit 17,18,19,20
Overview
Clouds and Big Data which is decomposed into lots of ‘’Little data’’ running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition. Cyberinfrastructure for e-moreorlessanything or moreorlessanything-Informatics and the basics of cloud computing are introduced. This includes virtualization and the important ‘’as a Service’’ components and we go through several different definitions of cloud computing. Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Two simple examples of the value of clouds for enterprise applications are given with a review of different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). Features in Grid and cloud computing and data are treated. Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models are discussed followed by the Cloud Industry stakeholders and applications on the cloud including data intensive problems and comparison with high performance computing. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. The Big Data Processing from an application perspective with commercial examples including eBay concludes section.
Unit 17 - Parallel Computing: Overview of Basic Principles with familiar Examples
Go To Unit 17: https://bigdatacourse.appspot.com/unit?unit=17
Overview
Geoffrey describes the central role of Parallel computing in Clouds and Big Data which is decomposed into lots of ‘’Little data’’ running in individual cores. Many examples are given and it is stressed that issues in parallel computing are seen in day to day life for communication, synchronization, load balancing and decomposition.
Unit 18 - X-Informatics Cloud Technology Part I: Introduction
Go To Unit 18: https://bigdatacourse.appspot.com/unit?unit=18
Overview
Geoffrey discusses Cyberinfrastructure for e-moreorlessanything or moreorlessanything- Informatics and the basics of cloud computing. This includes virtualization and the important ‘’as a Service’’ components and we go through several different definitions of cloud computing.
Gartner’s Technology Landscape includes hype cycle and priority matrix and covers clouds and Big Data. Geoffrey reviews the nature of 48 technologies in 2012 emerging technology hype cycle. Gartner has specific predictions for cloud computing growth areas. The unit concludes with two simple examples of the value of clouds for enterprise applications.
Unit 19 - X-Informatics Cloud Technology Part II: Introduction
Go To Unit 19: https://bigdatacourse.appspot.com/unit?unit=19
Overview
Geoffrey covers different views as to nature of Cloud Computing. This IaaS (Infrastructure as a Service) discussion is followed by PaaS and SaaS (Platform and Software as a Service). The unit discusses features introduced in Grid computing and features introduced by clouds. The unit concludes with the treatment of data in the cloud from an architecture perspective.
Unit 20 - X-Informatics Cloud Technology Part III: Introduction
Go To Unit 20: https://bigdatacourse.appspot.com/unit?unit=20
Overview
Geoffrey opens up with a discussion of Cloud (Data Center) Architectures with physical setup, Green Computing issues and software models. A discussion of Cloud Industry stakeholders is followed by applications on the cloud including data intensive problems and comparison with high performance computing. Remarks on Security, Fault Tolerance and Synchronicity issues in cloud follow. The Big Data Processing from an application perspective with commercial examples including eBay concludes unit.
Section 8 - Web Search Informatics
Overview
This section starts with an overview of data mining and puts our study of classification, clustering and exploration methods in context. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information Retrieval. The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosine Similarity as a distance in this bag follows. The basic TF-IDF approach is dis cussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit. Geoffrey starts with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow. This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes section.
Unit 21- X-Informatics Web Search and Text Mining I
Go To Unit 21: https://bigdatacourse.appspot.com/unit?unit=21
Overview
Geoffrey starts this unit with an overview of data mining and noting our study of classification, clustering and exploration methods. We examine the problem to be solved in web and text search and note the relevance of history with libraries, catalogs and concordances. An overview of web search is given describing the continued evolution of search engines and the relation to the field of Information Retrieval.
The importance of recall, precision and diversity is discussed. The important Bag of Words model is introduced and both Boolean queries and the more general fuzzy indices. The important vector space model and revisiting the Cosinne Similarity as a distance in this bag follows. The basic TF-IDF approach is discussed. Relevance is discussed with a probabilistic model while the distinction between Bayesian and frequency views of probability distribution completes this unit.
Unit 22 - X-Informatics Web Search and Text Mining II
Go To Unit 22: https://bigdatacourse.appspot.com/unit?unit=22
Overview
Geoffrey starts with an overview of the different steps (data analytics) in web search and then goes key steps in detail starting with document preparation. An inverted index is described and then how it is prepared for web search. The Boolean and Vector Space approach to query processing follow.
This is followed by Link Structure Analysis including Hubs, Authorities and PageRank. The application of PageRank ideas as reputation outside web search is covered. The web graph structure, crawling it and issues in web advertising and search follow. The use of clustering and topic models completes unit.
Section 9 - Technology for X-Informatics
This section Contains Unit 23,24,25,26
Overview
Geoffrey uses the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ‘’hill’’ between different solutions and rationale for running K-means many times and choosing best answer. Then we introduce MapReduce with the basic architecture and a homely example. The discussion of advanced topics includes an extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given. The SciPy K-means code is modified to support a MapReduce execution style. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ‘’parallel’’ maps run sequentially. This simple 2 map version can be generalized to scalable parallelism. Python is used to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.
Unit 23 - PageRank (Python Track)
Go To Unit 23: https://bigdatacourse.appspot.com/unit?unit=23
Overview
Geoffrey uses Python to Calculate PageRank from Web Linkage Matrix showing several different formulations of the basic matrix equations to finding leading eigenvector. The unit is concluded by a calculation of PageRank for general web pages by extracting the secret from Google.
Unit 24 - K-means (Python Track)
Go To Unit 24: https://bigdatacourse.appspot.com/unit?unit=24
Overview
Geoffrey uses the K-means Python code in SciPy package to show real code for clustering. After a simple example we generate 4 clusters of distinct centers and various choice for sizes using Matplotlib tor visualization. We show results can sometimes be incorrect and sometimes make different choices among comparable solutions. We discuss the ‘’hill’’ between different solutions and rationale for running K-means many times and choosing best answer.
Unit 25 - MapReduce
Go To Unit 25: https://bigdatacourse.appspot.com/unit?unit=25
Overview
Geoffrey’s introduction to MapReduce describes the basic architecture and a homely example. The discussion of advanced topics includes extension to Iterative MapReduce from Indiana University called Twister and a generalized Map Collective model. Some measurements of parallel performance are given.
Unit 26 - Kmeans and MapReduce Parallelism (Python Track)
Go To Unit 26: https://bigdatacourse.appspot.com/unit?unit=26
Overview
Geoffrey modifies the SciPy K-means code to support a MapReduce execution style and runs it in this short unit. This illustrates the key ideas of mappers and reducers. With appropriate runtime this code would run in parallel but here the ‘’parallel’’ maps run sequentially. Geoffrey stresses that this simple 2 map version can be generalized to scalable parallelism.
Section 10 - Health Informatics
This section Contains Unit 27
Unit 27 - Health Informatics
Go To Unit 27: https://bigdatacourse.appspot.com/unit?unit=27
Overview
Geoffrey starts by discussing general aspects of Big Data and Health including data sizes, different areas including genomics, EBI, radiology and the Quantified Self movement. We survey an April 2013 McKinsey report on the Big Data revolution in US health care; a Microsoft report in this area and a European Union report on how Big Data will allow patient centered care in the future. Some remarks on Cloud computing and Health focus on security and privacy issues. The final topic is Genomics, Proteomics and Information Visualization.
Section 11 - Sensor Informatics
This section Contains Unit 28
Unit 28 - Sensors Informatics
Go To Unit 28: https://bigdatacourse.appspot.com/unit?unit=28
Overview
Geoffrey starts with the Internet of Things giving examples like monitors of machine operation, QR codes, surveillance cameras, scientific sensors, drones and self driving cars and more generally transportation systems. Sensor clouds control these many small distributed devices. More detail is given for radar data gathered by sensors; ubiquitous or smart cities and homes including U-Korea; and finally the smart electric grid.
Ubiquitous/Smart Cites
U-Korea.
U.=.Ubiquitous.
See:
http://www.academia.edu/1731078/
Korean_Ubiquitous_Society_Visions
Section 12 - Radar Informatics
This section Contains Unit 29
Unit 29 - Radar Informatics
Go To Unit 29: https://bigdatacourse.appspot.com/unit?unit=29
Overview
The changing global climate is suspected to have long-term effects on much of the world’s inhabitants. Among the various effects, the rising sea level will directly affect many people living in low-lying coastal regions. While the ocean-s thermal expansion has been the dominant contributor to rises in sea level, the potential contribution of discharges from the polar ice sheets in Greenland and Antarctica may provide a more significant threat due to the unpredictable response to the changing climate. The Radar-Informatics unit provides a glimpse in the processes fueling global climate change and explains what methods are used for ice data acquisitions and analysis.
X-Informatics: Radar Informatics (with application to glaciology)
Jerome E. Mitchell
School of Informatics and Computing
Indiana University Bloomington
Motivation
- CHANGING (DISAPPEARING OUTLET GLACIERS)
- DATA ANALYTICS INVOLVES IMAGE PROCESSING AFTER DETAILED PROCESSING OF RADAR DATA
- EXAMPLE OF LARGE (PETABYTE TODAY) GEOLOCATED DATASET
Outline
- Background – Remote Sensing
- Background – Global Climate Change
- Ice Sheet Science
- Radar Overview
- Radar Basics
- Radar Informatics


























































































































































Comments