##### Table of contents

- Story
- Slides
- Slide 1 Data Science for Statistics.com
- Slide 2 Chronology 1
- Slide 3 Chronology 2
- Slide 4 Chronology 3
- Slide 5 Data Science Process 1
- Slide 6 Data Science Process 2
- Slide 7 Data Science Process 3
- Slide 8 Data Science Process 4
- Slide 9 The initial mapping is as follows:
- Slide 10 Conclusions and Recommendations
- Slide 11 Data Science for Statistics.com: MindTouch Knowledge Base
- Slide 12 TIBCO Spotfire User’s Guide Knowledge Base
- Slide 13 TIBCO Spotfire Titanic Competition: Cover Page
- Slide 14 TIBCO Spotfire Titanic Competition: Train Data Set
- Slide 15 TIBCO Spotfire Titanic Competition: Train Model Relationships
- Slide 16 TIBCO Spotfire Titanic Competition: Test Data Set
- Slide 17 TIBCO Spotfire Titanic Competition: Test Model Relationships
- Slide 18 TIBCO Spotfire Titanic Competition: Gender Data Sets Exploratory Data Analysis
- Slide 19 Results 1
- Slide 20 Results 2

- Spotfire Dashboard
- Research Notes
- statistics.com
- NEXT

- Story
- Slides
- Slide 1 Data Science for Statistics.com
- Slide 2 Chronology 1
- Slide 3 Chronology 2
- Slide 4 Chronology 3
- Slide 5 Data Science Process 1
- Slide 6 Data Science Process 2
- Slide 7 Data Science Process 3
- Slide 8 Data Science Process 4
- Slide 9 The initial mapping is as follows:
- Slide 10 Conclusions and Recommendations
- Slide 11 Data Science for Statistics.com: MindTouch Knowledge Base
- Slide 12 TIBCO Spotfire User’s Guide Knowledge Base
- Slide 13 TIBCO Spotfire Titanic Competition: Cover Page
- Slide 14 TIBCO Spotfire Titanic Competition: Train Data Set
- Slide 15 TIBCO Spotfire Titanic Competition: Train Model Relationships
- Slide 16 TIBCO Spotfire Titanic Competition: Test Data Set
- Slide 17 TIBCO Spotfire Titanic Competition: Test Model Relationships
- Slide 18 TIBCO Spotfire Titanic Competition: Gender Data Sets Exploratory Data Analysis
- Slide 19 Results 1
- Slide 20 Results 2

- Spotfire Dashboard
- Research Notes
- statistics.com
- NEXT

## Story

**Data Science for Statistics.com**

Janet Dobbins, Vice President of Marketing and Communications, Statistics.com, attended one of our Meetups. She invited me to lunch with Peter Bruce, the founder and president of Statistics.com.

I said: Thank you for an excellent lunch. Did we identify any collaborative opportunities? Do you want me to look at your training courses and suggest something?

Janet replied: You are such a wealth of knowledge and experience! I would love to collaborate with you, admittedly, I am not exactly sure how. Do you have any ideas?

I replied: I think I would get some ideas from looking at your courses and see where I might be able to promote/supplement them to our Meetup members. What do you think about that idea for starters?

Janet replied: That would be wonderful. Here's a link to the catalog: http://www.statistics.com/course-catalog.

Would you mind if I added you to our mailing list for reminders of upcoming courses?

I replied: Thank you. I gave it a quick look and the most interesting thing I found was: There is nothing "massively open" about our courses

You have probably heard of "massively open online courses" -- "MOOC's." That's not us. We focus on the human element in the education process. Our instructors are world-renowned experts who are available to your employees throughout each course to answer questions and lead discussions. Our administrators and teaching assistants pay close attention to all the students in every course -- we guide them through the registration process, check on them if they are not participating, and provide help with questions on course work and homework assignments. We offer a level of personal service that cannot be matched by other online courses with thousands of students.

Most Data Science training I am familiar with, including mine, is done with MOOCs.

The most popular Data Science Training that I am aware of is:

- http://www.datasciencecentral.com/gr...apprenticeship
- http://www.datasciencecentral.com/gr...-certification

Please add me to your mailing list.

I further replied: I just analyzed the Kaggle Titanic: Machine Learning from Disaster Competition:

https://www.kaggle.com/c/titanic

for our upcoming Meetup and it reminded me that I probably could show how most, if not all, of your courses could be applied to something like this and present it at a the Meetup.

Janet replied: That would be terrific! Thank you so much!

So I data mined the Statistics.com Data Catalog and browsed the Statistics.com web pages to get a general idea of what they offer and my findings were:

- "the source for statistics education" that has added data science education;
- 110+ courses for novices, experts, and those in between. Most courses are 4-weeks long and put you in direct contact with leading experts and authors;
- topics in Data Science (predictive modeling, text mining, programming, the challenges of "Big Data" and unstructured data);
- topics in Statistics, as used in research and data analysis (biostatistics, inference, statistical models, Bayesian techniques, study design);
- Need advice on what which course to take? Contact us with your goals and background, and one of our instructors will provide some suggestions.

The Statistics.com business model is to offer many online courses to a world wide audience (the map display on their home page is impressive, but not a good example of data science), so they and their instructors make money.

This is in marked contrast to a MOOC (/muːk/): a massive open online course which is an online course aimed at unlimited participation and open access via the web. Many MOOCs provide interactive user forums to support community interactions between students, professors, and teaching assistants (TAs). Their principal characteristics are:

- Focus on Scalability
- Focus on Community and Connections
- No registration and Free of charge
- Self paced Learning community with Real-time interaction using Open content

Data Science has a Data Mining Standard (CRISP) that is used in a George Mason University Data Science Graduate Class: Data Mining for the Masses, which follows the CRISP Standard and provides data sets and hands-on exercises that I have repurposed into a MOOC for the Federal Big Data Working Group Meetup.

So what if one could cover most, or even all, the topics and methods in the 110 Statistics.com Data Catalog by the following:

- Select one or more interesting data sets (e.g. Kaggle Titanic: Machine Learning from Disaster Competition)
- Select a free trial comprehensive Data Science and Statistics Platform Tool (e.g. TIBCO Spotfire)
- Produce a Data Science Product that tells a story that could be published, presented, used in a job interview, used in teaching, etc.
- Produce more Data Science Products that use interesting data and tools to learn any of the 110 remaining topics and methods in the Statistics.com Data Catalog and publish them as a MOOC!

So I am going to do one example and map it to the 110 in Statistics.com Data Catalog and present it to the Federal Big Data Working Group Meetup and see if there are members that would like to crowdsource this to develop the MOOC.

The results are shown in the Slides (screen captures) of the Spotfire Dashboard below.

The initial mapping is as follows:

- Data Science:
- Data Mining and Prediction:
- Data Analytics (Spotfire has)
- Text Analytics (Semantic Insights does this)
- Using R (Spotfire has)
- IT/Programming (Spotfire has)
- Spatial Analytics (Spotfire has)
- Operations Research and Risk (Spotfire has)

- Statistics:
- Stats for Credit (Spotfire has)
- Introductory (Spotfire has)
- Review/Prep (Spotfire has)
- Statistical Modeling (Spotfire has)
- Biostatistics (Spotfire has)
- Bayesian (Spotfire has)
- Methods (Spotfire has)
- Clinical Trials (Spotfire has special version)
- Engineering (Spotfire has)
- Environmental (Spotfire did this for me at the US EPA)
- Social Science (NodeXL does most of these)
- Survey Statistics (Spotfire has)

In order to refine and extend this, I would need to see the actual data sets and hands-on exercises for the 110 courses in the Statistics.com Data Catalog. My previous analysis of SAS Public Data Sets for Training found them to be of limited use in the real, practical, world.

My conclusion and recommendation is to get an interesting data set with a well-defined purpose (question to be answered) and comprehensive tool like Spotfire and learn Data Science and Statistics by doing exploratory data analysis and studying my examples like the one provided below.

- Slide 13 TIBCO Spotfire Titanic Competition: Cover Page - Explains the Kaggle Competition, Titanic Data, and Resources. See Research Notes
- Slide 14 TIBCO Spotfire Titanic Competition: Train Data Set- Shows Train Data Set Multiple Bar Charts for Categories of Survived (0:died and 1: lived), Sex (female and male), and Pclass (1,2, or 3). More males in the Pclass 3 died than almost all the rest that died, and many more females in all three Pclasses lived than males.
- Slide 15 TIBCO Spotfire Titanic Competition: Train Model Relationships - For each combination of columns, the tool calculates a p-value, representing the degree to which the first column predicts values in the second column. A low p-value indicates a probable strong connection between two columns. Pclass and Fare are the most strongly correlated with Survived.
- Slide 16 TIBCO Spotfire Titanic Competition: Test Data Set - Shows Test Data Set Multiple Bar Charts for Categories of Sex (female and male), and Pclass (1,2, or 3). Nearly twice as many males as females were in Pclass 3.
- Slide 17 TIBCO Spotfire Titanic Competition: Test Model Relationships - For each combination of columns, the tool calculates a p-value, representing the degree to which the first column predicts values in the second column. A low p-value indicates a probable strong connection between two columns. Pclass and Fare are the most strongly correlated.
- Slide 18 TIBCO Spotfire Titanic Competition: Gender Data Sets Exploratory Data Analysis - Shows EDA Bar Graphs of the Gender Data Sets by Survived (0:died and 1: lived). About twice as many died as lived.

MORE EXAMPLES AND MOOC TO FOLLOW

## Slides

### Slide 1 Data Science for Statistics.com

### Slide 3 Chronology 2

### Slide 7 Data Science Process 3

Federal Big Data Working Group Meetup

### Slide 11 Data Science for Statistics.com: MindTouch Knowledge Base

Semantic Community Data Science Data Science for Statistics.com

## Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

## Research Notes

### Titanic: Machine Learning from Disaster

### Competition Details

Predict survival on the Titanic (using Excel, Python, R, and Random Forests)

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This Kaggle "Getting Started" Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning. The data is highly structured, and we provide tutorials of increasing complexity for using Excel, Python, pandas in Python, and a Random Forest in Python (see links in the sidebar). We also have links to tutorials using R instead. Please use the forums freely and as much as you like. There is no such thing as a stupid question; we guarantee someone else will be wondering the same thing!

### Make a submission

https://www.kaggle.com/account/set-username

Note: You must specify a username

Welcome to your new home URL: www.kaggle.com/brandniemann

### Evaluation

https://www.kaggle.com/c/titanic/details/evaluation

The historical data has been split into two groups, a 'training set' and a 'test set'. For the training set, we provide the outcome ( 'ground truth' ) for each passenger. You will use this set to build your model to generate predictions for the test set.

For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ). Your score is the percentage of passengers you correctly predict.

The Kaggle leaderboard has a public and private component. 50% of your predictions for the test set have been randomly assigned to the public leaderboard ( the same 50% for all users ). Your score on this public portion is what will appear on the leaderboard. At the end of the contest, we will reveal your score on the private 50% of the data, which will determine the final winner. This method prevents users from 'overfitting' to the leaderboard.

### Competition Rules

### Prizes

https://www.kaggle.com/c/titanic/details/prizes

The confidence to go forward and compete for some serious $$$$$$

### Frequently Asked Questions

https://www.kaggle.com/c/titanic/details/frequently-asked-questions

### Further Reading / Watching

https://www.kaggle.com/c/titanic/details/further-reading-watching

Want more? Here are some links to some tutorial pages that might interest you!

### Getting Started With Excel

Getting Started with Excel: Kaggle's Titanic Competition

https://www.kaggle.com/c/titanic/details/getting-started-with-excel

For those who are not experienced with handling large data sets, logging into the Kaggle website for the first time may be slightly daunting. Many of these competitions have a six figure prize and data which can, at times, be extremely involved. Here at Kaggle, we understand that this may seem like an insurmountable barrier to entry, so we have created a "getting started" competition to guide you through the initial steps required to get your first decent submission on the board.

## statistics.com

Source: http://www.statistics.com/course-catalog/

the source for statistics education

We offer 110+ courses for novices, experts, and those in between. Most courses are 4-weeks long and put you in direct contact with leading experts and authors. Whether you want to learn the fundamentals, polish your skills, master new methods or tackle new cutting edge topics, we have a course for you.

On the left are topics in Data Science (predictive modeling, text mining, programming, the challenges of "Big Data" and unstructured data).

On the right are topics in Statistics, as used in research and data analysis (biostatistics, inference, statistical models, Bayesian techniques, study design).

Need advice on what which course to take? Contact us with your goals and background, and one of our instructors will provide some suggestions.

*Evaluated and recommended for college credit by ACE CREDIT®

### Data Science

#### Data Analytics

#### IT/Programming

#### Spatial Analytics

#### Operations Research and Risk

### Statistics

#### Stats for Credit

- Biostatistics for Credit *
- Categorical Data *
- Cluster Analysis *
- Financial Risk *
- Forecasting Analytics *
- Hadoop *
- Intro Stats for Credit *
- Natural Lang. Processing (NLP) *
- Optimization - Advanced *
- Optimization - Intro *
- Predictive Analytics 1 *
- Predictive Analytics 2 *
- Predictive Analytics 3 *
- Python *
- R Programming Interm *
- R Programming Intro 1 *
- R Programming Intro 2 *
- Regression *
- Risk Simulation and Queuing *
- SQL *
- Sentiment Analysis *
- Social Network Analysis *
- Spatial Statistics - GIS *
- Survival Analysis *
- Text Mining *
- Visualization *

## Comments