Data Science for

Last modified


Data Science for

Janet Dobbins, Vice President of Marketing and Communications,, attended one of our Meetups. She invited me to lunch with Peter Bruce, the founder and president of

I said: Thank you for an excellent lunch. Did we identify any collaborative opportunities? Do you want me to look at your training courses and suggest something?

Janet replied: You are such a wealth of knowledge and experience!  I would love to collaborate with you, admittedly, I am not exactly sure how. Do you have any ideas?

I replied: I think I would get some ideas from looking at your courses and see where I might be able to promote/supplement them to our Meetup members. What do you think about that idea for starters?

Janet replied: That would be wonderful.  Here's a link to the catalog:
Would you mind if I added you to our mailing list for reminders of upcoming courses?

I replied: Thank you. I gave it a quick look and the most interesting thing I found was: There is nothing "massively open" about our courses

You have probably heard of "massively open online courses" -- "MOOC's."   That's not us.  We focus on the human element in the education process.  Our instructors are world-renowned experts who are available to your employees throughout each course to answer questions and lead discussions.  Our administrators and teaching assistants pay close attention to all the students in every course -- we guide them through the registration process, check on them if they are not participating, and provide help with questions on course work and homework assignments.  We offer a level of personal service that cannot be matched by other online courses with thousands of students.

Most Data Science training I am familiar with, including mine, is done with MOOCs.
The most popular Data Science Training that I am aware of is:

Please add me to your mailing list.

I further replied: I just analyzed the Kaggle Titanic: Machine Learning from Disaster Competition:
for our upcoming Meetup and it reminded me that I probably could show how most, if not all, of your courses could be applied to something like this and present it at a the Meetup.

Janet replied: That would be terrific!  Thank you so much!

So I data mined the Data Catalog and browsed the web pages to get a general idea of what they offer and my findings were:

  • "the source for statistics education" that has added data science education;
  • 110+ courses for novices, experts, and those in between. Most courses are 4-weeks long and put you in direct contact with leading experts and authors;
  • topics in Data Science (predictive modeling, text mining, programming, the challenges of "Big Data" and unstructured data);
  • topics in Statistics, as used in research and data analysis (biostatistics, inference, statistical models, Bayesian techniques, study design);
  • Need advice on what which course to take? Contact us with your goals and background, and one of our instructors will provide some suggestions.

The business model is to offer many online courses to a world wide audience (the map display on their home page is impressive, but not a good example of data science), so they and their instructors make money.

This is in marked contrast to a MOOC (/muːk/): a massive open online course which is an online course aimed at unlimited participation and open access via the web. Many MOOCs provide interactive user forums to support community interactions between students, professors, and teaching assistants (TAs). Their principal characteristics are:

  • Focus on Scalability
  • Focus on Community and Connections
  • No registration and Free of charge
  • Self paced Learning community with Real-time interaction using Open content

Data Science has a Data Mining Standard (CRISP) that is used in a George Mason University Data Science Graduate Class: Data Mining for the Masses, which follows the CRISP Standard and provides data sets and hands-on exercises that I have repurposed into a MOOC for the Federal Big Data Working Group Meetup.

So what if one could cover most, or even all, the topics and methods in the 110 Data Catalog by the following:

  • Select one or more interesting data sets (e.g. Kaggle Titanic: Machine Learning from Disaster Competition)
  • Select a free trial comprehensive Data Science and Statistics Platform Tool (e.g. TIBCO Spotfire)
  • Produce a Data Science Product that tells a story that could be published, presented, used in a job interview, used in teaching, etc.
  • Produce more Data Science Products that use interesting data and tools to learn any of the 110 remaining topics and methods in the Data Catalog and publish them as a MOOC!

So I am going to do one example and map it to the 110 in Data Catalog and present it to the Federal Big Data Working Group Meetup and see if there are members that would like to crowdsource this to develop the MOOC.

The results are shown in the Slides (screen captures) of the Spotfire Dashboard below.

The initial mapping is as follows:

  • Data Science:
    • Data Mining and Prediction:
    • Data Analytics  (Spotfire has)
    • Text Analytics (Semantic Insights does this)
    • Using R (Spotfire has)
    • IT/Programming (Spotfire has)
    • Spatial Analytics (Spotfire has)
    • Operations Research and Risk (Spotfire has)
  • Statistics:
    • Stats for Credit  (Spotfire has)
    • Introductory  (Spotfire has)
    • Review/Prep  (Spotfire has)
    • Statistical Modeling  (Spotfire has)
    • Biostatistics  (Spotfire has)
    • Bayesian  (Spotfire has)
    • Methods  (Spotfire has)
    • Clinical Trials (Spotfire has special version)
    • Engineering  (Spotfire has)
    • Environmental (Spotfire did this for me at the US EPA)
    • Social Science (NodeXL does most of these)
    • Survey Statistics (Spotfire has)

In order to refine and extend this, I would need to see the actual data sets and hands-on exercises for the 110 courses in the Data Catalog. My previous analysis of SAS Public Data Sets for Training found them to be of limited use in the real, practical, world.

My conclusion and recommendation is to get an interesting data set with a well-defined purpose (question to be answered) and comprehensive tool like Spotfire and learn Data Science and Statistics by doing exploratory data analysis and studying my examples like the one provided below.




Slide 1 Data Science for

Semantic Community

Data Science

Data Science for


Slide 4 Chronology 3


Slide 5 Data Science Process 1


Slide 6 Data Science Process 2



Slide 8 Data Science Process 4


Slide 9 The initial mapping is as follows:


Slide 10 Conclusions and Recommendations

SAS Public Data Sets


Slide 11 Data Science for MindTouch Knowledge Base

Semantic Community Data Science Data Science for


Slide 12 TIBCO Spotfire User’s Guide Knowledge Base


Slide 13 TIBCO Spotfire Titanic Competition: Cover Page

Research Notes


Slide 14 TIBCO Spotfire Titanic Competition: Train Data Set


Slide 15 TIBCO Spotfire Titanic Competition: Train Model Relationships


Slide 16 TIBCO Spotfire Titanic Competition: Test Data Set


Slide 17 TIBCO Spotfire Titanic Competition: Test Model Relationships


Slide 18 TIBCO Spotfire Titanic Competition: Gender Data Sets Exploratory Data Analysis


Slide 19 Results 1


Slide 20 Results 2


Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

Titanic: Machine Learning from Disaster

Competition Details

Predict survival on the Titanic (using Excel, Python, R, and Random Forests)

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

This Kaggle "Getting Started" Competition provides an ideal starting place for people who may not have a lot of experience in data science and machine learning. The data is highly structured, and we provide tutorials of increasing complexity for using Excel, Python, pandas in Python, and a Random Forest in Python (see links in the sidebar). We also have links to tutorials using R instead. Please use the forums freely and as much as you like. There is no such thing as a stupid question; we guarantee someone else will be wondering the same thing!

Get the Data

Note: You must register

Make a submission

Note: You must specify a username
Welcome to your new home URL:


The historical data has been split into two groups, a 'training set' and a 'test set'.  For the training set, we provide the outcome ( 'ground truth' ) for each passenger.  You will use this set to build your model to generate predictions for the test set.

For each passenger in the test set, you must predict whether or not they survived the sinking ( 0 for deceased, 1 for survived ).  Your score is the percentage of passengers you correctly predict.

The Kaggle leaderboard has a public and private component.  50% of your predictions for the test set have been randomly assigned to the public leaderboard ( the same 50% for all users ).  Your score on this public portion is what will appear on the leaderboard.  At the end of the contest, we will reveal your score on the private 50% of the data, which will determine the final winner.  This method prevents users from 'overfitting' to the leaderboard.


The confidence to go forward and compete for some serious $$$$$$

Further Reading / Watching

Want more? Here are some links to some tutorial pages that might interest you!

Getting Started With Excel

Getting Started with Excel: Kaggle's Titanic Competition

For those who are not experienced with handling large data sets, logging into the Kaggle website for the first time may be slightly daunting. Many of these competitions have a six figure prize and data which can, at times, be extremely involved. Here at Kaggle, we understand that this may seem like an insurmountable barrier to entry, so we have created a "getting started" competition to guide you through the initial steps required to get your first decent submission on the board.


the source for statistics education

We offer 110+ courses for novices, experts, and those in between. Most courses are 4-weeks long and put you in direct contact with leading experts and authors. Whether you want to learn the fundamentals, polish your skills, master new methods or tackle new cutting edge topics, we have a course for you.

On the left are topics in Data Science (predictive modeling, text mining, programming, the challenges of "Big Data" and unstructured data).

On the right are topics in Statistics, as used in research and data analysis (biostatistics, inference, statistical models, Bayesian techniques, study design).

Need advice on what which course to take? Contact us with your goals and background, and one of our instructors will provide some suggestions.

*Evaluated and recommended for college credit by ACE CREDIT®



Page statistics
613 view(s) and 27 edit(s)
Social share
Share this page?


This page has no custom tags.
This page has no classifications.