Table of contents
- Story
- Slides
- Spotfire Dashboard
- Research Notes
- Edward Tufte
- The 5 Most Influential Data Visualizations of All Time
- Slide 1 The 5 Most Influential Data Visualizations of All Time
- Slide 2 About These Visualizations
- Slide 3 5 London Cholera Map - John Snow 1
- Slide 4 5 London Cholera Map - John Snow 2
- Slide 5 Gapminder - Rosling
- Slide 6 3 March on Moscow - Charles Minard 1
- Slide 7 3 March on Moscow - Charles Minard 2
- Slide 8 2 War Mortality - Florene Nightingale 1
- Slide 9 2 War Mortality - Florene Nightingale 2
- Slide 10 1 Chart of Bibography - Joseph Priestley 1
- Slide 11 1 Chart of Bibography - Joseph Priestley 2
- Slide 12 1 Chart of Bibography - Joseph Priestley 3
- Slide 13 1 Chart of Bibography - Joseph Priestley 4
- Slide 14 1 Chart of Bibography - Joseph Priestley 5
- Slide 15 John Tukey 1977
- Slide 16 Want To Learn More?
- Slide 17 Learn More
- Expanding to 1 Million Rows!
- Road Safety Data
- Medicare Provider Charge Data: Inpatient
- Related Links
- Stop-and-Frisk Data
- SQF Codebook
- Story
- Slides
- Spotfire Dashboard
- Research Notes
- Edward Tufte
- The 5 Most Influential Data Visualizations of All Time
- Slide 1 The 5 Most Influential Data Visualizations of All Time
- Slide 2 About These Visualizations
- Slide 3 5 London Cholera Map - John Snow 1
- Slide 4 5 London Cholera Map - John Snow 2
- Slide 5 Gapminder - Rosling
- Slide 6 3 March on Moscow - Charles Minard 1
- Slide 7 3 March on Moscow - Charles Minard 2
- Slide 8 2 War Mortality - Florene Nightingale 1
- Slide 9 2 War Mortality - Florene Nightingale 2
- Slide 10 1 Chart of Bibography - Joseph Priestley 1
- Slide 11 1 Chart of Bibography - Joseph Priestley 2
- Slide 12 1 Chart of Bibography - Joseph Priestley 3
- Slide 13 1 Chart of Bibography - Joseph Priestley 4
- Slide 14 1 Chart of Bibography - Joseph Priestley 5
- Slide 15 John Tukey 1977
- Slide 16 Want To Learn More?
- Slide 17 Learn More
- Expanding to 1 Million Rows!
- Road Safety Data
- Medicare Provider Charge Data: Inpatient
- Related Links
- Stop-and-Frisk Data
- SQF Codebook
Story
Tableau Public Data Sets for DC Data Science
The recent DC Data Science meetup was July 30, 6:30 PM to 8:30 PM, at GWU, Funger Hall, Room 103, 2201 G St. NW, Washington, DC. The July Meetup included the work of not one, not two, but nine members (actually only 8) of the Data Science community giving lightning talks that included highlights of work presented at recent or upcoming academic conferences, demos of tools and techniques that people are excited about, and more, all in bite-sized packages of less than 10 minutes.
The Agenda was:
6:30pm -- Networking, Empenadas, and Refreshments
7:00pm -- Introduction
7:15pm -- Presentations
8:30pm -- Post presentation conversations
8:45pm -- Adjourn for Data Drinks (Tonic, 22nd & G St.)
Before and during the Data Science DC Meetup, several suggested the presenters provide their slides and data sets (unless classified or confidential) to review and audit their results (see my Research Notes). I summarized the status of this below.
Title | Presenter | Slides | Data | Comments |
Next-Gen Sequencing Data Analysis | Harsha Rajasimha, Jeeva Informatics | NA | NA | Did not present |
Fundamental Graphic Types | Jon Schwabish, CBO | Used | None | Mostly entertainment |
Growth Hacker | Robert Dempsey, Intridea | Used | LinkedIn company data | See Footnotes Below |
Hadoop Summit / GraphLab Review | Kevin Coogan, Amalgamood | Used | None | Excellent summary of two conference |
GitHub API and Graph Databases | Patrick Wagstrom, IBM Research | Used | Julia download (360 MB executable file) | Interesting network analysis |
Incentivized Sharing in Social Networks | Elena Zheleva, LivingSocial | Used | Confidential | Interesting statistical work |
Random Forests and Credit Scoring | Dhruv Sharma, FDIC | Used | ? | Interesting statistical work |
Analyzing Data Exhaust | Natalie Robb, WaveLength Analytics | Used | Confidential | Interesting data science work |
Consensus Clustering | Tommy Jones, STPI | Used | ? | Interesting statistical work |
The one that stands out is highlighted in bold above and will be analyzed in a future blog.
As a Tableau Public user, I also received an email and link to a web page on August 2nd announcing::
"We’re excited to announce today that we’ve extended the capabilities of Tableau Public tenfold. You can now connect to files and spreadsheets that contain up to 1 million rows of data.
Additionally, to accommodate these larger data files, your account’s storage space has been increased from 50 megabytes to 1 gigabyte. And there’s no action required on your part – these changes have already gone into effect. We like you that much.
We’ve taken this step because we noticed that many of the publicly available data sources go beyond the previous limit of 100,000 rows, including:
- 2012 UK Road Safety data (145K accidents)
- 2011 U.S. Medicare Inpatient Charge data (170K rows)
- 2011 NYPD “Stop-and-Frisk” data (685K incidents)
I decided to work with these publicly available data sources to make them analytic-ready for the DC Data Science Community as follows:
Data Sets | Downloads (MB) | Rows | Columns |
2011 NYPD “Stop-and-Frisk” data (685K incidents) | 197 MB Excel CSV | 685,724 | 99 |
2012 UK Road Safety data (145K accidents) |
20 MB Excel CSV 8 MB Excel CSV 15 MB Excel CSV
| 145,571 195,723 265,877 | 32 14 21 |
2011 U.S. Medicare Inpatient Charge data (170K rows) | 0.3 MB Excel Worksheet 13 MB Excel Worksheet
| 100 and 5,025 8 and 13 and 163,065 | 4 and 5 1 and 2 and 11 |
Totals | 253.3 | 1,456,086 | |
Spotfire File Size | 50.1 |
Last month, I worked with a suggested big data set posted at the Data Science DC and presented the data science results in my Wiki. I plan to do this as a regular service to the data science community.
The results of working in Spotfire with all three of the Tableau publicly available data sources that are nearly 1.5 million rows are shown in the slides below. An example of small adjacent multiple visualizations on the same page that are dynamically linked is shown below.
This shows that the highest number of dealths (42) (Scatter Plot at top) involved two vehicles near the center of England (Map Chart in the middle) around 6 am in the morning (Line Chart on the bottom).
IN PROCESS
Footnotes
Good morning everyone,
Thanks for all of your fantastic questions after the lightning talks. Here are links to everything I mentioned in my talk:
Slides - http://www.slideshare.net/intri...
Surfiki - http://surfiki.com/
Refine (Open Source) - https://github.com/intridea/surf...
Python script for scraping LinkedIn company data (Open Source) - https://github.com/intridea/link...
If you have any questions, give me a call: 888-968-4332 x517 robert.dempsey@intridea.com
Slides
Spotfire Dashboard
For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App
Research Notes
Upload UK (43 MB) and Medicare (13.3 MB)
Upload Stop-and-Frisk (197 MB)?
Organize Knowledge Base
Download Files and Examine
Data Dictionaries? Codebook Source: http://www.nyclu.org/files/SQF_Codebook.pdf (PDF)
Need Boundary Files -
Import All Files With Maximum Memory Available (Mashup)
Google Search for Definitions for UK Road Safety Data Since UK Data.gov Did Not Provide That
Found: https://www.gov.uk/government/organi...out/statistics
Scroll Down: Statistical series
Found: https://www.gov.uk/government/organi...ety-statistics
Found: See key releases, notes and guidance, the pre-release access list and technical information
https://www.gov.uk/transport-statist...ent-and-safety
Found: STATS19 road accident injury statistics – report form [PDF, 628KB, 4 pages]
https://www.gov.uk/government/upload...23/stats19.pdf
Found: Reported Road Casualties in Great Britain: notes, definitions, symbols and conventions [PDF, 23.4KB]
Edward Tufte
Source: http://www.ft.com/intl/cms/s/2/dda1cb5c-f4c0-11e2-a62e-00144feabdc0.html#axzz2aBV8tVGi
By Clive Cookson, FT science editor
The graphics guru has shown us how to visualise data with simplicity, clarity and elegance
The 5 Most Influential Data Visualizations of All Time
Source: http://www.tableausoftware.com/asset/5-most-influential-visualizations (PDF)
Discover 5 powerful, beautiful visualizations that changed how people thought about the world.
Data visualization allows us all to see and understand our data more deeply. That understanding breeds good decisions.
"The greatest value of a picture is when it forces us to notice what we never expected to see.” – John Tukey, 1977
Without data visualization and data analysis, we are all more prone to misunderstandings and missed opportunities.
This visual presentation will show show you 5 powerful, beautiful visualizations that changed how people thought about the world.
Slide 5 Gapminder - Rosling
http://en.wikipedia.org/wiki/Hans_Rosling
https://www.youtube.com/watch?v=hVimVzgtD6w
http://www.gapminder.org/world/#;example=75Slide 7 3 March on Moscow - Charles Minard 2
http://upload.wikimedia.org/wikipedia/commons/2/29/Minard.png
Expanding to 1 Million Rows!
Sources: Email and Web: http://www.tableausoftware.com/public/blog/2013/08/one-million-rows-2072
From: Tableau Public [mailto:tableau-public@tableausoftware.com]
Sent: Friday, August 02, 2013 9:32 AM
To: bniemann@cox.net
Subject: Expanding to 1 Million Rows!
| |||||||
| |||||||
|
| |
What else do we have in store for “Data Month”?
In celebration of “Data Month”, the Tableau Public team has a great line-up of useful blog post for you around the subject of data. Keep an eye on this blog as we go in-depth on the following subjects:
- Online data sources
- Data scraping
- Data shaping
- Best practices for 1M rows
We'll publish a few "vizzes" in the coming weeks that put the new data limit to the test, and you can bet our Viz of the Day page will include some great examples from the community of authors, too.
As always, thanks for stopping by, Ben
Viz of the Day
Road Safety Data
Source: http://data.gov.uk/dataset/road-accidents-safety-data
These files provide detailed road safety data about the circumstances of personal injury road accidents in GB from 2005, the types (including Make and Model) of vehicles involved and the consequential casualties. The statistics relate only to personal injury accidents on public roads that are reported to the police, and subsequently recorded, using the STATS19 accident reporting form.
Also includes: Results of breath-test screening data from recently introduced digital breath testing devices, as provided by Police Authorities in England and Wales
Results of blood alcohol levels (milligrams / 100 millilitres of blood) provided by matching coroners’ data (provided by Coroners in England and Wales and by Procurators Fiscal in Scotland) with fatality data from the STATS19 police data of road accidents in Great Britain. For cases when the Blood Alcohol Levels for a fatality are "unknown" are a consequence of an unsuccessful match between the two data sets.
Selected
Road Safety - Accidents 2012 My Note: Excel
Road Safety - Casualties 2012 My Note: Excel
Road Safety - Vehicles 2012 My Note: Excel
Additional Information
Openness score | |
Theme | Transportation |
Secondary Theme(s) | Health |
Temporal coverage | 1/1/2005 - 31/12/2012 |
Geographic coverage | Great Britain (England, Scotland, Wales) |
Update frequency | annual |
Temporal granularity | day |
Mandate | No value |
Date added computed | No value |
Date updated computed | No value |
Developer Tools
The information on this page (the dataset metadata) is also available in JSON format:/api/2/rest/package/road-accidents-safety-data
Read more about this site's CKAN API: About the API
This dataset has a permanent URI:http://data.gov.uk//dataset/road-accidents-safety-data
Medicare Provider Charge Data: Inpatient
The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.
Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.
For these DRGs, average charges and average Medicare payments are calculated at the individual hospital level. Users will be able to make comparisons between the amount charged by individual hospitals within local markets, and nationwide, for services that might be furnished in connection with a particular inpatient stay.
Data are being made available in Microsoft Excel (.xlsx) format and comma separated values (.csv) format.
Inpatient Charge Data, FY2011, Microsoft Excel version My Note: Excel
Inpatient Charge Data, FY2011, Comma Separated Values (CSV) version
National and State level summaries are also available here:
National and State Summaries of Inpatient Charge Data, FY2011, Microsoft Excel version My Note: Excel
National and State Summaries of Inpatient Charge Data, FY2011, Comma Separated Values (CSV) version
Inquiries regarding this data can be sent to MedicareProviderChargeData@cms.hhs.gov.
Related Links
- Page last Modified: 06/02/2013 8:09 PM
- Help with File Formats and Plug-Ins
Stop-and-Frisk Data
Source: http://www.nyclu.org/content/stop-and-frisk-data
Summary
An analysis by the NYCLU revealed that innocent New Yorkers have been subjected to police stops and street interrogations more than 4 million times since 2002, and that black and Latino communities continue to be the overwhelming target of these tactics. Nearly nine out of 10 stopped-and-frisked New Yorkers have been completely innocent, according to the NYPD’s own reports:
- In 2002, New Yorkers were stopped by the police 97,296 times.
80,176 were totally innocent (82 percent). - In 2003, New Yorkers were stopped by the police 160,851 times.
140,442 were totally innocent (87 percent).
77,704 were black (54 percent).
44,581 were Latino (31 percent).
17,623 were white (12 percent).
83,499 were aged 14-24 (55 percent). - In 2004, New Yorkers were stopped by the police 313,523 times.
278,933 were totally innocent (89 percent).
155,033 were black (55 percent).
89,937 were Latino (32 percent).
28,913 were white (10 percent).
152,196 were aged 14-24 (52 percent). - In 2005, New Yorkers were stopped by the police 398,191 times.
352,348 were totally innocent (89 percent).
196,570 were black (54 percent).
115,088 were Latino (32 percent).
40,713 were white (11 percent).
189,854 were aged 14-24 (51 percent). - In 2006, New Yorkers were stopped by the police 506,491 times.
457,163 were totally innocent (90 percent).
267,468 were black (53 percent).
147,862 were Latino (29 percent).
53,500 were white (11 percent).
247,691 were aged 14-24 (50 percent). - In 2007, New Yorkers were stopped by the police 472,096 times.
410,936 were totally innocent (87 percent).
243,766 were black (54 percent).
141,868 were Latino (31 percent).
52,887 were white (12 percent).
223,783 were aged 14-24 (48 percent). - In 2008, New Yorkers were stopped by the police 540,302 times.
474,387 were totally innocent (88 percent).
275,588 were black (53 percent).
168,475 were Latino (32 percent).
57,650 were white (11 percent).
263,408 were aged 14-24 (49 percent). - In 2009, New Yorkers were stopped by the police 581,168 times.
510,742 were totally innocent (88 percent).
310,611 were black (55 percent).
180,055 were Latino (32 percent).
53,601 were white (10 percent).
289,602 were aged 14-24 (50 percent). - In 2010, New Yorkers were stopped by the police 601,285 times.
518,849 were totally innocent (86 percent).
315,083 were black (54 percent).
189,326 were Latino (33 percent).
54,810 were white (9 percent).
295,902 were aged 14-24 (49 percent). - In 2011, New Yorkers were stopped by the police 685,724 times.
605,328 were totally innocent (88 percent).
350,743 were black (53 percent).
223,740 were Latino (34 percent).
61,805 were white (9 percent).
341,581 were aged 14-24 (51 percent). - In 2012, New Yorkers were stopped by the police 532,911 times
473,644 were totally innocent (89 percent).
284,229 were black (55 percent).
165,140 were Latino (32 percent).
50,366 were white (10 percent).About the Data
Every time a police officer stops a person in NYC, the officer is supposed to fill out a form to record the details of the stop. Officers fill out the forms by hand, and then the forms are entered manually into a database. There are 2 ways the NYPD reports this stop-and-frisk data: a paper report released quarterly and an electronic database released annually.
The paper reports – which the N.Y.C.L.U. releases every three months – include data on stops, arrests, and summonses. The data are broken down by precinct of the stop and race and gender of the person stopped. The paper reports provide a basic snapshot on stop-and-frisk activity by precinct and are available here.
The electronic database includes nearly all of the data recorded by the police officer after a stop. The data include the age of person stopped, if a person was frisked, if there was a weapon or firearm recovered, if physical force was used, and the exact location of the stop within the precinct. Having the electronic database allows researchers to look in greater detail at what happens during a stop. Below are CSV files containing data from the 2011 electronic database.
Downloadable Files
Click here to download a compressed (.zip) CSV file of the 2012 database. This file is easily imported into most statistical packages, including the freeware R. It contains 101 variables and 532,911 observations, each of which represents a stop conducted by an NYPD officer. Variables include race, gender and age of the person stopped as well as the location, time and date of the stop.
Click here to download a PDF file of documents and notes that may clarify the dataset. The PDF includes a list and description of variables, a blank stop-and-frisk reporting form (UF-250) and other notes provided by the NYPD.
SQF Codebook
Source: http://www.nyclu.org/files/SQF_Codebook.pdf (PDF)
List of Variables
See Spreadsheet
Crime Codes
See Spreadsheet
Comments