Table of contents
  1. Story
  2. Slides
    1. NITRD
      1. NITRD and the NCO
      2. Member Agencies
      3. NITRD PCAs
      4. NITRD SSGs
      5. FY 2012 Budget Estimates
      6. NITRD Strategic 2012 Plan
      7. The Future is Bright
    2. Accelerating Innovation in Big Data: From Data to Knowledge to Action
      1. The Promise of Big Data
        1. Era of Data and Information
        2. Era of "Big Data" in Healthcare
        3. Conceptualizing Big Data for Science
        4. From Hypothesis-driven to Data-driven Discovery
        5. Smart Sensing, Reasoning and Decision
        6. What is possible?
      2. What’s Different? Why Now?
        1. Why Now?
        2. Three Shifts in the Way We Analyze Information
      3. Federal Big Data R&D Initiative
        1. Federal Big Data R&D Initiative
        2. NSF Framework for Investments
        3. NSF 14-543 Due June 9, 2014
        4. Building a Big Data R&D Pipeline
        5. Complex Policy Setting
        6. Public Access: Memorandum
        7. Public Access: Implementation Plans
        8. Data to Knowledge to Action
        9. Materials Genome Initiative
        10. Cognitive Science and Neuroscience
      4. Some Success Stories…
        1. Enabling the Patient, Curing Diseases, and Saving Lives
          1. Data at the Forefront of Diagnosis
        2. Transformative Implications for Commerce
          1. UPS Uses Data Analytics to Deliver Faster and More Sustainability
          2. Helping Small Businesses
        3. Big Data for a Sustainable Future
          1. Bringing NASA Data Down to Earth
        4. Accelerating the Pace of Discovery in Science and Engineering
          1. Extracting Knowledge from Data
          2. Social Media and Big Data
      5. Barriers, Challenges and Opportunities
          1. Barriers, Challenges...
        1. Big Data and Security
          1. Evolution of Cyber Threats
        2. Big Data and Privacy
          1. Big Data and Privacy
        3. Predictive Analytics
          1. The Power and Dangers of Predictive Analytics
          2. Dangers: Predictions Based on Correlations to Make Causal Decisions
        4. Education and Workforce Development
          1. NSF Research Traineeship (NST)
          2. Imagine a Day...
    3. Credits
  3. Spotfire Dashboard
  4. Slides
    1. Slide 1 Baseball Salaries for 2013
    2. Slide 2 2013 MLB Standard Pitching
    3. Slide 3 2013 Pitcher Statistics
    4. Slide 4 Data Ecosystem
  5. Research Notes
  6. STM Innovations Seminar U.S. 2014
    1. Program
    2. Results for #stminnovations2014
  7. Professional Baseball Pitchers’ Performance and its Effect on Salary
    1. Abstract
    2. 1 Introduction
    3. 2 Methods
      1. 2.1 Sampling
      2. 2.2 Software Used
    4. 3 Results
      1. Figure 1: Model candidates and their BIC Values
      2. Table 1: Predictor summary statistics
      3. Table 2: Coefficients for the regression of the model given in (4)
      4. Figure 2: Plot of the regression
      5. Figure 3: Residual analysis for (4)
      6. Table 3: Coefficients for the regression of the model given in (4) with minimum wage players excluded
      7. Figure 4: Plot of the regression with minimum wage players excluded
      8. Figure 5: Residual analysis after removing minimum wage players
    5. 4 Conclusion
    6. References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
    7. 5 Code
  8. NEXT

NSF Big Data Publications

Last modified
Table of contents
  1. Story
  2. Slides
    1. NITRD
      1. NITRD and the NCO
      2. Member Agencies
      3. NITRD PCAs
      4. NITRD SSGs
      5. FY 2012 Budget Estimates
      6. NITRD Strategic 2012 Plan
      7. The Future is Bright
    2. Accelerating Innovation in Big Data: From Data to Knowledge to Action
      1. The Promise of Big Data
        1. Era of Data and Information
        2. Era of "Big Data" in Healthcare
        3. Conceptualizing Big Data for Science
        4. From Hypothesis-driven to Data-driven Discovery
        5. Smart Sensing, Reasoning and Decision
        6. What is possible?
      2. What’s Different? Why Now?
        1. Why Now?
        2. Three Shifts in the Way We Analyze Information
      3. Federal Big Data R&D Initiative
        1. Federal Big Data R&D Initiative
        2. NSF Framework for Investments
        3. NSF 14-543 Due June 9, 2014
        4. Building a Big Data R&D Pipeline
        5. Complex Policy Setting
        6. Public Access: Memorandum
        7. Public Access: Implementation Plans
        8. Data to Knowledge to Action
        9. Materials Genome Initiative
        10. Cognitive Science and Neuroscience
      4. Some Success Stories…
        1. Enabling the Patient, Curing Diseases, and Saving Lives
          1. Data at the Forefront of Diagnosis
        2. Transformative Implications for Commerce
          1. UPS Uses Data Analytics to Deliver Faster and More Sustainability
          2. Helping Small Businesses
        3. Big Data for a Sustainable Future
          1. Bringing NASA Data Down to Earth
        4. Accelerating the Pace of Discovery in Science and Engineering
          1. Extracting Knowledge from Data
          2. Social Media and Big Data
      5. Barriers, Challenges and Opportunities
          1. Barriers, Challenges...
        1. Big Data and Security
          1. Evolution of Cyber Threats
        2. Big Data and Privacy
          1. Big Data and Privacy
        3. Predictive Analytics
          1. The Power and Dangers of Predictive Analytics
          2. Dangers: Predictions Based on Correlations to Make Causal Decisions
        4. Education and Workforce Development
          1. NSF Research Traineeship (NST)
          2. Imagine a Day...
    3. Credits
  3. Spotfire Dashboard
  4. Slides
    1. Slide 1 Baseball Salaries for 2013
    2. Slide 2 2013 MLB Standard Pitching
    3. Slide 3 2013 Pitcher Statistics
    4. Slide 4 Data Ecosystem
  5. Research Notes
  6. STM Innovations Seminar U.S. 2014
    1. Program
    2. Results for #stminnovations2014
  7. Professional Baseball Pitchers’ Performance and its Effect on Salary
    1. Abstract
    2. 1 Introduction
    3. 2 Methods
      1. 2.1 Sampling
      2. 2.2 Software Used
    4. 3 Results
      1. Figure 1: Model candidates and their BIC Values
      2. Table 1: Predictor summary statistics
      3. Table 2: Coefficients for the regression of the model given in (4)
      4. Figure 2: Plot of the regression
      5. Figure 3: Residual analysis for (4)
      6. Table 3: Coefficients for the regression of the model given in (4) with minimum wage players excluded
      7. Figure 4: Plot of the regression with minimum wage players excluded
      8. Figure 5: Residual analysis after removing minimum wage players
    5. 4 Conclusion
    6. References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
    7. 5 Code
  8. NEXT

  1. Story
  2. Slides
    1. NITRD
      1. NITRD and the NCO
      2. Member Agencies
      3. NITRD PCAs
      4. NITRD SSGs
      5. FY 2012 Budget Estimates
      6. NITRD Strategic 2012 Plan
      7. The Future is Bright
    2. Accelerating Innovation in Big Data: From Data to Knowledge to Action
      1. The Promise of Big Data
        1. Era of Data and Information
        2. Era of "Big Data" in Healthcare
        3. Conceptualizing Big Data for Science
        4. From Hypothesis-driven to Data-driven Discovery
        5. Smart Sensing, Reasoning and Decision
        6. What is possible?
      2. What’s Different? Why Now?
        1. Why Now?
        2. Three Shifts in the Way We Analyze Information
      3. Federal Big Data R&D Initiative
        1. Federal Big Data R&D Initiative
        2. NSF Framework for Investments
        3. NSF 14-543 Due June 9, 2014
        4. Building a Big Data R&D Pipeline
        5. Complex Policy Setting
        6. Public Access: Memorandum
        7. Public Access: Implementation Plans
        8. Data to Knowledge to Action
        9. Materials Genome Initiative
        10. Cognitive Science and Neuroscience
      4. Some Success Stories…
        1. Enabling the Patient, Curing Diseases, and Saving Lives
          1. Data at the Forefront of Diagnosis
        2. Transformative Implications for Commerce
          1. UPS Uses Data Analytics to Deliver Faster and More Sustainability
          2. Helping Small Businesses
        3. Big Data for a Sustainable Future
          1. Bringing NASA Data Down to Earth
        4. Accelerating the Pace of Discovery in Science and Engineering
          1. Extracting Knowledge from Data
          2. Social Media and Big Data
      5. Barriers, Challenges and Opportunities
          1. Barriers, Challenges...
        1. Big Data and Security
          1. Evolution of Cyber Threats
        2. Big Data and Privacy
          1. Big Data and Privacy
        3. Predictive Analytics
          1. The Power and Dangers of Predictive Analytics
          2. Dangers: Predictions Based on Correlations to Make Causal Decisions
        4. Education and Workforce Development
          1. NSF Research Traineeship (NST)
          2. Imagine a Day...
    3. Credits
  3. Spotfire Dashboard
  4. Slides
    1. Slide 1 Baseball Salaries for 2013
    2. Slide 2 2013 MLB Standard Pitching
    3. Slide 3 2013 Pitcher Statistics
    4. Slide 4 Data Ecosystem
  5. Research Notes
  6. STM Innovations Seminar U.S. 2014
    1. Program
    2. Results for #stminnovations2014
  7. Professional Baseball Pitchers’ Performance and its Effect on Salary
    1. Abstract
    2. 1 Introduction
    3. 2 Methods
      1. 2.1 Sampling
      2. 2.2 Software Used
    4. 3 Results
      1. Figure 1: Model candidates and their BIC Values
      2. Table 1: Predictor summary statistics
      3. Table 2: Coefficients for the regression of the model given in (4)
      4. Figure 2: Plot of the regression
      5. Figure 3: Residual analysis for (4)
      6. Table 3: Coefficients for the regression of the model given in (4) with minimum wage players excluded
      7. Figure 4: Plot of the regression with minimum wage players excluded
      8. Figure 5: Residual analysis after removing minimum wage players
    5. 4 Conclusion
    6. References
      1. [1]
      2. [2]
      3. [3]
      4. [4]
      5. [5]
      6. [6]
      7. [7]
      8. [8]
    7. 5 Code
  8. NEXT

Story

NSF Big Data Publications

Dr. George O. Strawn, NITRD Co-chair and NCO Director, and Dr. Farnam Jahanian, Assistant Director, National Science Foundation, have the NITRD Big Data Senior Steering Group and the ​Federal Big Data R&D Initiative​ in common. Their recent presentations at the 2014 Ontology Summit are shown below.

Dr. Jahanian said: "Implementation plans for public access (to scientific research data) could vary by discipline, and new business models for universities, libraries, publishers, and scholarly and professional societies could emerge."

At the summit, they asked me about digital objects and meetups, respectively. First, what I told Dr. Jahanian about our meetups:

Our next Meetups are May 6th and 20th , 6:30-9 p.m. at Excelerate Solutions, 8405 Greensboro Dr., Suite 930, McLean, VA 22102.

Brief History: I founded this meetup because of my experience working with George Strawn (and Susi Iacona and Wendy Wigen) on a presentation to their Big Data Senior Steering Group and my work with Congressional staff on the Data Act.

I worked for many years while a government employee on Federal CIO Council activities under George’s direction and most recently on what he thought was “the killer semantic web application for the government” called Semantic Medline. We now have Semantic Medline running on the new YarcData Graph Appliance and it was the demonstration for our kickoff Meetup in January.

George and NIH (Phil Bourne and Peter Lyster) hosted the March Meetup at NSF on Joint NSF-NIH Biomedical Big Data Research: Euretos BRAIN with BarendMons.

Our Meetup mission statement is:

  • Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies;
  • Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content;
  • Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products; and
  • Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.

Our framework for working with CODATA is:

  • Organize a Community of Data Scientists and Related Fields to focus on treating all of your content as "Big Data"
    • Example: Federal Big Data Working Group Meetups
  • Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment
    • Example: Semantic Community Data Science Knowledge Base and Big Data Science for CODATA.
  • Mine prominent scientific journals for data policy, data bases, and data results that can be reused.
    • Example: CODATA Data Science Journal and International Journal of Digital Earth
  • Provide data stories and presentation materials for public education and conferences
    • Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9, in Beijing

Second, what I told Dr. Strawn about digital objects:

Re your statement Existing identity management systems (eg, VIVO), each of which uses its own version of unique identifiers as part of the process by which identity is established, may be accommodated within the Digital Object Architecture (DOA).

I wrote a recent story on VIVO's Evolution from Semantic Web Application to DuraSpace and then submitted an Abstract to the upcoming VIVO 2014 Conference about Duraspace doing essentially the same thing as Semantic Community about trying to make all of the VIVO and IVMOOC "research objects" well-defined digital objects.

In addition, I prepared NIH Data Publication 1 for my upcoming discussion with Dr. Phil Bourne and I am thinking about doing NSF Data Publication 1 using your and Dr. Jahanian's slides from the 2014 Ontology Summit as the basis. END OF DR. STRAWN COMMUNICATION

Now I have already done a number of pilot data publications with NITRD, NSF, CODATA, EarthCube, and VIVO content, and could mine for some more content. However the recent STM Annual Event on Bright Research, Smart Articles and the new Author Ego-System​ gave me a new perspective. STM is the International Association of Scientific, Technical & Medical Publishers: The Voice of Academic and Professional Publishing. STM is at the leading edge of the latest technology trends within publishing. This annual US-event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come.

The Plenary Session on The Smart Article: The role of Publishers in Semantic Annotation and Knowledge Representation and Searching for Data and Finding New Science description really interested me:

Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference?  Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11.

I mined the Tweets and selected the following:

  • Common thread running through talks: metadata is extremely valuable, more valuable than the data/article/research itself
  • L Hunter: "With enough data you don't need semantic search. You can just use statistics."
  • Publishers could (and should?) deliver knowledge representation thru broad knowledge, multiple information sources, etc
  • L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative knowledge sharing
  • STM Tech trend 2: return of the author, Scholarly Author Ego System
  • Kevin Boyak notes that books are cited more than journal articles
  • Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than journal articles in sciences
  • Content vs context in data analysis
  • Kevin Boyack up now on the mapping of science and the analytics of publishing
  • "Looking for Data: Finding New Science": http://t.co/eok3ma37vO
  • Tech trend 3: new players changing the game. see http://ow.ly/3jPdvY
  • Tech trend 2: the return to the author
  • Tech trend 1: the machine is the new reader. Highlights from the Future Lab team
  • A baseball metrics talk to open. With perfect timing, the latest submission to the @writelatex gallery is an article on baseball!: https://www.writelatex.com/articles/...ect-on-salary/
  • How publishers can be worth the money? Also, joined a thought provoking “Future Lab Committee” this morning

This made me recall that I had sent knowledge representations on Graph Databases, Semantic Search, Data Science for VIVO, etc. to Dr. Strawn, which he really liked, and realized this was a way in which senior science managers and I liked information distilled and presented. In fact I have a Data Science Knowledge Base that contains a number of Data Science Books, especially for teaching Data Science. So I need to do more. The state-of-the-art wiki (MindTouch) I use supports book and publishing where I can export one or more pages to PDF and MySQL formats.

The STM Annual Event Opening Keynotes: Analytics and Metrics, used one of the innovators​, writeLaTeX, as an example of advanced authoring: Professional Baseball Pitchers' Performance and its Effect on Salary. I have decided to use this an example of the advanced scientific data analysis and publishing tools I use, namely MindTouch, Excel, and Spotfire. ADD LINK AND SCREEN CAPTURES BELOW This example supports STEM education at the high school and colleage level. The data came from [2], [3], [4], [5], and [6], but the Excel was not provided, so I created a spreadsheet. In addition, one cannot interact with the data in their publication and their publication is not reusable, like mine is here.

Stephen Wolfram's recent Blog: Something Very Big Is Coming: Our Most Important Technology Project Yet said:

Computational knowledge. Symbolic programming. Algorithm automation. Dynamic interactivity. Natural language. Computable documents. The cloud. Connected devices. Symbolic ontology. Algorithm discovery. These are all things we’ve been energetically working on—mostly for years—in the context of Wolfram|AlphaMathematicaCDF and so on.

There’ll be the Wolfram Data Science Platform, that allows one to connect to all sorts of data sources, then use the kind of automation seen in Wolfram|Alpha Pro, then pick out and modify Wolfram Language programs to do data science—and then use CDF to set up reports to generate automatically, on a schedule, through an API, or whatever.

There’ll be the Wolfram Publishing Platform that lets you create documents, then insert interactive elements using the Wolfram Language and its free-form linguistics—and then deploy the documents, on the web using technologies like CloudCDF, that instantly support interactivity in any web browser, or on mobile using the Wolfram Cloud App.

We’ve also been building the Wolfram Course Authoring Platform, that does major automation of the process of going from a script to all the elements of an online course—then lets one deploy the course in the cloud, so that students can have immediate access to a Wolfram Language sandbox, to be able to explore the material in the course, do exercises, and so on.

I am anxious to see what comes out, but glad to know I have a way to do advanced big data publications for NSF and other agencies and organizations now!

MORE TO FOLLOW

Slides

NITRD

Slides

GeorgeStrawn04282014Slide1.png

NITRD and the NCO

GeorgeStrawn04282014Slide2.png

Member Agencies

GeorgeStrawn04282014Slide3.png

NITRD PCAs

GeorgeStrawn04282014Slide4.png

NITRD SSGs

GeorgeStrawn04282014Slide5.png

FY 2012 Budget Estimates

GeorgeStrawn04282014Slide6.png

NITRD Strategic 2012 Plan

GeorgeStrawn04282014Slide7.png

The Future is Bright

GeorgeStrawn04282014Slide8.png

 

Accelerating Innovation in Big Data: From Data to Knowledge to Action

Slides

FarnamJahanian04282014Slide1.png

The Promise of Big Data

Era of Data and Information

FarnamJahanian04282014Slide8.png

Era of "Big Data" in HealthcareFarnamJahanian04282014Slide13.png
Conceptualizing Big Data for Science

FarnamJahanian04282014Slide15.png

From Hypothesis-driven to Data-driven Discovery

FarnamJahanian04282014Slide16.png

Smart Sensing, Reasoning and Decision

FarnamJahanian04282014Slide17.png

What is possible?

FarnamJahanian04282014Slide19.png

What’s Different? Why Now?

Why Now?

FarnamJahanian04282014Slide21.png

Three Shifts in the Way We Analyze Information

FarnamJahanian04282014Slide23.png

Federal Big Data R&D Initiative

Federal Big Data R&D Initiative

 

FarnamJahanian04282014Slide26.png

NSF Framework for Investments

FarnamJahanian04282014Slide27.png

NSF 14-543 Due June 9, 2014

FarnamJahanian04282014Slide28.png

Building a Big Data R&D Pipeline

FarnamJahanian04282014Slide30.png

Complex Policy Setting

FarnamJahanian04282014Slide31.png

Public Access: Memorandum

FarnamJahanian04282014Slide32.png

Public Access: Implementation Plans

FarnamJahanian04282014Slide33.png

Data to Knowledge to Action

FarnamJahanian04282014Slide34.png

Materials Genome Initiative

FarnamJahanian04282014Slide35.png

 

Cognitive Science and Neuroscience

FarnamJahanian04282014Slide36.png

 

Some Success Stories…

Enabling the Patient, Curing Diseases, and Saving Lives
Data at the Forefront of Diagnosis

FarnamJahanian04282014Slide39.png

Transformative Implications for Commerce
UPS Uses Data Analytics to Deliver Faster and More Sustainability

FarnamJahanian04282014Slide41.png

Helping Small Businesses

FarnamJahanian04282014Slide42.png

 

Big Data for a Sustainable Future
Bringing NASA Data Down to Earth

FarnamJahanian04282014Slide44.png

Accelerating the Pace of Discovery in Science and Engineering
Extracting Knowledge from Data

FarnamJahanian04282014Slide47.png

Social Media and Big Data

FarnamJahanian04282014Slide48.png

Barriers, Challenges and Opportunities

Barriers, Challenges...

FarnamJahanian04282014Slide50.png

Big Data and Security
Evolution of Cyber Threats

FarnamJahanian04282014Slide54.png

Big Data and Privacy
Big Data and Privacy

FarnamJahanian04282014Slide59.png

Predictive Analytics
The Power and Dangers of Predictive Analytics

FarnamJahanian04282014Slide62.png

Dangers: Predictions Based on Correlations to Make Causal Decisions

FarnamJahanian04282014Slide63.png

Education and Workforce Development
NSF Research Traineeship (NST)

FarnamJahanian04282014Slide67.png

Imagine a Day...

FarnamJahanian04282014Slide69.png

Credits

Copyrighted material used under Fair Use. If you are the copyright holder and believe your material has been used unfairly, or if you have any suggestions, feedback, or support, please contact: ciseitsupport@nsf.gov.

Except where otherwise indicated, permission is granted to copy, distribute, and/or modify all images in this document under the terms of the GNU Free Documentation license, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation license” at http://commons.wikimedia.org/wiki/Co...tation_License.

The inclusion of a logo does not express or imply the endorsement by NSF of the entities' products, services, or enterprises.

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Slides

Slide 1 Baseball Salaries for 2013

BaseballSalaries-Spotfire-Baseball Salaries for 2013.png

Slide 2 2013 MLB Standard Pitching

BaseballSalaries-Spotfire-2013 MLB Standard Pitching.png

Slide 3 2013 Pitcher Statistics

BaseballSalaries-Spotfire-2013 Pitcher Statistics.png

Slide 4 Data Ecosystem

BaseballSalaries-Spotfire-Data Ecosystem.png

Research Notes

STM Innovations Seminar U.S. 2014

Source: http://www.stm-assoc.org/events/stm-...inar-u-s-2014/

International Association of Scientific, Technical & Medical Publishers

The Voice of Academic and Professional Publishing

Bright Research, Smart Articles and the new Author Ego-System

STM is at the leading edge of the latest technology trends within publishing. This annual US-event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come.
 
Don’t miss the Flash Session with a line-up of crazy-5-minute talks that provide the newest of what publishers are preparing to launch.

Program

13.15

Welcome and Introduction by Gerry Grenier, IEEE, chair of the STM Future Lab 

Opening KeynotesAnalytics and Metrics

Moderator: Dave Martinsen, ACS 

Science in Sports,

Prof David W. Smith, Professor of Biological Sciences and baseball researcher and historian, University of Delaware

Mapping of Science and the Analytics of Publishing,

Kevin Boyack, President SciTech Strategies, Inc.

If we say ‘data’ we also automatically say ‘metrics and analytics’. New analytics tools have surged innovative new applications using the increasing abundance of data. In sports the use of new metrics and analytics creates a whole new area of science, as shown by baseball historian and researcher professor David W. Smith (who is also professor of Biological Sciences). In the area of STM publishing, analysis of publication data helps visualize the mapping of science. Pioneer Kevin Boyack of Scitech Strategies shows several real life examples.

14:30

Panel: STM Tech Trends 2014

Moderator Christopher Kenneally (CCC)

The new technology trends for 2014 and further, as they will impact STM publishers, explained by a panel of 4 from STM’s Future Lab Committee. With Andrea Powell (CABI), Howard Ratner (Chorus), Heather Ruland Staines (SIPX), Dave Martinsen (ACS),

15:10

Break

15:30

Plenary: The Smart Article

Moderator: Eefke Smit, STM

The role of Publishers in Semantic Annotation and Knowledge Representation

Professor Larry Hunter, University of Colorado, Denver, Computational BioScience Program

Searching for Data and Finding New Science

Anita de Waard, Elsevier, Vice-President Research Data Collaborations

Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference?  Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11.

16:15

Panel: The new Author Ego-System

Moderator: Ann Michael, DeltaThink

Within the STM scholarly ecosystem, publishers must consider and address an evolving set of author expectations. This panel-session will explore four types of innovative author-focused services: Collaborative Authoring Tools, Readership Maximization, Alternative Metrics, and Author Payment Systems. Panelists will briefly describe the underlying author need these services are meant to address before engaging in a lively and participatory discussion with each other and the audience. With Mike Buschman (Plum Analytics), Sarah Tegen (ACS), Jake Kelleher (CCC), Melinda Kenneway (GrowKudos.com).

17:00

Flash Session

Moderator: Terry Hulbertconsultant. 

Six ultra short talks on new start-ups, new launches, new services that help improve scholarly communication, with Colwiz, Newstex, Digimarc, WriteLatex, Algosharing, CHORUS:

Stuart Cooper, Colwiz

Ed Colleran, Newstex

Burt Slavin, Digimarc

John Hammersley, WriteLatex

Simon Adar, AlgoSharing
Howard Ratner, CHORUS

17:30

Welcome Reception for STM Spring Conference & Innovations Seminar attendees

Results for #stminnovations2014

Source: https://twitter.com/search?f=realtim...s2014&src=typd

Top / All
  1.  

    Griffiths: Funding-Intellectual property ownership-Access-Validation-Provenance-Compliance, notification, registration

  2.  

    Common thread running through talks: metadata is extremely valuable, more valuable than the data/article/research itself

  3.  

    L Hunter: "With enough data you don't need semantic search. You can just use statistics." Interested to hear more.

  4.  

    Publishers could (and should?) deliver knowledge representation thru broad knowledge, multiple information sources, etc

  5.  

    L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative knowledge sharing

  6.  

    Larry Hunter on “How publishers can be worth the money”

  7.  

    STM Tech trend 2: return of the author, Scholarly Author Ego System

  8.  

    Kevin Boyak notes that books are cited more than journal articles - would love to hear more about that

  9.  

    Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than journal articles in sciences.

  10.  

    Kevin Boyack up now on the mapping of science and the analytics of publishing.

  11.  

    Looking forward to my panel session - representing Kudos - on the author egosystem at

  12.  

    Tech trend 3: new players changing the game. see for more info.

  13.  

    Tech trend 1: the machine is the new reader. Highlights from the Future Lab team at the STM Assoc

  14.  

    A baseball metrics talk to open ... quoting on slide 3... Yeah, this suits me well!

  15.  

    My talk at : How publishers can be worth the money? Also, joined a thought provoking “Future Lab Committee” this morning

Professional Baseball Pitchers’ Performance and its Effect on Salary

Source: https://www.writelatex.com/articles/...dsqqqnrynm.pdf (PDF)

Charles Hills and Marshall Gregory
April 11, 2014

Abstract

In this study we identify factors that affect a Major League Baseball (MLB) pitcher’s salary. We are interested in knowing whether ability is a good indicator of compensation. To test this we created a model to predict the salaries of pitchers in the MLB.

1 Introduction

Money is a major driving factor in professional baseball and a major consideration for team managers looking to make changes to their rosters. Baseball is not a fair game: in most professional sports, teams are limited to a salary cap (e.g., the NFL has a salary cap of $133 million per team [1]). In baseball, however, there are no such limitations; team payrolls are limited only by their owners’ willingness to pay.

These payrolls may be determined by the amount of money generated by ticket sales or by the sale of team paraphernalia and royalties. There is no set amount required for ticket sales by the MLB, therefore each team can choose to charge as much or as little as they want for tickets. Popular teams with large fanbases are generally able to charge more for tickets or sell tickets in greater volume than less popular teams. Additionally, team payroll may be correlated by the market size of their home city [?].

This leads to major discrepancies in the amount teams are able to pay their players and the caliber of players they are able to recruit. In 2013, the Houston Astros had the lowest payroll in baseball at $26.1 million. The New York Yankees, the highest paying team in the league, paid out a staggering $228.1 million - over eight times as much. Alex Rodriguez, the highest paid player on the Yankees and in the league, earned $28 million in 2013: more than every player on the Astros team combined!

With this in mind, it is clear that low-budget teams (often called “small market teams") should be seeking out players who will play for a lower salary but still perform. Contrastingly, large market teams should only accept the best players, and will lure them in with exorbitant salaries. We evaluated the salaries of 345 pitchers in 2013 to see if this is true.

2 Methods

2.1 Sampling

We used a sample of 345 pitchers for this study. Pitchers were chosen as they play a crutial role on the team and are relatively easy to compare to one another. We only considered pitchers who have been playing for three consecutive years (i.e. 2011, 2012, and 2013), as the salaries of rookie pitchers cannot be predicted without statistics from prior years. We also removed from consideration pitchers who make less than $700,000; these players’ salaries are dictated by the MLB price floor, not their (not-so-great) performances.

2.2 Software Used

All statistics and plots were done in R. Charts and formatting were done in LaTeX. Data was gathered and arranged in Microsoft Excel.

My Note: The data came from [2], [3], [4], [5], and [6], but the Excel was not provided, so I created a spreadsheet

3 Results

Since we are want to know if ability is the driving factor behind a pitcher’s salary, we chose earned run average (ERA) to quantify this. ERA is defined as

ERA = 9 x Earned runs allowed/Innings pitched (1)

where “earned runs" are runs not scored on a fielding error [7]. A lower score signifies a pitcher who allows less runs, so a lower ERA is better. Since runs are all that matter at the end of a game, this is a good indicator of a pitcher’s performance. It is also worth noting that this is not a count statistic, since it an average per nine innings. The only major downfall here is ERA does not take quality of opponents into account, but since every pitcher are pitching to hundreds of opponents across different teams, this isn’t significant.

We defined our testing hypotheses using ERA:

H0 : Beta1 = 0
H1 : Beta1 Not = 0 (2)

where Beta1 is the coefficient for ERA predicting salary.

In order to test these hypotheses, we created by choosing from 36 predictors, including ERA. We choose from all possible models with 5 or fewer predictors based on their Bayesian information criterion (BIC). This selection method helped us deal with multicolinearity and the computational time required to deal with a large number of potential models. BIC is defined as

BIC = -2 ln(^L) + k * (ln(n) + ln(2Pie)): (3)

where n is the number of data points (in our case 345), k is the number of regressors (this penalizes models using many regressors), and ^L is the maximum

value of the likelihood function for the model. Minimizing BIC, we found several good candidates:

Figure 1: Model candidates and their BIC Values

ERA was not chosen in any of the models. Note that models were generated using an exhaustive algorithm (i.e. during each step all models were considered).

CharlesHillsandMarshallGregory04112014Figure1.png

 

Based on this critereon, we chose the predictor variables age (AGE), age squared;(AGE2), games starting (GS), innings pitched as a starter (IP.S), and saves (SV). In order to test our hypothesis, include ERA.:

Table 1: Predictor summary statistics

Predictor Min Mean Max Std. Dev. Obs.
ERA 1.44 3.75 8.11 1.04 345
AGE 21 29 41 3.71 345
AGE2 441 854.9 961 224.50 345
GS 0 12.4 34 7.856 345
IP.S 0 75.83 236 80.29 345
SV 0 3.518 46 7.86 345

   
Using these variables we predict salary using a regression in the form ln(SALARY) = Beta0 + Beta2AGE + Beta3AGE2 + Beta4GS + Beta5IP.S + Beta6SV (4)

We use ln(SALARY) to deal with heteroscedasticity.

There are no negative salaries, so we expect Beta0 to be positive.

As previously discussed, a low ERA indicates a more skilled pitcher, so we expect Beta1 to be negative

Since professional athletes tend to get better after their rookie year up until a “peak", and then decline with age, we expect the age predictors will create a concave-down parabola peaking somewhere in the mid to late twenties. Therefore, we expect Beta2 to be positive and Beta3 to be negative.

Valuable pitchers will start more games, so we expect Beta4 to be positive. This will also increase the value of IP.S (along with the stamina required to pitch more innings per game), so we predict Beta5 will be positive.

A pitcher who finishes a game records a records a save if at least one of three conditions are satisfied:

  • his team is ahead by less than four runs when he enters the game and he pitches for an entire inning
  • he enters the game when the enemy team has the potential to tie the game with the next at-bat
  • he pitches for at least three innings

A pitcher cannot record a win and a save in the same game [7]. Since a good relief pitcher will rack up more saves and have more opportunities to do so, Beta6 is positive.

Running the regression, we found the following coefficient values:

Table 2: Coefficients for the regression of the model given in (4)

Predictor Coefficient Std. Error t-value p-value
Intercept 27.254145 1.933914 14.093 < 2e-16
ERA 0.042576 0.028177 1.551 0.13172
AGE -0.736468 0.131323 -5.608 4.26e-08
AGE2 0.010664 0.002214 4.817 2.21e-06
GS 0.042898 0.038828 1.105  0.270020
IP.S -0.017646 0.006203 -2.845 0.004715
SV -0.068462 0.006258 -10.940 < 2e-16

 

Here, Beta3 went against our intuition. Stranger yet, Beta3 and Beta4 have opposite signs, although IP.S is directly dependent on GS. This combination of indicators provides and interesting metric that values pitchers who pitch many innings with few starts - these are pitchers either have the endurance to pitch deeper into the game or are not pulled early from the game as often.

We also predicted Beta6 incorrectly. This is likely negative because savers earn less on average than starting pitchers.

Figure 2: Plot of the regression

CharlesHillsandMarshallGregory04112014Figure2.png

 

 

The model fits the data well and has an R2 value of 0.64, but there are multiple problems which are illustrated by the residual plots:

Figure 3: Residual analysis for (4)

Note the straight line on the residual plot and the "V shape" on the scale-location plot. Both indicate systematic residuals.

CharlesHillsandMarshallGregory04112014Figure3.png

Both of these problems can be explained by the large group of players seen in Figure 2 on the bottom end of salaries. These players are earning the the MLB minimum wage ($500000) [8]. Since these players’ salaries are not dictated by the salary price floor and not skill, we removed them and reran the model:

Table 3: Coefficients for the regression of the model given in (4) with minimum wage players excluded

Predictor Coefficient Std. Error t-value p-value
Intercept 8.293866 1.784664 4.647 5.51e-06
ERA -0.051700 0.039049 -1.324 0.186755
AGE 0.394701 0.119750 3.296 0.001126
AGE2 -0.005752 0.001978 -2.907 0.003979
GS -0.062597 0.033955 -1.844 0.066467
IP.S 0.019358 0.005356 3.614 0.000366
SV 0.047817 0.00501 4 9.537 < 2e-16< 2e-16

 

Figure 4: Plot of the regression with minimum wage players excluded

The transformation did not significantly improve our R2 value, but this is a much more valid model:

CharlesHillsandMarshallGregory04112014Figure4.png

Figure 5: Residual analysis after removing minimum wage players

The only potential problem is the slight movement in standardized residuals, but this is not a big issue considering the data. A transformation other than log might fix this.

CharlesHillsandMarshallGregory04112014Figure5.png

We failed H0 at a 0.05 significance level in this model for both data sets.

4 Conclusion

ERA was not chosen as a strong predictor in our BIC predictor selection, and it was not found to be significant in our model. Every other predictor selected is either a count statistic or age. The count statistics are related to the amount a player is chosen to play, instead of directly measuring his ability; while a better player is certainly likely to play more often, there is could an additional effect of coaches trying to “get their money’s worth" out of highly-paid players.

We hypothesized that high ability should yield a high salary, but this is not strictly the case for the data we observed. One explanation for this could be contract restrictions: contracts can sometimes block players from being paid a salary deserve.

Another variable likely to be significant which we ommited is attendance per game. Since ticket sales generate revenue for teams, a more likable or exciting pitcher may be worth more than a highly skilled one. This might help to explain why age was so significant, since older players have had more time to gather a large fanbase.

We can see that the data does indicate some correlation between salary performance,but this effect is not as direct as we had expected.

5 Code

1 # Get predictor variables and salary data :
2 X <- read .csv (" predictors .csv ")
3 d <- read .csv (" PitcherData . csv ")
4 SALARY <- d$ SALARY
5
6 # Choose predictor variables
7 library ( alr3 )
8 library ( leaps )
9 library ( car )
10 ss <- regsubsets (as. matrix (X),Y, nvmax =5)
11 rs <- summary (ss)
12 subsets (ss , statistic =c(" bic "),legend =FALSE , xlim =c(1 ,6))
13 title (" BIC values ")
14
15 # Create linear model
16 attach (X)
17 lm <- lm( SALARY ~ AGE + AGE2 + GS + IP. Start + SV)
18 sink (" lmoutput1 . txt ")
19 summary (lm)
20 fit <- lm$ coefficients [1]+ lm$ coefficients [2] *AGE +lm$ coefficients [3]* AGE2 +lm$coefficients [4] *GS+lm$ coefficients [5]*IP. Start +lm$ coefficients [6]*SV
21
22 # Plot this model
23 par ( mar =c(4 ,4 ,2 ,2))
24 plot ( SALARY [ order ( fit)], ylim =c (0 ,25000000) , xlab =" Pitcher ", ylab =" Salary (Millions of Dollars )", axes = FALSE )
25 box ()
26 axis (2, at= seq (0 ,25000000 ,5000000) ,label = seq (0 ,25 ,5))
27 par ( new =t)
28 plot ( fit [ order (fit )], ylim =c (0 ,25000000) , col =" red ", type ="l", lwd =2, axes =F,ylab ="", xlab ="")
29 title (" Pitcher Salaries vs Predictions ")
30 legend (" topleft ", legend =c(" Actual Salaries ", " Predicted Salaries "), pch =c(1 ,26) , lty =c(0 ,1) , lwd=c(0 ,2) , col =c(" black "," red"))
31
32 # plot residuals
33 par ( mfrow =c(2 ,2) , mar =c(4 ,4 ,2 ,2))
34 plot (lm)
35
36 powerTransform (lm)
37 bcSALARY <- 4*(Y ^0.25 -1)
38 bclm <- lm( bcSALARY ~ AGE + AGE2 + GS + IP. Start + SV)
39 sink (" lmoutput2 . txt ")
40 summary ( bclm )
41 bcfit <- bclm $ coefficients [1]+ bclm $ coefficients [2]* AGE+ bclm $ coefficients [3] *AGE2 + bclm $ coefficients [4]*GS+ bclm $ coefficients [5]*IP. Start + bclm $coefficients [6] *SV
42
43 # Plot bc model
44 par ( mfrow =c(1 ,1) , mar =c(4 ,4 ,2 ,2))
45 plot ( bcSALARY [ order ( bcfit )], ylim =c (111 ,279) , xlab =" Pitcher ", ylab =" Salary (Transformed Dollars )", axes =F)
46 box ()
47 axis (2)
48 par ( new =t)
49 plot ( bcfit [ order ( bcfit )], ylim =c (111 ,279) , col =" red ", type ="l", lwd =2, axes =F, ylab ="", xlab ="")
50 title (" Transformed Pitcher Salaries vs Predictions ")
51 legend (" topleft ", legend =c(" Actual Salaries ", " Predicted Salaries "), pch =c(1 ,26) , lty =c(0 ,1) , lwd=c(0 ,2) , col =c(" black "," red"))
52
53 # plot bc residuals
54 par ( mfrow =c(2 ,2) , mar =c(4 ,4 ,2 ,2))
55 plot ( bclm )

NEXT

Page statistics
1333 view(s) and 25 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments