Story
NSF Big Data Publications
Dr. George O. Strawn, NITRD Co-chair and NCO Director, and Dr. Farnam Jahanian, Assistant Director, National Science Foundation, have the NITRD Big Data Senior Steering Group and the Federal Big Data R&D Initiative in common. Their recent presentations at the 2014 Ontology Summit are shown below.
Dr. Jahanian said: "Implementation plans for public access (to scientific research data) could vary by discipline, and new business models for universities, libraries, publishers, and scholarly and professional societies could emerge."
At the summit, they asked me about digital objects and meetups, respectively. First, what I told Dr. Jahanian about our meetups:
Our next Meetups are May 6^{th} and 20^{th} , 6:30-9 p.m. at Excelerate Solutions, 8405 Greensboro Dr., Suite 930, McLean, VA 22102.
Brief History: I founded this meetup because of my experience working with George Strawn (and Susi Iacona and Wendy Wigen) on a presentation to their Big Data Senior Steering Group and my work with Congressional staff on the Data Act.
I worked for many years while a government employee on Federal CIO Council activities under George’s direction and most recently on what he thought was “the killer semantic web application for the government” called Semantic Medline. We now have Semantic Medline running on the new YarcData Graph Appliance and it was the demonstration for our kickoff Meetup in January.
George and NIH (Phil Bourne and Peter Lyster) hosted the March Meetup at NSF on Joint NSF-NIH Biomedical Big Data Research: Euretos BRAIN with BarendMons.
Our Meetup mission statement is:
- Federal: Supports the Federal Big Data Initiative, but not endorsed by the Federal Government or its Agencies;
- Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content;
- Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products; and
- Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) being considered by the White House.
Our framework for working with CODATA is:
- Organize a Community of Data Scientists and Related Fields to focus on treating all of your content as "Big Data"
- Example: Federal Big Data Working Group Meetups
- Follow the Cross Industry Standard Process for Data Mining (CRISP-DM; Shearer, 2000) consisting of Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment
- Example: Semantic Community Data Science Knowledge Base and Big Data Science for CODATA.
- Mine prominent scientific journals for data policy, data bases, and data results that can be reused.
- Example: CODATA Data Science Journal and International Journal of Digital Earth
- Provide data stories and presentation materials for public education and conferences
- Example: CODATA International Workshop on Big Data for International Scientific Programmes, June 8-9, in Beijing
Second, what I told Dr. Strawn about digital objects:
Re your statement Existing identity management systems (eg, VIVO), each of which uses its own version of unique identifiers as part of the process by which identity is established, may be accommodated within the Digital Object Architecture (DOA).
I wrote a recent story on VIVO's Evolution from Semantic Web Application to DuraSpace and then submitted an Abstract to the upcoming VIVO 2014 Conference about Duraspace doing essentially the same thing as Semantic Community about trying to make all of the VIVO and IVMOOC "research objects" well-defined digital objects.
In addition, I prepared NIH Data Publication 1 for my upcoming discussion with Dr. Phil Bourne and I am thinking about doing NSF Data Publication 1 using your and Dr. Jahanian's slides from the 2014 Ontology Summit as the basis. END OF DR. STRAWN COMMUNICATION
Now I have already done a number of pilot data publications with NITRD, NSF, CODATA, EarthCube, and VIVO content, and could mine for some more content. However the recent STM Annual Event on Bright Research, Smart Articles and the new Author Ego-System gave me a new perspective. STM is the International Association of Scientific, Technical & Medical Publishers: The Voice of Academic and Professional Publishing. STM is at the leading edge of the latest technology trends within publishing. This annual US-event brings together the industry's most established thinkers and bright up-and-coming future stars to gives attendees an insight into the hottest innovations and vital technological trends and developments which will define STM publishing for years to come.
The Plenary Session on The Smart Article: The role of Publishers in Semantic Annotation and Knowledge Representation and Searching for Data and Finding New Science description really interested me:
Increasingly the research article becomes computable, adding research data, algorithms and smart searching. How intelligent will the article become; Can it find you so you no longer need to search for it? Can it test assertions? Generate new hypotheses? Can articles generate new articles without human interference? Will human analysis be eliminated and, if so, up to what point….where are the new opportunities for publishers. Come and listen to two experts in data mining and actionable articles, both well known from FORCE11.
I mined the Tweets and selected the following:
- Common thread running through talks: metadata is extremely valuable, more valuable than the data/article/research itself
- L Hunter: "With enough data you don't need semantic search. You can just use statistics."
- Publishers could (and should?) deliver knowledge representation thru broad knowledge, multiple information sources, etc
- L Hunter: Knowledge Representation (publishers) look at Alzforum collaborative knowledge sharing
- STM Tech trend 2: return of the author, Scholarly Author Ego System
- Kevin Boyak notes that books are cited more than journal articles
- Kevin Boyack of SciTech shares data that shows books are 2 to 4x more cited than journal articles in sciences
- Content vs context in data analysis
- Kevin Boyack up now on the mapping of science and the analytics of publishing
- "Looking for Data: Finding New Science": http://t.co/eok3ma37vO
- Tech trend 3: new players changing the game. see http://ow.ly/3jPdvY
- Tech trend 2: the return to the author
- Tech trend 1: the machine is the new reader. Highlights from the Future Lab team
- A baseball metrics talk to open. With perfect timing, the latest submission to the @writelatex gallery is an article on baseball!: https://www.writelatex.com/articles/...ect-on-salary/
- How publishers can be worth the money? Also, joined a thought provoking “Future Lab Committee” this morning
This made me recall that I had sent knowledge representations on Graph Databases, Semantic Search, Data Science for VIVO, etc. to Dr. Strawn, which he really liked, and realized this was a way in which senior science managers and I liked information distilled and presented. In fact I have a Data Science Knowledge Base that contains a number of Data Science Books, especially for teaching Data Science. So I need to do more. The state-of-the-art wiki (MindTouch) I use supports book and publishing where I can export one or more pages to PDF and MySQL formats.
The STM Annual Event Opening Keynotes: Analytics and Metrics, used one of the innovators, writeLaTeX, as an example of advanced authoring: Professional Baseball Pitchers' Performance and its Effect on Salary. I have decided to use this an example of the advanced scientific data analysis and publishing tools I use, namely MindTouch, Excel, and Spotfire. ADD LINK AND SCREEN CAPTURES BELOW This example supports STEM education at the high school and colleage level. The data came from [2], [3], [4], [5], and [6], but the Excel was not provided, so I created a spreadsheet. In addition, one cannot interact with the data in their publication and their publication is not reusable, like mine is here.
Stephen Wolfram's recent Blog: Something Very Big Is Coming: Our Most Important Technology Project Yet said:
Computational knowledge. Symbolic programming. Algorithm automation. Dynamic interactivity. Natural language. Computable documents. The cloud. Connected devices. Symbolic ontology. Algorithm discovery. These are all things we’ve been energetically working on—mostly for years—in the context of Wolfram|Alpha, Mathematica, CDF and so on.
There’ll be the Wolfram Data Science Platform, that allows one to connect to all sorts of data sources, then use the kind of automation seen in Wolfram|Alpha Pro, then pick out and modify Wolfram Language programs to do data science—and then use CDF to set up reports to generate automatically, on a schedule, through an API, or whatever.
There’ll be the Wolfram Publishing Platform that lets you create documents, then insert interactive elements using the Wolfram Language and its free-form linguistics—and then deploy the documents, on the web using technologies like CloudCDF, that instantly support interactivity in any web browser, or on mobile using the Wolfram Cloud App.
We’ve also been building the Wolfram Course Authoring Platform, that does major automation of the process of going from a script to all the elements of an online course—then lets one deploy the course in the cloud, so that students can have immediate access to a Wolfram Language sandbox, to be able to explore the material in the course, do exercises, and so on.
I am anxious to see what comes out, but glad to know I have a way to do advanced big data publications for NSF and other agencies and organizations now!
MORE TO FOLLOW
Professional Baseball Pitchers’ Performance and its Effect on Salary
Source: https://www.writelatex.com/articles/...dsqqqnrynm.pdf (PDF)
Charles Hills and Marshall Gregory
April 11, 2014
Abstract
In this study we identify factors that affect a Major League Baseball (MLB) pitcher’s salary. We are interested in knowing whether ability is a good indicator of compensation. To test this we created a model to predict the salaries of pitchers in the MLB.
1 Introduction
Money is a major driving factor in professional baseball and a major consideration for team managers looking to make changes to their rosters. Baseball is not a fair game: in most professional sports, teams are limited to a salary cap (e.g., the NFL has a salary cap of $133 million per team [1]). In baseball, however, there are no such limitations; team payrolls are limited only by their owners’ willingness to pay.
These payrolls may be determined by the amount of money generated by ticket sales or by the sale of team paraphernalia and royalties. There is no set amount required for ticket sales by the MLB, therefore each team can choose to charge as much or as little as they want for tickets. Popular teams with large fanbases are generally able to charge more for tickets or sell tickets in greater volume than less popular teams. Additionally, team payroll may be correlated by the market size of their home city [?].
This leads to major discrepancies in the amount teams are able to pay their players and the caliber of players they are able to recruit. In 2013, the Houston Astros had the lowest payroll in baseball at $26.1 million. The New York Yankees, the highest paying team in the league, paid out a staggering $228.1 million - over eight times as much. Alex Rodriguez, the highest paid player on the Yankees and in the league, earned $28 million in 2013: more than every player on the Astros team combined!
With this in mind, it is clear that low-budget teams (often called “small market teams") should be seeking out players who will play for a lower salary but still perform. Contrastingly, large market teams should only accept the best players, and will lure them in with exorbitant salaries. We evaluated the salaries of 345 pitchers in 2013 to see if this is true.
2 Methods
2.1 Sampling
We used a sample of 345 pitchers for this study. Pitchers were chosen as they play a crutial role on the team and are relatively easy to compare to one another. We only considered pitchers who have been playing for three consecutive years (i.e. 2011, 2012, and 2013), as the salaries of rookie pitchers cannot be predicted without statistics from prior years. We also removed from consideration pitchers who make less than $700,000; these players’ salaries are dictated by the MLB price floor, not their (not-so-great) performances.
2.2 Software Used
All statistics and plots were done in R. Charts and formatting were done in LaTeX. Data was gathered and arranged in Microsoft Excel.
My Note: The data came from [2], [3], [4], [5], and [6], but the Excel was not provided, so I created a spreadsheet
3 Results
Since we are want to know if ability is the driving factor behind a pitcher’s salary, we chose earned run average (ERA) to quantify this. ERA is defined as
ERA = 9 x Earned runs allowed/Innings pitched (1)
where “earned runs" are runs not scored on a fielding error [7]. A lower score signifies a pitcher who allows less runs, so a lower ERA is better. Since runs are all that matter at the end of a game, this is a good indicator of a pitcher’s performance. It is also worth noting that this is not a count statistic, since it an average per nine innings. The only major downfall here is ERA does not take quality of opponents into account, but since every pitcher are pitching to hundreds of opponents across different teams, this isn’t significant.
We defined our testing hypotheses using ERA:
H_{0} : Beta_{1} = 0
H_{1} : Beta_{1} Not = 0 (2)
where Beta_{1} is the coefficient for ERA predicting salary.
In order to test these hypotheses, we created by choosing from 36 predictors, including ERA. We choose from all possible models with 5 or fewer predictors based on their Bayesian information criterion (BIC). This selection method helped us deal with multicolinearity and the computational time required to deal with a large number of potential models. BIC is defined as
BIC = -2 ln(^{^}L) + k * (ln(n) + ln(2Pie)): (3)
where n is the number of data points (in our case 345), k is the number of regressors (this penalizes models using many regressors), and ^{^}L is the maximum
value of the likelihood function for the model. Minimizing BIC, we found several good candidates:
Figure 1: Model candidates and their BIC Values
ERA was not chosen in any of the models. Note that models were generated using an exhaustive algorithm (i.e. during each step all models were considered).
Based on this critereon, we chose the predictor variables age (AGE), age squared;(AGE2), games starting (GS), innings pitched as a starter (IP.S), and saves (SV). In order to test our hypothesis, include ERA.:
Table 1: Predictor summary statistics
Predictor | Min | Mean | Max | Std. Dev. | Obs. |
ERA | 1.44 | 3.75 | 8.11 | 1.04 | 345 |
AGE | 21 | 29 | 41 | 3.71 | 345 |
AGE^{2} | 441 | 854.9 | 961 | 224.50 | 345 |
GS | 0 | 12.4 | 34 | 7.856 | 345 |
IP.S | 0 | 75.83 | 236 | 80.29 | 345 |
SV | 0 | 3.518 | 46 | 7.86 | 345 |
Using these variables we predict salary using a regression in the form ln(SALARY) = Beta_{0} + Beta_{2}AGE + Beta_{3}AGE^{2} + Beta_{4}GS + Beta_{5}IP.S + Beta_{6}SV (4)
We use ln(SALARY) to deal with heteroscedasticity.
There are no negative salaries, so we expect Beta_{0} to be positive.
As previously discussed, a low ERA indicates a more skilled pitcher, so we expect Beta_{1} to be negative
Since professional athletes tend to get better after their rookie year up until a “peak", and then decline with age, we expect the age predictors will create a concave-down parabola peaking somewhere in the mid to late twenties. Therefore, we expect Beta_{2} to be positive and Beta_{3} to be negative.
Valuable pitchers will start more games, so we expect Beta_{4} to be positive. This will also increase the value of IP.S (along with the stamina required to pitch more innings per game), so we predict Beta_{5} will be positive.
A pitcher who finishes a game records a records a save if at least one of three conditions are satisfied:
- his team is ahead by less than four runs when he enters the game and he pitches for an entire inning
- he enters the game when the enemy team has the potential to tie the game with the next at-bat
- he pitches for at least three innings
A pitcher cannot record a win and a save in the same game [7]. Since a good relief pitcher will rack up more saves and have more opportunities to do so, Beta_{6} is positive.
Running the regression, we found the following coefficient values:
Table 2: Coefficients for the regression of the model given in (4)
Predictor | Coefficient | Std. Error | t-value | p-value |
Intercept | 27.254145 | 1.933914 | 14.093 | < 2e-16 |
ERA | 0.042576 | 0.028177 | 1.551 | 0.13172 |
AGE | -0.736468 | 0.131323 | -5.608 | 4.26e-08 |
AGE^{2} | 0.010664 | 0.002214 | 4.817 | 2.21e-06 |
GS | 0.042898 | 0.038828 | 1.105 | 0.270020 |
IP.S | -0.017646 | 0.006203 | -2.845 | 0.004715 |
SV | -0.068462 | 0.006258 | -10.940 | < 2e-16 |
Here, Beta_{3} went against our intuition. Stranger yet, Beta_{3} and Beta_{4} have opposite signs, although IP.S is directly dependent on GS. This combination of indicators provides and interesting metric that values pitchers who pitch many innings with few starts - these are pitchers either have the endurance to pitch deeper into the game or are not pulled early from the game as often.
We also predicted Beta_{6} incorrectly. This is likely negative because savers earn less on average than starting pitchers.
Figure 2: Plot of the regression
The model fits the data well and has an R^{2} value of 0.64, but there are multiple problems which are illustrated by the residual plots:
Figure 3: Residual analysis for (4)
Note the straight line on the residual plot and the "V shape" on the scale-location plot. Both indicate systematic residuals.
Both of these problems can be explained by the large group of players seen in Figure 2 on the bottom end of salaries. These players are earning the the MLB minimum wage ($500000) [8]. Since these players’ salaries are not dictated by the salary price floor and not skill, we removed them and reran the model:
Table 3: Coefficients for the regression of the model given in (4) with minimum wage players excluded
Predictor | Coefficient | Std. Error | t-value | p-value |
Intercept | 8.293866 | 1.784664 | 4.647 | 5.51e-06 |
ERA | -0.051700 | 0.039049 | -1.324 | 0.186755 |
AGE | 0.394701 | 0.119750 | 3.296 | 0.001126 |
AGE^{2} | -0.005752 | 0.001978 | -2.907 | 0.003979 |
GS | -0.062597 | 0.033955 | -1.844 | 0.066467 |
IP.S | 0.019358 | 0.005356 | 3.614 | 0.000366 |
SV | 0.047817 | 0.00501 | 4 9.537 | < 2e-16< 2e-16 |
Figure 4: Plot of the regression with minimum wage players excluded
The transformation did not significantly improve our R^{2} value, but this is a much more valid model:
Figure 5: Residual analysis after removing minimum wage players
The only potential problem is the slight movement in standardized residuals, but this is not a big issue considering the data. A transformation other than log might fix this.
We failed H_{0} at a 0.05 significance level in this model for both data sets.
4 Conclusion
ERA was not chosen as a strong predictor in our BIC predictor selection, and it was not found to be significant in our model. Every other predictor selected is either a count statistic or age. The count statistics are related to the amount a player is chosen to play, instead of directly measuring his ability; while a better player is certainly likely to play more often, there is could an additional effect of coaches trying to “get their money’s worth" out of highly-paid players.
We hypothesized that high ability should yield a high salary, but this is not strictly the case for the data we observed. One explanation for this could be contract restrictions: contracts can sometimes block players from being paid a salary deserve.
Another variable likely to be significant which we ommited is attendance per game. Since ticket sales generate revenue for teams, a more likable or exciting pitcher may be worth more than a highly skilled one. This might help to explain why age was so significant, since older players have had more time to gather a large fanbase.
We can see that the data does indicate some correlation between salary performance,but this effect is not as direct as we had expected.
References
My Note: See Spreadsheet for Data from these References
5 Code
1 # Get predictor variables and salary data :
2 X <- read .csv (" predictors .csv ")
3 d <- read .csv (" PitcherData . csv ")
4 SALARY <- d$ SALARY
5
6 # Choose predictor variables
7 library ( alr3 )
8 library ( leaps )
9 library ( car )
10 ss <- regsubsets (as. matrix (X),Y, nvmax =5)
11 rs <- summary (ss)
12 subsets (ss , statistic =c(" bic "),legend =FALSE , xlim =c(1 ,6))
13 title (" BIC values ")
14
15 # Create linear model
16 attach (X)
17 lm <- lm( SALARY ~ AGE + AGE2 + GS + IP. Start + SV)
18 sink (" lmoutput1 . txt ")
19 summary (lm)
20 fit <- lm$ coefficients [1]+ lm$ coefficients [2] *AGE +lm$ coefficients [3]* AGE2 +lm$coefficients [4] *GS+lm$ coefficients [5]*IP. Start +lm$ coefficients [6]*SV
21
22 # Plot this model
23 par ( mar =c(4 ,4 ,2 ,2))
24 plot ( SALARY [ order ( fit)], ylim =c (0 ,25000000) , xlab =" Pitcher ", ylab =" Salary (Millions of Dollars )", axes = FALSE )
25 box ()
26 axis (2, at= seq (0 ,25000000 ,5000000) ,label = seq (0 ,25 ,5))
27 par ( new =t)
28 plot ( fit [ order (fit )], ylim =c (0 ,25000000) , col =" red ", type ="l", lwd =2, axes =F,ylab ="", xlab ="")
29 title (" Pitcher Salaries vs Predictions ")
30 legend (" topleft ", legend =c(" Actual Salaries ", " Predicted Salaries "), pch =c(1 ,26) , lty =c(0 ,1) , lwd=c(0 ,2) , col =c(" black "," red"))
31
32 # plot residuals
33 par ( mfrow =c(2 ,2) , mar =c(4 ,4 ,2 ,2))
34 plot (lm)
35
36 powerTransform (lm)
37 bcSALARY <- 4*(Y ^0.25 -1)
38 bclm <- lm( bcSALARY ~ AGE + AGE2 + GS + IP. Start + SV)
39 sink (" lmoutput2 . txt ")
40 summary ( bclm )
41 bcfit <- bclm $ coefficients [1]+ bclm $ coefficients [2]* AGE+ bclm $ coefficients [3] *AGE2 + bclm $ coefficients [4]*GS+ bclm $ coefficients [5]*IP. Start + bclm $coefficients [6] *SV
42
43 # Plot bc model
44 par ( mfrow =c(1 ,1) , mar =c(4 ,4 ,2 ,2))
45 plot ( bcSALARY [ order ( bcfit )], ylim =c (111 ,279) , xlab =" Pitcher ", ylab =" Salary (Transformed Dollars )", axes =F)
46 box ()
47 axis (2)
48 par ( new =t)
49 plot ( bcfit [ order ( bcfit )], ylim =c (111 ,279) , col =" red ", type ="l", lwd =2, axes =F, ylab ="", xlab ="")
50 title (" Transformed Pitcher Salaries vs Predictions ")
51 legend (" topleft ", legend =c(" Actual Salaries ", " Predicted Salaries "), pch =c(1 ,26) , lty =c(0 ,1) , lwd=c(0 ,2) , col =c(" black "," red"))
52
53 # plot bc residuals
54 par ( mfrow =c(2 ,2) , mar =c(4 ,4 ,2 ,2))
55 plot ( bclm )
Comments