Table of contents
  1. Story
  2. Slides
    1. Slide 1 Data Science for NSF Data Science Workshop 2015
    2. Slide 2 NSF Data Science Workshop 2015
    3. Slide 3 Workshop Knowledge Base and Data Science Data Publication
    4. Slide 4 Workshop Knowledge Base
    5. Slide 5 Data Mining - Science - Questions - Publication Process
    6. Slide 6 Workshop White Paper Conclusions
    7. Slide 7 NIH Data Commons
    8. Slide 8 The NIH Commons: Overview
    9. Slide 9 OSTP/NSF Data Science Meetup of Meetups
    10. Slide 10 We Already Do This!
    11. Slide 11 The Journey to Data and Meetup 1
    12. Slide 12 The Journey to Data and Meetup 2
    13. Slide 13 The Journey to Data and Meetup 3
    14. Slide 14 My Data Mining Notes in Notepad That Helped Structure What Follows Next
    15. Slide 15 Welcome to NCBI
    16. Slide 16 NCBI Download: FTP and Aspera
    17. Slide 17 NCBI Download Tools
    18. Slide 18 GenBank Release 209.0 Is Now Available Via FTP
    19. Slide 19 NCBI Data & Software
    20. Slide 20 GenBank Overview
    21. Slide 21 Sample GenBank Record
    22. Slide 22 Nucleic Acids Research Article
    23. Slide 23 Welcome to the VaDE
    24. Slide 24 VaDE All Associations
    25. Slide 25 all_gwas_snap.csv
    26. Slide 26 Table 1 Statistics of the Population-Based Reproducibility Assessment
    27. Slide 27 VaDE Supplementary Data
    28. Slide 28 Table S1 List of Variations Tat Are Associated with various Traits
    29. Slide 29 PubMed Article on Beta-Glucan and Cancer
    30. Slide 30 Knowledge Base Index and Tables in Spreadsheet
    31. Slide 31 Knowledge Base Index and Tables in Spotfire Visualizations
    32. Slide 32 Conclusions and Recommendations
  3. Spotfire Dashboard
  4. Research Notes
    1. DRAFT Potential Sessions for Meetup Meeting: November 5-6, 2015
      1. Day 1 - November 5th, 2015 (Half Day)
      2. Day 2 - November 6th, 2015 (Full Day)
    2. Help Students Mine for Geonomic Data
  5. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
    1. Abstract
    2. Background
    3. How Does RF Work?
      1. Figure 1 Training of an individual tree of an RFM
    4. Important Variables For Class Prediction
    5. Proximity Scores Allow Determining Similarity Between Samples
    6. RF Implementations
    7. RF In The Life Sciences
      1. Table 1 Random Forest use in Life Sciences publications ordered by data type or origin
    8. Neglected RF Properties
    9. Proximity
    10. Local Importance
    11. Conditional Relationships and Variable Interactions
      1. Figure 2 Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest
    12. Conclusion
    13. Key points
    14. Supplementary Data
    15. Biographies
    16. References
  6. VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms
    1. Abstract
    2. Introduction
    3. THE VaDE Databases
      1. Collection of association data
        1. Figure 1 Data flow in VaDE database
      2. Standardize ontology
      3. Reproducibility assessment of SNP-trait association
        1. Table 1. Statistics of the population-based reproducibility assessment (VaDE version 1.3)
    4. Collection of functional genomic data
      1. Web interface
        1. Figure 2 Screen shot of VaDE contents
      2. Future perspectives
    5. Supplementary Data
    6. Acknowledgements
    7. Footnotes
    8. Funding
    9. References
      1. 1
      2. 2
      3. 3
      4. 4
      5. 5
      6. 6
      7. 7
      8. 8
      9. 9
      10. 10
      11. 11
      12. 12
      13. 13
      14. 14
      15. 15
      16. 16
      17. 17
      18. 18
      19. 19
      20. 20
      21. 21
      22. 22
  7. What's VaDE?
    1. Outline of VaDE
    2. Detailed Information of Pages and Utilization
      1. 1 What’s VaDE?
        1. Figure 1 Flow of the VaDE database construction
      2. 2 Major pages in VaDE and their links
        1. Figure 2 Links among the major pages in VaDE: (A) Reproduced Associations  page, (B) All Associations page, (C) SNP Functional Annotations page, (D) Genome Browser page
      3. 3 Detailed information of pages and search system
        1. 3.1 Top page
          1. Figure 3-1 Top page: (A) search window for reproduced associations in each trait and population, (B) search window for all SNP-trait associations in a country
        2. 3.2 Reproduced Associations page
          1. Figure 3-2 Reproduced Associations page
        3. 3.3 All Associations page
          1. Figure 3-3 All Associations page
        4. 3.4 SNP Functional Annotations page
          1. Figure 3-4 SNP Functional Annotations page
        5. 3.5 Genome Browser page
          1. Figure 3-5 Genome Browser page
          2. Table 3-5-1 Chromatin in state segmentation by HMM from ENCODE/Broad
          3. Table 3-5-2 Chromatin in Core Marks segmentation by HMM from Roadmap Project
      4. 4 Additional information
    3. Database construction
      1. Collection of SNP-trait association data
      2. Standardize ontology of population, disease, and gene
      3. Reproducibility assessment of SNP-trait association
      4. Collection of functional genomic data
    4. Release History
      1. 2015-6-16
      2. 2015-3-27 VaDE version 1.5
      3. 2014-12-16
      4. 2014-11-20 VaDE version 1.4
      5. 2014-08-21 VaDE version 1.3
      6. 2014-06-26 VaDE version 1.2
      7. 2014-06-06 VaDE version 1.1
      8. 2014-04-24 VaDE version 1.0
    5. Site policy
      1. Disclaimer
      2. Terms of Use
      3. Copyright
      4. Reference
    6. About us
      1. VaDE Development Team
      2. Presentation
      3. Contact
      4. Acknowledgement
      5. Funding
      6. Credit
  8. GERT PhD Program in Big Data and Data Science at the UW
  9. Consent Form – Interviews (individuals)
  10. White Paper Review Criteria Data Science Workshop, 2015
    1. Merit Review Criteria
    2. Review Instructions
  11. White Papers
    1. Genomic Data Science
      1. Background
      2. Problem Statement
      3. Broader Impacts
      4. References
    2. Big Data: From correlation to causation
      1. Background
      2. Problem Statement
      3. Broader Impacts
      4. References
    3. Shape mapping in genome-wide association studies
      1. Background
      2. Problem Statement
        1. Figure 1. The procedure of extracting shape information from a leaf image
      3. Broader Impacts
      4. References
  12. Workshop Overview
    1. Want to participate? Get engaged!
    2. White paper submission guidelines
    3. What happens next?
    4. Get Started
    5. Questions?
    6. Conference Organizers
    7. Banner image credits
  13. Agenda
    1. Day 1: Wednesday, August 5, 2015
    2. Day 2: Thursday, August 6, 2015
    3. Day 3: Friday, August 7, 2015
    4. Team breakout session room assignments
    5. Y-team assignments
  14. Mentors, Observers, Ethnographers & Organizers
    1. Anthony Arendt
      1. Polar Science Center University of Washington
    2. Ginger Armbrust
      1. Oceanography University of Washington
    3. Magda Balazinska
      1. Computer Science & Engineering University of Washington
    4. Chaitanya Baru
      1. Computer and Information Science and Engineering National Science Foundation
    5. Dave Beck
      1. Chemical Engineering University of Washington
    6. Rahul Biswas
      1. Astronomy University of Washington
    7. Alvin Cheung
      1. Computer Science & Engineering University of Washington
    8. Andy Connolly
      1. Astronomy University of Washington
    9. Oren Etzioni
      1. Allen Institute for Artificial Intelligence
    10. Rob Fatland
      1. Microsoft Research Microsoft
    11. Brittany Fiore-Gartland
      1. Human-Centered Design and Engineering University of Washington
    12. Will Gagne-Maynard
      1. Oceanography University of Washington
    13. Jeff Gardner
      1. Google
    14. Andrew Gartland
      1. Vaccine and Infectious Disease Division Fred Hutchinson Cancer Research Center
    15. Joseph Hellerstein
      1. eScience Institute University of Washington
    16. Tim Hesterberg
      1. Google
    17. Laura Norén
      1. Center for Data Science New York University
    18. Earnestine Psalmonds
      1. Division of Graduate Education National Science Foundation
    19. Raghu Ramakrishnan
      1. Microsoft
    20. Renata Rawlings-Goss
      1. AAAS S&T Policy Fellow National Science Foundation
    21. Darwin Schweitzer
      1. Algorithms and Data Science Group Microsoft
    22. Valentina Staneva
      1. eScience Institute University of Washington
    23. Alejandro Suarez
      1. AAAS S&T Policy Fellow National Science Foundation
    24. Anissa Tanweer
      1. Communication University of Washington
    25. Kristin Tolle
      1. Microsoft Research Microsoft
    26. David Williams
      1. Biology University of Washington
    27. Jennifer Worrell
      1. Computer Science & Engineering University of Washington
    28. Fen Zhao
      1. Strategic Innovation National Science Foundation
  15. Posters
    1. Yazhong Wang
    2. Julie van der Hoop
    3. Fred Morstatter
    4. Austin Arrington
    5. Chris Cacciapaglia
    6. James Morton
    7. Arif Khan
    8. Alexandra Munoz
    9. Colin Raffel
    10. Ashlynn Daughton
    11. Benjamin Weinstein
    12. Rayna Harris
    13. Raghava Mutharaju
    14. Dorian Rosen
    15. Dennis Linders
    16. Charmgil Hong
    17. Haroon Raja
    18. Bharathi Asokarajan
    19. William Kearney
    20. Saurabh Jha
    21. Victoria Villar
    22. Timothy Jones
    23. Daniel Cook
    24. Xin Yang
    25. Sepideh Pourazarm
    26. Susmit Shannigrahi
    27. Erika Helgeson
    28. Qi Song
    29. Chengrui Li
    30. Jianbo Ye
    31. Berk Ustun
    32. Mahdi Ahmadi
    33. Sayamindu Dasgupta
    34. Abhijit Bendale
    35. Justin Brandenburg
    36. Kevin Keys
    37. Yang Wang
    38. Ethan Rudd
    39. Wei Xie
    40. Marina Kogan
    41. Taisuke Imai
    42. Lincoln Sheets
    43. Nancy (Xin Ru) Wang
    44. Kin Gwn Lore
    45. Anuj Karpatne
    46. Jesse Hoff
    47. Zachary Foster
    48. Daqing Yun
    49. Kelly Spendlove
    50. Can Hu
  16. Team Assignments
  17. Team 1
  18. NEXT

NSF Data Science Workshop 2015

Last modified
Table of contents
  1. Story
  2. Slides
    1. Slide 1 Data Science for NSF Data Science Workshop 2015
    2. Slide 2 NSF Data Science Workshop 2015
    3. Slide 3 Workshop Knowledge Base and Data Science Data Publication
    4. Slide 4 Workshop Knowledge Base
    5. Slide 5 Data Mining - Science - Questions - Publication Process
    6. Slide 6 Workshop White Paper Conclusions
    7. Slide 7 NIH Data Commons
    8. Slide 8 The NIH Commons: Overview
    9. Slide 9 OSTP/NSF Data Science Meetup of Meetups
    10. Slide 10 We Already Do This!
    11. Slide 11 The Journey to Data and Meetup 1
    12. Slide 12 The Journey to Data and Meetup 2
    13. Slide 13 The Journey to Data and Meetup 3
    14. Slide 14 My Data Mining Notes in Notepad That Helped Structure What Follows Next
    15. Slide 15 Welcome to NCBI
    16. Slide 16 NCBI Download: FTP and Aspera
    17. Slide 17 NCBI Download Tools
    18. Slide 18 GenBank Release 209.0 Is Now Available Via FTP
    19. Slide 19 NCBI Data & Software
    20. Slide 20 GenBank Overview
    21. Slide 21 Sample GenBank Record
    22. Slide 22 Nucleic Acids Research Article
    23. Slide 23 Welcome to the VaDE
    24. Slide 24 VaDE All Associations
    25. Slide 25 all_gwas_snap.csv
    26. Slide 26 Table 1 Statistics of the Population-Based Reproducibility Assessment
    27. Slide 27 VaDE Supplementary Data
    28. Slide 28 Table S1 List of Variations Tat Are Associated with various Traits
    29. Slide 29 PubMed Article on Beta-Glucan and Cancer
    30. Slide 30 Knowledge Base Index and Tables in Spreadsheet
    31. Slide 31 Knowledge Base Index and Tables in Spotfire Visualizations
    32. Slide 32 Conclusions and Recommendations
  3. Spotfire Dashboard
  4. Research Notes
    1. DRAFT Potential Sessions for Meetup Meeting: November 5-6, 2015
      1. Day 1 - November 5th, 2015 (Half Day)
      2. Day 2 - November 6th, 2015 (Full Day)
    2. Help Students Mine for Geonomic Data
  5. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
    1. Abstract
    2. Background
    3. How Does RF Work?
      1. Figure 1 Training of an individual tree of an RFM
    4. Important Variables For Class Prediction
    5. Proximity Scores Allow Determining Similarity Between Samples
    6. RF Implementations
    7. RF In The Life Sciences
      1. Table 1 Random Forest use in Life Sciences publications ordered by data type or origin
    8. Neglected RF Properties
    9. Proximity
    10. Local Importance
    11. Conditional Relationships and Variable Interactions
      1. Figure 2 Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest
    12. Conclusion
    13. Key points
    14. Supplementary Data
    15. Biographies
    16. References
  6. VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms
    1. Abstract
    2. Introduction
    3. THE VaDE Databases
      1. Collection of association data
        1. Figure 1 Data flow in VaDE database
      2. Standardize ontology
      3. Reproducibility assessment of SNP-trait association
        1. Table 1. Statistics of the population-based reproducibility assessment (VaDE version 1.3)
    4. Collection of functional genomic data
      1. Web interface
        1. Figure 2 Screen shot of VaDE contents
      2. Future perspectives
    5. Supplementary Data
    6. Acknowledgements
    7. Footnotes
    8. Funding
    9. References
      1. 1
      2. 2
      3. 3
      4. 4
      5. 5
      6. 6
      7. 7
      8. 8
      9. 9
      10. 10
      11. 11
      12. 12
      13. 13
      14. 14
      15. 15
      16. 16
      17. 17
      18. 18
      19. 19
      20. 20
      21. 21
      22. 22
  7. What's VaDE?
    1. Outline of VaDE
    2. Detailed Information of Pages and Utilization
      1. 1 What’s VaDE?
        1. Figure 1 Flow of the VaDE database construction
      2. 2 Major pages in VaDE and their links
        1. Figure 2 Links among the major pages in VaDE: (A) Reproduced Associations  page, (B) All Associations page, (C) SNP Functional Annotations page, (D) Genome Browser page
      3. 3 Detailed information of pages and search system
        1. 3.1 Top page
          1. Figure 3-1 Top page: (A) search window for reproduced associations in each trait and population, (B) search window for all SNP-trait associations in a country
        2. 3.2 Reproduced Associations page
          1. Figure 3-2 Reproduced Associations page
        3. 3.3 All Associations page
          1. Figure 3-3 All Associations page
        4. 3.4 SNP Functional Annotations page
          1. Figure 3-4 SNP Functional Annotations page
        5. 3.5 Genome Browser page
          1. Figure 3-5 Genome Browser page
          2. Table 3-5-1 Chromatin in state segmentation by HMM from ENCODE/Broad
          3. Table 3-5-2 Chromatin in Core Marks segmentation by HMM from Roadmap Project
      4. 4 Additional information
    3. Database construction
      1. Collection of SNP-trait association data
      2. Standardize ontology of population, disease, and gene
      3. Reproducibility assessment of SNP-trait association
      4. Collection of functional genomic data
    4. Release History
      1. 2015-6-16
      2. 2015-3-27 VaDE version 1.5
      3. 2014-12-16
      4. 2014-11-20 VaDE version 1.4
      5. 2014-08-21 VaDE version 1.3
      6. 2014-06-26 VaDE version 1.2
      7. 2014-06-06 VaDE version 1.1
      8. 2014-04-24 VaDE version 1.0
    5. Site policy
      1. Disclaimer
      2. Terms of Use
      3. Copyright
      4. Reference
    6. About us
      1. VaDE Development Team
      2. Presentation
      3. Contact
      4. Acknowledgement
      5. Funding
      6. Credit
  8. GERT PhD Program in Big Data and Data Science at the UW
  9. Consent Form – Interviews (individuals)
  10. White Paper Review Criteria Data Science Workshop, 2015
    1. Merit Review Criteria
    2. Review Instructions
  11. White Papers
    1. Genomic Data Science
      1. Background
      2. Problem Statement
      3. Broader Impacts
      4. References
    2. Big Data: From correlation to causation
      1. Background
      2. Problem Statement
      3. Broader Impacts
      4. References
    3. Shape mapping in genome-wide association studies
      1. Background
      2. Problem Statement
        1. Figure 1. The procedure of extracting shape information from a leaf image
      3. Broader Impacts
      4. References
  12. Workshop Overview
    1. Want to participate? Get engaged!
    2. White paper submission guidelines
    3. What happens next?
    4. Get Started
    5. Questions?
    6. Conference Organizers
    7. Banner image credits
  13. Agenda
    1. Day 1: Wednesday, August 5, 2015
    2. Day 2: Thursday, August 6, 2015
    3. Day 3: Friday, August 7, 2015
    4. Team breakout session room assignments
    5. Y-team assignments
  14. Mentors, Observers, Ethnographers & Organizers
    1. Anthony Arendt
      1. Polar Science Center University of Washington
    2. Ginger Armbrust
      1. Oceanography University of Washington
    3. Magda Balazinska
      1. Computer Science & Engineering University of Washington
    4. Chaitanya Baru
      1. Computer and Information Science and Engineering National Science Foundation
    5. Dave Beck
      1. Chemical Engineering University of Washington
    6. Rahul Biswas
      1. Astronomy University of Washington
    7. Alvin Cheung
      1. Computer Science & Engineering University of Washington
    8. Andy Connolly
      1. Astronomy University of Washington
    9. Oren Etzioni
      1. Allen Institute for Artificial Intelligence
    10. Rob Fatland
      1. Microsoft Research Microsoft
    11. Brittany Fiore-Gartland
      1. Human-Centered Design and Engineering University of Washington
    12. Will Gagne-Maynard
      1. Oceanography University of Washington
    13. Jeff Gardner
      1. Google
    14. Andrew Gartland
      1. Vaccine and Infectious Disease Division Fred Hutchinson Cancer Research Center
    15. Joseph Hellerstein
      1. eScience Institute University of Washington
    16. Tim Hesterberg
      1. Google
    17. Laura Norén
      1. Center for Data Science New York University
    18. Earnestine Psalmonds
      1. Division of Graduate Education National Science Foundation
    19. Raghu Ramakrishnan
      1. Microsoft
    20. Renata Rawlings-Goss
      1. AAAS S&T Policy Fellow National Science Foundation
    21. Darwin Schweitzer
      1. Algorithms and Data Science Group Microsoft
    22. Valentina Staneva
      1. eScience Institute University of Washington
    23. Alejandro Suarez
      1. AAAS S&T Policy Fellow National Science Foundation
    24. Anissa Tanweer
      1. Communication University of Washington
    25. Kristin Tolle
      1. Microsoft Research Microsoft
    26. David Williams
      1. Biology University of Washington
    27. Jennifer Worrell
      1. Computer Science & Engineering University of Washington
    28. Fen Zhao
      1. Strategic Innovation National Science Foundation
  15. Posters
    1. Yazhong Wang
    2. Julie van der Hoop
    3. Fred Morstatter
    4. Austin Arrington
    5. Chris Cacciapaglia
    6. James Morton
    7. Arif Khan
    8. Alexandra Munoz
    9. Colin Raffel
    10. Ashlynn Daughton
    11. Benjamin Weinstein
    12. Rayna Harris
    13. Raghava Mutharaju
    14. Dorian Rosen
    15. Dennis Linders
    16. Charmgil Hong
    17. Haroon Raja
    18. Bharathi Asokarajan
    19. William Kearney
    20. Saurabh Jha
    21. Victoria Villar
    22. Timothy Jones
    23. Daniel Cook
    24. Xin Yang
    25. Sepideh Pourazarm
    26. Susmit Shannigrahi
    27. Erika Helgeson
    28. Qi Song
    29. Chengrui Li
    30. Jianbo Ye
    31. Berk Ustun
    32. Mahdi Ahmadi
    33. Sayamindu Dasgupta
    34. Abhijit Bendale
    35. Justin Brandenburg
    36. Kevin Keys
    37. Yang Wang
    38. Ethan Rudd
    39. Wei Xie
    40. Marina Kogan
    41. Taisuke Imai
    42. Lincoln Sheets
    43. Nancy (Xin Ru) Wang
    44. Kin Gwn Lore
    45. Anuj Karpatne
    46. Jesse Hoff
    47. Zachary Foster
    48. Daqing Yun
    49. Kelly Spendlove
    50. Can Hu
  16. Team Assignments
  17. Team 1
  18. NEXT

  1. Story
  2. Slides
    1. Slide 1 Data Science for NSF Data Science Workshop 2015
    2. Slide 2 NSF Data Science Workshop 2015
    3. Slide 3 Workshop Knowledge Base and Data Science Data Publication
    4. Slide 4 Workshop Knowledge Base
    5. Slide 5 Data Mining - Science - Questions - Publication Process
    6. Slide 6 Workshop White Paper Conclusions
    7. Slide 7 NIH Data Commons
    8. Slide 8 The NIH Commons: Overview
    9. Slide 9 OSTP/NSF Data Science Meetup of Meetups
    10. Slide 10 We Already Do This!
    11. Slide 11 The Journey to Data and Meetup 1
    12. Slide 12 The Journey to Data and Meetup 2
    13. Slide 13 The Journey to Data and Meetup 3
    14. Slide 14 My Data Mining Notes in Notepad That Helped Structure What Follows Next
    15. Slide 15 Welcome to NCBI
    16. Slide 16 NCBI Download: FTP and Aspera
    17. Slide 17 NCBI Download Tools
    18. Slide 18 GenBank Release 209.0 Is Now Available Via FTP
    19. Slide 19 NCBI Data & Software
    20. Slide 20 GenBank Overview
    21. Slide 21 Sample GenBank Record
    22. Slide 22 Nucleic Acids Research Article
    23. Slide 23 Welcome to the VaDE
    24. Slide 24 VaDE All Associations
    25. Slide 25 all_gwas_snap.csv
    26. Slide 26 Table 1 Statistics of the Population-Based Reproducibility Assessment
    27. Slide 27 VaDE Supplementary Data
    28. Slide 28 Table S1 List of Variations Tat Are Associated with various Traits
    29. Slide 29 PubMed Article on Beta-Glucan and Cancer
    30. Slide 30 Knowledge Base Index and Tables in Spreadsheet
    31. Slide 31 Knowledge Base Index and Tables in Spotfire Visualizations
    32. Slide 32 Conclusions and Recommendations
  3. Spotfire Dashboard
  4. Research Notes
    1. DRAFT Potential Sessions for Meetup Meeting: November 5-6, 2015
      1. Day 1 - November 5th, 2015 (Half Day)
      2. Day 2 - November 6th, 2015 (Full Day)
    2. Help Students Mine for Geonomic Data
  5. Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?
    1. Abstract
    2. Background
    3. How Does RF Work?
      1. Figure 1 Training of an individual tree of an RFM
    4. Important Variables For Class Prediction
    5. Proximity Scores Allow Determining Similarity Between Samples
    6. RF Implementations
    7. RF In The Life Sciences
      1. Table 1 Random Forest use in Life Sciences publications ordered by data type or origin
    8. Neglected RF Properties
    9. Proximity
    10. Local Importance
    11. Conditional Relationships and Variable Interactions
      1. Figure 2 Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest
    12. Conclusion
    13. Key points
    14. Supplementary Data
    15. Biographies
    16. References
  6. VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms
    1. Abstract
    2. Introduction
    3. THE VaDE Databases
      1. Collection of association data
        1. Figure 1 Data flow in VaDE database
      2. Standardize ontology
      3. Reproducibility assessment of SNP-trait association
        1. Table 1. Statistics of the population-based reproducibility assessment (VaDE version 1.3)
    4. Collection of functional genomic data
      1. Web interface
        1. Figure 2 Screen shot of VaDE contents
      2. Future perspectives
    5. Supplementary Data
    6. Acknowledgements
    7. Footnotes
    8. Funding
    9. References
      1. 1
      2. 2
      3. 3
      4. 4
      5. 5
      6. 6
      7. 7
      8. 8
      9. 9
      10. 10
      11. 11
      12. 12
      13. 13
      14. 14
      15. 15
      16. 16
      17. 17
      18. 18
      19. 19
      20. 20
      21. 21
      22. 22
  7. What's VaDE?
    1. Outline of VaDE
    2. Detailed Information of Pages and Utilization
      1. 1 What’s VaDE?
        1. Figure 1 Flow of the VaDE database construction
      2. 2 Major pages in VaDE and their links
        1. Figure 2 Links among the major pages in VaDE: (A) Reproduced Associations  page, (B) All Associations page, (C) SNP Functional Annotations page, (D) Genome Browser page
      3. 3 Detailed information of pages and search system
        1. 3.1 Top page
          1. Figure 3-1 Top page: (A) search window for reproduced associations in each trait and population, (B) search window for all SNP-trait associations in a country
        2. 3.2 Reproduced Associations page
          1. Figure 3-2 Reproduced Associations page
        3. 3.3 All Associations page
          1. Figure 3-3 All Associations page
        4. 3.4 SNP Functional Annotations page
          1. Figure 3-4 SNP Functional Annotations page
        5. 3.5 Genome Browser page
          1. Figure 3-5 Genome Browser page
          2. Table 3-5-1 Chromatin in state segmentation by HMM from ENCODE/Broad
          3. Table 3-5-2 Chromatin in Core Marks segmentation by HMM from Roadmap Project
      4. 4 Additional information
    3. Database construction
      1. Collection of SNP-trait association data
      2. Standardize ontology of population, disease, and gene
      3. Reproducibility assessment of SNP-trait association
      4. Collection of functional genomic data
    4. Release History
      1. 2015-6-16
      2. 2015-3-27 VaDE version 1.5
      3. 2014-12-16
      4. 2014-11-20 VaDE version 1.4
      5. 2014-08-21 VaDE version 1.3
      6. 2014-06-26 VaDE version 1.2
      7. 2014-06-06 VaDE version 1.1
      8. 2014-04-24 VaDE version 1.0
    5. Site policy
      1. Disclaimer
      2. Terms of Use
      3. Copyright
      4. Reference
    6. About us
      1. VaDE Development Team
      2. Presentation
      3. Contact
      4. Acknowledgement
      5. Funding
      6. Credit
  8. GERT PhD Program in Big Data and Data Science at the UW
  9. Consent Form – Interviews (individuals)
  10. White Paper Review Criteria Data Science Workshop, 2015
    1. Merit Review Criteria
    2. Review Instructions
  11. White Papers
    1. Genomic Data Science
      1. Background
      2. Problem Statement
      3. Broader Impacts
      4. References
    2. Big Data: From correlation to causation
      1. Background
      2. Problem Statement
      3. Broader Impacts
      4. References
    3. Shape mapping in genome-wide association studies
      1. Background
      2. Problem Statement
        1. Figure 1. The procedure of extracting shape information from a leaf image
      3. Broader Impacts
      4. References
  12. Workshop Overview
    1. Want to participate? Get engaged!
    2. White paper submission guidelines
    3. What happens next?
    4. Get Started
    5. Questions?
    6. Conference Organizers
    7. Banner image credits
  13. Agenda
    1. Day 1: Wednesday, August 5, 2015
    2. Day 2: Thursday, August 6, 2015
    3. Day 3: Friday, August 7, 2015
    4. Team breakout session room assignments
    5. Y-team assignments
  14. Mentors, Observers, Ethnographers & Organizers
    1. Anthony Arendt
      1. Polar Science Center University of Washington
    2. Ginger Armbrust
      1. Oceanography University of Washington
    3. Magda Balazinska
      1. Computer Science & Engineering University of Washington
    4. Chaitanya Baru
      1. Computer and Information Science and Engineering National Science Foundation
    5. Dave Beck
      1. Chemical Engineering University of Washington
    6. Rahul Biswas
      1. Astronomy University of Washington
    7. Alvin Cheung
      1. Computer Science & Engineering University of Washington
    8. Andy Connolly
      1. Astronomy University of Washington
    9. Oren Etzioni
      1. Allen Institute for Artificial Intelligence
    10. Rob Fatland
      1. Microsoft Research Microsoft
    11. Brittany Fiore-Gartland
      1. Human-Centered Design and Engineering University of Washington
    12. Will Gagne-Maynard
      1. Oceanography University of Washington
    13. Jeff Gardner
      1. Google
    14. Andrew Gartland
      1. Vaccine and Infectious Disease Division Fred Hutchinson Cancer Research Center
    15. Joseph Hellerstein
      1. eScience Institute University of Washington
    16. Tim Hesterberg
      1. Google
    17. Laura Norén
      1. Center for Data Science New York University
    18. Earnestine Psalmonds
      1. Division of Graduate Education National Science Foundation
    19. Raghu Ramakrishnan
      1. Microsoft
    20. Renata Rawlings-Goss
      1. AAAS S&T Policy Fellow National Science Foundation
    21. Darwin Schweitzer
      1. Algorithms and Data Science Group Microsoft
    22. Valentina Staneva
      1. eScience Institute University of Washington
    23. Alejandro Suarez
      1. AAAS S&T Policy Fellow National Science Foundation
    24. Anissa Tanweer
      1. Communication University of Washington
    25. Kristin Tolle
      1. Microsoft Research Microsoft
    26. David Williams
      1. Biology University of Washington
    27. Jennifer Worrell
      1. Computer Science & Engineering University of Washington
    28. Fen Zhao
      1. Strategic Innovation National Science Foundation
  15. Posters
    1. Yazhong Wang
    2. Julie van der Hoop
    3. Fred Morstatter
    4. Austin Arrington
    5. Chris Cacciapaglia
    6. James Morton
    7. Arif Khan
    8. Alexandra Munoz
    9. Colin Raffel
    10. Ashlynn Daughton
    11. Benjamin Weinstein
    12. Rayna Harris
    13. Raghava Mutharaju
    14. Dorian Rosen
    15. Dennis Linders
    16. Charmgil Hong
    17. Haroon Raja
    18. Bharathi Asokarajan
    19. William Kearney
    20. Saurabh Jha
    21. Victoria Villar
    22. Timothy Jones
    23. Daniel Cook
    24. Xin Yang
    25. Sepideh Pourazarm
    26. Susmit Shannigrahi
    27. Erika Helgeson
    28. Qi Song
    29. Chengrui Li
    30. Jianbo Ye
    31. Berk Ustun
    32. Mahdi Ahmadi
    33. Sayamindu Dasgupta
    34. Abhijit Bendale
    35. Justin Brandenburg
    36. Kevin Keys
    37. Yang Wang
    38. Ethan Rudd
    39. Wei Xie
    40. Marina Kogan
    41. Taisuke Imai
    42. Lincoln Sheets
    43. Nancy (Xin Ru) Wang
    44. Kin Gwn Lore
    45. Anuj Karpatne
    46. Jesse Hoff
    47. Zachary Foster
    48. Daqing Yun
    49. Kelly Spendlove
    50. Can Hu
  16. Team Assignments
  17. Team 1
  18. NEXT

Story

Data Science on the NSF Data Science Workshop 2015

My goal is to organize a Federal Big Data Working Group Meetup on this so I end up with an agenda for September 28th with presenters and content that supports the OSTP/NSF Data Science Meetup of Meetups on November 5-6 (See Research Notes):

As background I want to write and submit a White Paper that follows the six Data Mining Standard steps:

  • Business Understanding - Workshop Purpose
  • Data Understanding - Data Sets Used by Students?
  • Data Preparation - Etc.
  • Modeling
  • Evaluation
  • Deployment

I want to follow the Data Science Process:

  • Data Preparation
  • Data Ecosystem
  • Data Story

I want to answer the four essential data science questions:

  • How was the data collected?
  • Where is the data stored?
  • What are the data results? and
  • Why should we believe the data results?

I want my White Paper to be a Data Science Data Publication:

  • Knowledge Base
  • Spreadsheet Index
  • Web & PDF Tables to Spreadsheet
  • Data Browser
  • Dynamically Linked Adjacent Visualizations

For an example of an end product for the USDA, see recent: Slides

Since the three white papers from the NSF Data Science Workshop did not describe any actual work with data sets, I decided to use their content to find a data set and the first reference in the first white paper was GenBank, and when I shared that, I got a response from a member of the OSTP/NSF Data Science Meetup of Meetups planning team that works directly with it:

I do a lot of genomics data stuff (I work for NCBI, which is the largest genomic database in the world [we make Genbank, which is the first citation in the genomics data challenge summary]).  

I think I might be able to help focus the genomics data challenge a bit more.

I responded:

I looked at: http://www.ncbi.nlm.nih.gov/genbank/

And found: http://www.ncbi.nlm.nih.gov/guide/data-software/

And wondered where it would be good to start?

This is like a Data Commons that Vivien Bonazzi talked about at our last meetup: A NIH – Semantic Medline Data Science Data Publication Commons (Click See All). 

I could build an searchable index in a spreadsheet and Spotfire with your guidance like I have done for other NIH data sets.

He responded:

We also have bigger databases (in terms of data size) like SRA, dbGaP and GEO.

Here’s a third party attempt at normalizing the SRA metadata:

http://www.bioconductor.org/packages...tml/SRAdb.html

We also provide a run selector tool for visualization in SRA, if you go to the send to menu.

We’ve also done some hackathoning with such data

https://github.com/DCGenomics?tab=repositories

To come full circle, the RNA_mapping repo here:

https://github.com/NCBI-Hackathons

May be the preamble for a collaboration with the NIH Data Science Data Commons.

The NIH Commons Framework is:

  • Discoverability: Search and Find
  • Open APIs: Data and Tools
  • Unique IDs: For Digital Research Objects
  • Containers: For Packaging Applications
  • Computing Platform: Cloud & HPC

Interestingly, Semantic Community and the Federal Big Data Working Group Meetup already do this in our Data Science Data Publications! To illustrate that, lets do another one with the NIH Geonomics Data for the White Paper.

The results are show below in the:

which support the NIH Commons Framework. QED

Bottom Line: You do need to process the really big geonomic data (TBs) if someone has extracted the associations into a nice database (19 MB) like VaDE has done! Thank you to the Team at the Biomedical Informatics Laboratory, Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine.

MORE TO FOLLOW

Slides

Slides

Slide 1 Data Science for NSF Data Science Workshop 2015

Semantic Community

Data Science

NSF Data Science Workshop 2015

BrandNiemann08242015Slide1.PNG

Slide 2 NSF Data Science Workshop 2015

http://depts.washington.edu/dswkshp/

BrandNiemann08242015Slide2.PNG

Slide 3 Workshop Knowledge Base and Data Science Data Publication

Semantic Community

Data Science

NSF Data Science Workshop 2015

BrandNiemann08242015Slide3.PNG

Slide 4 Workshop Knowledge Base

BrandNiemann08242015Slide4.PNG

Slide 5 Data Mining - Science - Questions - Publication Process

BrandNiemann08242015Slide5.PNG

Slide 6 Workshop White Paper Conclusions

BrandNiemann08242015Slide6.PNG

Slide 8 The NIH Commons: Overview

https://datascience.nih.gov/commons

BrandNiemann08242015Slide8.PNG

Slide 9 OSTP/NSF Data Science Meetup of Meetups

BrandNiemann08242015Slide9.PNG

Slide 10 We Already Do This!

BrandNiemann08242015Slide10.PNG

Slide 11 The Journey to Data and Meetup 1

http://www.ncbi.nlm.nih.gov/genbank/

BrandNiemann08242015Slide11.PNG

Slide 14 My Data Mining Notes in Notepad That Helped Structure What Follows Next

http://semanticommunity.info/%40api/...?origin=mt-web

BrandNiemann08242015Slide14.PNG

Slide 15 Welcome to NCBI

http://www.ncbi.nlm.nih.gov/ 

BrandNiemann08242015Slide15.PNG

Slide 17 NCBI Download Tools

http://www.ncbi.nlm.nih.gov/home/tools.shtml

BrandNiemann08242015Slide17.PNG

Slide 18 GenBank Release 209.0 Is Now Available Via FTP

http://www.ncbi.nlm.nih.gov/news/08-...k-release-209/

BrandNiemann08242015Slide18.PNG

Slide 19 NCBI Data & Software

http://www.ncbi.nlm.nih.gov/guide/data-software/

BrandNiemann08242015Slide19.PNG

Slide 20 GenBank Overview

http://www.ncbi.nlm.nih.gov/genbank/

BrandNiemann08242015Slide20.PNG

Slide 22 Nucleic Acids Research Article

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC438388b/

BrandNiemann08242015Slide22.PNG

Slide 23 Welcome to the VaDE

http://bmi-tokai.jp/VaDE/

BrandNiemann08242015Slide23.PNG

Slide 24 VaDE All Associations

http://bmi-tokai.jp/VaDE/all-gwas-snp/

BrandNiemann08242015Slide24.PNG

Slide 25 all_gwas_snap.csv

all_gwas_snp.csv

BrandNiemann08242015Slide25.PNG

Slide 26 Table 1 Statistics of the Population-Based Reproducibility Assessment

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383886/table/tbl1/

BrandNiemann08242015Slide26.PNG

Slide 28 Table S1 List of Variations Tat Are Associated with various Traits

http://nar.oxfordjournals.org/content/suppl/2014/10/30/gku1037.DC1/Table_S1_nagai3.docx

BrandNiemann08242015Slide28.PNG

Slide 29 PubMed Article on Beta-Glucan and Cancer

http://www.ncbi.nlm.nih.gov/pubmed/19573626

BrandNiemann08242015Slide29.PNG

Slide 30 Knowledge Base Index and Tables in Spreadsheet

NSFNIHGeonomic.xlsx

BrandNiemann08242015Slide30.PNG

Slide 31 Knowledge Base Index and Tables in Spotfire Visualizations

Web Player

BrandNiemann08242015Slide31.PNG

Slide 32 Conclusions and Recommendations

BrandNiemann08242015Slide32.PNG

Spotfire Dashboard

For Internet Explorer Users and Those Wanting Full Screen Display Use: Web Player Get Spotfire for iPad App

Error: Embedded data could not be displayed. Use Google Chrome

Research Notes

DRAFT Potential Sessions for Meetup Meeting: November 5-6, 2015

National Data Science Organizers Workshop

Potential Sessions for November 5-6, 2015

Please note: This will be in-person by invitation and remote for all

Day 1 - November 5th, 2015 (Half Day)

12:00 pm (Pre-conference Lunch with Big Data Regional Hubs Leaders)

1:30 pm Session 1: Data Science for the Nation

Keynote: What are the National Priorities?, White House Office of Science and Technology Policy - Deputy Director for Technology and Innovation

Impacts of Data Science on National Priorities

  • Data Kind: Speaker
  • Data Science for Social Good: Speaker
  • Federal Meetup: Speaker

Discussion: Using Meetups to explore National Challenges

5:00 pm: Evening Event at AAAS: Grassroots Data Science Across the Nation

Lighting Talks: Every group gets 10 slides and 3 minutes

Highlight past events in National Priority Areas or of national interest, state plans for the future, and give challenges, and ideas for how a Network of Data Science Organizers can solve national problems.

Networking Reception: Highlight AAAS, S&T Fellows, and Affinity Groups

Day 2 - November 6th, 2015 (Full Day)

8:00 am Session 2: Exposing Data

Available Datasets: Speakers

  • Socrata Open Data Portal demo: Speaker
  • Open Data.gov / Open Data Working Group: Speaker

Exposing data resources

  • Meetup Contributions

Product Creation: Connecting data sources among regions.

10:00 am Break

10:30 am Session 3: Coordination and Support of Data Science Meetups

Resources for Meetups:

  • Federal Support for Meetup groups

Coordination mechanisms:

  • You Tube Channel, Podcast, White Papers, listserv
  • Meetup of Data Science Meetup groups Online

Discussion: Mechanisms to spread good ideas among regions.

12:30 pm Lunch Speaker

1:30 pm Session 4: The National Priority Challenge

National Priority Challenge-Speaker

  • National Data Science Challenges and Hackathons: Proposed by steering committee
  • RDA Research Data Alliance (RDA): P8 venue to announce specific challenge (2016)

Working Session: Launching National Priority Challenge 2016

5:45 pm Closing Remarks: TBA

Help Students Mine for Geonomic Data

http://www.ncbi.nlm.nih.gov/
The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.

Download: Transfer NCBI Data to Your Computer

http://www.ncbi.nlm.nih.gov/home/download.shtml
The majority of NCBI data are available for downloading, either directly from the NCBI FTP site or by using software tools to download custom datasets.

NCBI Announcements:

GenBank release 209.0 is now available via FTP
19 Aug 2015
http://www.ncbi.nlm.nih.gov/news/08-...nk-release-209

GenBank release 209.0 (8/14/2015) has 187,066,846 non-WGS, non-CON records containing 199,823,644,287 base pairs of sequence data. In addition, there are 302,955,543 WGS records containing 1,163,275,601,001 base pairs of sequence data, as well as 87,827,013 TSA records containing 69,360,654,413 base pairs of sequence data.

All Databases and Downloads

http://www.ncbi.nlm.nih.gov/guide/data-software/

GenBank Overview

http://www.ncbi.nlm.nih.gov/genbank/

VarySysDB Disease Edition (VaDE) is a literature based genetic trait and genomic information database. Provides genomic polymorphisms associated to diseases, traits, and pharmacogenomics.

http://bmi-tokai.jp/VaDE/

VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383886/

Amazing Trail!

http://nar.oxfordjournals.org/conten...2014/10/30/gku

1037.DC1

VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms

SUPPLEMENTARY DATA
Files in this Data Supplement:

SUPPLEMENTARY DATA

http://nar.oxfordjournals.org/conten...2014/10/30/gku

1037.DC1/Table_S1_nagai3.docx Word

VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms

http://nar.oxfordjournals.org/conten...r.gku1037.full

Pop Out Table
http://nar.oxfordjournals.org/conten...expansion.html

Table 1. Statistics of the population-based reproducibility assessment (VaDE version 1.3)

Population Number of associations Number of unique variants Number of unique traits
European 2446

2196

288
African 67 56 19
East Asian 557 456 102
South-East Asian 10 10 5
South Asian 26 13 3
West Asian

4

2 2
Caribbean/Central American 22 15 9
North American 0 0 0
South American 0 0 0
Oceanian 0 0 0

 

Similar articles in PubMed
The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.
[Nucleic Acids Res. 2014]
VarySysDB: a human genetic polymorphism database based on all H-InvDB transcripts.
[Nucleic Acids Res. 2009]
GWAS Integrator: a bioinformatics tool to explore human genetic associations reported in published genome-wide association studies.
[Eur J Hum Genet. 2011]
A SNP-centric database for the investigation of the human genome.
[BMC Bioinformatics. 2004]
[Single nucleotide polymorphisms(SNPs)and SNP databases].
[Zhonghua Yi Xue Yi Chuan Xue Z...]

Recent Activity

VaDE: a manually curated database of reproducible associations between various t...
VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms
Nucleic Acids Research. 2015 Jan 28; 43(Database issue)

D868
NCBI Large Data Download Best Practices
NCBI Large Data Download Best Practices
VaDE: a manually curated database of reproducible 

associations between various t...
VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms.
Nucleic Acids Res. 2015 Jan ;43(Database issue):D868-72. 

doi: 10.1093/nar/gku1037. Epub 2014 Oct 31 .
PubMed
**Maitake beta-glucan enhances granulopoiesis and mobilization of granulocytes by ...
Maitake beta-glucan enhances granulopoiesis and mobilization of granulocytes by increasing G-CSF production and modulating CXCR4/SDF-1 expression.
Int Immunopharmacol. 2009 Sep ;9(10):1189-96. doi: 

10.1016/j.intimp.2009.06.007. Epub 2009 Jun 30 .
PubMed
The NHGRI GWAS Catalog, a curated resource of SNP-traitassociations

Data mining in the Life Sciences with Random Forest: a walk in the park or lost in the jungle?

Source: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3659301/

Wouter G. Touw,* Jumamurat R. Bayjanov,* Lex Overmars,* Lennart Backus,* Jos Boekhorst,* Michiel Wels,* and Sacha A. F. T. van Hijumcorresponding author*

Author information ► Article notes ► Copyright and License information ►

This article has been cited by other articles in PMC.

Abstract

In the Life Sciences ‘omics’ data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.

Keywords: Random Forest, variable importance, local importance, conditional relationships, variable interaction, proximity

Background

Development of high-throughput techniques and accompanying technology to manage and mine large-scale data has led to a revolution of Systems Biology in the last decade [1–3]. ‘Omics’ technologies such as genomics, transcriptomics, proteomics, metabolomics, epigenomics and metagenomics allow rapid and parallel collection of massive amounts of different types of data for the same model system. Software tools to manage [4], visualize [5] and integratively analyse omics-scale data are crucial to deal with its inherent complexity and ultimately uncover new biology. For example, knowledge on both gene expression and protein abundance may better explain a phenotype than gene expression or protein abundance separately. Particularly machine learning algorithms play a central role in the process of knowledge extraction [67]. They are applied for supervised pattern recognition in data sets: typically they are used to train a classification model that allows separating samples of different classes (e.g. healthy or ill) based on variables (e.g. SNPs in a Genome-Wide Association Study or GWAS), and to estimate which variables were important for this task (see below).

The Random Forest (RF) algorithm [8] has become very popular for pattern recognition in omics-scale data, mainly because RF provides two aspects that are very important for data mining: high prediction accuracy and information on variable importance for classification. The prediction performance of RF compares well to other classification algorithms [7] such as support vector machines (SVMs, [910]), artificial neural networks [11–13], Bayesian classifiers [1415], logistic regression [16], k-nearest-neighbours [17], discriminant analysis such as Fisher’s linear discriminant analysis [18] and regularized discriminant analysis [19], partial least squares (PLS, [20]) and decision trees such as classification and regression trees (CARTs, [21]). The theoretical and practical aspects of many of those algorithms and their application in biology have been discussed elsewhere (for example [62223]). SVM and RF are arguably the most widely used classification techniques in the Life Sciences. Comparisons between the prediction accuracy of SVM and RF have been made several times [e.g. 24–29]. Although the performance of carefully tuned SVMs is generally slightly better than RF [24], RF offers unique advantages over SVM (see below). Further comparisons between SVM and RF will not be discussed here.

Life Science data sets typically have many more variables than samples. This problem is known as the ‘curse of dimensionality’ or the small n large p problem [30]. For instance, genomics, transcriptomics, proteomics and GWAS data sets suffer from this problem with in general thousands of measurements of genes, transcripts, proteins or SNPs determined for only dozens of samples [31–33]. RF effectively handles these data sets by training many decision trees using subsets of the data. Furthermore, RF has the potential to unravel variable interactions, which are ubiquitous in data sets generated in the Life Sciences. Interactions can for example be expected between SNPs in GWAS [34], between microbiota in metagenomics [35], between physicochemical properties of peptides in proteomic biomarker discovery studies [36] and between cellular levels of gene-products in gene-expression studies [25]. Additionally, the combinations of variables that together define molecules, e.g. mass spectrometry m/z ratios or Nuclear Magnetic Resonance chemical shifts, can distinguish phenotypes in metabolomics and metabonomics [37]. A final example includes combinations of several protein characteristics influencing the success rate in structural genomics [38]. In summary, its versatility makes RF a very suitable technique to investigate high-throughput data in this omics era.

Recent reviews aimed towards a more specialized audience have discussed the use of RF in (i) a broad scientific context [7], (ii) genomics research [39] and (iii) genetic association studies [40]. Here, we focus on the application of RF for supervised classification in the Life Sciences. In addition to reviewing the different uses of RF, we provide ideas to make this algorithm even more suitable for uncovering complex interactions from omics data. First, we introduce the general characteristics of RF for the reader who is not familiar with RF, followed by its use to tackle problems in data analysis. We also discuss rarely used properties of RF that allow determining interaction between variables. RF even has the potential to characterize these interactions for sample subclasses (e.g. groups of patients for which a SNP combination is predictive, while for a different group of patients the same SNP combination is not). Here, we discuss several research strategies that may allow exploiting RF to its full potential.

How Does RF Work?

Predictive RF models (from now on referred to as RFM) are non-parametric, hard to over-train, relatively robust to outliers and noise and fast to train. The RF algorithm can be used without tuning of algorithm parameters, although a better classification model can often easily be obtained by optimization of very few parameters (see below) [8]. RF trains an ensemble of individual decision trees based on samples, their class designation and variables. Every tree in the forest is built using a random subset of samples and variables (Figure 1), hence the name RF. The RF description by Breiman serves as a general reference for this section [841].

Figure 1 Training of an individual tree of an RFM

The tree is built based on a data matrix (shown within the ellipses). This matrix consists of samples (S1–S10; e.g. individuals) belonging to two classes (encircled crosses or encircled plus signs; e.g. ...

Figure 1:

bbs034f1Figure1.jpg

Suppose a forest of decision trees (e.g. CARTs) is constructed based on a given data set. For each tree, a different training set is created by randomly sampling samples (e.g. patient samples) from the data set with replacement resulting in a training set, or ‘bootstrap’ set, containing about two-third of the samples in the original data set. The remaining samples in the original data set are the ‘out-of-bag’ (OOB) samples. The tree is grown using the bootstrap data set by recursive partitioning (Figure 1). For every tree ‘node’, variables are randomly selected from the set of all variables and evaluated for their ability to split the data (Figure 1). The variable resulting in the largest decrease in impurity is chosen to separate the samples at each ‘parent node’, starting at the top node, into two subsets, ending up in two distinct ‘child nodes’. In RF, the impurity measure is the Gini impurity. A decrease in Gini impurity is related to an increase in the amount of order in the sample classes introduced by a split in the decision tree. After the bootstrap data has been split at the top node, the splitting process is repeated. The partitioning is finished when the final nodes, ‘terminal nodes’ or ‘leafs’, are either (i) ‘pure’, i.e. they contain only samples belonging to the same class or (ii) contain a specified number of samples. A classification tree is usually grown until the terminal nodes are pure, even if that results in terminal nodes containing a single sample. The tree is thus grown to its largest extent; it is not ‘pruned’. After a forest has been fully grown, the training process is completed. The RFM can subsequently be used to predict the class of a new sample. Every classification tree in the forest casts an unweighted vote for the sample after which the majority vote determines the class of the sample.

Although a single tree from the RFM is a weak classifier because it is trained on a subset of the data, the combination of all trees in a forest is a strong classifier [8]. Random selection of candidate variables for splitting ensures a low correlation between trees and prevents over-training of an RFM. Therefore, trees in an RFM need not be pruned, in contrast to classical decision trees that do not use random selection of variables [8]. The expected error rate of classification of new samples by a classifier, is usually estimated by cross-validation procedures, such as leave-one-out or K-fold cross-validation [42]. In K-fold cross-validation, the original data are randomly partitioned into K subsets (folds). Each of the K folds is once used as a test set while the other K − 1 folds are used as training data to construct a classifier. The average of the K error rates is the expected error rate of the classification of new samples when the classifier is built with all samples. In leave-one-out cross-validation a single sample is left out from the training set. General cross-validation procedures are unnecessary to predict the classification performance of a given RFM. A cross-validation is already built-in, as each tree in the forest has its own training (bootstrap) and test (OOB) data.

Important Variables For Class Prediction

In addition to an internal cross-validation RF also calculates estimates of variable importance for classification [8]. Importance estimates can be very useful to interpret the relevance of variables for the data set under study. The importance scores can for example be used to identify biomarkers [36] or as a filter to remove non-informative variables [25]. Two frequently used types of the RF variable importance measures exist. The mean decrease in classification is based on permutation. For each tree, the classification accuracy of the OOB samples is determined both with and without random permutation of the values of the variable. The prediction accuracy after permutation is subtracted from the prediction accuracy before permutation and averaged over all trees in the forest to give the permutation importance value. The second importance measure is the Gini importance of a variable and is calculated as the sum of the Gini impurity decrease of every node in the forest for which that variable was used for splitting. The use of different variable importance measures is discussed below in more detail.

The importance of variables for classification of a single sample is provided by RF as the local importance. It thus shows a direct link between variables and samples. As discussed in more detail below, the differences in local importance between samples can for example be used to detect variables that are important for a subset of samples of the same class (e.g. the important variables for a subtype of cancer in a data set with cancer patients and healthy subjects as classes). The local importance score is derived from all trees for which the sample was not used to train the tree (and is therefore OOB). The percentage of correct votes for the correct class in the permuted OOB data is subtracted from the percentage of votes for the correct class in the original OOB data to assign a local importance score for the variable of which the values were permuted. The score reflects the impact on correct classification of a given sample: negative, 0 (the variable is neutral) and positive. Local importances are rarely used and noisier than global importances, but a robust estimation of local importance values can be obtained by running the same classification several times [43] and for instance averaging the local importance scores.

Proximity Scores Allow Determining Similarity Between Samples

RF not only generates variable-related information such as variable importance measures, but also calculates the proximity between samples. The proximity between similar samples is high. For proximity calculations, all samples in the original data set are classified by the forest. The proximity between two samples is calculated as the number of times the two samples end up in the same terminal node of a tree, divided by the number of trees in the forest. Provided sufficient variables are included in the RFM, outliers or mislabelled samples can be defined as samples whose proximity to all other samples from the same class is small. Identification of outliers or mislabelled samples serves as important feedback for the biologist who, if necessary, can correct for experimental mistakes. Similarly, subclasses can in principle be identified by finding samples that have similar proximities to all other samples of the same class. Subclasses in a data set with healthy and diseased subjects can for example be severe and mild subtypes of the disease. Proximity scores also allow the identification of prototypes, representative samples of a group of samples. The variable values of prototypes may explain how those variables relate to the classification of the group. Proximity scores may also be used to construct multidimensional scaling (MDS) plots. MDS plots aim to visualize the dissimilarity (calculated as 1 – proximity) between samples typically in a two-dimensional plot, so that the distances between data points are proportional to the dissimilarities. A good class separation may be obtained by plotting the first two scaling coordinates against each other, provided they capture sufficient information.

RF Implementations

The RF algorithm is available in many different open source software packages. Conveniently, the ‘randomForest’ package [44] is available as an R implementation [45] of the original RF code by Breiman and Cutler [41]. It is probably the most referred RF implementation because it is easy to use and the user benefits from other R data processing functionality. Recently, a framework for tree growing called Random Jungle (RJ) was developed [46]. It is currently the fastest implementation of RF, allows parallel computation of trees and is therefore very suited for the analysis of genome-wide data. The Willows package was also designed for tree-based analysis of genome-wide data by maximizing the use of computer memory [47]. The WEKA workbench [48] is a data mining environment that includes several machine learning algorithms including RF. The workbench allows for easy pre-processing of data and comparison between RF and other algorithms.

RF In The Life Sciences

Table 1 lists a non-exhaustive, yet in our opinion representative, number of studies that applied RF in different areas of the Life Sciences. A summary of the use of RF features in these areas is also provided in Table 1. The publications include many highly cited papers and papers that we included because they describe noteworthy use of RF properties. A detailed overview of the use of RF in these publications as well as meta data on them can be found in Supplementary Table S1. See Spreadsheet and Spreadsheet

Table 1 Random Forest use in Life Sciences publications ordered by data type or origin

Table 1:

See Spreadsheet

Three-quarters of the studies exploited the variable importance output of the RF algorithm (Table 1). For example, information on variable importance has been used to identify risk-associated SNPs in a genome-wide association study [56], to determine important genes and pathways for the classification of micro-array gene-expression data [27] and to identify factors that can be used to predict protein–protein interactions [29]. Very few studies report on the use of an iterative variable selection procedure [25] to select the most relevant variables and optimize the prediction accuracy of the RFM, although the classification accuracy improved when such a protocol was applied [24256898] (Supplementary Table S1). In several data mining pipelines, important variables were selected from an RFM, which were subsequently used in other analysis techniques [5071].

Improving prediction accuracy has also been researched. In addition to a better separation of the samples of different classes, the variables of an accurate RFM are likely to be more relevant than those of a less accurate RFM. The number of variables to select for the best split at each node, mtry, was already marked as a tuning parameter by Breiman [6]. Varying the number of trees in the forest may also improve the OOB-error. One-fourth of the papers tuned and optimized the value of mtry and the number of trees. A single study not only regulated the size of the forest but also the size of the trees by varying the minimal node size [25]. The improvement of the prediction accuracy however was negligible. In contrast, Segal reported a better prediction accuracy may be achieved by regulation of the tree size via limiting the number of splits or the size of nodes for which splitting is allowed [99]. Boulesteix et al. [100] also recommended tuning tree depth and minimal node size in the context of genetic association studies. Alternative voting schemes, such as weighted voting, may improve classification accuracy [101] too, but have not been applied in the papers listed in Table 1.

Zhang and Wang pointed out that the interpretation of an RFM may be less practical than the interpretation of a single decision tree classifier due to the many trees in a forest. In a single tree, it is clear in which level of the tree and with what cut-off a variable is used to make a split. In a forest, a variable may or may not be present in a given tree, and if it is present, it may be so at different levels in the tree and have different cut-offs. They proposed to shrink a full forest to a smaller forest having a manageable number of trees and a level of prediction accuracy similar to the original RFM [102]. The smallest forest is one of the attempts to modify RF or use RF in combination with other methods in order to increase the prediction accuracy or model interpretability of RFMs (Table 1). Several other modifications were reviewed by Verikas et al. [7]. RF has not only been used in combination with other techniques, but several studies also combined multiple RFMs in a pipeline for better classification results (Table 1, [557287]). RF has also been used in conjunction with dimension reduction techniques [3354]. For example, RF has been applied after PLS (PLS-RF, [33]). Sampson and colleagues argued the loadings (relative contribution of variables to the variability in the data) produced by PLS allow for meaningful interpretation of the association between variables and disease. De Lobel et al. [54] have used RF as a pre-screening method to remove noisy SNPs before multifactor-dimensionality reduction in genetic association studies. Additionally, RF has been incorporated in a transductive confidence machine [95], a framework that allows the prediction of classifiers to be complemented with a confidence value that can be set by the user prior to classification [103].

Neglected RF Properties

RF has several properties that allow extracting relevant trends from data with complex variable relations, such as omics data sets. Nevertheless, these properties have according to our knowledge not yet been exploited to their full extent and only a few studies have explored their potential. Below we discuss the most important ones.

Proximity

Proximity values are a measure of similarity between samples. A few studies used proximity values to detect outliers [277374] resulting in an RFM before and after removal of outliers. The OOB prediction accuracy may improve after removing the outliers [74]. However, not in all cases a comparison was reported between the OOB errors of the second and the first model [73].

In addition to outlier detection, studies listed in Table 1 used proximity scores in MDS plots [276796] and for class discovery from RF clustering results [91]. Analogous to their role in clustering, proximity scores also in supervised classification have the potential to allow discovering subclasses of data samples and even to identify corresponding prototypic variable values. However, we did not come across literature examples of utilization of the RF proximity measure for identification of subclasses or variable prototypes.

Local Importance

The global variable importance generated by RF captures classification impact of variables on all samples. The local variable importance is an estimate of the importance of a variable for the classification of a single sample. Local importance may therefore reveal specific variable importance patterns within groups of samples that may not be evident from global importance values. In other words, variables that are important for a subset of samples from the same class could show a clear local importance signal, while this signal would be lost in the global measure. Nevertheless, only one study in the Life Sciences reported the use of local importances in data analysis (Table 1). In this study, the local importance measure was exploited to predict microRNAs (miRNAs) that are significantly associated to the modification of expression of specific mRNAs [76]. Local importance instead of global importance was used in a regression RF analysis because the authors assumed that only a subset of miRNAs would significantly contribute to the regression fit. Recently, we developed PhenoLink, a method that links phenotypes to omics data sets [43]. Local importances were applied for variable selection using two criteria: (i) a removal criterion: having a negative or neutral local importance for the majority of class samples removing variables that do not positively contribute to the classification and (ii) a selection criterion: having a positive local importance for at least a few samples (typically 3) or for a percentage of samples (at least 10%) of a class. Classification of a metabolomics data set consisting of 9303 headspace (gas-phase) GC-MS metabolomics-based measurements (variables) for 45 different bacterial samples resulted in a classification (OOB) error of 71% (results not shown). After removal of 8587 ‘garbage’ variables the classification error was reduced to 18%. This dramatic reduction of classification error is due to the ‘garbage’ variables that make it more difficult for RF to recognize the informative variables. The positive selection criterion resulted in the same classification error but with an additional 210 variables removed and a total of 506 variables relevant for separating the bacterial samples based on headspace metabolites. PhenoLink was used effectively to remove redundant or even confusing variables and to detect variables that were important for a subset of samples in a number of studies ranging from gene-trait matching, metabolomics-transcriptomics matching and identification of biomarkers based on a variety of data sources [43]. Altogether, utilization of local importances is promising for many omics data sets and has the potential to uncover variables important for subsets of samples.

Conditional Relationships and Variable Interactions

For data sets generated in the Life Sciences, e.g. for metabolomics and proteomics measurements, gene expression data and GWAS studies, variables (e.g. SNPs in genetic association studies) are typically important for a subset of samples of the same class (e.g. patients) and conditional relations between variables might be important for a subset of samples. For example, certain SNPs or SNP combinations may be important for the first subgroup of patients and not important for the second subgroup.

Variable interactions have been reported to increase the global variable importance value [56]. The importance value itself however only provides the combined importance of the variable and all its interactions with other variables, but does not specify the actual variable interactions. Interactions between two variables can be inferred from a classification tree if a variable systematically makes a split on the other variable more likely or less likely than expected compared to variables without interactions. A recent paper reviewed the ability to identify SNP interactions by variations of logic regression, RF and Bayesian logistic regression [52]. For RF, an interaction importance measure was defined. However, the actual SNP interactions were not identified by the interaction importance, but rather by a relatively high variable importance measure. As Chen and colleagues discussed, the problem with their interaction importance measure was that two interacting SNPs need to be jointly selected in a tree branch relatively often. Furthermore, in the branches further down the tree the interaction of SNP A and B may have to be prominent in the presence of other variables in order to show a signal in the interaction importance [52].

Interactions between variables will often go hand in hand with conditional dependencies between the variables, i.e. variable B contributes to classification given that variable A is present above B in the tree. Conditional relations between variables are implicitly taken into account by the conditional inference forest algorithm (cforest, implemented in the party package [104–106] in R). cforest is a variant of RF that has been designed for unbiased variable selection (discussed below) [107]. Like RF, cforest generates a variable importance measure. Variable importance measures are currently subject of debate and rankings produced using permutation importance may be preferred over Gini importance rankings when variables: (i) are correlated [105108–110], (ii) vary in their scale of measurement (e.g. continuous and categorical variables) [104110] and (iii) vary in their number of categories [104110]. These variable characteristics are common in Life Science data sets, e.g. for patient parameters (for instance a categorical variable such as the dichotomous variable ‘has dog’: yes, no; another discrete variable such as ‘number of children’: 0, 1, 2, 3, 4; and a continuous variable ‘IgG blood level’: 0–20 g/l) and gene expression (continuous) versus SNP data (categorical). In combination with subsampling instead of bootstrap sampling, the splitting criterion of cforest has been reported to be less biased than the RF criterion [105]. The algorithm to determine the conditional importance measure generated by cforest explicitly takes into account the conditional relationships. However, like in RF conditional relationships are still implicit in the importance value output of cforest.

Analysis of individual RFM tree structures might be a good strategy to investigate interactions between variables. If variable A precedes variable B significantly more often than expected for variables without interactions, B is likely conditionally dependent on A. Recently, in a GWAS study the genetic variants underlying age-related macular degeneration (AMD) were investigated [111]. The authors analysed tree structures and proposed an importance measure based on associations between a variable (SNP) and the response variable (trait), conditional on other variables (other SNPs). For a given SNP, the forest was searched for nodes where that SNP was used as a splitting variable. A conditional Chi-square statistic was calculated for each of those nodes using SNPs that preceded the SNP in the same tree. The maximal conditional Chi-square (MCC) importance was defined as the highest Chi-square value of all nodes where the SNP was used as a splitting variable. The MCC value thus quantifies the relationship between a phenotype and a SNP given its preceding SNPs in the RFM.

The interactions between alleles of patients or healthy people in these SNPs were shown in a tree-like graph. The effects of the conditional relationships between variables for all samples of a given class are directly visible in these graphs. Partial dependence plots [112] may reveal the same information as they show how the classification of a data set is altered as a function of a subset of variables (usually one or two) after accounting for the average effects of all other variables in the model. CARTscans [113] allow visualization of conditional dependencies on categorical variables. However, multidimensional partial dependence plots or CARTscans have to be manually inspected to derive concrete interactions between variables.

The MCC importance can probably also be applied to other high-throughput data with numerous noisy and only a few important variables, as long as the node size is sufficient [111]. To date, however, no publicly available MCC implementation exists. Importantly, none of the above-described studies allow deriving a minimum set of variables and their interactions required to classify a given data set. Such minimum set is essential in reducing the complexity of a biomarker and increasing its interpretability. In addition, it could very well be that variable interactions are relevant only for a subset of samples of the same class. Generating this potentially crucial information for a given data set would require supplementing for instance the MCC algorithm of Wang and co-workers with, e.g. a clustering of samples based on, e.g. local variable importance or RF proximity scores and subsequently selecting the variables and/or variable interactions that explain the classification of a given subset of samples of the same class. A publicly available and validated MCC implementation might therefore be promising for the discovery of variable interactions in proteomics, metabolomics, genomics and transcriptomics data using RF, especially if the implementation would also include the determination of variable interactions for subsets of samples and visualization tools that support interpretation of such complex relationships.

For inspiration, we provide a concept visualization of interacting variables, relevant for subsets of samples, different from the visualizations discussed earlier. The visualization might be a typical result from extensive omics data mining from the trees in an RFM (Figure 2). Linking the samples of the same subclass using evidence-based graphs, much like those from STRING [114], could furthermore allow the viewer to see and understand the (other) biological connection(s) between samples that are found to be linked by (interacting) variables identified in this data-driven approach.

Figure 2 Concept visualization of how relations between variables and samples could be represented following the dissection of the trees in a random forest

In this hypothetical case, a supervised classification was performed on samples from two classes (encircled ...

Figure 2:

bbs034f2Figure2.jpg

Conclusion

The RF algorithm has been widely used in the Life Sciences. It is suited for both regression and classification tasks, for example the prediction of disease state of patients (samples) using expression characteristics of genes (variables). However, RF has predominantly been used in a straight-forward way as a classifier without preceding variable selection and parameter tuning, or as a variable filter prior to using other prediction algorithms. RF is an elegant and powerful algorithm allowing the extraction of additional relevant knowledge from omics data, such as conditional relations between variables and interactions between variables for subsets of samples. Exploiting local importances, proximity values and analysis of individual trees could prove to be a compass to unlocking this information from complex omics data.

Key points

  • RF is widely used in the Life Sciences because RF classification models are versatile, have a high prediction accuracy and provide additional information such as variable importances.
  • RF is often used as a black box, without parameter optimization, variable selection or exploitation of proximity values and local importances.
  • RF is a unique and valuable tool to analyse variable interactions and conditional relationships for data sets in which (combinations of) variables are important for subsets of samples, typically for omics data generated in the Life Sciences.

Supplementary Data

Supplementary data are available online at http://bib.oxfordjournals.org/.

Supplementary Data: Click here to view. See Spreadsheet

Biographies

  • Wouter Touw is a master student of Molecular Life Sciences at the Radboud University of Nijmegen, the Netherlands. He specializes in bioinformatics and structural biology.
  • Jumamurat Bayjanov is a postdoctoral researcher at the Radboud University Medical Centre, the Netherlands. He is involved in analyzing next-generation sequence data and developing machine-learning tools.
  • Lex Overmars is a PhD student at the Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre. His research focuses on the analysis of prokaryotic regulatory elements. 
  • Lennart Backus is developing phylogenomics techniques for sequence-based prediction of microbial interactions at the Radboud University Medical Centre in a PhD project funded by TI Food and Nutrition.• 
  • Jos Boekhorst is a bioinformatician at NIZO food research. He uses computational tools to unravel links between microbes, food, health and disease. 
  • Michiel Wels is group leader bioinformatics at NIZO food research and is involved in applying bioinformatics approaches to different food-related research questions. 
  • Sacha van Hijum is a senior scientist bioinformatics at NIZO food research and group leader of the bacterial genomics group, Centre for Molecular and Biomolecular Informatics at the Radboud University Medical Centre. Bioinformatics research at the bacterial genomics group focuses on establishing the relation between microbial consortia and health.

References

1. Ideker T, Galitski T, Hood L. A new approach to decoding life: systems biology. Annu Rev Genomics Hum Genet. 2001;2:343–72. [PubMed]
2. Kitano H. Systems biology: a brief overview. Science. 2002;295:1662–4. [PubMed]
3. Chuang H-Y, Hofree M, Ideker T. A decade of systems biology. Annu Rev Cell Dev Biol. 2010;26:721–44. [PMC free article] [PubMed]
4. Ghosh S, Matsuoka Y, Asai Y, et al. Software for systems biology: from tools to integrated platforms. Nat Rev Genet. 2011;12:821–32. [PubMed]
5. Gehlenborg N, O’Donoghue SI, Baliga NS, et al. Visualization of omics data for systems biology. Nat Methods. 2010;7:S56–68. [PubMed]
6. Larranaga P. Machine learning in bioinformatics. Brief Bioinform. 2006;7:86–112. [PubMed]
7. Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: a survey and results of new tests. Pattern Recognit. 2011;44:330–49.
8. Breiman L. Random Forests. Mach Learn. 2001;45:5–32.
9. Boser BE, Guyon IM, Vapnik VN. In: Proceedings of the fifth annual workshop on Computational learning theory - COLT ’92, 1992. A training algorithm for optimal margin classifiers; pp. 144–52.
10. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20:273–97.
11. McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys. 1943;5:115–33. [PubMed]
12. Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain.Psychol Rev. 1958;65:386–408. [PubMed]
13. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature.1986;323:533–36.
14. Friedman N, Geiger D, Goldszmidt M. Bayesian network classifiers. Mach Learn. 1997;29:131–63.
15. Minsky M. Steps toward artificial intelligence. Proc IRE. 1961;49:8–30.
16. Kleinbaum DG, Kupper LL, Chambless LE. Logistic regression analysis of epidemiologic data: theory and practice. Commun Stat Theory. 1982;11:485–547.
17. Fixt E, Hodges JL. Discriminatory analysis-nonparametric discrimination: consistency properties. Int Stat Rev. 1989;57:238–47.
18. Fischer RA. The use of multiple measurements in taxonomic problems. Ann Hum Genet. 1936;7:179–88.
19. Friedman JH. Regularized discriminant analysis. J Am Stat Assoc. 1989;84:165–75.
20. Wold H. Perspectives in Probability and Statistics, Papers in Honour of M. S. Bartlett. 1975. Soft modeling by latent variables: the nonlinear iterative partial least squares approach.
21. Breiman L, Friedman JH, Olshen RA, et al. Classification and regression trees. The Wadsworth Statistics Probability Series. 1984;19:368.
22. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edn. New York: Springer-Verlag; 2009.
23. Tarca AL, Carey VJ, Chen X-wen, et al. Machine learning and its applications to biology. PLoS Comput Biol. 2007;3:e116. [PMC free article] [PubMed]
24. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;9:319. [PMC free article][PubMed]
25. Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3. [PMC free article] [PubMed]
26. Jiang P, Wu H, Wang W, et al. MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Nucleic Acids Res. 2007;35:W339–44.[PMC free article] [PubMed]
27. Pang H, Lin A, Holford M, et al. Pathway analysis using random forests classification and regression.Bioinformatics. 2006;22:2028–36. [PubMed]
28. Bao L, Cui Y. Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information. Bioinformatics. 2005;21:2185–90. [PubMed]
29. Qi Y, Bar-Joseph Z, Klein-seetharaman J. Evaluation of different biological data and computational classification methods for use in protein interaction. Bioinformatics. 2006;500:490–500. [PMC free article][PubMed]
30. Bellman RE. RandCorporationDynamic Programming. Princeton: Princeton University Press; 1957. p. 342.
31. Somorjai RL, Dolenko B, Baumgartner R. Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics. 2003;19:1484–91. [PubMed]
32. Bureau A, Dupuis J, Falls K, et al. Identifying SNPs predictive of phenotype using random forests.Genet Epidemiol. 2005;28:171–82. [PubMed]
33. Sampson DL, Parker TJ, Upton Z, et al. A comparison of methods for classifying clinical samples based on proteomics data: a case study for statistical and machine learning approaches. PLoS One. 2011;6:e24973.[PMC free article] [PubMed]
34. Moore JH, Asselbergs FW, Williams SM. Bioinformatics challenges for genome-wide association studies. Bioinformatics. 2010;26:445–55. [PMC free article] [PubMed]
35. Arumugam M, Raes J, Pelletier E, et al. Enterotypes of the human gut microbiome. Nature.2011;473:174–80. [PMC free article] [PubMed]
36. Fusaro VA, Mani DR, Mesirov JP, et al. Prediction of high-responding peptides for targeted protein assays by mass spectrometry. Nat Biotechnol. 2009;27:190–8. [PMC free article] [PubMed]
37. Nicholson JK, Connelly J, Lindon JC, et al. Metabonomics: a platform for studying drug toxicity and gene function. Nat Rev Drug Discov. 2002;1:153–61. [PubMed]
38. Goh C-S, Lan N, Douglas SM, et al. Mining the structural genomics pipeline: identification of protein properties that affect high-throughput experimental analysis. J Mol Biol. 2004;336:115–30. [PubMed]
39. Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99:323–9.[PMC free article] [PubMed]
40. Goldstein BA, Polley EC, Briggs FBS. Random forests for genetic association studies. Stat Appl Genet Mol Biol. 2011;10:1–34. [PMC free article] [PubMed]
41. Breiman L, Cutler A. Random Forests. http://www.stat.berkeley.edu/~breiman/RandomForests/
42. Stone M. Cross-validatory choice and assessment of statistical predictions. J Roy Stat Soc B Met.1974;36:111–47.
43. Bayjanov JR, Molenaar D, Tzeneva V, Siezen RJ, van Hijum SAFT. PhenoLink – a web-tool for linking phenotype to ∼omics data for bacteria: application to gene-trait matching for Lactobacillus plantarumstrains. BMC Genomics. 2012;13:170. [PMC free article] [PubMed]
44. Liaw A, Wiener M. Classification and regression by randomForest. R News. 2002;2:18–22.
45. R Development Core Team. R. A Language and Environment for Statistical Computing. 2012.
46. Schwarz DF, König IR, Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics. 2010;26:1752–8. [PMC free article] [PubMed]
47. Zhang H, Wang M, Chen X. Willows: a memory efficient tree and forest construction package. BMC Bioinformatics. 2009;10:130. [PMC free article] [PubMed]
48. Frank E, Hall M, Trigg L, et al. Data mining in bioinformatics using Weka. Bioinformatics.2004;20:2479–81. [PubMed]
49. Alvarez S, Diaz-Uriarte R, Osorio A, et al. A predictor based on the somatic genomic changes of the BRCA1/BRCA2 breast cancer tumors identifies the non-BRCA1/BRCA2 tumors with BRCA1 promoter hypermethylation. Clin Cancer Res. 2005;11:1146–53. [PubMed]
50. Briggs FBS, Bartlett SE, Goldstein BA, et al. Evidence for CRHR1 in multiple sclerosis using supervised machine learning and meta-analysis in 12,566 individuals. Hum Mol Genet. 2010;19:4286–95.[PMC free article] [PubMed]
51. Caporaso JG, Lauber CL, Costello EK, et al. Moving pictures of the human microbiome. Genome Biol.2011;12:R50. [PMC free article] [PubMed]
52. Chen CCM, Schwender H, Keith J, et al. Methods for identifying snp interactions: a review on variations of logic regression, random forest and bayesian logistic regression. IEEE/ACM Trans Comput Biol Bioinf.2011;8:1580–91. [PubMed]
53. Christensen BC, Houseman EA, Godleski JJ, et al. Epigenetic profiles distinguish pleural mesothelioma from normal pleura and predict lung asbestos burden and clinical outcome. Cancer Res. 2009;69:227–34.[PMC free article] [PubMed]
54. De Lobel L, Geurts P, Baele G, et al. A screening methodology based on Random Forests to improve the detection of gene-gene interactions. Eur J Hum Genet, 2010;18:1127–32. [PMC free article] [PubMed]
55. Dutilh BE, Jurgelenaite R, Szklarczyk R, et al. FACIL: fast and accurate genetic code inference and logo. Bioinformatics. 2011;27:1929–33. [PMC free article] [PubMed]
56. Lunetta KL, Hayward LB, Segal J, et al. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 2004;5:32. [PMC free article] [PubMed]
57. Ma D, Xiao J, Li Y, et al. Feature importance analysis in guide strand identification of microRNAs.Comput Biol Chem. 2011;35:131–6. [PubMed]
58. Meijerink M, van Hemert S, Taverne N, et al. Identification of genetic loci in Lactobacillus plantarumthat modulate the immune response of dendritic cells using comparative genome hybridization. PloS One.2010;5:e10632. [PMC free article] [PubMed]
59. Rödelsperger C, Guo G, Kolanczyk M, et al. Integrative analysis of genomic, functional and protein interaction data predicts long-range enhancer-target gene interactions. Nucleic Acids Res. 2011;39:2492–502. [PMC free article] [PubMed]
60. Roshan U, Chikkagoudar S, Wei Z, et al. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest. Nucleic Acids Res. 2011;39:e62.[PMC free article] [PubMed]
61. Tsou JA, Galler JS, Siegmund KD, et al. Identification of a panel of sensitive and specific DNA methylation markers for lung adenocarcinoma. Mol Cancer. 2007;6:70. [PMC free article] [PubMed]
62. van Hemert S, Meijerink M, Molenaar D, et al. Identification of Lactobacillus plantarum genes modulating the cytokine response of human peripheral blood mononuclear cells. BMC Microbiol.2010;10:293. [PMC free article] [PubMed]
63. Vingerhoets J, Tambuyzer L, Azijn H, et al. Resistance profile of etravirine: combined analysis of baseline genotypic and phenotypic data from the randomized, controlled Phase III clinical studies. AIDS.2010;24:503–14. [PubMed]
64. Enot DP, Beckmann M, Draper J. On the interpretation of high throughput MS based metabolomics fingerprints with random forest. Metabolomics. 2006:226–35..
65. Gupta S, Aires-de-Sousa J. Comparing the chemical spaces of metabolites and available chemicals: models of metabolite-likeness. Mol Divers. 2007;11:23–36. [PubMed]
66. Pino Del Carpio D, Basnet RK, De Vos RCH, et al. Comparative methods for association studies: a case study on metabolite variation in a Brassica rapa core collection. PloS One. 2011;6:e19624.[PMC free article] [PubMed]
67. Finehout EJ, Franck Z, Choe LH, et al. Cerebrospinal fluid proteomic biomarkers for Alzheimer’s disease. Ann Neurol. 2007;61:120–9. [PubMed]
68. Hettick JM, Kashon ML, Slaven JE, et al. Discrimination of intact mycobacteria at the strain level: a combined MALDI-TOF MS and biostatistical analysis. Proteomics. 2006;6:6416–25. [PubMed]
69. Munro NP, Cairns DA, Clarke P, et al. Urinary biomarker profiling in transitional cell carcinoma. Int J Cancer. 2006;119:2642–50. [PubMed]
70. Gunther EC, Stone DJ, Gerwien RW, et al. Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro. Proc Natl Acad Sci USA. 2003;100:9608–13.[PMC free article] [PubMed]
71. Guo L, Ma Y, Ward R, et al. Constructing molecular classifiers for the accurate prognosis of lung adenocarcinoma. Clin Cancer Res. 2006;12:3344–54. [PubMed]
72. Nannapaneni P, Hertwig F, Depke M, et al. Defining the structure of the general stress regulon ofBacillus subtilis using targeted microarray analysis and Random Forest classification. Microbiology.2012;158:696–707. [PubMed]
73. Riddick G, Song H, Ahn S, et al. Predicting in vitro drug sensitivity using Random Forests.Bioinformatics. 2011;27:220–4. [PMC free article] [PubMed]
74. Tsuji S, Midorikawa Y, Takahashi T, et al. Potential responders to FOLFOX therapy for colorectal cancer by Random Forests analysis. Br J Cancer. 2011:1–7. [PMC free article] [PubMed]
75. Wang X, Simon R. Microarray-based cancer prediction using single genes. BMC Bioinformatics.2011;12:391. [PMC free article] [PubMed]
76. Wuchty S, Arjona D, Li A, et al. Prediction of associations between microRNAs and gene expression in glioma biology. PloS One. 2011;6:e14681. [PMC free article] [PubMed]
77. Bordner AJ. Predicting protein-protein binding sites in membrane proteins. BMC Bioinformatics.2009;10:312. [PMC free article] [PubMed]
78. Chen X-wen, Jeong JC. Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics. 2009;25:585–91. [PubMed]
79. Dybowski JN, Heider D, Hoffmann D. Prediction of co-receptor usage of HIV-1 from genotype. PLoS Comput Biol. 2010;6:e1000743. [PMC free article] [PubMed]
80. Han P, Zhang X, Norton RS, et al. Large-scale prediction of long disordered regions in proteins using random forests. BMC Bioinformatics. 2009;10:8. [PMC free article] [PubMed]
81. Heider D, Verheyen J, Hoffmann D. Predicting Bevirimat resistance of HIV-1 from genotype. BMC Bioinformatics. 2010;11:37. [PMC free article] [PubMed]
82. Hillenmeyer ME, Ericson E, Davis RW, et al. Systematic analysis of genome-wide fitness data in yeast reveals novel gene function and drug action. Genome Biol. 2010;11:R30. [PMC free article] [PubMed]
83. Li Y, Fang Y, Fang J. Predicting residue-residue contacts using random forest models. Bioinformatics.2011:1–7. [PubMed]
84. Li Y, Wen Z, Xiao J, et al. Predicting disease-associated substitution of a single amino acid by analyzing residue interactions. BMC Bioinformatics. 2011;12:14. [PMC free article] [PubMed]
85. Lin N, Wu B, Jansen R, et al. Information assessment on predicting protein-protein interactions. BMC Bioinformatics. 2004;5:154. [PMC free article] [PubMed]
86. Marino SR, Lin S, Maiers M, et al. Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantation.Bone Marrow Transplant. 2012;47:217–26. [PMC free article] [PubMed]
87. Medema MH, Zhou M, van Hijum SAFT, et al. A predicted physicochemically distinct sub-proteome associated with the intracellular organelle of the anammox bacterium Kuenenia stuttgartiensis. BMC Genomics. 2010;11:299. [PMC free article] [PubMed]
88. Nayal M, Honig B. On the nature of cavities on protein surfaces: application to the identification of drug-binding sites. Proteins. 2006;63:892–906. [PubMed]
89. Nimrod G, Szilágyi A, Leslie C, et al. Identification of DNA-binding proteins using structural, electrostatic and evolutionary features. J Mol Biol. 2009;387:1040–53. [PMC free article] [PubMed]
90. Radivojac P, Vacic V, Haynes C, et al. Identification, analysis, and prediction of protein ubiquitination sites. Proteins. 2010;78:365–80. [PMC free article] [PubMed]
91. Shi T, Seligson D, Belldegrun AS, et al. Tumor classification by tissue microarray profiling: random forest clustering applied to renal cell carcinoma. Mod Pathol. 2005;18:547–57. [PubMed]
92. Slabbinck B, De Baets B, Dawyndt P, et al. Towards large-scale FAME-based bacterial species identification using machine learning techniques. Syst Appl Microbiol. 2009;32:163–76. [PubMed]
93. Springer C, Adalsteinsson H, Young MM, et al. PostDOCK: a structural, empirical approach to scoring protein ligand complexes. J Med Chem. 2005;48:6821–31. [PubMed]
94. Tognazzo S, Emanuela B, Rita FA, et al. Probabilistic classifiers and automated cancer registration: an exploratory application. J Biomed Inform. 2009;42:1–10. [PubMed]
95. Wang H, Lin C, Yang F, et al. Hedged predictions for traditional Chinese chronic gastritis diagnosis with confidence machine. Comput Biol Med. 2009;39:425–32. [PubMed]
96. Wiseman SM, Melck A, Masoudi H, et al. Molecular phenotyping of thyroid tumors identifies a marker panel for differentiated thyroid cancer diagnosis. Ann Surg Oncol. 2008;15:2811–26. [PubMed]
97. Zhang G, Li H, Fang B. Discriminating acidic and alkaline enzymes using a random forest model with secondary structure amino acid composition. Process Biochem. 2009;44:654–60.
98. Kim Y, Wojciechowski R, Sung H, et al. Evaluation of random forests performance for genome-wide association studies in the presence of interaction effects. BMC Proc. 2009;3:S64. [PMC free article][PubMed]
99. Segal M. Technical Report, Center for Bioinformatics & Molecular Biostatistics, University of California, San Francisco, 2004. Machine learning benchmarks and random forest regression; pp. 1–14.
100. Boulesteix A-L, Bender A, Lorenzo Bermejo J, et al. Random forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations. Brief Bioinform. 2012;13:292–304.[PubMed]
101. Robnik-Sikonja M. Improving Random Forests. In: Boulicaut JF, et al., editors. Machine Learning: ECML 2004 Proceedings. Vol. 3201. Berlin: Springer; 2004. pp. 359–70.
102. Zhang H, Wang M. Search for the smallest random forest. Stat Interface. 2009;2:381. [PMC free article][PubMed]
103. Gammerman A, Vovk V, Vapnik V. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, 1998. Learning by transduction; pp. 148–55.
104. Strobl C, Boulesteix A-L, Zeileis A, et al. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007;8:25. [PMC free article] [PubMed]
105. Strobl C, Boulesteix A-L, Kneib T, et al. Conditional variable importance for random forests. BMC Bioinformatics. 2008;9:307. [PMC free article] [PubMed]
106. Hothorn T, Bühlmann P, Dudoit S, et al. Survival ensembles. Biostatistics. 2006;7:355–73. [PubMed]
107. Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat. 2006;15:651–74.
108. Nicodemus KK, Malley JD. Predictor correlation impacts machine learning algorithms: implications for genomic studies. Bioinformatics. 2009;25:1884–90. [PubMed]
109. Nicodemus KK, Malley JD, Strobl C, et al. The behaviour of random forest permutation-based variable importance measures under predictor correlation. BMC Bioinformatics. 2010;11:110. [PMC free article][PubMed]
110. Nicodemus KK. Letter to the Editor: On the stability and ranking of predictors from random forest variable importance measures. Brief Bioinform. 2011;12:369–73. [PMC free article] [PubMed]
111. Wang M, Chen X, Zhang H. Maximal conditional chi-square importance in random forests.Bioinformatics. 2010;26:831–7. [PMC free article] [PubMed]
112. Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–232.
113. Nason M, Emerson S, LeBlanc M. CARTscans: a tool for visualizing complex models. J Comput Graph Stat. 2004;13:807–25.
114. Szklarczyk D, Franceschini A, Kuhn M, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561–8.[PMC free article] [PubMed]

VaDE: a manually curated database of reproducible associations between various traits and human genomic polymorphisms

Source: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383886/

Yoko Nagai,1, Yasuko Takahashi,1, and Tadashi Imanishi1,2,*

Author information ► Article notes ► Copyright and License information ►

Abstract

Genome-wide association studies (GWASs) have identified numerous single nucleotide polymorphisms (SNPs) associated with the development of common diseases. However, it is clear that genetic risk factors of common diseases are heterogeneous among human populations. Therefore, we developed a database of genomic polymorphisms that are reproducibly associated with disease susceptibilities, drug responses and other traits for each human population: ‘VarySysDB Disease Edition’ (VaDE; http://bmi-tokai.jp/VaDE/). SNP-trait association data were obtained from the National Human Genome Research Institute GWAS (NHGRI GWAS) catalog and RAvariome, and we added detailed information of sample populations by curating original papers. In addition, we collected and curated original papers, and registered the detailed information of SNP-trait associations in VaDE. Then, we evaluated reproducibility of associations in each population by counting the number of significantly associated studies. VaDE provides literature-based SNP-trait association data and functional genomic region annotation for SNP functional research. SNP functional annotation data included experimental data of the ENCODE project, H-InvDB transcripts and the 1000 Genome Project. A user-friendly web interface was developed to assist quick search, easy download and fast swapping among viewers. We believe that our database will contribute to the future establishment of personalized medicine and increase our understanding of genetic factors underlying diseases.

Introduction

Genome-wide association studies (GWASs) have identified numerous single nucleotide polymorphisms (SNPs) that are associated with development of multifactorial diseases, such as coronary artery disease, rheumatoid arthritis, type 2 diabetes mellitus and cancers (1). However, because GWASs use statistical evaluation, we cannot completely eliminate false positives that may contaminate the data. On the other hand, it is becoming clear that genetic risk factors of common diseases are not totally universal, but are heterogeneous among human populations. For example, disease-associated SNPs have different effects and frequencies between different populations, such as European and East Asian populations (2). According to our population-based Rheumatoid Arthritis association database, RAvariome, 30 of 79 rheumatoid arthritis-associated SNPs are unique to East Asian (3).

Existing SNP-trait association databases, such as a catalog of published GWASs of the National Human Genome Research Institute (NHGRI GWAS catalog) (4), GWASdb (5), GWAS Central (6), HuGE Navigator (7), dbGaP (8) and PharmGKB (9), were developed by collecting data from published articles or are repository databases that set up an infrastructure for researchers. However, these databases do not have sufficient information on the human subjects, especially the ancestry of populations examined.

In this study, we introduce the VarySysDB Disease Edition (VaDE) database, which provides various trait-related, human genetic risk information based on an assessment of reproducibility of the association for each human population. The SNP-trait association data were collected from the NHGRI GWAS catalog, RAvariome and from manual curation of the literature. In addition to subject ethnic information from NHGRI GWAS catalog, we curated ancestral population and nationality of subjects from the original articles. Furthermore, functional information of genomic regions for each SNP was integrated to permit the identification of functional SNPs. Experimental data, such as ChIP-seq, DNase I hypersensitivity experiments, regulatory motifs, chromatin state segmentation, RefSeq genes, H-InvDB transcripts and linkage disequilibrium (LD) data in three major human populations were integrated and linked to the SNPs. We also installed a genome browser to visualize these data.

THE VaDE Databases

Collection of association data

Figure Figure1 illustrates the data flow in the VaDE database. There are two types of SNP-trait association data in VaDE. One was collected from a NHGRI GWAS catalog up to 20 January 2014. Detailed information of the literature curation process of the NHGRI GWAS catalog is described at http://www.genome.gov/gwastudies/, which we followed basically. In addition to items from the NHGRI GWAS catalog, we added items such as subject ancestral population, subject nationality, number of studies in the article and number of significant studies reported in the article. NHGRI GWAS catalog data were carefully checked and corrected with reference to the original articles. The association results of the articles used a variety of genetic models; therefore, we integrated associations of allelic or additive models and excluded results of dominant and recessive models. As of October 2014 (VaDE version 1.3), 4169 pieces of data from the NHGRI GWAS catalog have been modified and 169 have been deleted from the 15 542 NHGRI GWAS entries.

Figure 1 Data flow in VaDE database

The data resources are shown at the top and bottom. VaDE database contents are shown in the middle.

Figure 1.

gku1037fig1.jpg

In addition, comprehensive manual curation of selected diseases is continuously carried out by the VaDE team. The aim of the project is to collect comprehensive association data from non-European population studies, and assess the data quality and reproducibility mechanically to avoid curator bias. Extracted SNP-trait associations were not limited to those with P-values <1.0 × 10−5 because there is a possibility that this would exclude recent large-scale studies. As of October 2014 (VaDE version 1.3), 7530 association pieces of data from rheumatoid arthritis (based on our previous database, RAvariome) and hypertension were included in the VaDE database. Sixty-seven type 2 diabetes-associated data were provided by collaboration with the Tokyo Medical and Dental University (10). These comprise highly curated datasets, and with continued curation, the amount of data will continue to increase. After data of the NHGRI GWAS catalog and VaDE were merged, 90 duplicated data were excluded from the database.

Standardize ontology

The World Factbook of the CIA and Composition of Regions of United Nations Statistics Division were used to standardize the vocabulary of nations and populations. Subject population information was classified into the following 10 populations; European (including European American, European Australian, European New Zealander, West European, North European, South European and East European), African (including African American), East Asian, South-East Asian, South Asian, West Asian, Caribbean/Central American (such as Caribbean Hispanic, Latino, Caribbean, Hispanic), North American (such as Native American, American Indian, Native Alaskan, Pima Indian, Tohono O'odham Indian, Native Alaskan, Native Indian), South American and Oceanian.

Traits and diseases were classified into the following five categories: disease, multiple diseases, medical trait, general trait and pharmacogenomics. Diseases were further classified based on the WHO International Classification of Diseases, 10th Revision by using the Unified Medical Language System. SNP location and nearest gene annotation are based on dbSNP137. A reported gene is that reported in the original article. A related gene is that normalized by the expressions of reported genes from many articles, such that the representation is unique to the SNP.

Reproducibility assessment of SNP-trait association

To assess the reproducibility of a SNP-trait association, the SNP-trait association data were excluded when aP value was not reported, where the association is not significant and where the sample came from multiple populations. For each SNP, the number of total studies and the number of significant results in the total studies were counted when they were tested by other independent samples, such as a replication study. For each combination of population, trait, SNP and risk allele, the total number of significant studies was counted between independent articles to confirm the reproducibility in the population. SNP-trait associations whose reproducibility was confirmed by more than one independent study within a literature or between literatures were summarized in the Reproduced Associations page.

Finally, VaDE (version 1.3) provided 15 283 NHGRI GWAS catalog data and 7530 association data from manual curation. By an assessment of reproducibility of each combination of disease, population, variant (SNP, haplotype or HLA allele) and risk allele, 3140 reproducible associations were found within or between human populations. Table Table 1 shows the statistics of the total number of reproducible associations, unique variants and unique traits for each human population. Only 67 associations were reproducible between two different populations (Supplementary Table S1).

Table 1 Statistics of the population-based reproducibility assessment (VaDE version 1.3)

Table 1.

Table 1. Statistics of the population-based reproducibility assessment (VaDE version 1.3)
Population Number of associations Number of unique variants Number of unique traits
European 2446

2196

288
African 67 56 19
East Asian 557 456 102
South-East Asian 10 10 5
South Asian 26 13 3
West Asian

4

2 2
Caribbean/Central American 22 15 9
North American 0 0 0
South American 0 0 0
Oceanian 0 0 0

 

Collection of functional genomic data

Experimental data, such as ChIP-seq, DNase I hypersensitivity experiments, regulatory motifs data from ENCODE project (11), chromatin state segmentation from ENCODE/Broad (12,13), chromatin state segmentation from NIH Roadmap Epigenomics Mapping Consortium (14) and RefSeq gene annotation (15) were downloaded from HaploReg v2 (16). Annotation of H-InvDB transcripts to SNPs was based on VarySysDB (17,18). LD between a SNP and 3000 upstream and downstream SNPs based on the 1000 Genome Project (19) were provided by the Center for Statistical Genetics, University of Michigan. Detailed information for the method is available at http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-02-14.html. To visualize the location and relation between SNPs and functional genomic regions, we developed a genome browser based on the UCSC Genome Browser (20).

Web interface

The VaDE database comprises four pages: the Reproduced Associations page, the All Associations page, the SNP Functional Annotations page and the Genome Browser page (Figure (Figure 2).

Figure 2 Screen shot of VaDE contents

(A) Reproduced Associations page, (B) All Associations page, (C) SNP Functional Annotations page and (D) Genome browser.

Figure 2.

gku1037fig2.jpg

On the Reproduced Associations page, a user can search association data by trait/disease name, gene name (gene symbol), SNP ID (dbSNP rs number) or HLA allele name and population name (Figure (Figure2A).2A). By filtering the number of populations, a user can search association data whose reproducibility was confirmed in more than one population. In the left section, a list of the number of studies according to human population is displayed for every record of combination of trait/disease, gene, variant and allele. In the right section, information of selected SNP-trait associations from the latest study is shown as a representative result for each population.

In the All Associations page, manually curated SNP-trait association data are provided (Figure 2B). In the search field, a user can search for the country where the samples were collected. In the left section, in addition to trait/disease, gene, SNP-allele and population, P-value of SNP-trait association data and OR/beta, the reported article's PubMed ID and number of studies in the article are provided. The number of studies in the article is not counted when the data is not significant, does not report a P-value or is a multiple population analysis.

The SNP Functional Annotations page and the Genome Browser page provide SNP-related functional genomic regions and LD information of European, African and East Asian populations (Figure (Figure 2C and and D). The Genome Browser page provides the location of SNPs, the LD block of the SNPs, genes and functional genomic region. The LD block shown in the Genome Brower is limited to SNPs in the VaDE All Associations page. The LD block is defined by the farthest SNPs that link to the focus SNP in r2 > 0.8, both upstream and downstream. Using the Genome browser, a user can easily determine the location and distance between a focus SNP and functional genomic regions.

All four pages are linked by hyperlinks. In the Reproduced Associations page, hyperlinks in the trait/disease column, gene name column and SNP-allele column take the user to the All Associations page by searching with the clicked query. For example, if user clicked ‘rs887829-T’ in the SNP-allele column, the result of searching ‘rs887829-T’ in the All Associations page would be shown (Figure 2A and and B). In the All Associations page, users can jump to the SNP Functional Annotations page by clicking the SNP rs number. Figure Figure 2C show the list of SNPs that correlated with rs887829 (r2 > 0.8 in European, African or Asian populations) and the genomic functional region that overlapped with the SNP location. Clicking the location column of rs887829 takes the user to the Genome Browser page (Figure (Figure 2D).

Future perspectives

GWASs are currently producing large amounts of data worldwide, and will continue to do so. GWASs report genomic polymorphisms that are associated with various human phenotypes in numerous scientific papers. Also, it is becoming clear that not only SNPs but also copy number variations and other structural variants are associated with human phenotypes. The VaDE development team will continue to collect these data, which will be released regularly. However, the VaDE development team alone cannot survey all GWAS papers published in all major journals. Thus, in the near future, we plan to develop a data submission system by which GWAS researchers can submit their own results to VaDE, which may facilitate feedback from the user community.

Currently, VaDE viewers provide users with several hyperlinks to major public databases, such as PubMed and dbSNP (21). In the future, we plan to offer further hyperlinks to external databases concerning human genomic polymorphisms using the Hyperlink Management System (22) that is an automated ID mapping system. This will enable VaDE users to obtain more data about genomic polymorphisms dispersed worldwide, and to analyze them in an integrated manner. Through the development of an integrated database of genomic polymorphisms, VaDE, we will continue to provide the research community with reliable association data to support the development of future preventive medicines and basic research on human phenotypes.

Supplementary Data

Supplementary Data are available at NAR Online. My Note: See Spreadsheet

Acknowledgements

We express our thanks to Drs Noriko Sato, Nay Chi Htun and Masaaki Muramatsu of the Tokyo Medical and Dental University for providing us with curated data of type 2 diabetes. We also thank Drs Takato Matsui and Junichi Takeda and members of the Support Center for Medical Research and Education, Tokai University for manual curation of the literature and Kensuke Numakura for valuable discussions and disease classification. Finally, we thank Nobuo Obi, Takuya Habara and Kentaro Mamiya for technical support for the database development.

Footnotes

The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors.

Funding

JSPS KAKENHI [258055, 268046]. Funding for openaccess charge: Tokai University School of Medicine.

Conflict of interest statement. None declared.

References

1

Visscher P.M., Brown M.A., McCarthy M.I., Yang J. Five years of GWAS discovery. Am. J. Hum. Genet. 2012;90:7–24. [PMC free article] [PubMed]

2

Okada Y., Wu D., Trynka G., Raj T., Terao C., Ikari K., Kochi Y., Ohmura K., Suzuki A., Yoshida S., et al. Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature. 2014;506:376–381.[PMC free article] [PubMed]

3

Nagai Y., Imanishi T. RAvariome: a genetic risk variants database for rheumatoid arthritis based on assessment of reproducibility between or within human populations. Database (Oxford) 2013;2013:bat073.[PMC free article] [PubMed]

4

Welter D., MacArthur J., Morales J., Burdett T., Hall P., Junkins H., Klemm A., Flicek P., Manolio T., Hindorff L., et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 2014;42:D1001–D1006. [PMC free article] [PubMed]

5

Li M.J., Wang P., Liu X., Lim E.L., Wang Z., Yeager M., Wong M.P., Sham P.C., Chanock S.J., Wang J. GWASdb: a database for human genetic variants identified by genome-wide association studies. Nucleic Acids Res. 2012;40:D1047–D1054. [PMC free article] [PubMed]

6

Beck T., Hastings R.K., Gollapudi S., Free R.C., Brookes A.J. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. Eur. J. Hum. Genet. 2014;22:949–952. [PMC free article] [PubMed]

7

Yu W., Gwinn M., Clyne M., Yesupriya A., Khoury M.J. A navigator for human genome epidemiology.Nat. Genet. 2008;40:124–125. [PubMed]

8

Tryka K.A., Hao L., Sturcke A., Jin Y., Wang Z.Y., Ziyabari L., Lee M., Popova N., Sharopova N., Kimura M., et al. NCBI's Database of Genotypes and Phenotypes: dbGaP. Nucleic Acids Res.2014;42:D975–D979. [PMC free article] [PubMed]

9

McDonagh E.M., Whirl-Carrillo M., Garten Y., Altman R.B., Klein T.E. From pharmacogenomic knowledge acquisition to clinical applications: the PharmGKB as a clinical pharmacogenomic biomarker resource. Biomark. Med. 2011;5:795–806. [PMC free article] [PubMed]

10

Sato N., Htun N.C., Daimon M., Tamiya G., Kato T., Kubota I., Ueno Y., Yamashita H., Fukao A., Kayama T., et al. Likelihood ratio-based integrated personal risk assessment of type 2 diabetes. Endocrinol. J. 2014. EJ14-0271.

11

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome.Nature. 2012;489:57–74. [PMC free article] [PubMed]

12

Ernst J., Kellis M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010;28:817–825. [PMC free article] [PubMed]

13

Ernst J., Kheradpour P., Mikkelsen T.S., Shoresh N., Ward L.D., Epstein C.B., Zhang X., Wang L., Issner R., Coyne M., et al. Mapping and analysis of chromatin state dynamics in nine human cell types.Nature. 2011;473:43–49. [PMC free article] [PubMed]

14

Chadwick L.H. The NIH Roadmap Epigenomics Program data resource. Epigenomics. 2012;4:317–324.[PMC free article] [PubMed]

15

Pruitt K.D., Brown G.R., Hiatt S.M., Thibaud-Nissen F., Astashyn A., Ermolaeva O., Farrell C.M., Hart J., Landrum M.J., McGarvey K.M., et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014;42:D756–D763. [PMC free article] [PubMed]

16

Ward L.D., Kellis M. HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants. Nucleic Acids Res. 2012;40:D930–D934.[PMC free article] [PubMed]

17

Imanishi T., Itoh T., Suzuki Y., O'Donovan C., Fukuchi S., Koyanagi K.O., Barrero R.A., Tamura T., Yamaguchi-Kabata Y., Tanino M., et al. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2004;2:e162. [PMC free article] [PubMed]

18

Shimada M.K., Matsumoto R., Hayakawa Y., Sanbonmatsu R., Gough C., Yamaguchi-Kabata Y., Yamasaki C., Imanishi T., Gojobori T. VarySysDB: a human genetic polymorphism database based on all H-InvDB transcripts. Nucleic Acids Res. 2009;37:D810–D815. [PMC free article] [PubMed]

19

Abecasis G.R., Auton A., Brooks L.D., DePristo M.A., Durbin R.M., Handsaker R.E., Kang H.M., Marth G.T., McVean G.A. An integrated map of genetic variation from 1,092 human genomes. Nature.2012;491:56–65. [PMC free article] [PubMed]

20

Karolchik D., Barber G.P., Casper J., Clawson H., Cline M.S., Diekhans M., Dreszer T.R., Fujita P.A., Guruvadoo L., Haeussler M., et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res.2014;42:D764–D770. [PMC free article] [PubMed]

21

Sherry S.T., Ward M.H., Kholodov M., Baker J., Phan L., Smigielski E.M., Sirotkin K. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001;29:308–311. [PMC free article] [PubMed]

22

Imanishi T., Nakaoka H. Hyperlink Management System and ID Converter System: enabling maintenance-free hyperlinks among major biological databases. Nucleic Acids Res. 2009;37:W17–W22.[PMC free article] [PubMed]

What's VaDE?

Source:http://bmi-tokai.jp/VaDE/document/

Outline of VaDE

A VarySysDB Disease Edition (VaDE) is a database of human genome polymorphisms involved in traits such as various disease susceptibilities or drug responses, which have been collected from a number of academic papers.

Most of the data that has been registered in VaDE is a genome polymorphisms associated with disease or drug responses. Besides, it contains a number of genome polymorphisms associated with general traits such as height or weight. VaDE provides a wealth of information about these genome polymorphisms such as odds ratio, β value, sample population, p value and so on. Furthermore, VaDE evaluates reproducibility of associations in multiple independent studies.

By using VaDE, you can easily search and get the reliable information of genome polymorphisms associated with disease susceptibility. This information can be used in researches for predicting disease risks, which lead to application to preventive medicine in the future. In addition, data registered in VaDE is available in a wide range of fields such as drug discovery, forensic medicine, and anthropology, so the role of this database will become increasingly important in the future.

Detailed Information of Pages and Utilization

Please refer to the following VaDE manual.

VaDE Manual verion 1.1 released. PDF and Word

December 15, 2014

VaDE
Database manual version 1.1

Biomedical Informatics Laboratory Department of Molecular Life Science
Division of Basic Medical Science and Molecular Medicine Tokai University School of Medicine 

1 What’s VaDE?

The VarySysDB Disease Edition (VaDE) is a database of human genome polymorphisms involved in traits such as various disease susceptibilities or drug responses, which have been collected from a number of academic papers.

Recently, many genome-wide association studies (GWASs) have been performed and identified various disease-associated genomic polymorphisms. These data are valuable for medical research. However, use of these data has been difficult for general life scientists because the information have been described in numerous academic articles. We therefore started a project to construct a database of human genome polymorphisms involved in various traits from 2013. In principle, the information has been obtained from a large number of collected GWAS articles. The VaDE database was born by integrating with the VarySysDB database of functional information of human genome polymorphism that has been previously built.

Most of the data that has been registered in VaDE is genomic polymorphisms associated with diseases or drug responses. Besides, it contains a number of genomic polymorphisms associated with general traits such as height or weight. VaDE provides a wealth of information about these genomic polymorphisms such as odds ratios, β values, sample populations, p values and so on. Furthermore, VaDE evaluates reproducibility of associations in multiple independent studies.

By using VaDE, you can easily search and get the reliable information of genomic polymorphisms associated with disease susceptibility. This information can be used in researches for predicting disease risks, which lead to application to preventive medicine in the future. In addition, data registered in VaDE is available in a wide range of fields such as drug discovery, forensic medicine, and anthropology, so the role of this database will become increasingly important in the future.
The VaDE database address is http://bmi-tokai.jp/VaDE/.

Figure 1 Flow of the VaDE database construction

Figure 1 Flow of the VaDE database construction.png

2 Major pages in VaDE and their links

Figure 2 Links among the major pages in VaDE: (A) Reproduced Associations  page, (B) All Associations page, (C) SNP Functional Annotations page, (D) Genome Browser page

Figure 2 Links among the major pages in VaDE.png

There are hyperlinks among all major pages of VaDE, so you can move to pages that provide more detailed SNP information in a step-wise manner. Each page provides the following information: (A) a list of reproducible SNP-trait associations in each population, (B) a list of detailed information of SNP-trait associations, (C) a list of SNPs in high linkage disequilibrium with the selected SNP and their functional information, and (D) Genome Browser.

3 Detailed information of pages and search system

3.1 Top page
Figure 3-1 Top page: (A) search window for reproduced associations in each trait and population, (B) search window for all SNP-trait associations in a country

VaDEFigure 3-1 Top page (A) search window for reproduced associations in each trait and population, (B) search window for all SNP-trait associations in a country.png

[Page description] On the Top page (A), you can search by trait/disease names or population names, and move to the Reproduced Associations page. On the Top page (B), you can search by country names on the world atlas, and move to the All Associations page. Here, the search phrases need to be written in English (The same shall apply hereafter). 

3.2 Reproduced Associations page
Figure 3-2 Reproduced Associations page

VaDEFigure 3-3 All Associations page.png

[Page description] The Reproduced Associations page provides information of reproduced trait/disease associated SNPs reported in two or more studies with independent samples for each population. In the left section, a list reproducible SNPs is displayed with trait/disease, reported gene, SNP-allele, population examined, and the number of their significant study (GWAS: P-value <1.0×10-5, replication study: P-value <0.05). When you select an item in the list, you can move to the All Associations page with search by the item (Refer to the next page). In the right section, detailed information of selected SNP-trait associations from the study is shown, using the largest number of cases as a representative result for each population. There are links to PubMed, dbSNP, SNPedia, and ICD-10. You can download all the data by clicking of the CSV or TSV buttons.

[Search method] You can search association data by trait/disease name, gene name (gene symbol), SNP ID (dbSNP rs number), population name, and the condition of reproducibility (in one region or in multiple regions). Figure 3-2 shows a result of search by Rheumatoid arthritis and East Asian. 

3.3 All Associations page
Figure 3-3 All Associations page

VaDEFigure 3-3 All Associations page.png

[Page description] The All Associations page provides information of all trait/disease associated SNPs registered in VaDE regardless of reproducibility. In the left section, a list of SNPs is displayed with trait/disease, reported gene, SNP-allele, population examined, P-value, odds ratio (OR)/beta-value, PubMed ID of original article and the number of their significant study (GWAS: P-value <1.0×10-5, replication study: P-value <0.05). When you select a SNP-allele, you can move to the Functional Annotations page with search by the SNP-allele (Refer to the next page). In the right section, detailed information of each SNP-trait association is shown. There are links to PubMed, dbSNP, SNPedia, and ICD-10. You can download all the data by clicking the CSV or TSV button.

[Search method] You can search association data by trait/disease name, gene name (gene symbol), SNP ID (dbSNP rs number), population name, upper limit of P-value, lower limit of OR/beta-value, PubMed ID, and country name.

3.4 SNP Functional Annotations page
Figure 3-4 SNP Functional Annotations page

VaDEFigure 3-4 SNP Functional Annotations page.png

[Page description] The SNP Functional Annotations page provides information of query SNP, SNPs in high linkage disequilibrium (LD) with the query SNP, and functional genomic region overlapping with each SNP location. In the left section, a list of SNPs in LD (r2>0.8) is displayed with their location, r2 in each major population (European, Asian, and African), nearest gene, SNP position, and functional region (enhancer, promoter, DNase I, or motif). When you select the location of a target SNP, you can move to the Genome Browser page with search by the SNP and the location (Refer to the next page). In the right section, detailed information of each SNP is shown. You can download the list of the data by clicking the CSV or TSV button.

[Search method] You can search these data by SNP ID (dbSNP rs number). Also, you are able to change a query SNP by selecting the other SNP in LD.

3.5 Genome Browser page
Figure 3-5 Genome Browser page

VaDEFigure 3-5 Genome Browser page.png

[Page description] The Genome Browser page that incorporated UCSC Genome Browser provides information of positional relationship on the genome with a focus on a query SNP. You can find information of all registered SNPs near the query SNP, LD blocks in each major population (European, Asian, and African), genes, and functional genomic regions. 
 

Table 3-5-1 Chromatin in state segmentation by HMM from ENCODE/Broad

My Note: Need to Add Colors

Color

State

High frequency chromatin marker (frequency over 50%)

1

Active Promoter

H3K4me2, H3K4me3, H3K27ac, H3K9ac

2

Weak Promoter

H3K4me1, H3K4me2, H3K4me3

3

Inactive/poised Promoter

H3K27me3, H3K4me2

4

Strong enhancer

H3K4me1, H3K4me2, H3K4me3, H3K27ac, H3K9ac

5

Strong enhancer

H3K4me1, H3K4me2, H3K27ac

6

Weak/poised enhancer

H3K4me1, H3K4me2

7

Weak/poised enhancer

H3K4me1

8

Insulator

CTCF

9

Transcriptional transition

H3K36me3(low), H4K20me1(low), H3K4me1(low)

10

Transcriptional elongation

H3K36me3(low)

11

Weak transcribed

H3K36me3(very low), H4K20me1(very low)

12

Polycomb-repressed

H3K27me3(low)

13

Heterochromatin; low signal

(no signal)

 

14

Repetitive/Copy Number Variation

 

(low freq. of all chromatin marks)

 

15

Repetitive/Copy Number Variation

 

(high freq. of all chromatin marks)

 

Color coding of chromatin segmentation in the Genome Browser is shown in Table 3-5-1 and Table 3-5-2.

Table 3-5-2 Chromatin in Core Marks segmentation by HMM from Roadmap Project

My Note: Need to Add Colors

Color

State

1

TSS_poised

2

TSS_flanking_more_upstream

3

TSS_active

4

TSS_weak

5

TSS_flanking_downstream

6

TSS_flanking_more_downstream

7

Transcription

8

Transcription_weak

9

Transcription_Enhancer-like

10

Transcription_Enhancer-like_(short_genes)

11

Enhancer_weak_1

12

Enhancer_weak_2

13

Enhancer_active

14

Enhancer_active_with_weakK4me1_strong_K27ac

15

Enhancer_poised

16

Repressed_polycomb_weak

17

Repressed_polycomb

18

H3K9me3_K27me3

19

Zinc_finger_genes_H3K36me3_K9me3

20

Heterochromatin_at_repeats

21

Heterochromatin

22

Quiescent_1

23

Quiescent_2

24

Quiescent_3

25

Quiescent_low_H3K9ac

4 Additional information

Further information and utilization for the VaDE database is presented in the following paper.

Nagai Y, Takahashi Y, and Imanishi T (2014) VaDE: a manually-curated database of reproducible associations between various traits and human genomic polymorphisms.
Nucleic Acids Research, Database Issue 2014;doi:10.1093/nar/gku1037.

Also, statistics of the VaDE database is presented in the online Document page (http://bmi-tokai.jp/VaDE/document/).
If you have any questions, please contact us by email to the address below. vade@ml.tokai-u.jp

Copyright 2014 Tokai University School of Medicine


Database construction

DETAILED INFORMATION OF DATABASE CONSTRUCTION

Collection of SNP-trait association data

There are two kinds of SNP-trait association data in VaDE database; 1. One was collected from a catalog of published genome-wide association studies of National Human Genome Research Institute (NHGRI GWAS catalog) http://www.genome.gov/gwastudies/, 2. The other was a VaDE original comprehensive manual curation data.

1. We did manual curation for the data from a NHGRI GWAS catalog to check and modify the data, add the items, and change the format. 2. Comprehensive manual curation is currently underway by the VaDE team. Some data provided from collaborators have been imported through a confirmation process by the VaDE team. Details of each data can be found in the following RELEASE HISTOY.

Because the association results in articles use variety of genetic models, we integrated associations of allelic or additive models and excluded results of dominant and recessive models.

Standardize ontology of population, disease, and gene

In addition to an association analysis result, VaDE provides information of ancestral population, ethnicity, and nationality for a subject used in each study. Vocabulary of nation and population were standardized by referring to the World Factbook of CIA and Composition of regions of United Nations Statistics Division.

Subject population information was classified into the following ten populations.

  • European (including European American, European Australian, European New Zealander, West European, North European, South European, and East European)
  • African (including African American)
  • East Asian
  • South-East Asian
  • South Asian
  • West Asian
  • Caribbean/Central American (including Latino, Hispanic, and Caribbean Hispanic)
  • North American (including Native American and Native Alaskan)
  • South American
  • Oceanian

Trait and disease were classified into the following five categories; disease, multi diseases, medical trait, general trait, and pharmacogenomics (PGx). Disease was further sub-categorized based on ICD10 by automatic linking and manual curation using Unified Medical Language System (UMLS).

SNP locations were based on dbSNP137. Reported genes were those reported in original articles.

Reproducibility assessment of SNP-trait association

To assess reproducibility of SNP-trait association, the SNP-association data were excluded when p-value was not significant (GWAS: p ≥ 1.0 × 10-5, replication study: p ≥ 0.05) or not reported, or those sample was multi-populations. After exclusion of these data, we examined the reproducibility.

We counted total number of significant results for each stages when multi-studies were conducted with independent samples such as a two-stage design GWAS. For each combination of population, trait, SNP, and risk allele, total number of significant studies was counted between independent articles to confirm reproducibility. SNP-trait associations which reproducibility was confirmed by more than one independent studies within or between literatures were summarized in Reproduced Associations page.

Collection of functional genomic data

Experimental data such as ChIP-seq, DNase I hypersensitivity experiments, regulatory motifs data from ENCODE project, chromatin state segmentation from ENCODE/Broad, chromatin state segmentation from NIH Roadmap Epigenomics Mapping Consortium, and RefSeq gene annotation were downloaded from Haploreg v2. Annotation of H-InvDB transcript to SNP was based on VarySysDB. Linkage disequilibrium (LD) between SNP and upstream and downstream 3000 SNPs based on 1000 Genomes Project were provided by Center for Statistical Genetics, University of Michigan. Detailed information of method is described at http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G.2012-02-14.html. To visualize location and relation between SNPs and functional genomic region, we developed Genome Browser based on UCSC Genome Browser.

Release History

2015-6-16

  • Enhancements: Expanded the keyword of the target for a trait/disease search (e.g. cancer)
  • Bug fixes: Fixed some of the problems of data processing

2015-3-27 VaDE version 1.5

  • Data update: Added the association data of the NHGRI GWAS catalog updated data from 9th August 2014 until 15th December 2014
  • Data update: Modified a part of the existing association data
  • Data update: Updated for the trait/disease list
  • Bug fixes: Fixed some of the problems of the trait/disease list function
Items of data
Category # of kinds # of records # of original articles
Disease 421 15,057 1,084
Trait 872 9,949 671
PGx 83 752 85
All 1,376 25,758 1,815
Information of data source
Data source # of records Description
NHGRI GWAS catalog 18,231 Extract association results which p-value is smaller than 1E-5. Detailed curation method is provided by web site of NHGRI GWAS catalog. Sample population (ancestry), country and ethnicity were added by VaDE team. The duplicated data were excluded after data of NHGRI GWAS catalog and VaDE were merged.
VaDE 7,527
  1. 1) Extract all significant and non-significant associations that have been reported from original article. Currently, association data of hypertension and rheumatoid arthritis were included.
  2. 2) Association analysis data of Type 2 diabetes in Japanese based on Sato N et al.(PMID:25069673) that have been provided from Tokyo Medical and Dental University.

2014-12-16

  • Enhancements: Add search function by country name on the Reproduced Associations page
  • Bug fixes: Fixed some of the problems of data processing

2014-11-20 VaDE version 1.4

  • Data update: Add association data of NHGRI GWAS catalog updated data from 21th January 2014 until 8th August 2014
  • Enhancements: Add detailed information of sample for each stage in the multi-stage research
  • Bug fixes: Fixed some of the problems of download function
Items of data
Category # of kinds # of records # of original articles
Disease 344 13,873 965
Trait 455 8,118 648
PGx 64 588 68
Others 218 1,967 84
All 1,081 24,546 1,722
Information of data source
Data source # of records Description
NHGRI GWAS catalog 17,019 Extract association results which p-value is smaller than 1E-5. Detailed curation method is provided by web site of NHGRI GWAS catalog. Sample population (ancestry), country and ethnicity were added by VaDE team. The duplicated data were excluded after data of NHGRI GWAS catalog and VaDE were merged.
VaDE 7,527
  1. 1) Extract all significant and non-significant associations that have been reported from original article. Currently, association data of hypertension and rheumatoid arthritis were included.
  2. 2) Association analysis data of Type 2 diabetes in Japanese based on Sato N et al.(PMID:25069673) that have been provided from Tokyo Medical and Dental University.

2014-08-21 VaDE version 1.3

  • Data update: Add association data of type2 diabetes
  • Data update: Count significant studies per literature
  • Enhancements: Altered reproducibility assessment based on significant studies
Information of data source
Data source # of records Description
NHGRI GWAS catalog 15,283

Extract association results which p-value is smaller than 1E-5. Detailed curation method is provided by web site of NHGRI GWAS catalog. Sample population (ancestry), country and ethnicity were added by VaDE team. The duplicated data were excluded after data of NHGRI GWAS catalog and VaDE were merged.

VaDE 7,530
  1. 1) Extract all significant and non-significant associations that have been reported from original article. Currently, association data of hypertension and rheumatoid arthritis were included.
  2. 2) Association analysis data of Type 2 diabetes in Japanese based on Sato N et al.(PMID:25069673) that have been provided from Tokyo Medical and Dental University.

2014-06-26 VaDE version 1.2

  • Released genome browser and supported Japanese language
  • Data update: Add association data of hypertension and rheumatoid arthritis that is manually curated by our team
  • Enhancements: Enhanced record click function to show detail information, search system, categorized trait search system, and inner-links
  • Bug fixes: Fixed outer-link URL
Information of data source
Data source # of records Description
NHGRI GWAS catalog 15,448

Extract association results which p-value is smaller than 1E-5. Detailed curation method is provided by web site of NHGRI GWAS catalog.Sample population (ancestry), country and ethnicity were added by VaDE team. The association data overlapped to VaDE's manually curated data were excluded.

VaDE 7,464

Extract all significant and non-significant associations that have been reported from original article. Currently, association data of hypertension and rheumatoid arthritis were included.

2014-06-06 VaDE version 1.1

  • Launched bmi-tokai.jp domain name

2014-04-24 VaDE version 1.0

  • Pre-released VaDE database

Site policy

Disclaimer

The information on VarySysDB Disease Edition (VaDE) is presented for the purpose of academic researches. It is not providing professional medical advice, diagnosis, treatment, or care. We does not accept any liability for any injury, loss or damage incurred by use of the information provided on this website.

Every effort is taken to ensure that the information contained in this website is accurate and complete. However, accuracy cannot be guaranteed, because rapid advances in medicine may cause information contained here to become outdated, invalid or subject to debate.

Terms of Use

All of the pages in this database can be freely viewed at no charge. You are also free to establish links to all of the pages.

However, download and use of all data is limited for academic researches and non-commercial use. For commercial use, please contact us from the following CONTACT.

This database has set Attribution-NonCommercial 2.1 Japan based on Creative Commons. Summary is as follows.

  • You are free to:

    Share - 
    copy and redistribute the medium or format
    Adapt - 
    remix, transform, and build upon the material
  • The licensor cannot revoke these freedoms as long as you follow the license terms.

  • Under the following terms:

    Attribution - 
    You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
    NonCommercial - 
    You may not use the material for commercial purposes.

For more details, please see this site.

https://creativecommons.org/licenses/by-nc/2.1/jp/deed.en

Copyright

The copyright of this database belongs to Biomedical Informatics Laboratory (Imanishi Laboratory), Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine.

Reference

If you publish your work as a research paper using the data in this database, please refer to the following paper.

Nagai Y, Takahashi Y, and Imanishi T (2014) VaDE: a manually-curated database of reproducible associations between various traits and human genomic polymorphisms. Nucleic Acids Research, Database Issue 2014;doi:10.1093/nar/gku1037.

About us

VaDE Development Team

Biomedical Informatics Laboratory (Imanishi Laboratory), Department of Molecular Life Science, Division of Basic Medical Science and Molecular Medicine, Tokai University School of Medicine

  • Tadashi Imanishi, Professor
  • Yoko Nagai, Ph.D.
  • Yasuko Takahashi, Ph.D.

Presentation

Contact

Comments, suggestions, ideas, and bug reports are welcome. Your feedback is highly appreciated and will help us to improve our database in the future. Additionally, we are looking for research institutions and private companies wish to support the development of this database. Please contact us by email to the below address.

vade[@]ml.tokai-u.jp

Acknowledgement

We express our greatest thanks to Drs. Noriko Sato, Nay Chi Htun, and Masaaki Muramatsu of Tokyo Medical and Dental University for providing us with curated data of type-2 diabetes. We thank Nobuo Obi, Takuya Habara, and Kentaro Mamiya for technical support of database development. We also thank Dr. Takato Matsui, Dr. Junichi Takeda and members of Support Center for Medical Research and Education, Tokai University for manual curation of literatures, and Kensuke Numakura for valuable discussion and disease classification.

Funding

This database project is partially supported by JSPS KAKENHI, Grant-in-Aid for Publication of Scientific Research Results (Grant Numbers 258055 and 268046) in 2013-2014.

Credit

Icons: Font Awesome Created by Dave Gandy.

GERT PhD Program in Big Data and Data Science at the UW

Source: http://data.uw.edu/education/IGERT/index.html

The path to deep scientific discoveries is changing rapidly. Most disciplines, from physical to life sciences, have entered an era, where discovery is now no longer limited by the collection and processing of data, but by the management, analysis, and visualization of this information. From studying the building blocks of life, to understanding the nature of our universe, transformative breakthroughs are increasingly dependent on our ability to interrogate complex data streams from instruments distributed on a global scale.

To be successful, the next generation of scientists needs to be deep in both their own field as well as computer science and statistics. Similarly, the next generation of computer scientists and statisticians needs to understand deeply the real needs of domain scientists. Students can no longer develop tools and models in isolation, because the resulting “hammers” fail to meet the growing needs of the data-enabled sciences.

To address these challenges and to educate the next generation of scientists, the University of Washington offers PhD programs specialized in Big Data and Data Science in various departments as well as an integrative program that crosses department boundaries.

More information: My Note:  I explored these links quickly

Integrative Graduate Education and Research Traineeship (IGERT) in Big Data and Data Science

Participating Departments and Department-Specific Programs

IGERT Program Pre-requisites (visit this page to learn about possible background courses in preparation for the main IGERT courses).

IGERT Leadership

IGERT Students

Consent Form – Interviews (individuals)

Source: http://depts.washington.edu/dswkshp/nyu.pdf (PDF)

New York University A private university in the public service Courant Institute of Mathematical Sciences Center for Data Science 726 Broadway, 7th Floor New York, NY 10003 Phone: (212) 998-3367

You have been invited to take part in a research study about emerging and historical practices in data analysis. This study will be conducted by Laura Norén, Center for Data Science, Courant Institute of Mathematical Sciences, New York University as a part of her Postdoctoral research. Her faculty sponsor is Professor Jennifer Hill, Department of Humanities and Social Sciences in the Professions, Steinhardt School of Education, New York University.

If you agree to be in this study, you will be asked to do the following:

1 participate in a semi-structured interview about your work experience, especially any data-driven projects you have worked on and data management practices you use.

2 allow the researcher to take photos of your desk space and request screen shots of your computer screen. You may decline to allow photos or provide screen shots without any impact on yourself or your firm.

Your interviews will be audio-recorded. You may review these recordings and request that all or any portion of the recordings be destroyed. The researcher may request to take photos of your desk space or ask for screenshots of your computer during the interview. You may review these photos and screenshots and ask that all or any portion of these images be destroyed.

Participation in this interview will take 60 minutes.

There are no known risks associated with your participation in this research beyond those of everyday life. Although you will receive no direct benefits, this research may help the investigator understand emerging and historical practices in data analysis and the way organizational structure and data practices are shaped by power dynamics.

Confidentiality of your research records will be strictly maintained by assigning pseudonyms to you and your firm as well as by obscuring specific identifying details associated with you or your firm prior to publication. Interview audio recordings and transcripts will be stored on passwordprotected computers. No images containing faces or personally identifying details will be saved. Images will be stored on a password-protected computer.

Participation in this study is voluntary. You may refuse to participate or withdraw at anytime without penalty. You have the right to skip or not answer any questions you prefer not to answer for any reason or for no reason.

If there is anything about the study or your participation that is unclear or that you do not understand, if you have questions or wish to report a research-related problem, you may contact Laura Norén at 646-302-2152, laura.noren@nyu.edu, 726 Broadway, 7th Floor #792, or the faculty sponsor, Jennifer Hill at 212-992-7677, jennifer.hill@nyu.edu, 246 Greene Street Floor 8W.

For questions about your rights as a research participant, you may contact the University Committee on Activities Involving Human Subjects, New York University, 665 Broadway, Suite 804, New York, New York, 10012, at ask.humansubjects@nyu.edu or (212) 998-4808.

You have received a copy of this consent document to keep.

Agreement to Participate in the interview with Laura Norén

Subject’s Name (Print)

Subject’s Signature

Date

White Paper Review Criteria Data Science Workshop, 2015

Source: http://depts.washington.edu/dswkshp/...ewCriteria.pdf (PDF)

Merit Review Criteria

The NSF sponsored Data Science Workshop seeks to invite graduate students to participate in team problem solving efforts around Grand Challenge topics from across NSF sponsored research and engineering domains. The invitees are selected based on the strength of their submitted white papers. In addition, the authors of the highest scoring white papers will be selected for talks during the workshop and the topics of their white papers will be used to seed the Grand Challenge problem set tackled by teams during the workshop.

The white papers should described a significant problem, challenge or growth opportunity in a domain that is Data Science / Big Data driven or related. It is also possible to structure a white paper around a proposed new method or tool with cross-cutting interdisciplinary application potential focused on Big Data / Data Science techniques. The white papers should contain at least three sections, Background, Problem Statement, Broader Impacts and be no more than three pages long including figures and references.

Each white paper will be reviewed individually using two common NSF Merit Review Criteria: Intellectual Merit encompasses the potential to advance knowledge; and Broader Impacts encompasses the potential benefit to other areas of research and society at large.

Example considerations to be used in the review for both criteria:

1. What is the potential for a solution to the described challenge or for the proposed tool / method to:

a. Advance knowledge and understanding within its own field or across different fields; and

b. Benefit society or advance designed societal incomes

2. To what extent is the described challenge a significant inhibitor to progress in research or engineering; Or alternately, how is the proposed tool / method novel and transformative?

3. Does the author present a solution sketch for the described challenge or is there a use case sketch for the proposed tool / method?

4. What is the relevance to the NSF funding areas?

Review Instructions

Reviewers need not weight Intellectual Merit and Broader Impacts criteria equally. Reviewers will assign a rating to each review criterion (Excellent, Very Good, Good, Fair or Poor) and provide constructive written comments that support the rating. Reviewers are evaluating individual white papers based on the above criteria and should avoid judgments that may be influences by implicit biases. Panelists must be aware of how implicit biases might affect their evaluations, and make every effort to avoid biases, or the perception thereof, in the evaluation of the white paper and in their comments. Comments should be thorough and substantial, informative, non-inflammatory, non-discriminatory, helpful to the white paper author and should support the ratings assigned.

White Papers

Source: http://depts.washington.edu/dswkshp/white-papers/

Search all white papers here. My Note: This does not look useful, so put them in-line for Google Chrome Find

Three Newest White Papers My Note: Were there more than these three?

Genomic Data Science

PDF

Problems regarding the speed, cost and hardware that are required for analysing and sharing the big genomic data are among the major challenges in Genomic Data Science. On the other hand, the area of genomics provides Data Science with not only great challenges but also great promises.

Author
Seydanur Tikir
Georgia State University
Department of Biology

Background

In 20th century, computational biologists were not concerned about not having powerful computing clusters. The discovery of next generation sequencing technologies introduced of a major threat against genomic data science a decade ago. The genome of a single person is stored as 100 gigabytes of data. Full sequence data has been reported and archieved for thousands of humans as well as many thousands of species [1]. As the cost of sequencing decrease, the number of genomes is increasing exponentially [2].
 
In order to understand genomic variation and biological pathways extensively, it is necessary to analyse millions of genomes. On the other hand, the challenges that we have in Genomic Data Science is increasing as the genomic big data continues growing exponentially. The storage and analysis of big data, as well as the security and privacy issues has been a big concern in Genomic Data Science. On the other hand, the area of genomics provides Data Science with not only great challenges but also great promises.

Problem Statement

A single next-generation sequencer can generate 40 Gb data per day, which reveals a huge raw data along with thousands of sequencers worldwide. On the other hand, storing and analyzing the big data is even more challenging than it is to generate. In next generation sequencing technologies, each base that is sequenced is assigned to quality scores showing 2 the chance that the base is correct. The size of these quality scores, which are generally captured as image files, is currently much more greater the raw data. The cost of storage of such large data produce a major financial obstacle for small research groups. A key to solve this problem is improvement of ways to record and achieve quality values, which will upgrade the ease and speed of the analyses. Considering that the sequencing is much cheaper than storage, a migration to cloud computing on Genomics era is also a smart way so deal with this issue. [3]
 
When the challenges of Data Science is considered, the main issues that are considered are analyzing, sharing the gigantic data, as well as problems of speed, cost and hardware regarding this job. Having said that, I think that a bigger problem of Data Scientists is that we are spending our time to figure our how to analyze the data faster, rather than thinking about putting the tools we have into use efficiently. As a bioinformatician who is dealing with a huge size of Next Generation Sequencing Data, I am not able to compare my data with hundreds of other published data easily and regularly. Such an analysis would reveal marvelous discoveries in my research. The question here is “Can we routinely compare our data with the other published bulk?” When I ask that question to myself, the word “routinely” fires a light on my mind. We need to build new tools for automatic analysis and comparisons of the genomics data as new data published. Such an automatic system would allow data scientists to save their time to evaluate already analyzed data rather than analyzing them. Invention of such tools will be a shift to move computation to the data rather than moving the data to the computation.
 
Another challenge in the world of Big Genomic Data is data sharing and privacy. Genomic Data is generally integrated with different sources, especially in genome-wide association studies regarding medical researches. The public databases that provide human genomic data may store personal information of patients. [4] Though not being publically available, it is possible to hack these databases. To overcome this issue, suggested policies and procedures should be modified.
 
The mission of data science needs to be carried through developments on research, education and economics in combination. A smart collaboration is required among data scientists and genomic researchers. There are different kinds of data scientists such as those who ask great questions having a good view of the big picture; and those who better analyze the data with great computation skills. Collaboration between two data scientists from such two groups will reveal more production than the collaboration between a dozen of data scientist from only a single group. In this regard, collaboration between data scientists and wet-lab scientist is of a further importance. I believe organization of workshops and holding discussion groups to produce solutions is a crucial key to handle challenges in Data Science.

Broader Impacts

The area of genomics provides Data Science with not only great challenges but also great promises. If we can achieve a considerable improvement on the issues such as speed, hardware and software problems in Big Genomic Data, we can analyze and compare the genomes much more efficiently. Such an improvement will advance our understanding of genomic variation and biological pathways extensively. The inventions in medicine will chiefly depend on our power to manage and analyze big genomic data.

References

[2] Davies K. The $1000 Genome. The Revolution in DNA Sequencing and the New Era of Personalized Medicine. Free Press, NY, USA, 2010.
[3] Wall DP, Kudtarkar P, Fusaro VA, Pivovarov R, Patil P, Tonellato PJ: Cloud computing for comparative genomics. BMC Bioinformatics 2010, 11:259.
[4] D. Greenbaum, A. Sboner, X. Mu, and M. Gerstein, “Genomics and privacy: Implications of the new reality of closed data for the field,” PLoS Computational Biology, vol. 7, no. 12, 2011.

Big Data: From correlation to causation

PDF

Data science and big data analytics can contribute to fruitful future scientific inquiries as they allow researchers to investigate new underexplored scientific questions that could not be studied before and to dive deeper into existing scientific questions in order to elicit better answers. In this short paper, I outline the opportunities that arise from leveraging existing econometric methods in combination with big data. In particular, I discuss how the four major features that characterize big data (also known as “4 V’s”) can greatly enhance standard econometric techniques and how big data analytics can vastly benefit from these techniques. More importantly, I outline the possibilities of leveraging existing econometric methods in combination with big data in order to make causal inferences from observational data. For data science and big data analytics to become more useful towards examining causal relations nowadays, I argue that we need to draw on the substantial knowledge base created in the economics and social science fields over the years in order to infer interesting causal effects as simply analyzing large amounts of data does not necessarily help us make better data-driven decisions.

Author
Vilma Todri
New York University
Department of Information, Operations and Management Sciences

Background

Big data analytics can contribute to fruitful future scientific inquiries as they allow researchers to investigate new underexplored scientific questions that could not be studied before and to dive deeper into existing scientific questions in order to elicit better answers. In this short paper, I outline the opportunities that arise from leveraging existing econometric methods in combination with big data. In particular, I discuss how the four major features that characterize big data (also known as “4 V’s”) can greatly enhance standard econometric techniques and how big data analytics can vastly benefit from these techniques. More importantly, I outline the possibilities of leveraging existing econometric methods in combination with big data in order to make causal inferences from observational data. For data science and big data analytics to become more useful towards examining causal relations nowadays, I argue that we need to draw on the substantial knowledge base created in the economics and social science fields over the years in order to infer interesting causal effects as simply analyzing large amounts of data does not necessarily help us make better data-driven decisions.

Problem Statement

Data science and big data analytics have the potential to contribute to different research paradigms and facilitate inter-disciplinary research thanks to the characteristics of variety (i.e., variety of data sources) and veracity (i.e., quality of the data) of information. Even though big data alone is usually not sufficient to generate causal inferences, it can be particularly useful for leveraging research designs and statistical techniques that allow datadriven analysis to move from correlations to causation. First, big data can greatly enhance our ability to account for previously unobserved confounders as well as to employ alternative controls and proxies. This is particularly important as consideration of confounding is fundamental to the design and analysis of studies of causal effects (Greenland et al. 1999). Hence, big data can alleviate concerns of potential endogeneity that undermine researchers’ efforts to extract causal relationships from non-experimental data. For instance, researchers can leverage unstructured data through big data and machine learning techniques that allow them to control for various factors that remained unobserved in previous research. Furthermore, the variety and veracity of big data can facilitate the identification of exogenous variations (Einav and Levin 2014) and the utilization of natural experiments in order to generate causal inferences from observational data. For instance, Ghose and Todri (2015) exploit the granular information on the viewability of advertising 2 exposures (i.e., information on whether a user has viewed the display advertisement that was served to him/her) that enables the employed natural experiment research design. The circumstances that affect the viewability of a display advertising impression serve as an exogenous shock to the firm’s targeting and simulate a natural experiment creating two groups of users: those who view the display ad and those who do not, while both groups are automatically targeted by the same marketing campaign fulfilling certain targeting criteria. Such natural experiment frameworks can avoid the self-selection and other treatment selection biases that are important concerns in analyzing non-experimental data for making causal inferences. Similarly, the access to rich information regarding the advertising exposures (i.e., timing of exposures, location of targeted consumer, etc.) allows the authors to use the exogenous shocks of hyper-local granular weather information in order to employ the instrumental variable panel data method and address concerns for time-varying unobserved confounders. In particular, employing a variety of sources they use information on hyper-local weather for very granular time intervals (i.e., 20 minute time intervals) matched with individual-level advertising exposures as an instrument because weather data is correlated with the (potentially endogenous) explanatory variable of a viewable impression (i.e., browsing the Internet is an activity that competes with other outdoors activities the users might be enjoying) but not correlated with the dependent variable under study.
 
Apart from facilitating scientific inquiry through the variety and veracity of data, big data analytics methods can further contribute to fruitful future scientific inquiries through the other two unique aspects of big data that can lead to more precise and more accurate estimation of various effects are the velocity and volume of big data. The volume of data refers to the amount of data and determines the potential of the data under consideration, while the velocity of big data refers to the speed of data generation. As far as the volume of big data in analytics is concerned, it can enable the identification of especially small effects, when they exist, resolving to a large extend the issue of statistical power. Furthermore, standard econometric methods can benefit from big data analytics due to the velocity and volume of the data. For instance, big data analytics enhance the quality of the various matching methods since they enable researchers to reliably estimate the treatment effect using the observed counterfactuals that exist in big data sets without the need to depend on model-based extrapolations due to lack of sufficient data. Similarly, instrumental variable methods can benefit from big data for the estimation of the effects beyond the local average treatment effect of intervention. In particular, the increased volume and velocity of observational data can be more informative about the causal effect of treatment thanks to the larger number of individuals whose treatment status has been manipulated with the exogenous shock.
 
Moreover, big data analytics can enable theorizing and hypothesis development at an unprecedented level of granularity (Varian 2014). Utilizing the granularity of big data, researchers can gain a deeper understanding on phenomena that have been previously studied at a more aggregate level. For instance, big data analytics in many cases could allow researchers to further deepen the exploratory analysis by integrating various relevant contextual factors in their analysis and, hence, have the potential to vastly facilitate hypothesis development as well as data-driven theorizing in future research. For instance, 3 (Ghose and Todri 2015)leveraging big data find that hyperlocal contextual information, such as weather conditions, affect the browsing activity of consumers. This finding could be a topic of future research that could significantly lead to fruitful research in the future. Besides, the granularity of advertising exposures revealed that on average 55% of the display ads are not rendered viewable. This is particularly important as previous academic research on display advertising assumed that an impression was always viewable to the user whenever loaded and might have led prior research to overestimate or underestimate the true effect of display advertising.
 
Finally, big data analytics allow researchers to investigate phenomena and study research questions that could not be explored before due to the unavailability of or the limited access to certain information. In particular, one could take advantage of specific institutional details and micro-level variation that would be difficult to isolate and exploit with more aggregate level data. For instance, researchers could tap into the granularity of big data and further develop theories that are more granular. Besides, data science and big data analytics generate novel findings, it also the potential to subsequently further spur academic interest leading to novel scientific inquiry and hypotheses generation creating a positive “feedback loop”. For instance, Ghose and Todri (2015) discovered that one of the effects of advertising is that costumers engage in both active and passive information gathering processes. Based on this finding that was enabled by big data analytics, future research could investigate and identify the underlying mechanisms that are related to this effect and cause the corresponding user behavior.

Broader Impacts

This paper elaborates on the opportunities for fruitful collaboration between classical econometric techniques and big data in order to make causal inferences and estimations that are more accurate. While experimentation is the gold standard for establishing causal effects, it is not always feasible, ethical or cost-efficient and, therefore, researchers across disciplines have to rely on non-experimental data in order to extract knowledge and advance the theory of their respective scientific fields. Hence, the proposed methods of combining econometric techniques with big data in order to extract causal inferences from observational data is of critical importance when cause and effect are the central questions of the scientific research.

References

Einav, L., and Levin, J. 2014. "Economics in the Age of Big Data," Science (346:6210), p. 1243089.
 
Ghose, A., Todri, V. 2015. "Towards a Digital Attribution Model: Measuring Display Advertising Effects on Online Search Behavior," Working Paper.
 
Greenland, S., Robins, J.M., and Pearl, J. 1999. "Confounding and Collapsibility in Causal Inference," Statistical Science), pp. 29-46.
 
Varian, H.R. 2014. "Big Data: New Tricks for Econometrics," The Journal of Economic Perspectives), pp. 3-27.

Shape mapping in genome-wide association studies

PDF and Word

Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biological traits. Currently, the major focuses of GWASs are the associations between single-nucleotide polymorphisms (SNPs) and traits such as human diseases. We are particularly interested in the associations between SNPs and biological shapes like leaf shapes. This is an interdisciplinary research involving geometrical analysis and statistical shape and genetics analysis, which also face some unprecedented challenges.

Author
Xiaotian Dai
Utah State University
Dept. of Math and Statistics

Background

Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biological traits. Currently, the major focuses of GWASs are the associations between single-nucleotide polymorphisms (SNPs) and traits such as human diseases. We are particularly interested in the associations between SNPs and biological shapes like leaf shapes. This is an interdisciplinary research involving geometrical analysis and statistical shape and genetics analysis, which also face some unprecedented challenges.

Problem Statement

In the scenario of big data, there are two major obstacles in shape genetic mapping studies.
 
Firstly, rapid progress in genotyping technology is generating large numbers of SNPs and inspiring new findings in biological studies. When the number of SNPs dramatically increases to millions but the sample size is still limited to thousands, the traditional pvalue based statistical approaches suffer from loss of power. Feature screening has proved to be an effective and powerful approach to handle ultrahigh dimensional data statistically, yet it has not received widely enough attention in GWAS. In addition to the large number of SNPs, the univariate measures used to rank features are mainly based on individual marginal effect without considering the mutual interactions with other features. It is very computationally expensive to include every possible combination of interaction effects into a regression model as the number of predictors (SNPs) becomes very large, which can also easily render traditional regression methods to lose power.
 
Secondly, it is impossible to represent biological shapes using a univariate response variable as in traditional regression models. Some geometric models can transform leaf 2 shapes or other biological shapes into functional data based on their geometrical measurements, such as the radius-centroid-contour (RCC) model discussed in Fu et al. (2013). According to the RCC model, we can assign points on the boundary of a leaf or other organisms with equal radial angle Theta, where Theta = 2PI/n and n is the number of points assigned. Then we can quantify the shape as a functional RCC curve defined as a radius function of radial angle Theta at the centroid. The whole process of shape extraction is illustrated in Figure 1. However, many of the traditional genetic mapping methods cannot handle high-dimensional responses. One of the mainly used approaches in other researches is principal component analysis (Ziezold, 1994). Nevertheless, it would be more innovative if the functional data representing shapes can be directly included into a regression model as responses.
Figure 1. The procedure of extracting shape information from a leaf image
The RCC curve (D), as a function of the radial angle, can uniquely represent a shape (A).
My Note: I could not capture this figure to insert it here. See original PDF
 
These problems described above not only are the obstacles in genetic studies, but also are challenges in practical data science problems. It would be very promising to develop a new machine learning algorithm to handle high-dimensional response and predictor space simultaneously.

Broader Impacts

As mentioned above, the major motivation of this white paper is to develop a machine learning algorithm to handle high-dimensional response and predictor space simultaneously. This algorithm can be applied to genetic mapping studies. Nevertheless, it has many other applications in a wide range of practical problems. In the scenario of big Figure 1. The procedure of extracting shape information from a leaf image. The RCC curve (D), as a function of the radial angle, can uniquely represent a shape (A). 3 data, the existing statistical models face many unprecedented challenges due to the complexity of both the data collected and the problems we want to solve.

References

Fu, G., Bo, W., Pang, X., Wang, Z., Chen, L., Song, Y., Zhang, Z., Li, J. and Wu, R. (2013). Mapping shape quantitative trait loci using a radius-centroid-contour model. Heredity, 110, 511–519.
 
Ziezold, H. (1994). Mean figures and mean shapes applied to biological figure and shape distributions in the plane. Biometrical Journal, 36, 491–510. 

Workshop Overview

Source: http://depts.washington.edu/dswkshp/

The workshop has ended. Thank you all for your spectacular participation!

This NSF sponsored 2.5 day workshop on August 5th – 7th on the University of Washington, Seattle campus, will bring together 100 graduate students from diverse domain sciences and engineering with Data Scientists from industry and academia to discuss and collaborate on Big Data / Data Science challenges. In addition to keynote presentations from high profile speakers, the participants will present posters covering their own research and work collaboratively to begin to solve some of the Grand Challenge problems facing Data Enabled Science & Engineering disciplines. After the workshop, the output from the collaborative teams will be published in an open access environment. Through the shared work at the workshop and beyond, the participants will form lasting, collaborative relationships with their peers and the senior academia partners and industry participants including those from Amazon, Google and Microsoft.

Want to participate? Get engaged!

To get started, create a profile on this web site. Next, use your profile to submit a white paper in PDF format that describes a Big Data / Data Science challenge faced by your scientific or engineering discipline or an idea for a new tool or method addressing Big Data / Data Science problem. Finally, review some of the current submissions, make comments and vote on your favorites.

The workshop Grand Challenge topics will be selected from the highest scoring white paper submissions. During the workshop, attendees will form teams to work on the Grand Challenges.

White paper submission guidelines

The white papers should be no more than three written pages in length including figures, references and should contain the following sections: BackgroundProblem Statement, and Broader Impacts and be formatted according to the supplied template.  Each white paper author will need to provide at least three keywords and three possible reviewers. The white papers will be reviewed according to NSF fellowship application standards.

The white paper submission deadline is June 22nd, 2015.  Invitees will be notified between July 1st to the 8th, 2015.

What happens next?

The white papers are sent out for review.  When the reviews are returned, you will receive a copy and you will be notified if you have been invited to attend.  We will provide you with the registration information and further details.

If you are selected for attendance, you must bring a poster to present on one of either of the two poster presentation sessions.  The authors of the very highest scoring white papers will be invited to give lightning talks of a few slides during the plenary session to describe their challenges or methods.

Get Started

Questions?

Contact us: dswkshp@uw.edu

Conference Organizers

  • Chair David A. C. Beck, PhD (UW, Chemical Engineering)
  • Ginger Armbrust, PhD (UW, Oceanography)
  • Magdalena Balazinska, PhD (UW, Computer Science & Engineering)
  • Andrew Connolly, PhD (UW, Astronomy)
  • Jennifer Worrell:  Administration

Banner image credits

  • Jeffrey Heer, Interactive Data Lab (UW, Computer Science & Engineering)
  • Ariel Rokem, eScience Institute (UW)
  • Dave Beck, eScience Institute (UW, Chemical Engineering)
  • Want your name here?  Submit a banner image (1600×300) to dswkshp@uw.edu

Agenda

Source: http://depts.washington.edu/dswkshp/agenda/

A map of the various event locations is available here.
Profiles of the Mentors, Observers, Ethnographers, and Organizers are available here. My Note: See below

Day 1: Wednesday, August 5, 2015

  Begin End Event Location
  5:00 7:00 Icebreaker Social WRF Data Science Studio
6th floor
Physics & Astronomy Tower
 

Day 2: Thursday, August 6, 2015

  Begin End Event Location
  8:00 8:30 Breakfast Hogness Auditorium Lobby (Health Sciences Lobby)
  8:30 8:45 Intro Health Sciences A420 – Hogness Auditorium
  8:45 9:45 Plenary Session – Academia Keynote:
Oren Etzioni My Note: See below
“Big Data has been very, very good to me”
Health Sciences A420 – Hogness Auditorium
  9:45 10:00 break Hogness Auditorium Lobby (Health Sciences Lobby)
  10:00 11:00 Panel Discussion: Perspectives on Data Science from Academia & Industry

 

Health Sciences A420 – Hogness Auditorium
  11:00 12:00 Lightning Talks

 

Health Sciences A420 – Hogness Auditorium
  12:00 12:30 Lunch Hogness Auditorium Lobby (Health Sciences Lobby)
  12:30 1:30 Poster Session I
(poster list) My Note: See below
Hogness Auditorium Lobby (Health Sciences Lobby)
  1:30 3:00 X-Team Breakouts Sessions
More info here My Note: See below
Health Sciences Rooms
  3:00 3:15 break Hogness Auditorium Lobby (Health Sciences Lobby)
  3:15 5:00 X-Team Breakouts Sessions
More info here My Note: See below
Health Sciences Rooms
  5:15 5:30 Free time & walk to Vista Café Vista Café
  5:30 9:00 Dinner with Mentors, Observers, Ethnographers & Organizers Vista Café
 

Day 3: Friday, August 7, 2015

  Begin End Event Location
  8:00 8:30 Breakfast Hogness Auditorium Lobby (Health Sciences Lobby)
  8:30 8:45 Intro Health Sciences A420 – Hogness Auditorium
  8:45 9:45 Plenary Session – Industry Keynote:
Raghu Ramakrishnan My Note: See below
Health Sciences A420 – Hogness Auditorium
  9:45 10:00 break Hogness Auditorium Lobby (Health Sciences Lobby)
  10:00 12:00 X-team Breakout Report Presentations
In X-team # order
Health Sciences A420 – Hogness Auditorium
  12:00 12:30 Lunch Hogness Auditorium Lobby (Health Sciences Lobby)
  12:30 1:30 Poster Session II
(poster list) My Note: See below
Hogness Auditorium Lobby (Health Sciences Lobby)
  1:30 3:00 Y-team Breakout Sessions
More info here My Note: See below
Health Sciences Rooms
  3:00 3:15 break Hogness Auditorium Lobby (Health Sciences Lobby)
  3:15 5:15 Y-team Breakout Report Presentations
In Y-team # order
Health Sciences A420 – Hogness Auditorium
  5:15 5:30 Closing Remarks Health Sciences A420 – Hogness Auditorium
  5:30 6:00 Free time & walk to WAC Waterfront Activities Center (WAC)
  6:00 8:00 Appetizer Social Waterfront Activities Center

 

Mentors, Observers, Ethnographers & Organizers

Source: http://depts.washington.edu/dswkshp/...rs-organizers/

 

Anthony Arendt

 

Polar Science Center
University of Washington

Mentor

 
Anthony Arendt is a Senior Research Scientist with the Polar Science Center at the Applied Physics Laboratory, where he conducts research on the response of glaciers and ice sheets to changing climate. Anthony joined the eScience Institute in June 2015 and provides expertise in relational databases, geospatial data analytics, and development of lightweight cloud computing solutions for scientific research. Anthony is researching new ways to integrate satellite observations with large scale hydrological models. He is developing APIs to interface with these models and provide web-accessible decision support tools to environmental stakeholders. Anthony was previously a Research Professor at the University of Alaska Fairbanks, where he conducted field studies of Alaska glaciers and taught classes in Geographic Information Systems. He has worked at NASA’s Goddard Space Flight Center and was a Visiting Researcher at Microsoft Research.

Ginger Armbrust

 

Oceanography
University of Washington

Organizer

 
E. Virginia Armbrust is a Professor in the School of Oceanography at the University of Washington and Director of the School of Oceanography. She received her A.B. from Stanford University in 1980 and her PhD from Massachusetts Institute of Technology and Woods Hole Oceanographic Institution in 1990. She carried out postdoctoral research training at Washington University before joining the faculty at the University of Washington in 1996. Dr. Armbrust’s research focuses on marine phytoplankton, particularly marine diatoms, which are responsible for about 20% of global photosynthesis. She has pioneered the use of environmental genomics and transcriptomics, combined with metabolomics, to understand how natural diatom communities are shaped by the environment and by their interactions with other microbes. Most recently, she has identified chemical signals that form the basis of cross-kingdom communication. Her group developed ship-board instrumentation that now permits the fine-scale continuous mapping of distributions, growth rates and loss rates of different groups of phytoplankton. Armbrust is a Fellow of the American Academy of Microbiology, the American Association for the Association for the Advancement of Science, and a member of the Washington State Academy of Science.

Magda Balazinska

 

Computer Science & Engineering
University of Washington

Organizer

 
Magdalena Balazinska is an Associate Professor in the department of Computer Science and Engineering at the University of Washington and the Jean Loup Baer Professor of Computer Science and Engineering. She’s the director of the IGERT PhD Program in Big Data and Data Science. She’s also a Senior Data Science Fellow of the University of Washington eScience Institute. Magdalena’s research interests are in the field of database management systems. Her current research focuses on big data management, scientific data management, and cloud computing. Magdalena holds a Ph.D. from the Massachusetts Institute of Technology (2006). She is a Microsoft Research New Faculty Fellow (2007), received an NSF CAREER Award (2009), a 10-year most influential paper award (2010), an HP Labs Research Innovation Award (2009 and 2010), a Rogel Faculty Support Award (2006), a Microsoft Research Graduate Fellowship (2003-2005), and multiple best-paper awards.

Chaitanya Baru

 

Computer and Information Science and Engineering
National Science Foundation

Observer

 
Chaitan Baru is Senior Advisor for Data Science in the Computer and Information Science and Engineering Directorate at the National Science Foundation. He also co-chairs the federal inter-agency Big Data Senior Steering Group, consisting of 18 federal R&D agencies, which was formed to help coordinate Big Data R&D activities across agencies, as part of the White House Big Data R&D Initiative launched in 2012. Baru leads the BIGDATA research program at NSF and is also involved in the Big Data Regional Innovation Hubs, a new NSF initiative. He is on assignment from the San Diego Supercomputer Center, UC San Diego, where he is a Distinguished Scientist and Associate Director of Data Initiatives. He is interested in applied and applications-oriented research in scientific data management, big data, and database systems. Information about his research activities is available at http://acid.sdsc.edu/users/chaitan-baru.

Dave Beck

 

Chemical Engineering
University of Washington

Organizer

 
David Beck is an Assistant Research Professsor in the department of Chemical Engineering at the University of Washington (UW). He is a participating faculty in the IGERT PhD Program in Big Data & Data Science and a Senior Data Science Fellow of the eScience Institute at UW. Dr. Beck holds a Ph.D. from UW (2006) in Biomedical Structure & Design / Medicinal Chemistry and a BS in Computer Science from Drexel University. His research interests center on using computing & Data Science methods in systems, synthetic and structural biology for problems in biogeochemical cycling, wastewater + environment, the microbiome of built environments, and human health.

Rahul Biswas

 

Astronomy
University of Washington

Mentor

 
My main scientific interests are in the area of cosmology, concentrated around quantifying the phenomenology of the late-time accelerated expansion of the universe, often described in terms of properties of “dark energy,” the substance assumed to drive this acceleration. Mostly, I use observations of supernovae Type Ia from large astronomical surveys.for studying cosmology, or work on large scale structure studies that are most relevant for clusters of galaxies. Two of the key challenges for the cosmological studies of this type are: (a) the data tend to be large and low signal-to-noise, (b) the incomplete knowledge (obviously rich research frontiers in themselves) of astrophysical objects used to draw cosmological inferences. To address such issues, my research involves methods of drawing statistical inferences from such datasets, improving the models of astrophysical objects used in such inferences by using simulations or data, and accounting for the observational system. At the University of Washington, I am closely associated with the upcoming LSST project, where a current focus is building OSS tools for the survey. I am also working on analyses to forecast the performance of components of LSST for particular survey strategies as a step towards optimizing the survey strategy for scientific output.

Alvin Cheung

 

Computer Science & Engineering
University of Washington

Mentor

 
I am an assistant professor in the Department of Computer Science & Engineering at the University of Washington, affiliated with the database and programming systems research groups. I am also an affiliate member of the eScience Institute. My research interests include program analysis, improving database application performance, and building big systems in general. Some current research themes include: query processing across heterogeneous systems, helping end users build database applications, and improving application performance across heterogeneous architectures. Before that, I was a graduate student in the MIT database group and the computer-aided programming group, working with Professors Sam Madden and Armando Solar-Lezama. I worked on tools that make use of programming language techniques to improve application performance.

Andy Connolly

 

Astronomy
University of Washington

Organizer

 
Professor Andy Connolly studies cosmology and the formation of structure within our universe using large astronomical surveys such as the Sloan Digital Sky Survey and the Large Synoptic Sky Survey (LSST). He currently works on helping develop algorithms and software for analyzing the petabytes of imaging data that will come out of the LSST. He also runs the LSST simulation group which generates high-fidelity simulations of the universe that are used to scale the scientific algorithms that will be used in the LSST. Beyond his scientific research he is interested in using technology to increase access to scientific data and to improve the educational experiences of students.

Oren Etzioni

 


Allen Institute for Artificial Intelligence

Keynote Speaker

 
Dr. Oren Etzioni is Chief Executive Officer of the Allen Institute for Artificial Intelligence. He has been a Professor at the University of Washington’s Computer Science department since 1991, receiving several awards including Seattle’s Geek of the Year (2013), the Robert Engelmore Memorial Award (2007), the IJCAI Distinguished Paper Award (2005), AAAI Fellow (2003), and a National Young Investigator Award (1993). He was also the founder or co-founder of several companies including Farecast (sold to Microsoft in 2008) and Decide (sold to eBay in 2013), and the author of over 100 technical papers that have garnered over 23,000 citations. The goal of Oren’s research is to solve fundamental problems in AI, particularly the automatic learning of knowledge from text. Oren received his Ph.D. from Carnegie Mellon University in 1991, and his B.A. from Harvard in 1986.

Rob Fatland

 

Microsoft Research
Microsoft

Mentor

 
Dr. Rob Fatland works as a senior research program manager at Microsoft Research, specifically on the use of both cloud and traditional technology in service to environmental science. His work currently includes adaptation of cloud + Maker/IOT technology to marine science as well as data sharing to enhance global carbon cycle research. He is actively working with the data science community to articulate best practices in ‘long tail’ lightweight data management. Dr. Fatland received a B.S. in Physics from the California Institute of Technology in 1987 and a Ph.D. in Geophysics from the University of Alaska Fairbanks in 1998. He worked for 6 years at NASA-JPL in radar remote sensing. More recent work at Vexcel Corporation and Microsoft has included 4+D data visualization, deploying sensor networks in harsh environments and geospatial information challenges from remote sensing. In his spare time he engages in freelance STEAM outreach.

Brittany Fiore-Gartland

 

Human-Centered Design and Engineering
University of Washington

Ethnographer

 
Brittany Fiore-Gartland is a Postdoctoral Fellow in the eScience Institute and the Department of Human-Centered Design and Engineering. Her research is concerned with the social and organizational dimensions of the data-intensive transformations occurring across many sectors of work. This research agenda includes studying how communities make sense of and value data and what is organizationally required to support emerging data intensive practices and collaborations. She is part of a team of researchers at UW, NYU, and UC Berkeley conducting ethnographic research on the Data Science Environment funded by the Gordon & Betty Moore and Alfred P. Sloan Foundations. She leads a team of researchers with Cecilia Aragon in the Human-Centered Data Science Lab to understand the cultural changes that are reshaping how data science work is accomplished and the implications for institutions supporting this work. As part of her ethnographic practice, she works with communities to bridge communication gaps and develop value-informed and adaptive organizational practice. She holds a Ph.D. in Communication from the University of Washington and an M.A. in Anthropology from Columbia University.

Will Gagne-Maynard

 

Oceanography
University of Washington

Mentor

 
Will works with the River Systems Research group in the School of Oceanography at UW. His research focuses on applying various -omics techniques to tropical river systems and helping to integrate results into basin-wide models.

Jeff Gardner

 


Google

Mentor

 
Jeff Gardner has a BA in Physics from Cornell University and PhD in Astronomy from UW. For his thesis, he simulated large chunks of the Universe on massively parallel supercomputers in order to study the formation of structure and the evolution of galaxies. He then moved out of pure astrophysics into computational science, working as a Research Scientist for the Pittsburgh Supercomputing Center at Carnegie Mellon. He came back to UW in 2008 as a Senior Research Scientist in Physics with an Affiliate Research Assistant Professor appointment in Astronomy. Jeff’s research focused on computational and data science as applied to the physical sciences. He worked with a number of groups in Physics, Astronomy, and Computer Science. In 2010 he moved to the Office of Research to become Assistant Director for the then-nascent eScience Institute. He is now a software engineer at Google where he works on cloud computing products that can impact scientific research.

Andrew Gartland

 

Vaccine and Infectious Disease Division
Fred Hutchinson Cancer Research Center

Mentor

 
Dr. Andrew Fiore-Gartland earned his Ph.D. studying the neural mechanisms that underlie visual processing in the retina. In his research he used computational models of molecular and cellular processes to analyze and interpret the data generated by his experiments. Upon completing his degree he sought a field in which he could be more connected to the application and social impact of his work. As a Post-Doctoral fellow in Peter Gilbert’s group he shifted his focus to the immune system and has been applying his statistical and computational skills towards understanding the mechanisms of novel HIV, TB, influenza and dengue vaccines, including extensive analysis of the immune data generated by the RV144 HIV vaccine trial. He has developed statistical methods for detecting evidence of T-cell induced viral escape in the genetic sequences of “breakthrough” viruses in vaccine trials. Now as a Staff Scientist he continues his work at the HIV Vaccine Trials Network adding capacity to incorporate novel assays into cohesive statistical analyses. He is also interested in the determinants of T-cell immunodominance and in understanding the heterogeneity of the immune response to vaccination with the specific goal of identifying innate correlates of adaptive immunogenicity and immune correlates of protection. Andrew also recently joined the eScience Institute at University Washington where he is focusing on the development of open-source software for computational biology research and on the training of data scientists at the University.

Joseph Hellerstein

 

eScience Institute
University of Washington

Mentor

 
Joseph L. Hellerstein is Senior Data Science Fellow in the eScience Institute and Affiliate Professor of Computer Science, both at the University of Washington, Seattle, Washington. Previously, Dr. Hellerstein managed the Computational Discovery Department at Google (2008-2014), was a Principal Architect at Microsoft Corp. in Redmond, WA (2006 to 2008), and founded/directed the Adaptive Systems Department at the IBM Thomas J. Watson Research Center in Hawthorne, NY (1984 to 2006). Dr. Hellerstein received the PhD in computer science from the University of California at Los Angeles. He has published approximately 200 peer-reviewed papers, 30 patents, and two books. He has taught at Columbia University and the University of Washington, and has served on numerous program committees and government advisory panels. Dr. Hellerstein is a recipient of the 2007 IEEE/IFIP Stokesberry Award, and is a Fellow of the IEEE.

Tim Hesterberg

 


Google

Mentor

 
Dr. Tim Hesterberg is a Senior Quantitative Analyst at Google. He previously worked at Insightful (S-PLUS), Franklin & Marshall College, and Pacific Gas & Electric Co. He received his Ph.D. in Statistics from Stanford University, under Brad Efron. He is author of the “Resample” package for R, Chihara and Hesterberg “Mathematical Statistics with Resampling and R” (2011), and “What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Statistics Curriculum”, arXiv, 2014. http://arxiv.org/abs/1411.5279

Laura Norén

 

Center for Data Science
New York University

Ethnographer

 
Laura Norén is a Moore/Sloan Postdoctoral Associate at the Center for Data Science at New York University where she also holds adjunct professorships in the Stern School of Business and the Department of Media, Culture and Communication. Norén’s research focuses on collaboration and sociotechnical systems within organizations. She is currently studying data scientists and data analysts across a range of workplaces. Her dissertation addressed creative collaboration among graphic designers, architects, electric vehicle engineers, and fine dining kitchen staff. She has also published on collaboration within communities across a range of technological/infrastructural settings including food bloggers and taxi drivers.

Earnestine Psalmonds

 

Division of Graduate Education
National Science Foundation

Observer

 
Earnestine Psalmonds Easter is a program director in the Division of Graduate Education, Directorate for Education and Human Resources, National Science Foundation. Her current responsibilities include serving as program director for the EHR Workforce Development Core Research, Historically Black Colleges and Universities Undergraduate Programs broadening participation research track, and co-coordinator for the Research Experiences for Undergraduates (REU) program. She is the lead program officer for the new directorate-wide STEM Professional Workforce Development Core area. She represents the NSF on the Office of Science and Technology Policy interagency working group on broadening participation. As senior program officer and visiting scholar in the Policy and Global Affairs Division, National Academies, she served as study director for the two National Academies reports, including a congressionally mandated study focused on the underrepresentation of minorities in science and engineering. She has held numerous administrative positions in higher education and served on boards of directors for state and national organizations She received the baccalaureate and master’s degree from Tuskegee University and Ph.D. from Georgia State University.

Raghu Ramakrishnan

 


Microsoft

Keynote Speaker

 
Dr. Raghu Ramakrishnan is a researcher in the areas of database and information management. He is a Technical Fellow at Microsoft. He has been a Vice President and Research Fellow for Yahoo! Inc., and a Professor of Computer Sciences at the University of Wisconsin–Madison. Ramakrishnan received a bachelor’s degree from IIT Madras in 1983, and a Ph.D. from the University of Texas at Austin in 1987. He has been selected as a Fellow of the ACM and a Packard fellow, and has done pioneering research in the areas of deductive databases, data mining, exploratory data analysis, data privacy, and web-scale data integration. The focus of his work in 2007 was community-based information management. With Johannes Gehrke, he authored the popular textbook Database Management Systems, also known as the “Cow Book”.

Renata Rawlings-Goss

 

AAAS S&T Policy Fellow
National Science Foundation

Observer

 
Dr. Renata Rawlings-Goss is a Big Data AAAS science policy fellow at the National Science Foundation working on Big Data policies and priority goals. She sits on the NITRD inter-agency Big Data Senior Steering group with partners from (NSF, NIH, DARPA, DOE, NIST, NOOA, NASA, USGS, DHS, DOD, and other agencies). Additionally, Renata participates in the implementation of NSF priority goals for increased activity and workforce in data science. She interacts with industry partners and the White House Office of Science and Technology policy in the formation of public-private partnerships around big data, data science and the “Internet of Things”. Dr. Rawlings-Goss, is a biophysicist who completed her doctorate work at the University of Michigan-Ann Arbor. She then worked with the Center for Computational Medicine, where she developed new predictive statistics for patient monitored diabetes. Subsequently, she became a Penn-Port fellow in the department of genetics at the University of Pennsylvania, where her research interests included data-driven analysis of genetic/expression variation among worldwide populations for diseases such as cancer

Darwin Schweitzer

 

Algorithms and Data Science Group
Microsoft

Mentor

 
Darwin is a Senior Program Manager at Microsoft focused on Data Science education where he is part of the Algorithms and Data Science group in Information Management and Machine Learning. His data experience has been gained through a number of diverse roles at Microsoft as well as at other technology companies like IBM and Business Objects. Roles have included instructor, practitioner, data architect, technical lead, consultant, and teaching assistant and he has worked for companies in a variety of industries (technology, healthcare, financial services, insurance, pharmaceuticals, travel, education, non-profit, and utilities) including local PacWest organizations like the UW, WaMu, Expedia, and SnoPUD. The one commonality in his career has been Data and Education. Darwin is an aspiring Data Scientist and dedicated lifelong learner who contributes to continuing education as a Cloud Data Management & Analytics instructor at the University of Washington and as a teaching assistant at Henry M. Jackson High School in Millcreek, WA where he helps students learn Java and prepare for the AP Computer Science exam http://tealsk12.org . In his spare time he likes to travel, hike, read (technology or books about US Presidents), listen to Blues and Jazz, and enjoy an occasional round of golf. Darwin hopes to help drive Data Science education and increase the broad adoption of Data Science products and services and build Data Science community.

Valentina Staneva

 

eScience Institute
University of Washington

Mentor

 
Valentina Staneva started as a data scientist at the eScience Institute in March, 2015. Prior to joining University of Washington, she was a PhD student at the Applied Mathematics & Statistics Department at Johns Hopkins University. Her research was with the Center for Imaging Science and was devoted to developing methods for tracking deforming objects in videos and statistical estimation of their dynamics. Valentina has a Bachelors degree in Mathematics from Concord University, and between her undergraduate and graduate studies she spent 1.5 years working at Los Alamos National Laboratory on problems in imaging, optimization and compressed sensing. She has broad interests in extracting information from different types of data, and building tools for it.

Alejandro Suarez

 

AAAS S&T Policy Fellow
National Science Foundation

Observer

 
Alejandro is a computational physicist with a background in surface science and the chemical modification of graphene. After obtaining his B.S. in applied physics from Rensselaer Polytechnic Institute (RPI), he obtained a Bunton-Waller fellowship to attend Penn State University (PSU) for his graduate studies. While at PSU, Alejandro studied the electronic properties of graphene, a two-dimensional allotrope of carbon with novel physical and mechanical properties. By better understanding how graphene could be chemically modified, Alejandro helped advance research for incorporating graphene into modern electronic devices. While at PSU, Alejandro was also awarded an NSF GK-12 fellowship. As part of the fellowship, Alejandro worked closely with a sixth grade educator in Harrisburg, PA to enhance the scientific curriculum of the classroom and develop skills in communicating science to diverse audiences. After obtaining his Ph.D., Alejandro received an NRC postdoctoral associateship with the U.S. Naval Research Laboratory to study the effect of substrates on the chemical reactivity of graphene. Alejandro is currently a AAAS S&T fellow in the Computer and Information Sciences and Engineering (CISE) directorate at NSF, where he plans to study the use and future impact of data science and software development in scientific research.

Anissa Tanweer

 

Communication
University of Washington

Ethnographer

 
Anissa Tanweer is a PhD student in the Department of Communication at the University of Washington and a research assistant in the Human-Centered Data Science Lab. She is broadly interested in social dimensions of information and communication technologies, and her work focuses on the role of big data and data science in the production of knowledge and the formation of policy. She works with a team of researchers at UW led by Professor Cecilia Aragon that are conducting ethnographic research on the Data Science Environment funded by the Gordon & Betty Moore and Alfred P. Sloan Foundations.

Kristin Tolle

 

Microsoft Research
Microsoft

Mentor

 
Kristin M. Tolle is the Director of the Data Science Initiative in Microsoft Research Outreach, Redmond, WA. Since joining Microsoft in 2000, Dr. Tolle has acquired numerous patents and worked for several product teams including the Natural Language Group, Visual Studio, and the Microsoft Office Excel Team. Since joining Microsoft Research’s outreach program in 2006, she has run several major initiatives from Biomedical computing and environmental science to more traditional computer and information science programs around natural user interactions and data curation. She was also directed the development of the Microsoft Translator Hub and the Environmental Science Services Toolkit. She is also one of the editors and authors of one of the earliest books on data science, The Fourth Paradigm: Data Intensive Scientific Discovery. Her current focus is develop an outreach program to engage with academics on data science in general and more specifically around using data to create meaningful and useful user experiences across devices platforms. Prior to joining Microsoft, Tolle was an Oak Ridge Science and Engineering Research Fellow for the National Library of Medicine and a Research Associate at the University of Arizona Artificial Intelligence Lab managing the group on medical information retrieval and natural language processing. She earned her Ph.D. in Management of Information Systems with a minor in Computational Linguistics. Dr. Tolle’s present research interests include global public health as related to climate change, mobile computing to enable field scientists and inform the public, sensors used to gather ecological and environmental data, and integration and interoperability of large heterogeneous environmental data sources. She collaborates with several major research groups in Microsoft Research including eScience, computational science laboratory, computational ecology and environmental science, and the sensing and energy research group.

David Williams

 

Biology
University of Washington

Mentor

 
I work on understanding the control of biological motion, scaling from the molecular dynamics of individual motor molecules to the kinematics of animal movement. I use molecular models of muscle that incorporate interesting new protein movement mechanisms to understand force regulation in single sarcomeres, the sub-cellular unit that shortens when muscle contracts. These models show us how muscle’s geometric properties control the force it generates. I validate these models with X-ray imaging and, as a WRF fellow at the University of Washington, am developing software tools to automate and introduce objectivity into this technique.

Jennifer Worrell

 

Computer Science & Engineering
University of Washington

Organizer

 
Jennifer Worrell is the Program Manager for the Data Science IGERT Program and the Database Group in the UW Computer Science and Engineering department. She has a background in psychology and anthropology, and has spent the last ten years managing multi-million dollar awards for various groups within the department.

Fen Zhao

 

Strategic Innovation
National Science Foundation

Observer

 
Fen Zhao is a Staff Associate, Strategic Innovation. Dr. Zhao focuses on building public – private partnerships around CISE’s Big Data, next generation internet, and cybersecurity R&D portfolios. Prior to joining CISE OAD, Dr. Zhao was a AAAS Fellow at the White House Office of Science and Technology Policy working on national security S&T issues. Before her work in the public sector, Dr. Zhao was an associate with McKinsey and Company’s Risk Management Practice, serving public sector clients in the mortgage and debt markets. Fen received a PhD in Applied Physics from Stanford University and her BS from MIT. Her doctoral research was conducted at the Kavli Institute for Particle Astrophysics and Cosmology at SLAC National Accelerator Labs, where she developed supercomputing astrophysical simulations of magnetic fields within the early universe.

Posters

Source: http://depts.washington.edu/dswkshp/session1.html

Yazhong Wang

Physics
Rutgers

Poster 1
X-team 10

Whitepaper:
Cosmological experiments in condensed matter system
We studied topological defects in hexagonal manganites to help understand the evolution of our Universe based on Kibble-Zurek mechanism. In this work, we have to account defect density in a large scale and the coordinates of each vortex core on some optical images. It takes us several months to do this data analysis. We are seeking more efficient way to do this work. And it will take us huge convenience in the future research.

Julie van der Hoop

Joint Program in Oceanography (Biology)
Massachusetts Institute of Technology - Woods Hole Oceanographic Institution

Poster 2
X-team 1

Whitepaper:
Integrating animal sensing systems
The next breakthroughs in wearable technology, for humans or animals, require integrated sensing systems.

Fred Morstatter

School of Computing, Informatics, and Decision Systems Engineering
Arizona State University

Poster 3
X-team 1

Whitepaper:
Discovering Bias in Big Social Media Data
One fundamental problem with social media mining is getting access to representative, reliable data. While companies like Facebook have massive amounts of data, they do not share this data with the research community at large. For the few sites that do share their data, they do so through the use of APIs that allow the researcher access to a portion of the overall data generated on the site. Twitter, one example of a social media site that shares its data, allows researchers to at most 1% of all of the posts generated on the site each day through its API. Twitter is perhaps the most lenient when it comes to sharing data with the research community. While Twitter’s APIs come as a welcome relief to those in the area of social media mining, their ability to represent the true activity on the social media site has become a concern to researchers in recent years. The problem of finding representative samples of social media is a widely accepted and necessary problem that researchers must address in order to ensure the veracity of their research results. Herein we define the problem and outline two state-of-the-art solutions.

Austin Arrington

Environmental Science / Ecosystem Restoration 
SUNY College of Environmental Science and Forestry

Poster 4
X-team 4

Whitepaper:
Color Analysis of Crowdsourced Images for Ecological Monitoring
Remote sensing technology, such as satellite imagery, is a powerful tool for studying spatial ecology. However, understanding spatial ecology often requires finer scales than is afforded by satellite imagery, and the need for “ground-truthing” still exists. Leveraging “Big Data,” or more specifically, geo-tagged and time-stamped images provided through open source online networks, may offer a solution to help better understand scale and pattern in ecological systems.

Chris Cacciapaglia

Biological sciences
PhD student

Poster 5
X-team 4

Whitepaper:
Climate change refuges in the oceans
Identify coral reef refugia in the Pacific, Indian and Atlantic Oceans under differing climate change scenarios using climate-envelope models in accordance with high-resolution environmental data at a global scale

James Morton

Computer Science
University of California San Diego

Poster 6
X-team 9

Whitepaper:
Uncovering the Unknown: A New Approach in Analyzing Microbiome Data
In microbiome studies, the process of normalizing samples is still in the midst of immense debate. We argue that the most straightforward approach for normalizing samples is to calculate the proportions of species for each sample. In this paper we introduce a novel statistic for estimating multinomial proportions when the total number of possible species is unknown. Here we will show why using observed species abundances to estimate proportions are poor estimators for the true proportions and how coverage estimators can enhance accuracy of true proportion estimators.

Arif Khan

Computer Science
Purdue Universit

Poster 7
X-team 5

Whitepaper:
Large Scale Adaptive Anonymity via Parallel Approximate b-Matching
Data privacy is a necessary feature for data science applications. We discuss the potential of k-Anonymity, a privacy algorithm in the context of big data. We show some of the limitations of k-Anonymity and propose a heuristic solution to solve those problems. We also present the applicability of k-Anonymity to different domains of Data Sciences.

Alexandra Munoz

Environmental Medicine
New York University

Poster 8
X-team 9

Whitepaper:
Re-evaluating the paradigmatic presuppositions of molecular biology in the context of big data
Every piece of information that is extracted in data analysis also assumes a model – without the model the data would not tell you anything – there would be no context through which to relate the variables and the magnitude of the values would be meaningless. The molecular landscape is modeled in a DNA-centric manner that prioritizes certain types of information (singularities) over others (dynamic processes) and in turn constructs a system in which certain avenues of causality are not being fully integrated into the model. In turn, this paper critiques the current model and points to a direction for alternative exploration. The motivation for this work is to model the complexity of cancer in a new way, in an effort to expand the search area for the solution to cancer.

Colin Raffel

Electrical Engineering
Columbia University

Poster 9
X-team 8

Whitepaper:
Learning Efficient Representations for Sequence Retrieval
We explore the problem of matching sequences of high-dimensional vectors to entries in very large sequence databases. When utilizing dynamic time warping distance to compare sequences, the local distance calculations can be prohibitively expensive when the data's dimensionality and intrinsic sampling rate is high. We therefore motivate the need for methods which can learn efficient representations for sequence comparison and discuss potential applications of these techniques.

Ashlynn Daughton

Systems Analysis and Surveillance 
Los Alamos National Lab

Poster 10
X-team 6

Whitepaper:
Use of Historic Disease Data to Facilitate Awareness and Inform Control Measures
Infectious diseases have recognizable patterns that have been documented for decades, but have not been fully exploited. We have developed an application that seeks to use similarities in historic infectious disease outbreak data to inform situational awareness of current outbreaks for a wide number of infectious diseases, even in contexts where minimal data is available.

Benjamin Weinstein

Ecology and Evolution
Stony Brook University

Poster 11
X-team 10

Whitepaper:
A pipeline for combining crowd-sourced images and computer vision to monitor plant flowering
Using images gathered from the Flickr photo sharing cite to collect data on the timing of flowering plants in Mt. Rainier National Park.

Rayna Harris

Integrative Biology, Center for Computational Biology and Bioinformatics
The University of Texas at Austin

Poster 12
X-team 2

Whitepaper:
White Paper: Integrative Neuroscience
The brain is a fascinatingly complex organ that has been the subject of intense study for centuries, but many mysteries about the brain still remain. Current brain initiatives call for multi-scale integration of the activity and structure of the brain in order to elucidate and link the neural circuits dynamics to brain function. As cutting edge technologies are being developed, neuroscience critically depends on developing both data repositories, analysis tools and theories for integrating real-time genomic, connectomic, optogeneic, electrophysiological, and behavioral data.

Raghava Mutharaju

Computer Science
Wright State University

Poster 13
X-team 6

Whitepaper:
Distributed Reasoning over Ontology Streams and Large Knowledge Base
With the rapid increase in the velocity and the volume of data, it is becoming increasingly difficult to effectively analyze the data so as to extract knowledge from it. Use of background knowledge (domain knowledge captured in the form of ontologies) and reasoning (to correlate and infer facts) can prove to be useful in tackling the Big Data monster. But existing reasoning approaches are not scalable. In this paper, we present a distributed reasoning solution that can scale with the data.

Dorian Rosen

Materials Science Engineering
University of Utah

Poster 14
X-team 7

Whitepaper:
Data Mining and Machine Learning to Guide Novel Thermoelectric Development
This white paper describes the possible uses of thermoelectric materials, and addresses the problems associated with conducting high-risk studies to synthesize novel compounds from chemical white space. By data mining the ever-increasing number of materials science publications, a comprehensive database is being constructed. Newly-developed machine-learning systems are being used to predict the thermoelectic properties of hypothetical materials, and bridge the gap between computational tools and experimental needs.

Dennis Linders

iSchool
University of Maryland, College Park

Poster 15
X-team 1

Whitepaper:
The Smart City as a Platform for Collaboration on Climate Change
Cities are at the forefront of the fight against climate change, because their concentration of resources provides the most environmentally-friendly way of delivering a high quality of life. Sustainable cities combine this advantage with a society-wide commitment to a low-carbon lifestyle. Yet the traditional tools of public administration are poorly equipped to facilitate this collaborative approach. Fortunately, advancements in Information and Communication Technologies (ICT) hold tremendous potential to address these shortcomings. Most promisingly, innovative urban leaders have begun to reshape both government and governance around a vision of a “Smart City” that collects vast amounts of data on the state and performance of its communities and then translates this data into actionable insights. Yet the adoption of these “smart city” innovations remains best described as experimental, as blind aspirations continue to far exceed validated best practices or proven implementation strategies. To bridge this gap, the proposed research project will conduct holistic case studies of three pioneering "smart cities" to identify effective business models for using "smart" infrastructure, data science, and connected citizens to promote community-wide action on climate change.

Charmgil Hong

Department of Computer Science
University of Pittsburgh

Poster 16
X-team 5

Whitepaper:
Multivariate Conditional Outlier Detection and Its Clinical Application
This paper summarizes our research that aims at developing automated methods of multivariate conditional outlier detection, and applying the methods to support clinical decision making. In particular, we are interested in identifying statistically unusual patient care patterns corresponding to medical errors based on data stored in electronic medical record (EMR) systems. We describe the problems and objectives of the research, and outline our model-based outlier detection approach. We also discuss the future directions and expected impacts of the research.

Haroon Raja

Electrical and Computer Engineering Department
Rutgers, The State University of New Jersey

Poster 17
X-team 5

Whitepaper:
Cloud K-SVD: A Dictionary Learning Algorithm for Big, Distributed Data
This paper studies the problem of data-adaptive representations for big, distributed data. It is assumed that a number of geographically-distributed, interconnected sites have massive local data and they are interested in collaboratively learning a low-dimensional geometric structure (dictionary) underlying these data.

Bharathi Asokarajan

Data Science and Analytics
University of Oklahoma

Poster 18
X-team 10

Whitepaper:
Pixel oriented visualization – An aid to analyze large-scale text data
Classics scholars work with text data that is not just Big, but also Interesting and Complex. We are developing new pixel-based text visualization technique that display the hierarchical structure of primary texts with their rich apparatus metadata in an accessible and comparable fashion. As a part of this, we are investigating new ways to support focus+context interactions across multiple scales of text. These visualization designs will help scholars to engage effectively and efficiently with the long and deep provenance of knowledge that surrounds some of humanity's most important historical works. We anticipate that successful application of new interactive visualization techniques for text analysis to a complex domain like classics will provide a clear direction for application to scholarship and learning on text, language, and communication in a wide variety of domains.

William Kearney

Earth and Environment
Boston University

Poster 19
X-team 8

Whitepaper:
Deriving process knowledge from data in coastal ecohydrology
Scientists interested in developing robust predictive models should aim for a synthetic modeling approach which combines the predictive power of empirical models with the process-driven understanding of physical models. I examine how this synthetic approach can improve the representation of processes in empirical models of salt marsh hydrology.

Saurabh Jha

Computer Science
University of Illinois at Urbana Champaign

Poster 20
X-team 6

Whitepaper:
LASE: Log Analysis and Storage Engine for Resiliency Study
There is a need to build exascale computers to further the progress of scientific studies and meet the ever growing demands of computing power. One of the critical problems – if not the most critical problem – in reaching exascale computing goal by the end of the decade is “designing fault tolerant applications and systems that can reach sustained petaflops to exaflops of performance”. Due to high number of errors and failures in such complex systems, it has become important to understand the reasons for errors and failures and take proactive action to contain the errors to support exascale computation. Logs serve as important source of information to do such study. However, due to nature and scale of these logs, it has become difficult if not impossible to process and extract meaningful information from these logs. LASE is a log analysis and storage engine that bring various techniques together in an unified framework that can handle petabytes of data and assist in building models for failure diagnosis, prediction and anomaly detection.

Victoria Villar

Astronomy and Astrophysics
Harvard

Poster 21
X-team 3

Whitepaper:
Classification of Intermediate-Luminosity Astronomical Transients
Stars materialize, live and die following a lifecycle that depends on both intrinsic properties and environmental factors. Their transient outbursts, interactions and deaths all encode important information about stellar evolution. Future large surveys, such as LSST, will produce 30+ TB of data daily which astronomers can use to study these transients. This paper describes possible classification techniques for analyzing the LSST dataset of intermediate-luminosity transients.

Timothy Jones

Biological Sciences
Louisiana State University

Poster 22
X-team 2

Whitepaper:
Seeing is believing with data visualization
This whitepaper outlines state of the art biodiversity identification practices and new visual methods using Kingdom Plantae and its children as models.

Daniel Cook

Molecular Biosciences
Northwestern University

Poster 23
X-team 1

Whitepaper:
Improving data-management and integration within resequencing-pipelines
A powerful strategy for identifying the genetic basis of phenotypes is to perform genome-wide association (GWA) analysis. GWA studies that utilize massively parallel sequencing rely on population resequencing pipelines to identify genetic variants. Resequencing pipelines require precise data handling and integration of data generated across a large series of steps from multiple programs to identify issues, confounding factors in analysis, or to identify interesting associations. However, integrating this data is challenging as it requires extensive file parsing, manipulation, and merging. Here, I propose the development of a database schema resembling an entity-attribute-value (EAV) model for storage of summary data generated at different steps within resequencing pipelines and a set of tools enabling integration of this system. This system improves data-handling within resequencing pipelines and facilitates comparison of variables across tools, samples, and between pipeline configurations.

Xin Yang

Computational Science
Middle Tennessee State University

Poster 24
X-team 10

Whitepaper:
Spatial Regularization for Multitask Learning and Application in fMRI Data Analysis
fMRI data has extremely complicated structure. Hence efficient and accurate models are necessary in detecting accurate neuronal activity, by incorporating spatial and spectral information. In this paper, we use General Linear Model to formulate the fMRI data, where each voxel is assumed as a task, and proposed a class of spatial Multi-task Learning models, which incorporates spatial information provided by each task’s neighborhood. Simulation and real application results show satisfactory performance from spatial Multi-task Learning algorithms.

Sepideh Pourazarm

Systems Engineering
Boston University

Poster 25
X-team 7

Whitepaper:
Improving Traffic Management Using Big Data
We study the routing problem for vehicle flows through a road network that includes both battery-powered Electric Vehicles (EVs) and Non-Electric Vehicles (NEVs). We seek to optimize a system-centric (as opposed to user-centric) objective aiming to minimize the total elapsed time for all vehicles to reach their destinations considering both traveling times and recharging times for EVs when the latter do not have adequate energy for the entire journey. We are validating the efficiency of our algorithm using real traffic data in terms of “average speed” on the road segments in Eastern Massachusetts provided by the City of Boston.

Susmit Shannigrahi

Computer Sc. Dept
Colorado State University

Poster 26
X-team 7

Whitepaper:
Named Data Networking for Large Scientific Data Management
This paper discusses how using Named Data Networking (NDN) reduces the complexities of large scientific data management. Scientific data collections require safe archiving and easy retrieval while maintaining data provenance and integrity. The large size and distributed nature of these datasets complicates already challenging data management task. NDN (Named Data Networking), a NSF project for investigating future Internet architectures, replaces IP endpoints by hierarchical content names. NDN implicitly overcomes many of the challenges associated with managing scientific data. We describe a framework developed with NDN to reduce such challenges.

Erika Helgeson

Department of Biostatistics
University of North Carolina

Poster 27
X-team 1

Whitepaper:
Nonparametric Cluster Significance Testing
We describe a proposed method for testing the statistical significance of putative clusters. Cluster analysis is an unsupervised learning strategy that can be used to identify groups of observations in data sets of unknown structure. Few methods are available that can assess the strength of clusters identified in a data set. The methods that are available often rely on distributional assumptions or are not optimized for high dimensional settings. We propose a novel non-parametric method for testing the null hypothesis that no clusters are present in a given data set which can be used in both high and low dimensional settings with optimal accuracy.

Qi Song

School of Electrical Engineering and Computer Science
Washington State University

Poster 28
X-team 6

Whitepaper:
Knowledge Search Made Easy: Effective Knowledge Graph Summarization and Applications
The rising Big Data tide requires powerful techniques to effectively search useful knowledge from information systems such as knowledge bases and knowledge graphs. Accessing and search complex knowledge graphs is difficult for end users due to query ambiguity, data heterogeneity and large scale data. We propose to develop effective knowledge summarization techniques to make the knowledge search process easy for end users. The knowledge graph summaries not only help users understand the complex knowledge data and search results, but can also suggest reasonable queries and support fast knowledge search. Our research will benefit a number of knowledge discovery applications including web and scientific search, social network analysis, cyber security and health informatics.

Chengrui Li

Department of Statistics and Biostatistics
Rutgers University

Poster 29
X-team 3

Whitepaper:
A Sequential Split-Conquer-Combine Approach for Analysis of Big Spatial Data
The task of analyzing massive spatial data is extremely challenging. In this paper we propose a sequential split-conquer-combine (SSCC) approach for analysis of dependent big data and illustrate it using a Gaussian process model, along with a theoretical support. This SSCC approach can substantially reduce computing time and computer memory requirements. We also show that the SSCC approach is oracle in the sense that the result obtained using the approach is asymptotically equivalent to the one obtained from performing the analysis on the entire data in a super-super computer. The methodology is illustrated numerically using both simulation and a real data example of a computer experiment on modeling room temperatures.

Jianbo Ye

Information Sciences and Technology
The Pennsylvania State University

Poster 30
X-team 10

Whitepaper:
Clustering Distributions at Scale: A New Tool for Data Sciences
We introduce a fast and parallel tool for clustering large-scale discrete distributions under the optimal transport distance. The significant computational cost brought by the optimal transport has left the problems of machine learning from such unstructured data almost untouched till today. Our proposed optimization method successfully resolves the scalability bottleneck of previous methods, hence is readily applicable for analyzing large distributional dataset without first specifying which form the distribution of data has to comply.

Berk Ustun

Electrical Engineering and Computer Science
MIT

Poster 31
X-team 7

Whitepaper:
Learning Tailored Risk Scores from Large-Scale Datasets
Risk scores are simple models that let users assess risk by adding, subtracting and multiplying a few small numbers. These models are widely used in medicine and crime prediction but difficult to learn from data because they need to be accurate, sparse, and use integer coefficients. We formulate the risk score problem as a mixed integer non-linear programming problem, and present a cutting-plane algorithm to solve it for datasets with large sample sizes.

Mahdi Ahmadi

Mechanical and Energy Engineering
University of North Texas

Poster 32
X-team 1

Whitepaper:
Big Data Mining Methods for Accurate Spatial Interpolation of Ozone Pollution
This paper explains the importance of using big data methods to extract accurate spatial interpolation functions for ozone pollution prediction.

Sayamindu Dasgupta

Media Arts and Sciences
Massachusetts Institute of Technology

Poster 33
X-team 3

Whitepaper:
Large-scale analysis of novice programmer trajectories in an open-ended programming community
This white paper outlines some of the opportunities and challenges in analyzing trajectories of young novice programmers as they create, share, and remix media-rich programming projects, as well as participate socially in the Scratch online community (https://scratch.mit.edu). Scratch is open-ended by design, where anyone with a web-browser can create a wide variety of programming projects, ranging from games to science-simulations, from interactive stories to computational music programs. This open-ended context poses a number of challenges for the large-scale analysis and measurement of learning outcomes. Addressing these challenges hold promise not just for understanding the use of Scratch as a learning environment, but also, as the learn-to-code movement in the United States and elsewhere gathers momentum, methods and strategies formulated for Scratch data-research has the potential to be useful for research on other similar tools and environments that teach young people programming.

Abhijit Bendale

Computer Science
University of Colorado at Colorado Springs

Poster 34
X-team 10

Whitepaper:
Towards Open World Recognition
With the of advent rich classification models and high computational power visual recognition systems have found many operational applications. Recognition in the real world poses multiple challenges that are not apparent in controlled lab environments. The datasets are dynamic and novel categories must be continuously detected and then added. At prediction time, a trained system has to deal with myriad unseen categories. Operational systems require minimal downtime, even to learn. To handle these operational issues, we present the problem of Open World Recognition and formally define it. We prove that thresholding sums of monotonically decreasing functions of distances in linearly transformed feature space can balance “open space risk” and empirical risk. Our theory extends existing algorithms for open world recognition.

Justin Brandenburg

Computational Social Science
George Mason University

Poster 35
X-team 3

Whitepaper:
Replicating Cyber-attack Patterns of Behavior using Bipartite Network Analysis and Agent-Based Modeling
Introducing a method of evaluating cyber traffic behavior via bipartite graph analysis and implementing agent-based modeling to simulate and test network capability.

Kevin Keys

Biomathematics
UCLA

Poster 36
X-team 7

Whitepaper:
Parsimonious model selection in genome-wide association studies
This white paper sketches an issue with model selection in multiple regression analysis of genome-wide association studies. Based on our current research, we suggest a remedy to perform these large analyses on a desktop machine.

Yang Wang

System and Information Engineering
University of Virginia

Poster 37
X-team 8

Whitepaper:
Maintained Individual Data Distributed Likelihood Estimation (MIDDLE)
Maintained Individual Data Distributed Likelihood Estimation (MIDDLE) paradigm will construct and validate a revolutionary model for the accomplishing health science human-subject research with networked devices. The MIDDLE paradigm is that data can be privately maintained by participants on their personal devices and never revealed to researchers, while statistical models are fit and scientific hypotheses are tested.

Ethan Rudd

Computer Science
University of Colorado at Colorado Springs

Poster 38
X-team 10

Whitepaper:
The Extreme Value Machine
This paper describes a scalable, non-linear model called the Extreme Value Machine (EVM), an analog to the Support Vector Machine (SVM) derived from statistical Extreme Value Theory. The EVM is far more scalable than a kernelized SVM, exhibits comparable accuracy on closed set datasets (where all classes are known at test time), and avoids the need for a parameter grid search. This allows the EVM model to scale to large datasets that are computationally infeasible for non-linear SVMs. Moreover, unlike SVMs, our EVM model performs well in the open-set regime (when unknown classes are present at test time), achieving state-of-the-art results on open-set datasets.

Wei Xie

Department of Electrical Engineering & Computer Science
Vanderbilt University

Poster 39
X-team 2

Whitepaper:
Collaborative data science without violating privacy: a case study from genome research
Data privacy is an important issue for many data science disciplines involving human subjects. Here we take genome research as a representative case study to illustrate the privacy concerns and countermeasures. We develop a novel cryptography-based method to enable collaborative studies via meta-analysis without violating privacy. We also show the relevance of this method to the wider research community of data science.

Marina Kogan

Computer Science
University of Colorado Boulder

Poster 40
X-team 5

Whitepaper:
Why Data Science Needs to Attend to Contextual Behavior: The Case of Crisis Informatics
Crisis informatics is a study of how people converge, spread information, and cooperate around the tasks they deem important on social media in crisis. The socio-behavioral focus of crisis informatics necessitates that research methodology accounts for the social context of users’ activity. On the other hand, the volume of the social media data requires the use of data science approaches, which in the current form often decontextualize the social activity. I propose several methodological innovations that would propel big data methods towards attending to the highly-situated and contextual nature of the social activity in crisis.

Taisuke Imai

Division of the Humanities and Social Sciences
California Institute of Technology

Poster 41
X-team 3

Whitepaper:
Detecting Habitual Behavior in Natural Consumer Choice Data
Habit is a process by which a stimulus automatically generates an impulse toward action, based on learned association between stimulus and response. In this project we seek to identify habitual choices and shifts from habit to model-directed behavior using big and broad data sets of natural consumer decision making such as online shopping, online stock trading, and commuter route choice.

Lincoln Sheets

Informatics Institute
University of Missouri

Poster 42
X-team 3

Whitepaper:
Data Mining to Predict Healthcare Utilization in Managed Care Patients
Systematic association mining of clinical attributes from the electronic health records of adult primary care patients to discover predictors of high healthcare utilization.

Nancy (Xin Ru) Wang

Computer Science and Engineering
University of Washington

Poster 43
X-team 6

Whitepaper:
Decoding neural signals with natural multimodal data
This paper outlines our project that will use deep and unsupervised techniques to analyze a large multimodal natural (non-experimental) dataset, including simultaneous video, audio and ECoG/EEG signals for computational neuroscience and brain-computer interface applications. This project combines techniques from multiple fields in order to fully leverage the multimodality of the dataset.

Kin Gwn Lore

Mechanical Engineering
Iowa State University

Poster 44
X-team 1

Whitepaper:
Pattern Discovery from Large-scale Computational Fluid Dynamic Data using Deep Learning
This paper outlines our research in solving an inverse fluid dynamics design problem using large-scale simulation data. The forward problem of sculpting fluid flow by placing a set of pillars in a fluid channel has been simulated and experimentally validated. We now explore the applicability of machine learning models (specifically deep learning) in the inverse problem to serve as a map between user-defined flow shapes and the corresponding sequence of pillars in the design of microfluidic devices.

Anuj Karpatne

Computer Science and Engineering
University of Minnesota

Poster 45
X-team 4

Whitepaper:
Global Monitoring of Inland Water Dynamics: A Data-driven Approach
Freshwater, which is only available in inland water bodies such as lakes, reservoirs, and rivers, is increasingly becoming scarce across the world and this scarcity is posing a global threat to human sustainability. A global monitoring of inland water bodies is necessary for policy-makers and the scientific community to address this problem. The promise of data-driven approaches coupled with availability of remote sensing data presents opportunities as well as challenges for global monitoring of inland water bodies. My research aims at developing predictive models that address the challenges in analyzing remote sensing data for creating the first global monitoring system of inland water dynamics.

Jesse Hoff

Animal Science/Genetics
University of Missouri

Poster 46
X-team 2

Whitepaper:
Incorporation of Genomic Data in US Cattle Breeding and Production
We are analyzing appropriate practices for routine incorporation of genetic information in both the selection and care of US livestock. The rapid decrease in cost of genetic data from chips or short read data has lead to the accumulation of large, profitable data sets that provide dense quantification of the genetic component of livestock production. Unlike human or model organism genetics, the regulatory environment surrounding livestock genomics has enabled immediate application of cutting edge genomic technology in a commercial setting. Low cost genomic data informs high leverage decision making at farms ranging in size and technical expertise, enabled by well shared central genomic data repositories that inform genomic breeding models. We seek to develop tools that lower costs and allow genomic information to provide value across the whole livestock production cycle from reproduction, to immunity to ecological sustainability and carcass quality.

Zachary Foster

Botany and Plant Pathology
Oregon State University

Poster 47
X-team 9

Whitepaper:
Automated website generation for reproducible and shareable data science
Describes the potential benefits and challenges of using literate programming for embedded documentation in data science projects and introduces a new R package under development that generates website representations of project folders. It uses the names of files/folders and options specified in configuration files to infer a menu hierarchy and organize the content of files. Literate programming documents are executed and their output is integrated into the website along with PDF files, images, and other HTML files in the project.

Daqing Yun

Department of Computer Science
University of Memphis

Poster 48
X-team 8

Whitepaper:
An Integrated Transport Solution to Big Data Movement in High-performance Networks
We propose and develop an integrated transport solution to big data movement in high-performance networks in support of data- and network-intensive scientific applications.

Kelly Spendlove

Mathematics
Rutgers University

Poster 49
X-team 10

Whitepaper:
Determining Periodicity In Data
In the last few years high-throughput technologies have enabled the efficient and inexpensive collection of massive amounts of data. In many cases the data are high dimensional and being generated by some nonlinear system. In such a situation one is interested in both the geometry of the data and the action of the unknown nonlinear system. One of the most fundamental problems in analyzing nonlinear systems is determining whether the system is periodic. However, the past few decades of dynamical systems theory have shown nonlinear systems can exhibit extremely complex behavior with respect to both system variables and parameters. Such complex behavior proven in theoretical work has to be contrasted with the capabilities of application; in the case of modeling multiscale processes, for instance, measurements may be of limited precision, parameters are rarely known exactly and nonlinearities are often not derived from first principles. This contrast suggests that extracting a robust characterization of the periodic behavior is of greater importance than a detailed understanding of the fine structure. For such a characterization, we propose an approach which incorporates Takens’ embedding theorem, persistent homology and diffusion maps.

Can Hu

Department of Statistics and Biostatistics
Rutgers, The State University of New Jersey

Poster 50
X-team 3

Whitepaper:
Advanced Data Analytics of Railroad Infrastructure Degradation to Improve Transportation Safety
This white paper introduces some possible models to capture the track geometry degradation.

Team Assignments

Source: http://depts.washington.edu/dswkshp/y-teams.html

 

Last Name First Name Y-team
Ahmadi Mahdi 10
Alshammari Sultanah 4
Anantharam Pramod 5
Anderson Jennings 8
Arrington Austin 5
Asokarajan Bharathi 10
Balaji Janani 1
Bendale Abhijit 1
Bhoorasingh Pierre 2
Borhani Alireza 2
Brandenburg Justin 1
Cacciapaglia Chris 4
Cesare Nina 7
Cook Daniel 10
Counihan Jessica 10
Dasgupta Sayamindu 9
Daughton Ashlynn 5
Diaz Rodriguez Natalia 2
Eilering Bradford 1
Esfandiari Mohammadreza 4
Faisal Fazle 8
Farah Subrina 5
Farmanesh Babak 9
Feng Long 5
Fluet-Chouinard Etienne 7
Foster Zachary 1
Gadiraju Krishna Karthik 8
Geng Xinli 7
Georges Alex 10
Gittelman Rachel 3
Goodrich Timothy 1
Goodwin Zane 3
Gray Vanessa 9
Harris Rayna 9
Helgeson Erika 6
Hoff Jesse 8
Hong Charmgil 4
Hu Yirui 4
Hu Can 3
Hughes Shiree 1
Iaconangelo Charles 8
Imai Taisuke 2
Jha Saurabh 9
Jones Timothy 10
Karpatne Anuj 2
Kearney William 3
Kejriwal Mayank 8
Keys Kevin 4
Khan Arif 6
Kogan Marina 9
Lee Ryan 4
Li Jia 3
Li Chengrui 4
Linders Dennis 3
Little Joshua 2
Long Matthew 3
Lore Kin Gwn 5
Mahajan Sushant 4
Marti Hazan 6
Melander Josh 9
Miri Pardis 2
Morstatter Fred 7
Morton James 7
Munoz Alexandra 5
Mutharaju Raghava 3
Pourazarm Sepideh 6
Qiu Jiangxiao 6
Raffel Colin 1
Raja Haroon 7
Rajendrababu Ishwarya 10
Rosen Dorian 10
Rudd Ethan 5
Salehi Sadghiani Nima 3
Shannigrahi Susmit 9
Sheets Lincoln 6
Song Qi 10
Spendlove Kelly 9
Sridhar Sowmya 6
TALUKDER NILOTHPAL 5
Tao Jin 8
Trigg Shelly 2
Tunney Robert 6
Ustun Berk 7
van der Hoop Julie 8
Villar Victoria 10
Vimal Solomon 9
Wang Yazhong 8
Wang Fei 6
Wang Yang 7
Wang Nancy (Xin Ru) 8
Waterhouse Lynn 1
Weinstein Benjamin 3
Welch Joshua 2
White Colin 5
Xie Wei 7
Yang Xin 7
Yao Yisha 4
Ye Jianbo 6
Yun Daqing 2
Zhou Lisheng 1

 

Team 1

My Note: What to do with this content? Is it worth mining? Are there data sets and results?

Member profiles: X-team (Same as poster information above)

X-team Working documents: Notes (6 pages) Presentation (15 slides with notes) Report (none)

Y-team Working documents: Notes (3 pages) Presentation (4 slides)

NEXT

Page statistics
2014 view(s) and 104 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments