Table of contents
  1. Story
  2. Research Notes
  3. Postscript
  4. Global Top 25 University Business Incubators 2013
    1. Introduction
    2. UBI Network
      1. About
      2. Upcoming Studies
        1. Benchmark 2013 Update
        2. State of Incubation at Top 100 Universities
        3. Sample Research
      3. Join the UBI Network
  5. Solve for Success: A Full-Day Workshop
  6. Slides
    1. Data Science: Achieve Rationalized Data
    2. Today
    3. Agenda
    4. About the Speakers 1
    5. About the Speakers 2
    6. PSIKORS Institute: Introduction
    7. PSIKORS Institute: History
    8. PSIKORS Institute: Why?
    9. The Ψ–KORS™ System Model
    10. Corporate NoSQL
    11. PSIKORS Institute Access
    12. Data Science and Data Rationalization
    13. Quote
    14. What is Data Science?
    15. Big or Small Data: It Is the Quality That Counts
    16. Intertwined Complex Information
    17. Data Science As A Form of Science Study
    18. Data Science From A Practitioner
    19. PSIKORS Institute Data Science Principles
    20. Data Science Roadmap
    21. Foundations 1
    22. Foundations 2
    23. Focus on Data Rationalization
    24. Data Rationalization
    25. Healthcare.gov Problems
    26. The State of Data for Reporting and Analytics
    27. Data Rationalization Matrix
    28. Data Rationalization Issues
    29. US Navy HR: Complicated Mixture of Commercial, Custom, Legacy, Services Applications, Data Stores
    30. Healthcare.gov Architecture
    31. Aligning Data Values With Business Context
    32. Semantic Conflict in Real Estate Models
    33. Data Value Semantic Errors = Inconsistent, Difficult to Merge, Report, Analyze
    34. Real Estate Data Set
    35. PSI Data Rationalization/Virtualization Platform
    36. DataStar Key Capabilities
    37. The Ψ–KORS™ System Model
    38. Clarifying Data Use and Meaning
    39. Corporate NoSQL
    40. Benefits of Corporate NoSQL
    41. DoD CIO IT Asset Decision Support: Kevin Garrison
    42. DoD CIO IT Asset Decision Support: DODCIO IT Asset Common Data Model
    43. DataStar Analysis 1
    44. DataStar Analysis 2
    45. Common Data Model Purpose
    46. CDM Strategy
    47. Design Characteristics
    48. Information Sources
    49. Core Data for Better Management
    50. Defining Initial Dimensions
    51. Coverage of Dimensions Between PSI-KORS and Troux
    52. Common Data Model for analysis, and acts as core for DITPR PLUS asset specific data sets
    53. Data Model
    54. The Need for Meaning in BI and Data Analysis: Dr. Ramon Barquin
    55. Data Sleuthing for IT Management: Jim Knox, Terry Halvorsen, Dept of Navy CIO
    56. Corporate Data Rationalization with DataStar
    57. Data Rationalization Project
    58. Corporate Overview
    59. DataStar’s Corporate Value Impact
    60. Relevance to All Organizations
    61. Initial Results: First 30 Days
    62. Clarifying Data Use and Meaning
    63. Outcome of First 30 Days
    64. Final Outcome
    65. Architecture 1
    66. Architecture 2
    67. Ad Hoc Topics
  7. Delivering Agile Data
  8. Emails
    1. Geoffrey Malafsky, 12/23/2013 1:00 PM
    2. Brand Niemann, 12/23/2013 10:08 AM
    3. Geoffrey Malafsky, 12/21/2013 6:12 PM
    4. Kingsley Idehen, 12/21/2013 3:18 PM
    5. Geoffrey Malafsky, 12/20/2013 6:14 PM
    6. Kingsley Idehen, 12/20/2013 5:59 PM
    7. Geoffrey Malafsky,12/20/2013 3:34 PM
    8. Geoffrey Malafsky, 12/20/2013 2:01 PM
    9. Eric Kavanagh, 12/20/2013 1:34 PM
    10. Jim Hendler, 12/20/2013 9:22 AM
    11. Geoffrey Malafsky, December 19, 2013 6:27 PM
    12. Kingsley Idehen, Thu 12/19/2013 4:51 PM
    13. Geoffrey Malafsky, Thu 12/19/2013 1:11 PM
    14. Dennis Wisnosky, Thu 12/19/2013 1:05 PM
    15. Geoffrey Malafsk, 12/19/2013 1:03 PM
    16. Eric Kavanagh, 12/19/2013 12:02 PM
    17. Geoffrey Malafsky, 12/19/2013 11:37 AM
    18. Brand Niemann, 12/19/2013 11:18 AM
    19. Geoffrey Malafsky, 12/19/2013 10:58 AM
    20. Kingsley Idehen, 12/19/2013 10:12 AM
    21. Jim Hendler, 12/18/2013 5:10 AM
    22. Brand Niemann, 12/18/2013 4:36 AM
    23. Brand Niemann, 12/18/2013 2:34 PM
    24. Geoffrey Malafsky, 12/18/2013 9:29 AM
    25. Brand Niemann, 12/17/2013 4:51 PM
    26. Geoffrey Malafsky, December 17, 2013 8:55 AM
    27. Eric Kavanagh, 12/13/2013 4:43 PM
    28. Eric Kavanagh, 12/12/2013 1:53 PM

Semantic Data Science for DoD

Last modified
Table of contents
  1. Story
  2. Research Notes
  3. Postscript
  4. Global Top 25 University Business Incubators 2013
    1. Introduction
    2. UBI Network
      1. About
      2. Upcoming Studies
        1. Benchmark 2013 Update
        2. State of Incubation at Top 100 Universities
        3. Sample Research
      3. Join the UBI Network
  5. Solve for Success: A Full-Day Workshop
  6. Slides
    1. Data Science: Achieve Rationalized Data
    2. Today
    3. Agenda
    4. About the Speakers 1
    5. About the Speakers 2
    6. PSIKORS Institute: Introduction
    7. PSIKORS Institute: History
    8. PSIKORS Institute: Why?
    9. The Ψ–KORS™ System Model
    10. Corporate NoSQL
    11. PSIKORS Institute Access
    12. Data Science and Data Rationalization
    13. Quote
    14. What is Data Science?
    15. Big or Small Data: It Is the Quality That Counts
    16. Intertwined Complex Information
    17. Data Science As A Form of Science Study
    18. Data Science From A Practitioner
    19. PSIKORS Institute Data Science Principles
    20. Data Science Roadmap
    21. Foundations 1
    22. Foundations 2
    23. Focus on Data Rationalization
    24. Data Rationalization
    25. Healthcare.gov Problems
    26. The State of Data for Reporting and Analytics
    27. Data Rationalization Matrix
    28. Data Rationalization Issues
    29. US Navy HR: Complicated Mixture of Commercial, Custom, Legacy, Services Applications, Data Stores
    30. Healthcare.gov Architecture
    31. Aligning Data Values With Business Context
    32. Semantic Conflict in Real Estate Models
    33. Data Value Semantic Errors = Inconsistent, Difficult to Merge, Report, Analyze
    34. Real Estate Data Set
    35. PSI Data Rationalization/Virtualization Platform
    36. DataStar Key Capabilities
    37. The Ψ–KORS™ System Model
    38. Clarifying Data Use and Meaning
    39. Corporate NoSQL
    40. Benefits of Corporate NoSQL
    41. DoD CIO IT Asset Decision Support: Kevin Garrison
    42. DoD CIO IT Asset Decision Support: DODCIO IT Asset Common Data Model
    43. DataStar Analysis 1
    44. DataStar Analysis 2
    45. Common Data Model Purpose
    46. CDM Strategy
    47. Design Characteristics
    48. Information Sources
    49. Core Data for Better Management
    50. Defining Initial Dimensions
    51. Coverage of Dimensions Between PSI-KORS and Troux
    52. Common Data Model for analysis, and acts as core for DITPR PLUS asset specific data sets
    53. Data Model
    54. The Need for Meaning in BI and Data Analysis: Dr. Ramon Barquin
    55. Data Sleuthing for IT Management: Jim Knox, Terry Halvorsen, Dept of Navy CIO
    56. Corporate Data Rationalization with DataStar
    57. Data Rationalization Project
    58. Corporate Overview
    59. DataStar’s Corporate Value Impact
    60. Relevance to All Organizations
    61. Initial Results: First 30 Days
    62. Clarifying Data Use and Meaning
    63. Outcome of First 30 Days
    64. Final Outcome
    65. Architecture 1
    66. Architecture 2
    67. Ad Hoc Topics
  7. Delivering Agile Data
  8. Emails
    1. Geoffrey Malafsky, 12/23/2013 1:00 PM
    2. Brand Niemann, 12/23/2013 10:08 AM
    3. Geoffrey Malafsky, 12/21/2013 6:12 PM
    4. Kingsley Idehen, 12/21/2013 3:18 PM
    5. Geoffrey Malafsky, 12/20/2013 6:14 PM
    6. Kingsley Idehen, 12/20/2013 5:59 PM
    7. Geoffrey Malafsky,12/20/2013 3:34 PM
    8. Geoffrey Malafsky, 12/20/2013 2:01 PM
    9. Eric Kavanagh, 12/20/2013 1:34 PM
    10. Jim Hendler, 12/20/2013 9:22 AM
    11. Geoffrey Malafsky, December 19, 2013 6:27 PM
    12. Kingsley Idehen, Thu 12/19/2013 4:51 PM
    13. Geoffrey Malafsky, Thu 12/19/2013 1:11 PM
    14. Dennis Wisnosky, Thu 12/19/2013 1:05 PM
    15. Geoffrey Malafsk, 12/19/2013 1:03 PM
    16. Eric Kavanagh, 12/19/2013 12:02 PM
    17. Geoffrey Malafsky, 12/19/2013 11:37 AM
    18. Brand Niemann, 12/19/2013 11:18 AM
    19. Geoffrey Malafsky, 12/19/2013 10:58 AM
    20. Kingsley Idehen, 12/19/2013 10:12 AM
    21. Jim Hendler, 12/18/2013 5:10 AM
    22. Brand Niemann, 12/18/2013 4:36 AM
    23. Brand Niemann, 12/18/2013 2:34 PM
    24. Geoffrey Malafsky, 12/18/2013 9:29 AM
    25. Brand Niemann, 12/17/2013 4:51 PM
    26. Geoffrey Malafsky, December 17, 2013 8:55 AM
    27. Eric Kavanagh, 12/13/2013 4:43 PM
    28. Eric Kavanagh, 12/12/2013 1:53 PM

  1. Story
  2. Research Notes
  3. Postscript
  4. Global Top 25 University Business Incubators 2013
    1. Introduction
    2. UBI Network
      1. About
      2. Upcoming Studies
        1. Benchmark 2013 Update
        2. State of Incubation at Top 100 Universities
        3. Sample Research
      3. Join the UBI Network
  5. Solve for Success: A Full-Day Workshop
  6. Slides
    1. Data Science: Achieve Rationalized Data
    2. Today
    3. Agenda
    4. About the Speakers 1
    5. About the Speakers 2
    6. PSIKORS Institute: Introduction
    7. PSIKORS Institute: History
    8. PSIKORS Institute: Why?
    9. The Ψ–KORS™ System Model
    10. Corporate NoSQL
    11. PSIKORS Institute Access
    12. Data Science and Data Rationalization
    13. Quote
    14. What is Data Science?
    15. Big or Small Data: It Is the Quality That Counts
    16. Intertwined Complex Information
    17. Data Science As A Form of Science Study
    18. Data Science From A Practitioner
    19. PSIKORS Institute Data Science Principles
    20. Data Science Roadmap
    21. Foundations 1
    22. Foundations 2
    23. Focus on Data Rationalization
    24. Data Rationalization
    25. Healthcare.gov Problems
    26. The State of Data for Reporting and Analytics
    27. Data Rationalization Matrix
    28. Data Rationalization Issues
    29. US Navy HR: Complicated Mixture of Commercial, Custom, Legacy, Services Applications, Data Stores
    30. Healthcare.gov Architecture
    31. Aligning Data Values With Business Context
    32. Semantic Conflict in Real Estate Models
    33. Data Value Semantic Errors = Inconsistent, Difficult to Merge, Report, Analyze
    34. Real Estate Data Set
    35. PSI Data Rationalization/Virtualization Platform
    36. DataStar Key Capabilities
    37. The Ψ–KORS™ System Model
    38. Clarifying Data Use and Meaning
    39. Corporate NoSQL
    40. Benefits of Corporate NoSQL
    41. DoD CIO IT Asset Decision Support: Kevin Garrison
    42. DoD CIO IT Asset Decision Support: DODCIO IT Asset Common Data Model
    43. DataStar Analysis 1
    44. DataStar Analysis 2
    45. Common Data Model Purpose
    46. CDM Strategy
    47. Design Characteristics
    48. Information Sources
    49. Core Data for Better Management
    50. Defining Initial Dimensions
    51. Coverage of Dimensions Between PSI-KORS and Troux
    52. Common Data Model for analysis, and acts as core for DITPR PLUS asset specific data sets
    53. Data Model
    54. The Need for Meaning in BI and Data Analysis: Dr. Ramon Barquin
    55. Data Sleuthing for IT Management: Jim Knox, Terry Halvorsen, Dept of Navy CIO
    56. Corporate Data Rationalization with DataStar
    57. Data Rationalization Project
    58. Corporate Overview
    59. DataStar’s Corporate Value Impact
    60. Relevance to All Organizations
    61. Initial Results: First 30 Days
    62. Clarifying Data Use and Meaning
    63. Outcome of First 30 Days
    64. Final Outcome
    65. Architecture 1
    66. Architecture 2
    67. Ad Hoc Topics
  7. Delivering Agile Data
  8. Emails
    1. Geoffrey Malafsky, 12/23/2013 1:00 PM
    2. Brand Niemann, 12/23/2013 10:08 AM
    3. Geoffrey Malafsky, 12/21/2013 6:12 PM
    4. Kingsley Idehen, 12/21/2013 3:18 PM
    5. Geoffrey Malafsky, 12/20/2013 6:14 PM
    6. Kingsley Idehen, 12/20/2013 5:59 PM
    7. Geoffrey Malafsky,12/20/2013 3:34 PM
    8. Geoffrey Malafsky, 12/20/2013 2:01 PM
    9. Eric Kavanagh, 12/20/2013 1:34 PM
    10. Jim Hendler, 12/20/2013 9:22 AM
    11. Geoffrey Malafsky, December 19, 2013 6:27 PM
    12. Kingsley Idehen, Thu 12/19/2013 4:51 PM
    13. Geoffrey Malafsky, Thu 12/19/2013 1:11 PM
    14. Dennis Wisnosky, Thu 12/19/2013 1:05 PM
    15. Geoffrey Malafsk, 12/19/2013 1:03 PM
    16. Eric Kavanagh, 12/19/2013 12:02 PM
    17. Geoffrey Malafsky, 12/19/2013 11:37 AM
    18. Brand Niemann, 12/19/2013 11:18 AM
    19. Geoffrey Malafsky, 12/19/2013 10:58 AM
    20. Kingsley Idehen, 12/19/2013 10:12 AM
    21. Jim Hendler, 12/18/2013 5:10 AM
    22. Brand Niemann, 12/18/2013 4:36 AM
    23. Brand Niemann, 12/18/2013 2:34 PM
    24. Geoffrey Malafsky, 12/18/2013 9:29 AM
    25. Brand Niemann, 12/17/2013 4:51 PM
    26. Geoffrey Malafsky, December 17, 2013 8:55 AM
    27. Eric Kavanagh, 12/13/2013 4:43 PM
    28. Eric Kavanagh, 12/12/2013 1:53 PM

Story

Data Science Answers Three Fundamental Questions With Specific Integrated Results:

  • Where does the data come from? (statistical data collection versus ad hoc found data assembly)
  • Where is it stored and retrieved? (unstructured, relational, or graph)
  • What are the results? (statistics, visualizations, analytics)

Data Science: Achieve Rationalized Data

  • Add web page (requested) and slides (done)
  • Email discussion below (summarize: define acronyms, abbreviations, jargon, etc.) requested for the Federal Big Data Working Group Meetup Panel Discussion (better as tutorial now)
  • Framework You Can Use: "EIW failed and DataStar endorsed now"
  • Web Site: http://psikorsinstitute.net/ (see current below - new January 1, 2014)
  • My three questions: Harmonizing data versus integrating the data?; Integrating unstructured and structured data?; and Using RDF and OWL?
  • My lesson learned: Fix legacy versus build new like Army Weapons System of System 2012
  • My Data Science for Healthcare.gov and BeInformed (Semantic Data Science Proceeds Use of Semantic Technology)

Inside Analysis

Semantic Data Science for DoD:

  • Semantic Data Science Team End-to-end-solution (c.f., Modus Operandi for John Hopkins Informatics)
  • Import data tables, Create data dictionaries, and Identify relations (exosystem, federation, and relations)
  • Work on this for the past 2-3 years for senior DoD Officials (Dennis Wisnosky, DoD DCMO, Terry Edwards Army SoSE, and Walt Okun, DoD CIO) to semantically harmonize a data ecosystem for integration and applications to answer the three questions above - see examples below

 

Title Link Story Comments
A DoD DCMO Enterprise Information Web Spotfire Story See Chronology

Military Departments Can Improve Their Enterprise Architecture Programs

MindTouch   See Digital Government Strategy

Air Force OneSource

Spotfire Background  

Army Weapon Systems Handbook 2012

Spotfire Story Gall's Law

Build DoD in the Cloud

Spotfire 3rd Annual DoD SOA & Semantic Technology Symposium  

2011 DOD IG Semiannual Report to Congress

MindTouch   See Digital Government Strategy

Enterprise Information Web for Semantic Interoperability at DoD

Spotfire Story and EIW RFI Two Hour Oral Presentation Request for Information (RFI) for Enterprise Information Web (EIW)
DoDCIO-Spotfire Spotfire (NA) See Footnote 1

Data Sources: MindTouch and Dictionary

DoDCIOIT Enterprise Strategic Management Demo-Spotfire Spotfire (NA) See Footnote 2

Data Sources: MindTouch

and Excel

Spotfire Data Federation

     

Spotfire Information Designer

     

Bigdata SYSTAP Literature Survey of Graph Databases

  Linked Data "Also, these data are independently maintained and not truly semantically integrated, as discussed below."

Footnote 1:

Purpose: We are interested in learning about Taxonomy and Enterprise Vocabulary for fundamental architectural elements to enable interoperability and provide consistent understanding of shared architecture information across enterprise. Source: Walt Okon, October 4, 2011, Email.

Footnote 2:

1. Create an inventory of a sample DoD IT Enterprise Strategic Management documents for: Data Center Consolidation, Architecture, Portfolio Artifacts, Strategy, and Taxonomy

2. Build that inventory in an Excel spreadsheet so it supports faceted search in a Spotfire dashboard.

3. Provide a sample knowledgebase of each of the four types of documents: Word (DoD CIO Campaign Plan), PDF (DoD IT Enterprise Strategy and Roadmap), PowerPoint (Update at USACE IM/IT Symposium, Austin, TX), and Excel (DoD Dictionary)
4. Provide the multiple sample knowledgebases in a Spotfire dashboard so they can be seen, compared, merged, harmonized, sorted, searched, downloaded, and shared on mobile devices (e.g. iPad).
5. Scale the previous architectural pattern with more content volume and types if necessary.

Goal: Federate the DoD IEA and BEA. Note: This pilot shows how that could be done.

End of Footnote 2

A statement in the Linked Data Section of the SYSTAP Literature Survey of Graph Databases seems very appropriate: "Also, these data are independently maintained and not truly semantically integrated, as discussed below (see Challenges section) which says: "The Linked Open Data Cloud is an impressive example of what can be achieved with standards for graph data interchange and simple access protocols. However, there are significant problems introduced by the success of RDF, SPARQL, and linked data. First, these data sets are not truly semantically integrated – instead they tend to share some schemas and provide some level of cross-­‐linking. As more and more graph data sets are published (or shared privately by corporations), it becomes increasingly important to find ways to automatically semantically align and mine the relationships in these data. Otherwise we will find ourselves with too much data and too little information that we can actually use. Second, as more and more data sets are combined, it becomes increasingly impossible for even expert users to write queries that properly integrate all of the available data that bears on their problem. This leads quickly to published and semantically disclosed information that nevertheless cannot be adequately leveraged."

My comment: If data sets are not collected with integration in mind, then it is very difficult, if not impossible to semantically integrate them afterwards, and efforts to build new semantically interoperable systems is likely to be more successful by adding common key fields, etc, through collaborative agreements that this is in the common interests to all the data set stewards.

MORE TO FOLLOW

Research Notes

MY NOTE: Distil 19 Emails BIG JOB FOR LATER

The Invitation: Eric Kavanagh, 12/12/2013 1:53 PM

My close colleague Dr. Geoffrey Malafsky will lead a team of data scientists for an in-depth discussion of how to reconcile disparate information systems. Dr. Malafsky earned his stripes by saving a $300 million data reconciliation project several years ago using the most elegant methodology I've seen.

The Description: Solve for Success: A Full-Day Workshop:

In this one-day seminar, we provide a framework and methodology to help you address critical business challenges in a practical, cost-effective way, regardless of which tools you use, or your data maturity.

The Acceptance: Eric Kavanagh, 12/13/2013 4:43 PM

Feel free to bring any examples of issues you're facing right now, including actual data models, Visio diagrams or anything else. Part of the day will focus on solving real-world problems for attendees.

The Response to My Issues: Geoffrey Malafsky, December 17, 2013 8:55 AM

The afternoon most enlightening as it dealt with the real use case of many of the questions you asked. But, I want to clarify some issues so we continue to understand each others perspective on semantics and how we are each using it to solve challenges in the data field.

I frequently state that there is no evidence of large scale operational success of semantic technology over the past 20 years. Many attempts; large investments; many proud public claims; but no actual production use in heterogeneous data environments. This includes what is widely considered a boondoggle failure of Wiznosky's Semantic Web effort EIW which was killed (with my overt cheering) by DCMO in early 2013.

My Initial Invitation to Dialogue: Brand Niemann, 12/17/2013 4:51 PM

I appreciate your detailed explanation below and what comes to mind is that this would make an excellent panel session for our new Federal Big Data Working Group Meetup, that I have invited Eric to participate in using your Healthcare.gov critique.

So if you are willing, I would like to share your thoughts with a few experts from Semantic Web-Semantic Technology and invite you to suggest someone else who shares your views (DoD, Navy, etc.), or even a different view from yours and the SW-ST experts, to participant in a panel discussion, so we can have an email discussion before the panel say in early-February.

My Expanded Invitation to Dialogue in Preparation for a Panel: Brand Niemann, 12/18/2013 4:36 AM

Dennis, Kingsley, and Jim, Please see the email from Geoffrey Malafsky below and provide your responses in preparation for a panel discussion that I would like to have for the new Federal Big Data Working Group Meetup.

MY NOTE: Distil 19 Emails BIG JOB FOR LATER

My Conclusion Email: Brand Niemann, 12/23/2013 10:08 AM

Decided not ready, need to see data sets, and need to focus on Healthcare.gov instead.

Review of Presentation: Slides

Organize My Thoughts: Story

Wrote a PostscriptGlobal Top 25 University Business Incubators 2013

Postscript

This workshop was held in GMU Enterprise Center where i read in the Fall 2013 magazine about its new president, Angel Cabrera, and the Mason IDEA of being innovative, diverse, entrepreneurial, and accessible. George Mason University is one of the fastest growing and most culturally diverse universities in the United States, The university's student enrollment has increased to more than 32,000 students who major in a wide variety of undergraduate and graduate programs.

The Mason Enterprise Center Was Honored as Top Global University Business Incubator by the UBI Index.

Based in Stockholm, Sweden, the UBI Index is a European research initiative with an international research team specializing in incubation and entrepreneurship indexes. This was UBI’s first global benchmark study of university business incubators.

The UBI Index Global Benchmark 2013 considers university business incubators based on three categories:

  • value for the ecosystem (job creation, economy enhancement, and talent retention)
  • value for the client (competence development, access to funds, network enhancement,    hardware and services, and post-incubation relationship)
  • attractiveness (incubator offer, internal environment, and post-incubation performance)

I am going to try to get the UBI Index data set for the Data Science Course I am preparing to teach at GMU in January 2014.

Global Top 25 University Business Incubators 2013

Source: http://ubiindex.com/global-top-list-2013/

 

300px_GlobalTop25
 
# Incubator University Country
1 Rice Alliance for Technology and Entrepreneurship Rice University United States
2 VentureLab Georgia Institute of Technology United States
3 UB Technology Incubator University at Buffalo State University of New York United States
4 NDRC Dublin City University; Dun Laoghaire Institute of Art, Design and Technology; National College of Art & Design; Trinity College Dublin; University College Dublin Ireland (Republic)
4 SETsquared Bath, Bristol, Exeter, Southampton, Surrey United Kingdom
4 Innovation Centre Sunshine Coast University of the Sunshine Coast Australia
4 Tech 20/20 University of Tennessee United States
4 ATP Innovations University of Sydney; University of New South Wales; University of Technology, Sydney; Australian National University Australia
4 STING KTH Royal Institute of Technology Sweden
10 NCTU Innovation Incubation Center National Chiao Tung University Taiwan
11 Youngstown Business Incubator Youngstown State University
(not officially but works closely)
United States
11 I3P Politecnico di Torino Italy
11 Vermont Center for Emerging Technologies University of Vermont and 9 others United States
11 Jon Brumley Texas Venture Labs University of Texas at Austin United States
15 Startup Sauna Aalto University Finland
16 InNOLEvation Accelerator Florida State University United States
17 TEC Edmonton University of Alberta Canada
17 ASU Venture Catalyst Arizona State University United States
19 Oxford Entrepreneurs Incubation Centre Oxford Entrepreneurs; Oxford University United Kingdom
19 Mason Enterprise Center George Mason University United States
19 NYU-Poly Incubator Polytechnic Institute of NYU United States
19 profund – Die Gründungsförderung der Freien Universität Berlin Freie Universität Berlin United States
23 Sid Martin Biotechnology Incubator University of Florida United States
23 Advanced Science and Commercialization Center University of Kentucky United States
25 Incubateur HEC HEC Paris France
25 StarTau Tel Aviv University Israel
25 Flagship Enterprise Center Anderson University United States
25 Digital Media Zone Ryerson University Canada

Introduction

UBI2013CoverPage.png

Source: PDF

Want to start or transform an incubator? Read on!

 
Everyone who knows incubation up close would agree that if done right, business incubation has the potential to transform businesses. But the question is what is the right method? One way of finding it is to observe and benchmark the top performing incubators globally and learn from their savoir faire. Surprisingly, no serious professional e!ort has been made on a global scale to highlight these best practices. Doing so will not only help aspiring incubators, policy makers and organizations that support entrepreneurship but also revive the debate and aid the incubation industry to move forward. This is what this professional benchmark report is all about!
 
As it is impossible to cover all types of incubators in a single benchmark, this first report in a series of other upcoming benchmarks focuses on university business incubators. One of the main reasons being that a multitude of incubators are found in universities around the world. According to the NBIA’s 2012 State of the Incubation Industry report, academic institutions are the single largest sponsors (32%) of all incubators in North America. Consequently, we feel it is legitimate to start with university business incubators in our first benchmark.
 
For everyone interested in university incubators, this benchmark report digs deep into 3 fundamental queries. Firstly, which are the top performing university business incubators globally? Secondly, how do they perform in comparison to average incubators? And thirdly, what lessons can average incubators draw from top performing incubators to improve their own performance? To find the answers, the benchmark report closely examines 150 university business incubators in 22 countries using a unique and comprehensive assessment framework developed by UBI Index. The resulting insights are packed into this report, which features a list of the top performing university business incubators, their success traits, best practices, detailed benchmarks and much more ...
 
UBI Index
Stockholm,Sweden
July 2013

UBI Network

Did you know that . . .
Top incubators have 3.3 times higher sales revenue generated by graduates per year than global average incubators?
(Dhruv Bhatli, Global Benchmark 2013 report)

 

Our passion is entrepreneurship
We would love to hear your thoughts and perception about incubation of tomorrow and the Global Benchmark of Top University Business Incubators. 

Insights from our study - what are your thoughts?

 

Do you know that top list incubators have 2x higher number of clients receiving VC or angel funding per year copared to the global average? According to our study, almost 7 clients of top list incubators receive VC or angel funding.


Why don't we schedule a 20 min call and talk about some of the insights and highlights from our research. We would also love to hear about your thoughts and concerns and take the opportunity to tell you what we are planning next.

How's your schedule for next week?

Best regards from a cloudy Stockholm,
/An Nguyen, Head of Operations
UBI Index

About

UBI Index is a Stockholm based startup which provides expertise in starting and transforming incubators. It was founded by Joel Eriksson Enquist as his master thesis at the Chalmers School of Entrepreneurship, Dhruv Bhatli, a Paris based researcher and entrepreneur, Lars-Henrik Friis Molin, a serial entrepreneur with over 20 successful companies to his credit such as Universum group, Job line etc. and Ali Amin, a dynamic student and entrepreneur. For more info: http://ubiindex.com.

Upcoming Studies

Benchmark 2013 Update
  • This benchmark update consist of detailed qualitative insights into the performance of top list incubators. Additionally few top cases are presented with step by step analysis of their performance
  • With additional incubators and more data, this is a complimentary benchmark to complete your 2013 collection
... much more
State of Incubation at Top 100 Universities
  • Being a top university does not necessarily make a university incubation friendly. What then is the state of incubation at top 100 universities in the world?
  • This report assesses these universities on a range of criteria, e.g. preparedness, infrastructure, e!orts to promote incubation etc. to identify best practices and provide insights about the state of incubation at top 100 universities in the world.
... much more
Sample Research
UBI2013SampleResearch.png

Join the UBI Network

  • Receive Benchmark Reports & Updates
  • Get invitation to participate in future Benchmarks
  • Be the first to get pre-release on all future incubation research papers
  • Collaborate & network with peers from all over the world
  • Receive tools for self assessment and benchmarking
  • Get invitations to quarterly events worldwide and incubation webinars
  • Receive monthly newsletter on best practice
  • Get preferential pricing on personalized services
 

Solve for Success: A Full-Day Workshop

Deep in the heart of your information systems, serious problems await. Learn how to find and resolve them before they harm your business! Our team of data scientists at the PSIKORS Institute will lead a workshop explaining the nature of the challenges to achieving rationalized data for meaningful data analytics, decision making, and enterprise applications. We will cover the critical factors impeding almost all organizations, and how to overcome them without organizational turmoil or following strategies shown time and again to fail. Dr. Geoffrey Malafsky, the chief Data Scientist, will lead the interactive day joined by experts, including a speaker from the Department of Defense. In this one-day seminar, we provide a framework and methodology to help you address critical business challenges in a practical, cost-effective way, regardless of which tools you use, or your data maturity.

Program Topics:

  • Big Data: The Good, the Bad and the Really Ugly
  • Decision Support -- How to Get Promoted, not Fired
  • User Stories
  • Rational Data as the Key to Success
  • Lunch RoundTable Discussion: From Healthcare.gov to part numbers – how small data value conflicts propagate into data models, warehouses, governance, ETL, BI, and executive decision making
  • PSIKORS Methodology: The What, How and Why
  • In Practice: Showcases of Before and After Rationalizing with PSIKORS
  • Wrap up: Community use and collaboration of PSIKORS and the DataStar platform from the PSIKORS Institute
  • Next Steps: How to bring rationalized data into your organization – motive, opportunity, action

Some complimentary passes are available to both Government and corporate personnel who are willing to provide a detailed use case example of their data rationalization challenges that can be discussed during the workshop. Submit your request and synopsis to Eric Kavanagh. Group discounts are available at registration.

Slides

Slides (PDF)

Data Science: Achieve Rationalized Data

GeoffreyMalafsky12122013Slide1.png

Today

GeoffreyMalafsky12122013Slide2.png

Agenda

GeoffreyMalafsky12122013Slide3.png

About the Speakers 1

GeoffreyMalafsky12122013Slide4.png

About the Speakers 2

GeoffreyMalafsky12122013Slide5.png

PSIKORS Institute: Introduction

GeoffreyMalafsky12122013Slide6.png

PSIKORS Institute: History

GeoffreyMalafsky12122013Slide7.png

PSIKORS Institute: Why?

GeoffreyMalafsky12122013Slide8.png

The Ψ–KORS™ System Model

GeoffreyMalafsky12122013Slide9.png

Corporate NoSQL

GeoffreyMalafsky12122013Slide10.png

PSIKORS Institute Access

http://psikorsinstitute.net

GeoffreyMalafsky12122013Slide11.png

Data Science and Data Rationalization

GeoffreyMalafsky12122013Slide12.png

Quote

GeoffreyMalafsky12122013Slide13.png

What is Data Science?

GeoffreyMalafsky12122013Slide14.png

Big or Small Data: It Is the Quality That Counts

GeoffreyMalafsky12122013Slide15.png

Intertwined Complex Information

GeoffreyMalafsky12122013Slide16.png

Data Science As A Form of Science Study

GeoffreyMalafsky12122013Slide17.png

Data Science From A Practitioner

GeoffreyMalafsky12122013Slide18.png

PSIKORS Institute Data Science Principles

GeoffreyMalafsky12122013Slide19.png

Data Science Roadmap

GeoffreyMalafsky12122013Slide20.png

Foundations 1

GeoffreyMalafsky12122013Slide21.png

Foundations 2

GeoffreyMalafsky12122013Slide22.png

Focus on Data Rationalization

GeoffreyMalafsky12122013Slide23.png

Data Rationalization

GeoffreyMalafsky12122013Slide24.png

Healthcare.gov Problems

GeoffreyMalafsky12122013Slide25.png

The State of Data for Reporting and Analytics

GeoffreyMalafsky12122013Slide26.png

Data Rationalization Matrix

GeoffreyMalafsky12122013Slide27.png

Data Rationalization Issues

GeoffreyMalafsky12122013Slide28.png

US Navy HR: Complicated Mixture of Commercial, Custom, Legacy, Services Applications, Data Stores

GeoffreyMalafsky12122013Slide29.png

Healthcare.gov Architecture

GeoffreyMalafsky12122013Slide30.png

Aligning Data Values With Business Context

GeoffreyMalafsky12122013Slide31.png

Semantic Conflict in Real Estate Models

GeoffreyMalafsky12122013Slide32.png

Data Value Semantic Errors = Inconsistent, Difficult to Merge, Report, Analyze

GeoffreyMalafsky12122013Slide33.png

Real Estate Data Set

GeoffreyMalafsky12122013Slide34.png

PSI Data Rationalization/Virtualization Platform

GeoffreyMalafsky12122013Slide35.png

DataStar Key Capabilities

GeoffreyMalafsky12122013Slide36.png

The Ψ–KORS™ System Model

GeoffreyMalafsky12122013Slide37.png

Clarifying Data Use and Meaning

GeoffreyMalafsky12122013Slide38.png

Corporate NoSQL

GeoffreyMalafsky12122013Slide39.png

Benefits of Corporate NoSQL

GeoffreyMalafsky12122013Slide40.png

DoD CIO IT Asset Decision Support: Kevin Garrison

GeoffreyMalafsky12122013Slide41.png

DoD CIO IT Asset Decision Support: DODCIO IT Asset Common Data Model

GeoffreyMalafsky12122013Slide42.png

DataStar Analysis 1

GeoffreyMalafsky12122013Slide43.png

DataStar Analysis 2

GeoffreyMalafsky12122013Slide44.png

Common Data Model Purpose

GeoffreyMalafsky12122013Slide45.png

CDM Strategy

GeoffreyMalafsky12122013Slide46.png

Design Characteristics

GeoffreyMalafsky12122013Slide47.png

Information Sources

GeoffreyMalafsky12122013Slide48.png

Core Data for Better Management

GeoffreyMalafsky12122013Slide49.png

Defining Initial Dimensions

GeoffreyMalafsky12122013Slide50.png

Coverage of Dimensions Between PSI-KORS and Troux

GeoffreyMalafsky12122013Slide51.png

Common Data Model for analysis, and acts as core for DITPR PLUS asset specific data sets

GeoffreyMalafsky12122013Slide52.png

Data Model

GeoffreyMalafsky12122013Slide53.png

The Need for Meaning in BI and Data Analysis: Dr. Ramon Barquin

GeoffreyMalafsky12122013Slide54.png

Data Sleuthing for IT Management: Jim Knox, Terry Halvorsen, Dept of Navy CIO

GeoffreyMalafsky12122013Slide55.png

Corporate Data Rationalization with DataStar

GeoffreyMalafsky12122013Slide56.png

Data Rationalization Project

GeoffreyMalafsky12122013Slide57.png

Corporate Overview

GeoffreyMalafsky12122013Slide58.png

DataStar’s Corporate Value Impact

GeoffreyMalafsky12122013Slide59.png

Relevance to All Organizations

GeoffreyMalafsky12122013Slide60.png

Initial Results: First 30 Days

GeoffreyMalafsky12122013Slide61.png

Clarifying Data Use and Meaning

GeoffreyMalafsky12122013Slide62.png

Outcome of First 30 Days

GeoffreyMalafsky12122013Slide63.png

Final Outcome

GeoffreyMalafsky12122013Slide64.png

Architecture 1

GeoffreyMalafsky12122013Slide65.png

Architecture 2

GeoffreyMalafsky12122013Slide66.png

Ad Hoc Topics

GeoffreyMalafsky12122013Slide67.png

 

image

 


Phasic Systems Inc has developed a unique agile methodology and core technology suite, DataStar, to solve the toughest data and management problems organizations face. DataStar is a cost-effective and efficient tool which uses agile business-centric modeling and integration. Agility focuses on producing value quickly instead of following structured methodologies. Our agile technology, together with our facilitated methodology, supporting products and integration, delivers the agile facilitated solutions other companies simply can’t. It is easy to learn, easy to use and can handle every business variation using simplified standards-based semantics. 

As consultants and developers to many large organizations, we understand that the problem most organizations face is their inability to identify, access, retrieve and utilize data effectively to support business operations, reporting and decision-making. This inability stems from the fact that their data is dispersed over a vast number of disparately developed systems and applications. Since data continues to proliferate (producing multiple versions with conflicting meanings) in response to business needs, the management of data has become an intractable problem. Without new approaches and supporting tools, organizations cannot:
  • achieve the advantages of the Cloud
  • capitalize on their data
  • conduct data standardization and consolidation
  • provide streamlined services to ensure customer satisfaction
  • generate accurate reports to maintain regulatory compliance
  • develop successful business strategies

DataStar Discovery 
 

DataStar Discovery is an Agile data governance and standards methodology and management system. It uses a simplified, guided semantic framework of intuitive everyday concepts (e.g. organization, process, technology) to rapidly and easily capture critical information from documents, web sites, and people to build and deploy common data standards and models. It gets to the core knowledge and decisions necessary to identify, analyze, and define what key business terms mean in their use case context, and how to implement them in a high performance enterprise data model. It is this use case context (the semantics) that is missing from other approaches and which almost universally prevents common data standards from being adopted, practical, or both. Learn more ... 

"We accomplished more in three months with DataStar than we did in the past three years." Dept of Navy CIO

DataStar Unifier 

DataStar Unifier connects Agile data warehousing and integration, i.e. Aggregation, directly to governance with automatic use of managed data models, allowed values, master code lists, and XML schemas. Existing integration tools and database servers can be used but made many times more Agile, cost efficient, and adaptive to changing business requirements’ using PSI’s patent pending technology for Corporate NoSQL. Corporate NoSQL is a PSI invention that blends the best of traditional data modeling with the significant speed and flexibility benefits of NoSQL. Traditional data model techniques are rigid, brute force endeavors requiring complete knowledge of business requirements, data standards, and use case contexts before designing entities and attributes. The aggregated data is automatically loaded into existing database servers (e.g. Oracle, MicroSoft, IBM) or a NoSQL system (e.g. Cassandra, Mongo). Learn more ...


Consolidated & corrected 1 million HR records in Oracle as only trustworthy data– VADM K. Moran (Ret)
Corporate NoSQL™ 

PSI created Corporate NoSQL™ to take the best of traditional data modeling and the new NoSQL movement. It is object-based, uses business terminology, easily handles all semantic variations and updates without any changes to logical or physical models, directly implements governance concepts in models and physical storage, and can be as ‘dimensional’ or ‘normalized’ as desired. Each table comes from governance concepts and there can be multiple tables for concepts with important variations, such as “customer” with variations for a common enterprise definition, a legacy or proprietary application, and different departments. Since Corporate NoSQL uses NoSQL vocabulary to map attributes to data, there is no limit to how many variations can be used and how often they are changed without any changes to logical or physical models. Each table has keys and a type-value pair where the ‘type’ element holds the vocabulary instance and the ‘value’ element holds its instance value. Learn more ... 

Only consolidated data approved as trustworthy - VADM K. Moran (Ret)

Emails

Geoffrey Malafsky, 12/23/2013 1:00 PM

Feb 4 is good.

I think that is an excellent topic. Within that topic, I think we are technically and strategically aligned but are bound by respective focal points. That is, I state that a required precursor to the semantic tech is missing thus... while someone else believes either this precursor is not important or is assumed to be available.

Most of the data on slide 48 is FOUO at this time. I cannot distribute it but I may be able to walk through some examples in a web conference, or even better, coffee shop discussion. Some of it I can show since it is open source:  FedBizOpps and FPDS (former has RFPs, latter has procurement award data).  The key data that really opened up the solution and got DoD CIO excited was cross-correlating to the actual contracts and invoices per item per vendor per purchasing office.

Little to no correlation in raw data. DataStar performed the analysis to show the factual lack of alignment and led to Corporate NoSQL Common Data Model with integration and correlation. This is the crux of my belief that the really big problems with HC.gov are yet to even be recognized, and the semantic layers cannot possibly do this rationalization since the knowledge of how to merge, modify, eliminate data values is unknown by the people managing the source systems.

Since I always like to close with an example, I had breakfast this morning with someone who is a CIO of a major Fed agency. Simply shaking his head in disbelief that he cannot get sub-agencies to even acknowledge the serious problems in their data values while they want to spend much more money on ERPs that everyone agrees are flailing. He sums it up by saying no one wants to deal with the hard challenge so it is easy to spend money throwing a fad at the situation, whether that happens to be ERP, NoSQL, Hadoop, semantics, Informatica, or whatever.

I think the technical communities, and especially anyone actually working on large systems like HC.gov, have an obligation to provide ground truth as to the prerequisites and scope boundaries of each technology and not just be happy they were selected to play.

Brand Niemann, 12/23/2013 10:08 AM

Geoff, Thank you for your comments. I agree this is not ready for a broadcast, but that is up to Eric.

I would like to see the data sets mentioned in slide 48 on Information Sources. Are they available?

I think we should start with this:

What Went Wrong with the Obamacare Web Site, and How Can It Be Fixed? and Why the First Rollout of HealthCare.gov Crashed, an Architectural AssessmentEric KavanaghInside Analysis, and Kees van MansomBe Informed, Healthcare.gov DemoVideo

that I posted as a Possible Future Presentation at: http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group

since you have done the critique and I have done the data science leading to the Be Informed pilot by Kees van Mansom.

Are you available on February 4th at 7 p.m.?

I am copying my co-organizer and host, Kate, to see if that date is available and Kees Mansom to see if he is available (or we can just run his video). Brand

Geoffrey Malafsky, 12/21/2013 6:12 PM

I am glad we were able to calm things down and have a bit of a dialogue. Certainly I learned a lot about Jim and Kingsley's perspectives, and Kingsley's technology.

For a possible joint broadcast by you two, I still think there is a foundational piece missing. That is, Kingsley provides very good information below but I do not think he is addressing the actual issue I raised and I think I spent a fair amount of time articulating and giving examples for. Maybe this is just a symptom of the poor knowledge transfer in email. However, if we cannot focus on the same specific issue then the broadcast is not useful and will degenerate into each tribe saying things their respective tribes already know and want to hear (hmmm, sounds a lot like modern politics). 

I completely defer to you two but I think you will have to solve this challenge before the broadcast or a joint panel can be meaningful.

Kingsley Idehen, 12/21/2013 3:18 PM

co-references via owl:sameAs relations, indirect co-references via owl:InverseFunctionalProperty relations etc.., are examples of relation semantics that RDF enables one incorporate into data representation (be a transient or materialized view). The beauty of RDF is that these semantics don't sit confined in an silo (application, service etc..).

Data needs to flow, and the flow should be comprised of packets that include human- and machine-comprehensible relation semantics. Sadly, its taking way to long for everyone to understand that this is the meat of the matter, and that RDF is simply an open specification for addressing this issues in a uniform manner. There's no reason to be draconian about RDF if its essence is proper explained, which simply hasn't been the norm, hence the many misconceptions and poor applications etc..

Our break came when I realized there was no technical solution to this problem.

There are technical solutions to these problems. It's just data.

Also, there was no practical business process solution.

Again, there are, once the applications are constructed is a proper understanding of what data actually is. The trouble is that applications are constructed as silos from the onset, a majority of the time.

Many policies, and projects had come and gone without effect. Indeed, the creator of the still used Functional Areas and Functional Data Managers in the Navy works for PSI, Tim Traverso. I took a big step back and thought about how I could capture the requisite knowledge from reluctant humans. I treated the problem as a chemical kinetics situation and realized I could play one facet of the situation against another and adapt on the fly if one went in direction A vs B. Thus, our technology grew.

The solution had to be the same as engineering any complete complex system: continuous knowledge collection through nudging awareness and understanding forward incrementally while using technology to lower cycle time from knowledge capture to operational system to a few days with management approval at every step. This could not be achieved with the existing stable of tools and methods, whether downscale data modeling or upscale semantic modeling because the knowledge needed would be blocked. In addition, standard database tools and especially integration were so opaque and complicated that their cycle time was very high and their logic divorced from the upstream business rules that were continuously evolving.

The end goal was to merge the data such that accurate, auditable values for core data entities were produced, used, and approved at the same time that the large fraction of non-core values were gracefully handled but explicitly defined and scoped. This is what we did and do now. Note that Enterprise Architecture as a field has failed to deliver this solution.

Applications have dominated the enterprise and in doing simply perpetuated fundamental problems associated with the silo mindset.

Once this occurs, I am happy to send the data to a semantic framework, traditional DB, or typical BI tool. That is not my concern but what we do enhances anything that requires good, accurate, meaningful data to do its functions.

I am sensing the description of yet another silo since there's no mention of standard for data representation endowed with machine- and human-comprehensible relation semantics. Data is all about entity relationships and the relations they facilitate. That's what data is all about.

Thus, our approach to virtualization is to not put an access and brokering layer on top of existing sources as we do not want to propagate bad data leading to bad decision making.

Virtualization doesn't imply propagation of misleading data. It's the mechanism for homogenizing disparity.

Rather we want to use simplified and highly visible knowledge in an intuitive model that connects capture and representation to execution in existing servers while lowering the need for costly hardware. Hence Corporate NoSQL's use of NoSQL within traditional table structures.

NoSQL doesn't deal with use of URIs to facilitate data flow and representation in a uniform manner. Two NoSQL systems are incapable of transmitting data between themselves, for the reasons I just mentioned.

A couple of months ago a DeNodo person told me after one of the broadcasts I did with Eric that every single one of their clients struggles with data rationalization using their virtualization tool.

One size doesn't fit all. Never has an never will. A failed product doesn't define an entire domain of specialization.

Kingsley

Geoffrey Malafsky, 12/20/2013 6:14 PM

Excellent ideas.

Clarity on our primary use case (we do have a virtualization engine in our platform using Corporate NoSQL as the model structure and automatic synchronization to data models, and code dictionaries enforced into ETL logic):

We are not dealing with views on top of extant sources. We are dealing with semantically adjudicating wide disparities in value spaces for supposedly similar data elements. Simple use case is the example of National Stock Number with and without prefix/suffix. More challenging is the following real world example:

- Navy HR systems have multiple records per sailor labeled as active
- values within many elements in these records are widely variable including job (called rating and rate), seniority, credentials, etc.
- dates on these records overlap so there is no clear choice among them as to which one is accurate
- reverse engineering cause and actions on these records showed decades of not poor processes but an overt decision to keep all bad data
- no documentation exists on what, why, how, who to do about it
- 10 years and >$400M (not inc $850M on DIMHRS which addressed this issue but never made it to touching real data values) with many consultants, tools de Gartner, and no improvements whatsoever

Our break came when I realized there was no technical solution to this problem. Also, there was no practical business process solution. Many policies, and projects had come and gone without effect. Indeed, the creator of the still used Functional Areas and Functional Data Managers in the Navy works for PSI, Tim Traverso. I took a big step back and thought about how I could capture the requisite knowledge from reluctant humans. I treated the problem as a chemical kinetics situation and realized I could play one facet of the situation against another and adapt on the fly if one went in direction A vs B. Thus, our technology grew.

The solution had to be the same as engineering any complete complex system: continuous knowledge collection through nudging awareness and understanding forward incrementally while using technology to lower cycle time from knowledge capture to operational system to a few days with management approval at every step. This could not be achieved with the existing stable of tools and methods, whether downscale data modeling or upscale semantic modeling because the knowledge needed would be blocked. In addition, standard database tools and especially integration were so opaque and complicated that their cycle time was very high and their logic divorced from the upstream business rules that were continuously evolving.

The end goal was to merge the data such that accurate, auditable values for core data entities were produced, used, and approved at the same time that the large fraction of non-core values were gracefully handled but explicitly defined and scoped. This is what we did and do now. Note that Enterprise Architecture as a field has failed to deliver this solution. Once this occurs, I am happy to send the data to a semantic framework, traditional DB, or typical BI tool. That is not my concern but what we do enhances anything that requires good, accurate, meaningful data to do its functions.

Thus, our approach to virtualization is to not put an access and brokering layer on top of existing sources as we do not want to propagate bad data leading to bad decision making. Rather we want to use simplified and highly visible knowledge in an intuitive model that connects capture and representation to execution in existing servers while lowering the need for costly hardware. Hence Corporate NoSQL's use of NoSQL within traditional table structures.

A couple of months ago a DeNodo person told me after one of the broadcasts I did with Eric that every single one of their clients struggles with data rationalization using their virtualization tool.

Kingsley Idehen, 12/20/2013 5:59 PM

You describe what I see as a classic enterprise data virtualization challenge. RDF Views over RDBMS data sources and SOA web services is a challenge. The problem (from my vantage point) is that Virtuoso hasn't really been thrown at this problem. For starters, this functionality isn't part of our Open Source Edition, and rightly so, as the issues are technically challenging ++

We have the following baked into our Virtual DBMS engine:

1. location aware query cost optimization that works over ODBC and JDBC connections
2. vectorized query execution at the core engine level and over the ODBC or JDBC channels (e.g., of these standard offer this capability, but it isn't generally implemented by RDBMS vendors, we also make ODBC and JDBC drivers for all the major DBMS engines + ODBC <- two-way-> Bridges)
3. transient and materialized RDF views over ODBC and JDBC data sources
4. different kinds of replication.

We are going to showcase more of this in 2014 as we've been a little preoccupied integrating column-wise storage into the product. That and vectorzied execution is why we will be going beyond our 150 Billion Triples benchmark record to 500 Billion triples. [1][2].

Links:

[1] http://bit.ly/14ULX2F -- 150 Billion Triples Benchmark Report
[2] http://bit.ly/11pPq6F -- Old report covering how we bulk load data into the LOD Cloud
[3] http://bit.ly/13fnIbr -- SPARQL and SQL query editor that includes the ability to query ODBC sources attached to Virtuoso (Blue Links are live i.e., resolve to the Web and Green Links only resolve locally within DB)
[4] http://bit.ly/18ovowq -- Querying DBpedia from SQL using SPASQL (SPARQL inside SQL) .

Kingsley

Geoffrey Malafsky,12/20/2013 3:34 PM

Accepted and cooling down from me as well.

But, to introduce some more information about the basis of my view on EIW, I just left a meeting this morning at DODCIO with people involved in dealing with the remnants of EIW and whose tenure overlapped DWs. Straight from them is (paraphrasing without editorial comment): EIW failed; it only showed what we already knew from DMDC reports and did so with much higher labor and system costs without any additional analytical or reporting benefit. They had information that the EIW approach was not addressing real data and information needs and avoiding the tough challenges of incompatible data and large scale organizational resistance.

In fact, for my own ongoing work with DODCIO on rationalizing data from myriad program, financial, and asset inventory systems one major issue is how to circumvent the ubiquitous organizational resistance, and what I call civil disobedience, whereby the critical data necessary to merge is hidden. So, my technology is addressing this by using a common model with simple, clear semantic definitions of each entity but one in which I welcome extensions as a way to remove the ability of these groups to use standard civil disobedience excuses like we dont have resources, we have to do it this way, yada yada. This is a key ingredient of our approach and the PSIKORS model. Then, I cross-correlate data across data sources (much like evidence extraction) and move the combined data into the common model (in Corporate NoSQL - our tech). While this lacks the richer computable semantics using RDF and other tools, it adds in the critical capability to get human knowledge extracted, viewed, adjudicated, and implemented while removing most of the typical organizational impediments. That is, I seek to solve the overall challenge that includes humans and trade-off technology for this .

In fact, I came up with this approach by cherry-picking parts of approaches from DARPA and other R&D programs like Rapid Knowledge Engineering, your many fine programs, as well as techniques from my research days where a little friction at the outset can be channeled to produce faster, more useful, and more widely accepted insights from a group of strong willed, highly knowledgeable people.

Jim Hendler, 12/20/2013 2:50 PM

Jeff, understood, I was commenting on the earlier scaling issue - email doesn't always have answers in serial...  But the main point I'm making, which the others have too, is your original (and later) attacks are against a strawman that doesn't exist ... I'll bow out now and let these cooler heads prevail - I just note who threw the first gratuitous stone (at EIW) 

Geoffrey Malafsky, 12/20/2013 2:01 PM

I defer to Eric.

Jim, the silly response about SPAWAR is profoundly ludicrous.

The other issues are not in contention. Get off your iphone and buy a real computer with a larger screen.

Eric did a good job of pointing out the very different issues, and for which you are reacting to a non-issue

Eric Kavanagh, 12/20/2013 1:34 PM

Yikes!

Seems we're talking about two areas of application here: 1) enterprise information management, which is Geoffrey's focus; and, 2) consumer technology, which has a vastly broader audience but decidedly different requirements.

Perhaps we should do a research-oriented Webcast in 2014, something like this:

A Defining Moment for Semantics? The Innovators Explain!

Brand and I could co-host the event, which would maybe last 75 minutes and feature 10-minute presentations by 3-4 experts, then some moderated discussion about the granular details.

We'd be happy to publish at InsideAnalysis.com, and maybe write something for Techopedia as well, with whom we've been working closely these days.

Let me know what y'all think...

Merry Christmas & Happy Holidays to you all!

Eric

Jim Hendler, 12/20/2013 9:22 AM

Yeah, guess spawars is a pretty big deal and those are impressive numbers - Google's knowledge graph is only working at the petabytes per day level, Facebook has only a little more than a billion users - so maybe we do know a little about size

  What I agree with you on is that the mid 90s view of logical AI doing sound and complete reasoning wouldn't with for this level -  in fact, my DARPA program was based on that / there is still a strand in the sem web committee looking this direction, but most of us have moved on to scalable solutions, heuristics, partial correctness, lightweight ontologies etc

  You are maybe not incorrect in you views of the 20 yr old technology, but you missed the past decade - which was built to a large degree to handle the info sharing problems that were reported in a few yrs later in the 9/11 report

Geoffrey Malafsky, December 19, 2013 6:27 PM

Excellent response, thank you.

I do not disagree with any of your premises. I only differ in whether for the use case and total lifecycle being addressed whether the overhead of RDF is worth it. In what I am doing, the answer was no.

Indeed, because of performance issues I stopped transferring data among service interfaces in XML but solely in JSON. Same data load (meaningful info), but 2x smaller transfer size. For total transactional latency, this is important.

My platform, DataStar, allows management, correlation, analysis and editing of any number of data models, glossaries, code dictionaries, and business models (in PSIKORS metamodel) as the key feature of rationalizing data. That is, most places do not even know that they have nor have accurate versions of these artifacts. I just went through a major win for a client where DataStar automatically compared 7 versions of their major company data warehouse and identified the dissimilarities and points of assumption that were false, The results were detailed and easy to understand so were immediately shown to the senior managers who could make adjudication decisions on the spot. This is what we enable that is missing from almost all current approaches. We then convert that business decision into implementable data models (using Corporate NoSQL) that can go into existing DB servers since it is a methodology. But, I have to move the models from the server to the client and render them in a browser with JS, so any extra processing or transmission load is a big problem. I have unfortunately become very familiar with the many paths to memory leaks in IE.

Thus, without a doubt RDF could have done something useful here but the front end tools and the back end storage would make the process too opaque to get the understanding and decision making by the necessary people, and communicate the result in such a way that the IT group agrees. This is how we were able to garner 100% unanimous agreement in their steering committee which I had never seen before in either the IT realm, or the Nanotech/MEMS realm from whence I come.

I also agree with you that for Identity Mgmt, the additional information in RDF could be very useful. Yet, for other reasons, I have gone the other route. Every transaction in DataStar is checked multiple times both in the browser and on the web server and then again on the app server for correctness of the request and also consistency of the sessionID. If the latter does not match what I store (encrypted) for each user's login, DataStar sends a message and forces a logout. This is to maintain security at each transaction and to guard against unexpected commands buried in text strings. I can answer your next question: I do not use JS eval at all for this reason.

I agree wholeheartedly with your assertion about misuse of RDF (as well as other technologies). This is the crux of my critique. I am, and have been, a excited supporter of the promise of semantic technologies. But, I have seen it misapplied and over-touted to the point that it falls into the category of "typical IT fad", of which there are many. IT as a field seems unable to shake off weak engineering and stop squandering large sums of money and time chasing approaches that if we were talking about an electrical engineering signal processing application would never make it past the first meeting. I am flummoxed by all the dollars going to Informatica when most of the ETL is pt A --> pt B.

Finally, you capture the very different perspectives and use cases we deal with:

"That gives you what? Do all the entities relevant in the domain have HTTP URIs that resolve to their descriptors? Is this data reusable by any tool capable of making an HTTP GET without loss of semantic fidelity? "

Those are not relevant issues for creating and managing the enterpise scale structured data environment to feed applications (e.g. DIMHRS), reports, and analytics. It is the combining and existence and use of the data itself that is the focus. It is getting the data together that is the main hurdle, not using it once it is together. To use the same example, DIMHRS failed because it could not even begin to combine data.

This is within each service and command in each service let alone across the DoD. Even today, the Army has 6 ERPs and the number is growing. At least 2 of these act as data brokers under the naive idea that if they do not know how to combine data from 4 ERPs, #5 will make it easier.

A specific example to close with:  the National Stock Number is supposed to be a well governed standard across logistics and ERP systems, and used as a primary key sometimes. The actual values stored in different systems varies, without documentation as to why, such that sometimes there is a prefix, sometimes a suffix, and sometimes not.  Hence, ERP and logistics data is impeded from being meaningfully integrated, and the ERP programs are bleeding large dollars. No semantic technology can solve this because there is no knowledge on what to do about it and when to apply rules etc.

Kingsley Idehen, Thu 12/19/2013 4:51 PM

On 12/19/13 10:57 AM, Geoffrey Malafsky wrote:

> Seems like we agree far more than we disagree. Also, I fully accept your more accurate definition of RDF. My point has to do with the latter parts which you also comment on. Namely, whther we discuss semantics or any other technology, it actually does a disservice to exuberantly tout capabilities that the technology just cannot deliver.

The thing is RDF is a specification for implementing solutions that add the profoundly difficult issue of heterogeneous data integration. Unfortunately, narratives went ary, in many different directions, so you ended up with a misconception that RDF was a magic bullet and RDF/XML the only kind of gun compatible with this magic bullet.

The real RDF i.e., the W3C specification that enables the uniform construction of webby structured data, endowed with machine- and human-comprehensible entity relationship semantics, is extremely capable of dealing with the challenge of heterogeneous data integration, that scales up to the Internet.

> Semantic technology for all of its promise and power cannot provide the guaranteed, auditable, exact record x among all the others based just on reasoning algorithms. Never going to happen. I There's a lot packed into the paragraph above. I'll try to break this down a little. Today, privacy is the biggest challenge the world faces, the problem is really about:

1. identity

2. identifiers

3. structured data (e.g., identity card representation)

4. authentication protocols (i.e., protocols that use the structured representation of an identity card to verify the card's subject or identity principal)

5. trust webs (clouds or networks).

Basically, how do enterprises and/or service providers constrain access to resources without compromising privacy?

What is privacy? Basically, self calibration of one's vulnerabilities in a given realm.

Through RDF it is possible to address the identity challenge without conflating any of the items outlined in the list above. Basically, using existing standards, it is possible for people and organizations to create verifiable identifiers that can be used to craft resource access controls (basically what NIST refers to as "Attributed Based Access Control" or ABAC). Likewise, actual identifier verification is based on loosely coupled authentication protocols all leading up the the ability to add Trust Webs/Networks/Clouds, to what we already have re, the Internet (computer network cloud), World Wide Web (document network cloud), and Linked Open Data (data network cloud).

It is all possible because RDF brings logic into actual data representation [1][2]

[1] http://slidesha.re/18CtxGK -- Blogic Presentation by Pat Hayes

[2] http://bit.ly/16EVFVG -- Venn diagram Illustrating intersections of Identifiers, Linked Data, and RDF.

More responses further down ......

> t can be augmented with heuristics and empirical rules, which is fine, but then should not be passed off as a wow factor result of probabilistic reasoning. My background is as a Nanotechnology research scientist who then went on to be a tech expert for troubled DARPA, ONR, DOE programs spanning novel solid state devices to real-time information fusion. I have watched and joined the excitement of the promise of computational reasoning as a major aid to language based computing. I still do. But, for large serious data environments where the source data is not well defined, cannot be easily made to align, and then encumbered with significant organizational resistance, semantic technology is a very poor choice because it starts with the assumption that the knowledge and authority exists to create well defined entities and the mappings from sources to these mappings. This rarely if ever occurs and is the primary reason the semantic approach has failed in almost all implementations going back to the '90s in Intel.

> Some of the very expensive failures along the way, using a little to a lot of semantics:  DoD DIMHRS (no data rationalization whatsoever, $850M);  NGA GeoScout (all talk, we showed up the Lockheed results with an Innovision IB prototype we built using the PSIKORS model, $1B); FBI Trilogy (actually a rather simple use case to solve but failed due to the reality of organizational civil disobedience and a naive view that IT can solve that, $250M);  plus many others. 

Soon to be wasting money:  NIEM as a standard.  No chance of success. No one in DoD commands, with the exception of the always well defined and documented METOC community, has any idea of how to map their sources to any type of model let alone an abstraction. I have seen this data very closely. This is why Wiznosky's program could not succeed and why it should have been framed from the beginning to address this issue rather than the Revelytix (smart folks) SPARQL framework.

You can judge any specification on what manifests in a single implementation. Too often, one failure associated with RDF (rightly or wrongly so) ends up being an indictment on the entire realm. I never understood this unique treatment of RDF -- bearing in mind my own experience with many data related standards, over a 20+ year timeframe. This could have followed but the initial issue is rationalization. 

> I am on a project in DoDCIO now where I successfully brought together and rationalized data from 10 disparate completely misaligned sources  using PSIKORS and Corporate NoSQL. No way anything using RDF and reasoning could do that.

You can't make blanket statements like that, especially when they are also inaccurate and quite trivial to refute, with the right RDF product in hand.

If I told you an RDF DBMS could toy with 50 Billion triples and be accessible to the Web for anyone to hammer, I bet you wouldn't believe that claim. If I told you it could happen at the 150 and even 500 Billion scales I bet you wouldn't believe that either.

Memory and CPUs are cheap today. DBMS technology has advanced at the storage and query execution levels.

> It came down to using PSIKORS as a simple way to figure out (no one told me) the context of various data (not from data dictionaries) and then I implement this is a new semantic model (Corporate NoSQL) and put the results into pretty pictures in QlikView in a web system.

That gives you what? Do all the entities relevant in the domain have HTTP URIs that resolve to their descriptors? Is this data resuable by any tool capable of making an HTTP GET without loss of semantic fidelity?

> The one place I disagree is with your statement: "There are no scenarios associated with modern data representation, access, integration, and management where reasoning over entity relationship semantics isn't a virtue. "

> The reason I disagree is the issue of deterministic data collated from disparate sources.

That's a zillion times easier to handle, if you are equipped with the right RDF tool. For instance, in Virtuoso, reasoning and inference can be constrained. You can control the behavior be it when forward-chaining or backward-chaining. These are issues we implemented on day one, once

we decided RDF provided a open standard for exposing what we had been developing etc..

> I do not want any fungible results at all.

You are but one entity :-) The trick is having a solution that caters for heterogeneity at both the data, user, and operator levels. It all part of the challenge.

> I want completely auditable results with traceable logic and documented business rules of --both-- values and merging (aka  integration).

An engine that make sophisticated use of RDF is your friend here. We have such technology.

> An example is my corporate year end taxes. No reasoning over entity relationships is desired nor allowed.

Then how are you going to grok the realm known as the tax code? You have to understand the terms and the semantics that connect the terms in the collection of statements that are ultimately processed by a database.

> I have had enough conversations with the IRS. This distinction is completely lost on the managers (technical included) who blindly go forth chasing either traditional approaches that have way more than enough examples to show they do not work (Informatica for all), to applying semantic technology without doing what every scientist and engineer should do every day and on every task they have: what is the scope of cost vs performance for various approaches to solving challenge x of N? I can solve almost any challenge in multiple ways. While not elegant, I can do all of the reasoning executed on RDF with simple metadata fields stored in JSON using a set of high level code routines.

You can use JSON to construct data that still bears RDF model based relation semantics. The data format doesn't matter one bit. It's about the relation semantics and their visibility to data processors that understand RDF.

> Indeed, I wrote quantum mechanic wave function analysis routines in Fortran to back up my experiments. The question is if this is the best use of time, money and produces the best result. 

> I have a client in a large financial and risk services corporation that is deploying a brand new computational application for daily risk assessments as an Excel application deployed in Sharepoint. Not how I would do it, but it works.

All the stuff in Excel can be transformed into RDF model based Linked Data. That can be in the form of conventional ETL or through views over the data. You just need the right tools and the right people. The RDF community isn't homogenous, there are a variety of profiles within the community, as is always the case.

> While I do not disagree with you, I do not want to clarify use cases: "What's important is the flow of data across heterogeneous databases (or stores). You need to create views over data sources and then orchestrate how data flows across through these data sources using applications and service layering. "

> Not always.

Data has to flow. That's the key to individual and/or enterprise agility. Imagine if man didn't develop vital skills such as:

1. language

2. literacy

3. cooking.

> I do not care about flow if my goal is to centralize newly quality corrected, semantically integrated data. Indeed, I do not want to orchestrate flow if I do not trust the accuracy and meaningfulness of the data.

If you have a web of trust, linked data, and RDF based semantics, the concerns above are zapped, just like that. This isn't speculation, you just have to use the right tools.

> I specialize in data rationalization. I think that statement above is where I critique both semantic technology and traditional DB approaches: the false assumption that the existing data is accurate and trustworthy as is, or that a group of people with enough time and  money know how to define rules to make it so. This is not the case.

There is no such thing as perfect data, what we need is the ability to construct data with context-fluidity in mind. This is where RDF is excellent and unrivaled.

Kingsley

Geoffrey Malafsky, Thu 12/19/2013 1:11 PM

Fine, I do know about DIMHRS. I was developing my technology at the time on contract to SPAWAR and very successfully merging all the grossing discordant Navy HR data for 1 Million people that had never been done before. We had technical interchange meetings with the DIMHRS technical leads who both admitted and were symied that the basic impediment was the complete inability to even know how to marge data from different sources, such as 3 versions of PeopleSoft, let alone do it within the program. It is this lack of appreciation for depth of the underlying real challenges that makes semantic technology unserious. I am standing on the tracks reading email and in no danger of being run over. 

Dennis Wisnosky, Thu 12/19/2013 1:05 PM

I agree 100% with Dr. Hendler with respect to both Semantic Technology and what he says about the US Government.  I was most happy to get out of the Beltway for exactly those reasons.

Re EIW, Jim is the only person in this discussion close enough to have any idea what was going on. My name is not even spelled correctly by the commenters who know nothing about EIW and who are also misinformed about DIMHRS.

EIW did not fail.   Every 90 days, EIW delivered exactly what it promised at a cost that was de minimis.  Every experiment that we conducted with Semantic Technology worked exactly as was planned.  Today, DoD the Business Enterprise Architecture (BEA) continues to follow the path of the Business Enterprise Ontology Framework (BEOF) that was modeled based on the EIW experience. This is the only practical way to achieve the Business Architecture federation that is necessary for the Department to finally figure out how to aggregate data with assured provenance.

Every bashing of this technology that I have seen is from people who have their own either personal axe to grind, or a substantial investment in his or her idiosyncratic methods and/or technology. Therefore there is an enormous bias not to change.

Within the Beltway, I fought this attitude everyday for five years.  I have literally never seen this in the truly commercial semantic technology applications that Jim and Kingsley have mentioned.

My advice to those who continue to naively whisper about the past rather than listen to the howling whistle of the future is to get on this train or prepare to be run over.

Geoffrey Malafsk,

Geoffrey Malafsk, 12/19/2013 1:03 PM

Thanks but actually I do know about it, have worked with some, and have been working on the actual large data for some time deep into the data itself, the business analysis portion, and all sorts of technical approaches and tools. Planning for the future is not the same as making it happen. Also, there are plenty of lessons learned about prior attempts to lead a serious engineer effort to do better. This does not happen in Govt IT and did not happen in EIW. Govt actually does accept new ideas and spends a lot of money on them. This is not news.

Intelligence, possibly with your help, tried to use ontological approaches in the 90's and did not get far. Indeed, many large Govt programs like the notion of semantic techs but do not do the due diligence around them. I suggest that you, as a a highly respected expert, could do more to guide them to realistic uses and learn from the many past failures.

Additionally, I say again if you spent more time reading you would see that I carefully described the use cases I was addressing and the factual examples of what failed and why.

I have been a technical reviewer and made corrective plans on technologies such as info fusion, MEMS, and Nanotechnology encountering world class scientists each time. Every time the verbiage and personalities are the same and can be a rerun of each other. Our current discussion fits right in. However, once the facts (pesky critters those) are enumerated and a no-fault corrective strategy disclosed, everyone is happy and glad progress is made.

Indeed, I did one such project at Los Alamos/Sandia Labs on their Nanotechnology program for nuclear bomb fuses with me versus >80 scientists, some of whom I knew well from my own research days at NRL.

Incredible resistance. But, I marched them through the facts and existing lessons learned of what they were proposing and why it was not reasonable nor supportable. As usual, they grew to acknowledge and then made the dramatic leap to restructuring that part of the program making everyone, including Congressional oversight, happy. The issue they faced was a deep technical issue regarding atomic scale stiction and how they expected to solve it. Short answer: atoms are like toddlers and do not stay put despite your best efforts. The moral is that even the best people get caught up in belief over science sometimes and a focused objective helpful critique can benefit everyone.

One example that stuck with me and I used in my workshop Monday that you may find interesting.

I was brought into the DARPA program Small Unit Operations Situational Awareness System because it was flailing. They had reached a point of spending a great deal of political capital bringing in operational people and Grey Beards and the nascent technology was not doing well. My task was to find out what was going wrong with the information technology portion and produce a corrective plan. There were many other aspects to the program such as autonomous peer-peer parallel computing, mobile power management, node health monitoring and adjustments. The program was based on someones proposition that a non-linear Kalman filter could be used for real-time, power sensitive, peer-peer fusion with much greater precision of the result. One of the excellent companies often used in DARPA programs was highly resistant to me looking at their code but that is what I did. The fix was easy to find and propose. The section of their code where the non-linear matrix was supposed to be calculated was commented out with "TBD" and a simple linear extrapolation inserted. The team was flustered, embarrassed, but not chastised.

So, I continue to say you are not grasping the points being made.

Eric Kavanagh, 12/19/2013 12:02 PM

Gents, thanks for the lively discussion!

On a tangential but interesting note regarding Open Government, I have an immediate need for speakers at a conference in Abu Dhabi, April 28-30. A gentleman named Richard Kerby from the United Nations is helping the government there put this conference together, and I said I'd submit a handful of names and bios to him today. Is anyone interested? All expenses are paid, but no speaker fees.

I think Brand knows that I was very active several years ago, post-Katrina, in promoting the concept of transparency in federal spending. So for me, this is a great opportunity to keep the fire burning. This is the piece I wrote for Public Manager back in 2006, which followed a couple pieces at TDWI.org: http://www.thepublicmanager.com/articles/docs/TPM_Kavanagh.pdf

Geoffrey Malafsky, 12/19/2013 11:37 AM

Thanks for the reply Jim, but I think you are overlooking the major issue. It is not whether semantics are being used somewhere successfully. That is not in contention. The entire discussion, well articulated, is about large scale disparate structured data environments and their needs to rationalize existing data into enterprise assets. I suggest you spend some time reading when you get to a laptop/desktop.

I have similarly spent a lot of years education people including those irrationally exuberant about any number of technologies that do not work in some misapplied, expensive, and high profile cases.

Brand Niemann, 12/19/2013 11:18 AM

Thanks for the good discussion. Jim Hendler's is included below.

Geoffrey Malafsky, 12/19/2013 10:58 AM

Seems like we agree far more than we disagree. Also, I fully accept your more accurate definition of RDF. My point has to do with the latter parts which you also comment on. Namely, whther we discuss semantics or any other technology, it actually does a disservice to exuberantly tout capabilities that the technology just cannot deliver. Semantic technology for all of its promise and power cannot provide the guaranteed, auditable, exact record x among all the others based just on reasoning algorithms. Never going to happen. It can be augmented with heuristics and empirical rules, which is fine, but then should not be passed off as a wow factor result of probabilistic reasoning. My background is as a Nanotechnology research scientist who then went on to be a tech expert for troubled DARPA, ONR, DOE programs spanning novel solid state devices to real-time information fusion. I have watched and joined the excitement of the promise of computational reasoning as a major aid to language based computing. I still do. But, for large serious data environments where the source data is not well defined, cannot be easily made to align, and then encumbered with significant organizational resistance, semantic technology is a very poor choice because it starts with the assumption that the knowledge and authority exists to create well defined entities and the mappings from sources to these mappings. This rarely if ever occurs and is the primary reason the semantic approach has failed in almost all implementations going back to the '90s in Intel.

Some of the very expensive failures along the way, using a little to a lot of semantics:  DoD DIMHRS (no data rationalization whatsoever, $850M);  NGA GeoScout (all talk, we showed up the Lockheed results with an Innovision IB prototype we built using the PSIKORS model, $1B); FBI Trilogy (actually a rather simple use case to solve but failed due to the reality of organizational civil disobedience and a naive view that IT can solve that, $250M);  plus many others.

Soon to be wasting money:  NIEM as a standard.  No chance of success.

No one in DoD commands, with the exception of the always well defined and documented METOC community, has any idea of how to map their sources to any type of model let alone an abstraction. I have seen this data very closely. This is why Wiznosky's program could not succeed and why it should have been framed from the beginning to address this issue rather than the Revelytix (smart folks) SPARQL framework. This could have followed but the initial issue is rationalization.

I am on a project in DoDCIO now where I successfully brought together and rationalized data from 10 disparate completely misaligned sources using PSIKORS and Corporate NoSQL. No way anything using RDF and reasoning could do that. It came down to using PSIKORS as a simple way to figure out (no one told me) the context of various data (not from data dictionaries) and then I implement this is a new semantic model (Corporate NoSQL) and put the results into pretty pictures in QlikView in a web system.

The one place I disagree is with your statement:

"There are no scenarios associated with modern data representation, access, integration, and management where reasoning over entity relationship semantics isn't a virtue. "

The reason I disagree is the issue of deterministic data collated from disparate sources. I do not want any fungible results at all. I want completely auditable results with traceable logic and documented business rules of --both-- values and merging (aka integration). An example is my corporate year end taxes. No reasoning over entity relationships is desired nor allowed. I have had enough conversations with the IRS. This distinction is completely lost on the managers (technical included) who blindly go forth chasing either traditional approaches that have way more than enough examples to show they do not work (Informatica for all), to applying semantic technology without doing what every scientist and engineer should do every day and on every

task they have:   what is the scope of cost vs performance for various

approaches to solving challenge x of N?    I can solve almost any

challenge in multiple ways. While not elegant, I can do all of the reasoning executed on RDF with simple metadata fields stored in JSON

using a set of high level code routines.   Indeed, I wrote quantum

mechanic wave function analysis routines in Fortran to back up my experiments. The question is if this is the best use of time, money and produces the best result.

I have a client in a large financial and risk services corporation that is deploying a brand new computational application for daily risk assessments as an Excel application deployed in Sharepoint. Not how I would do it, but it works.

While I do not disagree with you, I do not want to clarify use cases:

"What's important is the flow of data across heterogeneous databases (or stores). You need to create views over data sources and then orchestrate how data flows across through these data sources using applications and service layering. "

Not always. I do not care about flow if my goal is to centralize newly quality corrected, semantically integrated data. Indeed, I do not want to orchestrate flow if I do not trust the accuracy and meaningfulness of the data.

I specialize in data rationalization. I think that statement above is where I critique both semantic technology and traditional DB approaches:

the false assumption that the existing data is accurate and trustworthy as is, or that a group of people with enough time and money know how to

define rules to make it so.   This is not the case.

Kingsley Idehen, 12/19/2013 10:12 AM

On 12/18/13 4:36 AM, Brand Niemann wrote:

> Dennis, Kingsley, and Jim, Please see the email from Geoffrey Malafsky below and provide your responses in preparation for a panel discussion that I would like to have for the new Federal Big Data Working Group Meetup.

All,

My responses (please excuse typos, this is my only opening to reply) are inline...

> 1) RDF et al: I frequently hear people, and I inferred the same from your questions, misunderstand the role of RDF in semantics. RDF is a format for storing information and as such is comparable to XML, UMLx XMIy, CWM, XML, and JSON.

RDF isn't a format. That's the consequence of a poor fragment of a broken narrative that went viral, so to speak. RDF is an acronym for Resource Description Framework. As a framework it's naturally comprised of a number of items, specifically:

1. Data Model -- covers model theory and abstract syntax for structured data representation

2. Data Formats -- covers concrete syntaxes (notations) for actual content creation and serialization.

>   It is what goes into the formatted structures that determines whether it is useful, meaningful, or relevant to actual semantics but the storage by itself and the specification of the formatting rules is not semantics.

Again, the misconception that a structured data representation format (e.g., RDF/XML) uniquely determines the semantic fidelity discernible from document content is just another misconception that arose for the same poor narratives that positioned RDF as an interchange format.

RDF model theory establishes the following:

1. entity denotation mechanism -- i.e., use of IRIs to denote each member of an entity relationship

2. entity relationship semantics -- i.e., use of the basic subject->predicate->object statement structure to represent how entities are related (connected/linked) via a predicate

As a consequence of the basic building blocks outlined above, RDF provides a mechanism for building basic terminology vocabularies and/or full-blown worldview expressing ontologies. Basically, you have the same RDF statements driving:

1. entity type (or class) definition -- so called TBox

2. relation type definition -- so called RBox

3. facts about instances of the aforementioned types above -- so called ABox.

At this juncture, I would like to assume that irrespective of assigned label (be it "RDF" or "something else"), what I am outlining is something fundamental to how observation is captured and expressed for reuse i.e., this is what data representation has always been about, in one guise or another.

I continue ..

>   Nor is using RDF an entry requirement to be a semantic technology.

"RDF" the literals aren't important. The model theory associated with the W3C's RDF specification is what outlines an open standard (uniform) for structured data representation that's endowed with human- and machine-comprehensible entity relationship semantics. This is all about enabling us recreate digital sentences using patterns that predate silicon based computing. It's basic sentence structure and grammar tweaked for the the World Wide Web.

> Thus, questions about RDF should be framed in terms of what storage format is the best choice for the intended use of the information and technology, not as is often done, in regard to whether the technology is or is not a semantic technology. This view is false.

RDF isn't a storage format. You can produce structured data bearing documents, based on the RDF model using a plethora of markup languages:

W3C standards:

1. Turtle

2. JSON-LD

3. RDF/XML

4. RDF/JSON.

Others:

1. CSV

2. JSON

3. TriX

4. ASN.1

5. Anything else enables 3-tuple representation. 

> 2) Similarly, the use of SPARQL, XQuery, SQL, or indexed search engines (such as the very good semantic search engine dtSearch that I use) is also a matter of preference determined solely by the intended use and required performance characteristics of the technology. We need to clarify a few things with regards to query languages. A query language is something implemented distinct parties:

1. client side query tools

2. server side data management systems.

When data is created and persisted to documents, it ultimately needs to be aggregated, indexed, and stored in a database management system or store. Thus, when structured data bearing documents are created using RDF there is a natural need for a query language based on the same fundamental model theory.

SPARQL is the query language for performing create, read, update, and delete (CRUD) operations against RDF based structured data. It serves a role similar to what SQL offers to relational databases that manage data (entity relationships) modeled as records in a table (i.e., tables represent relations). Naturally, it goes further than SQL because in RDF relations take the form of 3-tuple (triple) statements that share a common predicate, some refer to this as a "relational property graph" (or graph) variant of a relational database, with regards to database management system types.

> There is absolutely nothing inherently preferable about SPARQL.

SPARQL combines performance and relations semantics exploitation on a scale that's simply impossible (yes, I said that) to achieve with a pure SQL database management system. I can demonstrate this claim to you on a whim, whenever, using our Virtuoso DBMS engine with supports both SQL and SPARQL.

> If the technology is primarily focused on semantic reasoning and uses RDF, then SPARQL is a good choice.

There are no scenarios associated with modern data representation, access, integration, and management where reasoning over entity relationship semantics isn't a virtue.

>   If it is using graph theory based networks, then SPARQL is a bad choice. If it focused on deterministic governance controlled data analysis and I/O, then SPARQL is not even relevant.

RDF model theory is a kind of graph theory. Triples are directed graphs. I don't buy the claim above one bit. I have a live 50 billion triples+ instance of Virtuoso that's accessible to the entire planet for ad-hoc querying using SPARQL or even SQL (since we can even co-mingle both languages). You perform backward-chained inference, transitive closures, all kinds of graph analytics (topic clusters, distances that constitute relationships, and more).

> What is relevant is the total lifecycle of performance from initial load, indexing (if used), updates, transaction based access time, joins (god forbid), etc.

What's important is the flow of data across heterogeneous databases (or stores). You need to create views over data sources and then orchestrate how data flows across through these data sources using applications and service layering.

Thus far, I've deliberately made no mention of Linked Data because RDF and Linked Data aren't the same thing. RDF is a powerful mechanism for creating Linked Data which is itself a powerful solution for data virtualization across heterogeneous data sources, when the right product is used for the job at hand.

Remember, RDF requires entity denotation using IRIs, which means different kinds of identifiers can be used to craft RDF based structured data. In a sense, RDF mandates that statements (sentences) are constructed from words. On the other hand, RDF based Linked Data mandates that statements are created terms (i.e., word specialization that combine denotation and reference).

HTTP scheme based URIs provide a low-cost and high-impact (due its ubiquity) mechanism for creating terms, and when applied to RDF based structured data representation (i.e., triple based statement construction) you end up with a Linked Data Network (Cloud or Graph) abstraction atop the current World Wide Web (or internal HTTP based networks).

> This is where I do not think MarkLogic was a good choice for HC.gov because the use case is deterministic with real-time computations of analytic results (financial status).

No comment about MarkLogic.

>   This use case is a very poor fit for backends that are in anyway non-deterministic such as any semantic reasoning or probabilistic engine, or high transactional latency storage structure and engines such as standard RDBMS engines using standards relational data models.

Not so if you have a hybrid DBMS like Virtuoso.

> Note that it is the combination of model and engine that determines efficacy for high load deterministic performance.

Yep!

> I have always known that ML has modified from pure XML and Xquery to get their performance up. HC.gov merely shows that their improvements are not sufficient for the required environment.

XQuery isn't the tool for this kind of job. I'll just leave it at that.

>   But the same is true of a traditional Oracle model and engine.

Yes.

> HC.gov should not be using anything associated with semantic reasoning or RDF as these are entirely poor fits for the technology requirements.

Here I disagree. You can have a hybrid DBMS engine. We know it works because we built one and it underlies key aspects of the massive Linked

Open Data Cloud. 

3) Data rationalization is about the values and the continuous visible auditable control of these values in large heterogeneous data environments. This is the primary challenge facing almost all enterprise scale environments.

Yep!

> The notion of building a new system occurs in some large programs, such as HC,gov and the ill-fated DIMHRS, but this is not really a practical option in most places due to the combination of cost and organizational authority zones. In many cases that have tried to build the grand new data system, they find out that they cannot actually improve the data in the new system because they cannot rationalize the data from the old systems due to lack of knowledge on all the tentacles of values and relationships among values that are largely undocumented. Hence, you end up with another 100M data system or poor data.

Yes.

You need a Virtual DBMS engine. One that combines models, and is designed to work with a plethora of data format, access protocols, and query languages. This is why be embarked on the Virtuoso project circa. 1998. This is what lead us to the W3C and its Semantic Web effort i.e., they provided standards for implementing the solution we envisaged a long time ago, based on our backgrounds (DBMS engine and Data Access Middleware specialists).

> Finally, a personal truth be told moment. I frequently state that there is no evidence of large scale operational success of semantic technology over the past 20 years.

That's inaccurate. We have the following:

1. massive Linked Open Data Cloud (really Big Data based on RDF)

2. global explosion of Open Data across governments

3. growing enterprise exploitation at the enterprise level -- we have many enterprise customers taking advantage of this technology .

> Many attempts; large investments; many proud public claims; but no actual production use in heterogeneous data environments.

There are few publicly available success stories, but that doesn't mean they don't exist. That said, I get your point since I am frustrated with the overall appreciation and exploitation of Virtual DBMS technology driven by RDF based Linked Data. The tendency is to produce a lot of toys tools that collapse once challenged, and in the process the concept as a whole gets written off.

The problem here is that we have a misguided expectation that technology that solves serious problems should cost $0.00. The pursuit is disguised as a request for Open Source editions of these products knowing fair well that the "free beer" (as opposed to "free speech") part of Open Source meme is really what most pursue.

> This includes what is widely considered a boondoggle failure of Wiznosky's Semantic Web effort EIW which was killed (with my overt cheering) by DCMO in early 2013.I'll leave Dennis to speak to that, I have no idea what happened there :-)

Links:

1. http://lov.okfn.org/dataset/lov/ -- Linked Open Vocabularies Cloud

2. http://bit.ly/1cbSTNb -- Linked Data Illustrated using Linked Data (all the links in this illustration are live)

3. http://bit.ly/1jlHsq9 -- Semantically enhanced Linked Open Data illustrated

3. http://bit.ly/1c1KwCn -- G+ note explaining how easy (contrary to general misconceptions) it is to create and deploy RDF based Linked Data

4. http://bit.ly/WmKlJ0 -- SPARQL and Reasoning examples

5. http://bit.ly/OEBP7N -- Various posts about reasoning that include live examples

6. http://lod.openlinksw.com -- live 50B+ triples LOD Cloud Cache we maintain (a Virtuoso instance)

7. http://lod.openlinksw.com/sparql -- SPARQL endpoint

8. http://dbpedia.org/sparql -- DBpedia SPARQL endpoint.

--

Regards,

Kingsley Idehen               

Founder & CEO

OpenLink Software

Company Web: http://www.openlinksw.com

Personal Weblog: http://www.openlinksw.com/blog/~kidehen

Twitter Profile: https://twitter.com/kidehen

Google+ Profile: https://plus.google.com/+KingsleyIdehen/about

LinkedIn Profile: http://www.linkedin.com/in/kidehen

Jim Hendler, 12/18/2013 5:10 AM

I've spent too many hours educating folks like this to waste a whole lot more.  With Facebook and Google using semantic models, with Sparql becoming the standard for graph queries, and with triple stores now scaling on commodity clusters, those of us who use these technologies are at no disadvantage, and often at an advantage.  I recently read that govt procurement lags as much as ten years behind commercial markets, and these technologies are just now making it in the commercial markets, so I figure in another decade I can check back in and see how these folks are playing catch up against the technologies of choice

  JH

Ps I should note that this is not meant to endorse in anyway the Marklogic stuff, or for that matter to argue against it - it's just one player in a growing ecosystem 

Brand Niemann, 12/18/2013 4:36 AM

Dennis, Kingsley, and Jim, Please see the email from Geoffrey Malafsky below and provide your responses in preparation for a panel discussion that I would like to have for the new Federal Big Data Working Group Meetup.

Thank you, Brand

Dr. Brand Niemann

Director and Senior Data Scientist

Semantic Community

http://semanticommunity.info

703-268-9314

Brand Niemann, 12/18/2013 2:34 PM

That is why I suggested you find some others that share you views so we have balance.

Geoffrey Malafsky, 12/18/2013 9:29 AM

Yes, I am always interested in doing these things. I asked DODCIO about it and he responded with what I expected: in this environment there cannot be formal participation with attribution to the DoD- legal does not allow this except in a few cases anymore. But, he could be a drive by individual with knowledge and experience to share who happens to work in DODCIO.

The second issue that troubles him, and I assume Eric as well, is that there is a chance that this will degenerate into the high priests of semantics holding up the infidels for community stoning. He does not want to be involved in that situation. I enjoy such encounters although it would not be a useful expenditure of time. So, I defer to you whether this can be arranged as a point-counterpoint situation that can inject some much needed new perspective into the semantic community or whether I would have to rent a costume and come dressed as a lamb for slaughter.

Brand Niemann, 12/17/2013 4:51 PM

Geoff and Eric, Thank you for inviting me to attend and answering my questions at the beginning because I had to take my wife for a pre-op and wasn't sure I would be able to return, which I wasn't.

I appreciate your detailed explanation below and what comes to mind is that this would make an excellent panel session for our new Federal Big Data Working Group Meetup, that I have invited Eric to participate in using your Healthcare.gov critique.

http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group

http://www.meetup.com/Federal-Big-Data-Working-Group

I have been working on this as well with Be Informed and we just produced a video: http://vimeo.com/82093897 http://semanticommunity.info/Healthcare.gov

So if you are willing, I would like to share your thoughts with a few experts from Semantic Web-Semantic Technology and invite you to suggest someone else who shares your views (DoD, Navy, etc.), or even a different view from yours and the SW-ST experts, to participant in a panel discussion, so we can have an email discussion before the panel say in early-February.

Are you interested?

Brand

Dr. Brand Niemann

Director and Senior Data Scientist

Semantic Community

http://semanticommunity.info

703-268-9314

Geoffrey Malafsky, December 17, 2013 8:55 AM

Brand, Thank you for attending the morning portion of the workshop. I think you would have found the afternoon most enlightening as it dealt with the real use case of many of the questions you asked. But, I want to clarify some issues so we continue to understand each others perspective on semantics and how we are each using it to solve challenges in the data field.

1) RDF et al: I frequently hear people, and I inferred the same from your questions, misunderstand the role of RDF in semantics. RDF is a format for storing information and as such is comparable to XML, UMLx XMIy, CWM, XML, and JSON.  It is what goes into the formatted structures that determines whether it is useful, meaningful, or relevant to actual semantics but the storage by itself and the specification of the formatting rules is not semantics. Nor is using RDF an entry requirement to be a semantic technology. Thus, questions about RDF should be framed in terms of what storage format is the best choice for the intended use of the information and technology, not as is often done, in regard to whether the technology is or is not a semantic technology. This view is false.

2) Similarly, the use of SPARQL, XQuery, SQL, or indexed search engines (such as the very good semantic search engine dtSearch that I use) is also a matter of preference determined solely by the intended use and required performance characteristics of the technology. There is absolutely nothing inherently preferable about SPARQL. If the technology is primarily focused on semantic reasoning and uses RDF, then SPARQL is a good choice. If it is using graph theory based networks, then SPARQL is a bad choice. If it focused on deterministic governance controlled data analysis and I/O, then SPARQL is not even relevant. What is relevant is the total lifecycle of performance from initial load, indexing (if used), updates, transaction based access time, joins (god forbid), etc.  This is where I do not think MarkLogic was a good choice for HC.gov because the use case is deterministic with real-time computations of analytic results (financial status). This use case is a very poor fit for backends that are in anyway non-deterministic such as any semantic reasoning or probabilistic engine, or high transactional latency storage structure and engines such as standard RDBMS engines using standards relational data models. Note that it is the combination of model and engine that determines efficacy for high load deterministic performance. I have always known that ML has modified from pure XML and Xquery to get their performance up. HC.gov merely shows that their improvements are not sufficient for the required environment. But the same is true of a traditional Oracle model and engine. HC.gov should not be using anything associated with semantic reasoning or RDF as these are entirely poor fits for the technology requirements.

3) Data rationalization is about the values and the continuous visible auditable control of these values in large heterogeneous data environments. This is the primary challenge facing almost all enterprise scale environments. The notion of building a new system occurs in some large programs, such as HC,gov and the ill-fated DIMHRS, but this is not really a practical option in most places due to the combination of cost and organizational authority zones. In many cases that have tried to build the grand new data system, they find out that they cannot actually improve the data in the new system because they cannot rationalize the data from the old systems due to lack of knowledge on all the tentacles of values and relationships among values that are largely undocumented.

Hence, you end up with another 100M data system or poor data.

Finally, a personal truth be told moment. I frequently state that there is no evidence of large scale operational success of semantic technology over the past 20 years. Many attempts; large investments; many proud public claims; but no actual production use in heterogeneous data environments. This includes what is widely considered a boondoggle failure of Wiznosky's Semantic Web effort EIW which was killed (with my overt cheering) by DCMO in early 2013.

Eric Kavanagh, 12/13/2013 4:43 PM

Hi Brand,

Thanks so much for your interest in our Data Science Workshop this coming Monday. I'm just touching base to make sure you have all the relevant details. And will anyone else be coming with you? We're trying to make sure we get the right amount of food for breakfast, lunch and a snack.

Monday, Dec. 16 -- 9 a.m. till 5 p.m.

Mason Enterprise Center

4031 University Drive

Room 122A

Fairfax, VA 22030

Feel free to bring any examples of issues you're facing right now, including actual data models, Visio diagrams or anything else. Part of the day will focus on solving real-world problems for attendees.

Thanks again, and let me know if you have any questions!

Cheers!

Eric Kavanagh

512.426.7725 (cell)

512.847.7020 (office)

http://twitter.com/Eric_Kavanagh

Eric Kavanagh, 12/12/2013 1:53 PM

Hi Brand, I have a few free passes left for an all-day workshop in Farifax on Monday, Dec. 16. My close colleague Dr. Geoffrey Malafsky will lead a team of data scientists for an in-depth discussion of how to reconcile disparate information systems. Dr. Malafsky earned his stripes by saving a $300 million data reconciliation project several years ago using the most elegant methodology I've seen. Here are details for the event: http://insideanalysis.com/event-registration/

Use PSI100 (letters PSI number 100) as a coupon code to avoid the fee. Topics to be covered will include a look at how to resolve the data quality issues facing Healthcare.gov, how to reconcile any number of data models or information systems; plus a workshop in which participants are encouraged to share their most daunting challenges, then watch Malafsky's team identify, distill and rationalize key data assets in tens of minutes.

Feel free to pass this along to a colleague who might be interested. If you'd like to see an event like this in your area -- or even as an on-site -- please just let me know. We're planning 2014 right now.

Thanks!

Eric Kavanagh
512.847.7020 (office)
512.426.7725 (cell)
http://twitter.com/Eric_Kavanagh

Page statistics
3241 view(s) and 37 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments