Table of contents
  1. 1.0 Background 
  2. 2.0 The Data Reference Model 
    1. 2.1 Public History  
      1. Figure 1: DRM Version 1.0 Three Part Structure (DRM 2.0 Below)
    2. 2.2 Key Decisions of the DRM 2.0 Final Writing Team
      1. 2.2.1 Audience for the DRM 2.0 
      2. 2.2.2 The Primacy of Data Sharing  
      3. 2.2.3 The Technical Approach  
  3. 3.0 Deficiencies in the DRM 2.0 
    1. 3.1 Defects in the Abstract Model and its Data Descriptions
      1. 3.1.1 Massive Data Collections  
      2. 3.1.2 Schema Mismatch
      3. 3.1.2.1 Syntactic Mismatch 
        1. 3.1.2.2 Entity-Attribute Mismatch 
        2. 3.1.2.3 Attribute-Value Mismatch 
        3. 3.1.2.4 The Role of Relationships 
        4. 3.1.2.5 Summary 
      4. 3.1.3 Data Context: Topics, Taxonomies and Ontologies  
  4. 4.0 The Motivation for the use of "3" 
    1. 4.1 The DRM Version 3.0
      1. Figure 2: The Data Reference Model 3.0, Web 3.0, and SOAs
    2. In the DRM 2.0 the Data Description Chapter did not mention how language should or should not be used to support Data Sharing Services. It was then only in the Data Context Chapter that any Natural Language artifacts were named: topics organized in a taxonomy. As we mentioned above that is only one type of organization of concepts in English. We need to use all of them and then go farther.
    3.  
    4. 4.2 Service Oriented Architectures  
    5. 4.3 The Web 3.0 
  5. 5.0 Language and Semantic Interoperability 
    1. 5.1 WordNet
      1. 5.1.1 Nouns and their Relationships  
      2. 5.1.2 Verbs and their Relationships  
      3. 5.1.3 Other WordNet Word Relationships  
      4. 5.1.4 WordNet as the Lexical Ontology  
    2. 5.2 Well Formed Lexical Ontologies  
    3. 5.3 Suggestions on How It Should Be Done  
  6. 6.0 The Global Change Master Directory is the Model 
    1. Figure 3: Network of Topics and Directory Page Documents
  7. 7.0 Language Computer Corporation's Parsing Suite 
  8. 8.0 CYCORP and IKRIS: Knowledge Management is HERE! 
  9. 9.0 Conclusion 
  10. Footnotes 
    1. 1
    2. 2
    3. 3

Federal SICoP White Paper 3

Last modified
Table of contents
  1. 1.0 Background 
  2. 2.0 The Data Reference Model 
    1. 2.1 Public History  
      1. Figure 1: DRM Version 1.0 Three Part Structure (DRM 2.0 Below)
    2. 2.2 Key Decisions of the DRM 2.0 Final Writing Team
      1. 2.2.1 Audience for the DRM 2.0 
      2. 2.2.2 The Primacy of Data Sharing  
      3. 2.2.3 The Technical Approach  
  3. 3.0 Deficiencies in the DRM 2.0 
    1. 3.1 Defects in the Abstract Model and its Data Descriptions
      1. 3.1.1 Massive Data Collections  
      2. 3.1.2 Schema Mismatch
      3. 3.1.2.1 Syntactic Mismatch 
        1. 3.1.2.2 Entity-Attribute Mismatch 
        2. 3.1.2.3 Attribute-Value Mismatch 
        3. 3.1.2.4 The Role of Relationships 
        4. 3.1.2.5 Summary 
      4. 3.1.3 Data Context: Topics, Taxonomies and Ontologies  
  4. 4.0 The Motivation for the use of "3" 
    1. 4.1 The DRM Version 3.0
      1. Figure 2: The Data Reference Model 3.0, Web 3.0, and SOAs
    2. In the DRM 2.0 the Data Description Chapter did not mention how language should or should not be used to support Data Sharing Services. It was then only in the Data Context Chapter that any Natural Language artifacts were named: topics organized in a taxonomy. As we mentioned above that is only one type of organization of concepts in English. We need to use all of them and then go farther.
    3.  
    4. 4.2 Service Oriented Architectures  
    5. 4.3 The Web 3.0 
  5. 5.0 Language and Semantic Interoperability 
    1. 5.1 WordNet
      1. 5.1.1 Nouns and their Relationships  
      2. 5.1.2 Verbs and their Relationships  
      3. 5.1.3 Other WordNet Word Relationships  
      4. 5.1.4 WordNet as the Lexical Ontology  
    2. 5.2 Well Formed Lexical Ontologies  
    3. 5.3 Suggestions on How It Should Be Done  
  6. 6.0 The Global Change Master Directory is the Model 
    1. Figure 3: Network of Topics and Directory Page Documents
  7. 7.0 Language Computer Corporation's Parsing Suite 
  8. 8.0 CYCORP and IKRIS: Knowledge Management is HERE! 
  9. 9.0 Conclusion 
  10. Footnotes 
    1. 1
    2. 2
    3. 3

 

Semantic Interoperability Community of Practice (SICoP), Best Practices Committee, Federal Chief Information Officers Council
 
Operationalizing the Semantic Web/Semantic Technologies: A roadmap for agencies on how they can take advantage of semantic technologies and begin to develop Semantic Web implementations.
 
Advanced Intelligence Community R&D Meets the Semantic Web (ARDA AQUAINT Program)
 
White Paper Series Module 3, June 18, 2007, Version 1.0
 
Executive Editors and Co-Chairs:
 
Brand Niemann, U.S. EPA, and SICoP Co-Chair
 
Mills Davis, Project10X, and SICoP Co-Chair
 
Principal Author:
 
Lucian Russell, Expert Reasoning & Decisions LLC (lucianrussell@verizon.net)
 
Contributors: Bryan Aucoin (baucoin@enterrasolutions.com)
 
  1. 1.0 Background 
  2. 2.0 The Data Reference Model 
    1. 2.1 Public History  
      1. Figure 1: DRM Version 1.0 Three Part Structure (DRM 2.0 Below)
    2. 2.2 Key Decisions of the DRM 2.0 Final Writing Team
      1. 2.2.1 Audience for the DRM 2.0 
      2. 2.2.2 The Primacy of Data Sharing  
      3. 2.2.3 The Technical Approach  
  3. 3.0 Deficiencies in the DRM 2.0 
    1. 3.1 Defects in the Abstract Model and its Data Descriptions
      1. 3.1.1 Massive Data Collections  
      2. 3.1.2 Schema Mismatch
      3. 3.1.2.1 Syntactic Mismatch 
        1. 3.1.2.2 Entity-Attribute Mismatch 
        2. 3.1.2.3 Attribute-Value Mismatch 
        3. 3.1.2.4 The Role of Relationships 
        4. 3.1.2.5 Summary 
      4. 3.1.3 Data Context: Topics, Taxonomies and Ontologies  
  4. 4.0 The Motivation for the use of "3" 
    1. 4.1 The DRM Version 3.0
      1. Figure 2: The Data Reference Model 3.0, Web 3.0, and SOAs
    2. In the DRM 2.0 the Data Description Chapter did not mention how language should or should not be used to support Data Sharing Services. It was then only in the Data Context Chapter that any Natural Language artifacts were named: topics organized in a taxonomy. As we mentioned above that is only one type of organization of concepts in English. We need to use all of them and then go farther.
    3.  
    4. 4.2 Service Oriented Architectures  
    5. 4.3 The Web 3.0 
  5. 5.0 Language and Semantic Interoperability 
    1. 5.1 WordNet
      1. 5.1.1 Nouns and their Relationships  
      2. 5.1.2 Verbs and their Relationships  
      3. 5.1.3 Other WordNet Word Relationships  
      4. 5.1.4 WordNet as the Lexical Ontology  
    2. 5.2 Well Formed Lexical Ontologies  
    3. 5.3 Suggestions on How It Should Be Done  
  6. 6.0 The Global Change Master Directory is the Model 
    1. Figure 3: Network of Topics and Directory Page Documents
  7. 7.0 Language Computer Corporation's Parsing Suite 
  8. 8.0 CYCORP and IKRIS: Knowledge Management is HERE! 
  9. 9.0 Conclusion 
  10. Footnotes 
    1. 1
    2. 2
    3. 3
 
Contents

1.0 Background 

At the request of Dr. Lucian Russell the Semantic Interoperability Community of Practice (SICoP) organized a special meeting February 6th 2007 to consider the issue of "Building DRM 3.0 and Web 3.0 for Managing Context Across Multiple Documents and Organizations." The reason for this workshop was to explore further the Data Reference Model (DRM) specific implications of the IKRIS presentation at the October 16th 2006 SICoP workshop: CWelty10102006.ppt
 
The acronym stands for the Interoperable Knowledge Representation for Intelligence Support. The presentation was given by Co-Principle Investigator Dr. Chris Welty: http://domino.research.ibm.com/comm/research_people.nsf/pages/welty.index.html
 
The presentation describes the IKRIS project, one of a number of unclassified projects funded over the last several years by the Advanced Research and Development Activity (ARDA) of the Intelligence Community. ARDA programs are now within the Disruptive Technology Office (DTO) of the Office of the Director of National Intelligence (DNI). The IKRIS project developed IKRIS Knowledge Language (IKL). This language can be used to translate among a number of different powerful knowledge representations. It encompasses First Order Predicate Calculus (ISO's Common Logic) and has the necessary extensions to include non-monotonic logic; it also admits some Second Order Predicate Calculus expressions.
 
Given that this higher level interoperable representation of knowledge was announced April 19th 2006, its capabilities were unknown to the writing team that produced the Data Reference Model (DRM) Version 2.0. The latter document was built upon an understanding of Computer Science that can, at best, be described as a 2004 baseline.  The DRM Version 2.0 discusses the topics of Data Description, Data Context and Data Exchange (Chapters 3, 4 and 5), called standardization areas. It provides an Abstract model (Chapter 2) that contains abstract Entities and Relationships among them, but shows that such Entities are distinctly allocated to the standardization areas.
 
The IKRIS Knowledge Language (IKL), however, creates the ability to specify DRM Entities and Relationships in a new more powerful way, one that favorably changes the cost/benefit ratio of information sharing by orders of magnitude. This will lead to a new Abstract Model, but its details are as of yet unknown. This SICoP Workshop's goal is to initiate the process of identifying the elements of that new Abstract Model that will in turn lead to a DRM Version 3.0.
 
The SICoP February 6th morning session was organized as a Special Conference to explore the implication of the existence of IKRIS. It was Special because it brought together two members of the writing team of the DRM 2.0, the manager of a key government program whose artifacts were the basis for much of the substance of the DRM 2.0 guidance sections, and representatives of three of the world's outstanding research organizations:
 

Dr. Christiane Fellbaum Princeton University: WordNet
Dr. John Prange Language Computer Corporation
Dr. Michael Witbrock Cycorp

In the session they discussed their work which collectively opens up a new way of envisioning Artifacts and Services for Data Sharing.

2.0 The Data Reference Model 

If it seems that the idea of redefining the Abstract Model of the DRM is "extreme", this history will show that the DRM 2.0 is already a change from DRM 1.0. Hence there is precedent.

2.1 Public History  

The writing team for the Data Reference Model Version 1.0 completed their work in December 2003 and the document was released September 2004. It contained the notion of the three Standardization areas, shown in Figure 1:

Figure 1: DRM Version 1.0 Three Part Structure (DRM 2.0 Below)

LRusselFigure1.gif  

Source: Expanding E-Government, Improved Service Delivery for the American People Using Information Technology, December 2005, pp. 2-3. http://www.whitehouse.gov/omb/budintegration/expanding_egov_2005.pdf

• Data Sharing

• Data Description

• Data Context

 
Together they defined the categories that would contain the artifacts and services to enable information sharing.
 
In the DRM 1.0 http://www.whitehouse.gov/omb/egov/documents/fea-drm1.PDF the following description (DRM 1.0 Page 4) of the three areas was provided:
 
• Categorization of Data: The DRM establishes an approach to the categorization of data through the use of a concept calledBusiness Context. The business context represents the general business purpose of the data. The business context uses the FEA Business Reference Model (BRM) as its categorization taxonomy.
 
• Exchange of Data: The exchange of data is enabled by the DRM's standard message structure, called the Information Exchange Package. The information exchange package represents an actual set of data that is requested or produced from one unit of work to another. The information exchange package makes use of the DRM's ability to both categorize and structure data.
 
• Structure of Data: To provide a logical approach to the structure of data, the DRM uses a concept called the Data Element. The data element represents information about a particular thing, and can be presented in many formats. The data element is aligned with the business context, so that users of an agency's data understand the data's purpose and context. The data element is adapted from the ISO/IEC 11179 standard."
 
A project to complete the DRM Version 2.0 was initiated in 2005 and the resulting evolution of understanding is shown in Figure 2. In December 2005 the DRM 2.0 was released:http://www.whitehouse.gov/omb/egov/documents/DRM_2_0_Final.pdf . In this version the description had evolved to the point that the concepts were renamed Standardization Areas and were re-characterized. Also, it featured Communities of Interest (COI). The text states:
 
• Data Context is a standardization area within the DRM. A COI should agree on the context of the data needed to meet its shared mission business needs. A COI should be able to answer basic questions about the Data Assets that it manages. "What are the data (subject areas) that the COI needs? What organization(s) is responsible for maintaining the data? What is the linkage to the FEA Business Reference Model (BRM)? What services are available to access the data? What database(s) is used to store the data?" Data Context provides the basis for data governance with the COI. 
 
• Data Description is a standardization area within the DRM. A COI should agree on meaning and the structure of the data that it needs in order to effectively use the data.
 
• Data Sharing is a standardization area within the DRM. A COI should have common capabilities to enable information to be accessed and exchanged. Hence the DRM provides guidance for the types of services that should be provisioned within a COI to enable this information sharing." (DRM Version 2.0 Page 6)
 
There are differences. The Information Exchange Package survived as the Exchange PackageData Content and Business Context were replaced by the more general Data Context and Data Description. The new DRM totally lacks the notion of the Data Element.

 

2.2 Key Decisions of the DRM 2.0 Final Writing Team

The text of the DRM 2.0 Version 2.0 was on the one hand the product of a large collaborative effort over several months and on the other a collaboration of a small number of specialists for two weeks. The work of the former was documented in the DRM Wiki that tracked the evolution of the Standard. The work of the latter has hitherto not been made public. The team did not change the Abstract Model, nor the excellent work on examples (Sections 3.6, 4.6 and 5.6) provided by the Department of the Interior. Some information was consolidated, some information was removed to Appendices, some was posted for reference on the Wiki and some descriptive information was re-written. As the final product may have deviated from some people's expectations the reasoning underlying the key decisions is provided below.

2.2.1 Audience for the DRM 2.0 

In a conference call prior to the assembling of the final writing team the following principle was enunciated by Dr. Russell: To be a Federal Enterprise Architecture (FEA) standard the Data Reference Model had to be both a reference and a model.
 
1. For a document to be a reference it must, by means of a compare and contrast process, be usable to render a judgment whether a data artifact or a process to enable sharing is or is not in compliance with the standards in the document.
 
2. For the DRM's logical artifacts and their relationships to be a model they had to be an abstraction that was (a) simpler than their implementations but (b) common to all of them.
 
Once the team was assembled the first decision that had to be made was "What is the audience for the DRM?" If the document was to describe a reference and a model the audience had to be one that was interested in such a document and could use it. Hence, the DRM Version 2 in Chapter 2, Section 2 2.1.  - Target Audience and Stakeholders - states:
 
"The target audience for DRM 2.0 is:
 

   
 
  • Enterprise architects  
 
 
   
 
  • Data architects"  
 
 

 
That decision meant that management issues had to be addressed separately. This was the focus of a companion volume, theData Reference Model Management Guide. The final editing for this volume was completed in the Winter of 2006, but the volume is not yet released by the OMB.
 
The fact that the DRM 2.0 was aimed for a technical audience also explains the prominence of the Communities of Interest. If one looks at the Data Description segment of the Abstract Model, Figure 2-5 in Chapter 2 Section 2.3, it might appear that it was designed to describe Relational Databases on the one hand and files. Then the Data Context segment would define terms for a keyword search. Such an interpretation would be too limiting. The model is meant for all types of data, as is stated in Chapter 3 (Data Description), Section 3.3 Guidance:
 
"… the government's data holdings encompass textual material, fixed field databases, web page repositories, multimedia files, scientific data, geospatial data, simulation data, manufactured product data and data in other more specialized formats. Whatever the type of data, however, COIs specializing in them have developed within the government and external stakeholder organizations." (DRM 2.0 Page 20)
 
At the Feb 6th workshop the Global Change Master Directory project at NASA described a site with nearly a two decade record of supporting successful data sharing among stakeholders for nearly 20 petabytes of data! The many agencies and their specialists who use this site obviously know what they are doing. A similar community is the Geospatial Community of Practice at http://colab.cim3.net/cgi-bin/wiki.pl?GeoSpatialCommunityofPractice. The Introduction and Guidance sections in Chapters 3, 4 and 5 of the DRM Version 2.0 were written to ensure that successful data sharing practices within the government would be allowed to continue.

 

2.2.2 The Primacy of Data Sharing  

The Writing committee accepted the principle that whatever was written about the artifacts that were mentioned in Data Description and Data Context, the role of such artifacts and their instantiations was to support a variety of services needed for effective Data Sharing. The details of how this was to be accomplished for any specific data collections was then left to the individual COIs. The services would be defined as needed by the COI, and the Data Description and Data Context artifacts would be generated so as to meet the needs of those services.

2.2.3 The Technical Approach  

The decision was made not to change the abstract model during November's writing session but the team recommended that it be reviewed later by a panel of experts in the relevant Computer Science disciplines. Where there were reservations about the adequacy of the model the team decided to address relevant issues by making changes in the descriptive wording of two Chapter Sections, Introduction and Guidance.
 
With respect to the Data Exchange section the key author, Bryan Aucoin, made the assertion that the concept of a documentwas sufficiently broad in his view to allow a complex inner structure to be present. This was a very important decision. In reality there is no such thing as a file containing data that has no structure. However the term "Unstructured Data Resource" was part of the Abstract Model in the Data Description Chapter. The team agreed that Bryan's document could be mapped to the either a Semi-Structure Data Resource or an Unstructured Data Resource. The distinction made in the DRM was that a Structured Data Resource was one whose structure was static and hence could be "factored out" into a Data Schema. This approach allows one to categorize as "unstructured" all data files whose structure is defined (a) depending on the context provided by the file's data, or (b) have a structure known to the application programs that are used to process them. An example of type (a) files ate the groups of files containing nuclear power plan safety analysis simulation data, e.g. NASTRAN input and output files. Examples of type (b) files are the various types of image files, identified by their unique suffixes. These are intended to be used by application programs that recognize image data, e.g. ".jpg" files.
 
The wording in the Data Context section was also carefully constructed to allow the word Topic to have room for interpretation. Although it looks similar to a simplistic "keyword" provision was made to allow it to be represented by a more complex artifact. This is because of a major ARDA/DTO funded effort to improve the lexical database WordNet to help with the automatic identification of meanings expressed by polysemous words in context.

3.0 Deficiencies in the DRM 2.0 

The DRM 2.0 reflected a gap between the needs of organizations that wished to share data and the availability of mechanisms to meet those needs. This is because the Basic and Applied Research areas of the underlying Computer Science disciplines were at the time deficient in results that would allow those needs to be addressed. Lacking a firm conceptual basis for mechanisms that would meet those needs they were left unmet. The COIs data specialists, however, were well aware of the defects within their domains, and hence were expected to deal with them as best they could. The defects are primarily in three areas:

 

3.1 Defects in the Abstract Model and its Data Descriptions

The Abstract Model is described as if it were a Structured Data Resource. It is described using the well known concepts Entity, Relationship, Attribute and Data Type. This set of concepts is some 30 years old. Less well known, however, is that because they were initially developed to help document computer systems' databases they have been found to be of limited value in supporting data sharing. Specifically they do not help one to address three issues, (1) Large Data Collections, (2) Schema Mismatch and (3) the lack of linguistic precision with respect to Topics.

3.1.1 Massive Data Collections  

A technical solution that works for data collections that are a gigabytes in size do not necessarily scale to data collections that are a million times larger. This is the size of the government's data collection; it is measured in petabytes (10**15 bytes). To access this amount of data entails developing a set of linked directories, each with its own set of abstract topics and embedded index pointers to other directories. Fortunately there is a template for managing such large collections of data. It is the system called the Global Change Master Directory project, hosted by NASA. It is described in section 6 of this report. It has a network of Data Assets, Data Resources and Topics that have allowed it to address the data sharing needs of multiple government agencies and public users. The DRM wording allows it to be constructed, but as there are no explicit DRM concepts that reflect the index structure one has to infer the solution. This report brings attention the GCMD as a template for data sharing. It is a highly significant best practice.

3.1.2 Schema Mismatch

The implicit idea in the Chapter 3 is that the Data Description, notably the Data Schema should be an information rich artifact that allows agencies to provision services (described in Chapter 5) which (a) discover and (b) enable access to identically structured data. It is hoped further that a broader spectrum of data can be made accessible. Section 3.2.1, "What is Data Description and Why is it Important", looks hopefully towards the future (DRM Version 2.0 Page 19)

Semantic Interoperability (1): Implementing information sharing infrastructures between discrete content owners (even with using service-oriented architectures or business process modeling approaches) still has to contend with problems with different contexts and their associated meanings. Semantic interoperability is a capability that enables enhanced automated discovery and usage of data due to the enhanced meaning (semantics) that are provided for data….    
    
To overcome problems of such interoperability for Structured Data Resources, however, one must confront the decades-old problem of Schema Mismatch. This is the condition that causes two Structured Data Resources to contain the same data but have it represented by two different incompatible schemas.
 
How difficult a problem is it to overcome Schema Mismatch? There are three sub-problems:

1. Syntactic Mismatch

2. Entity-Attribute Mismatch

3. Attribute-Value Mismatch

 

3.1.2.1 Syntactic Mismatch 

Syntactic mismatch is often thought of as contrasting date formats for attributes or character strings vs. integers in a field like "Social Security Number", i.e. dashes in between number groups. There is a very complete description of all of these in the article by Won Kim and Jungyun Seo: Classifying Schematic and Data Heterogeneity in Multidatabase Systems (IEEE Computer 24(12): 12-18 (1991). The list is extensive, but the approach of using an XML database as an intermediate repository could be useful in this case. It also applies to data value representations, e.g. "1" vs. ".9999998" which would have to be handled by special rules.

 
3.1.2.2 Entity-Attribute Mismatch 
Entity-Attribute Mismatch is a matter of data modeling, a choice in the semantics of the domain being modeled. Computer science courses at the undergraduate and even Masters level have taught that in data modeling there is an underlying triple of Entity or Object, Attribute and Value. It has been known for over 20 years, however, that an Entity in one SDR can be an Attribute in another. Therefore the schemas of the two Structured Data Resources would have a different number of Entities, and the Entities would have a different number of Attributes. Moreover the names of each are often made as short as possible to make the human-computer interface less cumbersome when ad hoc queries are formulated.
3.1.2.3 Attribute-Value Mismatch 

The above problem is quite manageable, however, compared to the next one: Attribute-Value interchangeability. In the paper Language Features for Interoperability of Databases with Schematic Discrepancies the authors Ravi Krishnamurthy, Witold Litwin and William Kent (Proceedings of the 1991 SIGMOD Conference, pp. 40-49.) discuss for the first time: "A less addressed problem is that of schematic discrepancies … when one database's data (values correspond to metadata (schema elements) in others … "

To quote form the paper further:

"Example: Consider three stock databases. All contain the closing price for each day of each stock in the stock market. The schemata for the three databases are as follows:

database euter (2):

relation r : {(date, stkCode, clsPrice) . ..}

database chwab (2):

relation r : {(date, stkl, stk2, . ..) . . .} 

database ource(2):

relation stkl : {(date, clsPrice) . . .},

reiation stk2 : {(date, clsPrice) . . .},

The euter database consists of a single relation that has a tuple per day per stock with its closing price. The chwab database also has a single relation, but with one attribute per stock, and one tuple per day, where the value of the attribute is the closing price of the stock. The ource database has, in contrast, one relation per stock that has a tuple per day with its closing prices (3). For now we consider that the stkCode values in euter are the names of the attributes and relations in the other databases (e.g., stk 1, stk2). … These schematically disparate databases have similar purposes although they may deal with different stocks, dates, or closing prices…
 
Although it is now clear that a service that addresses that interoperability of data by accessing data elements is needed for data sharing. However, the DRM Schema construct neglects to even mention Data Values.
 
There was a suggestion during 2005 that agencies document their Data Schemas using XSCHEMA and transmit them to a central repository. Given the Computer Science results cited above many additional steps would have to be taken to effectively support data sharing in this manner. That is why the Discovery Services approach was adopted in DRM Version 2.0.
 
As for how to supply the information necessary to detect and perhaps remedy the mismatches, one should start by looking back at Data Dictionaries. There was considerable work done on Data Dictionaries some 20 years ago, but as the work producing them did not appear to be cost effective they were not widely implemented. The key would be to use the Document that is associated with the schema which could be semi-structured to contain the Dictionary as a Structured Data Resource as well as text. A project to re-visit this approach in the light of the technologies just described here was initiated at George Mason University on June 11th by a student under the guidance of Prof. Edgar Sibley.
 
The Data Description section of the DRM could be changed in the next version to supply such information. Currently it would have to be supplied by a linked document, but as we shall see later such documents can be generated in a way that is extremely useful even at this time.

3.1.2.4 The Role of Relationships 
This is really a sub-topic of schema mismatch. It addresses the fact that a "Relationship" in the Abstract Model is used primarily as a human-readable markup technique to help understand the model. It is not actionable and schema mismatch problems are not ameliorated by using Relationships.
 
The Relationship idea was introduced in 1976 by Peter Chen in the ACM Transactions on Database Systems Volume 1 Issue 1. It has been shown useful as a documentation tool for software engineers to relate the data to a specification its exact nature. About 10 years later an article in ACM Computing Surveys that compared Semantic Data Models proposed relating it to the concept of Abstract Classes, but this is not necessarily viable as some relationships are proposed to have attributes (which Abstract Classes lack.) The best attempt to understand relationships within the context of the relational model was the final work of E.F. Codd "The Relational Model for Database Management: Version 2" which is now, unfortunately, only available in the used book market. Despite their unclear role in data models, however, Relationships persisted in the Unified Modeling Language (UML). Of course this too is a software or system specification language (http://www.omg.org), so the retention of the ideas is not surprising. Recently it has been used to name functional relationships between classes, but that would contradict earlier assertions that there could be Relationships among three Classes.
 
Whatever critique is provided today, however, in fairness one must say that putting Relationships into a Data Model was a good idea at the time. It was the first step in the process of finding ways to add semantics to Data Models, semantics that the Relational Model lacked. However, it is not a construct that would likely persist in a new Abstract Model.
3.1.2.5 Summary 
A more powerful abstract model that could overcome the above mentioned deficiencies of the current abstract model, would contain new concepts, and Entities that could be used by services which would overcome issues like Schema Mismatch and enable more data sharing.

3.1.3 Data Context: Topics, Taxonomies and Ontologies  

The Data Context section shows Topics that are in Taxonomies. These are shown in the master model in Chapter 2 Section 2.1. This diagram of the Abstract Model also shows that a document as Unstructured Data can be a Digital Data Resource which can have a topic. It is not limited to having one topic, though each topic must be in a taxonomy.
 
This construct is limiting. The builders of the lexical database WordNet found that there are important classes of relationships among words and concepts that go beyond the "Is-A" relationship that characterizes the taxonomic relationship. One significant example is the "Part-Of" relationship. Although this relationship is often used in OWL-DL Ontologies (that model classes of information described by nouns) there is a new, significant use that can be made of it in networks of verbs. Processes are made up of other (sub) processes, which are "Part-Of" the higher process. Given that the IKRIS project has shown that all the Process description formalisms can be mapped interoperable to one another using IKL one should look to expand the types of structured formalisms to which Topics are related. In fact this is a critical advance that must be made when creating new Data Descriptions for Structured Data Bases. The processes that generate the data in those databases need to formally described, and the data in the Structured Data Resources mapped to what initial, transient or end events are the cause of their inclusion.

4.0 The Motivation for the use of "3" 

The morning session mentioned the DRM 3.0, the Web 3.0 and SOAs. Presenting adequate material to justify such an introduction of terminology was a rather ambitious undertaking. For anything to be a "3.0", however, there must be promise that problems in a version "2.0" can be alleviated if not completely corrected by the next version. What allows the "3.0" designation to be reasonable is the expectation that the necessary technology can now be built ,given the Computer Science advances due to the ARDA/DTO sponsored projects.

4.1 The DRM Version 3.0

No activity is underway to replace the Data Reference Model Version 2.0, but now in 2007 one can make significantly more progress toward reasoning about data resources than was considered even possible in 2005 when the Abstract Model was developed. Where the Abstract Model shows three standardization areas it is now possible to consider a reduction to two areas, where Intelligent Awareness encompasses Data Description and Data Context. The issues are discussed below.
 
First let us consider a justification for a keeping a two tier model rather than rolling up everything into an amorphous "knowledge model", as shown in the PowerPoint slide image below. Whatever Computer Science and enabling technology may be able to bring to bear on the problems of data sharing there is just too much data to share. This leads to the necessity of a Query Oriented Architecture (QOA), one where automated services in a Service Oriented Architecture can dynamically assemble data to meet a query's needs. There is a reason that this term is unfamiliar and an even better one why it should now be considered.
 
Queries have been a staple of data management technology for a long time. Looking back to the foundations of data management the key reason that the relational model of data was chosen was its ability to generate a View. The data in the data base was organized into Base Relations and then through an SQL statement the data could be culled, sorted, merged and joined into a very large number of new collections; each one so generated was a View. The generating SQL statement, however, was a Query, and as any query could be posed of relational data there was no need to single out Queries.

Figure 2: The Data Reference Model 3.0, Web 3.0, and SOAs

LRussellFigure2.gif  

The data that was modeled, however, was the data of business. Data about the real world was handled differently and no general purpose query systems existed that encompassed such data. That is not to say there was no attempt to do so, however.  In the mid 1980s a DARPA contract provided to the Computer Corporation of America led researchers to the Probe Data Model (PDM). This was described in a technical report and many academic articles, and the papers so resulting described the many difficulties encountered in extending SQL to real world data. One major issue was that discontinuity between the domain of business forms with their discrete data and the Real World where one encounters continuous data. The latter could only be measured at points in time, and hence sampled. The SQL "join" became the "Apply-Append" operator, and in being so generalized was no longer of interest - although it handles missing data the data was still missing.
 
The Real World, however, remains of interest and the U.S. Government collects considerable amount of data about it. Not every aspect of the real word of interest, however, can be pre-determined and data that is stored may or may not reflect the interests of the person who believes it might be useful. To bridge the gap between users of data and collectors of data a service needs to be available to reason about the relevance of data sources. This notion of relevance is well developed in the field of Information Retrieval that enables Search Engines, but is only intuitively understood in a more general sense. The QOA is a means of making some headway towards a formalization. It is one that addresses specifically the needs of assembling data from Structured Data Resources as well as other Data Resources that contain non-textual data. We must add textual data to make that other data understandable to a discovery service.
 
A QOA is one where a user can specify an interest in data to a system and the system will dynamically examine the data which it does have and identify its precise semantics. This description is then available as input to the querying person. It then can be re-used, which can create more detailed questions or else change the intended semantics so that a better answer is retrieved the next time. The first issue to consider is what really is a "query" and to do so we look again at the most well known query language SQL.
 
One of the lesser known aspects of Relational Database theory is that under a closed world assumption - the database is all the data - SQL queries are expressions in First Order Logic. This was proven in 1978-84 in a series of papers by Raymond Reiter. It was shown by later researchers to be the same in Object Oriented databases. So a query is a proof of some assertion that uses the world of data stored in a database to determine the assertions truth or falsity. Given the limited context of an SQL database it is possible to find logical assertions about those entities or Objects that are modeled. When multiple relational databases are involved, however, there is no uniform naming convention, so even identifying what data exists that would become part of a proof is problematic.  If non-relational data is considered as well the problem appears to be insuperable. Fortunately there is a way to approach the problem. That way is to use language, in our case English, as the more expanded query language.
 
In the year 2007 there are many computer scientists and practitioners who have no knowledge of the roots of database technology, specifically the early work of the people who tried to understand data modeling. When data modeling was introduced into software engineering, such as it was in the 1970s, the idea was examine documents of how a computer system was to behave and strip away everything that was not essential. People were taught to take a page of text and use a highlighter on every noun and ignore all the rest of the words. These nouns were then grouped into Entities and Attributes. As an afterthought the verbs were put next to various lines in a graph between the Entities and called Relationships. The Relationships were documentation devices and the names were not actionable.
 
In order to bring into light the actual meaning of data collections, however, the original text descriptions need to be restored, or regenerated. Some new ones will have to be created. These will be parsed and rendered into a knowledge base artifact associated with the relevant Data Resource. The English Language has been examined in great detail over the last 15 years. Instead of looking at words in a sentence as a sequence of multi-meaning character strings one can now look at he words in a sentence as a set of sense-disambiguated tokens. Moreover, as we can see in the February 6th presentations by LCC and CYCORP it is now possible to take these same natural language texts and translate them into logical assertions. This means that the way to building the knowledge base needed for a Query Oriented Architecture is by means of letting the query be formed in Natural Language (English).
 
The queries are the questions that have been investigated in the AQUAINT program. Systems developed in the AQUAINT program parse Natural Language queries and interact with knowledge collections. These can be described as a network of Data Context and Data Description artifacts. Services will reason about the query using the new knowledge bases and these artifacts will allow the service to identify relevant data in the databases. This data can be dynamically accessed and the relevant query results assembled for presentation to the user.
 

In the DRM 2.0 the Data Description Chapter did not mention how language should or should not be used to support Data Sharing Services. It was then only in the Data Context Chapter that any Natural Language artifacts were named: topics organized in a taxonomy. As we mentioned above that is only one type of organization of concepts in English. We need to use all of them and then go farther.

 

4.2 Service Oriented Architectures  

There is a considerable amount being written about Service Oriented Architectures. Much of it describes how the services should be organized to support the business of an organization. This, however, is not new. In addition one reads about how XML and Web services should be used. This is good because it brings to light what the services are actually doing and allows a record of the messages passed among them to be stored in a form that is readable by people. However, when one digs deep into the "fine print" of the Service Oriented Architectures one comes up against the Service Description. This is were we see that we are back to the same old low level computer technology, Application Programming Interfaces (APIs) and a text document that describes how to invoke them. The data, input and output may be described with tags, but this does not mean that the content of the tags is understood in any way other than character-string matching. Fortunately, this does not have to be the case any longer.
 
The IKRIS project made a major advance in that it demonstrated that the various linguistic and logical formalisms used for process specifications are interoperable, e.g. NIST's Process Specification Language (PSL) and Cycorp's CYC-L. Process specifications are important because a service is a process: it takes in input and after a time it provides an output. A service will be implemented by multiple sub-processes, however. These may occur in series or in parallel. The sub-processes may be one of set of alternatives, and any sub-process may occur one time, many times or not at all. A process specification is the correct way to describe a service (another of the formalisms shown to be equivalent was OWL-S, the computer services Ontology). It specification's process steps' input and output data can be cross referenced to data in Structured Data Resources. Within the process one can also describe the logical or metric relationships among the data items, i.e. data integrity constraints.
 
The only way that a process can be described in Natural Language, however, is to include a precise vocabulary of time, and use it to create time markups. This has been the objective of the TIMEML language project within AQUAINT, one that has made significant progress towards that goal. The other aspect of processes is that the topic words or concepts that describe them are verbs. Without correctly understanding how to use verbs, the words that describe changes in either State of spatio-temporal positions, one cannot specify processes and services. The TIMEML technology allows text documents such as process descriptions to be read by linguistic tools and appropriate time markups inserted to make temporal relationships precise. These are available to reasoning services which can then detect semantically equivalents among different data descriptions.

4.3 The Web 3.0 

The Web 3.0 is now being called "The Semantic Web". In January the Wikipedia was refusing to let anyone create a definition of it, but now a number of discussion points have emerged. As it relates to this discussion, however, it would be reasonable to believe we have a Web 3.0 when there are available Web services that can take the content of the Web 2.0 and create order within the chaos. Web 2.0 artifacts are characterized as being generated by people, and full of cross references, bookmarks and recommendations. Blogs and videos and shared images help make the content more diverse and personal.
 
At the center of such a wealth of input is a core of facts and relationships about facts. This leads to a number of different contexts being established in which the facts are considered true, suspect, or false. On any matter where there is argumentation the alleged facts used in an argument can be linked back to Web resident sources. The sources' credibility can be evaluated using ancillary evidence and believed or not. Over time, those assertions that have merit emerge as a common Knowledge Base. Realizing this requires managing a Knowledge Base that is vast, variegated and manageable. The decades long work of CYCORP provides guidance as to how this knowledge should be organized.

5.0 Language and Semantic Interoperability 

In recent years much has been written about the projected use of "Semantic Technologies" to enable more sharing of information using Web technology, but the grounding of such discussions on a firm computer science foundation has left much to be desired. To understand semantics, however, requires understanding language, and in our case the English Language. Therefore, on February 6th following two presentations that discussed the history and promise of the DRM, the session introduced Princeton Professor Christiane Fellbaum. She is the Principal Investigator and lead scientist of the WordNet Project. The WordNet presentation by Prof. Christiane Fellbaum was solicited to explain what we now know about English and introduce to a wider audience the cross-annotated and disambiguated map of the English language that is now available.

5.1 WordNet

For a history of the project see http://wordnet.princeton.edu/ and purchase the book (ISBN 0-262-06197-X). To quote from the Website:
 
"WordNet® is a large lexical database of English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. The resulting network of meaningfully related words and concepts can be navigated with the browser."  
 
WordNet must rank high in the list of books that are cited but have not been read. As part of the AQUAINT program of linguistic technology enhancement, begun by ARDA in 2001, WordNet's initial collection of synsets and word relationships has moved from a "best efforts" database to a precise reference collection of word-concept mappings. Words are grouped into sets of roughly synonymous words, called synsets, that each expresses a concept. English words are polysemous (have multiple meanings), and so appear in multiple synsets. Their meanings can be disambiguated through the other synset members, through the semantic relations among synsets, and through the definitions that accompany each synset. A recently completed major enhancement of WordNet was the manual disambiguation of the definitions, or glosses: The nouns, verbs, adjectives and adverbs in the glosses were linked to the appropriate synset in WordNet. This not only further clarifies the meaning of synset, but also creates a corpus of sentences (the definitions) that are semantically annotated against a large lexical database. The glosses were translated into logical form, and the variables in these forms are linked to specific WordNet senses. These data constitute a valuable resource automated reasoning and inferencing.
 
For those who have not read the book, it turns out that there are multiple networks among words, which means that the project could accurately be renamed "Wordnets". These are discussed in the presentation stored on the Wiki, but the key points are listed below:

5.1.1 Nouns and their Relationships  

What are nouns? Nouns name things, and in English things can be either abstract or concrete. Whether the one or the other, however, one must be able to distinguish an instance of one thing from another. This means that there will be a number of distinguishing properties for the thing that is named by a noun.
 
For nouns, there are several important relationships beyond the ones familiar to Library Science, i.e. synonyms, general terms, specific terms. The synonyms give rise to the conception of the "synsets" as one  means of disambiguating polysemous words .For nouns the substitutions are very likely to cause little shift in meaning (in contrast to what happens with verbs) The more general terms are called hypernyms and the less general ones hyponyms. Linguistically this relationship shows up in the form "an A is a B".
 
WordNet, however, has shown that there is another use of "Is-A" common in English, the telic relationship where the prior phrase is an abbreviation of "is used as a". One can see the difference in the WordNet example of the two phrases "a chicken is a bird" and "a chicken is a food". The former is an instance of the hypernym/hyponym relationship between "bird" and "chicken" and the latter is a telic relationship.
 
In addition to the above there is the "Is A Part-of" relationship, where more general words are holonyms and more specific words are meronyms. There is more, so, as Prof. Fellbaum said at the workshop, "read the book!"

5.1.2 Verbs and their Relationships  

What are verbs? Verbs name events or states that have temporal properties; often they label a change of state of a displacement in space; in either case the change occurs over time.
 
For verbs there are relationships that are similar to those of nouns but they are not identical! For verbs the synsets and the hypernym/hyponym relationships are different. First of all, whereas for nouns replacing a word in a sentence with a synonym is unlikely to change the meaning of the sentence very much the WordNet book says the same is not true for verbs: the shifts in meaning are less subtle.
 
The second important difference is that although a given verb synset can have a single hypernym, unlike with nouns there is not just one type of hyponym. Rather the relationship of more specific words is one of entailment. WordNet identifies four types of entailment, distinguished by temporal properties. One of these is the troponym, which is a name for a verb that has the same starting and ending time as the verb that is its hypernyms. The difference is that the troponym qualifies the manner in which the action takes place. If the hypernym is "to walk" then a troponym is "to skip".
 
Of particular significance to Semantic Interoperability, however, is the relationship between verbs that have the relationship of holonym/meronym. In this case the holonymic verb describes a process of which the meronymic verb is a sub-process. A process is what is behind a service, which means that creating an accurate Service Descriptions requires an accurate process description.

5.1.3 Other WordNet Word Relationships  

The relationships among words described above are not the only ones in WordNet. Under a continuation of the AQUIANT program (Phase III) WordNet is being expanded with other relationships such as those that relate nouns to their verbal expressions. However, there is a point at which the number of relationships within the language must be capped, i.e. there needs to be a core set. This is because although human language can express many, many relationships about the real world and the world of ideas. Within them one should look for abstractions of linguistic structures. In choosing them some of the concepts that are used to describe the real world must be, and have been included, e.g. Seas. However, language that describes the specifics of the real world should specify the many relationships within the world using the core set of relationships. Thus inland seas and land locked seas are described by phrases within a language that reflects real-world situation. They are not part of the core terminology whereas the word "sea" is a linguistic concept. This issue is addressed further in the discussion below.

5.1.4 WordNet as the Lexical Ontology  

Although much has been said about "Ontologies", now we introduce a distinction with a difference. Ontologies that can be described:

• using the W3C recommended Web Ontology Language OWL, specifically the Descriptive logic portion if it, and 

• using a free-floating sense where anything which is specified can be part of the Ontology. The structure of the Ontology is then provided by First and Higher Order logics; it is the model CYCORP uses.  
 
There are several debates in progress in the subject area of the form and content of Ontologies. One of them is what constitutes the "best upper Ontology". This question at the present time is really a philosophic one, but the idea is simple: how do we divide up everything known to humankind into a finite collection of categories. Although there may be new answers in the future, we can look at a sub-set of that ideal Ontology, the set of all things which are described by a word or compact noun phrase in English. Specifically we look at all of such words and phrases that are in WordNet. Because WordNet limits its contents largely to those concepts for which there is a word (a "lexeme"), WordNet is a Lexical Ontology for English.

 

5.2 Well Formed Lexical Ontologies  

Based on the discussion of WordNet it should be clear that a verb is not a noun. Yet perhaps because of the flexibility of English this principle is overlooked when people try to create networks of words and corresponding collections of data that they call Ontologies. Some words are both nouns and verbs, though their synsets are very different. However, when the verb sense is required there is always the potential for "noun-ifying" a verb and sticking the word into an "Ontology".  Another practice is to take parts of sentences, assigning nouns to nodes and connecting them with arcs that have phrases written above them. The result is called a "semantic net" or a "knowledge diagram" of something but because these assignments are ad-hoc the diagramming step adds no machine process-able knowledge. The markups are good for people to read but unless they have a precise meaning within a formalism they cannot be used systematically by enhanced discovery services ad thus do not aid Semantic Interoperability.

5.3 Suggestions on How It Should Be Done  

Data that is to be made available for Data Sharing should be fully described in a Document that is a Data Description. In addition to any metadata describing the document creation there should be a set of topics extracted that provide a Data Context.
 
In the case of Data Descriptions where there is a SCHEMA this DOCUMENT should be a disambiguated Natural Language that has an IKRIS understandable description of all processes that generate the data. All data elements would be identified as to where in the process' steps they are used, created, changed and deleted.
 
Ontologies in the OWL-DL sense should be created or referenced for each data item as needed, but class names should only be nouns. Non-lexical terms should only be specified as a specialization of a lexical term and specific inclusion/exclusion rules should be provided.
 
Logical constraints in the form of rules that hold among data values would also be explicitly stated and be made available to a Constraint Discovery Service.

6.0 The Global Change Master Directory is the Model 

The presentation material of the Global Change Master Directory (GCMD) is rather complete in describing how it is organized today. The detailed presentation is available on the Wiki page for the Feb 6th meeting. It is used my many government agencies to share data. In fact it is used to describe 18 Petabytes of data. This has surely no peer as a successful template for data sharing and for building the network of DRM 2.0 artifacts needed for data sharing. Why it was overlooked until the last minute, however, is an interesting study in the complexity of Computer Science today.
 
The Master Directory is obscure because in part it is a great success. The project's goal it to provide a means of identifying the LARGE data collections devoted to a particular concept. In order to do this there is a need to understand what is a concept, how concepts should be organized, what categories of information should be used to collect descriptive data and what inter-relationships among concepts should be supported. One reason that the GCMD works is that the people involved in its specification and design are highly educated and very motivated. It would be an incorrect assumption, however, to conclude that development of the directory has been wither a straight-forward activity or without controversy. The project started in 1988 and there has been a considerable evolution.
 
The GCMD was known to Dr. Russell when he joined the DRM 2.0 writing team, and had it been possible to do so it would have become part of the explicit Abstract Model. Instead, through appropriate language in the Introduction and Guidance Sections of Data Description and Data Context Chapters it is there implicitly. This is important because it provides a coherent means of bringing together different concepts for Discovery Services and the data stores' description that will need to be processed to identify those that will be the subject of a Data Access Service.
 
The first area of success is in concept representation. The GCMD has to organize both data concerning the physical nature of the real word, measurements of natural phenomena, and the impact of the activity of human social institutions within the world. This challenge was met during the mid-1990s when to support the merger of two major data-set collection indexes two nearly orthogonal taxonomies had to be accommodated and inter-related. The one represented the data collections that had been developed to record physical phenomena. The other was a collection of data describing the impact of global climate change on human activity.
 
The government is an institution that holds data about the physical world, the world of social behavior and the world of individual behaviors. The concept space so encompassed is vast, and in addition to encompassing the words indexed in WordNet it extends to a vast additional technical vocabulary. To effectively develop any Taxonomy artifact mentioned in the Data Context Chapter of the DRM will require considerable reduction of that vocabulary to a smaller number of concepts. The success of the GCMD in meeting this challenge is evident. It should then be a model for how any such effort should be undertaken. The DRM 2.0 shows the way.

Figure 3: Network of Topics and Directory Page Documents

LRusselFigure3.png

In Section 4.2.2 "Purpose of the Data Context Section of the DRM Abstract Model" the way is shown in the text that follows:

"Context often takes the form of a set of terms, i.e. words or phrases, that are themselves organized in lists, hierarchies, or trees; they may be referred to as "context items". Collectively, Data Context can also be called "categorization" or "classification". In this case the groupings of the context items can be called "categorization schemes" or "classification schemes." More complex Data Context artifacts may also be generated, e.g. networks of classification terms or schemes.

Classification schemes can include simple lists of terms (or terms and phrases) that are arranged using some form of relationship, such as:

- sets of equivalent terms,

- a hierarchy or a tree relationship structure, or

- a general network."

The intent of the above language was to open the door to the use of WorldNet's exact synsets.
 
The second means of organization is in the pointing of the TOPIC artifact to the DATA ASSET. The latter is defined generally, but can be a DOCUMENT, and that document can have links in it that a DISCOVERY SERVICE can access. These links can be to other TOPICS, in one or more (lexical) Ontologies, or to other Documents that are pages in the Directory.
 
This is similar to the way that the Data Collections of the government are indexed in the GCMD. This is the template for the rest of the government. It should start at the top and continue all the way down to individual databases, document collections and file collections. It has worked before and with the new tools available today the words in the topics and directory page documents can be specified unambiguously.

7.0 Language Computer Corporation's Parsing Suite 

Language Computer Corporation (LCC) is a company that specializes in human language understanding Research and Development. It was founded 11 years ago in Dallas, Texas and established a second office in Columbia, MD in mid-2006. It employs about 70 research scientists and engineers. The research funding comes primarily from DNI/DTO, NSF, AFRL, DARPA and several individual Government Agencies. It has working software products. Its technology has been transferred to individual Government Organizations, Defense contractors and more recently to Commercial Customers. It was a major component of the AQUAINT project and has aggressively incorporated many of the advances reported by the nearly 20 research teams into their own product suite. The Wiki page with the conference briefing is Part 2 of the morning session. The lead for the presentation is

"What can be done today to extract logical relationships from text sources?. LCC is there as the company that specializes in extracting logical representations from language, which can generate new knowledge from well crafted Data Descriptions."  
 
The products described in the briefing fall into three categories.

• Information Extraction

o CiceroLite and other Cicero Products

o Extracting Rich Knowledge from Text

• Polaris: Semantic Parser

o XWN KB: Extended WordNet Knowledge Base

o Jaguar: Knowledge Extraction from Text

o Context and Events: Detection, Recognition & Extraction

• Cogex: Reasoning and Inferencing over Extracted Knowledge

o Semantic Parsing & Logical Forms

o Lexical Chains & On-Demand Axioms 

o Logic Prover 

The description of the several products' capabilities was laid out to show that there is a current capability of understanding the content of documents that was heretofore thought to be a decade away. The products success in NIST trials was also supported.
 
In achieving the impressive successes cited above there is a very important capability that had to be created, linguistic entailment caused by an event. The idea is simple: for the assertion that a person X performed act Y at time T to be true, there may be many other linguistic assertions that other prior acts were accomplished. That means one verb at a given time entails the truth of many assertions stated using other verbs cited at a prior time. Specifically this means that data describing an act Y can be taken as relevant to whether prior data described by other verbs is also relevant. This is done by understanding the meaning of words and their inter-relationships.
 
Because of the linguistic focus of the company it was selected to participate in the meeting because in addition to being able to identify the exact knowledge that is within text it can also extract assertions which may not have a factual basis, e.g. opinions etc.. The theory of language shows that statements in human language and these do more than just communicate facts, i.e. at least the following categories exist:

• Assertives: Statements that tell others (truly or falsely) how things are  
 
• Directives: Statements telling people to do things.  
 
• Commissives: Statements where we agree to commit ourselves to do things  
 
• Declarations: Statements where we bring about changes in our world due to an utterance. 

• Expressives: Statements of personal feelings and attitudes.

The LCC approach to language allows it to extend into all of the above. This is important because although much of language can be translated into logical formats many or the qualitative features are not so reducible. In addition, sometimes new concepts emerge that do not have an exact definition that is agreed by all parties. Such clusters of related words can be identified by LCC's tools, e.g. the summarization suite not mentioned in the presentation.

8.0 CYCORP and IKRIS: Knowledge Management is HERE! 

Cycorp is the company that is the successor to the CYC project that was started by Dr. Doug Lenat in the mid-1980s at the Microelectronics and Computer Consortium (MCC). To quote their website: "Cycorp was founded in 1994 to research, develop, and commercialize Artificial Intelligence. Cycorp's vision is to create the world's first true artificial intelligence, having both common sense and the ability to reason with it." It currently has an Ontology (in the general sense) with 15,000 predicates 300,000 concepts and 3,200,000 assertions about them. It is the only such Ontology in existence. It can be extended and there are many collaborators who are actively working on such extensions.
 
Cycorp was a heavy contributor to the IKRIS project. This is important because the company had 20 years of experience in specifying and managing knowledge about the real world. That experience was of great value in providing a testable foundation for many ideas. The company also participated in the AQUAINT program and performed other ARDA work as well.
 
Over the years the CYC Ontology was very well regarded but not well understood by many. Initially the project used LISP as its programming language, and this seemed to make many of its features inaccessible for other applications. Although this is no longer an issue, there was still a lingering doubt as to whether the formalism peculiar to CYC was usable outside of the CYC system itself. IKRIS showed that this is not the case: CYC's formalisms are interoperable. This is an extremely valuable result as one seeks to move forward with the DRM 3.0, the Web 3.0 and SOAs.
 
The reason that CYC was brought to the SICoP meeting is that any knowledge that is needed for any advanced application can be represented now in CYC and translated to any other equally powerful formalism at any time using IKL (OWL is not a First Order Logic, so it is less expressive).  This means that the processes associated with verbs can be specified now - without waiting. It means that SOAs can be specified without having to wait on the (now interoperable) SOA-S specification. It means that the role of data elements in a SOA can be specified with respect to where they are in a large Ontology and what states or sample points they represent in the processes that interact with them. It means that we can start building the Core Knowledge Base for the Federal government today.
 
CYC pioneered the idea of a Micro-theory (MT), which is realized in IKL in the CONTEXT clause. This is extremely important as it allows the representation of statements in non-monotonic logic. Monotonic logics are the ones used in mathematics, where Pythagoras' Theorem and other such assertions are proved. Non-monotonic logic allows one to capture the results of induction, such as Newton's Law of Gravity. The difference is that Pythagoras' Theorem is always true, but Newton's Law of Gravity must be amended when velocities close to the speed of light are considered.
 
The critical formalism that bridges human language and the theories of science is the Contra-factual Conditional. These are used to state the laws of science. For example if one says that "glass is brittle" one asserts that "were a glass to be struck by a hammer then it would shatter". Induction, however, is only as good as the last text case. The assertion by Northern Hemisphere people that "all swans are white" was shown to be false once contact was made with Australia.  In that vein, one asks "is that particular glass brittle? It is intact and not shattered!"  The answer is not a law but rather an assertion about the contrafactual - the glass is hit by a hammer. The conditional is the "IF" statement. In IKRIS this is accomplished by a CONTEXT and a Process specification (hitting the glass).
 
The preceding might seem to some to be an issue of splitting hairs, but some hairs have to be split. The inability of the world's greatest logicians in the 1930s to successfully unify language and logic was due in part to the need for a single formalism that would include the contrafactual conditional. This is now present and agreed to by all. In so doing the door is opened to the extensive use of automated reasoning processes to aid in data sharing. That does not mean that it would be easy, but it is now worth the trouble to undertake.
 
Another aspect of IKRIS power is that it has constructs that act as a universal solvent to the problems of Schema mismatches. Databases and metadata assertions concerning Entity, Attributes and Values can be restated in IKL and proofs about transformations among their different combinations can be stated. This is a breakthrough for a field that thought it was dead in 1991!
 
The CYC presentation is at the same Web site for the February 6th SICoP meeting as was referenced previously. It should be studied carefully.

9.0 Conclusion 

IKRIS was announced to the world on April 19th 2006, although not publicly until a couple of weeks later. The problems that heretofore inhibited Data Sharing efforts have been removed in theory. There are a panoply of advanced tools now available to start to remove the barriers in practice. The following three promises to be realized in the Web 3.0 are among those stated on the current "Web 3.0"page for the Wikipedia  (http://en.wikipedia.org/wiki/Web_3.0)

• An evolutionary path to artificial intelligence 

• The realization of the Semantic Web and SOA

The answer is "yes - these promises can be realized" due to IKRIS and the adjunct linguistic technologies due to the projects sponsored by ARDA/DTO.
 
We can start today!

Footnotes 

1

From Adaptive Information, by Jeffery T. Pollock and Ralph Hodgson, John Wiley and Sons, Inc., ISBN 0-471-48854-2, 2004. p. 6.

2

Any similarity of names of the database (i.e., (R)euter, (S)chwab, and (S)ource) to popularly known names is purely coincidental.

3

This source schema may seem contrived to a database researcher but low and behold, this is a popular schema among stock market data vendors."

Page statistics
856 view(s) and 5 edit(s)
Social share
Share this page?

Tags

This page has no custom tags.
This page has no classifications.

Comments

You must to post a comment.

Attachments