An Information Model for Domain Specifications in HL7

Gunther Schadow

The Regenstrief Institute

October 7, 1998

Preface

After having sowed the seeds of arguments through my initial comment on domain specifications, I feel obliged to follow on with further concrete and constructive contributions.

Also I am not "chairing" the domain specification group, I suggest that we follow up on and conclude the discussion on the domain specification information model. As of now, we have two competing models out there and we would need to merge them back into one we all agree on.

We can use the fundamental idea behind the whole MDF approach to derive everything else from that model: domain specifications are instance graphs of the domain specification model. Any domain specification syntax will be a "wire form" of a domain specification instance graph. I guess that our domain spec. syntax will look different from any of our HL7 message syntaxes. It would look more like a programming language, especially it would include expressions to do all this set arithmetic. However, I still think that a sound information model is the starting point and any discussion about syntax is premature. So, here goes

The information model

In order to foster an agreement, I have reworked my simplistic model so as to include the details used by Stan and shown in Woody's model. I used Stan's proposed ASN.1 and Woody's reversely engineered model as input and mapped every data element to my model. It turned out that my model wasn't so simplistic after all, as you can easily find out by looking at my original model.

Figure 1: The information model for domain specification that supports Stan's ASN.1 and Woody's UML model.
The Rose model can be found here.


Figure 1: The information model for domain specification that supports Stan's ASN.1 and Woody's UML model.
The Rose model can be found here.

The following table shows a complete data element mapping between Stan's ASN.1, Woody's UML and my UML:

STAN WOODY GUNTHER

CompleteDomain CompleteDomain Domain

      c.D.Name        name        name

      c.D.Id        id        unique id

      c.D.Version        version        version

      c.D.Descript        description        comments

      editNote        editNote        editNotes

   domains (SET OF)       has_domain (1..*)



CountryDomainSpec.    JurisDomainSpec.    (is a Domain)

      country        juris        realm of use

      cn.D.Name        name        (inherited)

      cn.D.Id        id        (inherited)

      cn.D.Version        version        (inherited)

      cn.D.Descript        description        (inherited)

      editNote        editNote        (inherited)

   valueSets (SET OF)       has_valueSets (1..*)



ValueSet    ValueSet    (is a Domain)

      scheme        schemeID        System.unique id)

      schemeVersion        schemeVersion        System.version)

      v.S.Name        name        (inherited)

      v.S.Id        id        (inherited)

      v.S.Version        version        (inherited)

      v.S.Descript        description        (inherited)

      editNote        editNote        (inherited)

   content       includes

             excludes

Content           Derivative

      includeItem    Inclusion        operator

      includeClass        type        union

      excludeItem        value        difference

      excludeClass

          Exclusion

Item        type    Enumeration

      visibleString        value        items-Term

Class

      visibleString    type: {item, class}    Subsystem

          value: String        name

(IncludeAll)

STAN	WOODY	GUNTHER

CompleteDomain	CompleteDomain	Domain
c.D.Name	name	name
c.D.Id	id	unique id
c.D.Version	version	version
c.D.Descript	description	comments
editNote	editNote	editNotes
domains (SET OF)	has_domain (1..*)

CountryDomainSpec.	JurisDomainSpec.	(is a Domain)
country	juris	realm of use
cn.D.Name	name	(inherited)
cn.D.Id	id	(inherited)
cn.D.Version	version	(inherited)
cn.D.Descript	description	(inherited)
editNote	editNote	(inherited)
valueSets (SET OF)	has_valueSets (1..*)

ValueSet	ValueSet	(is a Domain)
scheme	schemeID	System.unique id)
schemeVersion	schemeVersion	System.version)
v.S.Name	name	(inherited)
v.S.Id	id	(inherited)
v.S.Version	version	(inherited)
v.S.Descript	description	(inherited)
editNote	editNote	(inherited)
content	includes
	excludes
Content		Derivative
includeItem	Inclusion	operator
includeClass	type	union
excludeItem	value	difference
excludeClass
	Exclusion
Item	type	Enumeration
visibleString	value	items-Term
Class
visibleString	type: {item, class}	Subsystem
	value: String	name
(IncludeAll)

While Stan's abstract syntax and Woody's model are structurally equivalent (except at the bottom) my model is structurally quite different. Through inheritance I need to specify much less attributes. So I have more spare-complexity to spend on enhancements. My enhancements are:

Term is shown explicitly as a class.
A Term may have a Definition, which allows us to reuse the domain specification machinery to actually define and maintain HL7 code tables.
The external coding System is shown in a separate class. I felt like mentioning the organization here to give the due credit.
The subsytem is explicitly shown. This is what Stan calls "class". Of course, this does not shortcut the difficulty of how these subsystems are defined. However, SNOMED axes and the chapters of ICD or CPT4 can be immediately captured as a subsystem of their respective systems. Subsystems do have substructures.
The enumeration allows building Stan's "classes" and "items" by enumerating the terms as items from a base domain. This also allows to specify an entire HL7 code table de novo.
The derivative class handles all set algebra. Currently we will need the operators "union" and "difference" but in the near future we might want to expand this to include "intersection". These operators are sufficiently defined in textbooks about set theory thus it is save to not model their detailled meaning here.
The see also association captures the references to other domains as in Stan's examples: phenotypic gender refering to clinical gender and genotypic gender.
The updates link allows a better management of versioning: the assumption is that each has at most one predecessor, while it can have more than one successor. This versioning model is implemented in most source code revision control tools, such as SCCS, RCS, or CVS.

The remainder of this document gives definitions and explanations for the things in the information model of Figure 1.

Term and Domain

Domain is clearly the subject class of our problem space. A domain is a set of terms. Domain specifications are about, specifying the domains such that we can tell whether a given term belongs to a given domain.

Note that the class term is not necessarily instantiated in a data base sense. Of course we do not want to store millions of terms in our domain specification database - in fact we are not even allowed to do so or we risk to be sued for copyright infringement. But the problem space does include terms so we have to make them explicit in our model.

In fact, the element-set aggregation is a virtual relationship. It is not actually instantiated and you can not follow this link it in a database sense. However, you can prove whether or not a given term belongs to a given domain, i.e. whether such a virtual association exists between those two given instances. The element-set aggregation between term and domain is thus not technically "useful," although it is quite important in order to nail down the semantics. There are code systems in which you can not enumerate all terms - infinite sets (i.e., the code for units of meauses is such an infinite set). A domain need not be a finitely enumerable set; because of this the enumeration is defined as a subclass of domain. But see below.

There are different specializations of domains, e.g., Stan's proposal and Woody's meta model together talk about CompleteDomain, Country- or Juris-Domain, ValueSet, Class, Item and Scheme. These are all sets, and thus appear in this model of figure 1 as connected to a common generalization, the domain. This allows us to generally deal with the set operations (e.g., inclusion, exclusion).

System

An externally defined coding system (appears as "scheme" in Stan's proposal). Examples are: SNOMED, READ, ICD, ICPM, ICPC, CPT4, LOINC, ISO 639-1, IUPAC enzyme codes, .... The general idea is that these are coding systems published by an organization that provides meaning to the codes, maintains the system, etc. The essential attributes of the class System are:

I call this "code system," because it implies that these sets are not just sets of symbols, but only through those code systems symbols are given meaning. Without those systems, nothing else makes sense in domain specifications.

Subsystem

Most code systems have an inner structure. Structure can generally be expressed in set algebra using the subset relation. A subsystem is such a subset in the overall system and defined by the organization that published the system. For example: SNOMED axes are subsystems, ICD, ICPC and ICPM chapters and sections are subsystems. Although LOINC does not have chapters, it has many subsystems given by the logic of LOINC. I.e. LOINC allows to define the class of all tests on specimen BLDV (venous blood).

The Stan's "class" concept is possible only through the structure of the underlying coding system. This is a very tricky part. Remember, without further knowledge, the elements in a set are not ordered. Therefore, there is no semantics of a range such as "everything from cough to pneumonia." Some systems are kind of ordered, i.e., ICD, but others, such as LOINC, are not ordered. There is no useful meaning to "the set of all LOINC codes from 3456 to 6758." Only the unique structure of the code system allows us to conceive "classes" and express them in ways that are unique to the coding system.

Derivative

A derived domain is usually a subset of another set. Defining derivatives is all that the inclusion and exclusion business is about. derivatives can exist on different levels: e.g., in the realm of an institution (hospital department), a country (USA), a treaty (EU), and so on. Thus, each derivative can be defined on another derivative in a larger realm (e.g., a german hospital narrows the domain defined by the EU that in turn may be a subset of a WHO code).

While system and subsystem are defined externally, derivatives must be explicitely defined. This is what domain specifications are all about. Thus, the inclusion, and exclusion applies only to derivatives.

This means, we define a derivative as a domain that comes about by applying an operator to two (or more) other domains as the operands of the operation. The operators defined right now are union (+) and difference (-) as explained here. But this construct provides the flexibility to do more things. For instance, we could use it to specify a domain as the common subset of two derivatives defined based on a common system (intersection).

Note that in a complex set algebra expression, such as

A + ( B - C ) + { a, b },

each subexpression makes up its own derivative. This need not necessarily be fully armed with name, version, etc, in order to "save space" in the database. There is an unarguable truth, however: every subexpression in an expression of set algebra is itself a set. I exploited this truth to come up with a simple model. If we want to "save space," we can say that an operator can have more than two operands so that

A + B + C + D + ( E - F - G - H )

makes up only three derivative sets:

(+ A B C D)
(- E F G H)
the union of both.

Some people have argued for doing all exclusions after all inclusions, in fact, this seems to have reached a state of agreement. If you want to adhere to this order, you require that all set expressions conform to this pattern:

( I1 + I2 + ... + In ) - ( E1 + E2 + ... + En )

I don't think it is wise to restrict the allowed expressions in this way, especially not at the present time, where we did not have enough experience with how domain specifications will be used and what the impact of this constraint is. We might regret this unnecessary constraint later.

Enumeration

While large code systems are impractical to enumerate, and while some are unenumerable at all, enumeration is useful for small domains. E.g., if all we want is just the phenotypical gender based on SNOMED, all we need to do is to enumerate those codes from SNOMED: male, female, etc. Notably my above examples using curly braces such as { a, b, c } specified sets by enumeration. Special case are the empty set {} and the set with just one element { a }.

The one element set is what Stan called "Item." Note that in order to deal with a single element in a set algebra, we first have to "upgrade" the element a into the trivial set { a }, then we can build unions and differences as defined. However, items are also the terms that constituate the enumeration.

To be continued...