V3DT Minutes of conference call, Thu, Oct 22, 1998.

V3DT conference call minutes of Thu, Oct 22, 1998.

The HL7 version 3 data type task group has had its third conference call on Thursday, October 22, 1998, 11 to 12:30 AM EDT.

Attendees were:

Anthony Julian,
Greg Thomas,
Joann Larson,
Larry Reis,
Mark Shafarman,
Mark Tucker,
Matt Huges,
Mike Henderson,
Randy Marbach,
Stan Huff,
Gunther Schadow.

This time we had our first round of discussion about symbols, identifiers, coded elements, and the like. Although we were a bit afraid of getting into a tangle of controversial issues, we did make quite significant progress again.

OVERVIEW OF THE PROBLEM SPACE

	CONCEPT	INSTANCE
REAL WORLD	Coded using mostly externally defined code systems: ICD9, ICD10, SNOMED, DSM-III, DSM-IV, ICPC, LOINC, ICPM, CPT4, etc.	Examples: person names (old PN), organization names (old XON), locations descriptors (old AD, and PL), legal id numbers (SSN, DLN, etc.)
TECHNICAL	Examples: message type, order status code, participation type code, MIME media type.	Examples: message ids, Service catalog items, RIM instances (order numbers), phone numbers, e-mail addresses, URLs

REAL-WORLD CONCEPTS are concepts that scientists and ordinary people deal with in their mind and formulate in words (this sounds fuzzy, but that's what it is!) Communication must rely on common agreed terminology or standard code systems. Those are mostly defined by external (i.e. non-HL7) orgnanizations, such as those organizations representing domain experts in a particular medical specialty.

There is currently a lot of overlapping, competion and complementation of code systems. It does not seem as if this apparent disorganization could ever change because medicine and human life in the real world is always changing. Thus, the communication of real world concepts will always have to deal with issues of translating codes selecting the best matching "synonymous" code from different code systems.

TECHNICAL CONCEPTS are labels for well-defined concepts, such as protocols. For example: if we say "HTTP" we refer to the hypertext transfer protocol, that is an Internet standard defined quite rigorously. If we ultimately want to know what HTTP is, we can read the specification. However, most often we are not so much interested in what "HTTP" is or in what its meaning is, but we just want to use it. So we select an appropriate machinery (i.e. a web browser) and use HTTP.

With Technical Concepts there is no use for different vocabulary, no use for using both "HTTP" and "HypTexTranProt" to refer to the same technical concept. This is not to say that people could not use different names or abbreviations for HTTP, but it means that there is no point in letting everyone chose his own terminology for the exact same technical concepts.

REAL WORLD INSTANCES are individual people, organizations or things that we can meet, point at, think of, go to, etc. The strongest "definition" we can ever make is to point at those people or things, touch them or take them into hands and show them. But in documents and human communication we commonly use Names, some officially assigned Identifiers (i.e. social security number, or driver license number). Places are named using residential addresses, or other kinds of locators (e.g., building->tract,->floor->room->bed).

Things are most often pointed to (e.g. "give me this screwdriver"), or described (e.g., "give me the long screw driver ... no, the stronger one"). In larger context where we can neither point to things, nor could unambiguously describe things, we just assign arbitrary inventory numbers to the things.

In general, identifiers for Real World Instances are quite rich of intricacies and we will address those later. The common approach for data types is already laid out by HL7 v2.x: i.e. PN, XON, DLN, AD, PL, and the like.

TECHNICAL INSTANCES are instances that are useful in some technical sense. Just like with Technical Concepts we are less interested to know what exactly those instances are. Rather, the reason why we name technical instances is because we want to use them. In case of HL7 most of those technical instances will be particular data instances, such as messages, order numbers, service catalog items, or any other instance of a RIM class that we can refer to.

But Technical Instances are also things like telephone numbers and e-mail addresses or Uniform Resource Locators (URL) to Web pages, images, or chat rooms. The general idea is that what you do with a phone number is rarely to search the phone book in order to find the address of where a given telephone is to meat some person. This would be to find out what a given telephone number means. In most cases, we choose to directly use those telephone numbers by simply picking up the next phone and dial that number.

The same is true for database records or data instances on computer systems, we do not go and analyze memory dumps of computer systems in order to find out what a given Technical Instance really is, we just use them in some machinery that, for instance, lets us query for a given record entry, lets us change that record entry.

For the rest of the conference we concentrated on Concepts, both technical and of the real world, and on technical instances.

TECHNICAL INSTANCES

Mark Tucker contributed these conceptualizations on what one can do with an identifier that refers to something that exists within computer systems. It appears that those identifiers can have three levels of quality. They can be

unique (globally)
un-ravelable
de-referenceable

Unique

Suppose you are given two identifiers. What you can always do is to compare them literally (i.e. character by character) and if it turns out that these identifiers are literally equal. What do you know? You know that they both refer to the same identical instance if and only if you can be sure that the literal match of both identifiers is not accidential because of some naming conflicts.

Through narrowing down namespaces we can achieve uniqueness of identifiers quite easily. This is for example why in computer programming local variables in procedures are safer than using global variables. The real important quality of uniqueness is that identifiers are globally unique. Global uniqueness is generally achieved by a structure defined in the following piece of BNF:

<identifier> ::= <name> <namespace>

<namespace> ::= <identifier>

Obviously this is a recursive structure, i.e. every namespace is itself identified by a name in its parent namespace. This recursion up the namespace hierarchy must somehow be terminated. This is done by assigning one globally unique namespace, where names are valid without the reference to another namespace.

The uniqueness of an identifier does not imply, however, that a given instance could not have several names. Thus if you compare unique identifiers literally and you find that they do not match, you know nothing. Both identifiers can still refer to the same instance.

Un-ravelable

An identifier is "unravelable" if we can analyze its pieces, and for each piece, we can find someone to talk to.

Internet domain names (DNS) are unravelable expressions. For example we can unravel the string "falcon.iupui.edu" from the right, where "edu" is maintained by Internic (the organization that assigns top level Internet domains). When the Indiana University Purdue University Indianapolis (IUPUI) registered its domain name "iupui" with the Internic, they had to name an official person who is responsible for "iupui". That person knows what "falcon" is.

ISO Object Identifiers (OID)¹ are unravelable too. ISO OIDs are unraveled from the left. For example, "1.2.840.10008.421292.87828.333433.001" stands for ISO (1) ISO member body (2) USA (840) DICOM Standard (10008) AGFA (421292) ... The left most numbers are registered with gigantic organizations. Eventually, a company like AGFA gets a number allocated, say, 421292. It then creates machines where one of the machines has the number 87828. That machine allocate numbers to an imaging study (333433), that contains a series of images (001).

In unraveling an ISO OID we walk the path down basically the same way as with DNS names. DICOM has registered people with in the US member body of ISO (ANSI). AGFA has registered people to DICOM. They, or someone in the radiology department, could probably tell you that 87828 is the CT machine in the trauma center. Finally, the machine itself allocates identifiers at "computer speeds" to things like studies and images.

HL7 filler orders are somewhat unravelable. For example, you are given the filler order "1234^OUTPATIENT.LAB". If you could figure out what department the symbol "OUTPATIENT.LAB" referred to, then you could call them up, and ask them about item "1234".

As we can see, the quality that an identifier is unravelable is a result of the way the namespaces are managed. Both ISO OIDs and Internet domain names are organized through hierarchical namespaces.

De-referenceable Identifiers

An identifier is "dereferenceable" if there is a machinery that resolves those identifier for you rather than require you to go the rather painful way of unraveling. For Internet domain names there is such a machinery dedicated to resolve names. I.e. the domain name service (DNS). The Internet name server next to you will resolve the address for you quite seamlessly. There is a whole infrastructure of domain name services, which is why it takes so long to get an answer from a DNS server if you typed in a wrong domain name: your DNS server asks another server that asks another server and so on.

For ISO OIDs there is no such an easy way of dereferencing. In some cases there may be catalog services that resolve a subspace of the whole gigantic OID namespace.

A telephone number is a perfectly unique and dereferenceable identifier if we start at the root of the namespace provided by the global telephone system. Fax numbers are usually written in a standardized way, where for instance "+49308153355" used to be my old fax and phone number in Germany, while "+13176307960" is my office phone number in U.S. All you need to do to dereference such a phone number is to pick up your phone, dial the prefix for international codes ("+"), dial the other digits and be done with it.

Unified Resource Locators (URL) are another example of dereferenceable identifiers. For instance,

"http://aurora.rg.iupui.edu/~schadow/v3dt"

is our V3DT project homepage. Your browser and the Internet does everything for you after you typed in this URL. URLs start with naming the protocol to use, the rest of the URL is a literal that the protocol is supposed to understand. For example, I can watch the same homepage as a local file using the URL

"file:/home/schadow/public_html/v3dt/index.html"

In general for an identifier to be dereferenceable it need not be practically un-ravelable. For instance, a telephone number is for all everyday purpose not unravelable (only law enforcement is given this privilege). You may be able to figure out a country code (1 for U.S.) and an area code (317 for Indianapolis), but you will have a pretty hard time to find the number 6307960 in the phonebook of Indianapolis.

The important point about dereferencing identifiers is that you do not get down to their "meaning" in the real 3D world through the process of dereferencing. I.e. unless you come into my office, you will never see my machine, "Aurora", featuring the above homepage. And the machinery that dereferences URLs seamlessly does not bring you into my office. All you can do is looking at what the Internet/HTTP/Browser machinery brings to your screen as a result of dereferencing the URL identifier. Likewise with the telephone you can call me, but you cannot creep through the wire to see my telephone.

WHAT DOES THIS MEAN TO HL7

I do not remember that we brought this point to closure. There are some concrete propositions that where more or less implicit in our discussion but that we where probably not prepared enough to spell out clearly.

Proposition 1:

HL7 identifiers for technical instances are to be unique.

For identifiers to be unique we have to manage the global namespace. Most importantly every identifier must be explicitly linked to the root of the namespace hierarchy.

Since HL7 has ackquired a branch in the tree of ISO OIDs we are free to use OIDs in a similar way as DICOM uses OIDs heavily and directly.

Many existing HL7 systems do not assign purely numerical identifiers for the technical instances in their realm. For instance they may use alphanumeric keys into any data file. We might not want to force people to adopt a pure OID scheme for identifiers.

We can, however, assign OIDs to everyone who writes applications for HL7 and everyone who maintains HL7 communications. On that basis people were free to use attach their own naming scheme to their standard OID. If they want, they may use OIDs in their realm, but they may also use freeform identifiers.

Thus, HL7 identifiers for technical instances could be defined as pairs of OID and a Character String to be used for locally defined codes. In particular the HL7 standard would not allow identifiers to be sent without the OID.

Proposition 2:

HL7 identifiers for technical instances should be unravelable if they are not dereferencable.

This proposition is solved if we pursue the above described data type that uses an OID and an optional freeform identifier that is meaningful only in the namespace designated bu the OID and that may never be communicated in HL7 without the OID.

There are issue however:

Who maintains the assignment of OIDs to HL7 users?
This need not be outsourced, the HL7 HQ could do this as a service to its members and for a nominal registration fee for non-members. We can learn from the DIOCM and Internic experience of how easily this is done.
A database of HL7 OIDs raises privacy issues: How much information is made publicly available about the owners of OID subspaces? Remember that you need at least some information in order to unravel an OID.

Proposition 3:

If HL7 identifiers for technical instances are meant to be dereferencable they should be declared as such and the machinery should be specified that is needed to do the job.

It almost appears as if we want to have two different data types for technical instances:

One data type would be for unique unravelable identifiers. Those would not be meant to be dereferencable. These would be used for object identifier for RIM class instances, things like medical record number, placer and filler order number, service catalog item number, etc.
This data type could for simplicity be constructed of only two components: the required object identifier, an ISO OID and an optional extension of type Character String.
Another data type would be for dereferencable identifiers. These would be almost shaped as a Universal Resource Locator (URL), i.e. having the two components protocol and address where the format of address would be determined only by the protocol. Telephone number, e-mail address, and the locator for the reference pointer type would be of this data type.

Identifiers may be stable over time or may become invalid. E.g. ISO OIDs are supposed to be stable over time but Internet domain names and especially URLs can become invalid rather quickly. Telephone numbers can change too. This is another argument why we should rely on the ISO OID for our unravelable unique identifier for technical instance. But this is getting off the ground of just writing up the conference minutes. A concise proposal will follow for the next conference.

THE "CODE VALUE" DATA TYPE

We define the Code Value data type that will be our basic building block for refering to concepts, both technical and real world concepts. A Code Value is all we can know about a given Symbol, i.e. the literal and the code system that defines a given literal. For example the pair:

< "text/html", "MIME-TYPE">

would refer to the technical concept of an HTML media type, while

< "784.0", "ICD9 CM">

would refer to the real world concept of "headache" as defined by ICD9 (i.e., in ICD9 would not include the concept of "tension headache", 307.81).

The exact structure has more parts:

Note that the definition of this data type has further evolved: [version 2] [version 3]

component name type/domain optionality description

value Character String required this is the plain symbol, like "784.0"

code system a code by itself required, can be fixed by context denotes the code system that defined the plain symbol

code system version Character String conditional a version descriptor defined specifically for the given code system

print name Character String optional a sensible name for the code as a curtesy to an interpreter of the message. THE PRINTNAME BEARS NO MEANING, it can never be sent alone and it can never alternate the meaning of the code value

component name	type/domain	optionality	description
value	Character String	required	this is the plain symbol, like "`784.0`"
code system	a code by itself	required, can be fixed by context	denotes the code system that defined the plain symbol
code system version	Character String	conditional	a version descriptor defined specifically for the given code system
print name	Character String	optional	a sensible name for the code as a curtesy to an interpreter of the message. THE PRINTNAME BEARS NO MEANING, it can never be sent alone and it can never alternate the meaning of the code value

OPEN ISSUES ON THE CODED VALUE

The code system obviously is by itself a technical concept identifier. If we are going to use the data type Coded Value for concept identifiers, we have a recursive type definition. Not that recursion is bad in general, but the question is: what terminates the recursion?
If HL7 maintains a list of coding schemes and defines symbols for any one of those schemes, one could be tempted circumvent this problem of recursion by defining the component named code system as a simple Character String. However, we should be prudent here: what happens if HL7 outsources its code of coding systems? What happens if there are multiple codes of coding systems (e.g. suppose the CEN coding system registry standard becomes an ISO norm?)
We can decide that HL7 will for all times maintain its registry of coding systems. And if HL7 will outsource the maintenance of the registry of coding systems in the future, that we would always require only one backward compatible registry to be used. If we say this we could shortcut any recursion by using a Character String here. But we should know what we are doing!
The code system version is used as a refinement of the code system descriptor. Logically, any version information it is useful only together with the code system identifier. We would usually reflect this in a nested structure such as
<value <system, version> print-name>.
Stan Huff did not want this kind of nesting, Mark Tucker and I think that we should not be worried about nesting in any way. However, we do not want this to be a controversial issue, so that we agree into flattening the structure here. It is quite an exceptional situation anyway.
What is the hard difference between a code system name and a version? For instance is "ICD" the name and "9" or "10" the version? If so, what about the derivatives of ICD-9 (e.g., ICD-9-CM) and ICD-10 (e.g., ICD-9-PCS)? What about the minor versions where a few codes are taken out or brought in every now and then? If we define all coding systems in a special HL7-maintained table, why do we not just define new symbols for every new major and minor version coming out? How can we assure that the stuff people will put into the version component is standardized and interoperably useful?
A possible answer to some of this is: whenever a code system changes in an incompatible way, like between ICD-9 and ICD-10, it creates a new code system, not just another version, regardless of how the other organization calls that update. WHO speaks about "International Classification of Diseases, 9th revision" but HL7 still considers this another coding system, not just another revision or version of basically the same code system. For contrast, when LOINC updates from revision "1.0j" to "1.0k", HL7 would consider this to be another version of LOINC.
HL7 would still have to make sure that the true version identifier of LOINC 1.0j is either of "1.0J," "1.0j," "1.0-J," "1.0 j," but not just any of those. While the organization who maintains a code system will have their own version numbering scheme, they will not define unambiguous exact string representations for their revision. And we can not expect them to do that. So we have to maintain a list of the versions or a set of clearly defined rules on how the version identifying string is formed.
Anthony Julian was worried about unregistered local coding schemes. It would be a nightmare if every ideolectic coding scheme that changes ever so often would have to register in a central repository. The administrators of HL7 communications are often not in control of the code systems used at their sites and would be reluctant to take on the burden of registering with HL7 before use and still deal with all the headache that those changes involve at their sites.
The answer could be to say that locally defined coding systems do not have any meaning outside the defining organization. Thus there is no point in registering anyway. As long as the coding system identifiers do not collide with the HL7 defined code system identifiers, it wouldn't matter if there are code system name conflicts between different sites for their local code systems.
Traditionally, HL7 defined the letter "L" to stand for any local system, or, if more than one local code system exists at a given site, to name those "99zzz" where z would be a digit. We can loosen this constraint a little bit by saying that every code system name starting with "99" be local.

The unfilled circles in the above item list are reminders that we will have to check consensus on those.

DATA TYPE FOR REAL WORLD CONCEPTS

Stan Huff and everyone else agreed that the old CE data type and its interim proposed successors (with various names LCE/CWE and CE/CNE) was basically one pair of Code Values defined above plus a free text string that could be used to convey the original text in an uncoded fashion.

Neither Stan Huff nor anyone else objected that the new data type for real world concepts could be defined as a general collection of Code Values with one, two, or more codes.

We agreed that there is an important difference to make for the semantics of a collection of Code Values. Two those semantic flavors exist:

A collection of quasi-synonyms, i.e. codes that have been selected from different coding systems in order to convey the same meaning.
A collection of codes, possibly from the same coding system, that modify the overall meaning.

We recognize that both flavors of collections of code values will have to be supported by the new data type for real world concepts. An example from HL7 v2.x is the "specimen source code" in the OBR-Segment, which was such a conglomerate of quasi-synonyms and modifiers.

We are not afraid to define the new data type for real world concepts as a rich nested structure, as long as we are very specific about the meaning of such a structure.

Stan Huff wants to see the new data type for real world concepts keep track of the systems which perform translations on those codes. Thus every code value could be annotated by whom, when and how a particular quasi-synonymic code value was added to the collection of quasi-synonyms.

I want to make sure that the new data type for real world concepts keeps track of the order in which translations where performed and on the quality of those translations.

Stan Huff, and Mark Shafarman do want to see clearly how the "exception handling" would be dealt with. The distinction Code without exceptions" and "Code with exceptions" was proposed before and we should make sure that we capture the requirements that this proposal tries to address. Stan Huff also mentioned that he recognizes the general applicability of those "exception handling" mechanisms to other HL7 data fields that are not declared to be of this coded data type.

We did not yet discuss on anything more specific.

TECHNICAL CONCEPTS

There was a pending notion that the data type for technical concepts could just be the Code Value although this was not confirmed explicitly.

RESOLUTIONS

We have consensus that Code Value will be the basic building block for concepts, both technical concepts and real world concepts. We somehow agreed that the division of the problem space in the four fields is valid. Nota bene: There are no hard conclusions yet, but we are well on the path of coming to closure on three of the four fields pretty soon. And this includes real world concepts, believed to be the most controversial part.

ISSUES

Almost everything discussed above is an open issue, except for the things listed under Resolution.

Some specific open issues on the Coded Value are listed above. We will have to check the proposed solution for consensus, or we will have to negotiate other solutions.

For the next call

The next call is on Thursday, 29th of October 11 EST (yes the world witched back to normal time!) on the usual number 1-800 869-6684. On the agenda is continuing the promising of last time. Pushing forward to closure as much as we can without running over important issues. I will prepare a detailed but concise proposal with check boxes for the next conference. To be released, say, Monday. Attendees are kindly requested to watch out for this and prepare so that we get all issues on the table to reach a true consensus.

Thank you and regards,

-Gunther Schadow

Footnotes

¹ For more information on ISO Object Identifiers go to http://www.alvestrand.no/objectid

<identifier>	::=	<name> <namespace>
<namespace>	::=	<identifier>