V3DT conference call notes for Mon, Jan 11, 1999.

The HL7 version 3 data type task group has had its thirteenth conference call on Monday, January 11, 1999, 11:00 to 12:30 EST.

Attendees were:

Agenda items were:

  1. collections [slide]
  2. incomplete information (what Stan calls "exception") [slide] [slide] [slide]
  3. update semantics [slide]
  4. review of the uncertainty debate time boxed to 10 minutes.

Our time runs out. We have 10 more work days to go until the HL7 meeting in Orlando. Including today's items we have been able to walk once through allmost all the major issues. However, there are lots of detail issues that we have deferred or not even mentioned yet (e.g., week days, ordinal data, what do we end up doing with currency.) We did not make the mapping between all old types to new types, and we wouldn't be able to do so at this point, since we are still lacking major data types, i.e. the revised person name and address type, the real world instance identifier, and probably more.

Presentation of our Work in Orlando

Any way, today we are closing up the list of issues. For the rest of the time, we will concentrate on the presentation of our work to the Control/Query TC as well as to the TSC.

The CQ meeting on data types will be held on Wednesday AM. It will give a quite detail report on what we did. We'll likly need the whole morning session, however, with a definite end at noon, so that we won't continue significant parts in the afternoon.

That presentation will be held by several people. Including Joann and/or Mike of Kaiser, Stan Huff, probably Mark Tucker, and myself.

Mark Shafarman will schedule the presentation for the TSC meeting. That would be either Monday (retreat) or Tuesday (meeting) night and would last just 15 minutes.

It doesn't make much sense to divide up that time to multiple speakers, so, I would do that one; except, if Mark Shafarman want's to talk as well.

DATA TYPE "BOOLEAN"

Boolean
The boolean type stands for the values of two-valued logic. A boolean value can be either true or false.
PRIMITIVE TYPE

Although the Boolean data type seems like a detail issue, it is quite important. A Boolean value can either be true or false. While Boolean values are the very basic values of all digital information processing machinery, boolean data type is useful even in the highest sphere of abstraction data analysis.

Use cases for the Boolean type are all RIM attribute with the "attribute type" suffix "_ind" (indicators).

HL7 v2.x position on booleans was that of an ID data type with the special table that included only the values "Y" and "N". Since the follow-up data type for ID is Code Value, we could continue to serve the use case for booleans with Code Value constrained to the "Y/N" table.

The reason to not do that is that booleans are just the simples't data type possible and useful on virtually all levels of abstraction, so that it would be a move toward simplicity to define an explicit boolean data type to be used for all indicators. It's so much more easy to use booleans in program decisions, as the following example in a fictive programming language shows:

VAR X : BOOLEAN; ... IF X THEN (* X is true *) ELSE (* X is false *) END IF;

By contrast for dealing with arbitrary Code Value you would have to first check whether the code table used matches the Y_N_TABLE, then you would treat every possible case including that the given value is neither "Y" nor "N".

VAR X : CodeValue; ... IF X.codeSystem == CodeSystem.Y_N_TABLE THEN IF X.value == "Y" (* X is true *) ELSE IF X.value == "N" THEN (* X is false *) ELSE (* EXCEPTION: X is neither true or false *) END IF; END IF; END IF;

Why would we not want to use boolean data types?

Backwards compatibility to v2.x has never been (and should not be) the major issue for design decisions for v3.0. However, through type conversions we can actually allow for backwards compatibility. Thus, a Boolean would convert to a Code Value by using the Y/N table. Any Code Value with the coding system set to the Y/N table can be converted to a boolean.

Note: We should, however, not define a conversion from Integer Number to Boolean on the basis of 0 = false, 1 = true. While the Y/N table's semantics is clearly to represent boolan values, the mapping of boolean's to numbers is not semantically suggested nor is the mapping style determined by semantics (e.g. one could map false to -1 and true to 0, or false to 0 and true to non-zero just as well).

Some people might think that using the Y/N table to capture Boolean semantics is more flexible, because they could later extend the table to cover other (exceptional) values. For instance, some might want to add the value P for "perhaps" and U for "unknown". I call those two extensions to the Y/N table "generally applicable", since they are conceivably valid for all cases where the Y/N table is used. However, those extensions of the Y/N table are not necessary in the context of this data type proposal, since "perhaps" is covered by all the mechinsms to define uncertainty, and the "unknown" exception is covered by the incomplete information mechanisms defined further below.

Other people might still think that the Y/N table should be used to allow for subsequent extensions. An example might be for the patient death indicator, where Y/true means the patient is dead and N/false means that the patient is alive. Now one could make the case that a patient after the diagnosis of "brain death" might be kept in a vegetative state until some organ transplantation. This would be a status beween live and death that neither falls in the category of uncertainty nor incomplete information. So, one might need to extend the Y/N table by "B" for "brain death".

Clearly, such extensions of the Y/N table could be made only at one point of use of the Y/N table, e.g. only the death indicator would use the Y/N table extended by "B" for "brain death". This means that death indicator no longer would be defined as a code from the Y/N table, but from a "death code" table. According to the MDF, the attribute type suffix "_ind" would have to be changed to "_cd".

If "death indicator" would have been defined as a Boolean in version 3.0 and later would have to become a code of table "death code" one could either simply change the data type definition between versions or, instead, add another field, such as "death detail status" if "death indicator" is true. Those changes in the use of the field do require RIM changes regardless of whether we used the Boolean data type or not.

If nothing else, a Boolean data type could help sharpen the analytic work of the commitees, because it would be absolutely clear whether or not there can be other values aside from the two opposites represented by true and false.

Collections

HL7 v2.x used the word "repeating" to describe certain qualities of the definition of fields and segments. This reflected the observation that "repeated" stuff could occur multiple times in the message. However, obviously there must be a reason why someone would make the decision that a segment or a field is to be repeatable in a message. It turns out that there are different reasons to make that decision. It was never clear from the HL7 spec. what the meaning of repeatability was in every instance.

The stuff that could repeat was either a segment or a field. For the purpose of this discussion we will consider the v3 equivalent of a segment to be a class, whereas the v3 equivalent of a field is an attribute.

If segments repeated in v3 this expressed a relationship (cardinality) between classes. When fields were declared "repeatable" this expressed a relationship between an attribute and its data values. We will concentrate here on the relationship between attributes and data values rather than on inter-class relationships, although what we say here is equally valid for class relationships.

In general, when things end up being "repeatable" we have a collection of things.

Consider the example of Patient "telephone number" (tel) that might be declared as a "repeatable" field in version 2. The meaning of this is obviously that a patient has several telephones, we ususally say, a patient has a "set" of telephone numbers. The word "set" implies that (1) it would not be meaningful if a given telephone occured twice, and (2) that the order of telephone numbers does not matter.

Obviously from those criteria we can generate a table of all possible combinations:

unorderedordered
no multiplesSET*
multiplesBAGLIST

The ordered sequence without multiples is marked by an asterisk since this case is rarely considered in the computer science literature.

SET
collection of elements with no notion of order or duplicate element values. The number of distinguished elements in the set is called its "cardinality".
LIST (or SEQUENCE)
ordered collection of elements where the same value can occur more than once at different positions in the ordered collection. The notion of a LIST can be derived from the notion of a set if we extend each element by a position counter. The number of elements in the list is refered to as the "length" of the list.
BAG
unordered collection of elements where each element can occur more than once. A BAG can be constructed if we extend each element with an occurence counter. The total number of things in the BAG can be called its "size".
VECTOR
A LIST with a specific length. Every position in that list represents one "dimension". This need not be the dimensions of the 3D space. Elements of a vector need not be numbers. Vectors are just a quantitative restriction on the LIST type, i.e. where the list must have a particular length. Lists can be restricted in other ways, e.g. lengths between 1 and 5, those things are not vectors.
MATRIX
A two dimensional VECTOR. Implemented as a VECTOR of a specific amount of specific VECTORs. We do not yet have a use case for matrices. Matrices are used for vector transformations or to describe network structures. Images could be thought of a matrices, but this is not the only way to think of images.

We want to do away with language that speaks of "repeated attributes" and want to promote clarity regarding what specific semantic flavor of collections is meant.

In case of waveforms, where "repeatedness" became quite tricky in v2.x. Now we can define a sample of an n-channel waveform signal as a list of n-dimensional vectors, where each vector stands for a particular time.

One question was always associated with collections in HL7: how do we update those collections? We can distinguish the following cases:

  1. The elements of the collection have instance identifiers. Thus we can change some values of those elements. For example, if we have a list of individual Practitioners, and if one practitioner changes her last name, we can simply change the last name of that individual instance. The only requirement is that the list elements have identity.
  2. The elements of the collection have no identity. Changing the value of any given element is in fact changing the collection itself. Although we could change the value of the third element of a list of numbers, in a set or bag of numbers there is no "third element". The only thing we can do here is remove old element from the collection and add a new element. Thus the question boild down to: How do we change the collections themselves?

One solution is to allow collection to be updated only through speparate trigger events with explicit message structures that would specify exactly what would be changed in which way. Why this strategy works fine for high level RIM objects, such as, Encounter_practitioner, Clinical_observations, etc, for things like "set of stakeholder phone numbers" it is a bit too much of a burdon to define specific trigger events.

But even if we had a trigger event "change patient phone numbers" its is not clear how we would specify what exactly should be changed.

For v2.x the answer always was: you send a snapshot of the collection as you want it to be and the recipient could simply throw away whatever he knows and remember only what you just said. This works somewhat in situations with one master information producer and several slave information consumer, but it totatlly insufficient for collaborative information management. For example, my message could wipe out all the telephone numbers that your already know. The proposed solution is described below on update semantics

Incomplete Information

In v2.x we had the special values not present (||) or null (|""|) that could be sent instead of any other value in almost every field in a message. The semantics of those special values were two fold (1) not present expressed that information was missing (2) null was able to remove existing information at the side of the receiver so that this information was missing afterwards. We will factor this "update" component out into update semantics below. Here we only deal with the representation of incomplete information.

No Information
A No Information value can occur in place of any other value to express that specific information is missing and how or why it is missing. This is like a NULL in SQL but with the ability to specify a certain flavor of missing information.
component name type/domain optionality description
flavor Code Value, optional The flavor of the null value. Can be interpreted as the reason why the information is missing.

The "flavor" of the null value can be interpreted as the reason why the information is missing. For the time being we keep the list of possible flavors of null subject to open discussions. Numbers of different flavors of null values exist range between 1 (SQL) 70 (reported by Angelo Rossi-Mori).

Stan Huff's CE proposal contains the following null values:
U unknown no information at all. I.e. nothing more is known about the circumstances of missing information.
UASK asked but unknown the person asked could not supply the information (why?)
NAV not available the person asked does have the information somewhere but not available right now (e.g. oh, I wrote down what the doctor said last time, but I didn't bring this piece of paper with me).
NA not applicable e.g. an answer to "gestational age" for a patient who is not pregnant.
NASK not asked the person who should collect that information forgot to ask.

My criticism at Stan's list is mainly because I don't see any atempt to systematize the null values nor to be exhaustive on them. However, now that we defined a fairly general data type for no information, and as we factrored update semantics into its own method, I regard this issue to be less important. In most cases, all that people need is the No Information without the flavor component.

Update Semantics

Update semantics deals with the problem of what a receiver is supposed to do with information in the message. That information may be equal to prior information at the receivers data base, in which case no questions occur. But what if the information is different?

We can categorize the cases into the following taxonomy:

  1. IGNORE: Ignore the value all together
  2. VERIFY: Verify whether the value supplied matches the prior value. If the values do not match, raise an exception.
  3. REPLACE: Replace the value in the data base with the new value supplied in the message. Replace operations may be of the three more kinds:
    1. REPLACE VALUE: Change an old value to a new value
    2. DELETE: Change an old value to a No Information value (i.e. a null value).
  4. EDIT COLLECTION: If the data is of some collection type, we change the collection in specific ways depending on the kind of collection:
    1. A SET can be updated in one of the following ways:
      1. include elements: build the union of the set and another set.
      2. exclude elements: build the difference of the set and another set.
    2. A LIST can be updated in one of the following ways:
      1. add element
        1. append
        2. prepend
        3. insert at given position
        4. insert at element with given value
          1. before
          2. after
      2. replace (replace value, set to no information)
        1. by position
        2. by value
          1. first occurence
          2. last occurence
          3. n-th occurence
          4. all occurences
      3. delete element entirely, changing the positions of all other elements after the deleted one.
        1. by position
        2. by value
          1. first occurence
          2. last occurence
          3. n-th occurence
          4. all occurences
    3. A BAG can be updated in one of the following ways
      1. include elements: build the union of the bag and another bag.
      2. exclude elements: build the difference of this bag and another bag.
      3. exclude all of one elements: e.g., if a bag contains 5 apples and 3 oranges, you could exclude all oranges without having to know that you actually remove 3 oranges.

In principle, the update mechanism will send an information action code along with each message element instance (MEI). The information action code should be part of the meta model definition of message element instances.

It turns out that updating a list is the most difficult thing, since positions are relevant in the list. The problem is concurrent updates: You never know exactly what the list looks like at the receiver's data base when your update message is being processed. For example: if you think the list is (A, B, C) and you want to insert an element D to come before C you may send an (INSERT-AT 3 'D) to insert D at position 3 (and shift C to position 4). However, if someone rearranged the list to (C, B, A) just before your update arrives, the receiver would insert the D between B and A and you get (C, B, D, A). You could have sent an update expression (INSERT-BEFORE 'C 'D) which, at the receiver's side would update (A, B, C) to (A, B, D, C) but also (C, B, A) to (D, C, B, A).

The sender of an update expression has to be very sure whether he wants the new element appear in a particular position or in a particular sequence relationship with a particular other element and that concurrent edits to the same data at the receivers side can render the sender's assumptions invalid.

For the technical committees this means that a LIST collection semantics should only be chosen if the order really matters semantically from the perspective of pure abstract application logic. If the order proably is not important enough to justify the headache around concurrent updates, the committee should choose the SET or BAG flavor. Most collections that I come accross are SETs. Bags are very rare. If the collection element type is a class like, e.g., Health_issue, the ranking can (and should) be represented as explicitly by a ranking number rather than implying LIST semantics on some association.

Also note that there are partially ordered collections that often capture the application logic much better than totally ordered lists. Partially ordered collections are collections where elements may have the same ranking, so that you can not always decide whether one element has higher rank than another.

With SETs, concurrent updates are not a problem, because the only thing you do is add or remove values to and from the SET, independent on the prior contents of the set. Updating a BAG is equally straight forward. Therefore selecting SET and BAG semantics should be encouraged. SET is often exactly the right semantic kind of collection from the perspective of pure abstract application logic, without implementation considerations.


Next conference call is next Wednesday, January 13, 1999, 2:00 PM EST.

Agenda items are:

  1. review of issues comming up while preparing the presentation
  2. e.g. slight reopening of the uncertainty can

regards

-Gunther