V3DT conference call minutes of Thu, Oct 8, 1998.

The HL7 version 3 data type task group has had its first conference call on Thursday, October 8, 1998, 11 to 12 AM EDT.

Attendees were:

I General Procedere

We scheduled the next four conference calls on Thursdays, October 15, 22, 29 and November 5, at 11 eastern time (daylight saving time or standard time, whichever appiles on a respective date.) We will try to stick to this time of the week throughout the working period.

There are 16 weeks only until the January HL7 meeting, and there are lots of details to resolve. That's why we try to go in a weekly pace in the beginning, and then later loosen up depending on the amount of work we accomplished.

II Approach to the v3 data type design

Mark Tucker proposed to start off from existing 2.3 data types. A working stewardhip of those existing types would be assigned to people, who would propose what the fate of their types would be in version 3. It might be to drop the type entirely or to carry on with it in a way to be specified.

Mark Shafarman adds to this, that he has build a list of proposed version 3 data types, both in a tabular form and using Rational Rose. This material can be downloaded from the CQ section of the Duke WWW site.

My concern here is that I intended to redesign data types top down, starting from requirements, a well defined semantics and only then going down to specifying the structure of the types (all of this in a reasonable tight time frame -- no worry about the "european" appoach.) Mark's suggested approach from v2.3 goes the reverse way, bottom up. Mark's concerns that we want to make sure to account for every v2.3 data type is addressed on page 17 of the current working draft (revision 1.5,) or on the slides:

http://aurora.rg.iupui.edu/~schadow/v3dt/show-43.html

http://aurora.rg.iupui.edu/~schadow/v3dt/show-44.html

that shows existing old data types and tentative new data types in a synopsis:
VERSION 3VERSION 2.3
tentativeexisting
Text
string ST
text TX, FT, ED, (HTML, ...)
Things
real world concept CE, CF, ID, IS
real world instances
    person name PN, XPN
    organization name XON
    id number CK, CX, DLN
    general location PL
    residential address AD, XAD
technical concept ID, IS
technical instance TN, XTN, EI, HD, RP
Quantities
integer NM
rational SN
float NM
measurement CQ, MO
point in calendar TS, DT
calendar modulo TM, ID (weekday for VH)
Generic Types
collections (all old "repeated" stuff)
list NA, MA
set QIP
bag
interval SN, DR, RI
uncertainty (TS)
incompleteness not present || vs. null |""|
update semantics not present || vs. null |""|
historical dimension FC
Remaining Types
two technical concepts PT
merge CN, XCN
put into RIM CN, XCN, PPN, TQ
QSC, QIP, RCD, SCV

The table of contents of the working draft document shows the basic areas we will have to look at. It reflects the same structure given above. We would start looking at those 3 basic areas of data types first, text, things and quantities to identify the minimal set of data types that gives complete coverage of the unique requirements of those semantic areas.

III Character string data type

See also the continueation of this thread on Character Strings leading to the conclusion

Then we got right into the first topic: text. The working draft document covers this area fairly completely (everyone is urged to read and comment on section 2.1 beginning at page 20. Our discussion today danced on the floor of the subsections 2.1.1 and 2.1.2: character sets and character encodings.

The tentative proposition are:

  1. HL7 assumes a string of characters as one of its basic data types.
  2. HL7 assumes the semantics of the Unicode for its character string type. The semantics of the unicode is:
    There is one uniquely identifiable entity for every single character used by any language anywhere in the world.
  3. The task of encoding this "uniquely identifiable entity" so that it can be communicated over the wire between different systems is by and large assigned to the ITS.

ad 1. There is little discussion about proposition 1. The reason why it is mentioned is that it implies we will not define an explicit type "char" like most programming languages do. We never did this and there is hardly any need for the concept of a single character on the HL7 application layer.

ad 2 and 3. The important point about sticking to "the semantics of the Unicode" is that interpreting characters on the application level is free from any context information. If I wrote an HL7 interface toolkit I am supposed to deal with uniquely identified character entities on my API. Users application of my HL7 interface toolkit should have no obligation to deal with those character set switching escape sequences we had defined in the past, unless they want to.

HL7 continues to make no assumptions on the internal working of HL7 applications. So if people think they want to deal with lower layer issues like character representation on their application layer, they can do so by selecting an ITS implementation that does not doe the mapping to uniquely identifiable character entities for them.

The point for HL7 interfacing is not dramatic: we continue to allow different character representations as needed by our "customers", but to deal with the issues that arise from using different character encodings is to be handled by the ITS (specifications). For example, XML has its own way to deal with character encoding, and CORBA might have another way. Any HL7 ITS must only make sure that whatever goes over the wire technically can be mapped to the Unicode.

The XML specification [http://www.w3.org/TR/1998/REC-xml-19980210] states:

2.2 Characters

A parsed entity contains text, a sequence of characters, which may represent markup or character data. A character is an atomic unit of text as specified by ISO/IEC 10646 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal graphic characters of Unicode and ISO/IEC 10646. The use of "compatibility characters", as defined in section 6.8 of [Unicode], is discouraged.

[...]

The mechanism for encoding character code points into bit patterns may vary from entity to entity. All XML processors must accept the UTF-8 and UTF-16 encodings of 10646; the mechanisms for signaling which of the two is in use, or for bringing other encodings into play, are discussed later, in "4.3.3 Character Encoding in Entities".

Other interesting readings are to be found at:

http://www.unicode.org/

ftp://ftp.isi.edu/in-notes/rfc2376.txt

ISO/IEC 10646: ISO (International Organization for Standardization). ISO/IEC 10646-1993 (E). Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 1993 (plus amendments AM 1 through AM 7).

Unicode The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996.

ISSUES

A. On Mark Shafarman's question of how you select character encodings in XML on a per-element basis, Gunther said that in XML you choose one character encoding per document and you do not switch encodings later within the document. David Webber, objects to this and sais that there are ways to switch between character encodings and binary materials within an XML document. David wants to show evidence for this claim.

Meanwhile here is another quotation from the XML specification section 4.3.3 (recommended reading):

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is an error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, for an encoding declaration to occur other than at the beginning of an external entity, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

B. Mark Tucker doubts that character sets and character encoding can be contained purely as an ITS layer issue. We will have to talk about the impact of the Unicode proposition on HL7 applications. Here is my take on this:

HL7 would take on a similar standpoint as XML:

http://www.w3.org/TR/1998/REC-xml-19980210#charsets

http://www.w3.org/TR/1998/REC-xml-19980210#charencoding

But HL7 would be even more abstract. The basic application level requirement is the following:

If HL7 application A claims to store data in "high fidelity," A would have to make sure that every Unicode character it gets from B would be reported back as the same character.

What ever happens within the systems A and B is out of HL7's scope. The only thing that counts is, for instance, if B sends an DEVANAGARI OM to application A and application A reports that data back to B, then B would see the same DEVANAGARI OM again. -- This kind of character fidelity is, however, not a necessary requirement of all HL7 applications. HL7 just has to make sure that it supports a seamless way for "high fidelity" applications to exchange their characters without forcing them to deal with special HL7 escape sequences on the application layer.

Please cf. to http://charts.unicode.org/Unicode.charts/normal/U0900.html to see what the DEVANAGARI OM looks like. It is at the Unicode place U+0950.

C. Larry Reis reminds about the need to provide guidance for existing HL7 applications and a smooth upgrade path.

The UTF-8 encoding is here to serve that need for a smooth upgrade path. If you want to comply to Unicode but your application can handle is the set of less than 127 US-ASCII characters, you can claim to use UTF-8 character encoding and nothing changes for you. You continue to send the lower 7 bits of one byte per character. In order for, say, the 25 years wise Regenstrief Medical Record System (RMRS), written in VAX BASIC, complies to the requirement of a "high-fidelity" application, all you make sure is that:

  1. your communication is 8 bit clean,
  2. your data base storage is 8 bit clean
  3. you do not use the 8th bit for string delimiters internally
  4. your screens won't garble up when getting sent 8 bit UTF-8 encoded sequences.

Obviously, for the RMRS the problems would be located at 3 and 4, while 1 and 2 is fine. In this case, i.e. if your environment is not fully 8 bit clean, you can use UTF-7 encoding. UTF-7 has the same backwards compatibility feature as UTF-8 but does not use the 8th bit. So you won't have conflicts with your internal use of the 8th bit and your communication can strip the 8th bit if it wants to.

For Europeans who used ISO Latin-1, the backwards compatibility issue is not as easy. I fought a three weeks argument with Unicode folks to pursue them adopting a more flexible UTF character encoding that would allow backwards compatibility to Latin-1 and other ISO 8859. However, I did not succeed, the idea does not fly. Notably, it is the European participants who do not think that such a UTF is a good idea.

Generally our ITS specifications can include the mapping from other character codes into Unicode. And HL7 toolkits and APIs can deal with the mapping of Unicodes back to some character encoding that is used by a given Application. The XML approach lays out the kind of flexibility that we could embrace: [cf. to XML 4.3.3 Character Encoding in Entities, http://www.w3.org/TR/1998/REC-xml-19980210#charencoding]

RESOLUTIONS

We will elaborate on the section 2.1.7 Requirements to the ITS so as to address all of Larry's concerns and more.

We will add another section 2.1.8 Impact on HL7 Applications so as to address Mark Tucker's concerns.

Myself was assigned to do this. Basically I will ad the things discussed in this posting.

I urge everyone to print out and read the section 2.1 from page 20 to 27 of the working draft document. You can ignore everything else for now, but please read these seven pages. It is important that you comment on things you did not understand or things you did not like. You can find the draft and the slides on: http://aurora.rg.iupui.edu/~schadow/v3dt/

For the next call, please make sure you have downloaded the most recent version of Netscape Communicator 4.07 along with Netscape Conference. You do not need to struggle with Audio. I'd like to get the conference whiteboard feature up and running within the next two calls. It's free and it's easy, so please, take ten minutes to click on Netscape-Help-Software Updates, press the greenish "start" button select Netscape Communicator 4.07 on the top and Conference further below. Go through the downloading procedure and say yes to all defaults. After 15 minutes you should be all set.

Thank you and regards,

-Gunther Schadow