V3DT Minutes of conference call, Thu, Oct 15, 1998.

V3DT conference call minutes of Thu, Oct 15, 1998.

The HL7 version 3 data type task group has had its second conference call on Thursday, October 15, 1998, 11 to 12 AM EDT.

Attendees were:

Anthony Julian,
Carlos Sanroman,
Greg Thomas,
Joann Larson,
Larry Reis,
Laticia Fitzpatrick,
Mark Shafarman,
Mark Tucker,
Mike Henderson,
Randy Marbach,
David Webber,
Robin Zimmerman,
Stan Huff,
Gunther Schadow.

This time we were again talking about the basic data types for text and multimedia data.

I Character strings (cont'd.)

See also the beginning of this thread on Character Strings that sets out the problem space.

Larry Reis said his fears that the Unicode proposition would make it very difficult for legacy system vendors to achieve HL7 complience are resolved now. The Unicode Transfer Format UTF-8 does allow those systems that assume a character is always one byte to continue to operate properly.

Mark Tucker's issue that the Unicode proposition would not be just an ITS layer issue but would have more impact on the application layer could be resolved as well. Initially Mark thought that an HL7 application would have to deal with a string as a pair of ( encoding, data bytes ) where encoding would be set to UTF-16, UTF-8, 7bit ASCII, or whatever. We now understand that it is the task of the ITS layer software to convert any incoming character encoding into the encoding that the application can handle.

Examples:

For most Java-based applications the ITS layer would would most likely convert incoming UTF-8 byte format to 16 bit per character strings. This is a basic functionality of the Java core API.
Most UNIX-based C and C++ character functions treat one character as an int (16 or 32 bits depending on the CPU native word size). However the quick and easy approach in C is to use a char * as a string, which is just an array of 8-bit characters.

For those and many other environments that stick to the equation 1 char = 1 byte, the application would like to handle UTF-8 strings where the normal US-ASCII characters are represented as single bytes. Those applications would tell their ITS software that it should convert everyting to UTF-8.
A super-old system that has a packed array of char where a character has only 7 bits or that for some other reason strips off the 8^th bit, UTF-8 would not be the appropriate internal character format. Those applications would want to use UTF-7 instead.

The key issue is that the ITS layer always performs some translations on the character encoding according to the encoding of incoming messages and the needs of the application. Although HL7's scope is on the message format only, we do recommend that implementors of ITS layers be aware of this character encoding feature they should implement. What is important is that the notion of different character encodings does not exist on the application layer. No HL7 specification would be valid that makes any assumptions about character encodings or encoding-related escape sequences on the application layer. We do not even make assumptions that all applications use the Unicode. Again the only assumption is that:

There is one uniquely identifiable entity for every single character used by any language anywhere in the world.

We could resolve Mark Shafarman's issue about character encoding and XML. The basic fear was whether there would be some interference between the use of Unicode and its encoding formats and the use of XML. We would not want the Unicode proposition to have negative side effects on our deployment of XML.

The answer was that XML is itself aware of the Unicode and its encodings UTF-8 and UTF-16 are required features of every XML parser per XML specification.

Mark Shafarman's next sharp-minded question was whether the obvious upper limit of 16 bits per characters was enough in the long run? The answer is that all of the current Unicode is accomodated in 16 bits, i.e., 65536 character code positions are enough to encode the contemporary languages including Chinese, Japanese and Korean ideographs. Those three languages contribute with more than 20000 character positions but this is still well less than half of the positions available.

(BTW: Acknowledgements should go to those three countries who made a considrable effort of joint standardisation work. Given the historical and political problems in this important corner of the world, this is an almost invaluable achievement. Imagine we had to reserve for 60000 ideographs!)

As the Unicode will expand its scope further into historical scripts (Egyptian or Sumerian) and into such curiosities like the Klingon alphabet, the code would claim another 16 more bits. While UTF-8 and UTF-16 formats can cope with 32 bit characters, many 16 byte per character environments (including Java) would fail. Fortunately Sumerian and Klingonian languages will not have to be supported by HL7 for even the widest forseeable future.

Summary of the string data type

The character string data type for HL7 is a primitive data type. We will not define any data type for the character itself because there is hardly any use for single characters in medical informatics. Therefore a character string is a primitive data type in HL7. Just as it always used to be.

A string of characters has, of course, an implicit length. But length is an ambiguous concept of little use here. Remember we only say that each character of any contemporary language of the world be one uniquely identifyable entiy. We do not say "how many bytes" this entity would consume in anyone's database. The length of the string is the number of characters in the string, not the number of bytes needed to represent that string in any character encoding.

The most important difference to the old v2.x ST data type is that there are no escape sequences defined because those are no longer needed. This is achieved through "the Unicode proposition" and through the clean separation of the character string on the application layer and the bytes on the transport layer.

We no longer need to deal with any problem of a character in the string colliding with some delimiter used by any ITS. The application can happyly put verticle bars "|", carats "^", or ampersands "&" into the string which would not interfere with the undelying message encoding. Of course, the ITS has to deal with escape sequences, but this is done transparently to the application layer. This is, of course, also true with SGML special characters, such as less-than "<", quote """, and again the ampersand "<". No need for the application to deal with the SGML escape sequences ("<", """, and "&" resp.)

Again, we acknowledge that HL7 does not specify the internal working of an HL7 application, nor do we specify the functions of an API. A particular implementation can violate all the rules of layering and of good software engineering, and still be HL7 complient, as long as this does not lead to a different behavior of the HL7 interface.

For example, system XYZ of vendor SICK-TOS is written as a monolithic PDP-11 assembler program that behaves perfectly accoring to the HL7 spec. It would be a miracle, but it would be HL7 complient. But even a super-fancy modern system SANI-NET is not complient with HL7, if it fails everytime it receives an ampersand "&" character in a string. This is more important than it may seems: Suppose the system SANI-NET would "support" two HL7 ITS interfaces, for XML and for CORBA. If it would receive "&" with CORBA, it should emit "&" on the XML wire. And if it receives "&" on the XML wire it should emit "&" on the CORBA wire.

On the HL7 application layer, we do not bother with any of those ITS-specific encoding issues. The layering approach gives this to us: HL7 is on layer 7, and ITS are layers 6, 5, 4, and 3-1.

II Multimedia-enabled free text.

Mark Shafarman opened up the box on the other type for text, the one that is supposed to be multimedia capable. His question was:

Do we want to allow an entire (structured XML) document to be sent within a free text field?

From the history of HL7 v2.x we can tell that people will use free text data types to send a whole structured document. That's what we have seen all too often with and FT-typed result in the OBX segment. A whole textual microbiology report can thus be squeezed into one OBX, a whole path report may fit into one CF data type and have a single code attached to it.

I think that everyone agreed that we do not want to promote burrying sending data in subdocuments in HL7 text fields instead of using the appropriate HL7 data elements and instead of using multiple results data structures for a complex result.

On the other hand, Stan Huff made the point that we do not want to preclude sending a whole document in a free text field in general. There are conceivable use cases where this may make sense. What comes to mind is that a report can be coded and expanded into multiple Clinical_observation instances but we do want to send this data along with the original report written with some word processor, just in case.

Mark Shafarman raised the issue whether an embedded HTML document could interfere with the embedding HL7 message in XML wire form. This is an important issue that we will have to make sure.

Perhaps, XML's CDATA sections would be one way to go, escaping all SGML significant tokens in the embedded document is another. In any case, this is an ITS layer issue to make sure that the embedded data does not interfere with the outer message.

What is important for the application layer of HL7 is that those sub-documents would be encapsulated. Subdocuments would generally not be melting with the enclosing message. To melt an XML subdocument with its enclosing HL7 message in XML wire form seems to be a compelling temptation. We do not preclude this in principle, but we do not accept any backpressure on the HL7 application layer because of the syntactic intricacies of such a melting approach.

Encapsulating a document into an HL7 message means that the processing takes place in two steps. First the HL7 message is transformed from the wire-form into an internal representation (what we call the Message Element Instance (MEI) graph.) And second the program finds the multimedia-enabled free text MEI. It then determines what media handler (e.g., word processor or HTML browser) to invoke and hands to it the embedded data.

We design the multimedia-enabled free text data type pretty much along the lines of MIME and HL7 v2.3's ED type. We have the following two semantic components:

the media type descriptor
the data

The media type descriptor of MIME RFC 2046 consists of two parts:

the "top level media type"
the media subtype

MIME media types and subtypes are defined by the Internet Assigned Numbers Authority (IANA). Currently defined media types can be found in http://www.isi.edu/in-notes/iana/assignments/media-types/

The following top level media types are defined:

NAME PURPOSE

text textual information

image image data

audio audio data

video video data

application some other kind of data

multipart data consisting of multiple MIME entities

message an encapsulated message

model "an electronically exchangeable behavioral or physical representation within a given domain" [RFC 2077]

NAME	PURPOSE
`text`	textual information
`image`	image data
`audio`	audio data
`video`	video data
`application`	some other kind of data
`multipart`	data consisting of multiple MIME entities
`message`	an encapsulated message
`model`	"an electronically exchangeable behavioral or physical representation within a given domain" [RFC 2077]

There are currently more than 160 different MIME media subtypes defined with the list growing quite fast. There is no sense in listing them here. But more important, we are afraid of allowing just everything. Two concrete issues have been brought up:

Larry Reis was wondering whether we are going to overload too many different uses into one single data type.
Everyone seems to reconize the concern that this is a lot of optionality that leads to interoperability problems.

Larry's point is important to keep in mind: we initially started out to build a flexible and powerful free text data type and now we see things like video, application, even message listed in the table of media types. Should we not rather define one type for text, one for image, one for video, etc.?

Remember, the argument [slide 15ff] that lead from text to multimedia is that free text is information sent from one human to another human. The receiving human being will - if she has a method to render and see the information - be able to interpret this data.

To understand the full range of meaning of the word "text" we should have a look into Webster's dictionary:

Main Entry: text
Pronunciation: 'tekst
Function: noun
Etymology: Middle English, from Middle French texte, from Medieval Latin textus, from Latin, texture, context, from texere to weave -- more at TECHNICAL
Date: 14th century
1 a (1) : the original words and form of a written or printed work (2) : an edited or emended copy of an original work b : a work containing such text
2 a : the main body of printed or written matter on a page b : the principal part of a book exclusive of front and back matter c : the printed score of a musical composition
3 a (1) : a verse or passage of Scripture chosen especially for the subject of a sermon or for authoritative support (as for a doctrine) (2) : a passage from an authoritative source providing an introduction or basis (as for a speech) b : a source of information or authority
4 : THEME, TOPIC
5 a : the words of something (as a poem) set to music b : matter chiefly in the form of words that is treated as data for processing by computerized equipment <a text-editing typewriter>
6 : a type suitable for printing running text
7 : TEXTBOOK
8 a : something written or spoken considered as an object to be examined, explicated, or deconstructed b : something likened to a text <the surfaces of daily life are texts to be explicated -- Michiko Kakutani> <he ceased to be a teacher as he became a text -- D. J. Boorstin>

Our multimedia extension remains to be text in the sense of Webster's definitions 5 b and 8. Clearly, word processor documents can contain images such as drawings or photographs. Modern documents can embed video sequences and animations as well. Dictation (audio) is the most important form of pre-written medical narratives. In this sense, an image alone can be text. My slides are full of text, but they are brought to you using a GIF, a popular and quite efficient format for low color images.

It seems to be very difficult to draw a sharp line between text made up of written words and graphics, images, etc. Also, the flexibility to send an HL7 message off a radiologists worksatation with dictation in it seems to be compelling.

Mark Tucker interjected here that some systems just do not want to get audio because they can only show text to their users (e.g., behind VT-100 terminals). Clearly this is a matter of application conformance statements to say "I will not handle audio".

Stan Huff said, he would not like to see those distinctions made on the level of data types. And indeed, with domain specifications for codes, we can apply constraints on the media types whether at the time of HL7 message specification (HMD), or for a given application conformance statement. And may be even in the RIM. For instance, the IMSIG will eventually define a class "Image" which would conceivably contain an attribute "image_data". They certainly do not want to see text or audio here, but only images (and maybe a video clip of a coronar angiography.)

HL7's task remains to provide guidance on what media types to prefer in certain use cases. There are four different classes we can assign to any IANA defined MIME type.

mandatory: Every HL7 application must support at least those media types if it supports a given use. There should be one mandatory media type for each general media (e.g. text, audio, video, etc.). Without a very minimal greatest common denominator we cannot guarantee interoperability. The set of mandatory media types would be very, very small however, to not rule out legacy systems to play the game. And no HL7 application would be forced to support any given media (other than text). Only if you do audio, you must support the mandatory audio media type.
recommended: Other media type would be recommended for a particular purpose. For any given purpose there should be only very few additional recommended media types and the conditions and assumptions of such recommendation should be made very clear.
other: By default any media type would fall into the category other. This means HL7 does not endorse, but does not forbid the use of this media type. Given that there will be a mandatory or recommended type for most practically relevant use cases, these types should be used very conservatively.
deprecated: Media types that for given reason should not be used, because there are better altenatives or because of certain risks (e.g., security risks).

The following is a tentative list of media types that is pretty complete to cover most current and some future use cases:

media type class use case

Text

text/plain mandatory
default for any plain text. This is our former TX data type.

text/x-hl7-ft recommended
for compatibility this represents the old FT data type. It's use is recommended only for backwards compatibility with HL7 v2.x systems.

text/html recommended
mandatory? for any marked-up text, sufficient for most textual reports, platform independent and widely deployed.

application/pdf recommended for text, in case that absolute control over layout is required. platform independent, widely deployed, open specification.

text/sgml text/xml other the demand is not yet very high. there is a risk that SGML/XML is too powerful to allow a sharing of general SGML/XML documents between different applications.

text/rtf other this format is widely used, but it has its compatibility problems, it is quite dependent on the word processor, but may be useful if wordprocessor editable text should be shared.

application/msword deprecated this format is very prone to compatibility problems. If sharing of editable text is required, text/plain, text/html or text/rtf should be used instead.

Audio

audio/basic mandatory this is the absolute minimum that should be supported for any system claiming to be audio capable.¹
audio/k32adpcm recommended
for compression this allows compressing audio data. It is an internet standard specification [RFC 2421]. its implementation base is unclear.

Image

image/png mandatory portable network graphics PNG a highly supported lossless limage compression standards with open sourcecode available.

image/gif other GIF is a nice format that is supported by almost everyone. But it is patented, and the patent holder, Compuserve, has initiated nasty lawsuits in the past. No use to discurage this format, but we can not raise an encumbered format to a mandatory status.

image/jpeg mandatory This format is required for high compression, with loss of data, but almost unnoticable for high color photographs.

image/g3fax recommended
for FAX this is recommended only for fax applications. The format is not well compressed and G3 software is not very widespread.

image/tiff other although TIFF (Tag Image File Format) is an official standard it has a lot of interoperability problems in practice. Too many different versions that are not handled by all software alike.

image/x-dicom other not sure whether there is an inteoperable image file format in DICOM. I know of Papyrus, but is it a DICOM stadard?

Video

video/mpeg mandatory this is an international standard, widely deployed, highly efficient for high color video; open sourcecode exists; highly interoperable.

video/x-avi deprecated the AVI file format is just a wrapper for many different "codecs"; it is a source of lots of interoperability problems.

Other

model/vrml recommended this is an openly standardized format for 3D models that can be useful for virtual reality type of applications and is used in biochemical research (visualization of the steric structure of macromoloecules)

multipart deprecated this is a format that depends on the MIME standard, we only want to use MIME multimedia type definitions, not the MIME message format

message deprecated this is used to encapsulate e-mail messages in delivery reports and e-mail gateways, we do not need this for HL7. HL7 is itself a messaging standard that defines its own means of delivery and HL7 is not used for e-mail.

media type	class	use case
Text
`text/plain`	mandatory default	for any plain text. This is our former TX data type.
`text/x-hl7-ft`	recommended for compatibility	this represents the old FT data type. It's use is recommended only for backwards compatibility with HL7 v2.x systems.
`text/html`	recommended mandatory?	for any marked-up text, sufficient for most textual reports, platform independent and widely deployed.
`application/pdf`	recommended	for text, in case that absolute control over layout is required. platform independent, widely deployed, open specification.
`text/sgml text/xml`	other	the demand is not yet very high. there is a risk that SGML/XML is too powerful to allow a sharing of general SGML/XML documents between different applications.
`text/rtf`	other	this format is widely used, but it has its compatibility problems, it is quite dependent on the word processor, but may be useful if wordprocessor editable text should be shared.
`application/msword`	deprecated	this format is very prone to compatibility problems. If sharing of editable text is required, `text/plain`, `text/html` or `text/rtf` should be used instead.
Audio
`audio/basic`	mandatory	this is the absolute minimum that should be supported for any system claiming to be audio capable.¹
`audio/k32adpcm`	recommended for compression	this allows compressing audio data. It is an internet standard specification [RFC 2421]. its implementation base is unclear.
Image
`image/png`	mandatory	portable network graphics PNG a highly supported lossless limage compression standards with open sourcecode available.
`image/gif`	other	GIF is a nice format that is supported by almost everyone. But it is patented, and the patent holder, Compuserve, has initiated nasty lawsuits in the past. No use to discurage this format, but we can not raise an encumbered format to a mandatory status.
`image/jpeg`	mandatory	This format is required for high compression, with loss of data, but almost unnoticable for high color photographs.
`image/g3fax`	recommended for FAX	this is recommended only for fax applications. The format is not well compressed and G3 software is not very widespread.
`image/tiff`	other	although TIFF (Tag Image File Format) is an official standard it has a lot of interoperability problems in practice. Too many different versions that are not handled by all software alike.
`image/x-dicom`	other	not sure whether there is an inteoperable image file format in DICOM. I know of Papyrus, but is it a DICOM stadard?
Video
`video/mpeg`	mandatory	this is an international standard, widely deployed, highly efficient for high color video; open sourcecode exists; highly interoperable.
`video/x-avi`	deprecated	the AVI file format is just a wrapper for many different "codecs"; it is a source of lots of interoperability problems.
Other
`model/vrml`	recommended	this is an openly standardized format for 3D models that can be useful for virtual reality type of applications and is used in biochemical research (visualization of the steric structure of macromoloecules)
`multipart`	deprecated	this is a format that depends on the MIME standard, we only want to use MIME multimedia type definitions, not the MIME message format
`message`	deprecated	this is used to encapsulate e-mail messages in delivery reports and e-mail gateways, we do not need this for HL7. HL7 is itself a messaging standard that defines its own means of delivery and HL7 is not used for e-mail.

The data part of the multimedia-enabled free text type would not be a character string but a block of raw bits. ASN.1 calls this and "octett-string," which is the same as a "byte-string." The important point is that the byte string would not be subject to interpretation as characters, but must be passed through from one application's memory into the other application's memory absolutely unchanged.

The ITS layer has therefore two separate tasks: to transport Unicode character strings, and to transport byte strings. These two tasks are separate. This can not be overemphasized. The problem is we as HL7 people are used to look at binary data on top of character string data. The use of hexadecimal digits in escape sequence gave us some ability to transport raw bytes in HL7 messages. The HD data type used base64 encoding of bytes as characters. However, this makes only sense for character based encoding rules such as the traditional HL7 rules and XML. For an efficient CORBA ITS layer implementation, this would certainly be different. CORBA allows you to transfer bytes without trouble.

Thus, just as character encoding is an ITS layer issue, the encoding of bytes is an ITS layer issue too. On the HL7 application layer we do care only for the unchanged communication of a byte string.

We discovered some issues:

When we use this multimedia type consisting of media descriptor and binary data to convey plain text, we raise the issue of character encoding into the application layer.

We certainly do not want to remove the character set and character encoding headache in the character string data type just in order to reinviote this problem in the free text data type. One possible solution would be for the ITS layer to discover the special case of text/plain media and then perform the character set translation accordingly using the same machinery that handles character strings.
MIME types define "parameters." For instance, the text data type has the "charset" parameter specifying the character set and character encoding used. Parameters are generally name-value pairs that could be conveyed in a data type

The IANA also maintains a code of character sets. For Unicode characters in UTF-8 encoding we would use the parameter: charset = UTF-8
With text/plain we have the issue of how lines are terminated. This should be stadardized. The proper interpretation of the ASCII and Unicode standard suggest that line terminators consist of the two control characters carriage return and line feed. This is also the Internet standard of doing things. It is natural on MS-DOS descendents, and it is easy to comply to this requirement for Unix systems, who natively use a single line feed as an end of line.
It is often useful to compress binary data, e.g. using the gzip (deflate) bytestream compression algorithm. Using a media type of application/gzip is obviously not useful. The current (unofficial ?) way to deal with this in the MIME e-mail world is to say "Content-encoding: gzip" in one of the MIME headers. We could use our general parameter list for this purpose. However, since almost all data can be subject to byte stream compression (except GIF, JPEG and MPEG, which are already maximally compressed), it seems to be worthwhile to define a separate component for compression.

Summary of the multimedia-enabled free text data type.

The multimedia-enabled free text data type consists of the following components:

component name type/domain optionality description

media descriptor MIME type codes defined by IANA optional, defaults to text/plain used to select an appropriate method to render the data.

parameter set set of name value pairs defined by IANA optional, dafault for text/plain is UTF-8 used to pass media type specific parameters.

compression IANA defined code optional, default: none an optional byte stream compression like gzip (deflate)

data byte string required the data itself, these are bytes, not characters!

component name	type/domain	optionality	description
media descriptor	MIME type codes defined by IANA	optional, defaults to `text/plain`	used to select an appropriate method to render the data.
parameter set	set of name value pairs defined by IANA	optional, dafault for `text/plain` is `UTF-8`	used to pass media type specific parameters.
compression	IANA defined code	optional, default: none	an optional byte stream compression like gzip (deflate)
data	byte string	required	the data itself, these are bytes, not characters!

Note that the type specification is not yet formalized because we do yet have a defined type for those codes, nor do we have defined the generic collection types for the SET OF name value pairs.

RESOLUTIONS

We basically agreed on having two types: character string and multimedia enabled free text. The working document discusses the use case for either type. The decision should be quite clear. Computer processable strings are character strings. For instance in person names, addresses, identification numbers, symbols, etc. formatting makes no sense whatsoever. For all free text, like notes and comments, OBR placer and filler fields, the new free text type is to be used.

Noone really saw any problems with the variations in length of such a free text type. Instances of a free text type can vary from a couple few bytes "Please call me ASAP!" into the gigabyte range (image, video).

We recognized that there will be a reference data type defined to use for huge data blocks. Issue: should the free text type be allowed to be replaced by a reference? Other issues are, video streams do not fit into a single message, an external stream protocol (such as RealVideo) would be used.

ISSUES

We will have to come back to some of the details. We will have to formalize the notation for composite datatype definition. But there are amazingly few problems left behind our first two conferences. Thanks everyone for making such progress possible!

For the next call

We will attack the field of coded values for the first time. We do not expect to make progress on this as fast as we proceeded on strings and free text. The next round will try to uncover all the difficult issues and may not solve any of them. Unfortunately we do not yet have the conference whiteboard running. Mark Tucker and I found that having a whiteboard to scribble on is quite helpful in expressing the unexpressible.

The next three conference calls are on Thursdays, October 22, 29 and November 5, at 11 eastern time (daylight saving time or standard time, whichever appiles on a respective date.) We will try to stick to this time of the week throughout the working period.

Thank you and regards,

-Gunther Schadow

Footnotes

¹ The MIME specification [RFC 2046] says:

The initial subtype of "basic" is specified to meet this requirement by providing an absolutely minimal lowest common denominator audio format. It is expected that richer formats for higher quality and/or lower bandwidth audio will be defined by a later document.
The content of the "audio/basic" subtype is single channel audio encoded using 8bit ISDN mu-law [PCM] at a sample rate of 8000 Hz.

This format is standardized by: CCITT, Fascicle III.4 - Recommendation G.711. Pulse Code Modulation (PCM) of Voice Frequencies. Geneva, 1972.