HL7 v3.0 Data Types
Implementable Technology Specification (ITS)
Extensible Markup Language (XML)
Regenstrief Institute for Health Care
We consistently use XML tag names for the names of components not
for their types. For example, a Patient object that has a component
dateOfBirth of type
we would use the name
dateOfBirth as the XML tag. We use
variable names or component names for XML tags, not types. Thus
PointInTime would not occur as an XML tag.
XML allows data to appear as content of an element (i.e. data between a start tag and an end tag) or as attributes (i.e. given as part of the start tag). XML attributes can only contain unstructured data, i.e. character strings and character string literals (e.g. for numbers or point in time values). XML elements can contain a character string literal, tagged component elements, or both.
This ITS does not distinguish between XML elements and
attributes. All components of HL7 data types can occur as attributes
or elements. The name of the attribute is the same as the name of the
element and defined with the component. One component of each data
type can appear as the element's content. This
content-component usually is the required (i.e. non-optional)
"main" information of the data type, which is often named "value," or
"data." If all other components are optional ("
*", or "
#IMPLIED"), the content-component
may occur in an attribute.
An XML element represents a value with a name. The name (e.g., dateOfBirth) is represented by the XML tag name. The value is found as attributes or content data or child elements of the XML element. HL7 specifies the data type to be expected for each value. If the given value conforms to the element's expected type, a message need not contain any explicit type information.
The actual type of an element may be different than the expected
type if there is a conversion rule from the given type to the expected
type. If the given value is of an unexpected type, that type must be
explicitly specified. The type of an element can be specified using
The TY attribute determines all the features of a value, i.e. all the expected attributes child-elements and content data and their representation are determined by the TY attribute of an element.
Why don't we name that attribute "T"?
The name of XML elements always represents the name of a variable
or component. However, sometimes values are not assigned to names but
may appear anonymously. Anonymous values usually occur as elements of
collections, such as sets or lists. Anonymous values are supplied
Document Type Definitions (DTD) where once the suggested way to define SGML and XML documents. However, the DTD language has severe deficiencies, especially after the XML "simplification".
DTDs have essentially no notion of the difference between data names and data types.
XML invented the ideosyncratic difference between "attributes" and "elements." Both attributes and elements serve essentially the same need, i.e. to label values. But each does this in a different way. The distinction between attributes and elements (once motivated by the idea of a manin document text with sporadic annotational information) provides no value, instead it forces the designer to engage in unreasonable decisions and tradeoffs.
DTDs allow the specification of syntax, but the XML syntax is totally free of semantics (not even the simplest macro expansion facility is provided.)
Using DTDs (with XML instead of SGML) requires adherence to a fixed relative positioning of elements. This is not only unnecessary, it jeopardizes one of the key benefits of a tagged representation over a positional representation (such as the traditional HL7 encoding rules.)
DTD language does not allow to specify constraints.
This ITS specification will define data type representations using tables that show the representational components of HL7 data types for XML and associate with each component certain properties. These properties define how each component is to be represented in XML, e.g., whether it can appear as an attribute, as an element, or as a literal character data of an attribute or an element.
This section outlines algorithms for bothe building and analyzing XML expressions of HL7 data values. The meaning of the data type tables, though explained in plain english, are eventually defined in the prototype algorithms.
The following is a prototype table showing all the columns of the data type definition tables and all possible property values.
|long name||tag name||R||C||type||A|
|long name||tag name||O||type||a|
|long name||tag name||D||X||type|
|long name||tag name||X||type|
Each row in the table defines one component of the XML representation of HL7 data types. The HL7 Data Type Specification uses similar tables to help defining the data types. However, the components shown in the ITS independent specification are semantic components not representational components. By and large, each semantic component of a data type has a corresponding representational component in the ITS. However, sometimes a semantic component requires more than one representational component in an ITS (e.g., "encoding" of binary data.) Other times, a semantic component is conveyed implicitly as part of a representational component corresponding for another semantic component (e.g., precision of floating point values).
The columns of the table mean the following:
The long name of the component. Usually the name used in the ITS independent HL7 Data Type Specification. Otherwise the component is defined in this ITS specification.
The name used for the XML representation. Since we do not distinguish between XML attributes and elements, we use the same name for tags and attributes.
Occurence (e.g. optional/required.) The values used are:
value meaning R required, component must be present but, if appropriate, can be assigned a no-information (NULL) value. O optional, component need not be present, in which case it assumes a default value or a no-information value. D component must be explicitly provided if no default is defined at the point of use of the data type. X special occurence rule applies as specified in the notes column.
This column specifies whether the component can appear as a character data content of a value element. Usually one and only one component has the C property. Also usually, it is a fairly essential and required component that has the C property.
value meaning C component can appear as character data content of an element. component can not appear as character data content of an element. X a special rule applies as specified in the notes column.
The default type of the column. This is the expected type and the default for the TY attribute of the component if none is present. If the type provided is different than this expected type and is not a string literal for the expected type, an explicit TY attribute is required.
Specifies whether the component can appear as an
attribute. In general, all components can appear as an attribute
(little-a property). In addition, one component, usually the one
having the C property, can appear in an attribute of an enclosing
element (big-A property). For example, if type
is defined as having components
has the big-A property
is a component of
defined as of type
can be used instead of
The A-property values are
value meaning A big-A: component can be used stand-alone for the entire data type, i.e. the data type can be used for an attribute value giving only this component's value. a little-A: component can be represented as an attribute of an element representing a value of this data type.
This column contains important useage notes and constraints.
A simple boolean value is specified as an entity reference using
the entity reference "
&true;" for true and
&false;" for false.
To what do those entities expand? Can we reserve special
literals starting with "
#T and #F)?
A simple no information value is assumed where an XML element or attribute is not provided.
Alternatively an explicit no information value can be provided
using the entity reference "
&null;". Different flavors of
no-information will be defined as
How? Can we reserve special literals starting with
#NULL, #UNK, #IINF)?
The Character String is the only "datatype" that XML fully supports. XML character data meets all the requirements for HL7 Character Strings.
XML character data can appear as attribute data or as the content data of XML elements. Character data of XML elements can contain other elements (mixed content, PCDATA). Character data of XML attributes can be subject to further constraints.
The characters less than ("<"), ampersand ("&")
double quote (""")
have special meaning in XML and can oftentimes not
appear literally as character data. Therefore these characters should
always be rewritten as "
XML character strings can contain end-of-line tokens. This is
consistent with the definition of HL7 character strings. However, in
XML, end-of-line tokens are always normalized to ASCII line-feed (LF)
characters, while in HL7 we did not specify such behavior. Thus, XML
character strings do not make any difference between a carriage-return
(CR), the sequence CR+LF and the simple LF character.
XML only supports character data, not binary data. Thus, binary data must be encoded in characters. Many character encodings exist, such as hexadecimal digits (4 bit/char), uu-encoding (6 bit/char), base64 encoding (5 1/3 bit/char), or base 85 encoding (6.4 bit/char). Base64 encoding has become pretty popular as it is used with MIME e-mail. Base64 is not necessarily the best encoding as it is a little less dense as, e.g., the uu-code. But there is no strong point in supporting multiple encodings, which is why currently only base64 is allowed for true binary data.
Since the BIN data type is used primarily in the free text (FTX) data type, another encoding is defined for the encoding of text data. Both encodings are specified normatively in the following subsections.
|BASE64||D||Base 64 encoding of 16 bits in three characters.|
|TEXT||Text encoding (used only for actual character data.)|
Thus the following examples are legal
The following definition is by and large a verbatim copy of the Internet standard RFC 2045. Modifications include removing text that speaks about MIME, e-mail, or SMTP, which do not affect the specification. However, the original terms of RFC 2045 have been loosened up to avoid overspecification. Notably
Base64 data does not need to be broken into lines at all (or can be broken at line length different from 76 characters per line.)
Characters not in the base64 alphabet are completely ignored and not taken as indication of error.
Padding is recommended but not required. Correct padding is always assumed by the receiver at the end of a base64 data block.
No assumption about line breaks of the originally encoded data are made, specifically the suggestion to convert to the Internet-canonical CR+LF line breaks has been dropped.
The Base64 Content-Transfer-Encoding is designed to represent arbitrary sequences of octets in a form that need not be humanly readable. The encoding and decoding algorithms are simple, but the encoded data are consistently only about 33 percent larger than the unencoded data.
A 65-character subset of US-ASCII is used, enabling 6 bits to be represented per printable character. (The extra 65th character, "=", is used to signify a special processing function.)
The encoding process represents 24-bit groups of input bits as output strings of 4 encoded characters. Proceeding from left to right, a 24-bit input group is formed by concatenating 3 8bit input groups. These 24 bits are then treated as 4 concatenated 6-bit groups, each of which is translated into a single digit in the base64 alphabet. When encoding a bit stream via the base64 encoding, the bit stream must be presumed to be ordered with the most-significant-bit first. That is, the first bit in the stream will be the high-order bit in the first 8bit byte, and the eighth bit will be the low-order bit in the first 8bit byte, and so on.
Each 6-bit group is used as an index into an array of 64 printable characters. The character referenced by the index is placed in the output string. These characters, identified in Table 1, below, are selected so as to be universally representable.
Any characters outside of the base64 alphabet are to be ignored in base64-encoded data. The encoded output stream can be represented in several lines, where about 76 characters per line is an adviseable. All line breaks or other characters not found in Table 1 must be ignored by decoding software.
Special processing is performed if fewer than 24 bits are available at the end of the data being encoded. A full encoding quantum is always completed at the end of a body. When fewer than 24 input bits are available in an input group, zero bits are added (on the right) to form an integral number of 6-bit groups. Padding at the end of the data is performed using the "=" character. Since all base64 input is an integral number of octets, only the following cases can arise: (1) the final quantum of encoding input is an integral multiple of 24 bits; here, the final unit of encoded output will be an integral multiple of 4 characters with no "=" padding, (2) the final quantum of encoding input is exactly 8 bits; here, the final unit of encoded output will be two characters followed by two "=" padding characters, or (3) the final quantum of encoding input is exactly 16 bits; here, the final unit of encoded output will be three characters followed by one "=" padding character.
Because it is used only for padding at the end of the data, the occurrence of any "=" characters may be taken as evidence that the end of the data has been reached (without truncation in transit). If the data block ends prematurely without padding, the receiver should go ahead assuming the necessary padding characters anyway. Thus, padding is not strictly necessary.
Text encoding is used only when the data for a binary data block should be sent unobscured in the XML message. Text encoded data is not inert to XML specific transformation. Notably, text encoding is subject to the rules of character encoding, white space handling, and end of line rewriting.
Text encoded data is handled by the receiver as follows.
Transform bytes to characters according to the character encoding selected for the enclosing XML entity (e.g., the enclosing HL7 message).
Transform or discard white-space according to XML rules.
Transform end-of-line sequences to LF according to XML rules.
Resolve all XML entity references.
Transform characters to bytes according to the rules of the selected character encoding. This can be the same as the original encoding, but it can also be another encoding selected by other means.
Characters that do not exist in a particular character set or character encoding result in undefined behaviour (e.g. characters might be silently dropped or replaced.)
For this reason, text encoding for binary data should be used only if the distortions entailed by the above-mentioned transformations will not affect the meaning or usefulness of the data.
|media descriptor||MEDIA||O||CV||a||mandatory table|
|encoding||ENC||O||CV||a||mandatory table default depends on MEDIA|
|GZIP||gzip (deflate) algorithm|
|UTF-8||Unicode UTF-8 (backwards compatible to US-ASCII)|
|US-ASCII||7 bit US ASCII (ANSI X3.4)|
|UTF-7||Unicode UTF-7 (almost backwards compatible to US-ASCII)|
|UTF-16||Unicode in 16 bit per character encoding (subject to byte order problems)|
|ISO-10646-UCS-2||Unicode in 16 bit per character encoding (subject to byte order problems)|
|ISO-10646-UCS-4||Unicode in 32 bit per character encoding (subject to byte order problems)|
|ISO-8859-1||ISO Latin-1 (Western European)|
|ISO-2022-JP||Japanese character encoding|
|Shift_JIS||Japanese character encoding|
|EUC-JP||Japanese character encoding|
The Code Value is defined as having many components, yet the most important component is the "value" component, which is a character string.
The code system is often specified mandatory or as a default with the attribute or data type component declared as a Code Value. The code system version is often implicit as being some recent version and the version is not very important anyway, since versions of code systems are supposed to be largely backwards-compatible.
Finally, the print name is by its definition redundant information.
In other words, a code value can often be represented by a mere character string. While an implementation of HL7 data types should fill in the appropriate components of the code value, they don't need to be sent automatically, nor do we have to bother implying them by XML specific (and DTD dependent) means, such as #FIXED attributes. Thus, a code value can often be sent as a simple flat XML attribute instead of requiring an XML element with substructures. This not only saves bandwidth (quite dramatically as we shall see) it also adds to clarity of the message.
If more of the code value's components need to be specified, the code value can be sent as an empty element with only attributes. Expanding the XML attributes to elements is possible, but rarely necessary.
|code system||SYS||D||ST||a||must be explicitly provided when there is no default code system specified for some element|
|code system version||VER||O||ST||a|
|print name||PNM||O||X||ST||a||Content data is assigned to the print name instead of the value if VAL is provided explicitly.|
|replacement||REPL||X||X||ST||a||Only if value is %null; or %null.other;. Content data is assigned to the replacement instead of the value if VAL is set explicitly to %null; or %null.other;|
The print name and even the replacement text can appear in the content position, if the value VAL is specified as an attribute (and explicitely as NULL respectively). Thus, the following expressions are allowed:
It is up to us to decide whether we want to trade the slightly more complex logic for the possibility of a nicer look and feel of a message. However, even if the nice look and feel is possible, it requires additional logic at the sender's side to actually use it.
Unlike the ITS independent specification we present the XML representation in the reverse order, i.e. bottom up. Note that the purpose of all the types in this subsection (Code Phrase, Code Translation) is to serve the definition of the Concept Descriptor.
A code phrase is simply a collection of Code Values. Thus Code Phrase does not need any special XML definition.
However, often all the code values in a code phrase may come from the same code system and sending the code system over and over again for each CV is a waste. Therefore the code phrase can assign a default coding system which is then applied to all code values found in the code phrase. "Default coding system" means that code values without explicitly specified code system will "inherit" this default code system. Individual code values may still override the default code system.
The third row in the table is explained below.
|code system||SYS||O||ST||a||allows to set a default code system for all the code values|
|code system version||VER||O||ST||a||allows to set a default code system version for all the code values|
|values||(VAL)||1..*||E||CV*||(a*)||can only appear as attribute if default code system is set, the XML tag name will never be used for an element. See text.|
The following is the same code phrase in various forms. As can be seen, the amount of tag verbage can be significantly reduced according to the rules explained in the introduction.
B001 G001 H001 blond slight gray homogene B001 G002 H001
The code values of the code phrase can be contracted into one VAL attrribute. In this case, VAL contains white space delimited code value literals (XML NMTOKENS). This reduces the above example even more:
Beware! There is a potential flaw in the NMTOKEN hack: the individual code value may contain white space. There will be no white space in code values in 99% of the cases, however, we can not be totally sure. Should we not allow this nice short form, only because of those unknown 1%?
The code translation extends code phrase, which means it inherits all the features of code phrase and adds four more attributes. The reference to the origin of translation is done through XML ID/IDREF attributes. Note that IDs must be unique within one message but need not be unique accross messages (IDREF can not point outside of a message.)
|origin||ORG||R||XML:IDREF||a!||must be an attribute|
|referable identifier||ID||R||XML:ID||a!||must be an attribute|
We do not show examples of code translations since code translations never occur outside of a concept descriptor, whose representation is specified next.
|translations||0..*||E||CDXL*||may only occur as an attribute if only one translation is given. and if default code system is also set|
The following examles show the same CD value in various forms, from the most verbous to the most terse form.
the patient's hair had an ashy-blondish color the patient's hair had an ashy-blondish color
If you had only one hair color code, you could send
Note that none of this makes use of any type conversion rule except for converting a string (VAL) to a code value. If you never have original text and you always have only one code, you can send a CV for a CD as follows:
the patient's hair had an ashy-blondish color
The Object Identifier (OID) is a simple string literal in the
canonical number-dot-number form. No white space is permitted and no
characters besides "
0" (zero), "
9", and "
As an alternative we could define a string literal for TIIs which would thus fit into only one XML attribute:
Note that this creates a problem when the extension (EXT) itself contains an at sign ("@"). But it is nice and handy, so may be, it is worth the little additional trouble?
Many TIIs will occur as elements of sets of TIIs. This will look as follows:
The ITS independent definition of the Technical Instance Locator needs revision to (1) specify the table of protocol codes (to be IETF-URL + extensions, and (2) be more specific about the phone number and what happened to TN (and why it happened).
All names and identifiers for real world instance are partly RIM classes and partly data types. This ITS specifies only those parts that are data types. The RIM classes are handled by the generic MET/MEI mechanism specified elsewhere.
The Real World Instance Identifier is only a RIM class and thus will not be handled here.
The person name is simply a list of person name parts, so no special XML specification is necessary for the Person Name (PN).
Every person name part has a character string value and a set of coded classifiers. Again, we will rely on character string to code value conversion with the mandatory default classifier code. Just as we did with the code phrase, we will make use of XML's "NMTOKENS" attribute type that allows us to enumerate all classifiers in one XML attribute.
|classifiers||(C)||O||E||CV*||a*||mandatory table. The XML tag name (C) will never be used in an element.|
Irma C. Jongeneel - de Haas Irma C. Jongeneel - de Haas Irma C. Jongeneel-de Haas
|bad address flag||BAD||O||BL||a||default is %null; or %null.unknown;|
The first example is shows an expanded regular form (not even fully
expanded), the second example is much more concise using the various
means of folding elements into XML attributes, character data content
and through type casting between ST and CV.
An Organization Name (ON) is simply a set of Organization Name Variants (ONXV). No special XML specification is needed.
The example, again, shows the full blown form first followed by the
maximally reduced form:
Integer numbers (INT) are represented as decimal digit character strings.
... 4 ...
The positive integer infinity (Aleph0) is
&int.inf;, the negative integer infinity
is represented as
Floating point numbers (FPN) are represented as decimal digit
character strings, optionally with "
suffix, according to the ITS independent specification.
The precision is determined according to the rules of significant digits.
... 0.25... ...
The positive real infinity (Aleph1) is
&fpn.inf;, the negative integer infinity
is represented as
|value||VAL||R||X||FPN||a||may appear as data content only if the UNIT attribute is given.|
|unit||UNIT||O||CD||a||Default code system is The Unified Code for Units of Measures (UCUM) in its case insensitive form. Default is "1" (the unity).|
If the default code system is used, the CD can be replaced by a CV or even a character string. Note that in subsequent revisions CD may turn into CV and UCUM be required as mandatory code system. Here we will always assume that UCUM is used.
In addition, a character string literal form is defined for the
entire PQ data type. A PQ character string consists of an FPN
character string followed by whitespace, and a unit expression
according to the UCUM specification. For example "
means eight hours, "
35 S" means 35 seconds, "
DYN.S/CM5" is 2.5 dyne seconds per centimeter to the 5, and
15 /MIN" means fifteen per minute. Some whitespace must
be present between number and unit.
Examples, as usual in the order of increasing conciseness.
... ... ... ... ...
Monetary Amount (MO) has a similar representation as Physical Quantity.
|value||VAL||R||X||FPN||a||may appear as data content only if the UNIT attribute is given.|
|currency unit||UNIT||R||CD||a||Default code system is ISO 4217. There is no default unit value, i.e. the currency unit must be specified.|
If the default code system is used, the CD can be replaced by a CV or even a character string. Note that in subsequent revisions CD may turn into CV and ISO 4217 be required as mandatory code system. Here we will always assume that ISO 4217 is used.
In addition, a character string literal form is defined for the
entire MO data type. A MO character string consists of an FPN
character string followed by whitespace, and a currency unit symbol
according to ISO 4217. For example "
50 USD" means fifty
U.S. Dollar, "
85 DEM" means 85 Deutsche Mark, "
FRF" is 250 French Francs. Presumably "
will mean fiftyfife Euro. Some whitespace must be present between
number and unit.
Examples, as usual in the order of increasing conciseness.
... ... ... ... ...
The data type for Point in Time (TS) is represented as a character string according to the HL7 v2.3 standard. Although the ITS independent specification discusses issues of adopting ISO 8601, we will stick to the old HL7 form since there was nothing wrong with it.
The example shows my birth date and time, precise to the hour (I don't know the minute) in the Middle European Time zone (UTC+1)
To be specified according to the ITS independent specification, which is not ready yet.
|low closed||LCL||O||Boolean||aF||may be folded into the LOW element as an attribute OPEN or CLOSE|
|low closed||LCL||O||I||Boolean||aF||may be folded into the HIGH element as an attribute OPEN or CLOSE|
Annotation is one of those generic data types that merely add some additional data to any other data type. We will save a level of nesting by implementing this in a way that only adds an element to any other data type.
History (HIST) is simply a SET of History Item. Note that the ITS independent spec. should be changed to relax LIST to SET.
History item (HXIT) is a possible extension of any other type by adding an element that captures the validiy interval.
To be defined.