This is Info file ProtoGen.info, produced by Makeinfo-1.64 from the
input file ProtoGen.texi.

   This text describes the implementation of HL7 that is being done at
the Universitätsklinikum Steglitz in Berlin. It is meant as a report
about the work in general as well as a manual for the software that is
about to be developed.

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that the
entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Free Software Foundation.

   Copyright (C) 1994, 1995, 1996 Gunther Schadow


File: ProtoGen.info,  Node: Top,  Next: Introduction,  Prev: (dir),  Up: (dir)

ProtoGen/HL7 -- An Implementation of Health Level 7
***************************************************

   This paper describes the attempt to implement HL7 at the
Universitätsklinikum Steglitz in Berlin as part of a larger hospital
communication project. The aim of that project is to integrate of
heterogeneous computer systems which exist in the various departments
of the clinic. This project requires a standardized communication
protocol.

   The standardization of medical communication protocols is an ongoing
effort which is taken on by many different standard groups. Naturally
the diverse standard groups have different scopes depending on the
special field where each standard has it's origin. Since late 1990, the
US standard groups have been associated in the Healthcare Informatics
Standards Planing Panel (HISPP) which is coordinated by the American
National Standard Institute (ANSI). It's goal is to promote the
convergence of the different standards, a process which is still at the
beginning. However, as early as 1987 there was one standard group whose
objective was to cover a wide range of data interchange in health care,
including ADT, Finance and a variety of anciliary fields, like
laboratory medicin or radiology.

   In Europe, the Technical Board of the European Standardization
Commitee (CEN) established the Technical Committee for Medical
Informatics (TC 251) in March 1990; one of it's Working Groups (WG.3)
is dedicated to health care communications.

   Since the implementation of HL7 as presented here is embedded into a
project that is seeking the integration of different standards it is
conceptualized from a point of view that may be different from those of
other implementors who may concentrate on mere HL7. The difference will
become evident in the methods taken to build the implementation itself
as well as in the final linkage between the communications protocol and
the application software. Where it seems appropriate for the
implementation which is centered on HL7 to stay completely within the
concepts and terminology of HL7, this project seeks to go towards
further generality in order to provide concepts and methods to serve as
bridges between diverging standards, which are to be released in the
same field.

   An implementation of a communication standard is essentially a data
structure, a builder and a parser. The data structure holds the data
objects and reflects their relations among each other as assumed by the
standard. The builder is to produce a valid representation of the data
structure that is capable of being transported via electronic data
interchange media. Finally the parser transforms this representation
back into the data structure.

   This implementation tries to assume models and methods of todays
informatics technology. It's design is object oriented and is
implementeted using C++. The code of the implementation is produced by a
compiler whose input is a database which specifies the standard. The
database is in turn compiled semi-automatically by scanning the text of
the HL7 standard as is released by the HL7 Working Group. This method of
generating the implementation directly from the textual description has
the following advantages: correctness and customizability of both, the
parser/builder methods and the definition of the standard itself.

   Since the specifying database as well as the program code which
implements the standard is not written manually but produced by a
compiler, the output is assured to be correct, unless the compiler
itself is incorrect. An incorrect compiler, however, will result in
erroneous code but the errors tend to occur systematically rather than
accidentially and are thus discovered more easily. Thus the method of
ProtoGen/HL7 can be used to generate a HL7 reference implementation. On
the other hand, it does reveal problems and errors in the standard
definition itself, since a compiler will not guess the intended meaning
of an amiguous or obviously incorrect passage as a human programmer
would probably do.

   The input database for the ProtoGen/HL7 compiler is customizable and
allows the user to add new data elements quite easily. For example, a
special Z-segment that is used by a site can be added to the standard
just by adding a single line to the database. The implementation for
this special segment is generated fully automatically as is the
implementation of the whole standard.

   On the other hand, the methods of the parser/builder may be
customized in order to provide for different encoding rules or to speed
up the implementation or tune it for size. These customizations are,
however, not done very often and can currently only be done with a
deeper understanding of the compiler itself since this requires the
compiler program to be modified. Nevertheless, there are various
applications that can be worth the effort ranging from modifications to
HL7 coding or processing rules up to the implementation of a completely
different standard. Since the UN EDIFACT encoding rules are quite
similar to those of HL7 there could be an EDIFACT module added to
ProtoGen with only a little to moderate effort.

   Finally, the object oriented model of the HL7 standard that this
implementation assumes helps to stay compatible with the new
developement in the HL7 Working Group (version 3.0) in particular as
well as health care data interchange standards of the future in general,
such as MEDIX.

* Menu:

* Introduction::
* From Text to the Database::
* The HL7 database::
* Generating C++ code::
* Integration into the system::
* Bibliography::


File: ProtoGen.info,  Node: Introduction,  Next: From Text to the Database,  Prev: Top,  Up: Top

Introduction
************

   A communication protocol for medical applications tends to be rather
complex, since medical communication includes a variety of diverse
issues, such as ADT (admission discard transfer), accounting, reporting
and documentation of findings, querying and ordering etc.  In order to
cope with this kind of complexity it is necessary to find methods of
deriving the implementation from the specification of the standard
automatically.

   Besides the problem of complexity, modern software engineering has
pointed out the need to automate the production and management of
software. These concerns are reflected in the modern technologies by
which communication protocols are implemented today, and which are
formulated by the OSI standard.

   In this Introduction we want to take a first look on HL7 and this
implementation of it, concerning as to how we can cope efficiently with
the complexity mentioned above.

* Menu:

* A view on HL7::
* A view on this implementation::


File: ProtoGen.info,  Node: A view on HL7,  Next: A view on this implementation,  Prev: Introduction,  Up: Introduction

A view on HL7
=============

   HL7 is a communication protocol for medical applications that tries
to conform to the OSI concepts of networking. Particularly, HL7 even
attaches it's name to the OSI reference model (Health Level 7)
therefore it should be expected, that HL7 will easyly comply to the
concepts and methods that are suggested by OSI.

   Indeed, the distinctions between abstract syntax and encoding rules,
are reflected in the HL7 document: For each message, an abstract message
definition is given, consisting of 3 parts, which are logically related
to the elements which are considered by HL7 to constitute transactions:
(1) messages, (2) segments and (3) fields, `data elements' or types.
The transactions (messages) are conceptually embedded into their
environment in which they occur in terms of `trigger events'. On the
other hand, there are encoding rules defined in order to have an interim
standard, until the OSI suite of protocols is sufficiently supported.

   Finally there are several lower layer protocols (LLP) specified,
which belong to the session and transport layers, and which include or
use in turn resources, which are associated with again lower layers.
These various protocols all meet different needs, which arise from the
use of different transport media and their properties such as
reliability.

   However in the proceeding of this implementation attempt it turned
out, that the HL7 specification doesn't make the distinctions above as
strict as it claims to. This poses problems to the implementor, who
tries to go with OSI. We will point out these problems in this paper at
the place where they become evident (*note On the abstractness of
abstract syntax in HL7::.)

   The unique notation used in the HL7 document is another problem, that
uneases the use of OSI standard methods. Probably much of the problems
of HL7 could have been avoided, if a standard notation such as ASN.1
would have been taken over in time. This problem may be partialy come
from kind of a misinterpretation of ASN.1: In the HL7 document it is
repeatedly stated, that the scope of ASN.1 be the basic encoding rules,
however ASN.1 is an abstract syntax notation, which is more appropriate
for comparison with the `abstract message notation' used by HL7. The
encoding rules stay completely hidden in ASN.1 (in opposite of HL7
message definition). It seems rather likely, that ASN.1 was primarily
avoided because of the difficulty "to explain to application
programmers, who are not schooled in recursive languages in general, or
ASN.1 in particular". This is certainly an anachronism, since tools like
lex and yacc are widespread and extensively used since many years.
However, to be fair, we have to note here, that HL7 was first documented
back in 1987 when the ASN.1 standard was just released.


File: ProtoGen.info,  Node: A view on this implementation,  Prev: A view on HL7,  Up: Introduction

A view on this implementation
=============================

   Today, there exist some tools which are dedicated to the OSI concept,
most notably the ISODE, which is an almost complete development
environment for OSI applications. Thus it seemed at the first glance
appropriate to transform the HL7 notation to ASN.1 and feed the latter
code to the various tools of the ISODE including the structure generator
(pepsy) and the remote operation stub generator (rosy). However there is
one single point which inhibits this rather convenient way to go:

   ISODE supports only the OSI basic encoding rules (BER).  Furthermore
there is currently no other ASN.1 compiler, which supports the HL7
encoding rules. Even though, the HL7 encoding rules were meant as an
`interim standard' for use until OSI standards are available, it is now
a major point that makes it hard for the HL7 world to migrate towards
OSI. Most existing HL7 applications use the HL7 encoding rules, and for
compatibility, new HL7 applications will keep doing so.

   Because HL7 encoding rules have several disadvantages (KUPERMAN
(1991), p. 179) which will cause them not to survive for many years,
adding support for HL7 encoding rules into existing ASN.1 compilers
(ISODE or SNACC), though possible, seems not to be worth the effort.

   This implementation has developed (and is still about to develop) its
own means by which the program objects are automaitcally derived from
the documenting text of the standard. Since the author of this paper
happened to have the text of the HL7 specification in ASCII files, there
was a reasonable chance to go this way from the very beginning. Before
we start going along that way step by step, we'll first take a look on
it from above.  Figure 1 shows the process of generation of the
implementation.

   Let us start from the end, and answer the question where we want to
go: We want to end with a set of definitions which are expressed in a
common programming language plus an object library, which provides the
actions corresponding to the definitions (the implementation in the
closer meaning). This shall be further usable by applications to be
built or to be extended with the HL7 interface. There is a need of
compatibility in terms of platforms on which the programs are to be
running if this implementation shall once pay off its effort.

   The programming language that was cosen is C++. The important
advantages of C are its widespread propagation on small computers like
PCs as well as on workstations and minicomputers along with the
naturally embedded operating system interfaces especially on UNIX
platforms provide a base for compatible as well as efficient code.
These advantages are shared by C++ since C++ is derived from C.
Moreover, C++ provides new and powerful technologies of software design
which we want to make use of here.

   The steps towards the C++ code are taken by various means and
methods.  First we extract meaningful information from the HL7
document, this is done with AWK. The since most of this information is
presented either in tables or in a simple formalism capable of
expressing the message syntax, it is appropriate to process these
information further by a method which is capable of handling both,
relations and grammars.  Prolog on the one hand expresses information
in terms of relations, on the other hand it is powerful (and often
used) for handling grammatical data.  Moreover, analysis and generation
of corresponding data can be mutually achieved by the same code. The
tasks that are taken over by Prolog are the check for consistency of
the information gathered from the document, and the final generation of
the C++ code.

   The C++ code is then compiled and packed into an object archive (a
library). The C++ definitions in the header (`*.h') files provide the
Interface for application programs, that will make use of HL7. It
should be possible to provide availability to C programs (and programs
of other programming languages also).

   Even though, we try to provide an automatical tool of code
production from the standard's document which is reasonably general to
tolerate future changes of HL7, we however do not intend to make a
general purpose compiler for the notation in which HL7 is defined.
Remember, that this notation is mainly an unstructured text, which
although contains machine extractable structured information, which we
try to capture. Furthermore the way this structured information is
presented, is unique to HL7 and not backed by any standard. So there is
little use of a general purpose compiler for that notation. We rather
keep the work as specific to HL7 as possible, to minimize work overhead
for the sake of unused generality. This includes, that a considerably
amount of the code has to be written manually. Again here come the
advantages of C++ (over C) for help to minimize software management
overhead.


File: ProtoGen.info,  Node: From Text to the Database,  Next: The HL7 database,  Prev: Introduction,  Up: Top

From Text to the Database
*************************

   This chapter describes, how we actually scan the HL7 document to
capture useful information. First we have to take a look on the HL7
document itself, its format and what structured information we may find
therein.  Then we describe the interim format, in which all information
that can be expressed in tables is stored. Finally we describe the
methods used to generate Prolog predicates from these tables. However
as noted above, tables are not the only information style in which
information is presented in the HL7 document, the message syntax is
expressed in a simple formal language, that can be translated purely
lexically to Prolog expressions.

   To accomplish the tasks of this step, we use AWK, which is a common
tool for text file processing, available on virtually any UNIX system.
So is sed, which we sometimes use to post process our output in order
to get rid of nasty junk characters. Where AWK or sed is not available,
the C sources of the GNU version of these (strongly recommendable)
tools can be freely copied from any major FTP server next to you.

* Menu:

* The HL7 document::
* The Intermediate Table Format::
* Extracting tables::
* Extracting segment definitions::
* Generating Prolog predicates from tables::
* Extracting message definitions::


File: ProtoGen.info,  Node: The HL7 document,  Next: The Intermediate Table Format,  Prev: From Text to the Database,  Up: From Text to the Database

The HL7 document
================

   This chapter may be quite specific for the kind of text files that
was available at the time, the work on this project started. Namely
there were 11 Microsoft Word 5.0 files, one for each chapter (files
`kap[1-7].txt') or appendix (files `anh_[a-d].txt').

   The chapter files can be processed all the same way, while the
appendices require a unique handling of each file. Appendix C can be
treated like a chapter file. Since appendix B describes the lower level
protocols and is completely different in both format and purpose, we
will ignore it here until we come to the point, where we implement the
lower layer protocols. Appendix D is an automatically (?) generated
index, which we don't need at all. Before we start discussing the
treatment of the chapters, only appendix A needs to be looked at more
closely.

   Appendix A is a summary full of tables, which presents a most welcome
source of various information for us. This appendix was first treated
manually. It was easy to convert all the tables into the interim table
format described in the next section.  Actually the latter format was
developed while manually working on the appendix A. Since this is so,
the appendix A was broken up into well formatted tables even before any
AWK script was written which would do the work of extracting tables,
there still exists no automatic extraction script, which could be
applied to appendix A. Once our general method has proved to be useful
and appropriate for the next version of HL7 as well, this gap will be
filled up.

   Fortunately, the format of Microsoft Word 5.0 files is merely an
ASCII text file, with only a few special characters written directly
into the text. The big bunch of information on the print format is
appended binary to each text file. Even though, this information is
completely obscure for us, it can savely be ignored.  What remains open
to us and is sufficient, is the ASCII text.  Thus we don't have to
bother with all the layout information like indentations, fonts and
character styles etc.

   Now, what information do we have to expect, and how is it presented
in the ASCII text of our files? Each chapter consists of sections, which
describe messages, segments or fields. The sections are recognized by
their specific layout (e.g. tables) and by keywords, which have proved
to be very helpful for us here. It can even be stated, that the strict
usage of keywords (e.g. `FIELD NOTES:') in the text made this whole
work at all possible. We'll describe the extraction of of the items in
the appropriate section below.

   As you can easily see from the above filenames, these were not the
original files, which would for sure have english acronyms. In deed,
these files were being worked on by the German section of the HL7
consortium.

   This (visible) work consists merely of annotations mainly by German
translations of the terminology, sometimes there have been some remarks
added. However, this would have been only of little interest for us, if
this efforts had not caused the format of the files to mess up. This in
turn has been the cause of severe problems, some of which could not been
solved satisfactorily by a general mechanism, with the consequence of
having to manually edit the files that were produced defective. There
would have been little sense of waisting time trying to solve a problem
which is hoped to disappear in the next version of HL7 or even in the
next set of textfiles we may have available.

   It became quite obvious that a kind of a WYSIWYG text processor like
Microsoft Word for Windows -- though easy to use -- is prone to cause
severe problems, especially if more than one author is working on the
same file. Very much dicipline and a common method (a standard) is
required about the usage of resources to format the text (e.g. Print
Formats vs.  arbitrary formatting of marked blocks). It is not easy to
always keep this kind of discipline if one can achieve ones immediate
ends rather simple. Moreover the format of a text in a WYSIWYG editor
tends to become obscure, even though the writer feels to have complete
control over it, this gives rise to surprises. After all, it seems that
text processing methods which appear rather unfriendly or outdated on
the first view (like *roff and TeX), do pay off the extra effort in the
long run, particularly on long texts and when used in a work group.


File: ProtoGen.info,  Node: The Intermediate Table Format,  Next: Extracting tables,  Prev: The HL7 document,  Up: From Text to the Database

The Intermediate Table Format
=============================

   This section describes how we store most of the information drawn
from the HL7 text files. Since there are different items to extract (i.e
segments and tables) which however are essentially tables, but differ in
their embedding and appearance in the text, it seems appropriate to have
an easy to generate interim table format, which the different extraction
procedures write their information into, rather than generating the
final representation directly. From these tables we can get our final
representation easily by applying a common AWK script to all of them
regardless of what they are derived from. This has the advantage that we
are free now to decide to translate them into a different programming
language than Prolog or import them into a database management system
etc.

   Note that if we are talking about `tables' in this section, we mean
this interim format, do not confuse this with the tables found in the
HL7 text. The following is a set of rules telling us how these tables
are built:

  1. The first line of the table is it's name.

  2. The table is ended by at least one empty line.

  3. Each row of the table makes up exactly one line, the length of the
     line is however not limited to a specific number of characters.
     The line is terminated by the system's native <EOL> character,
     i.e. the one that the AWK's printf escape sequence `\n' expands
     into.

  4. Columns are colon separated. A colon appears *between* two columns,
     not starting the first and not ending the last one. Where there
     are two consecutive colons, a colon at the beginning or one at the
     end of the row, the corresponding field is treated as empty.

  5. The second line of the table names the titles of the columns.

  6. The third line of the table specifies the data type to expect on
     each column. This can be one of the following keywords:

    `sym'
          a symbol, that will become an identifier in the target
          language (Prolog).

    `num'
          a number

    `str'
          a string, i.e. a sequence of characters, that will appear
          enclosed by string delimiters `"'.

  7. The fourth line of the table is the first row of data of the
     table, this and any immediately following line will be treated as
     table data.

   However, there are more complex tables, which contain subtables, all
of the latter have the same format (e.g. number and types of columns).
These complex tables are generated e.g. from table of `TABLE
VALUES'(1), which appears in the appendix A.  The complex tables are
basically the same as described above. Notably the rules 1-6 of the
definition above do still apply.  Here are the other rules which apply
to the complex tables:

  7. the forth line must start with `-' and defines the titles of the
     columns of every subtable.

  8. the fifth line must also start with `-' and declares the data type
     of the columns of every subtable.

  9. any other line that is not preceded by a `-' starts a new subtable.

 10. any other line that is preceded by a `-' is a row of the subtable
     that started recently.

   The meaning of the rows is slightly different or extended from those
of the simple table. The idea is, that we generate two relations, from
the complex table. One is the main table (i.e. the table that results
if we delete any line that starts with an `-'), while the other is a
relation, that is constructed from the main table and the subtables.

   Let R = {t1, ..., tc} be a relation of cardinality c, where each
tuple is t = <ri1, ..., rin> for i running from 1 to c.  R corresponds
to the main table. For each ti there is a relation Si(si1, ..., sim)
which corresponds to a subtable. If ri1 is a key to R then T(ri1,si1,
..., sim) is a relation which is equivalent to S, and which corresponds
to the second relation we produce from the complex table.

   To give examples rather than exhaust the reader with definitions,
first we present a simple table:

     DATA TYPE
     DATA TYPE:DESCRIPTION:LENGTH
     sym:str:num
     AD:ADDRESS:
     DT:DATE:8
     ...
     PN:PERSON NAME:48
     TX:TEXT:

   Here is an example for a complex table:

     TABLE
     TABLE#:DESCRIPTION
     num:str
     -VALUE:DESCRIPTION:VALUE#
     -sym:str:num
     0001:SEX
     -F:Female:000345
     -M:Male:000344
     0002:MARITAL STATUS
     -D:Divorced:000350
     -M:Married:000348
     -S:Single:000349

   From the latter table we'll produce two relations, as if they had
been defined as follows, first the main table and then the constructed
table:

     TABLE
     TABLE#:DESCRIPTION
     num:str
     0001:SEX
     0002:MARITAL STATUS
     
     VALUE
     TABLE#:VALUE:DESCRIPTION:VALUE#
     num:sym:str:num
     0001:F:Female:000345
     0001:M:Male:000344
     0002:D:Divorced:000350
     0002:M:Married:000348
     0002:S:Single:000349

   Note from the examples, that the title of the derived table is the
title of the first column of it.

   The filename conventions for these interim tables are not uniform. A
simple table ends with `.tb'. However, a table which was derived from a
segment definition is named with a trailing `.stb'.

   ---------- Footnotes ----------

   (1)  yet another meaning of `table'


File: ProtoGen.info,  Node: Extracting tables,  Next: Extracting segment definitions,  Prev: The Intermediate Table Format,  Up: From Text to the Database

Extracting tables
=================

   Most tables which are scattered throughout the chapters are compiled
into the appendix A, so they don't have to be rescanned here. Actually
there is no table in the chapters, which is not found again in appendix
A, but we cannot be sure here. What we do is just to scan for the
headings of the tables, in order to merely catch the number of the
table. Afterwards, we can check, if there is any more information to
get, which we do not already have taken from appendix A.

   For this kind of extraction we do not even need AWK. We let sed(1)
run once over each chapter file with the following command, which
appears as the only command in the `bin/chptbl' shell script.

     sed -n -e "s/^TABLE \([0-9][0-9]*\).*/chapterTable(\1)./p" *.txt

   This causes any line like the following example

     TABLE 0002      MARITAL STATUS

   to be output as

     chapterTable(0002).

   The latter is a Prolog predicate, which is then used to check for
tables not yet known from appendix A.


File: ProtoGen.info,  Node: Extracting segment definitions,  Next: Generating Prolog predicates from tables,  Prev: Extracting tables,  Up: From Text to the Database

Extracting segment definitions
==============================

   Segment definitions are tables as well which describe one field on
each row. On the header of the table there is the sequence of keywords
`SEQ', `LEN' etc.  The tables are again surrounded by empty lines.
Following each table, there is a section which starts with the keyword
`FIELD NOTES:', followed on the same line with the id of the segment.
Even though, we take the segment id from the subsection headline, which
precedes the tables, we reassure us from the correctness of our
assumption with the help of the `FIELD NOTES:' construct, which turned
out to be very reliable.

   Please note, that we can not define a completely secure method at the
outset, by which we find any piece of information without fail, but we
rather refine our method as far as seems reasonable by trial and error,
trying to catch most misinterpretations by taking advantage of
redundancies found in any human readable text.

   The extraction of segment definitions is done by `bin/exseg', which
in turn runs AWK with `bin/exseg.awk' on its first parameter (which
typically is a file name), pipes the output through several sed
processes and prints its output to the standard output stream (which is
typically redirected into a file).

   We build the interim table with the name `SEGMENT' to be defined and
followed by the column specification lines, which are the same for any
segment. Then we convert the body of the table into the canonical
format, replacing VTs by colons. We are in some trouble here, because
some VTs have been mutated to consecutive spaces, thus we cannot savely
discriminate columns. The only reasonable way to cope with these kind of
problems seemed to manually edit the defect output. It is hoped, that
any next version will be free of these problems, such that there is no
use wasting time extending generality to cover faulty conditions.

   There is reason for extracting the field notes as well. We will find,
that there are fields declared as being of ID type, which actually are a
kind of a composite types. This applies for fields, that deal with
patient location. Actually we don't use the field notes a lot, but it
might be useful, if we'd store the whole information in a database, or
hypertext file. Field notes are extracted with `bin/exfld', which is
itself an AWK script.


File: ProtoGen.info,  Node: Generating Prolog predicates from tables,  Next: Extracting message definitions,  Prev: Extracting segment definitions,  Up: From Text to the Database

Generating Prolog predicates from tables
========================================

   From the interim tables we can finally build Prolog predicates. Since
Prolog predicates are essentially relations, and relations can be
regarded as tables, this conversion is rather straightforward.

   It is funny, that any formal information, identifiers as well as
descriptive strings are printed in all uppercase in the HL7 document.
However we turn everything to lower case, because of following reasons:

   * text in all-uppercase is hard to read.

   * text in all-uppercase is hard to type.

   * Prolog likes symbols to start in lower case.

   The functor symbol is derived from the name of the table, by deleting
any special character and replacing consecutive white spaces by one
single underscore (`_') character. Even though we could as well
surround the whole title of the table by `'' characters, thus marking
the sequence as a symbol, we make the conversion for convenience of the
working on it to come. Note that we have to refer to these symbols and
we don't want to bother with the special characters and number of
consecutive spaces, we rather convert the names in a canonical form.

   It happens quite often, that one or more attributes of a tuple stay
undefined, i.e. when nothing is found between two colons. In this case,
we set the value to undefined, which we can do in Prolog by using the
anonymous variable `_'. We could as well use a specific symbol which
will assume the meaning of nil, like `null', `nil', `''' or even `[]'.
However, the anonymous variable will be bound to anything during a
unification, thus anything will be allowed at this attribute to cause
the predicate to suceed.

   Finally we begin each group of predicates with a few command lines of
descriptive information about functor name and arity, column types and
column headings.

   This job is done by the `bin/tb2pl' shell script, which calls
`bin/tb2pl.awk', and collects temporary output of `bin/tb2pl.awk'.
These temporary files are created during the procession of subtables.


File: ProtoGen.info,  Node: Extracting message definitions,  Prev: Generating Prolog predicates from tables,  Up: From Text to the Database

Extracting message definitions
==============================

   Messages are syntactically defined in the HL7 document using the
formal language to be described below in this section. We can recognize
message definitions by their unique layout, which comprises 3 columns,
normally separated by ASCII VT characters, except for the cases where
the format was damaged or originally inconsistent. The first column
contains the code in the formal language, the second column contains
remarks and the third column contains the Chapter. The first row
contains the message id and on it's last column the word "Chapter" or
"Appendix" which we use as a keyword.  Finally these tables are
separated from the rest by one empty line, both at the beginning and at
the end.

   The following is an example of message definition as we find it in
the file `kap2.txt', all literal VTs have been replaced by ` - '.

     WRP - Widget Report - Chapter
     MSH - Message Header - II
     MSA - Message Acknowledgement - II
     { WDN - Widget Description - XX
       WPN - Widget Portion - XX
       { [WPD] } - Widget Portion Detail - XX
     }

   The syntax of the formal language is as follows

     <message> ::= <message id> <group>
     <group> ::= <item> | <item> <item>
     <item> ::= <segment id> | `[' <group> `]' | `{' <group> `}'
     <message id> ::= <id>
     <segment id> ::= <id>
     <id> ::= <uppercase> <upper or digit> <upper or digit>
     <upper or digit> ::= <uppercase> | <digit>
     <uppercase> ::= `A' ... `Z'
     <digit> ::= `0' ... `9'

   At this point we can however ignore the syntax,(1) we rather make
the following textual changes:

  1. remove any white space

  2. change all upper case to lower case

  3. any opening bracket (`[') is replaced by `opt('

  4. any opening curly brace (`{') is replaced by `rep('

  5. any closing bracket (`]') or curly brace (`}') is replaced by a
     closing parenthesis (`)')

  6. append a comma `, ' unless

        * there was recently no token at all

        * there was recently an opening parenthesis

  7. remove any new line character (i.e. print anything on a single
     line)

   The first id that was read, i.e. the one that happens to be on the
header line of the table becomes the message id. The rest of the first
line up to the keyword `Chapter' or `Appendix' will be recognized as
the description. However, for the body of the table, we recognize but
the first column, which we handle as said above. Things would have been
easier, if there would not be some message definition tables breaking
these rules. Some tables are formatted without the VT between the
columns, which made it very hard to get rid of the other columns, while
keeping the integrity of the first column.

   The definition of each message is then stored into a single Prolog
predicate message/4:

     message(wrp,'',"widget report",[msh, msa, rep(wdn, wpn, rep(opt(wpd)))]).

   The second argument of message/4 is the event type code, which
further qualifies a message type. However, an event type code is not
always specified (which we have to discuss later). This code -- if
given at all -- can be found in the most recent heading of a
subsection. One of the keywords `Event Code' or `Trigger Event'
precedes the three character id of the event type code. That one is
written as the second argument to the message predicate.

   We should rather have scanned the recent heading of a subsection for
a description of a segment, since this may uniquely describe one message
referenced by a pair of message id and event type code. For now we have
redundant descriptions that poorly specify the messages, which certainly
has to be fixed soon.

   ---------- Footnotes ----------

   (1)  though for other reasons, than not to disappoint application
programmers who are not schooled in recursive languages (sorry,
couldn't resist)


File: ProtoGen.info,  Node: The HL7 database,  Next: Generating C++ code,  Prev: From Text to the Database,  Up: Top

The HL7 database
****************

   After being done with the previous work, we are arrived at a point,
where we can take some kind of a rest, reflecting about what information
we had gained and how it is structured. What we did was to extract
sections of the files and to translate them into a different syntactical
presentation. However, what we did *not* do up to here is to make any
considerations about what is meant by all the tables, and grammars and
how they are mutually related. This will be done in this chapter, we
try to clarify all these properties of the data model and concepts of
transactions.

   First we take a look at the the relationship of HL7 conceptual
entities that explicitly appear in the specification. We will optimize
it by removing redundancies, and -- perhaps most important -- we gain an
oversight of what HL7 defines. Then we check the database that we
compiled in the last step for errors and inconsistencies, and we prove
an important assumption we made in the preceding section. Finally we
focus on the contents of the relations, we ask what in particular is
defined, and how is it defined.

* Menu:

* Dependencies::                Dependencies
* Errors::                      Errors
* Consistency check::           Consistency check
* The data item numbers::       The data item numbers
* On the abstractness of abstract syntax in HL7::  On the abstractness of abstract syntax in HL7
* On trigger events::           On trigger events
* On the null value::           On the null value


File: ProtoGen.info,  Node: Dependencies,  Next: Errors,  Prev: The HL7 database,  Up: The HL7 database

Dependencies
============

   In the following table, we give a synopsis of all relations, that we
have got so far. We tried to somehow compress the names of the items
here, to be specific and descriptive as well as short, in order to not
loose oversight. Primary keys to the relations are are marked by
preceding asterisks (`*'), while other candidate keys are marked with a
plus (`+'). If a primary key is made up of more than one attributes,
they are set in parentheses with one preceding asterisk.  The name of
the relation is followed by an `[A]' which means, we have drawn this
relation from appendix A or a `[C]' telling that it was drawn from the
chapters (including appendix C(1)).

Functional Area [A]
     *FunArId, +Chptr, +FunArDscrptn

Message Type [A](2)
     *MsgId, +MsgDscrptn, FunArId

Message [C]
     *(MsgId, EvntTpCd), +MsgDscrptn, +MsgDef

Segment [A]
     *SgmntId, +SgmntDscrptn, FunArId

Segment [C]
     *SgmntId, +SgmntDscrptn

Data Element [A]
     *DatElNum, +DatElDescrptn, SgmntID, FunArId, MaxLn, DtTypId, Opt,
     Rep, TblNum

Field [C]
     *(SgmntId, FldNum), MaxLn, DtTypId, Opt, Rep, TblNum, +DatElNum,
     +FldDscrptn

Data Type [A]
     *DtTypId, +DtTypDscrptn, Ln

Table [A]
     *TblNum, +TblDscrptn, TblClss

Table [A]
     *TblNum, +TblDscrptn

Table Value [A]
     *(TblNum, ValId), ValDscrptn, +ValNum

Field Notes [C]
     *(SgmntId, FldNum), +FldDscrptn, FldNteTxt

   The rows of the above table are sorted to ease orientation of the
reader. Therefore, one thing becomes immediately obvious: There are
sometimes more than one relation with the same name. Even though, they
both are titled `Segment', they are not the same relation because they
don't have the same cardinality.(3) This notwithstanding, it is still
obvious that these relations have some domains in common. We can
simplify our set of relations by rewriting it such, that any two
relations which correspond this way are replaced by a third one which
is defined over any domain, which is part of either the first or the
second relation.

   However, different names of two tables do not guarantee that they do
not correspond the same way, as was just said. Consider `Data Element'
and `Field'. Both are defined over the same domains by different order,
except from `FunArId' (i.e. the column titled `owner' in appendix A,
and lists the functional area, to where it belongs) and `FldNum'. The
question is now, if tuples of both relations can be mapped one to one.
We will see below (*note Consistency check::.), that they can.

   The name `Field' was given in `exseg.awk', since the word `field' is
used throughout the HL7 specification to designate the parts of which
the segments are built (more than 500 occurences).  However `data
element' is used sometimes (42 occurences) as well.  Why are there two
names for the same thing? One answer might be, that `data element' is
used where we refer to an atom of data regardless of the context in
which it occurs, while `field' is used, for such an atom in the context
at a certain place of a certain segment.  Thus a data element is the
*contents* of a field. In deed, the relation `Data Element' doesn't
have a domain, which could designate a certain place in a segment.

   However, why is there an attribute for repeatability and optionality
then? We wouldn't expect an object to be optional per se, whereas a
certain field in a segment may well be empty sometimes.  Also
repeatability is no property of a data element from this point of view,
even though it depends on how we think of an repetition: Does the field
repeat or does it's contents repeat in the field? If the first was true,
then a segment would not necessaryly be of a fixed number of fields,(4)
if the second is true, then there must be something in between a field
and an atom of data, we could say, that an `occurence' is not identical
to a data element. This resembles LISP's point of view: LISP would
regard a field of a segment like one half of a pair, which can be a
list (i.e. another pair), or an atomic data item.  If we have a look at
the encoding rules,(5) we notice, that repetition is realized with a
special delimiter, this reconfirms us in our view of repetition as
happening on a level inbetween a data element and a field, which we
might call the level of `occurence'.(6)

   In order not to digress too much we decide not to consider data
element and field as different things, if we can proof the one to one
relation of both. We perform a rewrite on both of them, which is similar
to the one we made for `Segment' or `Table'. We'll make this proof when
we check the consistency of the database, that we acquired.  For now
assume, that this proof will succeed.

   Figure 2 shows a sort of entity relationship model of the database
before we removed multiple occurences. Each relation of the database is
graphed as an entity (a name in a box) which has a relation (a line
linking it) to an other entity. Note the different notions of
`relation', to avoid confusion, we will speak of a `link' if we mean
relationships or dependencies between relations. At each contact
between a line and a box, there is a number `1' or `n'. This graph can
be "read" by following each line with the words: "<number> <name> is
linked to <number> <name>" where <number> is the number, which is
written at the box of <name>.

   Let's have a look if there is more to refine. Were there is a
one-to-one link, as between `Table/2' and `Table/3', we can merge the
two relations into one, that's what we have already planned to do.
However there is more: There is a pair of parallel one-to-many links,
one going from `Functional Area' via `Segment/3' and `Field' to `Data
Element' and the other going directly from `Functional Area' to `Data
Element'. We notice, from the table above, that this parallel link is
caused only by the `FunArId' domain. Thus, we can consider removing the
domain from the relation at the many-end of the link to remove this
indirect redundancy, unless it is not part of a key there, which it
isn't. Note that it depends on the one-to-one link, between `Field' and
`Data Element' whether we may commit this simplification. If it is a
many-to-one link, i.e. if one data element could appear in several
fields, we must not do this.

   Our simplified database looks as sketched in figure 3. The table
below will show it in detail:

Functional Area
     *FunArId, +Chptr, +FunArDscrptn

Message Type
     *MsgId, +MsgDscrptn, FunArId

Message
     *(MsgId, EvntTpCd), +MsgDscrptn, +MsgDef

Segment
     *SgmntId, +SgmntDscrptn, FunArId

Field
     *(SgmntId, FldNum), MaxLn, DtTypId, Opt, Rep, TblNum, +DatElNum,
     +FldDscrptn

Data Type
     *DtTypId, +DtTypDscrptn, Ln

Table
     *TblNum, +TblDscrptn, TblClss

Table Value
     *(TblNum, ValId), ValDscrptn, +ValNum

Field Notes
     *(SgmntId, FldNum), +FldDscrptn, FldNteTxt

   ---------- Footnotes ----------

   (1)  see above (*note The HL7 document::.)

   (2)  The redundant appearance of chapter and description of
functional area was erased in all tables of appendix A, in which this
redundancy appears.

   (3)  This holds unless we do not have two relations with the same
cardinality but different domains, we attach our concept of equality of
relations to the concept of Prolog.

   (4)  notwithstanding the fact, that trailing missing fields are
regarded as `not present'

   (5)  which should be irrelevant if we discuss about conceptual
issues. In fact as we'll see below (*note On the abstractness of
abstract syntax in HL7::.), there is only a weak distinction between
abstract concepts and representation issues in HL7

   (6)  which is somehow tautologic


File: ProtoGen.info,  Node: Errors,  Next: Consistency check,  Prev: Dependencies,  Up: The HL7 database

Errors
======

   Now we will finally feed our database into Prolog just to see if
what we generated was syntactically correct. We shouldn't bother the
reader with the results of our typos here, these have been corrected
beforehand.  Rather, this section reveals the first severe errors in
the HL7 specification. Here is what Prolog complains:

     [WARNING: (/usr/share/doc/HL-7/kap4.msg:5)
             Syntax error: Operator expected]
     
     [WARNING: (/usr/share/doc/HL-7/kap7.msg:11)
             Syntax error: Operator expected]

   Further investigations point us directly into the documents: At the
definition of the order message we find what is extracted below. Note
the matching of brackets and braces. The opening bracket before PID and
below ORC are never closed properly. So where do we have to assume the
closing brackets to be placed? The author of this document wasn't able
to do correct this until he could see Version 2.2 (ballot 1) for the
answer. In the table below, corrected brackets are set between
asterisks.

     ORM                     ORDER MESSAGE           Chapter
     MSH                     Message Header          II
     [ { NTE} ]              Notes and Comments      II
     [ PID                   Patient Identification  III
        [{NTE}]              Notes and Comments      II
        [ PV1 ] *]*          Patient Visit           III
     {
        ORC                  Common Order            IV
        [
        Any Order Segment    E.g., ORO, OBR, RX1 IV
        [ { NTE } ]
            [ { OBX }        Results Segment         VII
                [ { NTE} ]]  Notes and Comments      II
       *]*
            [ BLG ]          Billing segment         IV
     }

   The second error is due to the same kind of grouping problem,
however, this time we experience one closing brace too much. It's again
hard to guess the correct grouping except from a minor mismatch in the
PID group, it can be concluded from version 2.2 that what is set between
asterisks is wrong, even though there is at least one mismatch in v2.2
too.

     ORF                        Observational Report        Chapter
     MSH                        Message Header              II
     MSA                        Message Acknowledgement     II
     { QRD                      Query Definition            V
         [ QRF ]                Query Filter                V
         [ PID ]                PATIENT ID                  III
             [{NTE*}]*}         was: [{NTE]}}
         { [ ORC ]              Order common
              OBR               Observation request         VII
              {[NTE]}           Notes and comments          II
              {[OBX]            Result                      VII
                 {[NTE]}        Notes and comments          II
              }
         }
     * *                        was: }
     [DSC]                      Continuation Pointer        V

   While we cannot be sure who caused this error, whether it was the
original editor or some people of the German section of HL7 there is
more evidence for the need to edit this document with different methods.
E.g. emacs is an editor, which shows paranthesis matching and warns on
mismatched paranthesis while they are typed. This is more convenient
than WYSIWYG, since while typing a document control of correctness and
consistency should take precedence over immediate control of layout(1).

   ---------- Footnotes ----------

   (1)  not to repeat here, what was stated about control of layout
above (*note The HL7 document::.)