PROPOSAL TO THE HL7 WORKING GROUP

	       AN IMPROVED HL7 QUERY PROCESSING METHOD

			   GUNTHER SCHADOW
		     <schadow@ukbf.fu-berlin.de>

			 DRAFT 2, 11/01/1996
	      superseeds DRAFT 1, 08/01/1995


INTRODUCTION

Query processing with  HL7 is still an open  issue. While  the version
2.2 moved the query section into the control  section and defined more
codes for specifying the query by means  of the ``what filter'', there
is still a disproportion of query  management information versus query
specification information transfered with the HL7 QRY message. Display
vs. record oriented  query response, immediate vs.   deferred response
and interactive  continuation are  explained  in depth  in the section
about queries but the general  concept of uttering  a request for some
specific information was left out. 

The original query  mechanisms provided by  HL7  follows a  concept of
formulating queries by  filters. Filters  modify  a set  of objects by
letting   pass only some  objects  that  have certain  characteristics
defined by the filter. Applyed to queries, there must  be a set of all
possible  answers  to all possible   questions  from which the  filter
specifications  (i.e. What Filter, Who  Filter etc.) take away what is
not  needed to know. It  is  yet unclear how  the  set of all possible
answers is constructed. 

Version 2.3 invented  numerous  modes for  querying and formatting  of
replies.  These  require no  less than  seven  new segments   and four
different new means of formulating queries: EQL is an envelope for the
SQL  derived  ``embedded  query language'',  VTQ   uses a  ``selection
criteria'' formula with  operators and conjunctions,  while ERQ uses a
sloppyly defined two components type, with an implicit equal operator,
and whose  second part can  represent  either a nested  composite or a
list. Finally the SPR uses a stored procedure call with arguments. All
three new methods   have in common, that  they  make use of  the  data
element  numbers  in order to  reference field  values. The underlying
concept obviously  is that of  a relational model,  where segments are
regarded as relations and fields as their attributes. 

The  current HL7 standard  explicitely disclaims to provide a standard
way of  formulating queries that  is aimed in interoperability between
systems. Thus, the last paragraph of subsection 2.15.2 states: 

  ``In particular, there is no implication that a specific system 
    must support generalized queries to comply with the Standard. 
    Rather, these transactions provide a format, or a set of tools 
    to support queries to the extent desired by the institution. 
    The resources available and local policies will influence the 
    type of queries that are implemented.''

And later in subsection 2.21.1 and 2 item b):

  ``The format chosen for the query segments are very general.
    This might be read by prospective implementors to imply that
    the requirement for using the Standard is the ability to
    respond to a wide variety of inquiries.  This is not the
    intent.  The format here can be used with specific
    restrictions in any interface.''

The  addition of new  information   that can be  requested will   most
probably imply  addition of items  to the What  Filter, or new Strored
procedures. New items of  sub filters will have to  be defined as well
as need  for new specific questions arise.  The filter codes or stored
procedure  names  will eventually get very   sophisticated, not to say
complex. In the end,  any  information transmittable  by means of  HL7
messages must     be queryable by  means    of  specific filter codes,
resulting in a mapping of the complete HL7 standard onto filter codes! 

If this is not intended the standard tells to use  the SQL, VTQ or ERQ
terms instead. Here  you can specify  the query by  means  of the data
element names, which is already a mapping of parts  of the standard to
a  meta-code.  Originally,    the  data element    numbers  served the
administrative needs of the HL7 standard, they could be used (but have
not been necessary) in setting up a data dictionary  of HL7. Now these
meta-codes become part of the language and HL7 starts not only to talk
about health  care  data, but  also about itself!   Generally, whether
query filter  codes are used, stored  procedure names, or data element
numbers, all of      these  methods introduce  meta-talk   into    HL7
communication  suggesting that  there is  no different  way   to ask a
question than by meta-communication.

In short the dilemma of the current HL7 query domain can be summerized
as follows: The query mechanism is not  specific enough to be called a
standard, it contains too many ``left over to'',  but as it is becomes
specific  it will become  very complex.   In theory  it  will --  once
complete -- double the effort needed to implement  a HL7 interface: to
implement first order HL7 and then to implement second order meta-HL7. 

However,  the basic problem with all  four approaches is that they are
not   only costly,  but    they are  also   weak  and  therefore  very
unefficient. The methods  are weak because there  is a lot that can be
expressed  with  the messages  of  the  HL7 standard  but nevertheless
cannot  be  asked  for using  HL7  queries.  Actually  it is  not even
possible to ask  most natural queries of   dayly health care.  In  the
example I will confine myself to the EQL Method because it pretends to
be the most general and powerful alternative.  In short concise words,
given that the relations  of the  EQL  are the segments, you  can only
produce selections  and  projections  but no  joins.  Sounds harmless?
Then consider the following question:

  ``What is the yesterdays hemoglobine value of Carolyn Evans?''

  SELECT OBX.@00573, OBX.@00574 FROM PID, OBX
    WHERE PID.@00108.1 = 'EVANS' AND PID.@00108.2 = 'CAROLYN'
      AND OBX belongs-to PID
      AND OBX.@00571.1 = '718-7' AND OBX.@00571.3 = 'LN'

could be  the EQL  query   of choice,   if the equivalence    operator
``belongs-to'' would exist to build the natural join. However, it does
not exist, and the  segments have no  way to  reference each  other by
means of attributes which could be subject to a  join operation. Do we
need more   examples in order  to  see that  there is something really
wrong with the many expensive query methods suggested by version 2.3? 

Aside from the  event replay mechanism, which I  admit to be  the only
one  that does make at   least some sense,   try to ask the  following
question: 

  ``Who changed beds with John Doe last monday?''

This  question clearly relates  to the ADT^A17 message (swap patients)
and is un-askable, because two PID segments have to be joined based of
their incidential occurence  within the same transaction.  Since there
is no notion of a  message  or segment group  in  EQL, nor is there  a
common  field of two segments occuring  within the same message, there
can be no such question. But -- to avoid  beeing misunderstood here --
this is  not to say that there  is anything wrong  with the normal HL7
messages  and segments  which would have   to  be changed in  order to
facilitate querying. I  say, for the bad news,  that querying as it is
suggested by today's  HL7 is weak and  unefficient  and, for  the good
news, there is an easy and straightforward  alternative that goes with
the  single overhead  of 1 (one)  field in   the message  header and a
straightforward   implementation technique.  I  think  that there is a
more efficient approach   to querying than the  current  use of eleven
segments,  four query languages,  numerous codes and all the meta-talk
that was introduced into HL7 communication. 


A SIMPLE SOLUTION 

A simple solution to  the dilemma pointed out above  can be drawn from
what  anybody knows and deals with  painlessly  and naturally. What do
you mean? I mean look at natural languages! Compare the indicative and
interrogative forms  of their grammars. Indicative   forms are used to
predicate  something, to give  away information. In  terms of HL7, the
unsolicited update is an indicative form. Interrogative forms are used
to request some information. An  interrogative sentence is uttered  in
order to   solicit an indicative  sentence that  fills  in the  gap of
information of  the   requestor. The  one  who  ask provides  as  much
information as    possible  and  marks  his   informational  gaps with
interrogative pronouns. The polite answer  will be in a whole sentence
that repeats most  of the words that have  been already  spoken in the
dialog. 

The following example from the english language shall illustrate this.
First I give an indicative form that carries some patient information: 

S1: ``John Doe was admitted in the emergency room at 1:15 AM.''

Now  a question that queries some  information that the above sentence
S1 gives: 

S2: ``Who was admitted in the emergency room at 1:15 AM?''

The interrogative pronoun ``who'' marks the missing information  while
the rest of the sentence S2  gives as much information  as is known to
be  as   specific   as  possible.  The  formulation  of   the query is
straightforward. The interrogative form is  simply to be filled in the
places marked by the interrogative pronouns. 

To  picture the disadvantages  of   the original HL7  query  mechanism
within the context of this example I give the following sentence: 

  ``It is now 6:30 PM and I want to ask my 315th question. Please 
    give me immedate answer in whole sentences not exceeding 21 
    Words or 234 characters. Leave out any information which is 
    not associated with admission of patients, do not tell me 
    anything about patients named other than "John Doe" and just 
    tell me about what happened at 1:15.  Ah, yes, the answer 
    should be somehow concerned with the emergency room.''

It   is   obvious that the  formulation    of the  query  is currently
everything else  than straightforward and that  there is too much room
for  secondary management information  of unspecified use. The same is
true for using embedded SQL, which would sound as follows: 

  ``I have a question with the tag `TAG001' to which I gave the name 
    `MY_FIRST_EQL_QUERY'. Please answer in whole sentences: 
    $\prod_{PID.@00108} PID \bowtie PV1_{PID \simeq PV1} ( 
    \sigma_{@00144='EMERGENCY ROOM' \wedge @00174='199610140115'} PV1 
    )$.''

The   actual question is provided   here in a  relational algebra form
printable  by TeX  in order to  emphasize the  point:  The question is
formulated in a completely different language than in which I normally
speak  to you. That way,  I degrade the  normal language to serve just
for  envelope information and  not for  my   real intention: to ask  a
fairly  simple  question. By the  way: this  is  another example of an
impossible join operation, which means that the algebraic tech-speak I
am using can not even serve my needs, consider  I have eliminated that
minor problem by bilateral negociations or interface engines 

I promise, I will  no longer continue with  this nonsense, but keep in
mind that I just translated the  current querying practice of HL7 into
english. Apart from the querying chapter, the HL7 standard consists of
a  set of messages  concerning  certain problem domains (ADT, Finance,
Result   Reporting, Order   Entry).  Most of   the messages  represent
indicative forms, i.e. they are used to deliver information. The order
entry domain can  be considered an exception  from this rule, since it
is  really an imperative form. However,  the imperative  form does not
use   special  semantic   elements  (like interrogative   pronouns) in
addition to the  mere mark that a sentence  was imperative rather than
indicative.  Therefore the Order Entry can be considered an indicative
message within this context. 

The following example shows the english sentence S1 given above in HL7
form using the usual HL7 encoding rules: 

M1:

MSH|^~\&|||||||ADT^A01|
EVN|A01|199510140115
PID|||||Doe^John
PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115

The message only contains   a very minimal  set  of data in   order to
reflect the sentence S1: ``John Doe was admitted in the emergency room
at 1:15 AM''.  Now let us construct  a simple query message asking for
what sentence S2 asks for, i.e.  the name  of whom was admitted at the
specified source and time: 

To derive a simple query message from this standard HL7 message can be
as straightforward as the   derivation of the english   query sentence
from the indicative sentence (S1): A normal  message is sent, which is
marked  to  be a    query  message rather  than  a   simple indicative
message. This  mark, the ``question mark''  might be positioned in the
MSH segment  as an additional field  to the segment.  Then the message
supplies all  information  about the  subjects  in  question  that are
available. 

M2:

MSH|^~\&|||||199510141830||ADT^A01||||||||||||||||||QRY
EVN|A01|199510141815
PID|||||
PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115

The third component in the message type field ``QRY'', which is repeated
in the EVN segment, marks the whole message  as a query message. Since
the  event  type is  an A01  --  admit a patient,   we ask for patient
admissions.   The PID segment contains  the question mark  in place of
the patient name indicating  that it is the patient   name we want  to
know. The  PV1  segment gives the   rest  of the information   that we
already have in order to specify the question completely. 

The strength and simplicity of this querying method is obvious: 

1. It provides exact specification of queries in any domain that is 
   covered by the HL7 standard. 

2. It is very simple by being derived from the standard HL7 messages 
   with only one minimal change for the question mark. 

3. Particularly it does not require the mapping of any HL7 object into 
   some filter or EQL specification codes as is required by the 
   existing QRY message. 

4. Is applicable to any future evolution of the HL7 standard with no 
   extra effort. 

5. Is easy to implement by existing applications because it can reuse 
   existing database interfaces and does not require an extra query 
   processor. 


The advantage number 5  was not yet shown,  yet I think  it is a  very
important feature  of   the proposed  querying   method that makes  up
another big deal  of its efficiency. Suppose  you have  written an HL7
interface to your database that works as follows: 

You have written a function:

  insert_hl7(DBHandle db, HL7Message msg)

which transforms the HL7 message:

MSH|^~\&|||||||ADT^A01|
EVN|A01|199510140115
PID|||||Doe^John
PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115

to an SQL term like:

  INSERT 
    INTO MSH, EVN, PID, PV1 
    VALUES (
		MSH.KEY   = GENKEY(),
		MSH.MTYPE = 'ADT^A01',
		EVN.KEY	  = MSH.KEY,
		EVN.ETYPE = 'A01',
		EVN.TIME  = '199510140115',
		PID.KEY	  = MSH.KEY,
		PID.NAME  = 'Doe^John',
		PV1.KEY	  = MSH.KEY,
		PV1.ADSRC = 'EMERGENCY ROOM',
		PV1.ATIME = '199510140115
    );

which is   then fed to the  database  server for insertion. It  is now
simple to use the  existing source code in order  to derive the  query
processor 

  HL7Message select_hl7(DBHandle db, HL7Message qry) 

that takes an HL7 query message and returns an HL7 response message to
transform 

MSH|^~\&|||||199510141830||ADT^A01||||||||||||||||||QRY
EVN|A01|199510141815
PID|||||
PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115

to an SQL term like:

  SELECT
	MSH.MTYPE : mtype,
	EVN.ETYPE : etype,
	EVN.TIME  : time,
	PID.NAME  : name, 
	PV1.ADSRC : adsrc,
	PV1.ATIME : atime,
	...
     FROM MSH, EVN, PID, PV1
     WHERE
		MSH.MTYPE = 'ADT^A01',
	AND	EVN.KEY   = MSH.KEY,
	AND	EVN.ETYPE = 'A01',
	AND	EVN.TIME  = '199510140115',
	AND	PID.KEY   = MSH.KEY,
	AND	PV1.KEY   = MSH.KEY,
	AND	PV1.ADSRC = 'EMERGENCY ROOM',
	AND	PV1.ATIME = '199510140115;

where the  variables mtype, etype,  time, name, adsrc, atime,  ... are
written into the response message which is  returned to the originator
of  the query. Now, is  that straight forward? In deed,  it is, and it
does not require any nifty data dictionary (of course the same example
works as well with a data dictionary). So there is no longer a need to
deny the general   availability   of querying capabilities  for    HL7
interfaces,   since it  is so  simple.  There  can be similar examples
generated for  non  SQL databases and even   flat file storage. It  is
because querying  is   no more magic    anymore,  but uses  mechanisms
parallel to updating, that it is so simple. 


A NOTE TO EVENT REPLAY 

There is   some  similarity of   this approach   and the event  replay
approach of HL7  v2.3,  however, there  are differences.  Querying  in
event replay mode  is  nearly as weak  as  it is in  the other  modes,
except    that there   is  the  notion  of   a  message.  Still it  is
meta-communication  instead of  first  order communication. The result
message for event replay also loads unneccessary extra overhead to HL7
communication. Do  we really need  a QAK segment?  Can't that  data go
into  the message  header? I  propose   not to  double   the number of
syntactically   different  messages as is  implied   by an extra event
replay response message for every HL7 message/event. 


FINE POINTS OF QUERY SPECIFICATION 

The interrogative  semantics  that  was   added  to the HL7   messages
requires two points to  be examinated more  carefully. This is however
not of  immediate importance since the  querying mechanism as proposed
is already many  times more efficient and  powerful than  the querying
techniques we have now. And this is reason enough to consider it. What
follows   is   some reasoning about  querying    for  ranges and using
conjunctions different to `AND' for the clauses, but, again this is of
secondary importance. The main point is that such things can be done. 

Again, let's have a look  at the simple english interrogative sentence
S2 above. It seems to be a  problem that the  admission time was given
to be exactly  1:15 AM. For the common  scenarios this is certainly an
overspecification resulting in  no answer, because patients might have
been admitted at 1:16 or 1:14 but not  at 1:15 AM.  It is obvious that
we need a method to specify a range of time rather than an exact time.
The  same is true for   a query that  searches  for any patient with a
preprandial  blood  glucose level higher   than  200 mg/dl  or similar
questions that might be of interrest for studies. 

By specifying a  range of time or  blood glucose concentration we mean
to give a set of possible values. 1:00 AM -- 1:30 AM is a set of times
between  the given    boundaries  and  > 200   mg/dl    is a  set   of
concentrations greater  than  200 mg/dl.  Other sets  like complements
(all times except 1:00 AM  -- 1:30 AM)  and supersets and combinations
thereof are possible. Qualitative values that are normally represented
by codes can build  sets by enumerating  the elements of it.  Thus, we
must find a way to give sets in place of any specific value. 

A  syntax  to  specify    these  sets would   contain   the  elements:
enumeration, range  and  complement. A  BNF  specification of   such a
syntax would be as follows: 

set         : item | item superset_op set ;
item        : range | element ;
range       : range_op boundary
	    | boundary range_op
	    | boundary range_op boundary ;
element	    : value ;
boundary    : value ;

With the  terminal symbols  set to concrete  symbols  according to the
following list we can build example terms of set specifications: 

superset_op : `,' ;
range_op    : `-' ;

Thus the set given by

    199510140100-199510140130

is the time  range  between 1:00 and 1:30  AM  at August 1,  1995. The
specification 

    410^ACUTE          MYOCARDIAL         INFARCT*^I9C,415.1^PULMONARY
EMBOLISM/INFARCT^I9C 

means the superset of both diseases.  Other forms are possible for the
coded value (CE typed data) by leaving away the  clear text values and
giving only the codes: 

    410,415^^I9C 

A range  of  blood  glucose values  higher  than  200 mg/dl could   be
specified as follows in an OBX segment: 

    OBX|...|200-|mg/dl|... 

For numerical   values there  is  already  the usage  of  `>'  and `<'
operators mentioned, which whould look like: 

    OBX|...|>200|mg/dl|... 

Finally, a  more complex set  range which selects blood glucose levels
less than 5 mg/dl and more than 200 mg/dl is given here: 
 
    OBX|...|-5,200-|mg/dl|... 

This proposal is not  complete  with regard  to  a specific syntax  to
specify sets of  values. What was to be  shown in this  discussion was
that  such a  syntax  can be  defined (again increasing  the number of
encoding characters)  and that it should  be defined. In fact, the OBX
example shows that there is need for a general range/set specification
method even from other aspects than querying. 

The second  point of examination is  a direct  consequence of the last
one: Where  there are ranges  and sets supplied for  query parameters,
the result  usually will be  a set  of  elements  for which the  query
specification is true. This set should be transmitted as a sequence of
response  messages.  If  we do not  want  to  add segments that  would
identify the   messages of a  query  response to the  existing message
definitions this  information should be carried in  the MSH segment as
the continuation pointer. The exact   contents of  this field or   the
definition  of   more  fields is   subject to  future   drafts of this
proposal.