PROPOSAL TO THE HL7 WORKING GROUP AN IMPROVED HL7 QUERY PROCESSING METHOD GUNTHER SCHADOW DRAFT 2, 11/01/1996 superseeds DRAFT 1, 08/01/1995 INTRODUCTION Query processing with HL7 is still an open issue. While the version 2.2 moved the query section into the control section and defined more codes for specifying the query by means of the ``what filter'', there is still a disproportion of query management information versus query specification information transfered with the HL7 QRY message. Display vs. record oriented query response, immediate vs. deferred response and interactive continuation are explained in depth in the section about queries but the general concept of uttering a request for some specific information was left out. The original query mechanisms provided by HL7 follows a concept of formulating queries by filters. Filters modify a set of objects by letting pass only some objects that have certain characteristics defined by the filter. Applyed to queries, there must be a set of all possible answers to all possible questions from which the filter specifications (i.e. What Filter, Who Filter etc.) take away what is not needed to know. It is yet unclear how the set of all possible answers is constructed. Version 2.3 invented numerous modes for querying and formatting of replies. These require no less than seven new segments and four different new means of formulating queries: EQL is an envelope for the SQL derived ``embedded query language'', VTQ uses a ``selection criteria'' formula with operators and conjunctions, while ERQ uses a sloppyly defined two components type, with an implicit equal operator, and whose second part can represent either a nested composite or a list. Finally the SPR uses a stored procedure call with arguments. All three new methods have in common, that they make use of the data element numbers in order to reference field values. The underlying concept obviously is that of a relational model, where segments are regarded as relations and fields as their attributes. The current HL7 standard explicitely disclaims to provide a standard way of formulating queries that is aimed in interoperability between systems. Thus, the last paragraph of subsection 2.15.2 states: ``In particular, there is no implication that a specific system must support generalized queries to comply with the Standard. Rather, these transactions provide a format, or a set of tools to support queries to the extent desired by the institution. The resources available and local policies will influence the type of queries that are implemented.'' And later in subsection 2.21.1 and 2 item b): ``The format chosen for the query segments are very general. This might be read by prospective implementors to imply that the requirement for using the Standard is the ability to respond to a wide variety of inquiries. This is not the intent. The format here can be used with specific restrictions in any interface.'' The addition of new information that can be requested will most probably imply addition of items to the What Filter, or new Strored procedures. New items of sub filters will have to be defined as well as need for new specific questions arise. The filter codes or stored procedure names will eventually get very sophisticated, not to say complex. In the end, any information transmittable by means of HL7 messages must be queryable by means of specific filter codes, resulting in a mapping of the complete HL7 standard onto filter codes! If this is not intended the standard tells to use the SQL, VTQ or ERQ terms instead. Here you can specify the query by means of the data element names, which is already a mapping of parts of the standard to a meta-code. Originally, the data element numbers served the administrative needs of the HL7 standard, they could be used (but have not been necessary) in setting up a data dictionary of HL7. Now these meta-codes become part of the language and HL7 starts not only to talk about health care data, but also about itself! Generally, whether query filter codes are used, stored procedure names, or data element numbers, all of these methods introduce meta-talk into HL7 communication suggesting that there is no different way to ask a question than by meta-communication. In short the dilemma of the current HL7 query domain can be summerized as follows: The query mechanism is not specific enough to be called a standard, it contains too many ``left over to'', but as it is becomes specific it will become very complex. In theory it will -- once complete -- double the effort needed to implement a HL7 interface: to implement first order HL7 and then to implement second order meta-HL7. However, the basic problem with all four approaches is that they are not only costly, but they are also weak and therefore very unefficient. The methods are weak because there is a lot that can be expressed with the messages of the HL7 standard but nevertheless cannot be asked for using HL7 queries. Actually it is not even possible to ask most natural queries of dayly health care. In the example I will confine myself to the EQL Method because it pretends to be the most general and powerful alternative. In short concise words, given that the relations of the EQL are the segments, you can only produce selections and projections but no joins. Sounds harmless? Then consider the following question: ``What is the yesterdays hemoglobine value of Carolyn Evans?'' SELECT OBX.@00573, OBX.@00574 FROM PID, OBX WHERE PID.@00108.1 = 'EVANS' AND PID.@00108.2 = 'CAROLYN' AND OBX belongs-to PID AND OBX.@00571.1 = '718-7' AND OBX.@00571.3 = 'LN' could be the EQL query of choice, if the equivalence operator ``belongs-to'' would exist to build the natural join. However, it does not exist, and the segments have no way to reference each other by means of attributes which could be subject to a join operation. Do we need more examples in order to see that there is something really wrong with the many expensive query methods suggested by version 2.3? Aside from the event replay mechanism, which I admit to be the only one that does make at least some sense, try to ask the following question: ``Who changed beds with John Doe last monday?'' This question clearly relates to the ADT^A17 message (swap patients) and is un-askable, because two PID segments have to be joined based of their incidential occurence within the same transaction. Since there is no notion of a message or segment group in EQL, nor is there a common field of two segments occuring within the same message, there can be no such question. But -- to avoid beeing misunderstood here -- this is not to say that there is anything wrong with the normal HL7 messages and segments which would have to be changed in order to facilitate querying. I say, for the bad news, that querying as it is suggested by today's HL7 is weak and unefficient and, for the good news, there is an easy and straightforward alternative that goes with the single overhead of 1 (one) field in the message header and a straightforward implementation technique. I think that there is a more efficient approach to querying than the current use of eleven segments, four query languages, numerous codes and all the meta-talk that was introduced into HL7 communication. A SIMPLE SOLUTION A simple solution to the dilemma pointed out above can be drawn from what anybody knows and deals with painlessly and naturally. What do you mean? I mean look at natural languages! Compare the indicative and interrogative forms of their grammars. Indicative forms are used to predicate something, to give away information. In terms of HL7, the unsolicited update is an indicative form. Interrogative forms are used to request some information. An interrogative sentence is uttered in order to solicit an indicative sentence that fills in the gap of information of the requestor. The one who ask provides as much information as possible and marks his informational gaps with interrogative pronouns. The polite answer will be in a whole sentence that repeats most of the words that have been already spoken in the dialog. The following example from the english language shall illustrate this. First I give an indicative form that carries some patient information: S1: ``John Doe was admitted in the emergency room at 1:15 AM.'' Now a question that queries some information that the above sentence S1 gives: S2: ``Who was admitted in the emergency room at 1:15 AM?'' The interrogative pronoun ``who'' marks the missing information while the rest of the sentence S2 gives as much information as is known to be as specific as possible. The formulation of the query is straightforward. The interrogative form is simply to be filled in the places marked by the interrogative pronouns. To picture the disadvantages of the original HL7 query mechanism within the context of this example I give the following sentence: ``It is now 6:30 PM and I want to ask my 315th question. Please give me immedate answer in whole sentences not exceeding 21 Words or 234 characters. Leave out any information which is not associated with admission of patients, do not tell me anything about patients named other than "John Doe" and just tell me about what happened at 1:15. Ah, yes, the answer should be somehow concerned with the emergency room.'' It is obvious that the formulation of the query is currently everything else than straightforward and that there is too much room for secondary management information of unspecified use. The same is true for using embedded SQL, which would sound as follows: ``I have a question with the tag `TAG001' to which I gave the name `MY_FIRST_EQL_QUERY'. Please answer in whole sentences: $\prod_{PID.@00108} PID \bowtie PV1_{PID \simeq PV1} ( \sigma_{@00144='EMERGENCY ROOM' \wedge @00174='199610140115'} PV1 )$.'' The actual question is provided here in a relational algebra form printable by TeX in order to emphasize the point: The question is formulated in a completely different language than in which I normally speak to you. That way, I degrade the normal language to serve just for envelope information and not for my real intention: to ask a fairly simple question. By the way: this is another example of an impossible join operation, which means that the algebraic tech-speak I am using can not even serve my needs, consider I have eliminated that minor problem by bilateral negociations or interface engines I promise, I will no longer continue with this nonsense, but keep in mind that I just translated the current querying practice of HL7 into english. Apart from the querying chapter, the HL7 standard consists of a set of messages concerning certain problem domains (ADT, Finance, Result Reporting, Order Entry). Most of the messages represent indicative forms, i.e. they are used to deliver information. The order entry domain can be considered an exception from this rule, since it is really an imperative form. However, the imperative form does not use special semantic elements (like interrogative pronouns) in addition to the mere mark that a sentence was imperative rather than indicative. Therefore the Order Entry can be considered an indicative message within this context. The following example shows the english sentence S1 given above in HL7 form using the usual HL7 encoding rules: M1: MSH|^~\&|||||||ADT^A01| EVN|A01|199510140115 PID|||||Doe^John PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115 The message only contains a very minimal set of data in order to reflect the sentence S1: ``John Doe was admitted in the emergency room at 1:15 AM''. Now let us construct a simple query message asking for what sentence S2 asks for, i.e. the name of whom was admitted at the specified source and time: To derive a simple query message from this standard HL7 message can be as straightforward as the derivation of the english query sentence from the indicative sentence (S1): A normal message is sent, which is marked to be a query message rather than a simple indicative message. This mark, the ``question mark'' might be positioned in the MSH segment as an additional field to the segment. Then the message supplies all information about the subjects in question that are available. M2: MSH|^~\&|||||199510141830||ADT^A01||||||||||||||||||QRY EVN|A01|199510141815 PID||||| PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115 The third component in the message type field ``QRY'', which is repeated in the EVN segment, marks the whole message as a query message. Since the event type is an A01 -- admit a patient, we ask for patient admissions. The PID segment contains the question mark in place of the patient name indicating that it is the patient name we want to know. The PV1 segment gives the rest of the information that we already have in order to specify the question completely. The strength and simplicity of this querying method is obvious: 1. It provides exact specification of queries in any domain that is covered by the HL7 standard. 2. It is very simple by being derived from the standard HL7 messages with only one minimal change for the question mark. 3. Particularly it does not require the mapping of any HL7 object into some filter or EQL specification codes as is required by the existing QRY message. 4. Is applicable to any future evolution of the HL7 standard with no extra effort. 5. Is easy to implement by existing applications because it can reuse existing database interfaces and does not require an extra query processor. The advantage number 5 was not yet shown, yet I think it is a very important feature of the proposed querying method that makes up another big deal of its efficiency. Suppose you have written an HL7 interface to your database that works as follows: You have written a function: insert_hl7(DBHandle db, HL7Message msg) which transforms the HL7 message: MSH|^~\&|||||||ADT^A01| EVN|A01|199510140115 PID|||||Doe^John PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115 to an SQL term like: INSERT INTO MSH, EVN, PID, PV1 VALUES ( MSH.KEY = GENKEY(), MSH.MTYPE = 'ADT^A01', EVN.KEY = MSH.KEY, EVN.ETYPE = 'A01', EVN.TIME = '199510140115', PID.KEY = MSH.KEY, PID.NAME = 'Doe^John', PV1.KEY = MSH.KEY, PV1.ADSRC = 'EMERGENCY ROOM', PV1.ATIME = '199510140115 ); which is then fed to the database server for insertion. It is now simple to use the existing source code in order to derive the query processor HL7Message select_hl7(DBHandle db, HL7Message qry) that takes an HL7 query message and returns an HL7 response message to transform MSH|^~\&|||||199510141830||ADT^A01||||||||||||||||||QRY EVN|A01|199510141815 PID||||| PV1||||||||||||||EMERGENCY ROOM||||||||||||||||||||||||||||||199510140115 to an SQL term like: SELECT MSH.MTYPE : mtype, EVN.ETYPE : etype, EVN.TIME : time, PID.NAME : name, PV1.ADSRC : adsrc, PV1.ATIME : atime, ... FROM MSH, EVN, PID, PV1 WHERE MSH.MTYPE = 'ADT^A01', AND EVN.KEY = MSH.KEY, AND EVN.ETYPE = 'A01', AND EVN.TIME = '199510140115', AND PID.KEY = MSH.KEY, AND PV1.KEY = MSH.KEY, AND PV1.ADSRC = 'EMERGENCY ROOM', AND PV1.ATIME = '199510140115; where the variables mtype, etype, time, name, adsrc, atime, ... are written into the response message which is returned to the originator of the query. Now, is that straight forward? In deed, it is, and it does not require any nifty data dictionary (of course the same example works as well with a data dictionary). So there is no longer a need to deny the general availability of querying capabilities for HL7 interfaces, since it is so simple. There can be similar examples generated for non SQL databases and even flat file storage. It is because querying is no more magic anymore, but uses mechanisms parallel to updating, that it is so simple. A NOTE TO EVENT REPLAY There is some similarity of this approach and the event replay approach of HL7 v2.3, however, there are differences. Querying in event replay mode is nearly as weak as it is in the other modes, except that there is the notion of a message. Still it is meta-communication instead of first order communication. The result message for event replay also loads unneccessary extra overhead to HL7 communication. Do we really need a QAK segment? Can't that data go into the message header? I propose not to double the number of syntactically different messages as is implied by an extra event replay response message for every HL7 message/event. FINE POINTS OF QUERY SPECIFICATION The interrogative semantics that was added to the HL7 messages requires two points to be examinated more carefully. This is however not of immediate importance since the querying mechanism as proposed is already many times more efficient and powerful than the querying techniques we have now. And this is reason enough to consider it. What follows is some reasoning about querying for ranges and using conjunctions different to `AND' for the clauses, but, again this is of secondary importance. The main point is that such things can be done. Again, let's have a look at the simple english interrogative sentence S2 above. It seems to be a problem that the admission time was given to be exactly 1:15 AM. For the common scenarios this is certainly an overspecification resulting in no answer, because patients might have been admitted at 1:16 or 1:14 but not at 1:15 AM. It is obvious that we need a method to specify a range of time rather than an exact time. The same is true for a query that searches for any patient with a preprandial blood glucose level higher than 200 mg/dl or similar questions that might be of interrest for studies. By specifying a range of time or blood glucose concentration we mean to give a set of possible values. 1:00 AM -- 1:30 AM is a set of times between the given boundaries and > 200 mg/dl is a set of concentrations greater than 200 mg/dl. Other sets like complements (all times except 1:00 AM -- 1:30 AM) and supersets and combinations thereof are possible. Qualitative values that are normally represented by codes can build sets by enumerating the elements of it. Thus, we must find a way to give sets in place of any specific value. A syntax to specify these sets would contain the elements: enumeration, range and complement. A BNF specification of such a syntax would be as follows: set : item | item superset_op set ; item : range | element ; range : range_op boundary | boundary range_op | boundary range_op boundary ; element : value ; boundary : value ; With the terminal symbols set to concrete symbols according to the following list we can build example terms of set specifications: superset_op : `,' ; range_op : `-' ; Thus the set given by 199510140100-199510140130 is the time range between 1:00 and 1:30 AM at August 1, 1995. The specification 410^ACUTE MYOCARDIAL INFARCT*^I9C,415.1^PULMONARY EMBOLISM/INFARCT^I9C means the superset of both diseases. Other forms are possible for the coded value (CE typed data) by leaving away the clear text values and giving only the codes: 410,415^^I9C A range of blood glucose values higher than 200 mg/dl could be specified as follows in an OBX segment: OBX|...|200-|mg/dl|... For numerical values there is already the usage of `>' and `<' operators mentioned, which whould look like: OBX|...|>200|mg/dl|... Finally, a more complex set range which selects blood glucose levels less than 5 mg/dl and more than 200 mg/dl is given here: OBX|...|-5,200-|mg/dl|... This proposal is not complete with regard to a specific syntax to specify sets of values. What was to be shown in this discussion was that such a syntax can be defined (again increasing the number of encoding characters) and that it should be defined. In fact, the OBX example shows that there is need for a general range/set specification method even from other aspects than querying. The second point of examination is a direct consequence of the last one: Where there are ranges and sets supplied for query parameters, the result usually will be a set of elements for which the query specification is true. This set should be transmitted as a sequence of response messages. If we do not want to add segments that would identify the messages of a query response to the existing message definitions this information should be carried in the MSH segment as the continuation pointer. The exact contents of this field or the definition of more fields is subject to future drafts of this proposal.