This chapter may be quite specific for the kind of text files that was available at the time, the work on this project started. Namely there were 11 Microsoft Word 5.0 files, one for each chapter (files `kap[1-7].txt') or appendix (files `anh_[a-d].txt').
The chapter files can be processed all the same way, while the appendices require a unique handling of each file. Appendix C can be treated like a chapter file. Since appendix B describes the lower level protocols and is completely different in both format and purpose, we will ignore it here until we come to the point, where we implement the lower layer protocols. Appendix D is an automatically (?) generated index, which we don't need at all. Before we start discussing the treatment of the chapters, only appendix A needs to be looked at more closely.
Appendix A is a summary full of tables, which presents a most welcome source of various information for us. This appendix was first treated manually. It was easy to convert all the tables into the interim table format described in the next section. Actually the latter format was developed while manually working on the appendix A. Since this is so, the appendix A was broken up into well formatted tables even before any AWK script was written which would do the work of extracting tables, there still exists no automatic extraction script, which could be applied to appendix A. Once our general method has proved to be useful and appropriate for the next version of HL7 as well, this gap will be filled up.
Fortunately, the format of Microsoft Word 5.0 files is merely an ASCII text file, with only a few special characters written directly into the text. The big bunch of information on the print format is appended binary to each text file. Even though, this information is completely obscure for us, it can savely be ignored. What remains open to us and is sufficient, is the ASCII text. Thus we don't have to bother with all the layout information like indentations, fonts and character styles etc.
Now, what information do we have to expect, and how is it presented in the ASCII text of our files? Each chapter consists of sections, which describe messages, segments or fields. The sections are recognized by their specific layout (e.g. tables) and by keywords, which have proved to be very helpful for us here. It can even be stated, that the strict usage of keywords (e.g. `FIELD NOTES:') in the text made this whole work at all possible. We'll describe the extraction of of the items in the appropriate section below.
As you can easily see from the above filenames, these were not the original files, which would for sure have english acronyms. In deed, these files were being worked on by the German section of the HL7 consortium.
This (visible) work consists merely of annotations mainly by German translations of the terminology, sometimes there have been some remarks added. However, this would have been only of little interest for us, if this efforts had not caused the format of the files to mess up. This in turn has been the cause of severe problems, some of which could not been solved satisfactorily by a general mechanism, with the consequence of having to manually edit the files that were produced defective. There would have been little sense of waisting time trying to solve a problem which is hoped to disappear in the next version of HL7 or even in the next set of textfiles we may have available.
It became quite obvious that a kind of a WYSIWYG text processor like Microsoft Word for Windows -- though easy to use -- is prone to cause severe problems, especially if more than one author is working on the same file. Very much dicipline and a common method (a standard) is required about the usage of resources to format the text (e.g. Print Formats vs. arbitrary formatting of marked blocks). It is not easy to always keep this kind of discipline if one can achieve ones immediate ends rather simple. Moreover the format of a text in a WYSIWYG editor tends to become obscure, even though the writer feels to have complete control over it, this gives rise to surprises. After all, it seems that text processing methods which appear rather unfriendly or outdated on the first view (like *roff and TeX), do pay off the extra effort in the long run, particularly on long texts and when used in a work group.