V3DT conference call notes for Mon, Dec 14, 1998.

The HL7 version 3 data type task group has had its nineth conference call on Monday, December 14, 1998, 11:00 to 13:00 EDT.

Attendees were:

Mike Henderson,
Stan Huff
Anthony Julian
Joann Larson,
Randy Marbach
Larry Reis
Mark Shafarman,
Greg Thomas,
Mark Tucker,
Robin Zimmerman,
Gunther Schadow.

Agenda items were:

hang over items: generic data type for intervals (aka. "ranges"), [slide]
generic data type for uncertainty (aka. "probability distributions") [slide]

AGENDA ITEM 1: hang over items: generic data type for intervals (aka. "ranges")

After the follow up discussions on intervals between Stan and myself, the interval data type is finished now. An interval is represented by the structure reported in the last notes. As always, various constraints can be made on the data types. I.e., the components of the interval data structure can be constrained to certain allowable values or combinations of values only. As a notable special case, one could constrain intervals such that any allowable value would have to have an unknown (or infinite) boundary at one side.

Mark Shafarman suggested to explain the meaning of "open" and "closed" for intervals for non-mathematicians. So here goes:

[n, m] is a closed interval. A value x is an element of the interval if and only if x is greater or equal than n and less or equal than m. That is, the boundaries are included in the interval.

]n, m[ is an open interval. A value x is an element of the interval if and only if x is greater than n and less than m. That is, the boundaries n and m are not included in the interval.

Obviously an interval can be closed on one side while open on the other side.

An interval can only be closed at a finite boundary. That is, if a boundary is an infinity or unknown, the interval can not be closed at that boundary.

End of explanation.

AGENDA ITEM 2: generic data type for uncertainty (aka. "probability distributions")

The initial discussions and preparation of this conference ended up with a definition of two sets of data types for probability distributions for different basic information:

Non parametric probability distribution over a discrete and finite base data type (was called "Histogram").
Parametric probability distribution over an infinite base data type (was called "probability distribution").

Some confusion came up about the use of the term "histogram". It implies a graphical representation of the distribution data, even though the graphical presentation is not considered essential to this data type. We could not find a better short word for "histogram" though. Am convinced now that the term that I chose was indeed not good enough and in lack of a short name, I use the full name above:

We had a long discussion on how to express uncertainty of information that didn't even leave the question untouched whether or not we want to have data types for uncertain information. I was far too involved in the dispute that I could represent the arguments with any objectivity. (Sometimes it is good to have an uninvolved note-writer). I try to scetch the main arguments anyway. In what follows I try to summarize the issues and couterarguments brought up against the suggested uncertainty data types. I tried to withhold my arguments for those types here.

Stan Huff reported what forms of uncertainty he has identified from his experience:

A pathologist sais: "There is a 30% probability that this lesion is malignant."
A pathologist sais: "This lesion is malignant." A medical record system may find out from case-based reasoning (experience) that if pathologist A discovered malignancy, he was right in 80% of the cases, whereas if pathologist B makes the same statement, he was right in only 70% of the cases.
A pathologist sais: "This lesion is probably malignant." Again from experience, a system can say that if the word "probably" was used the chance of malignancy is 40% (whereas if this pathologiest had said "could be" the chance would have been only 10%).

Stan concluded that may want to distinguish whether a probability was issued by the "user" or a such a system that keeps track of experiences with the pathologist's judgement.

Stan further concluded that an expression of an uncertain discrete value (e.g., malignancy) should include both, a coded qualifier of confidence and a numeric probability, where each may be assessed by different entities.

However, Stan expressed that coded probabilities should not generally be used, only in cases where no numeric probability is available (e.g. coding of narratives). Agreement on that from Mark Shafarman and myself: coded probabilities have no precise meaning (I say, they have no reliable meaning at all). Not even the order of coded confidence qualifiers is clear in all cases (e.g., is "A is likely" more or less likely that "probably A"?) However, Stan said, that such coded confidence qualifiers at least uncover the ambiguity that exists (whether we want it or not.)

So the question became, do we want to have a third generic data type for uncertain information that allows you to send the information along with a coded confidence qualifier (shown below)?

But Stan was not really comfortable with that either. He thought that somehow the information about uncertainty woul be given in other structures. Mark Shafarman used the term "meta-observations" for those other structures. Stan said that this should probably be done with "CMEDs" rather than "Data Types".

Stan also wanted to drop the data type called "Histogram" in the last notes and did not want to define the data type for coded uncertainties either. Plus, he wanted to get rid of the probability distribution called "guess" in the last notes, for this overlaps with the narrative qualifiers. I suggested to Stan that his point would be more consistent if he would also vote against the probability distribution type, since it should not matter for him whether the base data type is continuous or discrete and finite. He accepted this and so his position was against any kind of generic data type for uncertainty. This information should rather be given as "CMEDs".

A few clarifications were needed on what CMEDs actually are (and what it means for Stan). Stan characterized CMEDs as some higher level structure that is not considered a data type. In the original sense of the MDF, however, a CMED was a piece of message structure (HMD) that was usually derived from the RIM. The RIM dependency was, however, not essential (as a message header was considered a CMED, but without associated RIM classes). The recent work on HMDs, message meta-models and XML brought forth a notion that everything from the biggest message down to a primitive data type was defined as "Message Element Types" (MET). Expressly, all those things were considered data types on some level. Some METs are derived from the RIM (top-down) while some are build bottom up (HL7 data types). Others, such as the message header, are built top-down, but are not (currently) reflected in the RIM. Behind that background a deferal to constructing a CMED needs clarification.

Do we mean that the issues at hand should be resolved in the RIM? Indeed the RIM inherited some of the things that we deal with. There are probabilities in Clinical_observation (ex OBX) and in the Helath_issue class (ex DG1, PRL), there are numeric vs. coded probabilities in the Health_issue class. There is a way to link observations together and create things that some people like to call "meta-observations". The problem here is that those attributes and methods for uncertainty have never been systematized the way that we are currenntly systematize them.

The other problem is that while "meta-observation" constructs are possible in the clinical part of the RIM, they are not possible in the administrative part of the RIM. This stirs up the old mudd again whether "slotted data" or "OBX type" tag-value data should be used. Stan objected against generic data types for uncertain values to be used in the clinical area of the RIM, since other means should be used here instead (Templates?). His argument goes: why do you want to predefine structures for uncertainty of clinical informations in other ways than you define structures for normal ranges, abnormal flags, methods, etc. That the administartive side of the RIM could use generic data types for uncertainty, in Stans oppinion (as I understand it) only counts against the structure of the administrative side of the RIM with using "slotted" data structures.

Stan's further concern was that he thought the technical committees (especially Order/Results) would not appreciate to get a ready-made structure for uncertaint information from CQ. Those things should rather be addressed out of the Order/Results (or other TCs).

The other issue we had on the table was whether probabilities are associated with a value in a one-to-one or one-to-many way. This question was particularly important for Mark Shafarman. This touches Stan's distinction between "user" assessed proabilities vs. probabilities assigned to statements from the experience of some system. Clearly the answer seems to be that one information may be believed at a different level of confidence by different people. Bayesian probabilities are subjective, thus any probability is valid only in the context of the one who issued the probability.

I suggested a criterion for when we want to stop using data types for probability assignments and when we want to start using RIM structures: whenever we have to attribute the confidence to a different entity than the value. In other words, as long as value and probability is stated together by the originator of a clinical statement, it is fair and quite efficient to use a data type rather than a structure of RIM instances. On the other hand, it would blow up the data type structures if we had to specify who made up the probability if it was not from the same entity that also reported the value.

It is hard to determine exactly where we were as we stopped the discussion and where we are heading to. Mark Shafarman, again, called for a further analysis of the intended use of those structures for uncertainty. And may be, he said, we may end up defining different methods for different use cases.

Stan wanted to come up with constructive alternatives "rather than just issues" next time.

Sorting out the issues (in preparation of the next conference)

There are many seemingly opposite alternatives: slotted values vs. everything is an OBX (and the rest are Templates); data types vs. CMEDs; uncertainty being an accidential companion information (just like "abnormal flags") vs. uncertainty as enssential part of the observation value.

I'd like to fix some invariants that I think should guide our discussion:

I1: The call for uncertainty extensions (and other annotations) to HL7 v2.x data types often came from the administrative side.

I2: The structure of the RIM may seem inconsistent when comparing the slotted style of the administrative (left) half of the RIM with the more tag-value (OBX) style of the clinical (right) half of the RIM. However, the structure of the RIM is not subject to our discussion. There may or may not be sufficient arguments for Stan's "second radical proposition" (i.e. everything is an OBX), but it is not CQ's business to decide on that.

I3: The technical committees may or may not appreciate CQ doing some methodological investigations (and conclusions) for them. As an active member of order results with good relationships to patient care, and also from having been talking with PAFM people I have to say that TCs have a lot of work to do and addressing data types or other fundamental logical issues, such as uncertainty, do not have a high priority on the agendas of the respective TCs. Also in the past, those TCs have consulted CQ for almost all their data types (TQ being an exception), which illustrates that CQ is generally trusted in providing the fundamental logical infrastructure for other TCs. As opposed to Stan's views it is my sense that the domain TCs in fact do appreciate CQ doing work in defing high level data types even if those would shed a new light on certain RIM-attributes maintained by the TCs.

I4: The only way to standardize information in Stan's OBX-only view of the world would be templates (in that sense, I see Stan's consistent agenda shine through many of his arguments.) However, from the preliminary shape of the template discussion it seems clear that defining the OBX segment structure of templates is nothing else than defining slots for values, it's just on a super-RIM layer instead of a RIM layer. The bottomline for our arguments is that the same generic type for probabilities can be used in both, the administrative data "slotted" on the RIM-level as well as the data whose "slotting" is defered to templates.

I5: I think that this is the time to talk about uncertainty in information (clinical, administrative or whatever provinience). What we are doing here is defining abstract structures of information, those may even end up as RIM classes if there is the need for it. I don't think that we should not cover uncertainty only because someone may want to do that work later.

Fact is, nobody has formally addressed this issue in HL7 so far and there is express need for it. So, I want to continue to deal with the real issues, rather than to defend whether or not we should deal with those things here. If Stan has a better model to deal with uncertainty that is applicable to templates, then I want to see this model specified and may be even translated to how things work in our existing world of slotted RIM data elements.

The real issues

I6: Uncertainty assessments (probabilities) are subjective. Thus they depend on who states them. If I am 70% sure that what I see in the microscope are malignant cells, I express my views. If some experienced pathologist sais that probability for malignancy is 70%, she expresses her view. Any receiver of that information must draw his own conclusions based on his trust in my or the pathologist's judgement.

Practically, a receiver might apply a penalty of 0.5 to what I say, whereas the pathologist's views would be trusted at a level of 0.95. Thus from my statement, the receiver may infer a probability of 35% for malignancy while the pathologists statement may be transformed to 67%. If the receiver has both of our statements, he may want to apply a noisy-or and infer his probability as 1-(1-35%)(1-67%) = 79%.

The bottomline is: the new value-probability-pair would be part of a new observation assessed by the receiver of both mine and the pathologists statements penalized and combined by the receiver. Because the receiver drafted his judgement about the case from information received by others, this may or may not be called a "meta-observation".

When the receiver communicates this data further along to someone else, he may or may not quote both input-statements plus his own judgement. In any case, the receiver of that information would again penalize and combine what he has got based on his trust in the judgement of the originators of the incoming statements.

I7: It generally doesn't matter whether a probability was issued by a human "user" or by any kind of decision support "system". The same rules apply: the probability is subjective and the receiver has a responsibility to value the uncertain information he received. Knowing the originator of the uncertain statement is essential (as it is always essential to know who said what), but knowing just the category "user" vs. "system" does not help.

I8: A data type should not include implied associations between RIM classes. Thus, one uncertain value should not be attributed to some Healthcare_provider instance of the RIM. For example, we should not build a data type composed of the triple <value, probability, originator>, where originator would be a forreign key to some Stakeholder or Healthcare_provider. Rather, the uncertain value would be included in a RIM class instance, where the attribution or responsibility of the statement is clear from that context of the RIM class.

The basic building blocks of uncertain information should thus be similar to what has been suggested for the last call and as defined and clarified below. One instance of uncertain information must be attributed to a responsible entity (Doctor or decision support system) just like a "supposedly certain" information must be attributed. Thus attribution is both outside of this data type and is not an argument against a generic data type for uncertain information.

GENERIC DATA TYPE FOR "NON-PARAMETRIC PROBABILITY DISTRIBUTION" OVER A DISCRETE BASE DATA TYPE.

Note that this proposed data type has evolved from and has superceded a data type named Histogram.

Non-Parametric Probability Distribution (NPPD).
Generic data type to specify an uncertain discrete value as a set of <value, probability> pairs (uncertain discrete valuess). The values are considered alternatives and are rated with probabilities for each of the values to apply. Those values that are in the set of possible alternative values but not mentioned in the non-parametric probability distribution data structure will have the rest probability distributed equally over all unmentioned values. That way the base data type can even be infinite (with the unmentioned values being just neglected).
GENERIC TYPE
parameter name	allowed types	description
T	Discrete	Any data type that is discrete can be used. Usually we would use non-parametric probability distributions for unordered types only and only if we assign probabilities to a "small" set of possible values. For other cases one may prefer parametric probability distributions.
SET OF Uncertain Discrete Value using Probabilities<T>

GENERIC DATA TYPE "UNCERTAIN DISCRETE VALUE USING PROBABILITIES"

Note that this proposed data type has evolved from and has superceded a data type named Histogram Item.

Uncertain Discrete Value using Probabilities (UDV-P).
Generic data type to specify one uncertain value as a pair of <value, probability>.
GENERIC TYPE
parameter name	allowed types	description
T	DiscreteType	Any data type that is discrete can be used.
component name	type/domain	optionality	description
value	T	required	The value to which a probability is assigned.
probability	Floating Point Number 0.0 to 1.0.	required	The probability assigned to the value.

NOTES

Type cast rules allow conversion between and uncertain discrete value using probabilities (UDV-P) and non-parametric probability distribution (NPPD) and vice versa.

NOTE probabilities don't have to be "exact", especially one does not need to carry out a series of experiments (samples) in order to specify a probability. Probabilities are always estimated. Bayesian probability theory equals the notion of "probability" with "belief". The probability is thus an assesment of the subjective belief of the originator of a statement. Some subjective numeric probability is often better than a mere indicator that a value is "estimated".

Probabilities are always subjective. Just like any other information, uncertain information needs to be seen in the context of who gave that information (attribution). A recipient updates his knowledge about a case from the received uncertain information based on how much confidence he has in the judgement of the originator of the information.

Both elements in the value-probability-pair are part of the statement made by one specific originator. Along a chain of communication, one value may be reported by different entities and assigned a different probability by each of them.

These data types do not allow to make specific attributions to originators of the information. The rules of attribution are the same whether information is given as uncertain or certain/precise. In particular, in case information is given in an instance of a Service_event class, the attribution is given by the Stakeholder designated as the active participation of type "originator of the information". For "slotted" data elements, implicite attribution defaults to the sending system.

The values in a discrete probability distribution are generally considered alternatives. It is understood that only one of the possible alternative values may truly apply. Because we may not know which value it is, we may state proabilities for multiple values. This does not mean that the values would in some way be "mixed." However, when Rough Sets theory or Fuzzy Logic is used as the underlying theory of uncertainty, the difference between "alternative" and "mixed" becomes blur. Friedman and Halpern (1995) have shown that all of those theories for uncertainty (probability, rough sets, fuzzy logic, Dempster-Shafer) can be subsumed under a theory of "plausibiliy". This theory of plausibility would of course be open as to whether or not a distribution is considered over alternative values as opposed to a mixture of the values.

However, probability is the most widely understood and deployed theory (although fuzzy logic decision support systems are used in clinical medicine). If some value should be represented as a "mixture" of a set of categorial values, other means should be investigated before resorting to "plausibility" theory. For instance, suppose we have to decide about a color in the code system "red, orange, yellow, green, blue, purple". Probabilistically all those values would be alternatives and thus a given color may be stated "orange with a probability of 60%", but the alternatives red and yellow are also considered with probabilities 20% and 15% resp. More naturally we would like to "mix" the colours saying that the color we see is 60% orange, 20% red, 15% yellow and 5% green. We could use fuzzy logic to do that, but a more easy to understand approach would be to use a more appropriate color model than the list of discrete codes. A more appropriate color model would, for instance, be the RGB system, where every color is represented as a mixture of the three base colors red, green and blue (or magenta, yellow, and cyan in subtractive color-mixing).

An example for a discrete probability would be a differential diagnosis as a result of a decision support system. For instance, for a patient with chest discomfort, it might find the following probability distribution:

This is a very compact representation of information that could of course be communicated separately using Clinical_observation class instances (or OBX-segments in v2.3). There are advantages of using the data type for non-parametric probability distribution though:

it is much more compact;
it is immediately clear that the stated values are alternatives assessed by one originator of the observation;
it is clearly specified from the definition of the data type that there is a rest-probability of 0.1% that is not assigned to any of the other diagnoses.

Those facts would be hard to discover from a bunch of OBX segments. If there was an OBX-template for non parametric probability distribution it would have to be defined quite similar to the definition of these data types (if not we should investigate on the differences). The resulting message would still be longer, the possible template does not exist yet (and will not exist before the end of this year) and does not have any proven advantage over the short form shown in the example.

The more tangible alternative is to construct a set of Health_issue class instances linked together. Again, the result would be longer. However, it would be the result of choice if one wishes to track down more precisely how the alternative differential diagnoses have been confirmed or otherwise clinically addressed. For the purpose of patient care the expanded set of Health_issue instances would be clearly more useful, but as an excerpt summary of a decision support process, the short form is useful too.

GENERIC DATA TYPE "UNCERTAIN VALUE USING NARRATIVE EXPRESSIONS OF CONFIDENCE"

Uncertain Value using narrative expressions of confidence (UV-N).
Generic data type to specify one uncertain value as a pair of <value, qualifier>. The qualifier is a coded representation of the confidence as used in narrative utterances, such as "probably", "likely", "may be", "would be supported", "consistent with", "approximately", etc.
GENERIC TYPE
parameter name	allowed types	description
T		Any data type that is allowed here, discrete or continuous.
component name	type/domain	optionality	description
value	T	required	The value to which an uncertainty qualifier is assigned.
confidence	Concept Descriptor	required	The confidence assigned to the value.

Like it or not (I don't like it) we do have the use case that data is known to be just estimated and we may want to signal that the data should be relied in with caution, without having any numeric probability. This occurs most frequently when textual reports are coded.

We also have to deal with narrative expressions of uncertainty that are heard everywere; and we may want to capture those ambiguous and largely undefined qualifiers of confidence. This is almost like an annotation to a value considered to be understood mainly by humans.

We do not specify a closed list of codes to be used. Jim Case has an action item to submit a dozen or so of qualifiers he commonly has seen, others are invited to contribute as well.

No special effort is made to assign numeric probabilities to the codes nor even to specify an order in the set of codes. Doing the translation to numeric probabilities is not trivial, as there may be linear or logarithmic scales useful in different circumstances.

We generally discurage to use narrative probabilities rather than numeric ones. People should be reminded over and over again that probabilities are subjective measures of belief and that an "inexact" numeric probability is much more useful than a statement that "X is likely to be true".

GENERIC DATA TYPE "PARAMETRIC PROBABILITY DISTRIBUTION"

Note that this proposed data type has evolved from and has superceded a data type named Probability Distribution.

Parametric Probability Distribution
Generic data type to specify an uncertain value of an ordered data type using a parametric method. That is, a distribution function and its parameters are specified. Aside from the specific parameters of the distribution a mean and standard deviation is always specified to help maintain interoperability is receiving applications can not deal with a certain the probability distribution. The base data type may be discrete or continuous. Discrete ordered types are mapped to natural numbers by setting their "smallest" possible value to 1, the second to 2, and so on. The order of non-numeric types must be unambiguosliy defined.
GENERIC TYPE
parameter name	allowed types	description
T	OrderedType	Any ordered type (anything that is unambiguously mapped to numbers) can be the basis of an uncertain quantity. Examples are Integer Number, Floating Point Number, and Measurement.
component name	type/domain	optionality	description
mean	T	required	The mean (expected value or first moment) of the probability distribution. The mean is used to standardize the data for computing the distribution. The mean is also what a receiver is most interested in. Applications that can not deal with distributions can still get the idea about the described quantity by looking at its mean.
standard deviation	dif(T)	required	The standard deviation (square-root of variance or square-root of second moment) of the probability distribution. The standard deviation is used to standardize the data for computing the distribution. Applications that can not deal with distributions can still get the idea about the confidence level by looking at the standard deviation.
type	Code Value	required	The type of probability distribution. Possible values are as shown in the attached table.
parameters	Choice	required	The parameters of the probability distribution. The number of parameters, their names and types depend on the selected distribution and described in the attached table.

The distribution type "guess" has been removed from the set of distribution types specified earlier as suggested by Stan Huff. This is now dealt with using the generic data type Uncertain Value using Narrative Expressions of Confidence

The common expression "Age: 75±10 years" now must be mapped to a proper distribution type. Suggested mappings are to a uniform distribution with the "±10" being the width of the distribution; or to a normal distribution with "±10" being translated to standard deviation = 5 years.

Distribution types, their mean and parameters.
type description and parameters

symbol name or meaning type constraint or comment

DISTRIBUTIONS OF DISCRETE RANDOM VARIABLES

binominal Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials.

n number of trials Integer n > 1

p probability of success Float p between 0 and 1

E mean E = n p

V variance V = n p( 1 - p )

geometric Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed.

p probability of success Float p between 0 and 1

E mean E = 1 / p

V variance V = ( 1 - p ) / p²

negative
binominal Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs.

p probability of success Float p between 0 and 1

r number of successes Integer r > 2

E mean E = r / p

V variance V = n r (N - r) (N - n) / ( N³ - N² )

hypergeometric Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items.

N the total number of items Integer N > 1

r number of items with property P Integer r > 1

n sample size Integer n > 1

E mean E = (n r) / N

V variance V = r(1 - p) / p²

Poisson Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean. The standard distribution is the sqare-root of the mean.

E mean

V variance V = E

DISTRIBUTIONS OF CONTINUOUS RANDOM VARIABLES

uniform The uniform distribution assigns a constant probability density over a range of possible outcomes. No parameters besides mean E and standard deviation s are required. Width of the interval is sqrt(12 V) = 2 sqrt(3) s. Thus, the uniform distribution assigns probability densities f(x) > 0 for values E - sqrt(3) s >= x <= E + sqrt(3) s and f(x) = 0 otherwise.

E mean E = (low + high) / 2

V variance V = (high - low)² / 12

normal
Gaussian The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations.

E mean often symbolized µ

V variance often symbolized sigma²

gamma Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution.

alpha Float alpha > 0

beta Float beta > 0

E mean E = alpha beta

V variance V = alpha beta²

chi-square Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum).

n number of degrees of freedom Integer n > 0

E mean E = n

V variance V = 2 n

Student-t Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom.

n number of degrees of freedom Integer n > 0

E mean E = 0 (the mean of a standard normal random variable is always 0)

V variance V = n / ( n - 2 )

F Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n₁ and n₂ which are the numbers of degrees of freedom of the numerator and denominator variable respectively.

n numerator's number of degrees of freedom Integer m > 0

m denominator's number of degrees of freedom Integer m > 0

E mean E = m / ( m - 2 )

V variance V = 2m² (m + n - 2) / ( n(m - 2)²(m - 4) )

logarithmic
normal The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution.

µ mean of the resulting normal distribution Float

sigma standard deviation Float

E mean of the original skewed distribution E = e ^{µ + 0.5
sigma²}

V variance of the original skewed distribution V = e ^{2µ + sigma²} ( e^sigma² - 1 )

beta The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve.

alpha Float alpha > 0

beta Float beta > 0

E mean T E = alpha / ( alpha + beta )

V variance T V = alpha beta / ((alpha + beta)²(alpha + beta + 1))

**Distribution types, their mean and parameters.**
type	description and parameters
	symbol	name or meaning	type	constraint or comment
DISTRIBUTIONS OF DISCRETE RANDOM VARIABLES
binominal	Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials.
n	number of trials	Integer	n > 1
p	probability of success	Float	p between 0 and 1
E	mean		E = n p
V	variance		V = n p( 1 - p )
geometric	Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed.
p	probability of success	Float	p between 0 and 1
E	mean		E = 1 / p
V	variance		V = ( 1 - p ) / p²
negative binominal	Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs.
p	probability of success	Float	p between 0 and 1
r	number of successes	Integer	r > 2
E	mean		E = r / p
V	variance		V = n r (N - r) (N - n) / ( N³ - N² )
hypergeometric	Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items.
N	the total number of items	Integer	N > 1
r	number of items with property P	Integer	r > 1
n	sample size	Integer	n > 1
E	mean		E = (n r) / N
V	variance		V = r(1 - p) / p²
Poisson	Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean. The standard distribution is the sqare-root of the mean.
E	mean
V	variance		V = E
DISTRIBUTIONS OF CONTINUOUS RANDOM VARIABLES
uniform	The uniform distribution assigns a constant probability density over a range of possible outcomes. No parameters besides mean E and standard deviation s are required. Width of the interval is sqrt(12 V) = 2 sqrt(3) s. Thus, the uniform distribution assigns probability densities f(x) > 0 for values E - sqrt(3) s >= x <= E + sqrt(3) s and f(x) = 0 otherwise.
E	mean		E = (low + high) / 2
V	variance		V = (high - low)² / 12
normal Gaussian	The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations.
E	mean		often symbolized µ
V	variance		often symbolized sigma²
gamma	Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution.
alpha		Float	alpha > 0
beta		Float	beta > 0
E	mean		E = alpha beta
V	variance		V = alpha beta²
chi-square	Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum).
n	number of degrees of freedom	Integer	n > 0
E	mean		E = n
V	variance		V = 2 n
Student-t	Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom.
n	number of degrees of freedom	Integer	n > 0
E	mean		E = 0 (the mean of a standard normal random variable is always 0)
V	variance		V = n / ( n - 2 )
F	Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n₁ and n₂ which are the numbers of degrees of freedom of the numerator and denominator variable respectively.
n	numerator's number of degrees of freedom	Integer	m > 0
m	denominator's number of degrees of freedom	Integer	m > 0
E	mean		E = m / ( m - 2 )
V	variance		V = 2m² (m + n - 2) / ( n(m - 2)²(m - 4) )
logarithmic normal	The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution.
µ	mean of the resulting normal distribution	Float
sigma	standard deviation	Float
E	mean of the original skewed distribution		E = e ^{µ + 0.5 sigma²}
V	variance of the original skewed distribution		V = e ^{2µ + sigma²} ( e^sigma² - 1 )
beta	The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve.
alpha		Float	alpha > 0
beta		Float	beta > 0
E	mean	T	E = alpha / ( alpha + beta )
V	variance	T	V = alpha beta / ((alpha + beta)²(alpha + beta + 1))

NOTES

The mean component is mentioned explicitely. This component will be used in type casting a probability distribution over type T to a simple value of type T in a case where a receiving application can not deal with or is not interested in probability distributions.

The literature on statistics commonly lists the mean and standard deviation as dependent on the parameters of the probability distributions (e.g. the mean of a binominal distribution with paramters n and p is np. Because we choose to mention the mean (to help in roughly grasping the "value") and standard deviation (to scale and help with interoperability) the parameters of the distributions may be defined in terms of the mean and standard deviation instead

For example, in the table above, the uniform distribution was specified based on the mean and standard deviation component without further parameters. This does not mean that the standard deviation component contains the half-width of the uniform distribution.

Those depenedncies between the explicit components mean and standard deviation are not always resolved. If there is redundancy, it is an error if the specified mean and standard deviation contradict what can also be derived from the distribution and its parameters.

We do not resolve redundancy since it seems to be useful to let people specify parameters in the natural way rather than dependent on mean and standard deviation.

The type dif(T) is the data type of the difference of two values of type T. Often, T is the same as dif(T). For the data type T = Point in time, dif(T) is not Point in time but a physical Measurement in the dimension of time (i.e. units seconds, hour, minutes, etc.). This concept is generalizeable since it describes the relationship between corresponding measurements on ratio-scales vs. interval-scales (e.g., absolute (Kelvin) temperatures vs. Celsius temperatures).

Most distributions are given in a form where only natural numbers or real numbers are acceptable. If distributions of measurements (with units) are to be specified, we need a way to remove the units for the purpose of generating the distribution and then reapply the units. For instance, if Q = µ u is a measured quantity with numeric magnitude µ and unit u, then we can bind the quotient Q / u to the random variable and calculate the distribution. For each calculated number x_i, we regain a quantity with unit as Q_i = x_i u.

Most distributions are given in a "standard" form, that is with mean or left boundary equals 0 and standard deviation equals 1 etc. Therefore one has to standardize the quantity to be described first. This is similar to the problem of removing and reapplying units. The method is also similar and can be unified: a transformation transforms the numeric value to a standard form and later re-transforms the standard form to the numeric value. Two issues must be considered:

translation, i.e. moving the mean (or left boundary) into the origin (zero-point)
scaling the value to adjust the standard deviation to one.

This means, that any transformation of a value x to a normalized value y can be described as:

y = ( x - o ) / s

We can combine the way we deal with the units and the standardization of the value into one formula:

y = ( Q_i - µ u ) / ( s u )

Here µ u is the expected value (mean) E expressed in the base type T (i.e. a Measurement). This is further justification that we should indeed carry the mean µ u and the standard deviation s u as an explicit components, so that scaling can be done accordingly. The product s u is the standard deviation (square root of the variance) of the described value. The standard deviation is a component that an application might be interested in even if it can not deal with a "chi-square" distribution function.

It would be awesome if we could define and implement an algebra for uncertain quantities. However, the little statistical understanding that I have tells me that it is a non-trivial task to tell the distribution type and parameter from a sum, or product of two distributions or from the inverse of a distribution.

Next conference call is next Monday, December 21, 1998, 11 EST.

Agenda items are:

review of the uncertainty debate time boxed to 10 minutes.
incomplete information (what Stan calls "exception") [slide] [slide] [slide]
historic information [slide]
annotated information (This has been an idea of Mark Tucker that seems to be very useful and straight forward. It would be the v3 counterpart of the NTE segment.)
update semantics [slide]
collections [slide]

regards

-Gunther Schadow