V3DT conference call notes for Mon, Dec 7, 1998.

These notes contain the preparation of the next conference call. So please read and think about it.

The HL7 version 3 data type task group has had its eighth conference call on Monday, December 7, 1998, 11 to 12:30 AM EDT.

Attendees were:

Jim Case,
Mike Henderson,
Stan Huff,
Anthony Julian,
Joann Larson,
Randy Marbach,
Mark Shafarman,
Greg Thomas,
Mark Tucker,
Robin Zimmerman,
Gunther Schadow.

Agenda items were:

comments on last conference and notes.
generic data type for intervals (aka. "ranges"), [slide]
generic data type for uncertainty (aka. "probability distributions") [slide]

AGENDA ITEM 1: Comments on last conference and notes.

Comments on lasts conference notes came from Mark Shafarman he suggested that:

the denominator compoent of the Ratio must not be zero. This is a technical correction and the notes of last conference have been updated accordingly.
The disposition of the different categories of "units" (i.e. physical units, monetary units, dosage forms) may not be closed. People tend to write expressions of the form
number x unit
The units do have certain common properties (i.e. you can not add meters and seconds or apples and oranges). But there are also important differences. Stan Huff said that in the case of counts reported as number x things counted, the measurement name should contain the "things counted" rather than the units (e.g. in lab data bases one frequently finds units such as "red blood cells" vs. "white blood cells", which is redundant given that the measurement name is reported properly.) We have to revisit this topic again later.

GENERIC DATA TYPES FOR QUANTITIES

AGENDA ITEM 2: Generic data type for intervals (aka. "ranges")

GENERIC DATA TYPE "INTERVAL"

Interval
Generic data type that can express a range of values. Ranges of values are most abundant as ranges of absolute time, used for ordering and scheduling. Note that an interval is not to be used to specify confidence intervals for uncertain values.
GENERIC TYPE
parameter name	allowed types	description
T	OrderedType	Any orderd type can be the basis of an interval. It does not matter whether the base type is discrete or continuous or whether any algebraic operators are defined for that type.
component name	type/domain	optionality	description
low	T	optional	The lower boundary.
low open	Boolean	required	Indicates whether the interval is closed or open at the lower boundary. For a boundary to be closed, a finite boundary must be provided, i.e. unspecified or infinite boundaries are always open.
high	T	optional	The upper boundary.
high open	Boolean	required	Indicates whether the interval is closed or open at the high boundary. For a boundary to be closed, a finite boundary must be provided, i.e. unspecified or infinite boundaries are always open.

NOTES

There was a considerable discussion on whether or not the interval notation is appropriate for things like a set A

A = { x_i | x_i < n }

i.e. the set of numbers x_i with all x_i less than n.

Stan Huff wants to specify a range as a composite of

one (and only one) number n
a code for a relational operator for less than, equal, greater then, etc.

The arguments for Stan's alternative are

people "commonly expect to see" values such as Glucose: <6 mg/dl as a lab result,
a result for diastolic blood pressure such as <60 mm Hg would be more appropriate than 40-60 mm Hg, or that 40-60 mm Hg would be a less appropriate expression that would have to be forbidden anyway,
one can "calculate" with entities consisting of a number and a relational operator better than with intervals,
the operator-number pair construct would be "more intuitive",
in the interval notation people would erroneously use only the low boundary component even though they want to express exact values like 5 or a range bounded only on the right side.

The arguments for the above stated interval construct were:

it is a general form to specify any continuous set of quantities,
it readily contains the alternative suggested by Stan Huff,
it is a uniform representation that is more easy to compute with, because you find the upper boundary always at the same place.

In this discussion, I felt that we (myself included) mixed two orthogonal axes of concepts and a couple of independent issues all together, which was not very helpful.

First: As always, we must distinguish between the any surface level representation of a type versus the semantics of the type. As always this caveat was disregarded.

Second: There are three different uses of ranges on the table that may be intuitively described as

a set of values, where each value may apply under some circumstances (e.g. an order scheduled to begin at 3:15 and end at 4 o'clock);
one single value supposed to assume just one value from the range of values given (e.g. a measurement which turns out to be off the lower absolute limit and therefore can be reported only as a range with an upper boudary);
one single value whose set of possible values is partitioned into equivalence classes because the exact differences are not interesting or not measurable (e.g in microbiologic susceptibility testing, we may have a parameter "OXACILLIN SUSC" where only the following equivalence classes are of interest: > 8.0 µg/ml (not susceptible); 4.0±2 µg/ml (limited susceptibility); and < 2.0 µg/ml (susceptible)).

Miscellaneous issues are:

What is the rationale for a constraint saying that "blood pressure may be less than 60 but not between 40 and 60"?
To what extend we can "calculate" with ranges as we can calculate with real numbers?
Which semantic representation is most easily accessible by computers? The uniform one? The one with less components? Do we care at all?
How many specialized data types do we want for sets of quantities?

Are "range" and "interval" synonyms?

About synonymity between "range" and "interval" let's ask Webster:

range: 1 a (1) : a series of things in a line, [...]
interval: 1 a : a space of time between events or states [...]
3 : a set of real numbers between two numbers either including or excluding one or both of them

In common language those are not synonyms. A range is the ordered "line of things" while the common notion of an interval is the gap between two things. However, "interval" is used in mathematics for things being aligned in a set. In any case, there is nothing about the word "range" that would suggest a line of things with only one end. Thus, we can use both terms synonymously.

We do distinguish between surface form and semantic components with intervals as with any other data type. We will specify a literal form for interval expressions that is tuned toward intuitiveness rather than uniformity. (I still believe that a computer is helped better with uniformity that with intuitiveness, since a computer has no intuition.) Here is a mapping between surface forms (string literals) and the uniform interval form:

literal interval form

<= n ]unk; n] (low open)

>= n [n; unk[ (high open)

< n ]unk; n[ (low open and high open)

> n ]n; unk[ (low open and high open)

= n [n; n]

[n,m] [n; m]

literal	interval form
`<=` n	]unk; n] (low open)
`>=` n	[n; unk[ (high open)
`<` n	]unk; n[ (low open and high open)
`>` n	]n; unk[ (low open and high open)
`=` n	[n; n]
`[n,m]`	[n; m]

Acceptable values for low and high boundaries are all values of the base data type (including positive infinity and negative infinity for the usual numeric types and derivatives). In addition one of the boudaries may be undefined which is shown in the above table as "unk". In an XML ITS that chooses to transfer the uniform semantic interval form, the boudary would not be sent at all instead of sending a special value for "unk". If the XML ITS decides to always send literals where available, there is no need to send "unk".

We are not yet clear about the use cases of the interval data type. One use case is certainly to specify the time of validity of some instance, or time a service was active, etc.

In order to decide whether intervals are to be used for reporting measurements we should first look at uncertainty.

AGENDA ITEM 3: generic data type for uncertainty (aka. "probability distributions")

We were not able to address this agenda item last time. It will be the first item on the next conference. The following notes are the preparation of the next conference call. So please read and think about it.

Uncertainty may exist for all kinds of informations. Information is selection of a signal (value) from a set of possible signals (values). Uncertain information is selection of several values from a set of possible signals where we assign to every signal a probability (i.e. belief that the given information applies). We may distinguish four cases:

There are only two possible values where one is the negation of the other (boolean). In that case we need to specify a probability p for only one value (preferably the value meaning "true"). The probability of the other value is then 1 - p.
The set of possible values may have no total order. In that case we have to send pairs of <value, probability>.
The set of possible values may have a total order but is discrete. In that case, we can send <value, probability> pairs too. In addition, however, there is a mapping of the set to the set of natural numbers, and we can specify a discrete probability distribution (e.g., binominal, geometric, poisson) and the necessary parameters of those distributions.
The set of possible values may have a total order but is continuous. In that case, we can not send <value, probability> pairs. But we can select a continuous probability distribution (e.g., normal, uniform, gamma, chi-square) and its necessary parameters.

GENERIC DATA TYPE "HISTOGRAM"

Note that this proposed data type has been superceded by another type named Non-Parametric Probability Distribution.

Histogram
Generic data type to specify an uncertain discrete value as a set of <value, probability> pairs (Histogram Items). Those values that are in the set of possible values but not mentioned in the histogram will have the rest probability distributed equally over all unmentioned values.
GENERIC TYPE
parameter name	allowed types	description
T	Discrete	Any data type that is discrete can be used in a histogram. Usually we would use a histogram for unordered types only and only if we assign probabilities to a "small" set of possible values.
component name	type/domain	optionality	description
SET OF Histogram Item<T>

GENERIC DATA TYPE "HISTOGRAM ITEM"

Note that this proposed data type has been superceded by another type named Uncertain Discrete Value with specializations: Uncertain Discrete Value using Probabilities and Uncertain Discrete Value using Narrative Expressions of Confidence.

Histogram Item
Generic data type to specify one uncertain value as a pair of <value, probability>.
GENERIC TYPE
parameter name	allowed types	description
T	DiscreteType	Any data type that is discrete can be used in a histogram item.
component name	type/domain	optionality	description
value	T	required	The value to which a probability is assigned.
probability	Floating Point Number 0.0 to 1.0.	required	The probability assigned to the value.

NOTES

Type cast rules allow sending a histogram item for a histogram. The resulting histogram contains only the histogram item originally sent.

If and only if a histogram contains one and only one histogram item, the probability value may be omitted. The meaning of this construct is to indicate that the value was estimated, but no assessment of the probability is wanted.

NOTE probabilities don't have to be "exact", especially one does not need to carry out a series of experiments (samples) in order to specify a probability. Probabilities are always estimated. Bayesian probability theory equals the notion of "probability" with "belief". The probability is thus an assesment of the subjective belief of the originator of a statement. Some subjective numeric probability is often better than a mere indicator that a value is "estimated".

GENERIC DATA TYPE "Probability Distribution"

Note that this proposed data type and its table has been significantly updated [version 2]. It has further been superceded by a data type called Parametric Probability Distribution.

Probability Distribution
Generic data type to specify an uncertain value of an ordered data type. The base data type may be discrete or continuous. Discrete ordered types are mapped to natural numbers by setting their "smallest" possible value to 1, the second to 2, and so on. The order of non-numeric types must be unambiguosliy defined.
GENERIC TYPE
parameter name	allowed types	description
T	OrderedType	Any ordered type (anything that is unambiguously mapped to numbers) can be the basis of an uncertain quantity.
component name	type/domain	optionality	description
mean	T	required	The mean (expected value or first moment) of the probability distribution. Although the mean can be quite easily derived from the distribution type and its parameters it should be specified explicitely.
type	Code Value	required	The type of probability distribution. Possible values are as discussed in the text.
parameters	Choice	required	The parameters of the probability distribution. The number of parameters, their names and types depend on the selected distribution.

The mean component is mentioned explicitely. This component will be used in type casting a probability distribution over type T to a simple value of type T in a case where a receiving application can not deal with or is not interested in probability distributions.

The literature on statistics commonly lists the mean as dependent on the parameters of the probability distributions (e.g. the mean of a binominal distribution with paramters n and p is np. Because we choose to mention the mean (to help in roughly grasping the "value") the parameters of the distributions may be defined in terms of the mean.

Distribution types, their mean and parameters. Symbols used in accordance with
[Mendenhall W, et al. Mathematical statistics with applications. 4^th edition. Wadsworth; Belmont (CA); 1989]
type description and parameters

symbol name or meaning type constraint or comment

guess Used to indicate that the mean is just a guess without any closer specification of its probability. This pseudo distribution does not have any parameter aside from the expected value.

E mean T

distributions of discrete random variables

binominal Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials.

n number of trials Integer n > 1

p probability of success Float p between 0 and 1

E mean T E = n p

geometric Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed.

p probability of success Float p between 0 and 1

E mean T E = 1 / p

negative
binominal Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs.

p probability of success Float p between 0 and 1

r number of successes Integer r > 2

E mean T E = r / p

hypergeometric Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items.

N the total number of items Integer N > 1

r number of items with property P Integer r > 1

n sample size Integer n > 1

E mean T E = (n r) / N

Poisson Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean.

E mean T

distributions of continuous random variables

uniform A uniform probability density over some interval.

E mean T

w width of the interval dif(T)

normal
Gaussian The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations.

E mean T often symbolized µ

sigma standard deviation dif(T)

gamma Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution.

alpha Float alpha > 0

beta Float beta > 0

E mean T E = alpha x beta

chi-square Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum).

E mean (number of degrees of freedom) Integer E > 0

Student-t Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom.

n number of degrees of freedom Integer n > 0

E mean Integer E = 0 (the mean of a standard normal random variable is always 0)

F Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n₁ and n₂ which are the numbers of degrees of freedom of the numerator and denominator variable respectively.

n₁ numerator's number of degrees of freedom Integer n₁ > 0

n₂ denominator's number of degrees of freedom Integer n₂ > 0

E mean Integer E = n₂ / ( n₂ - 2 )

logarithmic
normal The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution.

µ mean of the resulting normal distribution Float

sigma standard deviation Float

E mean of the original skewed distribution T E = e ^{µ + 0.5
sigma²}

beta The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve.

alpha Float alpha > 0

beta Float beta > 0

E mean T E = alpha / ( alpha + beta )

**Distribution types, their mean and parameters.** Symbols used in accordance with
[Mendenhall W, et al. *Mathematical statistics with applications.* 4^th edition. Wadsworth; Belmont (CA); 1989]
type	description and parameters
	symbol	name or meaning	type	constraint or comment
guess	Used to indicate that the mean is just a guess without any closer specification of its probability. This pseudo distribution does not have any parameter aside from the expected value.
E	mean	T
distributions of discrete random variables
binominal	Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials.
n	number of trials	Integer	n > 1
p	probability of success	Float	p between 0 and 1
E	mean	T	E = n p
geometric	Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed.
p	probability of success	Float	p between 0 and 1
E	mean	T	E = 1 / p
negative binominal	Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs.
p	probability of success	Float	p between 0 and 1
r	number of successes	Integer	r > 2
E	mean	T	E = r / p
hypergeometric	Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items.
N	the total number of items	Integer	N > 1
r	number of items with property P	Integer	r > 1
n	sample size	Integer	n > 1
E	mean	T	E = (n r) / N
Poisson	Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean.
E	mean	T
distributions of continuous random variables
uniform	A uniform probability density over some interval.
E	mean	T
w	width of the interval	dif(T)
normal Gaussian	The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations.
E	mean	T	often symbolized µ
sigma	standard deviation	dif(T)
gamma	Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution.
alpha		Float	alpha > 0
beta		Float	beta > 0
E	mean	T	E = alpha x beta
chi-square	Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum).
E	mean (number of degrees of freedom)	Integer	E > 0
Student-t	Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom.
n	number of degrees of freedom	Integer	n > 0
E	mean	Integer	E = 0 (the mean of a standard normal random variable is always 0)
F	Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n₁ and n₂ which are the numbers of degrees of freedom of the numerator and denominator variable respectively.
n₁	numerator's number of degrees of freedom	Integer	n₁ > 0
n₂	denominator's number of degrees of freedom	Integer	n₂ > 0
E	mean	Integer	E = n₂ / ( n₂ - 2 )
logarithmic normal	The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution.
µ	mean of the resulting normal distribution	Float
sigma	standard deviation	Float
E	mean of the original skewed distribution	T	E = e ^{µ + 0.5 sigma²}
beta	The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve.
alpha		Float	alpha > 0
beta		Float	beta > 0
E	mean	T	E = alpha / ( alpha + beta )

NOTES

The type dif(T) is the data type of the difference of two values of type T. For the data type T = Point in time dif(T) is not Point in time but a physical Measurement in the dimension of time (i.e. units seconds, hour, minutes, etc.).

A problem is that most distributions are given in a form where only natural numbers or real numbers are acceptable. If distributions of measurements (with units) are to be specified, we need a way to remove the units for the purpose of generating the distribution and then reapply the units. For instance, if Q = µ u is a measured quantity with numeric magnitude µ and unit u, then we can bind the quotient Q / u to the random variable and calculate the distribution. For each calculated number x_i, we regain a quantity with unit as Q_i = x_i u.

Another problem is that most distributions are given in a standard form, that is with mean or left boundary equals 0 and standard deviation equals 1 etc. Therefore one has to standardize the quantity to be described first. This is similar to the problem of removing and reapplying units. The method is similar: a transformation transforms the numeric value to a standard form and later re-transforms the standard form to the numeric value. Two issues must be considered:

translation, i.e. moving the mean (or left boundary) into the origin (zero-point)
scaling the value to adjust the standard deviation to one.

This means, that any transformation of a value x to a normalized value y can be described as:

y = ( x - o ) / s

We can combine the way we deal with the units and the standardization of the value into one formula:

y = ( Q_i - µ u ) / ( s u )

Here µ u is the expected value (mean) E expressed in the base type T (i.e. a Measurement). This is further justification that we should indeed carry the mean as a separate component. But we should also give s u as an explicit component, so that scaling can be done accordingly. The product s u is the standard deviation (square root of the variance) of the described value. The standard deviation is a component that an application might be interested in even if it can not deal with a chi-square distribution function.

Thus, the data type and distribution table should be redefined as follows:

Note that this proposed data type and its table has been evaolved from an earlier definition [version 1]. It has further been superceded by a data type called Parametric Probability Distribution.

Probability Distribution
Generic data type to specify an uncertain value of an ordered data type. The base data type may be discrete or continuous. Discrete ordered types are mapped to natural numbers by setting their "smallest" pssible value to 1, the second to 2, and so on. The order of non-numeric types must be unambiguosliy defined.

GENERIC TYPE

parameter name allowed types description

T OrderedType Any ordered type (anything that is unambiguously mapped to numbers) can be the basis of an uncertain quantity. Examples are Integer Number, Floating Point Number, and Measurement.

component name type/domain optionality description

mean T required The mean (expected value or first moment) of the probability distribution. The mean is used to standardize the data for calculating the distribution. The mean is also what a receiver is most interested in. Applications that can not deal with distributions can still get the idea about the described quantity by looking at its mean.

standard deviation dif(T) required The standard deviation (square-root of variance or square-root of second moment) of the probability distribution. The standard deviation is used to standardize the data for calculation the distribution. Applications that can not deal with distributions can still get the idea about the confidence level by looking at the standard deviation.

type Code Value required The type of probability distribution. Possible values are as shown in the adjactant table.

parameters Choice required The parameters of the probability distribution. The number of parameters, their names and types depend on the selected distribution and described in the adjactant table.

Probability Distribution
Generic data type to specify an uncertain value of an ordered data type. The base data type may be discrete or continuous. Discrete ordered types are mapped to natural numbers by setting their "smallest" pssible value to 1, the second to 2, and so on. The order of non-numeric types must be unambiguosliy defined.
GENERIC TYPE
parameter name	allowed types	description
T	OrderedType	Any ordered type (anything that is unambiguously mapped to numbers) can be the basis of an uncertain quantity. Examples are Integer Number, Floating Point Number, and Measurement.
component name	type/domain	optionality	description
mean	T	required	The mean (expected value or first moment) of the probability distribution. The mean is used to standardize the data for calculating the distribution. The mean is also what a receiver is most interested in. Applications that can not deal with distributions can still get the idea about the described quantity by looking at its mean.
standard deviation	dif(T)	required	The standard deviation (square-root of variance or square-root of second moment) of the probability distribution. The standard deviation is used to standardize the data for calculation the distribution. Applications that can not deal with distributions can still get the idea about the confidence level by looking at the standard deviation.
type	Code Value	required	The type of probability distribution. Possible values are as shown in the adjactant table.
parameters	Choice	required	The parameters of the probability distribution. The number of parameters, their names and types depend on the selected distribution and described in the adjactant table.

Distribution types, their mean and parameters.
type description and parameters

symbol name or meaning type constraint or comment

guess Used to indicate that the mean is just a guess without any closer specification of its probability. This pseudo distribution may not have any parameter aside from the mean. This is the only case where the standard deviation parameter may be omitted. The idstribution type guess thus allows utterances such as: "Age: 70 years (estimated)" or "Age: 75±10 years".

distributions of discrete random variables

binominal Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials.

n number of trials Integer n > 1

p probability of success Float p between 0 and 1

E mean E = n p

V variance V = n p( 1 - p )

geometric Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed.

p probability of success Float p between 0 and 1

E mean E = 1 / p

V variance V = ( 1 - p ) / p²

negative
binominal Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs.

p probability of success Float p between 0 and 1

r number of successes Integer r > 2

E mean E = r / p

V variance V = n r (N - r) (N - n) / ( N³ - N² )

hypergeometric Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items.

N the total number of items Integer N > 1

r number of items with property P Integer r > 1

n sample size Integer n > 1

E mean E = (n r) / N

V variance V = r(1 - p) / p²

Poisson Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean. The standard distribution is the sqare-root of the mean.

E mean

V variance V = E

distributions of continuous random variables

uniform The uniform distribution assigns a constant probability density over a range of possible outcomes. No parameters besides mean E and standard deviation s are required. Width of the interval is sqrt(12 V) = 2 sqrt(3) s. Thus, the uniform distribution assigns probability densities f(x) > 0 for values E - sqrt(3) s >= x <= E + sqrt(3) s and f(x) = 0 otherwise.

E mean E = (low + high) / 2

V variance V = (high - low)² / 12

normal
Gaussian The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations.

E mean often symbolized µ

V variance often symbolized sigma²

gamma Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution.

alpha Float alpha > 0

beta Float beta > 0

E mean E = alpha beta

V variance V = alpha beta²

chi-square Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum).

n number of degrees of freedom Integer n > 0

E mean E = n

V variance V = 2 n

Student-t Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom.

n number of degrees of freedom Integer n > 0

E mean E = 0 (the mean of a standard normal random variable is always 0)

V variance V = n / ( n - 2 )

F Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n₁ and n₂ which are the numbers of degrees of freedom of the numerator and denominator variable respectively.

n numerator's number of degrees of freedom Integer m > 0

m denominator's number of degrees of freedom Integer m > 0

E mean E = m / ( m - 2 )

V variance V = 2m² (m + n - 2) / ( n(m - 2)²(m - 4) )

logarithmic
normal The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution.

µ mean of the resulting normal distribution Float

sigma standard deviation Float

E mean of the original skewed distribution E = e ^{µ + 0.5
sigma²}

V variance of the original skewed distribution V = e ^{2µ + sigma²} ( e^sigma² - 1 )

beta The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve.

alpha Float alpha > 0

beta Float beta > 0

E mean T E = alpha / ( alpha + beta )

V variance T V = alpha beta / ((alpha + beta)²(alpha + beta + 1))

**Distribution types, their mean and parameters.**
type	description and parameters
	symbol	name or meaning	type	constraint or comment
guess	Used to indicate that the mean is just a guess without any closer specification of its probability. This pseudo distribution may not have any parameter aside from the mean. This is the only case where the standard deviation parameter may be omitted. The idstribution type guess thus allows utterances such as: "Age: 70 years (estimated)" or "Age: 75±10 years".
distributions of discrete random variables
binominal	Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials.
n	number of trials	Integer	n > 1
p	probability of success	Float	p between 0 and 1
E	mean		E = n p
V	variance		V = n p( 1 - p )
geometric	Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed.
p	probability of success	Float	p between 0 and 1
E	mean		E = 1 / p
V	variance		V = ( 1 - p ) / p²
negative binominal	Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs.
p	probability of success	Float	p between 0 and 1
r	number of successes	Integer	r > 2
E	mean		E = r / p
V	variance		V = n r (N - r) (N - n) / ( N³ - N² )
hypergeometric	Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items.
N	the total number of items	Integer	N > 1
r	number of items with property P	Integer	r > 1
n	sample size	Integer	n > 1
E	mean		E = (n r) / N
V	variance		V = r(1 - p) / p²
Poisson	Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean. The standard distribution is the sqare-root of the mean.
E	mean
V	variance		V = E
distributions of continuous random variables
uniform	The uniform distribution assigns a constant probability density over a range of possible outcomes. No parameters besides mean E and standard deviation s are required. Width of the interval is sqrt(12 V) = 2 sqrt(3) s. Thus, the uniform distribution assigns probability densities f(x) > 0 for values E - sqrt(3) s >= x <= E + sqrt(3) s and f(x) = 0 otherwise.
E	mean		E = (low + high) / 2
V	variance		V = (high - low)² / 12
normal Gaussian	The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations.
E	mean		often symbolized µ
V	variance		often symbolized sigma²
gamma	Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution.
alpha		Float	alpha > 0
beta		Float	beta > 0
E	mean		E = alpha beta
V	variance		V = alpha beta²
chi-square	Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum).
n	number of degrees of freedom	Integer	n > 0
E	mean		E = n
V	variance		V = 2 n
Student-t	Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom.
n	number of degrees of freedom	Integer	n > 0
E	mean		E = 0 (the mean of a standard normal random variable is always 0)
V	variance		V = n / ( n - 2 )
F	Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n₁ and n₂ which are the numbers of degrees of freedom of the numerator and denominator variable respectively.
n	numerator's number of degrees of freedom	Integer	m > 0
m	denominator's number of degrees of freedom	Integer	m > 0
E	mean		E = m / ( m - 2 )
V	variance		V = 2m² (m + n - 2) / ( n(m - 2)²(m - 4) )
logarithmic normal	The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution.
µ	mean of the resulting normal distribution	Float
sigma	standard deviation	Float
E	mean of the original skewed distribution		E = e ^{µ + 0.5 sigma²}
V	variance of the original skewed distribution		V = e ^{2µ + sigma²} ( e^sigma² - 1 )
beta	The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve.
alpha		Float	alpha > 0
beta		Float	beta > 0
E	mean	T	E = alpha / ( alpha + beta )
V	variance	T	V = alpha beta / ((alpha + beta)²(alpha + beta + 1))

There are dependencies between the explicit components mean and standard deviation and the parameters of the distribution. Those depenedncies are not always resolved. If we want to give mean and standard deviation explicitely there will often be redundancy in the parameter. However, it seems to be useful to let people specify parameters in the natural way rather than dependent on mean and standard deviation.

For the uniform distribution note that the standard deviation component does not contain the half width of the interval.

It would be awesome if we could define and implement an algebra for uncertain quantities. However, the little statistical understanding that I have tells me that it is a non-trivial task to tell the distribution type and parameter from a sum, or product of two distributions or from the inverse of a distribution.

Next conference call is next Monday, December 14, 1998, 11 EST. although not formally scheduled, I assume that people are fine with that usual time of Mondays 11 EST. If not, please complain ASAP!

Agenda items are:

generic data type for uncertainty (aka. "probability distributions") [slide]
hang over items: generic data type for intervals (aka. "ranges"), [slide]
hang over items: ordinals, [slide]
Calendar modulus expressions.

regards

-Gunther Schadow