The HL7 version 3 data type task group has had its eighth conference call on Monday, December 7, 1998, 11 to 12:30 AM EDT.
Attendees were:
Agenda items were:
number x unitThe units do have certain common properties (i.e. you can not add meters and seconds or apples and oranges). But there are also important differences. Stan Huff said that in the case of counts reported as number x things counted, the measurement name should contain the "things counted" rather than the units (e.g. in lab data bases one frequently finds units such as "red blood cells" vs. "white blood cells", which is redundant given that the measurement name is reported properly.) We have to revisit this topic again later.
Interval | |||
---|---|---|---|
Generic data type that can express a range of values. Ranges of values are most abundant as ranges of absolute time, used for ordering and scheduling. Note that an interval is not to be used to specify confidence intervals for uncertain values. | |||
GENERIC TYPE | |||
parameter name | allowed types | description | |
T | OrderedType | Any orderd type can be the basis of an interval. It does not matter whether the base type is discrete or continuous or whether any algebraic operators are defined for that type. | |
component name | type/domain | optionality | description |
low | T | optional | The lower boundary. |
low open | Boolean | required | Indicates whether the interval is closed or open at the lower boundary. For a boundary to be closed, a finite boundary must be provided, i.e. unspecified or infinite boundaries are always open. |
high | T | optional | The upper boundary. |
high open | Boolean | required | Indicates whether the interval is closed or open at the high boundary. For a boundary to be closed, a finite boundary must be provided, i.e. unspecified or infinite boundaries are always open. |
There was a considerable discussion on whether or not the interval notation is appropriate for things like a set A
A = { x_{i} | x_{i} < n }i.e. the set of numbers x_{i} with all x_{i} less than n.
Stan Huff wants to specify a range as a composite of
In this discussion, I felt that we (myself included) mixed two orthogonal axes of concepts and a couple of independent issues all together, which was not very helpful.
First: As always, we must distinguish between the any surface level representation of a type versus the semantics of the type. As always this caveat was disregarded.
Second: There are three different uses of ranges on the table that may be intuitively described as
Miscellaneous issues are:
About synonymity between "range" and "interval" let's ask Webster:
We do distinguish between surface form and semantic components with intervals as with any other data type. We will specify a literal form for interval expressions that is tuned toward intuitiveness rather than uniformity. (I still believe that a computer is helped better with uniformity that with intuitiveness, since a computer has no intuition.) Here is a mapping between surface forms (string literals) and the uniform interval form:
literal interval form <=
n
]unk; n] (low open) >=
n
[n; unk[ (high open) <
n
]unk; n[ (low open and high open) >
n
]n; unk[ (low open and high open) =
n
[n; n] [n,m]
[n; m]
Acceptable values for low and high boundaries are all values of the base data type (including positive infinity and negative infinity for the usual numeric types and derivatives). In addition one of the boudaries may be undefined which is shown in the above table as "unk". In an XML ITS that chooses to transfer the uniform semantic interval form, the boudary would not be sent at all instead of sending a special value for "unk". If the XML ITS decides to always send literals where available, there is no need to send "unk".
We are not yet clear about the use cases of the interval data type. One use case is certainly to specify the time of validity of some instance, or time a service was active, etc.
In order to decide whether intervals are to be used for reporting measurements we should first look at uncertainty.
We were not able to address this agenda item last time. It will be the first item on the next conference. The following notes are the preparation of the next conference call. So please read and think about it.
Uncertainty may exist for all kinds of informations. Information is selection of a signal (value) from a set of possible signals (values). Uncertain information is selection of several values from a set of possible signals where we assign to every signal a probability (i.e. belief that the given information applies). We may distinguish four cases:
Note that this proposed data type has been superceded by another type named Non-Parametric Probability Distribution.
Histogram | |||
---|---|---|---|
Generic data type to specify an uncertain discrete value as a set of <value, probability> pairs (Histogram Items). Those values that are in the set of possible values but not mentioned in the histogram will have the rest probability distributed equally over all unmentioned values. | |||
GENERIC TYPE | |||
parameter name | allowed types | description | |
T | Discrete | Any data type that is discrete can be used in a histogram. Usually we would use a histogram for unordered types only and only if we assign probabilities to a "small" set of possible values. | |
component name | type/domain | optionality | description |
SET OF Histogram Item<T> |
Note that this proposed data type has been superceded by another type named Uncertain Discrete Value with specializations: Uncertain Discrete Value using Probabilities and Uncertain Discrete Value using Narrative Expressions of Confidence.
Histogram Item | |||
---|---|---|---|
Generic data type to specify one uncertain value as a pair of <value, probability>. | |||
GENERIC TYPE | |||
parameter name | allowed types | description | |
T | DiscreteType | Any data type that is discrete can be used in a histogram item. | |
component name | type/domain | optionality | description |
value | T | required | The value to which a probability is assigned. |
probability | Floating Point
Number 0.0 to 1.0. |
required | The probability assigned to the value. |
Type cast rules allow sending a histogram item for a histogram. The resulting histogram contains only the histogram item originally sent.
If and only if a histogram contains one and only one histogram item, the probability value may be omitted. The meaning of this construct is to indicate that the value was estimated, but no assessment of the probability is wanted.
NOTE probabilities don't have to be "exact", especially one does not need to carry out a series of experiments (samples) in order to specify a probability. Probabilities are always estimated. Bayesian probability theory equals the notion of "probability" with "belief". The probability is thus an assesment of the subjective belief of the originator of a statement. Some subjective numeric probability is often better than a mere indicator that a value is "estimated".
Note that this proposed data type and its table has been significantly updated [version 2]. It has further been superceded by a data type called Parametric Probability Distribution.
Probability Distribution | |||
---|---|---|---|
Generic data type to specify an uncertain value of an ordered data type. The base data type may be discrete or continuous. Discrete ordered types are mapped to natural numbers by setting their "smallest" possible value to 1, the second to 2, and so on. The order of non-numeric types must be unambiguosliy defined. | |||
GENERIC TYPE | |||
parameter name | allowed types | description | |
T | OrderedType | Any ordered type (anything that is unambiguously mapped to numbers) can be the basis of an uncertain quantity. | |
component name | type/domain | optionality | description |
mean | T | required | The mean (expected value or first moment) of the probability distribution. Although the mean can be quite easily derived from the distribution type and its parameters it should be specified explicitely. |
type | Code Value | required | The type of probability distribution. Possible values are as discussed in the text. |
parameters | Choice | required | The parameters of the probability distribution. The number of parameters, their names and types depend on the selected distribution. |
The mean component is mentioned explicitely. This component will be used in type casting a probability distribution over type T to a simple value of type T in a case where a receiving application can not deal with or is not interested in probability distributions.
The literature on statistics commonly lists the mean as dependent on the parameters of the probability distributions (e.g. the mean of a binominal distribution with paramters n and p is np. Because we choose to mention the mean (to help in roughly grasping the "value") the parameters of the distributions may be defined in terms of the mean.
type | description and parameters | |||
---|---|---|---|---|
symbol | name or meaning | type | constraint or comment | |
guess | Used to indicate that the mean is just a guess without any closer specification of its probability. This pseudo distribution does not have any parameter aside from the expected value. | |||
E | mean | T | ||
distributions of discrete random variables | ||||
binominal | Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials. | |||
n | number of trials | Integer | n > 1 | |
p | probability of success | Float | p between 0 and 1 | |
E | mean | T | E = n p | |
geometric | Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed. | |||
p | probability of success | Float | p between 0 and 1 | |
E | mean | T | E = 1 / p | |
negative binominal | Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs. | |||
p | probability of success | Float | p between 0 and 1 | |
r | number of successes | Integer | r > 2 | |
E | mean | T | E = r / p | |
hypergeometric | Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items. | |||
N | the total number of items | Integer | N > 1 | |
r | number of items with property P | Integer | r > 1 | |
n | sample size | Integer | n > 1 | |
E | mean | T | E = (n r) / N | |
Poisson | Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean. | |||
E | mean | T | ||
distributions of continuous random variables | ||||
uniform | A uniform probability density over some interval. | |||
E | mean | T | ||
w | width of the interval | dif(T) | ||
normal Gaussian | The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations. | |||
E | mean | T | often symbolized µ | |
sigma | standard deviation | dif(T) | ||
gamma | Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution. | |||
alpha | Float | alpha > 0 | ||
beta | Float | beta > 0 | ||
E | mean | T | E = alpha x beta | |
chi-square | Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum). | |||
E | mean (number of degrees of freedom) | Integer | E > 0 | |
Student-t | Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom. | |||
n | number of degrees of freedom | Integer | n > 0 | |
E | mean | Integer | E = 0 (the mean of a standard normal random variable is always 0) | |
F | Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n_{1} and n_{2} which are the numbers of degrees of freedom of the numerator and denominator variable respectively. | |||
n_{1} | numerator's number of degrees of freedom | Integer | n_{1} > 0 | |
n_{2} | denominator's number of degrees of freedom | Integer | n_{2} > 0 | |
E | mean | Integer | E = n_{2} / ( n_{2} - 2 ) | |
logarithmic normal | The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution. | |||
µ | mean of the resulting normal distribution | Float | ||
sigma | standard deviation | Float | ||
E | mean of the original skewed distribution | T | E = e ^{µ + 0.5 sigma2} | |
beta | The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve. | |||
alpha | Float | alpha > 0 | ||
beta | Float | beta > 0 | ||
E | mean | T | E = alpha / ( alpha + beta ) |
The type dif(T) is the data type of the difference of two values of type T. For the data type T = Point in time dif(T) is not Point in time but a physical Measurement in the dimension of time (i.e. units seconds, hour, minutes, etc.).
A problem is that most distributions are given in a form where only natural numbers or real numbers are acceptable. If distributions of measurements (with units) are to be specified, we need a way to remove the units for the purpose of generating the distribution and then reapply the units. For instance, if Q = µ u is a measured quantity with numeric magnitude µ and unit u, then we can bind the quotient Q / u to the random variable and calculate the distribution. For each calculated number x_{i}, we regain a quantity with unit as Q_{i} = x_{i} u.
Another problem is that most distributions are given in a standard form, that is with mean or left boundary equals 0 and standard deviation equals 1 etc. Therefore one has to standardize the quantity to be described first. This is similar to the problem of removing and reapplying units. The method is similar: a transformation transforms the numeric value to a standard form and later re-transforms the standard form to the numeric value. Two issues must be considered:
y = ( x - o ) / sWe can combine the way we deal with the units and the standardization of the value into one formula:
y = ( Q_{i} - µ u ) / ( s u )Here µ u is the expected value (mean) E expressed in the base type T (i.e. a Measurement). This is further justification that we should indeed carry the mean as a separate component. But we should also give s u as an explicit component, so that scaling can be done accordingly. The product s u is the standard deviation (square root of the variance) of the described value. The standard deviation is a component that an application might be interested in even if it can not deal with a chi-square distribution function.
Thus, the data type and distribution table should be redefined as follows:
Note that this proposed data type and its table has been evaolved from an earlier definition [version 1]. It has further been superceded by a data type called Parametric Probability Distribution.
Probability Distribution | |||
---|---|---|---|
Generic data type to specify an uncertain value of an ordered data type. The base data type may be discrete or continuous. Discrete ordered types are mapped to natural numbers by setting their "smallest" pssible value to 1, the second to 2, and so on. The order of non-numeric types must be unambiguosliy defined. | |||
GENERIC TYPE | |||
parameter name | allowed types | description | |
T | OrderedType | Any ordered type (anything that is unambiguously mapped to numbers) can be the basis of an uncertain quantity. Examples are Integer Number, Floating Point Number, and Measurement. | |
component name | type/domain | optionality | description |
mean | T | required | The mean (expected value or first moment) of the probability distribution. The mean is used to standardize the data for calculating the distribution. The mean is also what a receiver is most interested in. Applications that can not deal with distributions can still get the idea about the described quantity by looking at its mean. |
standard deviation | dif(T) | required | The standard deviation (square-root of variance or square-root of second moment) of the probability distribution. The standard deviation is used to standardize the data for calculation the distribution. Applications that can not deal with distributions can still get the idea about the confidence level by looking at the standard deviation. |
type | Code Value | required | The type of probability distribution. Possible values are as shown in the adjactant table. |
parameters | Choice | required | The parameters of the probability distribution. The number of parameters, their names and types depend on the selected distribution and described in the adjactant table. |
type | description and parameters | |||
---|---|---|---|---|
symbol | name or meaning | type | constraint or comment | |
guess | Used to indicate that the mean is just a guess without any closer specification of its probability. This pseudo distribution may not have any parameter aside from the mean. This is the only case where the standard deviation parameter may be omitted. The idstribution type guess thus allows utterances such as: "Age: 70 years (estimated)" or "Age: 75±10 years". | |||
distributions of discrete random variables | ||||
binominal | Used for n identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of successes observed during n trials. | |||
n | number of trials | Integer | n > 1 | |
p | probability of success | Float | p between 0 and 1 | |
E | mean | E = n p | ||
V | variance | V = n p( 1 - p ) | ||
geometric | Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials until the first success is observed. | |||
p | probability of success | Float | p between 0 and 1 | |
E | mean | E = 1 / p | ||
V | variance | V = ( 1 - p ) / p^{2} | ||
negative binominal | Used for identical trials with each outcomes being one of two possible values (called success or failure) with constant probability p of success. The described random variable is the number of trials needed until the rth success occurs. | |||
p | probability of success | Float | p between 0 and 1 | |
r | number of successes | Integer | r > 2 | |
E | mean | E = r / p | ||
V | variance | V = n r (N - r) (N - n) / ( N^{3} - N^{2} ) | ||
hypergeometric | Used for a set of N items, where r items share a certain property P. The described random variable is the number of items with property P in a random sample of n items. | |||
N | the total number of items | Integer | N > 1 | |
r | number of items with property P | Integer | r > 1 | |
n | sample size | Integer | n > 1 | |
E | mean | E = (n r) / N | ||
V | variance | V = r(1 - p) / p^{2} | ||
Poisson | Describes the number of events observed in one unit that occur at an average of lambda per unit. For example, the number of incidents of a certain disease observed in a period of time given the average incidence of E. The poisson distribution only has one parameter, which is the mean. The standard distribution is the sqare-root of the mean. | |||
E | mean | |||
V | variance | V = E | ||
distributions of continuous random variables | ||||
uniform | The uniform distribution assigns a constant probability density over a range of possible outcomes. No parameters besides mean E and standard deviation s are required. Width of the interval is sqrt(12 V) = 2 sqrt(3) s. Thus, the uniform distribution assigns probability densities f(x) > 0 for values E - sqrt(3) s >= x <= E + sqrt(3) s and f(x) = 0 otherwise. | |||
E | mean | E = (low + high) / 2 | ||
V | variance | V = (high - low)^{2} / 12 | ||
normal Gaussian | The well-known bell-shaped normal distribution. Because of the central limit theorem the normal distribution is the distribution of choice for an unbounded random variable that is an outcome of a combination of many stochastic processes. Even for values bounded on a single side (i.e. greater than 0) the normal distribution may be accurate enough if the mean is "far away" from the bound of the scale measured in terms of standard deviations. | |||
E | mean | often symbolized µ | ||
V | variance | often symbolized sigma^{2} | ||
gamma | Used for data that is skewed and bounded to the right, i.e. where the maximum of the distribution curve is located near the origin. Many biological measurements, such as enzymes in blood, have a gamma distribution. | |||
alpha | Float | alpha > 0 | ||
beta | Float | beta > 0 | ||
E | mean | E = alpha beta | ||
V | variance | V = alpha beta^{2} | ||
chi-square | Used to describe the sum of squares of random variables which occurs when a variance (second moment) is estimated (rather than presumed) from the sample. The chi-square distribution is a special type of gamma distribution with parameter beta = 2 and alpha = E / beta. The only parameter of the chi-square distribution is thus the mean and must be a natural number, so called the number of degrees of freedom (which is the number of independent parts in the sum). | |||
n | number of degrees of freedom | Integer | n > 0 | |
E | mean | E = n | ||
V | variance | V = 2 n | ||
Student-t | Used to describe the quotient of a standard normal random variable and the square-root of a chi-square random variable. The t-distribution has one parameter n which is the number of degrees of freedom. | |||
n | number of degrees of freedom | Integer | n > 0 | |
E | mean | E = 0 (the mean of a standard normal random variable is always 0) | ||
V | variance | V = n / ( n - 2 ) | ||
F | Used to describe the quotient of two chi-square random variables. The F-distribution has two parameters n_{1} and n_{2} which are the numbers of degrees of freedom of the numerator and denominator variable respectively. | |||
n | numerator's number of degrees of freedom | Integer | m > 0 | |
m | denominator's number of degrees of freedom | Integer | m > 0 | |
E | mean | E = m / ( m - 2 ) | ||
V | variance | V = 2m^{2} (m + n - 2) / ( n(m - 2)^{2}(m - 4) ) | ||
logarithmic normal | The logarithmic normal (log-normal) distribution is often used to transform skewed random variable X into a normal form U = ln X. The log-normal distribution has the same parameters as the normal distribution. | |||
µ | mean of the resulting normal distribution | Float | ||
sigma | standard deviation | Float | ||
E | mean of the original skewed distribution | E = e ^{µ + 0.5 sigma2} | ||
V | variance of the original skewed distribution | V = e ^{2µ + sigma2} ( e^{sigma2} - 1 ) | ||
beta | The beta distribution is used for data that is bounded on both sides and may or may not be skewed. Two parameters are available to adjust the curve. | |||
alpha | Float | alpha > 0 | ||
beta | Float | beta > 0 | ||
E | mean | T | E = alpha / ( alpha + beta ) | |
V | variance | T | V = alpha beta / ((alpha + beta)^{2}(alpha + beta + 1)) |
There are dependencies between the explicit components mean and standard deviation and the parameters of the distribution. Those depenedncies are not always resolved. If we want to give mean and standard deviation explicitely there will often be redundancy in the parameter. However, it seems to be useful to let people specify parameters in the natural way rather than dependent on mean and standard deviation.
For the uniform distribution note that the standard deviation component does not contain the half width of the interval.
It would be awesome if we could define and implement an algebra for uncertain quantities. However, the little statistical understanding that I have tells me that it is a non-trivial task to tell the distribution type and parameter from a sum, or product of two distributions or from the inverse of a distribution.
Next conference call is next Monday, December 14, 1998, 11 EST. although not formally scheduled, I assume that people are fine with that usual time of Mondays 11 EST. If not, please complain ASAP!
Agenda items are:
regards
-Gunther Schadow