Sufficient statistic
In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call θ. A quantity T(X) that depends on the (observable) random variable X but not on the (unobservable) parameter θ is called a statistic. Sir Ronald Fisher tried to make precise the intuitive idea that a statistic may capture all of the information in X that is relevant to the estimation of θ. A statistic that does that is called a sufficient statistic.
Mathematical definition
The precise definition is this:
- A statistic T(X) is sufficient for θ precisely if the conditional probability distribution of the data X given the statistic T(X) does not depend on θ.
An equivalent test, known as the factorization criterion, is often used instead, as it is mathematically equivalent to T being sufficient. If the probability of observing X is f(X;&theta), then T satisfies the factorization criterion if, and only if, functions g and h can be found such that
This is a product of a function of X alone (h), and a function of &theta and T(X) alone. The crucial point is that h(X) is independent of &theta, and g(T(X),&theta) depends on X only through the value of T(X). The way to think about this is to consider varying X in such a way as to maintain a constant value of T(X) and ask whether such a variation has any effect on inferences one might make about &theta. If the factorization criterion above holds, the answer is "none" because the dependence of the likelihood function f on &theta is unchanged.
Examples
- If X1, ...., Xn are independent Bernoulli-distributed random variables with expected value p, then the sum T(X)=X1 + ... + Xn is a sufficient statistic for p.
This is seen by considering the joint probability distribution:
Because the observations are independent, this can be written as
and, collecting powers of p and 1-p gives
which satisfies the factorization criterion, with h(x) being just the identity function. Note the crucial feature: the unknown parameter (here p) interacts with the observation X only via the statistic T(X) (here the sum &Sigma xi).
- If X1, ...., Xn are independent and uniformly distributed on the interval [0,θ], then max(X1, ...., Xn ) is sufficient for θ.
The Rao-Blackwell theorem
Since the conditional distribution of X given T(X) does not depend on θ, neither does the conditional expected value of g(X) given T(X), where g is any (sufficiently well-behaved) function. Consequently that conditional expected value is actually a statistic, and so is available for use in estimation. If g(X) is any kind of estimator of θ, then typically the conditional expectation of g(X) given T(X) is a better estimator of θ ; one way of making that statement precise is called the Rao-Blackwell theorem. Sometimes one can very easily construct a very crude estimator g(X), and then evaluate that conditional expected value to get an estimator that is in various senses optimal.