Public domain/Public domain/Image by Author/Public domain/Public domain/CC BY-SA 4.0

A New Look at the Central Limit Theorem

Its definition, its many applications, its deep connection with inverse probability, and a glimpse at its history

Published in

Towards Data Science

11 min readJan 8, 2024

For all its heft, the Central Limit Theorem has a singularly succinct definition. It says, simply, the following: the standardized sum or mean of a sample of i.i.d. random variables converges in distribution to N(0,1). Built around this central idea is a modest-sized lattice of variations and special cases. But the main theme of the theorem remains intact.

The to-the-point pithiness of the CLT’s definition hides a myriad different uses that become apparent only when you carefully unpack the words in its definition and put them to use.

Also hidden behind the CLT’s definition is a long vein of discovery that reaches back more than three centuries. With dozens of mathematicians contributing to its development the Central Limit Theorem created a veritable gold rush amongst researchers during the 17th, 18th, 19th and 20th centuries.

Throughout the history of mathematical thought, seldom have so many researchers across so many centuries contributed so heavily to the development of a single idea. But the CLT isn’t just any old idea. It is the gold standard of statistics.

The Central Limit Theorem in action

It can be mesmerizing to see the CLT in action. In the video that follows, you’ll see the CLT at work on a random sample of size ‘n’ drawn from a population that is exponentially distributed. The simulation spawns 1000 different random samples of size 10 each. It calculates the mean of each of those 1000 samples and plots the frequency distribution of these 1000 means. This distribution looks absolutely nothing like a normal distribution. But just as soon as the sample size is increased from 10, to 20, to 30, 40, 50, and so on, you’ll see how eagerly the sample means want to arrange themselves into a normal distribution.

The Central Limit Theorem In Action (Video by Author)

What exactly is the Central Limit Theorem?

The CLT has a shape-shifting definition that is often adjusted to suit the context. Let’s deconstruct its definition.

We’ll begin in the comfortingly familiar embrace of a random sample. Let’s represent a sample of size n drawn randomly (with replacement) from an underlying population using the notation (X_1, X_2, …, X_n). Since each element X_i in the sample is chosen independently and randomly (with replacement) from the population, (X_1, X_2, …, X_n) is a set of n independent, identically distributed (i.i.d.) random variables. Let’s further assume that the population has a mean μ and a finite, positive variance σ².

Let X_bar_n be the sample mean or sample sum. It’s defined as follows:

The mean or sum of n i.i.d. random variables (Image by Author)

Some statistics texts denote the sample mean as X_bar_n and the sample sum as S_n.

Since X_bar_n is a function of n random variables, X_bar_n is itself a random variable with its own mean and variance represented using the notations E(X_bar_n) and Var(X_bar_n) respectively.

Suppose that X_bar_n is the mean. Since (X_1, X_2, …, X_n) are i.i.d. variables, it can be shown that:

E(X_bar_n) = μ

Var(X_bar_n) = σ²/n

Now let’s define a new random variable Z_n as follows:

The standardized version of X (Image by Author)

Z_n, when defined this way, is called the standardized sample mean. It’s the distance between the sample mean and the population mean expressed in (possibly fractional) number of standard deviations. There is another way you can calculate Z_n. If you transform each data point X_i of the original sample using the formula (X_i — μ)/(σ/√n) and take the simple mean of the transformed sample, you’ll get the standardized sample mean Z_n.

Given the above apparatus, the Central Limit Theorem makes all of the following equivalent statements about Z_n:

As the sample size grows, the Cumulative Distribution Function (CDF) of Z_n starts looking more and more like the CDF of the standard normal random variable N(0,1) and it becomes identical to the CDF of N(0,1) as the sample size approaches ∞.
As the sample size grows to infinity, the CDF of Z_n approaches the CDF of N(0,1).
For large sample sizes, Z_n is approximately a standard normal N(0,1) random variable.

Shaped as equations the three statements look like this:

The Central Limit Theorem expressed in several different equivalent ways (Image by Author)

In (a), the integral on the L.H.S. is the CDF of Z_n while that on the R.H.S is the CDF of N(0, 1).

In (b) and (c), P(Z_n ≤ z) is just another way to specify the CDF of Z_n. Φ(z) is the notation customarily used for the CDF of the standard normal random variable N(0, 1).

In (c), the wavy equals (or half-equals ‘ ≃’) means asymptotically equals.

In (d), the ‘d’ over the arrow means converges in distribution.

Notice how the CLT assumes nothing about the probability distributions of X_1, X_2, …, X_n. They don’t need to be normally distributed and that little fact greatly expands the applicability of the CLT. X_1, X_2, …, X_n just need to be identically distributed and independent, and even these two restrictions have been relaxed in certain special versions of the CLT.

What can you do with the Central Limit Theorem?

The CLT makes its presence felt everywhere. In simple Frequentist inference problems, in the method of Least Squares estimation, in Maximum Likelihood estimation, in Bayesian inference, in the analysis of time series models — no matter what area of statistics you are mucking around in, you’ll soon notice that there some form of the CLT at work. And everywhere you run into it you can use it to your benefit.

Let’s look at some of those benefits.

Calculating the probability of observing a particular sample mean

This is a straightforward use case of the CLT. I’ll illustrate it with an example. Suppose your broadband connection has an average speed of 400 Mbps with a variance of 100 Mbps. Thus μ=400 Mbps and σ²=100 Mbps. If you measure the broadband speed at 25 random times of the day, what is the probability that your sample mean will lie between 395 Mbps and 405 Mbps? i.e. What is the following probability:

What is the probability of the mean observed bandwidth lying between 395 and 405 Mbps? (Image by Author)

In the above probability, the vertical bar ‘|’ means ‘conditioned upon’.

The solution lies in the insight that as per the CLT the standardized sample mean has an approximately standard normal distribution.

The standardized mean converges in distribution to N(0,1) (Image by Author)

You can use this fact to work out the solution using the following steps:

We see that in 98.76% of random samples of size 25 measurements each, the average sample bandwidth will lie within 395 and 405 Mbps.

This is a textbook use-case of the CLT. To see why, recall how I started the example with the wide-eyed assumption that you would know the real bandwidth μ and variance σ² of your internet connection. Well, would you? You see, μ and σ² are the population values and in practice you aren’t likely to know the population value of any parameter.

Building a confidence interval around the unknown population mean

A more common and practical use of the CLT is to build a (1 —α)100% confidence interval around the unknown population mean μ or variance σ². Continuing with our bandwidth example, let’s assume (correctly this time) that you don’t know μ and σ². Say after taking 25 measurements at random times of the day you have gathered a random sample of size 25 with a sample mean of 398.5 Mbps and sample variance of 85 Mbps. Given this observed value of sample mean and variance, what is an interval [μ_low, μ_high] such that the probability of the real unknown mean μ lying within this interval is 95%?

Here n = 25, X_bar_n = 398.5, S = 85. The unknowns are μ and σ². You are seeking the interval [μ_low, μ_high] such that:

The 95% confidence interval for the population mean (Image by Author)

Notice how this probability is inverse of the probability in the previous case. In the previous case you were seeking P(X_bar_n|n,μ ,σ²), i.e. the probability of X_bar_n given n, μ, and σ². Here you are (effectively) seeking P(μ|n,X_bar_n,σ²) i.e. the probability of μ given n, X_bar_n, and σ².

P(μ|n,X_bar_n, σ²) or in short P(μ|X_bar_n), is called the inverse probability. Incidentally, because your connection speed is a real number, P(μ|X_bar_n) is the probability density function of the real mean μ conditioned upon the observed mean X_bar_n.

So what exactly are we saying here? Are we saying that there are an infinite number of real means possible, and each one of them has a probability density that lies somewhere on a probability density function whose shape is fixed the moment you observe a sample mean?

Yes, that’s exactly what we are saying. If this problem is starting to smell like the Many-Worlds Interpretation of quantum mechanics, you are not alone!

Don’t feel too bad if you feel at sea about inverse probability. It’s indeed a profound concept. Mathematicians were banging their heads on the problem of inverse probability for over a century until the French mathematician Pierre-Simon Laplace and the English Presbyterian minister Thomas Bayes independently cracked it open in the late 1700s (the problem obviously, not their heads). In a sequel to this article I’ll detail out Laplace’s delightful approach to inverse probability which he formulated over two centuries before the Many-Worlds interpretation of quantum mechanics first met the pages of a science journal. Laplace isn’t called the French Newton for no reason.

Pierre-Simon Laplace (1749–1827), and Thomas Bayes (1701–1761) (Public domain images)

By the way, speaking of head injuries, much of Laplace’s work happened during mid 1700s to early 1800s against the backdrop of tremendous upheaval in French society. While Laplace shrewdly managed to keep himself safe from the turmoil and the ever-changing power equations that followed the French revolution of 1789–1799, many of his fellow scientists found themselves involuntarily relieved of all earthly obligations.

12 November 1793: **Jean Sylvain Bailly** — French astronomer, mathematician, free mason, political leader — staring at his immediate future (Public domain image).

Laplace’s solution to inverse probability provides a very natural and satisfying rationale for constructing a confidence interval around the unknown population mean using — you guessed it — the Central Limit Theorem.

Now I could simply tell you how to employ the CLT to build a confidence interval around the unknown mean. It will take only a few minutes. I will show you the solution steps using the manner adopted by most textbooks and be done with it. But take this from me: it will be infinitely more enjoyable to follow Laplace’s route to the confidence interval, meandering as it would through his method of computing inverse probability. And that’s precisely how we’ll build a confidence interval and it’s all coming up in subsequent articles.

Meanwhile, here’s the third use case for the CLT.

Normality of errors and the soundness of least squares estimation

In the early 1800s, Laplace used the CLT to argue that measurement errors are normally distributed. In retrospect, his argument was strikingly simple. A measurement error can be thought as the sum of the effects of innumerable causes. Laplace assumed that causes are independent, uniformly distributed random variables. If the error in a given measurement is the sum of a random sample of causes, then as per the CLT their standardized sum must converge in distribution to that of the standard normal random variable. This is called the hypothesis of elementary errors.

Meanwhile in 1809, Johann Carl Friedrich Gauss (1777–1855) in what is present-day Germany formulated the method of least squares estimation. His whole technique rested upon the assumption that the measurement errors were normally distributed; An assumption for which Gauss provided a fabulously loopy justification that even he didn’t believe in. But Laplace rushed to Gauss’s rescue with his hypothesis of elementary errors to justify Gauss’s assumption that the errors are normally distributed thereby giving legitimacy to Gauss’s least squares estimation technique.

Use in time-series models

Finally, certain advanced forms of the CLT such as the Martingale Central Limit Theorem, and the Gordin’s Central Limit Theorem are used to build confidence intervals for the predictions of time series models. Without using these special forms of the CLT it’s not possible to easily analyze the accuracy of the predictions coming out of these models.

A deeper look at origins of the CLT

The internet speed test is an intriguing example. If you wanted to improve the accuracy of your sample average your instinct might be to take the measurements more often during the day. But how often is often enough? Once an hour? Once a second? Once a femtosecond? The trouble is that the sample space is indexed by a real numbered quantity — time — which makes the sample space uncountably infinite.

In such cases the actual value of the mean — if it even exists — is truly immeasurable. Nobody — at least no one down here on Earth — can know what the real value is. But is there a way we can nudge nature into revealing the true value? We’ll be happy to know even an estimate of the true value so long as we have some way to measure the accuracy of that estimate.

In 1687, a very famous scientist pondered exactly such questions. His meditations lead him to discover the (Weak) Law of Large Numbers. It also set in motion a train of inquiry that lasted more than a century and lead to the discovery of De Moivre’s theorem, the normal curve, Laplace’s method for inverse probability, and finally the Central Limit Theorem. That scientist was Jacob Bernoulli.

Join me next week when I’ll cover Jacob Bernoulli’s discovery of the Weak Law of Large Numbers. The WLLN forms the keystone of the Central Limit Theorem. Yank out the WLLN and the weighty edifice of the CLT collapses into a pile of rubble. We’ll see how. Stay tuned.

References and Copyrights

Books and Papers

Bernoulli, Jakob (2005) [1713], On the Law of Large Numbers, Part Four of Ars Conjectandi (English translation), translated by Oscar Sheynin, Berlin: NG Verlag, ISBN 978–3–938417–14–0 PDF download

Seneta, Eugene. A Tricentenary history of the Law of Large Numbers. Bernoulli 19 (4) 1088–1121, September 2013. https://doi.org/10.3150/12-BEJSP12 PDF Download

Fischer, H., A History of the Central Limit Theorem. From Classical to Modern Probability Theory, Springer Science & Business Media, Oct-2010

Hald, A., A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935, Springer, 2006

Stigler, S. M., The History of Statistics: The Measurement of Uncertainty Before 1900, Harvard University Press, Sept-1986

Images and Videos

All images and videos in this article are copyright Sachin Date under CC-BY-NC-SA unless a different source and copyright are mentioned underneath the image or video.

Thanks for reading! If you liked this article, please follow me for more content on statistics and statistical modeling.