Abraham De Moivre, His Famous Theorem, and the Birth of the Normal Curve

The life and times of Abraham De Moivre, his famous theorem, and how it set the stage for the discovery of the Central Limit Theorem

It was around 1689 that one of biggest discoveries in probability unfolded in the Swiss Confederation. The protagonist was Jacob Bernoulli and the discovery was the Weak Law of Large Numbers. Bernoulli showed that the mean of a randomly selected sample converges in probability to the mean of the population. Unfortunately, Bernoulli proved the WLLN in the context of a Binomial experiment and in the 1600s, there was not one soul in Europe who knew how to efficiently approximate, leave alone calculate, the big, bulky factorials that constitute binomial probabilities. And that meant that for all its ground-breaking importance, the Weak Law of Large Numbers enjoyed little practical utility.

It would be another four decades after Jacob Bernoulli’s death in 1705, that a self-exiled Frenchman in England by the name Abraham De Moivre would crack the problem of binomial approximation, and in doing so gift us the normal distribution curve.

This is the story of De Moivre, his theorem, and his legacy.

Abraham De Moivre

Abraham De Moivre (Public domain)

Abraham De Moivre was born in 1667 in Vitry-le-François, a neat little commune in the Champagne province of the Kingdom of France.

Historical hillsides of Champagne (CC BY-SA 4.0)

De Moivre spent a good portion of his childhood bouncing around an assortment of Protestant and Catholic schools in what must have been a less than happy school life for a math prodigy. At any rate, born to Protestant parents in a kingdom that was speedily turning Catholic under the iron thumb of Louis XIV, by age twenty, Abraham De Moivre faced a choice: convert, defy, or flee. De Moivre chose to flee, to England.

And thus it was that in the Summer of 1687, a young De Moivre undertook a treacherous flight that took him out of Paris, across France, over the channel, and to the safety of England. That very year — and who knows perhaps during the same Summer— 400 miles east of Paris in Basel, Switzerland, a 32 year old Jacob Bernoulli was carefully pouring his thoughts into Part 4 of his magnum opus: Ars Conjectandi — the very thoughts that two decades later would occupy De Moivre’s fertile mind in England. Ideas that would turn into a consuming occupation for De Moivre and lead him to produce some of the finest set of breakthroughs of 18th century mathematics.

Ars Conjectandi (Public domain)

In the Summer of 1687, when De Moivre made it to England, he was already proficient in mathematics. Not only did he soon start earning a living as a math tutor, he also speedily penetrated the intellectual life of 17th century Britain. By age 25, he was friends with Edmund Halley of Halley’s comet fame and soon with Isaac Newton himself. By age 30, he was inducted into the Royal Society, and at some point became acquainted with the works of the Bernoullis (of Jacob Bernoulli in particular who had died by then, and of his nephew Johann), and of Johann’s well-to-do friend and a fellow Frenchman, the mathematician and an expert in probability, Pierre R´emond de Montmort who had been inducted into the Royal Society just an year after De Moivre. De Moivre’s fertile mind had finally found the environment it needed to flourish. The Dragoons of Louis XIV were now far away and harmless. The strain of a childhood spent in constant fear was but a distant memory.

From left to right: Louis XIV, the excesses of his dragonnades policy portrayed in the art of the period. On the left side of the third pane is scribbled “Whoever can resist me is very strong”, while on the right side, the artist writes “Strength surpasses reason”,

It would be several years into his new life in England that a middle-aged De Moivre would take a real abiding interest in Jacob Bernoulli’s work on the Law of Large Numbers. To see what his interest led to, let’s visit Bernoulli’s theorem and the thought experiment that lead Bernoulli to its discovery.

Jacob Bernoulli’s binomial thought experiment, and Abraham De Moivre’s big idea

In Ars Conjectandi, Bernoulli had imagined a large urn containing r black tickets and s white tickets. Both r and s are unknown to you and so is the true fraction p = r/(r+s) of black tickets in the urn. Now suppose you draw n tickets from the urn randomly with replacement and your random sample contains X_bar_n black tickets. Here, X_bar_n is the sum of n i.i.d. random variables. Thus, X_bar_n/n is the ratio of black tickets that you observe. In essence, X_bar_n/n is your estimate of the true value of p.

The number of black tickets X_bar_n found in a random sample of black and white tickets has the familiar binomial distribution. That is:

X_bar_n ~ Binomial(n, p)

Where n is the sample size, and p=r/(r+s) is the true probability of a single ticket being a black ticket. Of course, p is unknown to you since in Bernoulli’s experiment, the number of black tickets (r) and white tickets (s) are unknown to you.

Since X_bar_n is binomially distributed, its expected value E(X_bar_n) = np and its Var(X_bar_n) = np(1 — p). Again, since p is unknown, both the mean and variance of X_bar_n are also unknown.

Also unknown to you is the absolute difference between your estimate of p and the true value of p. This estimate is the error |X_bar_n/n — p|.

Bernoulli’s great discovery was to show that as the sample size n becomes very large, the odds of the error |X_bar_n/n — p| being smaller than any arbitrarily small positive number ϵ of your choosing become incredibly large. As an equation, his discovery can be stated as follows:

Bernoulli’s theorem
Bernoulli’s theorem (Image by Author)

The above equation is the Weak Law of Large Numbers. In the above equation:

P(|X_bar_n/n — p| <= ϵ) is the probability of the estimation error being at most ϵ.
P(|X_bar_n/n — p| > ϵ) is the probability of the estimation error being greater than ϵ.
‘c’ is a very large positive number.

The WLLN can be stated in three other forms highlighted in the blue boxes below. These alternate forms result from doing some simple algebraic gymnastics as follows:

Alternate forms for Bernoulli’s theorem
Alternate forms of the Weak Law of Large Numbers (Image by Author)

Now notice the probability in the third blue colored box:
P(μ — δ ≤ X_bar_n ≤ μ + δ) = (1 — α)

Or plugging back μ =np:
P(np — δ ≤ X_bar_n ≤ np + δ) = (1 — α)

Since X_bar_n ~ Binomial(n,p), it is straightforward to express this probability as a difference of two binomial probabilities as follows:

P(np-δ ≤ X_bar_n ≤ np+δ) where X_bar_n ~ Binomial(n,p)
P(np-δ ≤ X_bar_n ≤ np+δ) where X_bar_n ~ Binomial(n,p) (Image by Author)

But it is at this point that things stop being straightforward. For large n, the factorials inside the two summations become enormous and near about impossible to calculate. Imagine having to calculate 20!, leave alone 100! or 1000!. What is needed is a good approximation technique for factorial(n). In Ars Conjectandi, Jacob Bernoulli made a few weak attempts at approximating these probabilities, but the quality of his approximations left a lot to be desired.

Abraham De Moivre’s big idea

In the early 1700s, when De Moivre first began looking at Bernoulli’s work, he immediately sensed the need for a fast, high quality approximation technique for the factorial terms in the two summations. Without an approximation technique, Bernoulli’s great accomplishment was like a big, beautiful kite without a string. A law of great beauty but of little practical use.

De Moivre recast the problem as an approximation for the sum of successive terms in the expansion of (a + b) raised to the nth power. This expansion, known as the binomial formula, goes as follows:

The formula for (a+b) raised to the nth power
The formula for (a+b) raised to the nth power (Image by Author)

De Moivre’s reasons for recasting the probabilities in the WLLN in terms of the binomial formula were arrestingly simple. It was known that if the sample sum X_bar_n has a binomial distribution, the probability of X_bar_n being less than or equal to some value n can be expressed as a sum of (n+1) probabilities as follows:

The formula for P(X_bar_n ≤ n)
The formula for P(X_bar_n ≤ n) (Image by Author)

If you compare the coefficients of the terms on the R.H.S. of the above equation with the coefficients in the terms in the expansion of (a+b) raised to n, you’ll find them to be remarkably similar. And so De Moivre theorized, if you find a way to appropriate the factorial terms in the R.H.S. of (a+b) raised to n, you have paved the way for approximating P(X_bar_n ≤ n), and thus also the probability lying at the heart of the Weak Law of Large Numbers, namely:

P(np — δ ≤ X_bar_n ≤ np + δ) = (1 — α)

De Moivre’s theorem

For over 10 years, De Moivre toiled on the approximation problem creating increasingly accurate approximations of the factorial terms. By 1733, he had largely concluded his work when he published what came to be called De Moivre’s theorem (or less accurately, the De Moivre-Laplace theorem).

At this point, I could just state De Moivre’s theorem but that will spoil half the fun. Instead, let’s follow along De Moivre’s train of thought. We’ll work through the calculations leading up to the formulation of his great theorem.

Our requirement is for a fast, high accuracy approximation technique for the probability that lies at the heart of Bernoulli’s theorem, namely:

P(|X_bar_n/n — p| ≤ ϵ)

Or equivalently its transformed version:
P(np — δ ≤ X_bar_n ≤ np + δ)

Or in the most general form, the following probability:
P(x_1 ≤ X ≤ x_2)

In this final form, we have assumed that X is a discrete random variable that has a binomial distribution. Specifically, X ~ Binomial(n,p).

The probability P(x_1 ≤ X ≤ x_2) can be expressed as follows:

Formula for probability P(x_1 ≤ X ≤ x_2)
Formula for probability P(x_1 ≤ X ≤ x_2) (Image by Author)

Now let p, q be two real numbers such that:
0 ≤ p ≤ 1, and 0 ≤ q ≤ 1, and q = (1 — p).

Since X ~ Binomial(n,p), E(X) = μ = np, and Var(X) = σ² = npq.

Let’s create a new random variable Z as follows:

A few variable definitions (Image by Author)

Z is clearly the standardized version of X. Specifically, Z is a standard normal random variable. Thus,

If X ~ Binomial(n,p), then Z ~ N(0, 1)

Keep this in mind for we’ll visit this fact in just a minute.

With the above framework in place, De Moivre showed that for very large values of n, the probability:

P(x1 ≤ X ≤ x2)

can be approximated by evaluating the following specific type of integral:

P(x1 <= X <= x2) asymptotically converges to the area under the curve exp(-z²/2) from z1 to z2.
P(x1 <= X <= x2) asymptotically converges to the area under the curve exp(-z²/2) from z1 to z2. (Image by Author)

The ≃ sign means the L.H.S. asymptotically equals the R.H.S. In other words, as the sample size grows to ∞, L.H.S. = R.H.S.

Did you notice something familiar about the integral on R.H.S? It’s the formula for the area under a standard normal variable’s probability density curve from z_1 to z_2.

Area under the PDF of N(0,1) from z1=-1 to z2=+1
Area under the PDF of N(0,1) from z_1=-1 to z_2=+1 (Image by Author)

And the formula inside the integral is the Probability Density Function of the standard normal random Z:

PDF of the standard normal random variable Z
PDF of the standard normal random variable Z (Image by Author)

Let’s split apart the integral on the R.H.S. as a difference of two integrals as follows:

P(z1 ≤ Z ≤ z2) = P(Z ≤ z2) — P(Z ≤ z1)
P(z1 ≤ Z ≤ z2) = P(Z ≤ z2) — P(Z ≤ z1) (Image by Author)

The two new integrals on the R.H.S. are the respectively the cumulative densities P(Z ≤ z_2), and P(Z ≤ z_1).

The Cumulative Density Function P(Z ≤ z) of a standard normal random variable is represented using the standard notation:

𝛟(z)

Therefore, the integral on the L.H.S. of the above equation is equal to:

𝛟(z_2) — 𝛟(z_1).

Bringing it all together, we can see that the probability:

P(x1 ≤ X ≤ x2)

asymptotically converges to 𝛟(z_2) — 𝛟(z_1):

Image by Author

Now recall how we defined Z as a standardized X :

The standardized X
The standardized X (Image by Author)

And thus we also have the following:

(Image by Author)

While formulating his theorem, De Moivre defined the limits x_1 and x_2 as follows:

(Image by Author)

Substituting these values of x_1 and x_2 in the previous set of equations, we get:

(Image by Author)

And therefore, De Moivre showed that for very large n:

De Moivre’s Theorem (Image by Author)

Remember, what De Moivre really wanted was to approximate the probability on the L.H.S. of Bernoulli’s theorem:

Bernoulli’s Theorem
Bernoulli’s Theorem (Image by Author)

Which he succeeded in doing by making the following simple substitutions:

Image by Author

Which produces the following asymptotic equality:

De Moivre’s approximation for Bernoulli’s theorem
De Moivre’s approximation for Bernoulli’s theorem (Image by Author)

In a single elegant stroke, De Moivre showed how to approximate the probability in Bernoulli’s theorem for large sample sizes. And large sample sizes is what Bernoulli’s theorem is all about. There is however some subtext to De Moivre’s achievement. The integral on the R.H.S. does not have a closed form and De Moivre approximated it using an infinite series.

An illustration of De Moivre’s Theorem

Suppose there are exactly three times as many black tickets as white tickets in the urn. So the true fraction of black tickets, p, is 3/4. Suppose also that you draw a random sample with replacement of 1000 tickets. Since p=0.75, the expected value of black tickets is np = 750. Suppose the number of black tickets you observe in the sample is 789. What is the probability of drawing such a random sample?

Let’s set out the facts:

(Image by Author)

We wish to find out:

P(750 — 39 ≤ X_bar_n <= 750 + 39)

We’ll use De Moivre’s Theorem to find this probability. As we know, the theorem can be stated as follows:

De Moivre’s approximation for Bernoulli’s theorem
De Moivre’s approximation for Bernoulli’s theorem (Image by Author)

We know that n=1000, p=0.75, X_bar_n=789, and δ=39. We can find k as follows:

(Image by Author)

Plugging in all the values:

Application of De Moivre’s theorem (Image by Author)

In approximately 99.56% of random samples of size 1000 tickets each, the number of black tickets will lie between 711 and 789.

The year 1733 is the first time the normal distribution curve appears in the statistical literature. De Moivre essentially discovered the normal distribution function without realizing the far-reaching use of his discovery. He certainly seemed to have paid no attention to the curve in his writings beyond a way to approximate the binomial probability density for large sample sizes.

Notice another thing about De Moivre’s theorem. De Moivre set the limits on X as follows:

x1 and x2 (Image by Author)

Since X ~ Binomial(n,p), the standard deviation of X is the square-root of npq. Without naming it as ‘standard deviation’, De Moivre effectively defined the bounds on the mean or sum to lie within k standard deviations on both sides of the expected value. Thus, De Moivre had proved the following important property:

If X ~ Binomial(n, p), then for large n, the probability of X lying within k standard deviations of the expected value np can be approximated by the area under the standard normal curve from -k to +k.

Abraham De Moivre’s achievements, and his failings

As per De Moivre’s theorem:

De Moivre’s theorem (Image by Author)

De Moivre came tantalizingly close to generalizing his result to a form that is now accepted as the Central Limit Theorem. But it appears he was completely focused on improving Jacob Bernoulli’s approximation of the binomial probabilities. If De Moivre sensed the gleam of the CLT’s machinery shinning through his work, he certainly didn’t betray any awareness of it. In fact, many scientists don’t even credit De Moivre with the Central Limit Theorem.

The Doctrine of Chances. This edition was published in 1756 (Public domain)

The 1733 edition of Doctrine of Chances is assuredly De Moivre’s pièce de résistance.

De Moivre had introduced the world to the Normal curve.

He showed how to approximate the successive terms of the Binomial expansion using the area under the Normal Curve.

And he showed how the standard deviation (without calling it by that name) can be used to specify the error between the observed and the true value of the mean.

In three giant steps, De Moivre brought us and himself incredibly close to knowing how to calculate inverse probability and toward the formulation of the Central Limit Theorem.

And here, De Moivre stopped. Just like Jacob Bernoulli four decades earlier, De Moivre’s research on this topic came to an inexplicable halt.

De Moivre showed how to approximate the binomial probability P(X|n,p) using the normal distribution, but he did not even attempt to invert the result to get P(p|n,X). He must have surely known that his approximation of the P(X|n,p) was useful only within the narrow use case of when you know (or assume) the population value of the parameter (in his case, the true fraction p of black tickets in the urn). As you must have noticed in the illustration of De Moivre’s theorem, you could approximate P(x_1 ≤ X ≤ x_2) only because you could calculate the limits z_1 and z_2 on the integral. And you could calculate z_1 and z_2 only because you somehow magically knew (or in this case assumed) the true fraction, p, of the black tickets in the urn to be 0.75. Which is to say, you cannot use De Moivre’s theorem in most practical inference settings.

Put it that way, it’s tempting to take a withering view of De Moivre’s contributions. But you must resist this temptation. You must resist it because this is precisely how science progresses. It does so incrementally, one to few discoveries at a time, always building upon prior work, until someone finally closes the gap to the goal and runs home with some grand, career-altering achievement. That someone was to be Pierre-Simon Laplace.

Legacy

De Moivre, despite his many achievements and his knowing the who’s-who in British science never could land a job as a professor in Brittan. For much of his long life, he seems to have continued tutoring students in private settings. His sustenance came primarily from tutoring and from the sales of his books. By all accounts, it was barely enough to eke out a living. One can almost picture a modern day De Moivre muttering under his breath in pitiful self-disdain, “If I am so bright, why ain’t I rich?” Rich or not, De Moivre was unquestionably brilliant. And his work on probability did not go unnoticed.

Abraham De Moivre died in 1754 at the ripe old age of 87. But well before his death, a formidable legion of 18th century mathematicians all across Europe had begun to be influenced by his work either directly or indirectly. In Britain, it was Thomas Simpson (1710–1761), and the home-schooled Presbyterian minister Thomas Bayes (1702–1761) of Bayesian inference fame. In Mulhouse (now France), it was Johann Heinrich Lambert (1727–1777). In France, it was the highly influential thinker Marquis de Condorcet (1743–1794) before he was imprisoned and then allegedly rubbed out while in prison during France’s Reign of Terror. In Switzerland, it was Daniel Bernoulli (1700–1782) (another nephew of Jacob’s). In Italy, it was Joseph-Louis Lagrange (1736–1813). And in France, it was the great Pierre-Simon Laplace (1749–1827) who not only survived the French Revolution and the ensuing Reign of Terror but also every single war started by or waged against Napoleon Bonaparte.

All these scientists attacked the problem of inverse probability from various angles and with diverse motivations.

In a landmark piece of work published posthumously in 1764 and to the complete ignorance of any scientist outside his native England, Thomas Bayes (who as a child may well have been home-tutored by De Moivre in London) came within a hair’s breadth of nailing the solution to inverse probability. A decade later in France, and seemingly without harboring a grain of awareness about Bayes’s breakthrough, Laplace independently broke through to the goal post on inverse probability. But it would be yet another three decades before Laplace would introduce the world to the Central Limit Theorem.

In my next article, I’ll describe Laplace’s brilliant approach for calculating inverse probability and his discovery of the Central Limit Theorem. I’ll also show how to use the CLT to build a confidence interval around the unknown population mean.

References and Copyrights

Books and Papers

Bernoulli, J. (2005) [1713]. On the Law of Large Numbers, Part Four of Ars Conjectandi (English translation). Translated by Oscar Sheynin. Berlin: NG Verlag. ISBN 978–3–938417–14–0. PDF download

Seneta, E. (September 2013). A Tricentenary history of the Law of Large Numbers. Bernoulli 19 (4) 1088–1121. https://doi.org/10.3150/12-BEJSP12 PDF Download

Fischer, H. (Oct-2010). A History of the Central Limit Theorem. From Classical to Modern Probability Theory. Springer Science & Business Media.

Shafer, G. (1996). The significance of Jacob Bernoulli’s Ars Conjectandi for the philosophy of probability today. Journal of Econometrics. Volume 75. Issue 1. Pages 15–32. ISSN 0304–4076. https://doi.org/10.1016/0304-4076(95)01766-6.

Polasek, W. (2000). The Bernoullis and the origin of probability theory: Looking back after 300 years. Resonance 5. 26–42. https://doi.org/10.1007/BF02837935 PDF download

Stigler, S. M. (Sept-1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press.

Hald, H. (2007) A History of Parametric Statistical Inference from Bernoulli to Fisher, 1713–1935. Springer

Barnard, G. A. Bayes, T. (1958). Studies in the History of Probability and Statistics: IX. Thomas Bayes’s Essay Towards Solving a Problem in the Doctrine of Chances. Biometrika. 45. No. 3/4: 293–315. https://doi.org/10.2307/2333180. PDF download

Images and Videos

All images and videos in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image or video.

Thanks for reading! If you liked this article, please follow me to receive tips, how-tos and programming advice on regression and time series analysis.

--

--