The White Cliffs of Dover (CC BY-SA 3.0)

A Deep Dive into the Science of Statistical Expectation

How we come to expect something, what it means to expect anything, and the math that gives rise to the meaning.

Published in

Towards Data Science

29 min readJun 17, 2023

It was the summer of 1988 when I stepped onto a boat for the first time in my life. It was a passenger ferry from Dover, England to Calais, France. I didn’t know it then, but I was catching the tail end of the golden era of Channel crossings by ferry. This was right before budget airlines and the Channel Tunnel almost kiboshed what I still think is the best way to make that journey.

I expected the ferry to look like one of the many boats I had seen in children’s books. Instead, what I came upon was an impossibly large, gleaming white skyscraper with small square windows. And the skyscraper appeared to be resting on its side for some baffling reason. From my viewing angle on the dock, I couldn’t see the ship’s hull and funnels. All I saw was its long, flat, windowed, exterior. I was looking at a horizontal skyscraper.

Thinking back, it’s amusing to recast my experience in the language of statistics. My brain had computed the expected shape of a ferry from the data sample of boat pictures I had seen. But my sample was hopelessly unrepresentative of the population which made the sample mean equally unrepresentative of the population mean. I was trying to decode reality using a heavily biased sample mean.

Seasickness

This trip across the Channel was also the first time I got seasick. They say when you get seasick you should go out onto the deck, take in the fresh, cool, sea breeze and stare at the horizon. The only thing that really works for me is to sit down, close my eyes, and sip my favorite soda until my thoughts drift slowly away from the harrowing nausea roiling my stomach. By the way, I am not drifting slowly away from the topic of this article. I’ll get right into the statistics in a minute. In the meantime, let me explain my understanding of why you get sick on a boat so that you’ll see the connection to the topic at hand.

On most days of your life, you are not getting rocked about on a ship. On land, when you tilt your body to one side, your inner ears and every muscle in your body tell your brain that you are tilting to one side. Yes, your muscles talk to your brain too! Your eyes eagerly second all this feedback and you come out just fine. But on a ship, all hell breaks loose on this affable pact between eye and ear.

On a ship, when the sea makes the ship tilt, rock, sway, roll, drift, bob, or any of the other things, what your eyes tell your brain can be remarkably different than what your muscles and inner ear tell your brain. Your inner ear might say, “Watch out! You are tilting left. You should adjust your expectation of how your world will appear.” But your eyes are saying, “Nonsense! The table I am sitting at looks perfectly level to me, as does the plate of food resting upon it. The picture on the wall of that thing that is screaming also appears straight and level. Do not listen to the ear.”

Your eyes could report something even more confusing to your brain, such as “Yeah, you are tilting alright. But the tilt is not as significant or rapid as your overzealous inner ears might lead you to believe.”

It’s as if your eyes and your inner ears are each asking your brain to create two different expectations of how your world is about to change. Your brain obviously cannot do that. It gets confused. And for reasons buried in evolution your stomach expresses a strong desire to empty its contents.

Let’s try to explain this wretched situation by using the framework of statistical reasoning. This time, we’ll use a little bit of math to aid our explanation.

Should you expect to get seasick? Getting into the statistics of seasickness

Let’s define a random variable X that takes two values: 0 and 1. X is 0 if the signals from your eyes don’t agree with the signals from your inner ears. X is 1 if they do agree:

In theory, each value of X ought to carry a certain probability P(X=x). The probabilities P(X=0) and P(X=1) together constitute the Probability Mass Function of X. We state it as follows:

For the overwhelming number of times, the signals from your eyes will agree with the signals from your inner-ears. So p is almost equal to 1, and (1 — p) is a really, really tiny number.

Let’s hazard a wild guess about the value of (1 — p). We’ll use the following line of reasoning to arrive at an estimate: According to the United Nations, the average life expectancy of humans at birth in 2023 is approximately 73 years. In seconds, that corresponds to 2302128000 (about 2.3 billion). Suppose an average individual experiences seasickness for 16 hours in their lifetime which is 28800 seconds. Now let’s not quibble about the 16 hours. It is a wild guess, remember? So, 28800 seconds gives us a working estimate of (1 — p) of 28000/2302128000 = 0.0000121626 and p=(1 —0.0000121626) = 0.9999878374. So during any second of the average person’s life, the unconditional probability of their experiencing seasickness is only 0.0000121626.

With these probabilities, we’ll run a simulation lasting 1 billion seconds in the lifetime of a certain John Doe. That’s about 50% of the simulated life of JD. JD prefers to spend most of this time on solid ground. He takes the occasional sea-cruise on which he often gets seasick. We’ll simulate whether J will experience sea sickness during each of the 1 billion seconds of the simulation. To do so, we’ll conduct 1 billion trials of a Bernoulli random variable having probabilities of p and (1 — p). The outcome of each trial will be 1 if J gets seasick, or 0 if J doesn’t get seasick. Upon conducting this experiment, we’ll get 1 billion outcomes. You too can run this simulation using the following Python code:

import numpy as np

p = 0.9999878374
num_trials = 1000000000

outcomes = np.random.choice([0, 1], size=num_trials, p=[1 - p, p])

Let’s count the number of outcomes of value 1(=not seasick) and 0(=seasick):

num_outcomes_in_which_not_seasick = sum(outcomes)
num_outcomes_in_which_seasick = num_trials - num_outcomes_in_which_not_seasick

We’ll print these counts. When I printed them, I got the following values. You may get slightly differing results each time you run your simulation:

num_outcomes_in_which_not_seasick= 999987794
num_outcomes_in_which_seasick= 12206

We can now calculate if JD should expect to feel seasick during any one of those 1 billion seconds.

The expectation is calculated as the weighted average of the two possible outcomes: one and zero, the weights being the frequencies of the two outcomes. So let’s perform this calculation:

Expected value of the outcome (Image by Author)

The expected outcome is 0.999987794 which is practically 1.0. The math is telling us that during any randomly chosen second in the 1 billion seconds in JD’s simulated existence, JD should not expect to get seasick. The data seems to almost forbid it.

Now let’s play with the above formula a bit. We’ll start by rearranging it as follows:

When rearranged in this manner, we see a delightful sub-structure emerging. The ratios in the two brackets represent the probabilities associated with the two outcomes, specifically the sample probabilities derived from our 1 billion strong data sample, rather than the population probabilities. They are sample probabilities because we calculated them using the data from our 1 billion strong data sample. Having said that, the values 0.999987794 and 0.000012206 should be pretty close to the population values of p and (1 — p) respectively.

By plugging in the probabilities, we can restate the formula for expectation as follows:

Notice that we used the notation for expectation, which is E(). Since X is a Bernoulli(p) random variable, the above formula also shows us how to compute the expected value of a Bernoulli random variable. The expected value of X ~ Bernoulli(p) is simply, p.

Of sample means, population means, and a word to make you sound cool

E(X) is also called the population mean, denoted by μ, because it uses the probabilities p and (1 — p) which are the population level values of probability. These are the ‘true’ probabilities that you will observe should you have access to the entire population of values, which is practically never. Statisticians use the word ‘asymptotic’ while referring to these and similar measures. They are referred to as asymptotic because their meaning is significant only when something, such as the sample size, approaches infinity or the size of the entire population. Now here’s the thing: I think people just like to say ‘asymptotic’. And I also think it’s a convenient cover for the troublesome truth that you can never measure the exact value of anything.

On the bright side, the impossibility of getting your hands on the population is ‘the great leveler’ in the field of statistical science. Whether you are a freshly minted graduate or a Nobel laureate in Economics, that door to the ‘population’ remains firmly closed for you. As a statistician, you are relegated to working with the sample whose shortcomings you must suffer in silence. But it’s really not as bad a state of affairs as it sounds. Imagine what will happen if you started to know the exact values of things. If you had access to the population. If you can calculate the mean, the median, and the variance with bullseye accuracy. If you can foretell the future with pinpoint precision. There will be little need to estimate anything. Great big branches of statistics will cease to exist. The world will need hundreds of thousands fewer statisticians, not to mention data scientists. Imagine the impact on unemployment, on the world economy, on world peace…

But I digress. My point is, if X is Bernoulli(p), then to calculate E(X), you can’t use the actual population values of p and (1 — p). Instead, you must make do with estimates of p and (1 — p). These estimates, you will calculate using not the entire population — no chance of doing that. Instead, you will, more often than not, calculate them using a modest sized data sample. And so with much regret I must inform you that the best you can do is get an estimate of the expected value of the random variable X. Following convention, we denote the estimate of p as p_hat (p with a little cap or hat on it) and we denote the estimated expected value as E_cap(X).

Estimated expectation of X (Image by Author)

Since E_cap(X) uses sample probabilities, it’s called the sample mean. It is denoted by x̄ or ‘x bar’. It’s an x with a bar placed on its head.

The population mean and the sample mean are the Batman and Robin of statistics.

A great deal of Statistics is devoted to calculating the sample mean and to using the sample mean as an estimate of the population mean.

And there you have it — the sweeping expanse of Statistics summed up in a single sentence. 😉

Diving into the deep end of expectation

Our thought experiment with the Bernoulli random variable has been instructive in that it has unraveled the nature of expectation to some extent. The Bernoulli variable is a binary variable, and it was simple to work with. However, the random variables we often work with can take on many different values. Fortunately, we can easily extend the concept and the formula for expectation to many-valued random variables. Let’s illustrate with another example.

The expected value of a multi-valued, discrete random variable

The following table shows a subset of a dataset of information about 205 automobiles. Specifically, the table displays the number of cylinders within the engine of each vehicle.

Cylinder counts of automobiles (Data source: UCI machine learning dataset repository under (CC BY 4.0) license) (Image by Author)

Let Y be a random variable that contains the number of cylinders of a randomly chosen vehicle from this dataset. We happen to know that the dataset contains vehicles with cylinder counts of 2, 3, 4, 5, 6, 8, or 12. So the range of Y is the set E=[2, 3, 4, 5, 6, 8, 12].

We’ll group the data rows by cylinder count. The table below shows the grouped counts. The last column indicates the corresponding sample probability of occurrence of each count. This probability is calculated by dividing the group size by 205:

Frequency distribution of cylinder counts

Using the sample probabilities, we can construct the Probability Mass Function P(Y) for Y. If we plot it against Y, it looks like this:

If a randomly chosen vehicle rolls out in front you, what will you expect its cylinder count to be? Just by looking at the PMF, the number you’ll want to guess is 4. However, there’s cold, hard math backing this guess. Similar to the Bernoulli X, you can calculate the expected value of Y as follows:

The expected value of Y (Image by Author)

If you calculate the sum, it amounts to 4.38049 which is pretty close to your guess of 4 cylinders.

Since the range of Y is the set E=[2,3,4,5,6,8,12], we can express this sum as a summation over E as follows:

Formula for the expected value of the discrete random variable Y (Image by Author)

You can use the above formula to calculate the expected value of any discrete random variable whose range is the set E.

The expected value of a continuous random variable

If you are dealing with a continuous random variable, the situation changes a bit, as described below.

Let’s return to our dataset of vehicles. Specifically, let’s look at the lengths of vehicles:

Lengths of automobiles (Data source: UCI machine learning dataset repository under (CC BY 4.0) license) (Image by Author)

Suppose Z holds the length in inches of a randomly selected vehicle. The range of Z is no longer a discrete set of values. Instead, it’s a subset of the set ℝ of real numbers. Since lengths are always positive, it’s the set of all positive real numbers, denoted as ℝ>0.

Since the set of all positive real numbers has an (uncountably) infinite number of values, it’s meaningless to assign a probability to an individual value of Z. If you don’t believe me, consider a quick thought experiment: Imagine assigning a positive probability to each possible value of Z. You’ll find that the probabilities will sum to infinity which is absurd. So the probability P(Z=z) simply does not exist. Instead, you must work with the Probability Density function f(Z=z) which assigns a probability density to different values of Z.

We previously discussed how to calculate the expected value of a discrete random variable using the Probability Mass Function.

Can we repurpose this formula for continuous random variables? The answer is yes. To know how, imagine yourself with an electron microscope.

Take that microscope and focus it on the range of Z which is the set of all positive real numbers (ℝ>0). Now, zoom in on an impossibly tiny interval (z, z+δz], within this range. At this microscopic scale, you might observe that, for all practical purposes (now, isn’t that a helpful term), the probability density f(Z=z) is constant across δz. Consequently, the product of f(Z=z) and δz can approximate the probability that a randomly selected vehicle’s length falls within the open-close interval (z, z+δz].

Armed with this approximate probability, you can approximate the expected value of Z as follows:

An approximate evaluation of E(Z) when Z is continuous (Image by Author)

Notice how we pole vaulted from the formula for E(Y) to this approximation. To get to E(Z) from E(Y), we did the following:

We replaced the discrete y_i with the real-valued z_i.
We replaced P(Y=y) which is the PMF of Y, with f(Z=z)δz which is the approximate probability of finding z in the microscopic interval (z, z+δz].
Instead of summing over the discrete, finite range of Y which is E, we summed over the continuous, infinite range of Z which is ℝ>0.
Finally, we replaced the equals sign with the approximation sign. And therein lies our guilt. We cheated. We sneaked in the probability f(Z=z)δz which is as an approximation of the exact probability P(Z=z). We cheated because the exact probability, P(Z=z), cannot exist for a continuous Z. We must make amends for this transgression, which is exactly what we’ll do next.

We now execute our master stroke, our pièce de résistance, and in doing so, we redeem ourselves.

Since ℝ>0 is the set of positive real numbers, there are an infinite number of microscope intervals of size δz in ℝ>0. Therefore, the summation over ℝ>0 is a summation over an infinite number of terms. This fact presents us with the perfect opportunity to replace the approximate summation with an exact integral, as follows:

The expected value of Z (Image by Author)

In general, if Z’s range is the real valued interval [a, b], we set the limits of the definite integral to a and b instead of 0 and ∞.

If you know the PDF of Z and if the integral of z times f(Z=z) exists over [a, b], you will solve the above integral and get E(Z) for your troubles.

If Z is uniformly distributed over the range [a, b], its PDF is as follows:

PDF of Z ~ Uniform(a, b) (Image by Author)

If you set a=1 and b=5,

f(Z=z) = 1/(5–1) = 0.25.

The probability density is a constant 0.25 from Z=1 to Z=5 and it is zero everywhere else. Here’s how the PDF of Z looks like:

PDF of Z ~ Uniform(1, 5) (Image by Author)

It’s basically a continuous flat, horizontal line from (1,0.25) to (5,0.25) and it’s zero everywhere else.

In general, if the probability density of Z is uniformly distributed over the interval [a, b], the PDF of Z is 1/(b-a) over [a, b], and zero elsewhere. You can calculate E(Z) using the following procedure:

Procedure for calculating the expected value of a continuous random variable that is uniformly distributed over the interval [a, b] (Image by Author)

If a=1 and b=5, the mean of Z ~ Uniform(1, 5) is simply (1+5)/2 = 3. That agrees with our intuition. If each one of the infinitely many values between 1 and 5 is equally likely, we’d expect the mean to work out to the simple average of 1 and 5.

Now I hate to deflate your spirits but in practice, you are more likely to spot double rainbows landing on your front lawn than come across continuous random variables for which you will use the integral method to calculate their expected value.

You see, delightful looking PDFs that can be integrated to get the expected value of the corresponding variables have a habit of ensconcing themselves in end-of-the-chapter exercises of college textbooks. They are like house cats. They don’t ‘do outside’. But as a practicing statistician, ‘outside’ is where you live. Outside, you will find yourself staring at data samples of continuous values like lengths of vehicles. To model the PDF of such real-world random variables, you are likely to use one of the well-known continuous functions such as the Normal, the Log-Normal, the Chi-square, the Exponential, the Weibull and so on, or a mixture distribution, i.e., whatever seems to best fit your data.

Here are a couple of such distributions:

The PDFs and the means of continuous random variables that are Normally distributed and Chi-square distributed — The PDFs and the expected values of continuous random variables that are Normally distributed and Chi-square distributed (Image by Author)

For many commonly used PDFs, someone has already taken the trouble to derive the mean of the distribution by integrating ( x times f(x) ) just like we did with the Uniform distribution. Here are a couple of such distributions:

The PDFs and the means of continuous random variables that are Exponentially distributed and Gamma distributed — The PDFs and the expected values of continuous random variables that are Exponentially distributed and Gamma distributed

Finally, in some situations, actually in many situations, real life datasets exhibit patterns that are too complex to be modeled by any one of these distributions. It’s like when you come down with a virus that mobs you with a horde of symptoms. To help you overcome them, your doctor puts you on drug cocktail with each drug having a different strength, dosage, and mechanism of action. When you are mobbed with data that exhibits many complex patterns, you must deploy a small army of probability distributions to model it. Such a combination of different distributions is known as a mixture distribution. A commonly used mixture is the potent Gaussian Mixture which is a weighted sum of several Probability Density Functions of several normally distributed random variables, each one having a different combination of mean and variance.

Given a sample of real valued data, you may find yourself doing something dreadfully simple: you will take the average of the continuous valued data column and anoint it as the sample mean. For example, if you calculate the average length of automobiles in the autos dataset, it comes to 174.04927 inches, and that’s it. All done. But that is not it, and all is not done. For there is one question you still have to answer.

How good is your sample mean? Getting a feel for its accuracy

How do you know how accurate an estimate of the population mean is your sample mean? While gathering the data, you may have been unlucky, or lazy, or ‘data-constrained’ (which is often an excellent euphemism for good-old laziness). Either way, you are staring at a sample that is not proportionately random. It does not proportionately represent the different characteristics of the population. Let’s take the example of the autos dataset: you may have collected data for a large number of medium-sized cars, and for too few large cars. And stretch-limos may be completely missing from your sample. As a result, the mean length you calculate will be excessively biased toward the mean length of only the medium-sized cars in the population. Like it or not, you are now working on the belief that practically everyone drives a medium-sized car.

To thine own self be true

If you’ve gathered a heavily biased sample and you don’t know it or you don’t care about it, then may heaven help you in your chosen career. But if you are willing to entertain the possibility of bias and you have some clues on what kind of data you may be missing (e.g. sports cars), then statistics will come to your rescue with powerful mechanisms to help you estimate this bias.

Unfortunately, no matter how hard you try you will never, ever, be able to gather a perfectly balanced sample. It will always contain biases because the exact proportions of various elements within the population remain forever inaccessible to you. Remember that door to the population? Remember how the sign on it always says ‘CLOSED’?

Your most effective course of action is to gather a sample that contains roughly the same fractions of all the things that exist in the population — the so-called well-balanced sample. The mean of this well-balanced sample is the best possible sample mean that you can set sail with.

But the laws of nature don’t always take the wind out of statisticians’ sailboats. There is a magnificent property of nature expressed in a theorem called the Central Limit Theorem (CLT). You can use the CLT to determine how well your sample mean estimates the population mean.

The CLT is not a silver bullet for dealing with badly biased samples. If your sample predominantly consists of mid-sized cars, you have effectively redefined your notion of the population. If you are intentionally studying only mid-sized cars, you are absolved. In this situation, feel free to use the CLT. It will help you estimate how close your sample mean is to the population mean of mid-sized cars.

On the other hand, if your existential purpose is to study the entire population of vehicles ever produced, but your sample contains mostly mid-sized cars, you have a problem. To the student of statistics, let me restate that in slightly different words. If your college thesis is on how often pets yawn but your recruits are 20 cats and your neighbor’s Poodle, then CLT or no CLT, no amount of statistical wizardry will help you assess the accuracy of your sample mean.

The essence of the CLT

A comprehensive understanding of CLT is the stuff for another article but the essence of what it states is the following:

If you draw a random sample of data points from the population and calculate the mean of the sample, and then repeat this exercise many times you’ll end up with…many different sample means. Well, duh! But something astonishing happens next. If you plot a frequency distribution of all these sample means, you’ll see that they are always normally distributed. What’s more, the mean of this normal distribution is always the mean of the population you are studying. It is this eerily captivating facet of our universe’s personality that the Central Limit Theorem describes using (what else?) the language of math.

The sample mean length of 174.04927 inches marked off on a normally distributed Z that has a hypothetical population mean of 180 inches (Image by Author)

Let’s go over how to use the CLT. We’ll begin as follows:

Using the sample mean Z_bar from just one sample, we’ll state that the probability of the population mean μ lying in the interval [μ_low, μ_high] is (1 — α):

The lower and upper confidence bounds for the population mean (Image by Author)

You may set α to any value from 0 to 1. For instance, If you set α to 0.05, you will get (1 — α) as 0.95, i.e. 95%.

And for this probability (1 — α) to hold true, the bounds μ_low and μ_high should be calculated as follows:

The lower and upper bounds for the population mean (Image by Author)

In the above equations, we know what are Z_bar, α, μ_low, and μ_high. The rest of the symbols deserve some explanation.

The variable s is the standard deviation of the data sample.

N is the sample size.

Now we come to z_α/2.

z_α/2 is a value you will read off on the X-axis of the PDF of the standard normal distribution. The standard normal distribution is the PDF of a normally distributed continuous random variable that has a zero mean and a standard deviation of one. z_α/2 is the value on the X-axis of that distribution for which the area under the PDF lying to the left of that value is (1 — α/2). Here’s how this area looks like when you set α to 0.05:

Area under the PDF to he left of a certain value x on the X-axis. In this case, x=1.96 — Area under the PDF to he left of a certain value X on the X-axis. In this case, x=1.96 (Image by Author)

The blue colored area is calculated as (1 — 0.05/2) = 0.975. Recall that the total area under any PDF curve is always 1.0.

To summarize, once you have calculated the mean (Z_bar) from just one sample, you can build bounds around this mean such that the probability that the population mean lies within those bounds is a value of your choice.

Let’s reexamine the formulae for estimating these bounds:

These formulae give us a couple of insights into the nature of the sample mean:

As the variance s of the sample increases, the value of the lower bound (μ_low) decreases, while that of the upper bound (μ_high) increases. This effectively moves μ_low and μ_high further apart from each other and away from the sample mean. Conversely, as the sample variance reduces, μ_low moves closer to Z_bar from below, and μ_high moves closer to Z_bar from above. The interval bounds essentially converge on the sample mean from both sides. In effect, the interval [μ_low, μ_high] is directly proportional to the sample variance. If the sample is widely ( or tightly) dispersed around its mean, the greater ( or lesser) dispersion reduces ( or increases) the reliability of the sample mean as an estimate of the population mean.
Notice that the width of the interval is inversely proportional to the sample size (N). Between two samples exhibiting similar variance, the larger sample will yield a tighter interval around its mean than the smaller sample.

Let’s see how to calculate this interval for the automobiles dataset. We’ll calculate [μ_low, μ_high] such that there is a 95% chance that the population mean μ will lie within these bounds.

To get a 95% chance, we should set α to 0.05 so that (1 — α) = 0.95.

We know that Z_bar is 174.04927 inches.

N is 205 vehicles.

The sample standard deviation can be easily calculated. It is 12.33729 inches.

Next, we’ll work on z_α/2. Since α is 0.05, α/2 is 0.025. We want to find the value of z_α/2 i.e., z_0.025. This is the value on the X-axis of the PDF curve of the standard normal random variable, where the area under the curve is (1 — α/2) = (1 — 0.025) = 0.975. By referring to the table for the standard normal distribution, we find that this value corresponds to the area to the left of X=1.96.

Table containing the values from the CDF of the Standard Normal Distribution. Contains P(X ≤ x) for different values of X (Source: Wikipedia)

Plugging in all these values, we get the following bounds:

μ_low = Z_bar — ( z_α/2 · s/√N) = 174.04927 — (1.96 · 12.33729/205) = 173.93131

μ_high = Z_bar + ( z_α/2 · s/√N) = 174.04927 + (1.96 · 12.33729/205) = 174.16723

Thus, [μ_low, μ_high] = [173.93131 inches, 174.16723 inches]

There is a 95% chance that the population mean lies somewhere in this interval. Look at how tight this interval is. Its width is just 0.23592 inches. Within this tiny sliver of a gap lies the sample mean of 174.04927 inches. In spite of all the biases that may be present in the sample, our analysis suggests that the sample mean of 174.04927 inches is a remarkably good estimate of the unknown population mean.

Going beyond the first dimension: Expectation in a multi-dimensional sample space

So far, our discussion about expectation has been confined to a single dimension, but it needn’t be so. We can easily extend the concept of expectation to two, three, or higher dimensions. To calculate the expectation over a multi-dimensional space, all we need is a joint Probability Mass (or Density) Function that is defined over the N-dim space. A joint PMF or PDF takes multiple random variables as parameters and returns the probability of jointly observing those values.

Earlier in the article, we defined a random variable Y that represents the number of cylinders in a randomly chosen vehicle from the autos dataset. Y is your quintessential single dimensional discrete random variable and its expected value is given by the following equation:

Let’s introduce a new discrete random variable, X. The joint Probability Mass Function of X and Y is denoted by P(X=x_i, Y=y_j), or simply as P(X, Y). This joint PMF lifts us out of the cozy, one-dimensional space that Y inhabits, and deposits us into a more interesting 2-dimensional space. In this 2-D space, a single data point or outcome is represented by the tuple (x_i, y_i). If the range of X contains ‘p’ outcomes and the range of Y contains ‘q’ outcomes, the 2-D space will have (p x q) joint outcomes. We use the tuple (x_i, y_i) to denote each of these joint outcomes. To calculate E(Y) in this 2-D space, we must adapt the formula of E(Y) as follows:

The expected of the discrete random variable Y over a 2-dimensional space (Image by Author)

Notice that we are summing over all possible tuples (x_i, y_i) in the 2-D space. Let’s tease apart this sum into a nested summation as follows:

The expected value of the discrete random variable Y over a 2-dimensional space (Image by Author)

In the nested sum, the inner summation computes the product of y_j and P(X=x_i, Y=y_j) over all values of y_j. Then, the outer sum repeats the inner sum for each value of x_i. Afterward, it collects all these individuals sums and adds them up to compute E(Y).

We can extend the above formula to any number of dimensions by simply nesting the summations within each other. All you need is a joint PMF that is defined over the N-dimensional space. For instance, here’s how to extend the formula to 4-D space:

The formula for the expected value of the discrete random variable Y over a 2-dimensional space — The expected value of the discrete random variable Y over a 4-dimensional space (Image by Author)

Notice how we are always positioning the summation of Y at the deepest level. You may arrange the remaining summations in any order you want — you’ll get the same result for E(Y).

You may ask, why will you ever want to define a joint PMF and go bat-crazy working through all those nested summations? What does E(Y) mean when calculated over an N-dimensional space?

The best way to understand the meaning of expectation in a multi-dimensional space is to illustrate its use on real-world multi-dimensional data.

The data we’ll use comes from a certain boat which, unlike the one I took across the English Channel, tragically did not make it to the other side.

RMS Titanic departing Southampton on April 10, 1912 (Public domain)

The following figure shows some of the rows in a dataset of 887 passengers aboard the RMS Titanic:

The Titanic data set — The Titanic dataset (CC0)

The Pclass column represents the passenger’s cabin-class with integer values of 1, 2, or 3. The Siblings/Spouses Aboard and the Parents/Children Aboard variables are binary (0/1) variables that indicate whether the passenger had any siblings, spouses, parents, or children aboard. In statistics, we commonly, and somewhat cruelly, refer to such binary indicator variables as dummy variables. There is nothing block-headed about them to deserve the disparaging moniker.

As you can see from the table, there are 8 variables that jointly identify each passenger in the dataset. Each of these 8 variables is a random variable. The task before us is three-fold:

We’d want to define a joint Probability Mass Function over a subset of these random variables, and,
Using this joint PMF, we’d want to illustrate how to compute the expected value of one of these variables over this multi-dimensional PMF, and,
We’d like to understand how to interpret this expected value.

To simplify things, we’ll ‘bin’ the Age variable into bins of size 5 years and label the bins as 5, 10, 15, 20,…,80. For instance, a binned age of 20 will mean that the passenger’s actual age lies in the (15, 20] years interval. We’ll call the binned random variable as Age_Range.

Once Age is binned, we’ll group the data by Pclass and Age_Range. Here are the grouped counts:

Frequency distribution of passengers by their Pclass and (binned) age — Frequency distribution of passengers by their cabin class and (binned) age (Image by Author)

The above table contains the number of passengers aboard the Titanic for each cohort (group) that is defined by the characteristics Pclass and Age_Range. Incidentally, cohort is yet another word (along with asymptotic) that statisticians downright worship. Here’s a tip: every time you want to say ‘group’, just say ‘cohort’. I promise you this, whatever it was that you were planning to blurt out will instantly sound ten times more significant. To illustrate: “Eight different cohorts of alcohol enthusiasts (excuse me, oenophiles) were given fake wine to drink and their reactions were recorded.” See what I mean?

To be honest, ‘cohort’ does carry a precise meaning that ‘group’ doesn’t. Still, it can be instructive to say ‘cohort’ once in a while and witness feelings of respect grow in your listeners’ faces.

At any rate, we’ll add another column to the table of frequencies. This new column will hold the probability of observing the particular combination of Pclass and Age_Range. This probability, P(Pclass, Age_Range), is the ratio of the frequency (i.e. the number in the Name column) to the total number of passengers in the dataset (i.e. 887).

Frequency distribution of passengers by their cabin class and (binned) age (Image by Author)

The probability P(Pclass, Age_Range) is the joint Probability Mass Function of the random variables Pclass and Age_Range. It gives us the probability of observing a passenger who is described by a particular combination of Pclass and Age_Range. For example, look at the row where Pclass is 3 and Age_Range is 25. The corresponding joint probability is 0.116122. That number tells us that roughly 12% of passengers in the 3rd class cabins of the Titanic were 20–25 years old.

As with the one-dimensional PMF, the joint PMF also sums up to a perfect 1.0 when evaluated over all combinations of values of its constituent random variables. If your joint PMF doesn’t sum up to 1.0, you should look closely at how you have defined it. There might be an error in its formula or worse, in the design of your experiment.

In the above dataset, the joint PMF does indeed sum up to 1.0. Feel free to take my word for it!

To get a visual feel for how the joint PMF, P(Pclass, Age_Range) looks like, you can plot it in 3 dimensions. In the 3-D plot, set the X and Y axis to respectively Pclass and Age_Range and the Z axis to the probability P(Pclass, Age_Range). What you’ll see is a fascinating 3-D chart.

A 3-D plot of the joint PMF of Pclass and Age_Range — A 3-D plot of the joint PMF of **Pclass** and **Age_Range** (Image by Author)

If you look closely at the , you’ll notice that the joint PMF consists of three parallel plots, one for each cabin class on the Titanic. The 3-D plot brings out some of the demographics of the humanity aboard the ill-fated ocean-liner. For instance, across all three cabin classes, it’s the 15 to 40 year old passengers that made up the bulk of the population.

Now let’s work on the calculation for E(Age_Range) over this 2-D space. E(Age_Range) is given by:

Expected value of Age_Range — Expected value of **Age_Range** (Image by Author)

We run the inside sum over all values of Age_Range: 5,10,15,…,80. We run the outer sum over all values of Pclass: [1, 2, 3]. For each combination of (Pclass, Age_Range), we pick the joint probability from the table. The expected value of Age_Range is 31.48252537 years which corresponds to the binned value of 35. We can expect the ‘average’ passenger on the Titanic to be 30 to 35 years old.

If you take the mean of the Age_Range column in the Titanic dataset, you’ll arrive at exactly the same value: 31.48252537 years. So why not just take the average of the Age_Range column to get E(Age_Range)? Why build a Rube Goldberg machine of nested summations over an N-dimensional space only to arrive at the same value?

Rube Goldberg’s “self-operating napkin” machine (Public domain)

It’s because in some situations, all you’ll have is the joint PMF and the ranges of the random variables. In this instance, if you had only P(Pclass, Age_Range) and you knew the range of Pclass as [1,2,3], and that of Age_Range as [5,10,15,20,…,80], you can still use the nested summations technique to calculate E(Pclass) or E(Age_Range).

If the random variables are continuous, the expected value over a multi-dimensional space can be found using a multiple integral. For instance, if X, Y, and Z are continuous random variables and f(X,Y,Z) is the joint Probability Density Function defined over the 3-dimensional continuous space of tuples (x, y, z), the expected value of Y over this 3-D space is given in the following figure:

Expected value of the continuous random variable Y defined over a continuous 3-D space (Image by Author)

Just as in the discrete case, you integrate first over the variable whose expected value you want to calculate, and then integrate over the rest of the variables.

A famous example demonstrating the application of the multiple-integral method for computing expected values exists at a scale that is too small for the human eye to perceive. I am referring to the wave function of quantum mechanics. The wave function is denoted as Ψ(x, y, z, t) in Cartesian coordinates or as Ψ(r, θ, ɸ, t) in polar coordinates. It’s used to describe the properties of seriously tiny things that enjoy living in really, really cramped spaces, like electrons in an atom. The wave function Ψ returns a complex number of the form A + jB, where A represents the real part and B represents the imaginary part. We can interpret the square of the absolute value of Ψ as a joint probability density function defined over the 4-dimensional space described by the tuple (x, y, z, t) or (r, θ, ɸ, t). Specifically for an electron in a Hydrogen atom, we can interpret |Ψ|² as the approximate probability of finding the electron in an infinitesimally tiny volume of space around (x, y, z) or around (r, θ, ɸ) at time t. By knowing |Ψ|², we can run a quadruple integral over x, y, z, and t to calculate the expected location of the electron along the X, Y, or Z axis (or their polar equivalents) at time t.

Parting thoughts

I began this article with my experience with seasickness. And I wouldn’t blame you if you winced at the brash use of a Bernoulli random variable to model what is a remarkably complex and somewhat poorly understood human ordeal. My objective was to illustrate how expectation affects us, literally, at a biological level. One way to explain that ordeal was to use the cool and comforting language of random variables.

Starting with the deceptively simple Bernoulli variable, we swept our illustrative brush across the statistical canvas all the way to the magnificent, multi-dimensional complexity of the quantum wave function. Throughout, we sought to understand how expectation operates on discrete and continuous scales, in single and multiple dimensions, and at microscopic scales.

There is one more area in which expectation makes an immense impact. That area is conditional probability in which one calculates the probability that a random variable X will take a value ‘x’ assuming that certain other random variables A, B, C, etc. have already taken values ‘a’, ‘b’, ‘c’. The probability of X conditioned upon A, B, and C is denoted as P(X=x|A=a,B=b,C=c) or simply as P(X|A,B,C). In all the formulae for expectation that we have seen, if you replace the probability (or probability density) with the conditional version of the same, what you’ll get are the corresponding formulae for conditional expectation. It’s denoted as E(X=x|A=a,B=b,C=c) and it lies at the heart of the extensive fields of regression analysis and estimation. And that’s fodder for future articles!

Citations and Copyrights

Dataset

The automobile dataset is downloaded from the UC Irvine Machine Learning Repository under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

The Titanic dataset is downloaded from Kaggle under the CC0 license.

Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.