The Aspiring Statistican’s Introduction to Random Variables

When it comes to yelling “Surprise!”, the universe never gets tired.

Published in

Towards Data Science

31 min readMay 12, 2023

In this article, we’ll take a contemplative walk across the land of chance. We’ll learn about random walks, about discrete and continuous random variables, and about their probability distributions. We’ll understand why it’s meaningless to assign a probability to a continuous random variable’s value. And in doing so, we’ll unearth the meaning of probability density. We’ll learn how random processes are formed, and their connection to time series models. We’ll end our itinerant wanderings with a short discussion on how random variables form the substrate of all regression models.

On experiments, outcomes and sample spaces

We experience change — unpredictable, random change devoid of any observable pattern or reason. It surprises us, it frustrates us, it unhinges our most well-thought plans.

We sleep through our morning alarms, we get stuck in flash traffic, and we show up late for meetings. Only to discover that they are postponed because whoever called them was also stuck in traffic.

But randomness isn’t always a fiend. Remember when the distracted Barista poured you a few extra millimeters of coffee? Did you protest? We’ve learnt to quietly enjoy the little windfalls of nature.

We seek predictability, certainty, constancy. But they are impossible to find. Our senses and our brains aren’t evolved to know everything all the time. We experience random variability because we aren’t omniscient. On the other hand, God may have little use for dice. But even that is debated.

We’ve learnt to exploit randomness. Games have been won or lost on a coin toss. A coin flip decided the owner of a prized horse. And it was a coin flip that gave Portland, Oregon its name.

On a grander scale, random numbers power the TLS protocol that encrypts most of the data flowing on the internet.

On an even grander scale, our Sun’s been flipping its magnetic field once every 11 years or so. The “or so” means the Sun doesn’t want to tell us exactly when it will do it next.

The time at which you drag yourself out of bed, the amount of coffee in your morning cup, the number of sips you take to gulp it down, the outcomes of all the coins you tossed in your lifetime, and the number of years it takes our Sun to flip its field — they are all random numbers. Random values that have come about from an activity such as tossing a coin. In statistics, this activity is called an experiment.

When you perform an experiment, you expect an outcome, typically one of several possible ones. When you toss a penny, you expect it to land heads or tails. But what if it lodges itself in the mud edge-wise or rolls off into the gutter? If you toss it while on top of Mount Everest, the wind could simply carry your penny away. And if you toss it inside the International Space Station…Well, you get the point. Should you ignore any of these outcomes? That depends on why you want to flip the penny. If your course of action depends on its landing Heads or Tails, you don’t care about any other outcome, and you’d want to design your experiment to ensure that you will get either Heads or Tails. The point to remember is that the design of the experiment decides the outcomes. As an experimenter, you must design the experiment to yield only outcomes that mean something to you.

We are interested in only Heads and Tails. So, let’s toss a coin on level ground inside an empty room. There is no sticky mud on the floor and there are no gutters or grates in sight. The possible outcomes of this controlled experiment are {Heads, Tails}. This is the sample space of the experiment. It’s the set of all possible outcomes that mean something to you, and it’s denoted by the letter S. If you observe any outcome other than Heads or Tails, you must discard it as if it never occurred and you must re-conduct the experiment. You could also redesign the experiment and incorporate the surprise outcome into the sample space. Or you could simply declare failure and go fishing.

If the experiment yields one of N possible outcomes, the sample space S is {s_1, s_2, s_3,…,s_N}. The size of S, denoted by |S|, is N.

When you conduct the experiment, every outcome in S is probable. When S={Heads, Tails}, the probability of occurrence of the two outcomes are denoted as P(Heads), and P(Tails). In the general case, the probability of occurrence of the kth outcome in S is denoted as P(s_k).

In our controlled experiment, the coin must land either Heads or Tails. There is no third outcome. The two outcomes are also mutually exclusive. If the coin has already come up Heads, it can’t possibly land Tails at the same time.

So the probability of the coin landing “Heads OR Tails” is:

P(Heads ∪ Tails) = P(Heads) + P(Tails) — P(Heads ∩ Tails) = P(Heads) + P(Tails) = 1.0

It’s like saying, when you toss a coin, there is a 100% chance that it will land either Heads or Tails. In the controlled environment of a carefully arranged experiment, such a statement is never false. Logicians call it a tautology. It may sound a trivial statement but it’s of great practical use to the experimenter. After you’ve designed the experiment that yields only mutually exclusive outcomes, you should verify that the probabilities of all outcomes in the sample space sum up to is 1.0. If they don’t, there is something wrong about your experiment’s design. Simply put, you need to redesign the experiment.

The sum of probabilities of all outcomes in the sample space is a perfect 1.0 (i.e., 100%) (Image by Author)

Taking action: A random walk across an island

Earlier, we talked about taking some action based on the outcome of the experiment. Let’s explore that further. If you are a tourist in Manhattan and wish to take a self-guided walking tour of the island, a generally acceptable way to achieve this is to grab a tour book or download an app and follow its instructions.

But what if you let a coin direct you around Manhattan? Here’s how a hypothetical “coin-operated” exploration could work:

Suppose we start at the corner of W 34th St. & 8th Ave right outside Penn Station. Then we flip a coin — maybe not a real coin but a pseudo coin generated by a coin-flipping app. If it comes Heads, we turn right, else we turn left. Either way, we walk up or down the street until we come to the next 3-way or 4-way intersection at which point we flip the coin again and repeat the process. If we come to a dead end, we simply retrace to the previous intersection. The path we would trace is called a random walk.

A random walk around Manhattan. Left turns are colored blue. Right turns are colored red. (Image by Author) (Map underlay copyright OpenStreetMap under OpenStreetMap license)

While on a random walk, you would turn right or left based on whether the coin lands Heads or Tails. We’ll call the action of turning right or left, an event. Let’s create a new set E that will contain these events. E = {Turn Right, Turn Left}.

Notice that the action of turning right or left is completely random. We’ll represent this action by a random variable which we’ll denote by the bolded capital letter X.

In other words, we define the random variable X to contain the action that we’ll take in response to a coin flip. Thus X will assume exactly one of the two values in E={Turn Right, Turn Left} based on the outcome in S={Heads, Tails}. Notice how X maps values from S to E. Outcomes to events.

Heads → Turn Right
Tails → Turn Left

What do such mappings remind us of? They remind us of a function. X is a function! So let’s write it in function form as follows:

X(Heads) = Turn Right
X(Tails) = Turn Left

You could also write X in the following Set notation which has the added advantage of making you look really smart:
X:S → E

Either way, all it says is X is a function that maps values in set S to values in set E.

The random variable X maps outcomes in S to events in E (Image by Author)

Let’s recap: A random variable is a function that maps values from a set of random outcomes S to a set of events E that interest you. Since X is a function, it has a domain and a range. X’s domain are the outcomes in the sample space S. X’s range are the events in E.

There are three kinds of random variables: discrete, continuous, and mixed. X happens to be the first kind. So let’s look closely at discrete category.

Discrete random variables and their properties

The random variable X is the engine that powers our random walk around Manhattan. We should notice the following properties of X:

X assumes discrete values Turn Right and Turn Left (as against a continuous value such as temperature). Thus, X is a discrete random variable.
You cannot say that Turn Right is any greater or lesser than Turn Left. You cannot impose any kind of order on the two values that X takes. Instead, you must treat them as equals. We call such random variables whose range contain values that cannot be ordered as nominal random variables.
Since there is a probability P(s_k) associated with each outcome s_k in S, there is a probability P(X=x_i) associated with each event in x_i in E.
P(X=x_i) is called the Probability Mass Function (PMF) of X. The PMF assigns a probability to every possible value of the random variable. i.e. each value in the set of events E.
The probabilities of all x_i in E sum up to 1.0. No surprises here. Given the design of the random walk experiment, once your coin lands Heads or Tails, you will turn either Left or Right. There is no third action or event in E available to choose at will. The coin is directing your action totally and absolutely. You have no personal say in the matter, no free will. But what if you want free will? We’ll get to this interesting case soon. For now, no free will implies P(X=Turn Left) + P(X=Turn Right) = 1.0.

Random variables as ‘onto’ functions

In the random walk experiment, X happens to map each value in S to a unique value in E. Heads is mapped to Turn Right, Tails is mapped to Turn Left. Set theorists call such functions one-to-one and onto. A One-to-One function permits you to leave a value in E completely un-mapped to anything in S. But you do map it, you must map it to exactly one value in S. That’s how one-to-one functions work.

An onto function requires you to map every value in E to at least one value in S. It doesn’t matter if multiple values in E are mapped to the same value in S. All that ‘onto-ness’ requires is that no value in E stays un-mapped.

Combining the one-to-one and onto properties means that every value in S is mapped to exactly one value in E and vice versa. S and E end up with the same number of values. This is a feature of X but it doesn’t have to be a feature of all random variables.

Random variables don’t need to be one-to-one functions. But random variables need to be onto functions. What if a random variable is not an onto function? In that case, its range E will contain events that aren’t linked to any outcome in it’s domain S. Imagine working with an X whose domain S is {Heads, Tails} and whose range E is {Turn Right, Turn Left, Walk into Oncoming Traffic}. If your actions are always guided by the outcome of a coin toss that tells you to turn either left or right at an intersection, when would you ever deliberately walk into oncoming traffic? Obviously, never.

Now in case you are thinking, “Why should my wanderings always be guided by what a stupid coin tells me? Don’t I have a right to exercise my free will?” Yes, you do. What you just asked for is another random variable. To help you exercise free will, we’ll define a new variable Y whose domain S is {Heads, Tails, Exercise Free Will}, and whose range E is {Turn Right, Turn Left, Do something (not stupid)}.

Y(Heads) = Turn Right
Y(Tails) = Turn Left
Y(Exercise Free Will) = Do something (not stupid)

Once again, everything in E is mapped to something in S making Y an onto function. When you design random variables, you really cannot get away from making them onto functions.

A more complex example: Counting right hand turns

As mentioned in the previous section, random variables don’t have to be one-to-one functions. In fact, many of them are many-to-one functions. Let’s look at an example.

During your coin-controlled wanderings around Manhattan, if you want to keep track of the number of times you turned right in any sequence of 4 turns you took, you may define a random variable W to hold this value. In any sequence of 4 turns you took, W will contain the number of times you turned right. The range of W is E={0, 1, 2, 3, 4} and W’s domain is a sample space S containing all possible sequences of length 4 of Heads and Tails.

S = {“HHHH”, “HHHT”, “HHTH”, “HHTT”, “HTHH”, “HTHT”, “HTTH”, “HTTT”, “THHH”, “THHT”, “THTH”, “THTT”, “TTHH”, “TTHT”, “TTTH”, “TTTT”}

In S, we could just as well substitute H with Turn Right and T with Turn Left. The character of S does not change if we do that, but we’ll leave them as H and T to keep in mind the that it is the coin tosses that are driving W.

As before, W maps values from S to values in E but this time the mapping is many-to-one:

The random variable W is a many-to-one onto function (Image by Author)

The random variable W shares all properties of X except for one. Let’s review all those properties:

W is a discrete random variable.
There is a probability P(w_k) associated with every w_k in E={0, 1, 2, 3, 4}.
All probabilities P(W=w_k) in the PMF of W must sum up to 1.0. If they don’t, there is something wrong about how you have defined W. You should give its definition a second look.
Now here’s the difference between X and W: Unlike X, the values E={0, 1, 2, 3, 4} in the range of W can be ordered. You turned right 0 times is smaller than you turned right 1 time which is smaller than 2 times, and so on. That makes W an ordinal random variable. Recollect that X was a nominal random variable.

The Probability Mass Function

Let’s talk about the probability associated with each value of a random variable. We’ll build what’s known as the Probability Mass Function of a random variable. Let’s begin with X. Recollect that the domain of X is the sample space S={Heads, Tails}. The range of X is the event space E={Turn Right, Turn Left}. Turn Right maps to the subset of outcomes {Heads} in {Heads, Tails}. Turn Left maps to the subset {Tails}. Thus, the probability of occurrence of {Turn Right} is the ratio of the size of the subset {Heads} to the size of the sample sample S. The size of the set {Heads} is denoted as |{Heads}| and it is obviously 1, and the size of {Heads, Tails} is denoted as |{Heads, Tails}| = |S| and it’s 2. Thus, we have:

P(X=Turn Right) = |{Heads}| / |S| = 1 / 2 = 0.5

Similarly, P(X=Turn Left) = |{Tails}| / |S| = 1 / 2 = 0.5

The sum of all probabilities in the PMF of X equals 1.0.

The PMF of W is a bit more interesting. Recollect that W is the number of times you turn right in any sequence of 4 coin tosses. The range of W is the set E={0,1,2,3,4}. We’ll work out the probabilities P(W=w_k) by building a table. Our table will have 4 columns. The first column will contain the four values from E, i.e. {0,1,2,3,4}. The second column will contain the outcomes from S that map to each of these four values. For example, if W=1, you have made exactly one right turn in any sequence of four coin tosses. The outcomes in S that result in exactly one right turn are {HTTT, THTT, TTHT, TTTH}. We’ll call this set I1. The third column will contain the size of I, and the fourth column will contain the probability P(W=w_k).

Here’s the table:

Table of probabilities associated with the values of W (Image by Author)

Let’s verify that the probabilities P(W=w_k) for k=0,1,2,3,4 sum up to 1.0:

1/16 + 4/16 + 6/16 + 4/16 + 1/16 = 16/16 = 1.0

Here’s how the plot of the PMF of W looks like:

X and W have been simple to construct. Their sample spaces were small and so were their ranges. For computing their PMFs, we summed up the counts of outcomes in S that mapped to each value of the random variable. Then we divided each sum by the size of the sample space, as follows:

P(W=w_k) = |I_w_k|/|S|

Let’s look at one of these probabilities. The probability of making 2 right turns in a sequence of 4 turns is 6/16 = 0.375 or 37.5%. What does this probability really mean? What is its real-life interpretation? Here’s one way to interpret it:

Millions of tourists visit Manhattan each year. Upon reaching Manhattan, suppose all of them feel an irresistible urge to go on a walking tour of the island. Each person begins their tour at the corner of a 3-way or 4-way intersection in the city. There, they take an unbiased coin out of their pocket and flip it. Depending on whether it lands Heads or Tails, they take a Right or Left turn at the intersection and proceed to the next intersection whereupon they flip the coin again. They repeat this process exactly four times. After all of them are done quenching their perambulatory cravings, an omniscient being counts all tours that contain exactly 2 Right turns and divides this number by the total number of tours (or people). The resulting fraction will be approximately 0.375.

The Cumulative Distribution Function

Once you know the PMF of W, you’ll also be able to answer questions like, in a sequence of 4 turns, what is the probability that a majority of turns you make are Right turns? Or in case you have an aversion to making Right turns, no more than 1 Right turn? For answering, in this case, possibly pointless but in the general case, useful questions like these, we need to build a new function called the Cumulative Distribution Function (CDF). The CDF of W will return the probability that the value of W is at most k for k=0,1,2,3, or 4.

There is an easy way to build the CDF if you know the PMF:

The CDF for W=k is the sum of the probabilities for k=0,1,2,…,k. BAM! Done.

The CDF is denoted by a capital letter F.

The Cumulative Distribution Function of W (Image by Author)

Let’s calculate the CDF of W using its PMF:

F(W=0) = P(W <= 0) = P(W=0) = 1/16

F(W=1) = P(W <= 1) = P(W=0) + P(W=1) = 1/16 + 4/16 = 5/16

F(W=2) = P(W <= 2) = P(W=0) + P(W=1) + P(W=2) = 1/16 + 4/16 + 6/16 = 11/16

F(W=3) = P(W <= 3) = P(W=0) + P(W=1) + P(W=2) + P(W=3) = 1/16 + 4/16 + 6/16 + 4/16 = 15/16

F(W=4) = P(W <= 4) = P(W=0) + P(W=1) + P(W=2) + P(W=3) + P(W=4) = 1/16 + 4/16 + 6/16 + 4/16 + 1/16 = 16/16 = 1.0

F(W=5) = P(W <= 5) = P(W=0) + P(W=1) + P(W=2) + P(W=3) + P(W=4) + P(W=5) = 1/16 + 4/16 + 6/16 + 4/16 + 1/16 + 0/16 = 16/16 = 1.0

So here are the six values of the CDF of W:

F(W=0) = 1/16
F(W=1) = 5/16
F(W=2) = 11/16
F(W=3) = 15/16
F(W=4) = 16/16
F(W=5) and P(any other value > 4) = 16/16

You may also want to know if F(.) is defined for W between 0 and 1, 1 and 2 etc., i.e. fractional values of W. In a sense, calculating the CDF for such values is meaningless. The values 1.5, 2.6 , 3.1415926 etc. do not belong to the range of W. But we can still calculate F for such fictitious, in-between values of W. That’s because the CDF F(.) is the probability of W being less than or equal to some w_k. Therefore by definition, the domain of F(.) can have fractional (real) values and the corresponding probabilities are meaningful. Let’s calculate F(W=1.5):

F(W=1.5) = P(W <= 1.5) = P(W=0) + P(W=1) + P(1 < W <= 1.5) = 1/16 + 4/16 + 0 = 5/16

We can similarly calculate F(.) for each one of the infinite number of real numbers in the close-open interval [1, 2). For each of them, the CDF will return the probability 5/16. Thus, a plot of F(.) contains a straight line from k=1 to k=2 whereupon it will step up to F(W=2)=11/16.

If you plot F(W=k) for each k, you’ll find a stepped graph as follows:

Using the CDF, you can calculate the probability of taking a majority of Right turns in a sequence of 4 turns as follows:

P(k >2) = 1 — P(k <= 2) = 1 — F(W=2) = 1–11/16 = 5/16

The probability of taking at most 1 Right turn in a sequence of 4 turns is:
P(k <= 1) = F(W=1). Also 5/16.

See how easy it is to calculate such probabilities with the CDF?

For large sample spaces, the tabular procedure we used for computing the PMF (and then the CDF) is simply not practical. For instance, if W contains the number of right turns in a sequence of 10 coin-directed left and right turns, you must work with 2^10=1024 different outcomes in S. With a sequence of 20 coin-directed turns, you are working with over a million outcomes.

While working with large and complex sample spaces, you must draw upon the appropriate formulas in Math. For example, if the random variable Z holds the number of right turns you took in a sequence of N coin-directed left/right turns where N can be arbitrarily large, Z’s domain is a sample space containing 2^N unique N-length sequences of left and right turns. Z’s range is E={0,1,2,3,…,N}. The number of sequences in S that contain exactly k right turns is given by the following formula drawn from combinatorics:

The number of ways to choose k similar outcomes from a set of N outcomes (Image by Author)

The probability that Z will take a value k is the ratio of the size of I to the size of S:

The probability of taking k right turns in a sequence of N coin-directed turns (Image by Author)

For each value of k=0,1,2,…N, if you plot the corresponding probability P(Z=k), you’ll get a bell shaped curve that will peak at k=N/2 if N is even or at k=(N-1)/2 and k=[(N-1)/2 + 1] if N is odd. To understand why it peaks at these values we appeal to the coin’s unbiasedness as follows:

When your coin is unbiased, the expected number of times it’ll turn up Heads in a sequence of N flips is N/2. And so, this is the expected number of times you’ll turn Right. Thus, you’d expect the probability of making N/2 right turns to be the highest amongst the probabilities for all other number of Right turns. Which explains the single bump in the PMF at or around N/2. The plot below shows the PMF of Z when N=60.

If you run your eye up the Y-axis, you’ll notice how vanishingly small the probabilities are for most values of Z. That’s mostly because of the factorials in the denominator of the formula for P(Z=k). When k is small, k! is small but (N-k)! is huge and together with the 2^N, they totally squash down the denominator. That makes the probability really tiny for small values of k. When k is large, k! is enormous and this time, it dominates the denominator which once again knocks flat the probability for large values of k. In a sequence of 60 turns, if you were looking forward to making less than say 20 Right turns or more than 40 Right turns, fuggedaboudit.

The CDF of Z is F(Z = k) = P(Z <= k) = P(Z=0) + P(Z=1) + … + P(Z=k).

We calculate this sum as follows:

Continuous random variables

Now let’s look at a different creature in random land: the continuous random variable. To understand it, we’ll look at the wheelbase lengths of 205 automobiles:

Wheelbases of 205 automobiles (Data source: UCI machine learning data set repository under (CC BY 4.0) license) (Image by Author)

This may look like a boring example but it perfectly illustrates how ubiquitous continuous random variables are.

Our sample space is a set of 205 vehicles. We’ll give each vehicle a unique id that goes from 1 through 205. Thus, S={1,2,3,…,205}. Don’t think of the identifiers as numbers. Each id is simply a proxy for a specific vehicle in the data set. We’ll define a random variable Y on this sample space that maps each vehicle identifier in S to its wheelbase in inches. The question is, what is the range of Y?

To answer this question, we must look at the nature of the quantity: “distance”. The wheelbase is the distance between the front and rear axles of the vehicle. The accuracy with which you’ll measure it is limited by the accuracy of your measurement device and any deliberate rounding that you’ll apply. For a 2-door Honda hatchback, you may have measured the wheelbase as 86.57754 inches before rounding it to 86.6. If you instead round it to 3 decimals, it would be 86.578 inches, four decimals — 86.5775 inches, and so on. If there are two vehicles with wheel bases of 86.57754 and 86.57755 inches, you could conceive of a third, a fourth, indeed any number of vehicles with a wheelbase that lies in between those two numbers. No matter how close the two numbers are there will always be a number that lies in between them. The point is that the wheelbase is a real number with an infinite theoretical precision. And so the range of Y is the set of positive real numbers.

Which brings us to the following way to characterize continuous random variables:

Random variables whose range is the set of real numbers (denoted by ℝ), or any continuous interval within ℝ are known as continuous random variables.

Thus, Y is a function that maps S to a subset of ℝ. What subset of ℝ is it? Since we are talking about distance, this subset is the set of all positive real numbers and it’s denoted by ℝ>0. We’ll put on our “look-smart” hat and mention Y in set notation as follows:

Y : S → ℝ>0

If Y had been a discrete random variable, its range would have been finite. A Probability Mass Function defined over this range would have assigned a probability to each value in the range. You would just have to ensure that all probabilities summed to 1.0. But Y is continuous, and its range is ℝ>0 which is an uncountably infinite set which is (curiously) even bigger than “just” an infinite set. Any heroic attempts to construct a PMF on this set are bound to fail. The probabilities will always sum up to infinity!

But surely, there is some probability measure associated with each wheelbase. Even in our sample of 205 vehicles, we see that some wheelbases occur more frequently than others. A frequency distribution of wheelbases quite readily illustrates this fact:

Frequency distribution of wheelbases (Image by Author)

Suppose you have to guess the wheelbase of a randomly chosen vehicle from this data set. Without even looking at the vehicle, you’d want to guess its wheelbase to lie in the range that contains the largest number of wheelbase measurements. This range is 93.8 to 97.4 inches and it contains 81 measurements, so you’d be right 81 / 205 i.e. about 40% of the times. If you want to make a more precise guess, you could half the bin size from 3.6 to 1.8 inches and base your guess on the revised frequency distribution which looks like this:

By shrinking the bin size by 50%, we have split the interval (93.8, 97.4] into two intervals (93.8, 95.6] and (95.6, 97.4]. You may now want to “shift-up” your guess to (95.6, 97.4] and be correct 54/205 i.e. about 26% of the time. These calculations have reassured us that for a continuous random variable, probabilities have meaning at least over intervals of values. Let’s develop this further.

The Cumulative Distribution Function, revisited

Remember that our data set of 205 vehicles is just a small sample drawn from a theoretically (uncountably) infinite population of cars and their corresponding real valued wheelbases. In any interval such as (93.8, 97.4] from this set, there lie a theoretically (uncountably) infinite number of wheelbase measurements. The continuous random variable Y can take any of these values and its impossible to know the probability with which Y will take any particular value. But we may have found a sneaky workaround to this impediment.

As long as we stay away from individual values of Y and instead calculate the probability of Y lying in some range of values, we can get a pretty decent estimate of the probability. And that’s just the thing to remember about continuous random variables — estimates of probability are meaningful only for ranges of values.

A useful range to consider is (-∞, y]. The accompanying probability P(Y <=y) once again leads us to the Cumulative Distribution Function of Y. Recollect that the CDF of a discrete random variable is a step function. This step function is discontinuous at each value of the random variable. But Y being a continuous random variable, its CDF isn’t a step function. Instead, its CDF is a continuous function of Y. How might the shape of this CDF look like? One way to guess its shape is by assuming that Y is a discrete variable. That is, we calculate the probabilities of P(Y <= y) for each value of y in the data set. To make it easy to calculate P(Y <= y) for the wheelbases data, we sort it so that looks like this: 86.6, 86.6, 88.4, 88.6, 88.6, 89.5, …, 115.6, 115.6, 120.9. Next, we calculate the probabilities as follows:

P(Y <= 86.6) = 2/205 = 0.0097561
P(Y <= 88.4) = 3/205 = 0.01463415
P(Y <= 86.6) = 5/205 = 0.02439024
…
…
…
P(Y <= 120.9) = 205/205 = 1.0

In fact, we can do better than this. We’ll interpolate between adjacent values of Y. For instance, we’ll assume that in between 88.4 and 88.6, there lies a mystery data point x and we’ll assign the probability P(Y <= x) = 4/205 = 0.0195122 to the first one of the two 86.6 values. This (somewhat silly smoothing technique) does in fact smooth the CDF by adding extra steps into the curve. At any rate, here’s how the fruit of our labor looks like:

Again, let’s not forget that Y is a continuous random variable. The real CDF of Y is a continuous function of Y whose shape may look sort of like the stepped approximation we have built. Then again, the real CDF may not look anything like the stepped function. It all depends on a crucial assumption we’ve silently made. We’ve assumed that our sample of 205 vehicles will do a terrific job of representing the characteristics of the population. If our sample fails to do that job, all bets are off about the presumed true shape of the CDF.

Having said that, the empirical CDF of Y does look a bit like the CDF of a normally distributed variable. So why don’t we visually test this hunch. In our data sample, the mean wheelbase is 98.75659 inches and its standard deviation is 6.02176 inches. On the empirical CDF, if we superimpose the CDF of a continuous random variable that is normally distributed with a mean of 98.75659 and standard deviation of 6.02176, it looks like this:

The CDF of a normally distributed random variable superimposed on the empirical CDF of Y (Image by Author)

To the eye, the empirical CDF of Y seems to fit the Normal variable’s CDF in an okay sort of way. It’s definitely not a good fit especially in the low and the middle regions of the wheelbase spectrum. If we estimate any quantities using the fitted Normal CDF, the crummy fit of this CDF to the data will cause systematic biases to creep into the probability estimates.

So what kinds of estimates are we looking to make? A useful one is the probability of the wheelbase lying in some interval (a, b]. This is simply the difference between the value returned by the CDF for the two ends of the interval.

P(a < Y <= b) = P(Y <= b) — P(Y <= a) = F(Y=b) — F(Y=a)

Let’s calculate this probability for the interval (93.8, 97.4]. Remember how we had calculated it to be 40% using the frequency distribution plot? Let’s look at what this probability comes to using both the empirical CDF and its approximation to the Normal CDF curve.

We’ll start by setting a and b:

a = 93.8, b=97.4

In the figure below, we’ve used a red colored brace to mark this interval (93.8, 97.4] on the X axis of the CDF curves. We’ll also mark off the following points on both plots, reading off the CDF values F(Y=y) from the Y-axis:

On the blue plot:
(a, F(Y=a)) = (93.8, F(Y=93.8)) = (93.8, 0.18315)
(b, F(Y=b)) = (97.4, F(Y=97.4)) = (97.4, 0.57982)

And on the orange plot:
(a, F(Y=a)) = (93.8, F(Y=93.8)) = (93.8, 0.20522)
(b, F(Y=b)) = (97.4, F(Y=97.4)) = (97.4, 0.41088)

On either plot, the probability of the wheelbase lying in the interval (93.8, 97.4] is:
P(93.8 < Y <= 97.4) = F(Y=97.4) — F(Y=93.8)

If we pick off the values of F(.) from the blue empirical CDF, we get the probability of the wheelbase lying in the interval (93.8, 97.4] as (0.57982–0.18315) = 0.39667 (or ~40%). This matches the 40% estimate we arrived at using the frequency histogram. There ought be no surprise that these estimates agree. Both estimates were calculated from the same data sample using two different methods. But if we calculate P(93.8 < Y <= 97.4) using the idealized Normal CDF, we get a value (0.41088–0.20522) = .20566 (or ~21%). This value is almost half the empirical value. This downward bias in the estimate is expected if you look at how much the idealized CDF deviates from the empirical version in the middle portion of the curve.

The Probability Density Function

The size of the interval we worked with was 3.6 inches. You could squeeze it down to an arbitrarily tiny length that is still greater than zero. The CDF will still give you a good working estimate of probability that the wheelbase will lie in that super-tiny sized interval.

For example, if you want to find the probability of a wheelbase lying in an interval of length .000001 inches starting at 93.8 inches. So you set y=93.8 and δy = .000001 and calculate F(Y=y+δy) — F(Y=y). Using the Normal CDF, you can calculate this probability as follows:

P(93.8 < Y <= 93.800001) = 0.20522269383945058–0.20522264662597367 = 4.721347690583855e-08

This is a terribly tiny number. But still a finite positive probability. If you squeeze down this interval even more, the probability associated with it will reduce even further. What if you want to explore the nature of probability at exactly 93.8? For that, you must switch to a different measure. And that measure is the probability density and it’s denoted by the function f(.). To calculate it, you divide the probability of occurrence over an interval by the length of the interval:

The density of probability over a finite interval (Image by Author)

But what if the probability is not distributed uniformly across the (y, y + δy] interval? Wouldn’t that reduce the accuracy of the density estimate at Y=y? We ran into exactly this situation when we shrunk the bin size of the histogram plot from 3.6 to 1.8 inches. When we shrunk the bin size, each interval along the X-axis of the histogram halved in size. The interval (93.8, 97.4] split into two intervals (93.8, 95.6] and (95.6, 97.4]. There were 81 wheelbase measurements lying in the parent interval (93.8, 97.4]. But these 81 measurements not split proportionately into the two child intervals in a roughly 1:1 ratio. Far from it, 27 of them went into (93.8, 95.6] while the remaining 54 went into (95.6, 97.4]. Thus bringing to light that the probability was not uniform across the parent interval (93.8, 97.4].

If for a minute we assume that the probability is uniformly distributed in the interval (93.8, 95.6], we can use the empirical CDF to calculate the estimate of probability density at exactly 93.8 inches. We set δy to 3.6 inches and calculate the density as follows:
f(Y=93.8) = [ F(Y=93.8+3.6) — F(Y=93.8) ] / 3.6 = (0.57982–0.18315) / 3.6 = 0.11019

But we know that the probability is not uniform across this interval. Therefore, if we shrink δy to 1.8 inches, we expect the density estimate at 93.8 to change. And it sure does so:
f(Y=93.8) = [ F(Y=93.8+1.8) — F(Y=93.8) ] / 1.8 = (0.31587–0.18315) / 1.8 = 0.07373

And if we shrink δy further to 0.9 inches, we find that the estimate changes again:
f(Y=93.8) = [ F(Y=93.8+0.9) — F(Y=93.8) ] / 0.9 = (0.28948–0.18315) / 0.9 = 0.11814

If we continue shrinking δy, the estimate of probability density at Y=93.8 will keep changing. It may converge to a stable value, or it may bounce around without any visible pattern. Which of the two behaviors it puts on depends on how “well-behaved” our sample data set is — how representative it is of the population. At any rate, whether the density estimate bounces around or seems to converge, it won’t be the true value of the density at Y=93.8. So what is the true value of the probability density at Y=93.8? The true value, or as statisticians like to fashionably call the asymptotic value, is when the interval δy shrinks to an infinitesimally small size. If this lights up the eyes of all you Calculus lovers, you won’t be disappointed. The true value of the probability density at Y=93.8 is given by the following limit:

Probability density at Y=93.8 inches (Image by Author)

From Calculus, we know that the above limit is the derivative of CDF of F(Y) at y = 93.8.

In general, the probability density of a continuous random variable Y is the derivative of its CDF:

Probability density at Y=y (Image by Author)

To use the above formula for PDF, you need to know the CDF of a continuous random variable. For Y, we showed how to approximate its empirical CDF with that of a normally distributed random variable having a mean of 98.75659 and standard deviation of 6.02176. If you differentiate this normal CDF, you get its PDF which is the classic bell-shaped curve that we are familiar with:

PDF of a N(μ=98.75659, σ²=36.26159) distributed continuous random variable (Image by Author)

Conversely, if you know the PDF f(Y), you can get back the CDF by integrating f(Y):

The CDF of Y obtained by integrating the PDF of Y (Image by Author)

By this time, if you are feeling a sense of mild unease, you are not alone. So let’s pause a bit. We’ve been talking about the derivative and the integral in a carefree manner while blithely switching between the empirical CDF of Y (which assumes that Y is a discrete variable) and some continuous approximation of it (which assumes that Y is a continuous variable). So let’s clear our glasses before they get too foggy and we trip over some unspoken assumption. The empirical CDF we are working with is just a working prop — an aid that we are using to illustrate how the CDF is calculated and how the PDF can be obtained from “differencing” it. The empirical CDF is discontinuous at each step. And therefore, it’s really not differentiable. But at some point, we must take the plunge and fit the best possible curve to the empirical data which is continuous and differentiable and for which the derivative and the integral do make sense. Our first attempt to fit such a curve has been to use the CDF of a normally distributed variable. We saw that it didn’t fit really well. At best it was a good first order approximation and we may be able to find other curves that will fit better.

Where to go from here?

Beyond the basic concepts we looked at, there is plenty more to know about random variables. For instance, using functions, you can combine different random variables to create new random variables. The new variable’s properties — its mean, variance, PMF or PDF, and CDF could be quite unlike those of the individual variables. One such combination that forms the basis of a linear regression model is a linear combination of random variables of the kind W = aX + bY + cZ. The linear combination has a few special properties. For instance, the mean of W is simply ‘a’ times the mean of X plus ‘b’ times the mean of Y plus ‘c’ times the mean of Z. Another interesting combination is obtained from combining a discrete and a continuous random variable. The new creature thus spawned is called a mixed random variable. Mixed random variables are used to form mixture distributions and mixture models.

In fact, the study of regression models is at its heart the study of random variables. The dependent variable of a regression model — the y variable — is a random variable governed by some probability distribution. The regression variables — the X variables — are also random variables, each of them ebbing and growing as dictated by their respective distribution. A frequently occurring example of X is a dummy regression variable — a variable that can be either 0 or 1. A dummy variable is a Bernoulli random variable which has the following PMF:

The PMF of a Bernoulli distributed random variable (Image by Author)

or in general, the following PMF:

When you toss a (biased or unbiased) coin, whether it comes up Heads or Tails is a perfect example of a Bernoulli random variable.

Random processes

Before we end, let’s say a quick hello to random processes. If you line up several random variables, what you get is a random process. And that brings us full circle to our coin-operated wanderings around Manhattan.

The random walk was produced by a sequence of coin flips. We can think of it as the outcomes from a sequence of N Bernoulli random variables. And that makes the random walk a Bernoulli process. An actual walk thus taken is a concrete realization of this process.

All time-series data are random processes. If the dataset contains N time-indexed observations y_1 through y_N, each observation y_i is a random variable. For example, in the random walk, each y_t is a Bernoulli random variable. And being part of a time series, the y_t are all obviously lined up in a sequence, making the whole time series data set a single observed instance of a Bernoulli process.

We’ll stop here in our introduction to random variables. Consider this article as your base camp. A place from which you can begin your expeditions into other topics in random variables, their properties, and their applications.

I’ll leave you with an image and an idea for a random variable. The image is of the beautiful M2 globular cluster. The idea is to define a random variable whose value is the number of stars lying inside any 10 mm x 10 mm region of space in the picture. How might the probability density function of this variable look like?

Globular Cluster M2 (Credit: ESA/Hubble under CC BY 4.0)

Citations and Copyrights

Data set

The Automobile data set is downloaded from the UC Irvine Machine Learning Repository under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Images

All images in this article are copyright Sachin Date under CC-BY-NC-SA, unless a different source and copyright are mentioned underneath the image.

If you liked this article, please follow me at Sachin Date to receive tips, how-tos and programming advice on topics devoted to regression, time series analysis, and forecasting.