8. Conditional expectation#
Conditional expectation provides a way to compute the expected value of a RV given additional information about the occurrence of a specific event or the value of another variable. It refines the notion of expectation by incorporating this new information, enabling more precise predictions and insights into the behavior of RVs under specific conditions.
For example, while \(\mathbb{E}[\text{final exam grade}]\) would tell us, for a randomly chosen person the class, what we should expect their final exam grade to be,
would tell us what we could expect a student’s final exam score to be based on whether they used ChatGPT on the homework 😉. Of course, we can anticipate that the former will be significantly lower than the latter.
8.1. Conditional expectation given an event#
We begin with the definition of conditional expectation given an event.
Conditional expectation given an event
Discrete Random Variable: Let \( A \) be an event and \( Y \) a discrete random variable. The conditional expectation of \( Y \) given \( A \) is expressed as:
where the summation is taken over all possible values of \( Y \).
Continuous Random Variable: If \( A \) is an event and \( Y \) is a continuous random variable, then we replace the PMF \(\mathbb{P}[Y = y]\) with the PDF \(f_Y(y)\) and take an integral rather than a sum:
Building off of this definition, the Law of Total Expectation is a natural extension of the Law of Total Probability.
Law of total expectation
If \( A_1, A_2, \dots, A_n \) partition the set of all possible outcomes, then:
\( Y = 0 \) |
\( Y = 1 \) |
\( Y = 2 \) |
|
---|---|---|---|
Didn’t sleep well |
\( 0.15 \) |
\( 0.30 \) |
\( 0.10 \) |
Slept well |
\( 0.05 \) |
\( 0.20 \) |
\( 0.20 \) |
Example: Tennis match
Suppose a tennis player is in a best-of-three match. The random variable \( Y \) represents the number of sets won. The player’s performance is influenced by whether they slept well the night before. The probabilities of winning 0, 1, or 2 sets under different sleep conditions are summarized in Table 8.1.
Question: What is the expected number of sets won \(\mathbb{E}[Y] \) given that the player slept well?
Let \( A \) represent the event of sleeping well. First, we compute \( \mathbb{P}[A] \), the probability of sleeping well:
Next, we calculate the conditional probabilities of winning 0, 1, or 2 sets given \( A \):
Finally, we compute the conditional expectation \( \mathbb{E}[Y \mid A] \) using these probabilities:
Substituting the values:
Thus, the expected number of sets won, given that the player slept well, is approximately 1.33.
Example: Tossing a coin
This is Example 9.1.9 from [BH19].
Suppose we toss a fair coin repeatedly. Define \( W_{HT} \) as the number of flips required to observe the pattern “HT” for the first time. For example, in the sequence \( \text{TTHHT} \), \( W_{HT} = 5 \).
Question: What is \( \mathbb{E}[W_{HT}] \)?
We break \( W_{HT} \) into two components: - \( W_1 \): The number of flips until the first heads (H). - \( W_2 \): The additional number of flips required to observe the first tails (T) after the first heads. Since \( W_1, W_2 \sim \text{FS}(\frac{1}{2}) \)), we compute:
Next, define \( W_{HH} \) as the number of flips required to observe the pattern “HH” for the first time. For example, in the sequence \( \text{TTTHTHH} \), \( W_{HH} = 7 \).
Question: What is \( \mathbb{E}[W_{HH}] \)?
Step 1: Define the goal. Using the law of total expectation, we will compute \( \mathbb{E}[W_{HH}] \) based on the outcomes of the first tosses:
Step 2: Compute individual probabilities.
If the first two tosses are (H, H), then \( \mathbb{E}[W_{HH} \mid \text{1st tosses are (H, H)}] = 2 \).
If the first two tosses are (H, T), then \( \mathbb{E}[W_{HH} \mid \text{1st tosses are (H, T)}] = 2 + \mathbb{E}[W_{HH}] \) (due to the memoryless property of the coin tosses).
If the first toss is T, then \( \mathbb{E}[W_{HH} \mid \text{1st toss is T}] = 1 + \mathbb{E}[W_{HH}] \).
Step 3: Solve for \( \mathbb{E}[W_{HH}] \). Substitute the values into the expectation formula:
Solving for \( \mathbb{E}[W_{HH}] \), we get that \(\mathbb{E}[W_{HH}] = 6.\)
8.2. Conditional expectation given a random variable#
Conditional expectation, given a random variable, is an important concept in probability. The notation \( \mathbb{E}[Y \mid X] \) represents the expectation of \( Y \) conditioned on \( X \). The goal in this section is to understand what this expression means and how it functions. As a first step, it is helpful to first understand \( \mathbb{E}[Y \mid X = x] \). Here, \( X = x \) represents the event that \(X\) equals \(x\). The expectation \( \mathbb{E}[Y \mid X = x] \) is the conditional expectation of \( Y \) given this event and can be determined using the conditional probability distribution of \( Y \) given \( X = x \). For a discrete random variable \( Y \), the conditional expectation \( \mathbb{E}[Y \mid X = x] \) is, by definition,
For instance, if \( X \) is also discrete, this can be expressed using the joint and marginal probabilities:
Similarly, if \( X \) and \( Y \) are continuous random variables, the conditional expectation is given by:
where \( f_{X,Y}(x, y) \) is the joint probability density function of \( X \) and \( Y \), and \( f_X(x) \) is the marginal density of \( X \).
We will use the notation \( g(x) = \mathbb{E}[Y \mid X = x] \) to represent the conditional expectation of \( Y \) given \( X = x \). Intuitively, it provides the best prediction for the value of \( Y \), assuming that the value of \( X \) is known.
For example, let \( X \) represent a person’s age and \( Y \) represent their maximum heart rate. The general rule of thumb is that a person’s maximum heart rate is 220 minus their age. Of course, this is just a rough estimate; some people’s maximum heart rate will be higher or lower. In other words, 220 minus age is our best guess for a person’s maximum heart rate. We can express this rule of thumb in terms of conditional expectation: for a 21-year-old, for example,
More generally,
We are now ready to define \(\mathbb{E}[Y \mid X]\).
Example: Conditional expectation given a random variable
The conditional expectation of \( Y \) given \( X \), denoted \( \mathbb{E}[Y \mid X] \), is defined as the random variable \( g(X) \).
For example, if \( g(x) = x^2 \), then \( g(X) = X^2 \), meaning the conditional expectation is expressed as a function of \( X \). Importantly, \( \mathbb{E}[Y \mid X] \) is a function of the random variable \( X \), not of \( Y \).
To illustrate, we return to the case where \( X \) represents a person’s age and \( Y \) represents their maximum heart rate:
Example: Breaking a stick
This is Example 9.2.4 from [BH19].
Suppose a stick of length 1 is broken at a random point \( X \), where \( X \sim \text{Uniform}(0, 1) \). Then, another breakpoint \( Y \) is chosen randomly from a uniform distribution over the interval \( (0, x) \). The goal is to compute \( \mathbb{E}[Y \mid X] \), the conditional expectation of \( Y \) given \( X \).
Step 1: Determine \( g(x) \). Given \( X = x \), the random variable \( Y \) is uniformly distributed over \( (0, x) \). For a uniform distribution, the expectation is simply the midpoint of the interval. Thus:
Step 2: Express \( \mathbb{E}[Y \mid X] \) as a function of \( X \). By substituting \( X \) into the expression for \( g(x) \), we get:
8.2.1. Properties of conditional expectation#
Dropping Independence: If \( X \) and \( Y \) are independent, the conditional expectation \( \mathbb{E}[Y \mid X] \) simplifies to the unconditional expectation \( \mathbb{E}[Y] \). In other words, knowing \( X \) provides no additional information about \( Y \), so \( \mathbb{E}[Y \mid X] = \mathbb{E}[Y] \).
Linearity: Conditional expectation is linear, meaning that for two random variables \( Y_1 \) and \( Y_2 \), \(\mathbb{E}[Y_1 + Y_2 \mid X] = \mathbb{E}[Y_1 \mid X] + \mathbb{E}[Y_2 \mid X].\)
Adam’s Law (Law of Iterated Expectations): The overall expectation of \( Y \) can be computed by taking the expectation of its conditional expectation: \(\mathbb{E}[\mathbb{E}[Y \mid X]] = \mathbb{E}[Y].\)
Example: Time until first goal in soccer
Suppose a soccer team that takes shots following a Poisson process with a rate of \( \lambda \) shots per minute. Each shot results in a goal with a probability \( p \). Let \( Y \) represent the number of minutes until the first goal. The task is to determine \( \mathbb{E}[Y] \), the expected time until the first goal.
Hint: Let \( N \) be the number of shots taken until the first goal (including the shot that scores). Using Adam’s law (\( \mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid N]] \)), we compute \( \mathbb{E}[Y] \) step by step.
Step 1: Compute \( g(n) = \mathbb{E}[Y \mid N = n] \). Given \( N = n \), the time \( Y \) until the first goal follows a Gamma distribution with shape parameter \( n \) and rate parameter \( \lambda \). The expected value of a Gamma random variable is given by the formula \( \frac{n}{\lambda} \). Therefore:
Step 2: Plug in \( N \). Replacing \( n \) with \( N \), the conditional expectation becomes:
Step 3: Use Adam’s law \( \mathbb{E}[Y] \). Using Adam’s law (\( \mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid N]] \)), we substitute \( g(N) \):
Step 4: Compute \( \mathbb{E}[N] \). The random variable \( N \), the number of shots until the first goal, follows a first success distribution with success probability \( p \). Therefore, the expected number of trials until the first success is \( \mathbb{E}[N] = \frac{1}{p} \).
Substituting \( \mathbb{E}[N] \) into the expression for \( \mathbb{E}[Y] \), we find:
8.3. Conditional variance#
Conditional variance helps us understand the variability of a RV \(Y\) within specific contexts or subgroups defined by another RV \(X\). This is valuable because it allows us to distinguish between variability due to inherent randomness within groups as well as differences across groups.
Conditional variance
The conditional variance of \( Y \) given \( X \), denoted as \(\text{Var}(Y \mid X)\), is defined as \(\text{Var}(Y \mid X) = \mathbb{E}[Y^2 \mid X] - \mathbb{E}[Y \mid X]^2.\)
Eve’s Law connects the total variance of \( Y \) to both the conditional variance of \( Y \) given \( X \) and the variance of the conditional expectation of \( Y \). It is stated as:
Here, the first term, \(\mathbb{E}[\text{Var}(Y \mid X)]\), captures the average variability within groups defined by \( X \), while the second term, \(\text{Var}(\mathbb{E}[Y \mid X])\), measures the variability of group means.
Eve’s law is easier to work with if we define the following functions:
\( h(x) = \text{Var}(Y \mid X = x) \): The variance of \( Y \) for a specific value of \( X = x \).
\( h(X) = \text{Var}(Y \mid X) \): The conditional variance of \( Y \) given \( X \), varying across values of \( X \). Using \( h(X) \) and \( g(X) = \mathbb{E}[Y \mid X] \), Eve’s Law can also be rewritten as \(\text{Var}(Y) = \mathbb{E}[h(X)] + \text{Var}(g(X)).\)
We illustrate the intuition behind Eve’s law with an example involving Olympic gymnastics teams, where:
\( X \) represents the country.
\( Y \) represents the height of gymnasts.
There are two sources of variation in the height \( Y \):
Within-group variation: \(\mathbb{E}[\text{Var}(Y \mid X)] = \mathbb{E}[h(X)]\)
Within each country, people have different heights. The average variation in height within each country, \(\mathbb{E}[\text{Var}(Y \mid X)]\), is the within-group variation.
Between-group variation: \(\text{Var}(\mathbb{E}[Y \mid X]) = \text{Var}(g(X))\)
Across countries, the average height is different. This variation in the average heights across countries, \(\text{Var}(\mathbb{E}[Y \mid X])\), is the between-group variation. In this context, Eve’s Law explains how the total variance in gymnast heights combines the within-group and between-group components.
8.4. Adam and Eve examples#
Example: Revenue of a store
This is Example 9.6.1 from [BH19].
Consider a store where the number of customers \( N \) visiting on a given day is random. The key details are:
\( N \), the number of customers, has a mean of \( \mathbb{E}[N] = 1000 \) and a variance of \( \text{Var}(N) = 100 \).
Each customer spends a random amount \( Y_j \), where:
\( Y_1, Y_2, \dots \) are independent and identically distributed.
The spending per customer has a mean of \( \mathbb{E}[Y_j] = 5 \) and a variance of \( \text{Var}(Y_j) = 3 \).
The total revenue \( Y \) for the day is the sum of the spending by all customers:
\[\begin{equation*} Y = \sum_{j=1}^N Y_j. \end{equation*}\]
Question: What is \(\mathbb{E}[Y]\)?
Step 1: Define the goal. The total revenue \( Y \) depends on the random number of customers \( N \). Using the Adam’s law, we can express \( \mathbb{E}[Y] \) as \(\mathbb{E}[Y] = \mathbb{E}[\mathbb{E}[Y \mid N]] = \mathbb{E}[g(N)],\) where \( g(N) = \mathbb{E}[Y \mid N] \) is the expected total revenue given \( N \) customers.
Step 2: Compute \( g(n) \). When the number of customers is fixed at \( N = n \), the total revenue is:
Since the \( Y_j \)‘s are independent and identically distributed:
Substituting \( \mathbb{E}[Y_j] = 5 \), we have that \(g(n) = 5n.\)
Step 3: Plug in \(N\). Replacing \( n \) with the random variable \( N \), the conditional expectation becomes \(g(N) = \mathbb{E}[Y \mid N] = 5N.\)
Step 4: Apply Adam’s Law. Using Adam’s law, we compute \( \mathbb{E}[Y] \):
Since \( 5 \) is a constant, we can factor it out: \(\mathbb{E}[Y] = 5 \cdot \mathbb{E}[N].\) Finally, substituting \( \mathbb{E}[N] = 1000 \), we have that \(\mathbb{E}[Y] = 5 \cdot 1000 = 5000.\)
Question: What is \(\text{Var}(Y)\)?
Step 1: Define the goal. Using Eve’s Law, we decompose \( \text{Var}(Y) \) into two components:
where \( h(N) = \text{Var}(Y \mid N) \) is the conditional variance of \( Y \) given \( N \), and \( g(N) = \mathbb{E}[Y \mid N] \) is the expected total revenue given \( N \).
Step 2: Compute \( h(n) = \text{Var}(Y \mid N = n) \). When the number of customers is fixed at \( N = n \), the total revenue is:
Since the spending amounts \( Y_j \) are independent:
Substituting \( \text{Var}(Y_j) = 3 \), we have that \(\text{Var}(Y \mid N = n) = 3n.\)
Step 3: Plug in \( N \). Replacing \( n \) with the random variable \( N \), the conditional variance becomes:
Step 4: Apply Eve’s Law. Using Eve’s Law, we compute the total variance:
Substituting \( h(N) = 3N \) and \( g(N) = 5N \), we have:
Simplifying further:
Substitute the values \( \mathbb{E}[N] = 1000 \) and \( \text{Var}(N) = 100 \):
Example: Grand Slam career
Alice is a tennis player who will participate in \( N \) Grand Slam tournaments over her career. The number of tournaments she plays, \( N \), follows a geometric distribution:
For each tournament, Alice has a probability of \( \frac{1}{15} \) of winning, independent of all other tournaments. Let \( T \) represent the total number of Grand Slams she wins during her career.
Question: What is the expected number of Grand Slam titles Alice will win, \( \mathbb{E}[T] \)?
Step 1: Define the goal. We can compute \( \mathbb{E}[T] \) using the law of total expectation:
where \( g(N) = \mathbb{E}[T \mid N] \) is the expected number of titles given that she plays \( N \) tournaments.
Step 2: Compute \( g(n) \). When Alice plays exactly \( N = n \) tournaments, the number of titles she wins, \( T \), follows a \(\text{Bin}\left(n, \frac{1}{15}\right)\) distribution. For a binomial random variable, the expected value is:
Thus, \( g(n) = \frac{n}{15} \).
Step 3: Plug in \(N\). Substituting the random variable \( N \) into \( g(n) \), we find:
Step 4: Apply Adam’s Law. Using Adam’s law:
Factoring out \( \frac{1}{15} \):
For a geometric random variable \( N \) with success probability \( p = \frac{1}{350} \), the expected value is:
Substituting this into the equation for \( \mathbb{E}[T] \):
Question: What is \( \text{Var}(T) \)?
Step 1: Define the goal. Using Eve’s Law, we express the variance of \( T \) as:
where \( h(N) = \text{Var}(T \mid N) \) is the conditional variance of \( T \) given \( N \), and \( g(N) = \mathbb{E}[T \mid N] \) is the expected number of wins given \( N \).
Step 2: Compute \( h(n) \). When the number of tournaments is fixed at \( N = n \), the number of titles won, \( T \), follows the \(\text{Bin}\left(n, \frac{1}{15}\right)\) distribution. The variance of a \(\text{Bin}(n,p)\) random variable is \(n \cdot p \cdot (1 - p).\) Substituting \( p = \frac{1}{15} \), we find:
Step 3: Plug in \(N\). Substituting \( N \) as the random variable, the conditional variance becomes:
Step 4: Apply Eve’s Law. Using Eve’s Law, we compute:
Substituting \( h(N) = \frac{14N}{225} \) and \( g(N) = \frac{N}{15} \), we have:
For \( N \sim \text{Geom}(\frac{1}{350}) \), we know \( \mathbb{E}[N] = \frac{1 - \frac{1}{350}}{\frac{1}{350}} = 349 \). Substituting:
Moreover,
Substituting:
Combining the two terms:
Example: Clustering sampling
This is Example 9.6.2 from [BH19].
Cluster sampling is a method used in polling, such as for a presidential election, to gather insights about the population. In this process, a random state is first selected as the cluster for the poll. From the chosen state, a random sample of \( n \) individuals are selected, with replacement in this example. Each individual in the sample is asked about their political affiliation: say, Democrat or Republican.
Let \( X \) represent the number of Democrats in the sample. If the proportion of Democrats in the chosen state is denoted by \( q \), then \( X \) follows a binomial distribution, \( X \sim \text{Bin}(n, q) \). However, since the true proportion of Democrats, \( q \), is unknown, it is modeled as a random variable. In this example, \( q \) is assumed to follow a uniform distribution on \([0, 1]\), i.e., \( Q \sim \text{Unif}(0, 1) \).
Question: What is \(\mathbb{E}[X]\)?
Step 1: Define the goal. To compute \( \mathbb{E}[X] \), we use the law of total expectation:
where \( g(Q) = \mathbb{E}[X \mid Q] \) is the expected value of \( X \) given \( Q \).
Step 2: Compute \( g(q) \). When the fraction of Democrats in the chosen state is \( Q = q \), \( X \) follows a binomial distribution. The expected value of a binomial random variable is:
Step 3: Plug in \(Q\). Substituting \( Q \) as the random variable, we find:
Step 4: Apply Adam’s Law. Using Adam’s law:
Factoring out \( n \), we have that \(\mathbb{E}[X] = n \mathbb{E}[Q].\) Since \( Q \sim \text{Unif}(0, 1) \), the expected value of \( Q \) is \(\mathbb{E}[Q] = \frac{1}{2}.\) Thus:
Question: What is \(\text{Var}(X)\)?
Step 1: Define the goal. To compute \( \text{Var}(X) \), we use Eve’s Law:
where \( h(Q) = \text{Var}(X \mid Q) \) is the conditional variance of \( X \) given \( Q \), and \( g(Q) = \mathbb{E}[X \mid Q] \).
Step 2: Compute \( h(q) \). When \( Q = q \), \( X\) is a binomial random variable, and the variance of a binomial random variable is:
Step 3: Plug in \(Q\). Substituting \( Q \) as the random variable, we have:
Step 4: Apply Eve’s Law. Using Eve’s Law:
Substituting \( h(Q) = nQ(1 - Q) \) and \( g(Q) = nQ \):
Factoring out constants:
For \( Q \sim \text{Unif}(0, 1) \):
From the relationship \( \text{Var}(Q) = \mathbb{E}[Q^2] - \mathbb{E}[Q]^2 \):
Solving for \( \mathbb{E}[Q^2] \):
Adding the two components: