\[ \newcommand{\set}[1]{\{\, #1 \,\}} \newcommand{\card}[1]{|#1|} \newcommand{\pset}[1]{𝒫(#1)} \newcommand{\falling}[2]{(#1)_{#2}} \]
In this reading, we see how we can use probabilities to compute the expected value arising from an experiment. This simple statistic is the entry point into the wider world of statistical analysis where we look at common patterns of probability distributions and their properties. We won’t have time in this course to explore statistics in detail, but we will talk about some of the basics here, so that you are aware of them for future study.
Random Variables and Expectation
Recall how our fundamental probability definitions are set up:
- The outcome of an experiment is drawn from the sample space \(\Omega\).
- An event \(E \subseteq \Omega\) describes a particular collection of outcomes of interest.
- For each event, we assign a probability to that event through the probability mass function \(\Pr : \mathcal{P}(\Omega) → \mathbb{R}\) which obeys the three axioms of probability of theory.
With our probability mass function, we can state the likelihood of events occurring. However, we can also use our probability mass function in conjunction with some other machinery to state the weighted average of the possible outcomes of an experiment. To do this, we first need to define a way to interpret the outcome of an experiment. We do by way of a function (confusingly) called a random variable.
Definition (Random Variable): a random variable is a function \(X : \Omega → T\) for some output type \(T\). A random variable represents some interpretation of the outcomes of some random process.
Consider our example of rolling three random dice, denoted by the set of outcomes:
\[ \Omega = \set{(x, y, z) \mid x, y, z ∈ [1, …, 6]}. \]
The sum of these dice forms a random variable, \(X\):
\[ X(x, y, z) = x + y + z. \]
The codomain of this random variable is the set of natural numbers in the range \([3, …, 18]\).
Note that the codomain of a random variable need not be a number. For example, if the sample space \(\Omega\) is the set of valid rock-paper-scissor plays:
\[ Σ = \set{(p_1, p_2) \mid p_1, p_2 \in P } \]
Where \(P = \set{\text{rock}, \text{paper}, \text{scissors}}\).
Then the random variable \(X_w\):
\[ X_w(p_1, p_2) = \begin{cases}\text{win} & \text{$p_1$ beats $p_2$} \\\\ \text{lose} & \text{otherwise}\end{cases} \]
Has type \(X_w : P × P \rightarrow \set{\text{win}, \text{lose}}\).
Exercise (Random Variables): Let consist of the outcomes of flipping three coins. Define a random variable \(X\) that gives the parity of the coins, i.e., the number of coins that turn up heads.
While the codomain of a random variable can be of any type, we most commonly work with real-valued random variables, i.e., \(\mathbb{R}\). Let \(X\) be a random variable over a set of outcomes \(\Omega\) of type \(X : \Omega → \mathbb{R}\). Also suppose the existence of a probability function \(pr : \Omega → \mathbb{R}\) over these outcomes. Then the expected value of \(X\), written \(E[X]\), is defined to be the weighted average of the outcomes and their respective probabilities:
\[ E[X] = \sum_{t ∈ \Omega} X(t) \cdot pr(t). \]
Example: consider an experiment where we have a weighed six-sided dice with outcomes \(\Omega = \set{1, 2, 3, 4, 5, 6}\). The probabilities of each outcome are:
\[\begin{gather} f(1) = \frac{1}{20} \qquad f(2) = \frac{1}{6} \qquad f(3) = \frac{1}{6} \qquad \\ f(4) = \frac{1}{6} \qquad f(5) = \frac{1}{5} \qquad f(6) = \frac{1}{4}. \end{gather}\]
Let \(X : \Omega → \mathbb{R}\) be a random variable that represents the value of a particular dice roll. Then the expectation of \(X\) is the expected value of the weighted die:
\[ E[X] = 1 \cdot \frac{1}{20} + 2 \cdot \frac{1}{6} + 3 \cdot \frac{1}{6} + 4 \cdot \frac{1}{6} + 5 \cdot \frac{1}{5} + 6 \cdot \frac{1}{4} = 4.05. \]
In contrast, if the probabilities of all the sides of the die were equally likely, then the expected value of the die would be:
\[ E[X] = \sum_{i=1}^{6} i \cdot \frac{1}{6} = 3.5. \]
We can think of the expectation of a random variable to be the weighted average of that variable where the weights are the probabilities of the various outcomes.
One consequence of the definition of expectation is that we can treat \(E[-]\) as an operation on a (random) variable. With this perspective, we can see that several algebraic properties hold of expectations. The most important of these is the linearity of expectation:
Claim (Linearity of Expectation): Let \(X\) and \(Y\) be real-valued random variables. Then the following identities hold:
\[\begin{gather} E[X + Y] = E[X] + E[Y] \\ E[aX] = aE[X] \end{gather}\]
For some constant value \(a\).
The linearity of expectation says that addition and multiplication (of a constant) distribute in a natural sense through expectation. This fact allows us to manipulate and combine random variables as if they were plain old variables.
Example: let \(\Omega\) be the set of all pairs of outcomes of two six-sided dice. Let \(X\) be a random variable defined as follows:
\[ X(k) = \begin{cases}1 & \text{$k$ is even} \\\\ 0 & \text{otherwise}.\end{cases} \]
And let \(Y\) be a random variable that is defined to be the sum of the two dice values.
By the linearity of expectations, \(E[X + Y] = E[X] + E[Y] = \frac{1}{2} + \frac{7}{2} = 4\) is the sum of averages of the two random variables. Also by the linearity of expectations \(E[2Y] = 2E[Y] = 2 \cdot \frac{7}{2} = 7\), the average of \(Y\) scaled by a factor of two.
Probability Distributions
Many experiments share similar distributions of probabilities among its outcomes. The study of probability distributions and their properties is an important part of the mathematical subfield of statistics. Here, we explore the basic concepts of probability distributions in light of our fundamental definitions of probability theory.
Definition (Probability Distribution): let \(X : \Omega → T\) be a random variable over a sample space \(\Omega\) and interpretation \(T\). A probability distribution is a function \(\Pr(t) : T → \mathbb{R}\) that describes the probabilities of the various interpretations of the elements of the sample space.
More informally, a probability distribution is a description of how a probability function distributes probabilities among the possible outcomes of an experiment. Many kinds of experiments fall into a handful of well known and understood probability distributions.
Bernoulli Distributions
Let \(X\) be a random variable with codomain \(\set{0, 1}\). Then a probability distribution \(\Pr : \set{0, 1} → \mathbb{R}\) over \(X\) forms a Bernoulli distribution with probability \(p\) where:
\[\begin{align} \Pr(1) &=\; p \\ \Pr(0) &=\; 1 - p \end{align}\]
We call \(p\) a parameter of the probability distribution. The Bernoulli distribution describe the outcome of a single experiment with a binary outcome—success or failure. The use of \(0\) and \(1\) to indicate boolean values is common in many areas of mathematics.
Example: here are some applications of the Bernoulli distribution.
The probability of a single fair coin flip being heads forms a Bernoulli distribution with success \(p = 0.5\).
Suppose you play a game where you roll two six-sided dice and you win if the sum of the die is greater than 8. Then the probability of winning the game forms a Bernoulli distribution with success \(p = \frac{10}{36}\). (Note that there are 10 ways out of \(6 × 6 = 36\) possibilities to get a higher than an 8 with two six-sided dice.)
Note how the Bernoulli distribution allows us to concisely describe the distribution of a set of probabilities. Different distributions exist in statistics that capture a wide variety of possible probabilities and situations.
Binominal Distributions
We can describe a particular probability distribution using a variety of statistics which summarize salient characteristics of that distribution. These include statistics you ought to be familiar with already, e.g.,
- The average (the expected value of a random value),
- Median (the value that splits the probability distribution in half), and
- Mode (the most frequent value).
Let \(X : \Omega → \mathbb{N}\) be a random variable that records the number of successes \(k\) after running \(n\) independent experiments. Then a probability distribution \(\Pr : \mathbb{N} → \mathbb{R}\) over \(X\) forms a Binomial distribution with the probability of generating \(k\) successes is given by:
\[ \Pr(k) = {n \choose k} p^{k} (1-p)^{n-k} \]
Where \(p\) is the probability of a single experiment generating a success. As shorthand, we write \(\mathcal{B}(n, p)\) for the binomial distribution consisting of \(n\) independent experiments with probability of success \(p\) for an individual experiment.
The probability function is derived combinatorially as follows:
The probability of getting \(k\) successes is \(\underbrace{p \cdot \cdots \cdot p}_{k} = p^k\).
The remaining experiments must be failures, and there are \(n-k\) of them, so the probability of this is \(\underbrace{(1-p)\cdot \cdots \cdot (1-p)}_{n-k} = (1-p)^{n-k}\).
Finally, any \(k\)-subset of the \(n\) experiments may succeed, so there are \({n \choose k}\) such combinations where we have \(k\) successes and \(n-k\) failures.
This final point is why we alternatively call the “choose” operator \({n \choose k}\) the binomial coefficient.
The binomial distribution is a generalization of the Bernoulli distribution where we conduct \(n\) such experiments rather than a single one. As such, it is potentially relevant whenever we are discussing the outcomes of running repeated trials with a binary result.
Suppose that we flip a biased coin with probability \(p = 0.75\) heads and \(1 - p = 0.25\) tails \(20\) times with success defined as obtaining heads. This forms a binomial distribution \(\mathcal{B}(20, 0.75)\).
As a consequence of identifying a probability distribution is binomial, we can apply known formulae to quickly derive statistics for that distribution.
Let \(\mathcal{B}(n, p)\) be a binomial distribution. Then:
The mean or expected value of the distribution is \(np\).
The median of the distribution is either \(⌊np⌋\) or \(⌈np⌉\). \(⌊-⌋\) is the unary flooring function which rounds its argument down. In contrast, \(⌈-⌉\) is the unary ceiling function which rounds its argument up.
The mode of the distribution is either \(⌊(n+1)p⌋\) or \(⌈(n+1)p⌉ - 1\).
Example: For our biased coin example above, the expected number of heads is \(20 \cdot 0.75 = 15\).
Exercise (More Gambling): let’s expand on the simple gambling game from the previous readings’ exercise.
Suppose that to play the game, you need to put in $1. Roll three six-sided dice in sequence. Say that a die wins if it is a five or a six.
- If the first die wins you get $1.
- If the first and second die win you get $2.
- If the first, second, and third die win you get $4.
- In all other cases, you get nothing.
- Write down the sample space \(\Omega\) of possible outcomes and a set of four disjoint events \(E_1, \ldots, E_4\) that describe the outcomes above.
- Write down the definition of a random variable \(X\) that describes the amount of money you may win from a single play of the game.
- Calculate the expected value \(E[X]\) of the random variable you defined above. Based on the computed value of \(E[X]\), is it worthwhile to play this game?