Recall Bayes’s theorem from our previous reading:
\[ \Pr(A \mid B) = \frac{\Pr(B \mid A)}{\Pr(B)} \cdot \Pr(A) \]
Observe that we can interpret Bayes’s theorem as a way of expressing an updated belief about the probability of \(A\) given that we learn some new information \(B\).
- \(\Pr(A)\) is the original probability, called the prior belief of \(A\).
- \(\Pr(A \mid B)\) is the updated probability, called the posterior belief of \(A\) given \(B\) occurs.
To obtain the posterior belief, we multiply the prior belief by:
- \(\Pr(B \mid A)\), the likelihood, the probability of the new information occurring given \(A\) occurs.
- \(\Pr(B)\), the prior probability of the new information.
We can apply this interpretation to the classification problem to obtain a simple algorithm, the Naive Bayes algorithm. For example, consider the following table of data that records when we are likely to go to class given that certain circumstances occur:
Time of Day | Partied Yesterday? | Test Today? | Hungry? | Went to Class? |
---|---|---|---|---|
Early | Yes | No | Yes | No |
Early | No | No | No | No |
Early | No | No | No | Yes |
Late | Yes | Yes | Yes | Yes |
Late | No | No | Yes | No |
Early | Yes | Yes | No | Yes |
Early | No | No | Yes | Yes |
Late | No | No | No | No |
Late | Yes | No | Yes | Yes |
Late | No | Yes | Yes | Yes |
We’ll call an individual row of this table a piece of data. The first four columns:
- “Time of day”, either early or late
- “Partied yesterday?”, either yes or no.
- “Test today?”, either yes or no.
- “Hungry?”, either yes or no.
We’ll call features of the data. To simplify our subsequent calculation, we’ll let all our features be boolean, i.e., two-valued, in nature, but we could certainly work other types of values as well. We can represent these features as a 4-tuple, e.g., \((\text{Early}, \text{Yes}, \text{No}, \text{Yes})\) for the first entry of the table.
The final column, “Went to Class?”, is the class of the data. For example, the class of the first entry of the table is “No,” i.e., when it was early, we partied, there was no test, and we were hungry, we recorded that we did not go to class.
The table above represents our training data. Given this training data, a classification algorithm is then able to compute the most likely class for a new piece of data. For example, if we see the data:
- \((\text{Early}, \text{No}, \text{No}, \text{Yes})\)
Is it more likely we’ll go to class or not in this scenario?
Problem 1: Working Out the Theory
First, let’s see how Bayes Theorem applies to this situation. Ultimately, our algorithm is a classifier which determines the class of a new piece of data given the training data it is initialized with. We’ll let \(A\) be the result of the classifier, i.e., whether we go to class, and \(B\) will be the data fed to the algorithm, i.e., the 4-tuple. If we let \(X\) be the 4-tuple and \(c\) be the class, then we have according to Bayes’s theorem:
\[ \Pr(c \mid X) = \frac{\Pr(X \mid c)}{\Pr(X)} \cdot \Pr(c) \]
Note that \(c\) can be either “Yes” or “No”, so we can calculate the probability of either class for our data \(X\) by instantiating \(c\) appropriately. Ultimately, our classifier computes each of these probabilities, \(\Pr(\text{Yes} \mid X)\) and \(\Pr(\text{No} \mid X)\) Then our classifier assigns \(X\) the class that generates the higher probability.
In terms of notation, the \(\mathop{\mathrm{arg\,max}}\) function captures this idea succinctly. If \(C = \{\, \text{"Yes"}, \text{"No"} \,\}\) is our set of classes, then this computation is:
\[ \mathop{\mathrm{arg\,max}}_{c \in C} \frac{\Pr(X \mid c)}{\Pr(X)} \cdot \Pr(c) \]
In other words, we choose the class \(c\) that maximizes the probability described by Bayes’s equation. Next, we must figure out how to compute each of these probabilities in terms of our training data. But before we do that, let’s try to simplify the formula to avoid having to compute too much stuff. One observation for simplification is that computing \(\Pr(X)\) is unnecessary, so we can instead compute:
\[ \mathop{\mathrm{arg\,max}}_{c \in C} \Pr(X \mid c) \Pr(c) \]
In a sentence or two, describe why we don’t need to compute \(\Pr(X)\) in our \(\mathop{\mathrm{arg\,max}}\) calculation.
(Hint: think about for each \(c\) what \(\Pr(X)\) represents and whether it depends on the particular \(c\) in question.)
Problem 2: Independence and Computation
The above calculation now demands that we compute two quantities for each class \(c\):
- \(\Pr(c)\): the probability that the class occurs in the training data.
- \(\Pr(X \mid c)\): the likelihood of the data given the class.
We can employ the definition of conditional probability which says that:
\[ \Pr(X \cap c) = \Pr(X \mid c) \cdot \Pr(c) \]
Employed in this fashion, we call this the probability Chain Rule. Therefore, we are really computing \(\Pr(X \cap c)\), the probability that we have data \(X\) and class \(c\) at the same time. Recall that \(X\) is really a tuple of sub-events, \(X = (x_1, x_2, x_3, x_4)\). With this in mind, we can then repeatedly apply the chain rule, observing that intersection of events is commutative:
\[ \Pr(X \mid c) \cdot \Pr(c) = \Pr(c) \cdot \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c, x_1) \cdot \Pr(x_3 \mid c, x_2, x_1) \cdot \Pr(x_4 \mid c, x_3, x_2, x_1). \]
While this, in theory, allows us to compute \(\Pr(X \mid c)\), it is onerous to do. We need to know the conditional probability of each feature given the others.
To simplify the computation, let’s assume that the features \(x_1, \ldots, x_4\) are independent from each other, i.e., their probabilities do not depend on each other. This leads us to the following computation:
\[ \Pr(X \mid c) = \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c) \cdot \Pr(x_3 \mid c) \cdot \Pr(x_4 \mid c). \]
\(\Pr(x_1 \mid c)\), for example, is the probability in our training data that it was early (or not) given that we went to class (or not). These probabilities are much easier to compute!
In summary, we now have the following computation to perform:
\[ \mathop{\mathrm{arg\,max}}_{c \in C} \Pr(c) \cdot \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c) \cdot \Pr(x_3 \mid c) \cdot \Pr(x_4 \mid c). \]
And then we choose the class \(c\) that maximizes this probability.
Go through the training data and hand-calculate the various probabilities you need to carry out this computation. It’ll be useful to organize these probabilities into various tables that you can look up easily. Once you have done this, determine what class your resulting classifier assigns the following data:
- \((\text{Early}, \text{No}, \text{No}, \text{Yes})\)
Create three other test data points and carry out the classifier on those data as well, showing the computation you performed.