Naive Bayes is a special class of Baysian Network inference problem which follows a specific structure used to solve classification problems. The Naive Bayes classifier is a Baysian Network of the shape: (Why is this backwards(ish)? Though we typically think about models as a function M(obs) = cls, the real world is almost kind of opposite; it kinda works like World(thing happening) = things we observe. Therefore, the observations are a RESULT of the class happening.) We consider, naively, o_{1:n} are all conditionally independent on c. From this graph, we can therefore use the probability chain rule + conditional probability to write that:

\begin{equation} P(c, o_{1:n}) = P( c) \prod_{i=1}^{n} P(o_{i} | c) \end{equation}

so, to actually compute this, we don’t want to bother going over all the multiplications because of underflow, we write:

\begin{equation} \hat{y} = \arg\max_{y} \log \hat{P}(y) + \sum_{i=1}^{m} \log \hat{P}(x|y) \end{equation}

brute-force Bayes classifier \begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} \frac{P(x|y)P(y)}{P(x)} \end{equation} but because we are taking argmax, we can not normalize:

\begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} P(x|y)P(y) \end{equation}

this only works if x is a single value (i.e. you have a one-feature classifier This system has 6 parameters; they can be MLE for Bernouli from data, but you can also use Baysian Parameter Learning Method y = 0 y = 1 x1 = 0 theta0 theta2 x1 = 1 theta1 theta3 y = 0 y = 0 theta4 y = 1 theta5 (=1-theta4) to perform estiimation with MAP

\begin{equation} p(X=1| Y=0) = \frac{\text{examples where X=1, Y=0}}{\text{examples where Y=0}} \end{equation}

whith MLE with a Laplace prior:

\begin{equation} p(X=1| Y=0) = \frac{\text{(examples where X=1, Y=0)}+1}{\text{(examples where Y=0)}+\text{(nclass = 2)}} \end{equation}

We can keep going; for instance, if you wave x_1, x_2 two diffferent features:

\begin{equation} \arg\max_{y} P(y|x) = \arg\max_{y} P(x_1, x_2|y)P(y) \end{equation}

but this requires us to have 2^{2} and ultimately 2^{n} parameters, which is exponential blowup. Hence, we need to treat the variables as—naivly—independent so we can multiply them. Hence: Naive Bayes assumption we assume independence between the input features. That is, we assume:

\begin{equation} P(x_1, \dots, x_{n}|y) = \prod_{i=1}^{n} P(X_{i}|y) \end{equation}

inference with Naive Bayes Recall the definition of inference, for our case here: given observations o_{1:n}, we desire to know what’s the probability of c happening. That is, from conditional probability:

\begin{equation} P(c | o_{1:n}) = \frac{P(c, o_{1:n})}{P(o_{1:n})} \end{equation}

Now, from above we have P(c, o_{1:n}) already. To get the denominator, we invoke law of total probability to add up the probability of all observations occurring given all classes. That is:

\begin{equation} P(o_{1:n}) = \sum_{c \in C} P(c, o_{1:n}) \end{equation}

You will note that this value P(o_{1:n}) is actually constant as long as the network structure does not change. Therefore, we tend to write:

\begin{align} P(c | o_{1:n}) &= \frac{P(c, o_{1:n})}{P(o_{1:n})} \\ &= \kappa P(c, o_{1:n}) \end{align}

or, that:

\begin{equation} P(c|o_{1:n}) \propto P(c, o_{1:n}) \end{equation}

“the probability of a class occurring given the inputs is proportional to the probability of that class occurring along with the inputs” Multiple believes \begin{equation} P(A=a | R_1) \propto P(R_1 | A=a) \cdot P(A=a) \end{equation} But now Motivation: Bayes rule This will give us: However, what if we don’t want to use the law of total probability to add up P(FB’)? We can actually write a relation that essentially reminds us that the fact that P(FB’) as not dependent on TSF, so we can write:

\begin{equation} P(TSF^{1}|FB^{1}) \porpto P(TSF^{1})P(FB^{1} | TSF^{1}) \end{equation}
[[curator]]
I'm the Curator. I can help you navigate, organize, and curate this wiki. What would you like to do?