https://insightimi.wordpress.com/2020/04/04/naive-bayes-classifier

Mathematic behind Naive Bayes algorithm and its application

6 min readJul 12, 2022

Bayes’ Theorem states that the conditional probability of an event, based on the occurrence of another event, is equal to the likelihood of the second event given the first event multiplied by the probability of the first event.

it's a classification algorithm and probability-based technique

Conditional probability:

https://www.youtube.com/watch?v=H02B3aMNKzE

Conditional probability is defined as the likelihood of an event or outcome occurring, based on the occurrence of a previous event or outcome.

Examples at page: https://en.wikipedia.org/wiki/Conditional_probability

Independent vs Mutually exclusive events:

if P(a/b) is P(a) or P(b/a) is P(b) then A & B known as independent event.

if P(a/b) = P(b/a) =0 then A & B known as mutually exclusive event

https://www.youtube.com/watch?v=Px9byn_Ioyw

Example at: https://en.wikipedia.org/wiki/Bayes%27_theorem

Naive Bayes classifier algorithm:

Assumptions Made by Naïve Bayes: each feature makes an:

features are assumed to be conditional Independent
equal contribution to the outcome

Probabilistic model:

Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector X= (x1,x2,x3,….xn) representing some n features (independent variables)

it assigns to this instance probabilities P(Ck/ x1,x2,x3…xn)

The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. using Bayesian probability terminology, equation can be written as

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C and the values of the features xi are given, so that the denominator is effectively constant.

The numerator is equivalent to the joint probability model P(Ck, x1,x2,x3…xn) which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

Now the “naive” conditional independence assumptions come into play: assume that all features in xi are mutually independent, conditional on the category Ck. Under this assumption,

This means that under the above independence assumptions, the conditional distribution over the class variable C is:

Constructing a classifier from the probability model:

The naive Bayes classifier combines this model with a decision rule. One common rule is to pick the hypothesis that is most probable so as to this minimize the probability of misclassification; this is known as the maximum a posteriori or MAP decision rule.

There could be cases where the classification could be multivariate. Therefore, we have to find the class variable(y) with maximum probability.

Refer blog for example reference: http://shatterline.com/blog/

Time and space Complexity:

Training

Time complexity:- o(nd)
space complexity:- o(d*c) where d-number of feature and c in number of class(we are storing only probability in space so much more memory efficient at run time)

Note:

Application in spam filter, review positive or negative mainly text classification problem
Naive base considered as benchmark(base line) model specially for text classsification

Laplace smoothing in Naive base algo:(use alpha=1)

what if a word in a review was not present in the training dataset?

Query review = w1 w2 w3 w’

In the likelihood table, we have P(w1|positive), P(w2|Positive), P(w3|Positive), and P(positive). Oh, wait, but where is P(w’|positive)?

In a bag of words model, we count the occurrence of words. The occurrences of word w’ in training are 0. According to that

P(w’|positive)=0 and P(w’|negative)=0, but this will make both P(positive|review) and P(negative|review) equal to 0 since we multiply all the likelihoods. This is the problem of zero probability. So, how to deal with this problem?

Laplace smoothing:-

Laplace smoothing is a smoothing technique that handles the problem of zero probability in Naïve Bayes. Using Laplace smoothing, we can represent P(w’|positive) as

Here,
alpha represents the smoothing parameter,
K represents the number of dimensions (features) in the data, and
N represents the number of reviews with y=positive

If we choose a value of alpha!=0 (not equal to 0), the probability will no longer be zero even if a word is not present in the training dataset.

Laplace smoothing is a smoothing technique that helps tackle the problem of zero probability in the Naïve Bayes machine learning algorithm. Using higher alpha values will push the likelihood towards a value of 0.5, i.e., the probability of a word equal to 0.5 for both the positive and negative reviews. Since we are not getting much information from that, it is not preferable. Therefore, it is preferred to use alpha=1. As alpha increases, moving likelhoods prob to uniform distribution

Note: it should be done at all point

Bias and Variance tradeoff:

alpha(Laplace smoothing parameter) determines underfitting and overfitting.

case1: aqlpha=0

High variance means a small change in data, larger change in the model means high variance — overfitting

case2: alpha= very large

P(wi/y=0,1) ~ 0.5 means for all case probibility=0.5 i.e undefitting means high bias, for all case predicting 0.5 probability irrespective of class

So to find appropriate alpha(hyperparameter), use simple or 10foald cross-validation

Impact of Imbalance data on NB:

it impacted by imbalanced data

because of class prior((p(y=1), p(y=0))( if multiplication of prob same), majority or dominating class have an advantage

Solution:

upsampling or downsampling
drop the prior probability term p(y=1), p(y=0)
creating different alpha for different class because the same alpha impact the minority class more compare to the majority class

Handling outlier in NB:

ignore outlier(remove) from training/test data if the frequency is small(less than 10 times)
use Laplace smoothing hyperparameter

Missing value treatment in NB:

case1: text data no case of missing data
casa2: in categorical feature: consider nan itself as the missing category
case3: numerical feature: use feature imputation(generally for the numerical feature we use Gaussian NB)

Large dimensional data:

NB can handle large dimensional data but use log probability instead of multiplication of each probability so that you don't have numerical stability issues or underflow issues. all probability values are from 0–1, so after multiplication, they will become insignificant so use log probability.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes:

This is mostly used for the document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

Bernoulli Naive Bayes:

This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.