Parameters  number of categories (integer) event probabilities 

Support  
pmf 
(1)

Mode 
In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution^{[1]}) is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible elementary events (single outcomes of the discrete sample space, often referred to as categories), with the probability of each elementary event separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution, (e.g. 1 to K). The Kdimensional categorical distribution is the most general distribution over a Kway event; any other discrete distribution over a sizeK sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.
The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a dice. On the other hand, the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.
Occasionally, the categorical distribution is termed the "discrete distribution". However, this properly refers not to one particular family of distributions but to a general class of distributions.
Note that, in some fields, such as machine learning and natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a "categorical distribution" would be more precise.^{[2]} This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1ofK" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below).
However, conflating the categorical and multinomial distributions can lead to problems. For example, in a Dirichletmultinomial distribution, which arises commonly in natural language processing models (although not usually with this name) as a result of collapsed Gibbs sampling where Dirichlet distributions are collapsed out of a hierarchical Bayesian model, it is very important to distinguish categorical from multinomial. The joint distribution of the same variables with the same Dirichletmultinomial distribution has two different forms depending on whether it is characterized as a distribution whose domain is over individual categorical nodes or over multinomialstyle counts of nodes in each particular category (similar to the distinction between a set of Bernoullidistributed nodes and a single binomialdistributed node). Both forms have very similarlooking probability mass functions (PMF's), which both make reference to multinomialstyle counts of nodes in a category. However, the multinomialstyle PMF has an extra factor, a multinomial coefficient, that is a constant equal to 1 in the categoricalstyle PMF. Confusing the two can easily lead to incorrect results in settings where this extra factor is not constant with respect to the distributions of interest. The factor is frequently constant in the complete conditionals used in Gibbs sampling and the optimal distributions in variational methods.
A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable.
In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The exact integers used as labels are unimportant; they might be {0, 1, ..., k  1} or {1, 2, ..., k} or any other arbitrary set of values. In the following descriptions, we use {1, 2, ..., k} for convenience, although this disagrees with the convention for the Bernoulli distribution, which uses {0, 1}. In this case, the probability mass function f is:
where , represents the probability of seeing element i and .
Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket:^{[3]}
where evaluates to 1 if , 0 otherwise. There are various advantages of this formulation, e.g.:
Yet another formulation makes explicit the connection between the categorical and multinomial distributions by treating the categorical distribution as a special case of the multinomial distribution in which the parameter n of the multinomial distribution (the number of sampled items) is fixed at 1. In this formulation, the sample space can be considered to be the set of 1ofK encoded^{[4]} random vectors x of dimension k having the property that exactly one element has the value 1 and the others have the value 0. The particular element having the value 1 indicates which category has been chosen. The probability mass function f in this formulation is:
where represents the probability of seeing element i and . This is the formulation adopted by Bishop.^{[4]}^{[nb 1]}
In Bayesian statistics, the Dirichlet distribution is the conjugate prior distribution of the categorical distribution (and also the multinomial distribution). This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.
Formally, this can be expressed as follows. Given a model
then the following holds:^{[2]}
This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector ? as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.
Further intuition comes from the expected value of the posterior distribution (see the article on the Dirichlet distribution):
This says that the expected probability of seeing a category i among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: if, for example, there are three possible categories, and we saw category 1 in our observed data 40% of the time, we would expect on average to see category 1 40% of the time in the posterior distribution as well.
(Note that this intuition is ignoring the effect of the prior distribution. Furthermore, it's important to keep in mind that the posterior is a distribution over distributions. Remember that the posterior distribution in general tells us what we know about the parameter in question, and in this case the parameter itself is a discrete probability distribution, i.e. the actual categorical distribution that generated our data. For example, if we saw the 3 categories in the ratio 40:5:55 in our observed data, then ignoring the effect of the prior distribution, we would expect the true parameter  i.e. the true, underlying distribution that generated our observed data  to have the average value of (0.40,0.05,0.55), which is indeed what the posterior tells us. However, the true distribution might actually be (0.35,0.07,0.58) or (0.42,0.04,0.54) or various other nearby possibilities. The amount of uncertainty involved here is specified by the variance of the posterior, which is controlled by the total number of observations  the more data we observe, the less our uncertainty about the true parameter.)
(Technically, the prior parameter should actually be seen as representing prior observations of category . Then, the updated posterior parameter represents posterior observations. This reflects the fact that a Dirichlet distribution with has a completely flat shape  essentially, a uniform distribution over the simplex of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort. However, the mathematical updating of the posterior works fine if we ignore the term and simply think of the ? vector as directly representing a set of pseudocounts. Furthermore, doing this avoids the issue of interpreting values less than 1.)
The maximumaposteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e.,^{[2]}
In many practical applications, the only way to guarantee the condition that is to set for all i.
In the above model, the marginal likelihood of the observations (i.e. the joint distribution of the observations, with the prior parameter marginalized out) is a Dirichletmultinomial distribution:^{[2]}
This distribution plays an important role in hierarchical Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.
The posterior predictive distribution of a new observation in the above model is the distribution that a new observation would take given the set of N categorical observations. As shown in the Dirichletmultinomial distribution article, it has a very simple form:^{[2]}
Note the various relationships among this formula and the previous ones:
The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident once we reexamine the above formula. As explained in the posterior predictive distribution article, the formula for the posterior predictive probability has the form of an expected value taken with respect to the posterior distribution:
The crucial line above is the third. The second follows directly from the definition of expected value. The third line is particular to the categorical distribution, and follows from the fact that, in the categorical distribution specifically, the expected value of seeing a particular value i is directly specified by the associated parameter p_{i}. The fourth line is simply a rewriting of the third in a different notation, using the notation farther up for an expectation taken with respect to the posterior distribution of the parameters.
Note also what happens in a scenario in which we observe data points one by one and each time consider their predictive probability before observing the data point and updating the posterior. For any given data point, the probability of that point assuming a given category depends on the number of data points already in that category. If a category has a high frequency of occurrence, then new data points are more likely to join that category  further enriching the same category. This type of scenario is often termed a preferential attachment (or "rich get richer") model. This models many realworld processes, and in such cases the choices made by the first few data points have an outsize influence on the rest of the data points.
In Gibbs sampling, we typically need to draw from conditional distributions in multivariable Bayes networks where each variable is conditioned on all the others. In networks that include categorical variables with Dirichlet priors (e.g. mixture models and models including mixture components), the Dirichlet distributions are often "collapsed out" (marginalized out) of the network, which introduces dependencies among the various categorical nodes dependent on a given prior (specifically, their joint distribution is a Dirichletmultinomial distribution). One of the reasons for doing this is that in such a case, the distribution of one categorical node given the others is exactly the posterior predictive distribution of the remaining nodes.
That is, for a set of nodes , if we denote the node in question as and the remainder as , then
where is the number of nodes having category i among the nodes other than node n.
There are a number of methods, but the most common way to sample from a categorical distribution uses a type of inverse transform sampling:
Assume we are given a distribution expressed as "proportional to" some expression, with unknown normalizing constant. Then, before taking any samples, we prepare some values as follows:
Then, each time it is necessary to sample a value:
If it is necessary to draw many values from the same categorical distribution, the following approach is more efficient. It draws n samples in O(n) time (assuming an O(1) approximation is used to draw values from the binomial distribution^{[6]}).
function draw_categorical(n) // where n is the number of samples to draw from the categorical distribution r = 1 s = 0 for i from 1 to k // where k is the number of categories v = draw from a binomial(n, p[i] / r) distribution // where p[i] is the probability of category i for j from 1 to v z[s++] = i // where z is an array in which the results are stored n = n  v r = r  p[i] shuffle (randomly reorder) the elements in z return z
In machine learning it is typical to parametrize the categorical distribution, via an unconstrained representation in , whose components are given by:
where is any real constant. Given this representation, can be recovered using the softmax function, which can then be sampled using the techniques described above. There is however a more direct sampling method that uses samples from the Gumbel distribution.^{[7]} Let be k independent draws from the standard Gumbel distribution, then
will be a sample from the desired categorical distribution. (If is a sample from the standard uniform distribution, then is a sample from the standard Gumbel distribution.)
Manage research, learning and skills at defaultLogic. Create an account using LinkedIn or facebook to manage and organize your Digital Marketing and Technology knowledge. defaultLogic works like a shopping cart for information  helping you to save, discuss and share.