dirichlet-process | Nonparametric Bayes , Infinite Mixture Models | Data Visualization library
kandi X-RAY | dirichlet-process Summary
kandi X-RAY | dirichlet-process Summary
Imagine you're a budding chef. A data-curious one, of course, so you start by taking a set of foods (pizza, salad, spaghetti, etc.) and ask 10 friends how much of each they ate in the past day. Your goal: to find natural groups of foodies, so that you can better cater to each cluster's tastes. For example, your fratboy friends might love wings and beer, your anime friends might love soba and sushi, your hipster friends probably dig tofu, and so on. So how can you use the data you've gathered to discover different kinds of groups?. One way is to use a standard clustering algorithm like k-means or Gaussian mixture modeling (see this previous post for a brief introduction). The problem is that these both assume a fixed number of clusters, which they need to be told to find. There are a couple methods for selecting the number of clusters to learn (e.g., the gap and prediction strength statistics), but the problem is a more fundamental one: most real-world data simply doesn't have a fixed number of clusters. That is, suppose we've asked 10 of our friends what they ate in the past day, and we want to find groups of eating preferences. There's really an infinite number of foodie types (carnivore, vegan, snacker, Italian, healthy, fast food, heavy eaters, light eaters, and so on), but with only 10 friends, we simply don't have enough data to detect them all. (Indeed, we're limited to 10 clusters!) So whereas k-means starts with the incorrect assumption that there's a fixed, finite number of clusters that our points come from, no matter if we feed it more data, what we'd really like is a method positing an infinite number of hidden clusters that naturally arise as we ask more friends about their food habits. (For example, with only 2 data points, we might not be able to tell the difference between vegans and vegetarians, but with 200 data points, we probably could.). Luckily for us, this is precisely the purview of nonparametric Bayes.*. *Nonparametric Bayes refers to a class of techniques that allow some parameters to change with the data. In our case, for example, instead of fixing the number of clusters to be discovered, we allow it to grow as more data comes in.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of dirichlet-process
dirichlet-process Key Features
dirichlet-process Examples and Code Snippets
Community Discussions
Trending Discussions on dirichlet-process
QUESTION
I just finished the Bayesian Analysis in Python book by Osvaldo Martin (great book to understand bayesian concepts and some fancy numpy indexing).
I really want to extend my understanding to bayesian mixture models for unsupervised clustering of samples. All of my google searches have led me to Austin Rochford's tutorial which is really informative. I understand what is happening but I am unclear in how this can be adapted to clustering (especially using multiple attributes for the cluster assignments but that is a different topic).
I understand how to assign the priors for the Dirichlet distribution
but I can't figure out how to get the clusters in PyMC3
. It looks like the majority of the mus
converge to the centroids (i.e. the means of the distributions I sampled from) but they are still separate components
. I thought about making a cutoff for the weights
(w
in the model) but that doesn't seem to work the way I imagined since multiple components
have slightly different mean parameters mus
that are converging.
How can I extract the clusters (centroids) from this PyMC3
model? I gave it a maximum of 15
components that I want to converge to 3
. The mus
seem to be at the right location but the weights are messed up b/c they are being distributed between the other clusters so I can't use a weight threshold (unless I merge them but I don't think that's the way it is normally done).
ANSWER
Answered 2017-Jan-31 at 04:15Using a couple of new-ish additions to pymc3
will help make this clear. I think I updated the Dirichlet Process example after they were added, but it seems to have been reverted to the old version during a documentation cleanup; I will fix that soon.
One of the difficulties is that the data you have generated is much more dispersed than the priors on the component means can accommodate; if you standardize your data, the samples should mix much more quickly.
The second is that pymc3
now supports mixture distributions where the indicator variable component
has been marginalized out. These marginal mixture distributions will help accelerate mixing and allow you to use NUTS (initialized with ADVI).
Finally, with these truncated versions of infinite models, when encountering computational problems, it is often useful to increase the number of potential components. I have found that K = 30
works better for this model than K = 15
.
The following code implements these changes and shows how the "active" component means can be extracted.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install dirichlet-process
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page