warplda | Cache efficient implementation for Latent Dirichlet | Search Engine library
kandi X-RAY | warplda Summary
kandi X-RAY | warplda Summary
WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of warplda
warplda Key Features
warplda Examples and Code Snippets
Community Discussions
Trending Discussions on warplda
QUESTION
I was wondering how results of different packages, hence, algorithms, differ and if parameters could be set in a way to produce similar topics. I had a look at the packages text2vec
and topicmodels
in particular.
I used below code to compare 10 topics (see code section for terms) generated with these packages. I could not manage to generate sets of topics with similar meaning. E.g. topic 10 from text2vec
has something to do with "police", none of the topics produced by topicmodels
refers to "police" or similar terms. Further, I could not identify a pendant of topic 5 produced by topicmodels
that has something to do with "life-love-familiy-war" in the topics produced by text2vec
.
I am a beginner with LDA, hence, my understanding may sound naive for experienced programmers. However, intuitively, one would asssume that it should be possible to produce sets of topics with similar meaning to prove validity/robustness of results. Of course, not necessarily the exact same set of terms, but termlists addressing similar topics.
Maybe the issue is simply that my human interpretation of these termlists is not good enough to capture similarities, but maybe there are some parameters that might increase similarity for human interpretation. Can someone guide me on how to set parameters to achieve this or otherwise provide explanations or hint on suitable resources to improve my understanding of the matter?
Here some issues that might be relevant:
- I know that
text2vec
does not use standard Gibbs sampling but WarpLDA, which already is a difference in the algorithm totopcimodels
. If my understanding is correct, the priorsalpha
anddelta
used intopicmodels
are set asdoc_topic_prior
andtopic_word_prior
intext2vec
respectively. - Furthermore, in postprocessing, text2vec allows the adaption of
lambda
for sorting terms of topics based on their frequency. I have not yet understood, how terms are sorted intopicmodels
- comparable to settinglambda=1
?. (I have tried different lambdas between 0 to 1 without getting similar topics) - Another issue is that is seems difficult to produce a fully reproducible example even when setting
seed
(see, e.g., this question). This is not directly my question but might make it more difficult to respond.
Sorry for the lenghty question and thanks in advance for any help or suggestions.
Update2: I have moved the content of my first update into an answer that is based on a more complete analysis.
Update: Following the helpful comment of text2vec
package creator Dmitriy Selivanov, I can confirm that setting lambda=1
increases the similarity of topics betweeen the termlists produced by the two packages.
Furthermore, I had a closer look at the differences between termlists produced by both packages via a quick check of length(setdiff())
and length(intersect())
across topics (see in below code). This rough check shows that text2vec
discards several terms per topic - probably by a threshold of probability for the individual topics? topicmodels
keeps all terms for all topics. This explains part of the differences in meanings that can be derived (by a human) from the termlists.
As mentioned above already, generating a reproducible example seems difficult, so I have not adapted all data examples in below code. Since run time is short, anybody can check on his/her own system.
...ANSWER
Answered 2017-Nov-30 at 10:38After having updated my question with some comparison results, I was still interested more in detail. Therefore, I have run lda models on the complete movie_review
data set included in text2vec
(5000 docs). To produce half-way realistic results, I have also introduced some gentle pre-processing and stopword removal. (Sorry for the long code example below)
My conclusion is that some of the "good" topics (from a subjective standpoint) produced by the two packages are comparable to a certain extent (especially the last three topics in below example are not really good and were difficult to compare). However, looking at similar topics between the two packages, produced different (subjective) associations for each topic. Hence, the standard Gibbs
sampling and the WarpLDA
algorithm seem to capture similar topical areas for the given data, but with different "moods" expressed in the topics.
I would see the main reason for the differences in the fact that the WarpLDA
algorithm seems to discard terms and introduce NA
values in the beta
matrix (term-topic-distribution). See below example for this. Hence, its faster convergence seems to be achieved by sacrificing completeness.
I do not want to judge which topics are subjectively "better" and leave this to your own judgement.
One important limitation of this analysis is, that I have not (yet) checked the results for an optimal number of topics, I only used k=10
. Hence, comparability of the topics might increase for an optimal k
, in any case the quality will improve and thereby maybe the "mood". (The optimal k
might again differ between the algorithms depending on the measure used to find k
.)
QUESTION
I have a project C++ using libnuma library. Because I don't have permission to install libnuma in the root system, so I have to install it in folder of user: /home/khangtg/opt. This folder contains 2 main folders:
- Folder include contains: numacompat1.h, numa.h, numaif.h
- Folder lib contains: libnuma.a, libnuma.la, libnuma.so, libnuma.so.1, libnuma.so.1.0.0
Now, I have a file .cpp include libnuma library:
...ANSWER
Answered 2017-Sep-13 at 16:20you want to set the link_directories
to include the directory of the libraries. More can be found in the cmake docs. This tells the linker where to look for the libraries.
It should probably look something like this
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install warplda
GCC (>=4.8.5)
CMake (>=2.8.12)
git
libnuma CentOS: yum install libnuma-devel Ubuntu: apt-get install libnuma-dev
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page