Top2Vec | Top2Vec learns jointly embedded topic , document and word | Topic Modeling library
kandi X-RAY | Top2Vec Summary
kandi X-RAY | Top2Vec Summary
Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can: * Get number of detected topics. * Get topics. * Get topic sizes. * Get hierarchichal topics. * Search topics by keywords. * Search documents by topic. * Search documents by keywords. * Find similar words. * Find similar documents. * Expose model with [RESTful-Top2Vec] See the [paper] for more details on how it works.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Duplicate the topic vectors
- Normalize vectors
- Calculate the topic vectors for each cluster
Top2Vec Key Features
Top2Vec Examples and Code Snippets
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Nat
library(doc2vec)
library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x <- data.frame(doc_id = be_parliament_2020$doc_id,
text = be_parliament_2020$text_nl,
!pip install top2vec[sentence_encoders]
!pip install tensorflow==2.5.0
!pip install numpy
!pip install top2vec
!pip install top2vec[sentence_encoders]
from top2vec import Top2Vec
docs = ['Consumer discretionary, healthcare and technology are preferred China equity sectors.',
'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Pros
Community Discussions
Trending Discussions on Top2Vec
QUESTION
I am trying to understand how Top2Vec works. I have some questions about the code that I could not find an answer for in the paper. A summary of what the algorithm does is that it:
- embeds words and vectors in the same semantic space and normalizes them. This usually has more than 300 dimensions.
- projects them into 5-dimensional space using UMAP and cosine similarity.
- creates topics as centroids of clusters using HDBSCAN with Euclidean metric on the projected data.
what troubles me is that they normalize the topic vectors. However, the output from UMAP is not normalized, and normalizing the topic vectors will probably move them out of their clusters. This is inconsistent with what they described in their paper as the topic vectors are the arithmetic mean of all documents vectors that belong to the same topic.
This leads to two questions:
How are they going to calculate the nearest words to find the keywords of each topic given that they altered the topic vector by normalization?
After creating the topics as clusters, they try to deduplicate the very similar topics. To do so, they use cosine similarity. This makes sense with the normalized topic vectors. In the same time, it is an extension of the inconsistency that normalizing topic vectors introduced. Am I missing something here?
...ANSWER
Answered 2022-Feb-16 at 16:13I got the answer to my questions from the source code. I was going to delete the question but I will leave the answer any way.
It is the part I missed and is wrong in my question. Topic vectors are the arithmetic mean of all documents vectors that belong to the same topic. Topic vectors belong to the same semantic space where words and documents vector live.
That is why it makes sense to normalize them since all words and documents vectors are normalized, and to use the cosine metric when looking for duplicated topics in the higher original semantic space.
QUESTION
I have an issue when importing Top2vec (In colab notebook). To reproduce it:
...ANSWER
Answered 2021-May-25 at 16:36I think this might due to the incompatibility of Tensorflow when using
QUESTION
docs = ['Consumer discretionary, healthcare and technology are preferred China equity sectors.', 'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus should reinforce the Chinese consumption theme.', 'The healthcare sector should be a key beneficiary of the coronavirus outbreak, on the back of increased demand for healthcare services and drugs.', 'The technology sector should benefit from increased demand for cloud services and hardware demand as China continues to recover from the coronavirus outbreak.', 'China consumer discretionary sector is preferred. In our assessment, the sector is likely to outperform the MSCI China Index in the coming 6-12 months.']
model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')
while running the above command, I'm getting an error that is not clearly visible for debugging what could be the root cause for the error?
Error:
2021-01-19 05:17:08,541 - top2vec - INFO - Pre-processing documents for training INFO:top2vec:Pre-processing documents for training 2021-01-19 05:17:08,562 - top2vec - INFO - Downloading universal-sentence-encoder model INFO:top2vec:Downloading universal-sentence-encoder model 2021-01-19 05:17:13,250 - top2vec - INFO - Creating joint document/word embedding INFO:top2vec:Creating joint document/word embedding WARNING:tensorflow:5 out of the last 6 calls to triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 6 calls to triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 2021-01-19 05:17:13,548 - top2vec - INFO - Creating lower dimension embedding of documents INFO:top2vec:Creating lower dimension embedding of documents 2021-01-19 05:17:15,809 - top2vec - INFO - Finding dense areas of documents INFO:top2vec:Finding dense areas of documents 2021-01-19 05:17:15,823 - top2vec - INFO - Finding topics INFO:top2vec:Finding topicsValueError Traceback (most recent call last) in () ----> 1 model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')
2 frames <array_function internals> in vstack(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in vstack(tup) 281 if not isinstance(arrs, list): 282 arrs = [arrs] --> 283 return _nx.concatenate(arrs, 0) 284 285
<array_function internals> in concatenate(*args, **kwargs)
ValueError: need at least one array to concatenate
...ANSWER
Answered 2021-Apr-20 at 04:07You need to use more docs and unique words for it to find at least 2 topics. As an example, I just multiply your list by 10 and it works:
QUESTION
When training the Top2Vec model in Python 3.9.2 I get the following error:
...ANSWER
Answered 2021-Mar-31 at 18:13I'm unfamiliar with the Top2Vec
class you're using.
However, that error would be expected if code that was written to use certain properties/methods in gensim-3.8.3
hasn't been adapted for the recently-released gensim-4.0.0
, which has removed and renamed some functions for consistency.
Specifically, the vectors_docs
property has been removed. (Also, the vectors_docs_norms
property mentioned a couple lines above in an unexecuted branch.)
The small changes required in the calling code are covered in the Migrating from Gensim 3.x to 4 wiki page, which I've just updated to ensure it mentions vectors_docs
specifically.
If you don't feel comfortable appkying this & any other changes to your Top2Vec
code yourself, you may just want to report the issue to its author/maintainer, and as a temporary workaround, explicitly install the older Gensim for now. With the usual pip
-based install, you could specify an older version with:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install Top2Vec
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page