Top2Vec | Top2Vec learns jointly embedded topic , document and word | Topic Modeling library

by ddangelov Python Version: 1.0.34 License: BSD-3-Clause

X-Ray Key Features Code Snippets(5)Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | Top2Vec Summary

Top2Vec is a Python library typically used in Artificial Intelligence, Topic Modeling, Bert applications. Top2Vec has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install Top2Vec' or download it from GitHub, PyPI.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can: * Get number of detected topics. * Get topics. * Get topic sizes. * Get hierarchichal topics. * Search topics by keywords. * Search documents by topic. * Search documents by keywords. * Find similar words. * Find similar documents. * Expose model with [RESTful-Top2Vec] See the [paper] for more details on how it works.

Support

Quality

Security

License

Reuse

Support

Top2Vec has a medium active ecosystem.

It has 2558 star(s) with 345 fork(s). There are 37 watchers for this library.

It had no major release in the last 12 months.

There are 37 open issues and 269 have been closed. On average issues are closed in 142 days. There are 13 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of Top2Vec is 1.0.34

Quality

Top2Vec has 0 bugs and 0 code smells.

Security

Top2Vec has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

Top2Vec code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

Top2Vec is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

Top2Vec releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top2Vec saves you 596 person hours of effort in developing the same functionality from scratch.

It has 1502 lines of code, 93 functions and 5 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed Top2Vec and discovered the below as its top functions. This is intended to give you an instant insight into Top2Vec implemented functionality, and help decide if they suit your requirements.

Duplicate the topic vectors
Normalize vectors
Calculate the topic vectors for each cluster

Get all kandi verified functions for this library.

Top2Vec Key Features

No Key Features are available at this moment for Top2Vec.

Top2Vec Examples and Code Snippets

Publications

Python

Lines of Code : 67

License : Permissive (Apache-2.0)

Copy

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Nat

doc2vec,Example on top2vec

C++

Lines of Code : 22

License : Non-SPDX (NOASSERTION)

Copy

library(doc2vec)
library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x      <- data.frame(doc_id = be_parliament_2020$doc_id,
                     text   = be_parliament_2020$text_nl,

Google Colaboratory NotFoundError: /usr/local/lib/python3.7/dist-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so

Python

Lines of Code : 8

License : Strong Copyleft (CC BY-SA 4.0)

Copy

!pip install top2vec[sentence_encoders]

!pip install tensorflow==2.5.0
!pip install numpy

!pip install top2vec
!pip install top2vec[sentence_encoders]

ValueError: need at least one array to concatenate in Top2Vec Error

Python

Lines of Code : 19

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from top2vec import Top2Vec

docs = ['Consumer discretionary, healthcare and technology are preferred China equity  sectors.',
'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Pros

Top2Vec error - 'KeyedVectors' object has no attribute 'vectors_docs'

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

pip install gensim==3.8.3

Community Discussions

Trending Discussions on Top2Vec

Normalizing Topic Vectors in Top2vec

Google Colaboratory NotFoundError: /usr/local/lib/python3.7/dist-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so

ValueError: need at least one array to concatenate in Top2Vec Error

Top2Vec error - 'KeyedVectors' object has no attribute 'vectors_docs'

QUESTION

Normalizing Topic Vectors in Top2vec

Asked 2022-Feb-16 at 16:13

I am trying to understand how Top2Vec works. I have some questions about the code that I could not find an answer for in the paper. A summary of what the algorithm does is that it:

embeds words and vectors in the same semantic space and normalizes them. This usually has more than 300 dimensions.
projects them into 5-dimensional space using UMAP and cosine similarity.
creates topics as centroids of clusters using HDBSCAN with Euclidean metric on the projected data.

what troubles me is that they normalize the topic vectors. However, the output from UMAP is not normalized, and normalizing the topic vectors will probably move them out of their clusters. This is inconsistent with what they described in their paper as the topic vectors are the arithmetic mean of all documents vectors that belong to the same topic.

This leads to two questions:

How are they going to calculate the nearest words to find the keywords of each topic given that they altered the topic vector by normalization?

After creating the topics as clusters, they try to deduplicate the very similar topics. To do so, they use cosine similarity. This makes sense with the normalized topic vectors. In the same time, it is an extension of the inconsistency that normalizing topic vectors introduced. Am I missing something here?

...

ANSWER

Answered 2022-Feb-16 at 16:13

I got the answer to my questions from the source code. I was going to delete the question but I will leave the answer any way.

It is the part I missed and is wrong in my question. Topic vectors are the arithmetic mean of all documents vectors that belong to the same topic. Topic vectors belong to the same semantic space where words and documents vector live.

That is why it makes sense to normalize them since all words and documents vectors are normalized, and to use the cosine metric when looking for duplicated topics in the higher original semantic space.

Source https://stackoverflow.com/questions/71143240

QUESTION

Google Colaboratory NotFoundError: /usr/local/lib/python3.7/dist-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so

Asked 2021-May-25 at 16:36

I have an issue when importing Top2vec (In colab notebook). To reproduce it:

...

ANSWER

Answered 2021-May-25 at 16:36

I think this might due to the incompatibility of Tensorflow when using

Source https://stackoverflow.com/questions/67648814

QUESTION

ValueError: need at least one array to concatenate in Top2Vec Error

Asked 2021-Apr-20 at 04:07

docs = ['Consumer discretionary, healthcare and technology are preferred China equity sectors.', 'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus should reinforce the Chinese consumption theme.', 'The healthcare sector should be a key beneficiary of the coronavirus outbreak, on the back of increased demand for healthcare services and drugs.', 'The technology sector should benefit from increased demand for cloud services and hardware demand as China continues to recover from the coronavirus outbreak.', 'China consumer discretionary sector is preferred. In our assessment, the sector is likely to outperform the MSCI China Index in the coming 6-12 months.']

model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

while running the above command, I'm getting an error that is not clearly visible for debugging what could be the root cause for the error?

Error:

2021-01-19 05:17:08,541 - top2vec - INFO - Pre-processing documents for training INFO:top2vec:Pre-processing documents for training 2021-01-19 05:17:08,562 - top2vec - INFO - Downloading universal-sentence-encoder model INFO:top2vec:Downloading universal-sentence-encoder model 2021-01-19 05:17:13,250 - top2vec - INFO - Creating joint document/word embedding INFO:top2vec:Creating joint document/word embedding WARNING:tensorflow:5 out of the last 6 calls to triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 6 calls to triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 2021-01-19 05:17:13,548 - top2vec - INFO - Creating lower dimension embedding of documents INFO:top2vec:Creating lower dimension embedding of documents 2021-01-19 05:17:15,809 - top2vec - INFO - Finding dense areas of documents INFO:top2vec:Finding dense areas of documents 2021-01-19 05:17:15,823 - top2vec - INFO - Finding topics INFO:top2vec:Finding topics

ValueError Traceback (most recent call last) in () ----> 1 model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

2 frames <array_function internals> in vstack(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in vstack(tup) 281 if not isinstance(arrs, list): 282 arrs = [arrs] --> 283 return _nx.concatenate(arrs, 0) 284 285

<array_function internals> in concatenate(*args, **kwargs)

ValueError: need at least one array to concatenate

...

ANSWER

Answered 2021-Apr-20 at 04:07

You need to use more docs and unique words for it to find at least 2 topics. As an example, I just multiply your list by 10 and it works:

Source https://stackoverflow.com/questions/65785949

QUESTION

Top2Vec error - 'KeyedVectors' object has no attribute 'vectors_docs'

Asked 2021-Mar-31 at 18:13

When training the Top2Vec model in Python 3.9.2 I get the following error:

...

ANSWER

Answered 2021-Mar-31 at 18:13

I'm unfamiliar with the Top2Vec class you're using.

However, that error would be expected if code that was written to use certain properties/methods in gensim-3.8.3 hasn't been adapted for the recently-released gensim-4.0.0, which has removed and renamed some functions for consistency.

Specifically, the vectors_docs property has been removed. (Also, the vectors_docs_norms property mentioned a couple lines above in an unexecuted branch.)

The small changes required in the calling code are covered in the Migrating from Gensim 3.x to 4 wiki page, which I've just updated to ensure it mentions vectors_docs specifically.

If you don't feel comfortable appkying this & any other changes to your Top2Vec code yourself, you may just want to report the issue to its author/maintainer, and as a temporary workaround, explicitly install the older Gensim for now. With the usual pip-based install, you could specify an older version with:

Source https://stackoverflow.com/questions/66891439

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install Top2Vec

The easy way to install Top2Vec is:.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: