Top2Vec | Top2Vec learns jointly embedded topic , document and word | Topic Modeling library

 by   ddangelov Python Version: 1.0.34 License: BSD-3-Clause

kandi X-RAY | Top2Vec Summary

kandi X-RAY | Top2Vec Summary

Top2Vec is a Python library typically used in Artificial Intelligence, Topic Modeling, Bert applications. Top2Vec has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can install using 'pip install Top2Vec' or download it from GitHub, PyPI.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can: * Get number of detected topics. * Get topics. * Get topic sizes. * Get hierarchichal topics. * Search topics by keywords. * Search documents by topic. * Search documents by keywords. * Find similar words. * Find similar documents. * Expose model with [RESTful-Top2Vec] See the [paper] for more details on how it works.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              Top2Vec has a medium active ecosystem.
              It has 2558 star(s) with 345 fork(s). There are 37 watchers for this library.
              There were 1 major release(s) in the last 12 months.
              There are 37 open issues and 269 have been closed. On average issues are closed in 142 days. There are 13 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of Top2Vec is 1.0.34

            kandi-Quality Quality

              Top2Vec has 0 bugs and 0 code smells.

            kandi-Security Security

              Top2Vec has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              Top2Vec code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              Top2Vec is licensed under the BSD-3-Clause License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              Top2Vec releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              Installation instructions, examples and code snippets are available.
              Top2Vec saves you 596 person hours of effort in developing the same functionality from scratch.
              It has 1502 lines of code, 93 functions and 5 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed Top2Vec and discovered the below as its top functions. This is intended to give you an instant insight into Top2Vec implemented functionality, and help decide if they suit your requirements.
            • Duplicate the topic vectors
            • Normalize vectors
            • Calculate the topic vectors for each cluster
            Get all kandi verified functions for this library.

            Top2Vec Key Features

            No Key Features are available at this moment for Top2Vec.

            Top2Vec Examples and Code Snippets

            Publications
            Pythondot img1Lines of Code : 67dot img1License : Permissive (Apache-2.0)
            copy iconCopy
            @inproceedings{reimers-2019-sentence-bert,
                title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
                author = "Reimers, Nils and Gurevych, Iryna",
                booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Nat  
            doc2vec,Example on top2vec
            C++dot img2Lines of Code : 22dot img2License : Non-SPDX (NOASSERTION)
            copy iconCopy
            library(doc2vec)
            library(word2vec)
            library(uwot)
            library(dbscan)
            data(be_parliament_2020, package = "doc2vec")
            x      <- data.frame(doc_id = be_parliament_2020$doc_id,
                                 text   = be_parliament_2020$text_nl,
                                   
            copy iconCopy
            !pip install top2vec[sentence_encoders]
            
            !pip install tensorflow==2.5.0
            !pip install numpy
            
            !pip install top2vec
            !pip install top2vec[sentence_encoders]
            
            ValueError: need at least one array to concatenate in Top2Vec Error
            Pythondot img4Lines of Code : 19dot img4License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            from top2vec import Top2Vec
            
            docs = ['Consumer discretionary, healthcare and technology are preferred China equity  sectors.',
            'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Pros
            Top2Vec error - 'KeyedVectors' object has no attribute 'vectors_docs'
            Pythondot img5Lines of Code : 2dot img5License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            pip install gensim==3.8.3
            

            Community Discussions

            QUESTION

            Normalizing Topic Vectors in Top2vec
            Asked 2022-Feb-16 at 16:13

            I am trying to understand how Top2Vec works. I have some questions about the code that I could not find an answer for in the paper. A summary of what the algorithm does is that it:

            • embeds words and vectors in the same semantic space and normalizes them. This usually has more than 300 dimensions.
            • projects them into 5-dimensional space using UMAP and cosine similarity.
            • creates topics as centroids of clusters using HDBSCAN with Euclidean metric on the projected data.

            what troubles me is that they normalize the topic vectors. However, the output from UMAP is not normalized, and normalizing the topic vectors will probably move them out of their clusters. This is inconsistent with what they described in their paper as the topic vectors are the arithmetic mean of all documents vectors that belong to the same topic.

            This leads to two questions:

            How are they going to calculate the nearest words to find the keywords of each topic given that they altered the topic vector by normalization?

            After creating the topics as clusters, they try to deduplicate the very similar topics. To do so, they use cosine similarity. This makes sense with the normalized topic vectors. In the same time, it is an extension of the inconsistency that normalizing topic vectors introduced. Am I missing something here?

            ...

            ANSWER

            Answered 2022-Feb-16 at 16:13

            I got the answer to my questions from the source code. I was going to delete the question but I will leave the answer any way.

            It is the part I missed and is wrong in my question. Topic vectors are the arithmetic mean of all documents vectors that belong to the same topic. Topic vectors belong to the same semantic space where words and documents vector live.

            That is why it makes sense to normalize them since all words and documents vectors are normalized, and to use the cosine metric when looking for duplicated topics in the higher original semantic space.

            Source https://stackoverflow.com/questions/71143240

            QUESTION

            Google Colaboratory NotFoundError: /usr/local/lib/python3.7/dist-packages/tensorflow_text/python/metrics/_text_similarity_metric_ops.so
            Asked 2021-May-25 at 16:36

            I have an issue when importing Top2vec (In colab notebook). To reproduce it:

            ...

            ANSWER

            Answered 2021-May-25 at 16:36

            I think this might due to the incompatibility of Tensorflow when using

            Source https://stackoverflow.com/questions/67648814

            QUESTION

            ValueError: need at least one array to concatenate in Top2Vec Error
            Asked 2021-Apr-20 at 04:07

            docs = ['Consumer discretionary, healthcare and technology are preferred China equity sectors.', 'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus should reinforce the Chinese consumption theme.', 'The healthcare sector should be a key beneficiary of the coronavirus outbreak, on the back of increased demand for healthcare services and drugs.', 'The technology sector should benefit from increased demand for cloud services and hardware demand as China continues to recover from the coronavirus outbreak.', 'China consumer discretionary sector is preferred. In our assessment, the sector is likely to outperform the MSCI China Index in the coming 6-12 months.']

            model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

            while running the above command, I'm getting an error that is not clearly visible for debugging what could be the root cause for the error?

            Error:

            2021-01-19 05:17:08,541 - top2vec - INFO - Pre-processing documents for training INFO:top2vec:Pre-processing documents for training 2021-01-19 05:17:08,562 - top2vec - INFO - Downloading universal-sentence-encoder model INFO:top2vec:Downloading universal-sentence-encoder model 2021-01-19 05:17:13,250 - top2vec - INFO - Creating joint document/word embedding INFO:top2vec:Creating joint document/word embedding WARNING:tensorflow:5 out of the last 6 calls to triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 6 calls to triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 2021-01-19 05:17:13,548 - top2vec - INFO - Creating lower dimension embedding of documents INFO:top2vec:Creating lower dimension embedding of documents 2021-01-19 05:17:15,809 - top2vec - INFO - Finding dense areas of documents INFO:top2vec:Finding dense areas of documents 2021-01-19 05:17:15,823 - top2vec - INFO - Finding topics INFO:top2vec:Finding topics

            ValueError Traceback (most recent call last) in () ----> 1 model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

            2 frames <array_function internals> in vstack(*args, **kwargs)

            /usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in vstack(tup) 281 if not isinstance(arrs, list): 282 arrs = [arrs] --> 283 return _nx.concatenate(arrs, 0) 284 285

            <array_function internals> in concatenate(*args, **kwargs)

            ValueError: need at least one array to concatenate

            ...

            ANSWER

            Answered 2021-Apr-20 at 04:07

            You need to use more docs and unique words for it to find at least 2 topics. As an example, I just multiply your list by 10 and it works:

            Source https://stackoverflow.com/questions/65785949

            QUESTION

            Top2Vec error - 'KeyedVectors' object has no attribute 'vectors_docs'
            Asked 2021-Mar-31 at 18:13

            When training the Top2Vec model in Python 3.9.2 I get the following error:

            ...

            ANSWER

            Answered 2021-Mar-31 at 18:13

            I'm unfamiliar with the Top2Vec class you're using.

            However, that error would be expected if code that was written to use certain properties/methods in gensim-3.8.3 hasn't been adapted for the recently-released gensim-4.0.0, which has removed and renamed some functions for consistency.

            Specifically, the vectors_docs property has been removed. (Also, the vectors_docs_norms property mentioned a couple lines above in an unexecuted branch.)

            The small changes required in the calling code are covered in the Migrating from Gensim 3.x to 4 wiki page, which I've just updated to ensure it mentions vectors_docs specifically.

            If you don't feel comfortable appkying this & any other changes to your Top2Vec code yourself, you may just want to report the issue to its author/maintainer, and as a temporary workaround, explicitly install the older Gensim for now. With the usual pip-based install, you could specify an older version with:

            Source https://stackoverflow.com/questions/66891439

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install Top2Vec

            The easy way to install Top2Vec is:.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install top2vec

          • CLONE
          • HTTPS

            https://github.com/ddangelov/Top2Vec.git

          • CLI

            gh repo clone ddangelov/Top2Vec

          • sshUrl

            git@github.com:ddangelov/Top2Vec.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Consider Popular Topic Modeling Libraries

            gensim

            by RaRe-Technologies

            Familia

            by baidu

            BERTopic

            by MaartenGr

            Top2Vec

            by ddangelov

            lda

            by lda-project

            Try Top Libraries by ddangelov

            RESTful-Top2Vec

            by ddangelovPython