kandi background
Explore Kits

gensim | Topic Modelling for Humans | Topic Modeling library

 by   RaRe-Technologies Python Version: 4.1.2 License: LGPL-2.1

 by   RaRe-Technologies Python Version: 4.1.2 License: LGPL-2.1

Download this library from

kandi X-RAY | gensim Summary

gensim is a Python library typically used in Institutions, Learning, Education, Artificial Intelligence, Topic Modeling applications. gensim has no bugs, it has no vulnerabilities, it has build file available, it has a Weak Copyleft License and it has high support. You can download it from GitHub.
<!-- The following image URLs are obfuscated = proxied and cached through Google because of Github’s proxying issues. See: https://github.com/RaRe-Technologies/gensim/issues/2805 -→. [![Build Status](https://github.com/RaRe-Technologies/gensim/actions/workflows/tests.yml/badge.svg?branch=develop)](https://github.com/RaRe-Technologies/gensim/actions) [![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases) [![Downloads](https://img.shields.io/pypi/dm/gensim?color=blue)](https://pepy.tech/project/gensim/) [![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847) [![Mailing List](https://img.shields.io/badge/-Mailing%20List-blue.svg)](https://groups.google.com/forum/#!forum/gensim) [![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&style=flat&logo=twitter&label=Follow&color=blue)](https://twitter.com/gensim_py). Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • gensim has a highly active ecosystem.
  • It has 13112 star(s) with 4201 fork(s). There are 431 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 346 open issues and 1381 have been closed. On average issues are closed in 195 days. There are 30 open pull requests and 0 closed requests.
  • It has a positive sentiment in the developer community.
  • The latest version of gensim is 4.1.2
gensim Support
Best in #Topic Modeling
Average in #Topic Modeling
gensim Support
Best in #Topic Modeling
Average in #Topic Modeling

quality kandi Quality

  • gensim has 0 bugs and 0 code smells.
gensim Quality
Best in #Topic Modeling
Average in #Topic Modeling
gensim Quality
Best in #Topic Modeling
Average in #Topic Modeling

securitySecurity

  • gensim has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • gensim code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
gensim Security
Best in #Topic Modeling
Average in #Topic Modeling
gensim Security
Best in #Topic Modeling
Average in #Topic Modeling

license License

  • gensim is licensed under the LGPL-2.1 License. This license is Weak Copyleft.
  • Weak Copyleft licenses have some restrictions, but you can use them in commercial projects.
gensim License
Best in #Topic Modeling
Average in #Topic Modeling
gensim License
Best in #Topic Modeling
Average in #Topic Modeling

buildReuse

  • gensim releases are available to install and integrate.
  • Build file is available. You can build the component from source.
  • Installation instructions, examples and code snippets are available.
  • It has 61066 lines of code, 2260 functions and 199 files.
  • It has high code complexity. Code complexity directly impacts maintainability of the code.
gensim Reuse
Best in #Topic Modeling
Average in #Topic Modeling
gensim Reuse
Best in #Topic Modeling
Average in #Topic Modeling
Top functions reviewed by kandi - BETA

kandi has reviewed gensim and discovered the below as its top functions. This is intended to give you an instant insight into gensim implemented functionality, and help decide if they suit your requirements.

  • Update the model with a given corpus
    • Perform the inference on the given document
    • Compute the phinorm
    • Evaluate the model
  • Updates the Lda model with a given corpus
    • Evaluate a single step
    • Add metrics to the plot
    • Set the model
  • Fit LDAPE algorithm
    • Merge two projections
      • Write a corpus to a file
        • Estimate the probability of a boolean sliding window
          • Extract articles and positions from file
            • Load a model
              • Add new documents to the LsiModel
                • Return unit vector
                  • Updates the model with the given corpus
                    • Update the LDA
                      • Add a model to the model
                        • Train the model
                          • Evaluate the word analogies in the model
                            • Evaluate a list of words
                              • Compute the difference between two topics
                                • Construct a sparse term similarity matrix
                                  • Compute the inner product between two matrices
                                    • Compute the distance between two documents

                                      Get all kandi verified functions for this library.

                                      Get all kandi verified functions for this library.

                                      gensim Key Features

                                      All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),

                                      Intuitive interfaces

                                      easy to plug in your own input corpus/datastream (trivial streaming API)

                                      easy to extend with other Vector Space algorithms (trivial transformation API)

                                      Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.

                                      Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.

                                      Extensive [documentation and Jupyter Notebook tutorials].

                                      Installation

                                      copy iconCopydownload iconDownload
                                          pip install --upgrade gensim

                                      Documentation

                                      copy iconCopydownload iconDownload
                                      [QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html
                                      [Tutorials]: https://radimrehurek.com/gensim/auto_examples/
                                      [Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/
                                      [Official API Documentation]: http://radimrehurek.com/gensim/apiref.html

                                      Support

                                      copy iconCopydownload iconDownload
                                      Adopters
                                      --------
                                      
                                      | Company | Logo | Industry | Use of Gensim |
                                      |---------|------|----------|---------------|
                                      | [RARE Technologies](http://rare-technologies.com) | ![rare](docs/src/readme_images/rare.png) | ML &amp; NLP consulting | Creators of Gensim –&nbsp;this is us! |
                                      | [Amazon](http://www.amazon.com/) |  ![amazon](docs/src/readme_images/amazon.png) | Retail |  Document similarity. |
                                      | [National Institutes of Health](https://github.com/NIHOPA/pipeline_word2vec) | ![nih](docs/src/readme_images/nih.png) | Health | Processing grants and publications with word2vec. |
                                      | [Cisco Security](http://www.cisco.com/c/en/us/products/security/index.html) | ![cisco](docs/src/readme_images/cisco.png) | Security |  Large-scale fraud detection. |
                                      | [Mindseye](http://www.mindseyesolutions.com/) | ![mindseye](docs/src/readme_images/mindseye.png) | Legal | Similarities in legal documents. |
                                      | [Channel 4](http://www.channel4.com/) | ![channel4](docs/src/readme_images/channel4.png) | Media | Recommendation engine. |
                                      | [Talentpair](http://talentpair.com) | ![talent-pair](docs/src/readme_images/talent-pair.png) | HR | Candidate matching in high-touch recruiting. |
                                      | [Juju](http://www.juju.com/)  | ![juju](docs/src/readme_images/juju.png) | HR | Provide non-obvious related job suggestions. |
                                      | [Tailwind](https://www.tailwindapp.com/) | ![tailwind](docs/src/readme_images/tailwind.png) | Media | Post interesting and relevant content to Pinterest. |
                                      | [Issuu](https://issuu.com/) | ![issuu](docs/src/readme_images/issuu.png) | Media | Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about. |
                                      | [Search Metrics](http://www.searchmetrics.com/) | ![search-metrics](docs/src/readme_images/search-metrics.png) | Content Marketing | Gensim word2vec used for entity disambiguation in Search Engine Optimisation. |
                                      | [12K Research](https://12k.co/) | ![12k](docs/src/readme_images/12k.png)| Media |   Document similarity analysis on media articles. |
                                      | [Stillwater Supercomputing](http://www.stillwater-sc.com/) | ![stillwater](docs/src/readme_images/stillwater.png) | Hardware | Document comprehension and association with word2vec. |
                                      | [SiteGround](https://www.siteground.com/) |  ![siteground](docs/src/readme_images/siteground.png) | Web hosting | An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA. |
                                      | [Capital One](https://www.capitalone.com/) | ![capitalone](docs/src/readme_images/capitalone.png) | Finance | Topic modeling for customer complaints exploration. |
                                      
                                      -------
                                      
                                      Citing gensim
                                      ------------
                                      
                                      When [citing gensim in academic papers and theses], please use this
                                      BibTeX entry:
                                      
                                          @inproceedings{rehurek_lrec,
                                                title = {{Software Framework for Topic Modelling with Large Corpora}},
                                                author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
                                                booktitle = {{Proceedings of the LREC 2010 Workshop on New
                                                     Challenges for NLP Frameworks}},
                                                pages = {45--50},
                                                year = 2010,
                                                month = May,
                                                day = 22,
                                                publisher = {ELRA},
                                                address = {Valletta, Malta},
                                                note={\url{http://is.muni.cz/publication/884893/en}},
                                                language={English}
                                          }
                                      
                                        [citing gensim in academic papers and theses]: https://scholar.google.com/citations?view_op=view_citation&amp;hl=en&amp;user=9vG_kV0AAAAJ&amp;citation_for_view=9vG_kV0AAAAJ:NaGl4SEjCO4C
                                      
                                        [design goals]: http://radimrehurek.com/gensim/about.html
                                        [RaRe Technologies]: http://rare-technologies.com/wp-content/uploads/2016/02/rare_image_only.png%20=10x20
                                        [rare\_tech]: //rare-technologies.com
                                        [Talentpair]: https://avatars3.githubusercontent.com/u/8418395?v=3&amp;s=100
                                        [citing gensim in academic papers and theses]: https://scholar.google.cz/citations?view_op=view_citation&amp;hl=en&amp;user=9vG_kV0AAAAJ&amp;citation_for_view=9vG_kV0AAAAJ:u-x6o8ySG0sC
                                      
                                        [documentation and Jupyter Notebook tutorials]: https://github.com/RaRe-Technologies/gensim/#documentation
                                        [Vector Space Model]: http://en.wikipedia.org/wiki/Vector_space_model
                                        [unsupervised document analysis]: http://en.wikipedia.org/wiki/Latent_semantic_indexing
                                        [NumPy and Scipy]: http://www.scipy.org/Download
                                        [ATLAS]: http://math-atlas.sourceforge.net/
                                        [OpenBLAS]: http://xianyi.github.io/OpenBLAS/
                                        [source tar.gz]: http://pypi.python.org/pypi/gensim
                                        [documentation]: http://radimrehurek.com/gensim/install.html

                                      How to get average pairwise cosine similarity per group in Pandas

                                      copy iconCopydownload iconDownload
                                      # get average if more than 1 word is included in the "text" column
                                      def document_vector(items):
                                          # remove out-of-vocabulary words
                                          doc = [word for word in items.split() if word in model_glove]
                                          if doc:
                                              doc_vector = model_glove[doc]
                                              mean_vec = np.mean(doc_vector, axis=0)
                                          else:
                                              mean_vec = None
                                          return mean_vec
                                      
                                      # get pairwise cosine similarity score
                                      def mean_cos_sim(grp):
                                          output = []
                                          for i, j in combinations(grp.tolist(), 2):
                                              if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                  sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                  output.append(sim)
                                          return np.mean(output, axis=0)
                                      
                                      df = pd.DataFrame(np.array(
                                          [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                           ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                      df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                      
                                      print(df_grpd)
                                      > firm
                                        apple       [[0.53190523]]
                                        facebook    [[0.83989316]]
                                        Name: text, dtype: object
                                      
                                      # get average if more than 1 word is included in the "text" column
                                      def document_vector(items):
                                          # remove out-of-vocabulary words
                                          doc = [word for word in items.split() if word in model_glove]
                                          if doc:
                                              doc_vector = model_glove[doc]
                                              mean_vec = np.mean(doc_vector, axis=0)
                                          else:
                                              mean_vec = None
                                          return mean_vec
                                      
                                      # get pairwise cosine similarity score
                                      def mean_cos_sim(grp):
                                          output = []
                                          for i, j in combinations(grp.tolist(), 2):
                                              if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                  sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                  output.append(sim)
                                          return np.mean(output, axis=0)
                                      
                                      df = pd.DataFrame(np.array(
                                          [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                           ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                      df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                      
                                      print(df_grpd)
                                      > firm
                                        apple       [[0.53190523]]
                                        facebook    [[0.83989316]]
                                        Name: text, dtype: object
                                      
                                      # get average if more than 1 word is included in the "text" column
                                      def document_vector(items):
                                          # remove out-of-vocabulary words
                                          doc = [word for word in items.split() if word in model_glove]
                                          if doc:
                                              doc_vector = model_glove[doc]
                                              mean_vec = np.mean(doc_vector, axis=0)
                                          else:
                                              mean_vec = None
                                          return mean_vec
                                      
                                      # get pairwise cosine similarity score
                                      def mean_cos_sim(grp):
                                          output = []
                                          for i, j in combinations(grp.tolist(), 2):
                                              if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                  sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                  output.append(sim)
                                          return np.mean(output, axis=0)
                                      
                                      df = pd.DataFrame(np.array(
                                          [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                           ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                      df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                      
                                      print(df_grpd)
                                      > firm
                                        apple       [[0.53190523]]
                                        facebook    [[0.83989316]]
                                        Name: text, dtype: object
                                      
                                      # get average if more than 1 word is included in the "text" column
                                      def document_vector(items):
                                          # remove out-of-vocabulary words
                                          doc = [word for word in items.split() if word in model_glove]
                                          if doc:
                                              doc_vector = model_glove[doc]
                                              mean_vec = np.mean(doc_vector, axis=0)
                                          else:
                                              mean_vec = None
                                          return mean_vec
                                      
                                      # get pairwise cosine similarity score
                                      def mean_cos_sim(grp):
                                          output = []
                                          for i, j in combinations(grp.tolist(), 2):
                                              if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                  sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                  output.append(sim)
                                          return np.mean(output, axis=0)
                                      
                                      df = pd.DataFrame(np.array(
                                          [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                           ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                      df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                      
                                      print(df_grpd)
                                      > firm
                                        apple       [[0.53190523]]
                                        facebook    [[0.83989316]]
                                        Name: text, dtype: object
                                      
                                      def mean_embeddings(s):
                                          """Transfer a list of words into mean embedding"""
                                          return np.mean([model_glove.get_vector(x) for x in s], axis=0)
                                      
                                      df["embeddings"] = df.text.str.split().apply(mean_embeddings)
                                      
                                      >>> df.embeddings
                                      0    [-0.2597, -0.153495, -0.5106895, -1.070115, 0....
                                      1    [0.0600965, 0.39806002, -0.45810497, -1.375365...
                                      2    [-0.43819, 0.66232, 0.04611, -0.91103, 0.32231...
                                      3    [0.1912625, 0.0066999793, -0.500785, -0.529915...
                                      4    [-0.82556, 0.24555385, 0.38557374, -0.78941, 0...
                                      Name: embeddings, dtype: object
                                      
                                      (
                                       df.groupby("firm").embeddings # extract 'embeddings' for each group
                                       .apply(np.stack) # turns sequence of arrays into proper matrix
                                       .apply(cosine_similarity) # the magic: compute pairwise similarity matrix
                                       .apply(np.mean) # get the mean
                                      )
                                      
                                      firm
                                      apple       0.765953
                                      facebook    0.893262
                                      Name: embeddings, dtype: float32
                                      
                                      def mean_embeddings(s):
                                          """Transfer a list of words into mean embedding"""
                                          return np.mean([model_glove.get_vector(x) for x in s], axis=0)
                                      
                                      df["embeddings"] = df.text.str.split().apply(mean_embeddings)
                                      
                                      >>> df.embeddings
                                      0    [-0.2597, -0.153495, -0.5106895, -1.070115, 0....
                                      1    [0.0600965, 0.39806002, -0.45810497, -1.375365...
                                      2    [-0.43819, 0.66232, 0.04611, -0.91103, 0.32231...
                                      3    [0.1912625, 0.0066999793, -0.500785, -0.529915...
                                      4    [-0.82556, 0.24555385, 0.38557374, -0.78941, 0...
                                      Name: embeddings, dtype: object
                                      
                                      (
                                       df.groupby("firm").embeddings # extract 'embeddings' for each group
                                       .apply(np.stack) # turns sequence of arrays into proper matrix
                                       .apply(cosine_similarity) # the magic: compute pairwise similarity matrix
                                       .apply(np.mean) # get the mean
                                      )
                                      
                                      firm
                                      apple       0.765953
                                      facebook    0.893262
                                      Name: embeddings, dtype: float32
                                      
                                      def mean_embeddings(s):
                                          """Transfer a list of words into mean embedding"""
                                          return np.mean([model_glove.get_vector(x) for x in s], axis=0)
                                      
                                      df["embeddings"] = df.text.str.split().apply(mean_embeddings)
                                      
                                      >>> df.embeddings
                                      0    [-0.2597, -0.153495, -0.5106895, -1.070115, 0....
                                      1    [0.0600965, 0.39806002, -0.45810497, -1.375365...
                                      2    [-0.43819, 0.66232, 0.04611, -0.91103, 0.32231...
                                      3    [0.1912625, 0.0066999793, -0.500785, -0.529915...
                                      4    [-0.82556, 0.24555385, 0.38557374, -0.78941, 0...
                                      Name: embeddings, dtype: object
                                      
                                      (
                                       df.groupby("firm").embeddings # extract 'embeddings' for each group
                                       .apply(np.stack) # turns sequence of arrays into proper matrix
                                       .apply(cosine_similarity) # the magic: compute pairwise similarity matrix
                                       .apply(np.mean) # get the mean
                                      )
                                      
                                      firm
                                      apple       0.765953
                                      facebook    0.893262
                                      Name: embeddings, dtype: float32
                                      
                                      def mean_embeddings(s):
                                          """Transfer a list of words into mean embedding"""
                                          return np.mean([model_glove.get_vector(x) for x in s], axis=0)
                                      
                                      df["embeddings"] = df.text.str.split().apply(mean_embeddings)
                                      
                                      >>> df.embeddings
                                      0    [-0.2597, -0.153495, -0.5106895, -1.070115, 0....
                                      1    [0.0600965, 0.39806002, -0.45810497, -1.375365...
                                      2    [-0.43819, 0.66232, 0.04611, -0.91103, 0.32231...
                                      3    [0.1912625, 0.0066999793, -0.500785, -0.529915...
                                      4    [-0.82556, 0.24555385, 0.38557374, -0.78941, 0...
                                      Name: embeddings, dtype: object
                                      
                                      (
                                       df.groupby("firm").embeddings # extract 'embeddings' for each group
                                       .apply(np.stack) # turns sequence of arrays into proper matrix
                                       .apply(cosine_similarity) # the magic: compute pairwise similarity matrix
                                       .apply(np.mean) # get the mean
                                      )
                                      
                                      firm
                                      apple       0.765953
                                      facebook    0.893262
                                      Name: embeddings, dtype: float32
                                      

                                      Unpickle instance from Jupyter Notebook in Flask App

                                      copy iconCopydownload iconDownload
                                      ├── WebApp/
                                      │  └── app.py
                                      └── Untitled.ipynb
                                      
                                      from WebApp.app import GensimWord2VecVectorizer
                                      GensimWord2VecVectorizer.__module__ = 'app'
                                      
                                      import sys
                                      sys.modules['app'] = sys.modules['WebApp.app']
                                      
                                      GensimWord2VecVectorizer.__module__ = 'app'
                                      
                                      import sys
                                      app = sys.modules['app'] = type(sys)('app')
                                      app.GensimWord2VecVectorizer = GensimWord2VecVectorizer
                                      
                                      ├── WebApp/
                                      │  └── app.py
                                      └── Untitled.ipynb
                                      
                                      from WebApp.app import GensimWord2VecVectorizer
                                      GensimWord2VecVectorizer.__module__ = 'app'
                                      
                                      import sys
                                      sys.modules['app'] = sys.modules['WebApp.app']
                                      
                                      GensimWord2VecVectorizer.__module__ = 'app'
                                      
                                      import sys
                                      app = sys.modules['app'] = type(sys)('app')
                                      app.GensimWord2VecVectorizer = GensimWord2VecVectorizer
                                      
                                      ├── WebApp/
                                      │  └── app.py
                                      └── Untitled.ipynb
                                      
                                      from WebApp.app import GensimWord2VecVectorizer
                                      GensimWord2VecVectorizer.__module__ = 'app'
                                      
                                      import sys
                                      sys.modules['app'] = sys.modules['WebApp.app']
                                      
                                      GensimWord2VecVectorizer.__module__ = 'app'
                                      
                                      import sys
                                      app = sys.modules['app'] = type(sys)('app')
                                      app.GensimWord2VecVectorizer = GensimWord2VecVectorizer
                                      

                                      Word2Vec returning vectors for individual character and not words

                                      copy iconCopydownload iconDownload
                                      from gensim.models import Word2Vec
                                      
                                      words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
                                      ['unimodal','7','regarding','random','59','intimating'],
                                      ['COMPETITION','prospects','2K15','gather','Mega'],
                                      ['SENSOR','NCTT','NETWORKING','orgainsed','acts']]
                                      
                                      vec_model= Word2Vec(words, min_count=1, size=30)
                                      vec_model['gather']
                                      
                                      array([ 0.01106581,  0.00968017, -0.00090574,  0.01115612, -0.00766465,
                                             -0.01648632, -0.01455364,  0.01107104,  0.00769841,  0.01037362,
                                              0.01551551, -0.01188449,  0.01262331,  0.01608987,  0.01484082,
                                              0.00528397,  0.01613582,  0.00437328,  0.00372362,  0.00480989,
                                             -0.00299072, -0.00261444,  0.00282137, -0.01168992, -0.01402746,
                                             -0.01165612,  0.00088562,  0.01581018, -0.00671618, -0.00698833],
                                            dtype=float32)
                                      
                                      from gensim.models import Word2Vec
                                      
                                      words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
                                      ['unimodal','7','regarding','random','59','intimating'],
                                      ['COMPETITION','prospects','2K15','gather','Mega'],
                                      ['SENSOR','NCTT','NETWORKING','orgainsed','acts']]
                                      
                                      vec_model= Word2Vec(words, min_count=1, size=30)
                                      vec_model['gather']
                                      
                                      array([ 0.01106581,  0.00968017, -0.00090574,  0.01115612, -0.00766465,
                                             -0.01648632, -0.01455364,  0.01107104,  0.00769841,  0.01037362,
                                              0.01551551, -0.01188449,  0.01262331,  0.01608987,  0.01484082,
                                              0.00528397,  0.01613582,  0.00437328,  0.00372362,  0.00480989,
                                             -0.00299072, -0.00261444,  0.00282137, -0.01168992, -0.01402746,
                                             -0.01165612,  0.00088562,  0.01581018, -0.00671618, -0.00698833],
                                            dtype=float32)
                                      

                                      No such file or directory: 'GoogleNews-vectors-negative300.bin'

                                      copy iconCopydownload iconDownload
                                      serv/GoogleNews-vectors-negative300.bin
                                      
                                      /Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/GoogleNews-vectors-negative300.bin
                                      
                                      serv/GoogleNews-vectors-negative300.bin
                                      
                                      /Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/GoogleNews-vectors-negative300.bin
                                      

                                      How to store the Phrase trigrams gensim model after training

                                      copy iconCopydownload iconDownload
                                      trigram_transformer.save(TRIPHRASER_PATH)
                                      
                                      reloads_trigram_transformer = Phrases.load(TRIPHRASER_PATH)
                                      
                                      trigram_transformer.save(TRIPHRASER_PATH)
                                      
                                      reloads_trigram_transformer = Phrases.load(TRIPHRASER_PATH)
                                      

                                      Plotly - Highlight data point and nearest three points on hover

                                      copy iconCopydownload iconDownload
                                      import gensim
                                      import numpy as np
                                      import pandas as pd
                                      from sklearn.manifold import TSNE
                                      import plotly.express as px
                                      import plotly.graph_objects as go
                                      
                                      import json
                                      
                                      import dash
                                      from dash import dcc, html, Input, Output
                                      
                                      external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
                                      app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
                                      
                                      
                                      def get_2d_coordinates(model, words):
                                          arr = np.empty((0,100), dtype='f')
                                          labels = []
                                          for wrd_score in words:
                                              try:
                                                  wrd_vector = model.wv.get_vector(wrd_score)
                                                  arr = np.append(arr, np.array([wrd_vector]), axis=0)
                                                  labels.append(wrd_score)
                                              except:
                                                  pass
                                          tsne = TSNE(n_components=2, random_state=0)
                                          np.set_printoptions(suppress=True)
                                          Y = tsne.fit_transform(arr)
                                          x_coords = Y[:, 0]
                                          y_coords = Y[:, 1]
                                          return x_coords, y_coords
                                      
                                      ic_model = gensim.models.Word2Vec.load("w2v_IceCream.model")
                                      ic = pd.read_csv('ic_prods.csv')
                                      
                                      icx, icy = get_2d_coordinates(ic_model, ic['ITEM_DESC'])
                                      ic_data = {'Category': ic['SUB_CATEGORY'],
                                                  'Words':ic['ITEM_DESC'],
                                                  'X':icx,
                                                  'Y':icy}
                                      
                                      ic_df = pd.DataFrame(ic_data)
                                      ic_fig = px.scatter(ic_df, x=icx, y=icy, color=ic_df['Category'], hover_name=ic_df['Words'], title='IceCream Data')
                                      
                                      NUMBER_OF_TRACES = len(ic_df['Category'].unique())
                                      ic_fig.update_layout(clickmode='event+select')
                                      
                                      app.layout = html.Div([
                                          dcc.Graph(
                                              id='ic_figure',
                                              figure=ic_fig)
                                          ])
                                      
                                      ## we take the 4 closest points because the 1st closest point will be the point itself
                                      def get_n_closest_points(x0, y0, df=ic_df[['X','Y']].copy(), n=4):
                                      
                                          """we can save some computation time by looking for the smallest distance^2 instead of distance"""
                                          """distance = sqrt[(x1-x0)^2 + (y1-y0)^2]"""
                                          """distance^2 = [(x1-x0)^2 + (y1-y0)^2]"""
                                          
                                          df["dist"] = (df["X"]-x0)**2 + (df["Y"]-y0)**2
                                      
                                          ## we don't return the point itself which will always be closest to itself
                                          return df.sort_values(by="dist")[1:n][["X","Y"]].values
                                      
                                      @app.callback(
                                          Output('ic_figure', 'figure'),
                                          [Input('ic_figure', 'clickData'),
                                          Input('ic_figure', 'figure')]
                                          )
                                      def display_hover_data(clickData, figure):
                                          print(clickData)
                                          if clickData is None:
                                              # print("nothing was clicked")
                                              return figure
                                          else:
                                              hover_x, hover_y = clickData['points'][0]['x'], clickData['points'][0]['y']
                                              closest_points = get_n_closest_points(hover_x, hover_y)
                                      
                                              ## this means that this function has ALREADY added another trace, so we reduce the number of traces down the original number
                                              if len(figure['data']) > NUMBER_OF_TRACES:
                                                  # print(f'reducing the number of traces to {NUMBER_OF_TRACES}')
                                                  figure['data'] = figure['data'][:NUMBER_OF_TRACES]
                                                  # print(figure['data'])
                                              
                                              new_traces = [{
                                                  'marker': {'color': 'teal', 'symbol': 'circle'},
                                                  'mode': 'markers',
                                                  'orientation': 'v',
                                                  'showlegend': False,
                                                  'x': [x],
                                                  'xaxis': 'x',
                                                  'y': [y],
                                                  'yaxis': 'y',
                                                  'type': 'scatter',
                                                  'selectedpoints': [0]
                                              } for x,y in closest_points]
                                      
                                              figure['data'].extend(new_traces)
                                              # print("after\n")
                                              # print(figure['data'])
                                              return figure
                                      
                                      if __name__ == '__main__':
                                          app.run_server(debug=True)
                                      

                                      Error in pip install transformers: Building wheel for tokenizers (pyproject.toml): finished with status 'error'

                                      copy iconCopydownload iconDownload
                                      error: can't find Rust compiler
                                      
                                      RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
                                      ENV PATH="/root/.cargo/bin:${PATH}"
                                      
                                      error: can't find Rust compiler
                                      
                                      RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
                                      ENV PATH="/root/.cargo/bin:${PATH}"
                                      

                                      How to plot tsne on word2vec (created from gensim) for the most_similar 20 cases?

                                      copy iconCopydownload iconDownload
                                      from gensim.test.utils import common_texts
                                      from gensim.models import Word2Vec
                                      from sklearn.manifold import TSNE
                                      import matplotlib.pyplot as plt
                                      
                                      model = Word2Vec(sentences=common_texts, window=5, min_count=1)
                                      
                                      labels = [i for i in model.wv.vocab.keys()]
                                      tokens = model[labels]
                                      
                                      tsne_model = TSNE(init='pca',learning_rate='auto')
                                      new_values = tsne_model.fit_transform(tokens)
                                      
                                      plt.figure(figsize=(7, 5)) 
                                      for i in range(new_values.shape[0]):
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
                                      most_sim_words
                                      ['human', 'graph', 'time', 'interface', 'system']
                                      
                                      for word in most_sim_words:
                                          i = labels.index(word)
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      from gensim.test.utils import common_texts
                                      from gensim.models import Word2Vec
                                      from sklearn.manifold import TSNE
                                      import matplotlib.pyplot as plt
                                      
                                      model = Word2Vec(sentences=common_texts, window=5, min_count=1)
                                      
                                      labels = [i for i in model.wv.vocab.keys()]
                                      tokens = model[labels]
                                      
                                      tsne_model = TSNE(init='pca',learning_rate='auto')
                                      new_values = tsne_model.fit_transform(tokens)
                                      
                                      plt.figure(figsize=(7, 5)) 
                                      for i in range(new_values.shape[0]):
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
                                      most_sim_words
                                      ['human', 'graph', 'time', 'interface', 'system']
                                      
                                      for word in most_sim_words:
                                          i = labels.index(word)
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      from gensim.test.utils import common_texts
                                      from gensim.models import Word2Vec
                                      from sklearn.manifold import TSNE
                                      import matplotlib.pyplot as plt
                                      
                                      model = Word2Vec(sentences=common_texts, window=5, min_count=1)
                                      
                                      labels = [i for i in model.wv.vocab.keys()]
                                      tokens = model[labels]
                                      
                                      tsne_model = TSNE(init='pca',learning_rate='auto')
                                      new_values = tsne_model.fit_transform(tokens)
                                      
                                      plt.figure(figsize=(7, 5)) 
                                      for i in range(new_values.shape[0]):
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
                                      most_sim_words
                                      ['human', 'graph', 'time', 'interface', 'system']
                                      
                                      for word in most_sim_words:
                                          i = labels.index(word)
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      from gensim.test.utils import common_texts
                                      from gensim.models import Word2Vec
                                      from sklearn.manifold import TSNE
                                      import matplotlib.pyplot as plt
                                      
                                      model = Word2Vec(sentences=common_texts, window=5, min_count=1)
                                      
                                      labels = [i for i in model.wv.vocab.keys()]
                                      tokens = model[labels]
                                      
                                      tsne_model = TSNE(init='pca',learning_rate='auto')
                                      new_values = tsne_model.fit_transform(tokens)
                                      
                                      plt.figure(figsize=(7, 5)) 
                                      for i in range(new_values.shape[0]):
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      
                                      most_sim_words = [i[0] for i in model.wv.most_similar(positive='trees', topn=5)]
                                      most_sim_words
                                      ['human', 'graph', 'time', 'interface', 'system']
                                      
                                      for word in most_sim_words:
                                          i = labels.index(word)
                                          plt.scatter(x[i],y[i])
                                          plt.annotate(labels[i],
                                                       xy=(x[i], y[i]),
                                                       xytext=(5, 2),
                                                       textcoords='offset points',
                                                       ha='right',
                                                       va='bottom')
                                      

                                      How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?

                                      copy iconCopydownload iconDownload
                                      lambd = 0.6 # a specific relevance metric value
                                      
                                      all_topics = {}
                                      num_topics = lda_model.num_topics
                                      num_terms = 10 
                                      
                                      for i in range(1,num_topics): 
                                          topic = LDAvis_prepared.topic_info[LDAvis_prepared.topic_info.Category == 'Topic'+str(i)].copy()
                                          topic['relevance'] = topic['loglift']*(1-lambd)+topic['logprob']*lambd
                                          all_topics['Topic '+str(i)] = topic.sort_values(by='relevance', ascending=False).Term[:num_terms].values
                                      pd.DataFrame(all_topics).T
                                      

                                      How to seek for bigram similarity in gensim word2vec model

                                      copy iconCopydownload iconDownload
                                      word2vec_model300.most_similar(positive=[average_vector])
                                      
                                      word2vec_model300.most_similar(positive=['artificial', 'intelligence'])
                                      
                                      word2vec_model300.most_similar(positive=[average_vector])
                                      
                                      word2vec_model300.most_similar(positive=['artificial', 'intelligence'])
                                      

                                      Community Discussions

                                      Trending Discussions on gensim
                                      • How to get average pairwise cosine similarity per group in Pandas
                                      • KeyedVectors\' object has no attribute \'wv for gensim 4.1.2
                                      • Gensim phrases model vocabulary length does not correspond to amount of iteratively added documents
                                      • Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?
                                      • Unpickle instance from Jupyter Notebook in Flask App
                                      • Word2Vec returning vectors for individual character and not words
                                      • No such file or directory: 'GoogleNews-vectors-negative300.bin'
                                      • How to store the Phrase trigrams gensim model after training
                                      • Plotly - Highlight data point and nearest three points on hover
                                      • gensim w2k - additional file
                                      Trending Discussions on gensim

                                      QUESTION

                                      How to get average pairwise cosine similarity per group in Pandas

                                      Asked 2022-Mar-29 at 20:51

                                      I have a sample dataframe as below

                                      df=pd.DataFrame(np.array([['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],['apple', "vice president"], ['apple', 'swimming contest']]),columns=['firm','text'])
                                      

                                      enter image description here

                                      Now I'd like to calculate the degree of text similarity within each firm using word embedding. For example, the average cosine similarity for facebook would be the cosine similarity between row 0, 1, and 2. The final dataframe should have a column ['mean_cos_between_items'] next to each row for each firm. The value will be the same for each company, since it is a within-firm pairwise comparison.

                                      I wrote below code:

                                      import gensim
                                      from gensim import utils
                                      from gensim.models import Word2Vec
                                      from gensim.models import KeyedVectors
                                      from gensim.scripts.glove2word2vec import glove2word2vec
                                      from sklearn.metrics.pairwise import cosine_similarity
                                      
                                       # map each word to vector space
                                          def represent(sentence):
                                              vectors = []
                                              for word in sentence:
                                                  try:
                                                      vector = model.wv[word]
                                                      vectors.append(vector)
                                                  except KeyError:
                                                      pass
                                              return np.array(vectors).mean(axis=0)
                                          
                                          # get average if more than 1 word is included in the "text" column
                                          def document_vector(items):
                                              # remove out-of-vocabulary words
                                              doc = [word for word in items if word in model_glove.vocab]
                                              if doc:
                                                  doc_vector = model_glove[doc]
                                                  mean_vec=np.mean(doc_vector, axis=0)
                                              else:
                                                  mean_vec = None
                                              return mean_vec
                                          
                                      # get average pairwise cosine distance score 
                                      def mean_cos_sim(grp):
                                         output = []
                                         for i,j in combinations(grp.index.tolist(),2 ): 
                                             doc_vec=document_vector(grp.iloc[i]['text'])
                                             if doc_vec is not None and len(doc_vec) > 0:      
                                                 sim = cosine_similarity(document_vector(grp.iloc[i]['text']).reshape(1,-1),document_vector(grp.iloc[j]['text']).reshape(1,-1))
                                                 output.append([i, j, sim])
                                             return np.mean(np.array(output), axis=0)
                                      
                                      # save the result to a new column    
                                      df['mean_cos_between_items']=df.groupby(['firm']).apply(mean_cos_sim)
                                      

                                      However, I got below error:

                                      enter image description here

                                      Could you kindly help? Thanks!

                                      ANSWER

                                      Answered 2022-Mar-29 at 18:47

                                      Remove the .vocab here in model_glove.vocab, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.

                                      # get average if more than 1 word is included in the "text" column
                                      def document_vector(items):
                                          # remove out-of-vocabulary words
                                          doc = [word for word in items.split() if word in model_glove]
                                          if doc:
                                              doc_vector = model_glove[doc]
                                              mean_vec = np.mean(doc_vector, axis=0)
                                          else:
                                              mean_vec = None
                                          return mean_vec
                                      

                                      Here you iterate over tuples of indices when you want to iterate over the values, so drop the .index. Also you put all values in output including the words (/indices) i and j, so if you want to get their average you would have to specify what exactly you want the average over. Since you seem to not need i and j you can just put only the resulting sims in a list and then take the lists average:

                                      # get pairwise cosine similarity score
                                      def mean_cos_sim(grp):
                                          output = []
                                          for i, j in combinations(grp.tolist(), 2):
                                              if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                  sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                  output.append(sim)
                                          return np.mean(output, axis=0)
                                      

                                      Here you try to add the results as a column but the number of rows is going to be different as the result DataFrame only has one row per firm while the original DataFrame has one per text. So you have to create a new DataFrame (which you can optionally then merge/join with the original DataFrame based on the firm column):

                                      df = pd.DataFrame(np.array(
                                          [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                           ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                      df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                      

                                      Which overall will give you (Edit: updated):

                                      print(df_grpd)
                                      > firm
                                        apple       [[0.53190523]]
                                        facebook    [[0.83989316]]
                                        Name: text, dtype: object
                                      
                                      Edit:

                                      I just noticed that the reason for the super high score is that this is missing a tokenization, see the changed part. Without the split() this just compares character similarities which tend to be super high.

                                      Source https://stackoverflow.com/questions/71666450

                                      Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                      Vulnerabilities

                                      No vulnerabilities reported

                                      Install gensim

                                      This software depends on [NumPy and Scipy], two Python packages for scientific computing. You must have them installed prior to installing gensim. It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as MKL, [ATLAS] or [OpenBLAS] is known to improve performance by as much as an order of magnitude. On OSX, NumPy picks up its vecLib BLAS automatically, so you don’t need to do anything special.

                                      Support

                                      [QuickStart][Tutorials][Official API Documentation] [QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html [Tutorials]: https://radimrehurek.com/gensim/auto_examples/ [Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/ [Official API Documentation]: http://radimrehurek.com/gensim/apiref.html

                                      DOWNLOAD this Library from

                                      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                                      over 430 million Knowledge Items
                                      Find more libraries
                                      Reuse Solution Kits and Libraries Curated by Popular Use Cases
                                      Explore Kits

                                      Save this library and start creating your kit

                                      Share this Page

                                      share link
                                      Consider Popular Topic Modeling Libraries
                                      Try Top Libraries by RaRe-Technologies
                                      Compare Topic Modeling Libraries with Highest Support
                                      Compare Topic Modeling Libraries with Highest Quality
                                      Compare Topic Modeling Libraries with Highest Security
                                      Compare Topic Modeling Libraries with Permissive License
                                      Compare Topic Modeling Libraries with Highest Reuse
                                      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                                      over 430 million Knowledge Items
                                      Find more libraries
                                      Reuse Solution Kits and Libraries Curated by Popular Use Cases
                                      Explore Kits

                                      Save this library and start creating your kit

                                      • © 2022 Open Weaver Inc.