kandi background

gensim | Topic Modelling for Humans | Topic Modeling library

Download this library from

kandi X-RAY | gensim Summary

gensim is a Python library typically used in Institutions, Learning, Education, Artificial Intelligence, Topic Modeling applications. gensim has no bugs, it has no vulnerabilities, it has build file available, it has a Weak Copyleft License and it has high support. You can download it from GitHub.
<!-- The following image URLs are obfuscated = proxied and cached through Google because of Github’s proxying issues. See: https://github.com/RaRe-Technologies/gensim/issues/2805 -→. [![Build Status](https://github.com/RaRe-Technologies/gensim/actions/workflows/tests.yml/badge.svg?branch=develop)](https://github.com/RaRe-Technologies/gensim/actions) [![GitHub release](https://img.shields.io/github/release/rare-technologies/gensim.svg?maxAge=3600)](https://github.com/RaRe-Technologies/gensim/releases) [![Downloads](https://img.shields.io/pypi/dm/gensim?color=blue)](https://pepy.tech/project/gensim/) [![DOI](https://zenodo.org/badge/DOI/10.13140/2.1.2393.1847.svg)](https://doi.org/10.13140/2.1.2393.1847) [![Mailing List](https://img.shields.io/badge/-Mailing%20List-blue.svg)](https://groups.google.com/forum/#!forum/gensim) [![Follow](https://img.shields.io/twitter/follow/gensim_py.svg?style=social&style=flat&logo=twitter&label=Follow&color=blue)](https://twitter.com/gensim_py). Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

kandi-support Support

  • gensim has a highly active ecosystem.
  • It has 13112 star(s) with 4201 fork(s). There are 431 watchers for this library.
  • There were 3 major release(s) in the last 12 months.
  • There are 346 open issues and 1381 have been closed. On average issues are closed in 195 days. There are 30 open pull requests and 0 closed requests.
  • It has a positive sentiment in the developer community.
  • The latest version of gensim is 4.1.2

quality kandi Quality

  • gensim has 0 bugs and 0 code smells.

securitySecurity

  • gensim has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • gensim code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.

license License

  • gensim is licensed under the LGPL-2.1 License. This license is Weak Copyleft.
  • Weak Copyleft licenses have some restrictions, but you can use them in commercial projects.

buildReuse

  • gensim releases are available to install and integrate.
  • Build file is available. You can build the component from source.
  • Installation instructions, examples and code snippets are available.
  • It has 61066 lines of code, 2260 functions and 199 files.
  • It has high code complexity. Code complexity directly impacts maintainability of the code.
Top functions reviewed by kandi - BETA

kandi has reviewed gensim and discovered the below as its top functions. This is intended to give you an instant insight into gensim implemented functionality, and help decide if they suit your requirements.

  • Updates the LdaModel with the given data .
  • Prepare the vocab .
  • Stochastic SVD
  • Add a model to the model .
  • Evaluate the word analogies in the model .
  • Construct a sparse term similarity matrix .
  • Compute inner product between two matrices .
  • Evaluate an Eloga polynomials .
  • Save special attributes to a file .
  • Compute the topic clustering .

gensim Key Features

All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),

Intuitive interfaces

easy to plug in your own input corpus/datastream (trivial streaming API)

easy to extend with other Vector Space algorithms (trivial transformation API)

Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.

Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.

Extensive [documentation and Jupyter Notebook tutorials].

gensim Examples and Code Snippets

  • Installation
  • Documentation
  • Support
  • How to get average pairwise cosine similarity per group in Pandas
  • Unpickle instance from Jupyter Notebook in Flask App
  • Word2Vec returning vectors for individual character and not words
  • No such file or directory: 'GoogleNews-vectors-negative300.bin'
  • How to store the Phrase trigrams gensim model after training
  • Plotly - Highlight data point and nearest three points on hover
  • Error in pip install transformers: Building wheel for tokenizers (pyproject.toml): finished with status 'error'
  • How to plot tsne on word2vec (created from gensim) for the most_similar 20 cases?
  • How to get list of words for each topic for a specific relevance metric value (lambda) in pyLDAvis?
  • How to seek for bigram similarity in gensim word2vec model

Installation

    pip install --upgrade gensim

Community Discussions

Trending Discussions on gensim
  • How to get average pairwise cosine similarity per group in Pandas
  • KeyedVectors\' object has no attribute \'wv for gensim 4.1.2
  • Gensim phrases model vocabulary length does not correspond to amount of iteratively added documents
  • Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?
  • Unpickle instance from Jupyter Notebook in Flask App
  • Word2Vec returning vectors for individual character and not words
  • No such file or directory: 'GoogleNews-vectors-negative300.bin'
  • How to store the Phrase trigrams gensim model after training
  • Plotly - Highlight data point and nearest three points on hover
  • gensim w2k - additional file
Trending Discussions on gensim

QUESTION

How to get average pairwise cosine similarity per group in Pandas

Asked 2022-Mar-29 at 20:51

I have a sample dataframe as below

df=pd.DataFrame(np.array([['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],['apple', "vice president"], ['apple', 'swimming contest']]),columns=['firm','text'])

enter image description here

Now I'd like to calculate the degree of text similarity within each firm using word embedding. For example, the average cosine similarity for facebook would be the cosine similarity between row 0, 1, and 2. The final dataframe should have a column ['mean_cos_between_items'] next to each row for each firm. The value will be the same for each company, since it is a within-firm pairwise comparison.

I wrote below code:

import gensim
from gensim import utils
from gensim.models import Word2Vec
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from sklearn.metrics.pairwise import cosine_similarity

 # map each word to vector space
    def represent(sentence):
        vectors = []
        for word in sentence:
            try:
                vector = model.wv[word]
                vectors.append(vector)
            except KeyError:
                pass
        return np.array(vectors).mean(axis=0)
    
    # get average if more than 1 word is included in the "text" column
    def document_vector(items):
        # remove out-of-vocabulary words
        doc = [word for word in items if word in model_glove.vocab]
        if doc:
            doc_vector = model_glove[doc]
            mean_vec=np.mean(doc_vector, axis=0)
        else:
            mean_vec = None
        return mean_vec
    
# get average pairwise cosine distance score 
def mean_cos_sim(grp):
   output = []
   for i,j in combinations(grp.index.tolist(),2 ): 
       doc_vec=document_vector(grp.iloc[i]['text'])
       if doc_vec is not None and len(doc_vec) > 0:      
           sim = cosine_similarity(document_vector(grp.iloc[i]['text']).reshape(1,-1),document_vector(grp.iloc[j]['text']).reshape(1,-1))
           output.append([i, j, sim])
       return np.mean(np.array(output), axis=0)

# save the result to a new column    
df['mean_cos_between_items']=df.groupby(['firm']).apply(mean_cos_sim)

However, I got below error:

enter image description here

Could you kindly help? Thanks!

ANSWER

Answered 2022-Mar-29 at 18:47

Remove the .vocab here in model_glove.vocab, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.

# get average if more than 1 word is included in the "text" column
def document_vector(items):
    # remove out-of-vocabulary words
    doc = [word for word in items.split() if word in model_glove]
    if doc:
        doc_vector = model_glove[doc]
        mean_vec = np.mean(doc_vector, axis=0)
    else:
        mean_vec = None
    return mean_vec

Here you iterate over tuples of indices when you want to iterate over the values, so drop the .index. Also you put all values in output including the words (/indices) i and j, so if you want to get their average you would have to specify what exactly you want the average over. Since you seem to not need i and j you can just put only the resulting sims in a list and then take the lists average:

# get pairwise cosine similarity score
def mean_cos_sim(grp):
    output = []
    for i, j in combinations(grp.tolist(), 2):
        if document_vector(i) is not None and len(document_vector(i)) > 0:
            sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
            output.append(sim)
    return np.mean(output, axis=0)

Here you try to add the results as a column but the number of rows is going to be different as the result DataFrame only has one row per firm while the original DataFrame has one per text. So you have to create a new DataFrame (which you can optionally then merge/join with the original DataFrame based on the firm column):

df = pd.DataFrame(np.array(
    [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
     ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)

Which overall will give you (Edit: updated):

print(df_grpd)
> firm
  apple       [[0.53190523]]
  facebook    [[0.83989316]]
  Name: text, dtype: object

Edit:

I just noticed that the reason for the super high score is that this is missing a tokenization, see the changed part. Without the split() this just compares character similarities which tend to be super high.

Source https://stackoverflow.com/questions/71666450

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install gensim

This software depends on [NumPy and Scipy], two Python packages for scientific computing. You must have them installed prior to installing gensim. It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as MKL, [ATLAS] or [OpenBLAS] is known to improve performance by as much as an order of magnitude. On OSX, NumPy picks up its vecLib BLAS automatically, so you don’t need to do anything special.

Support

[QuickStart][Tutorials][Official API Documentation] [QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html [Tutorials]: https://radimrehurek.com/gensim/auto_examples/ [Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/ [Official API Documentation]: http://radimrehurek.com/gensim/apiref.html

Build your Application

Share this kandi XRay Report