gensim | Topic Modelling for Humans | Topic Modeling library

 by   RaRe-Technologies Python Version: 4.3.1 License: LGPL-2.1

kandi X-RAY | gensim Summary

gensim is a Python library typically used in Institutions, Learning, Education, Artificial Intelligence, Topic Modeling applications. gensim has no bugs, it has no vulnerabilities, it has build file available, it has a Weak Copyleft License and it has high support. You can install using 'pip install gensim' or download it from GitHub, PyPI.
Topic Modelling for Humans
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        gensim has a highly active ecosystem.
                        summary
                        It has 14076 star(s) with 4334 fork(s). There are 431 watchers for this library.
                        summary
                        There were 2 major release(s) in the last 6 months.
                        summary
                        There are 365 open issues and 1431 have been closed. On average issues are closed in 102 days. There are 31 open pull requests and 0 closed requests.
                        summary
                        It has a positive sentiment in the developer community.
                        summary
                        The latest version of gensim is 4.3.1
                        gensim Support
                          Best in #Topic Modeling
                            Average in #Topic Modeling
                            gensim Support
                              Best in #Topic Modeling
                                Average in #Topic Modeling

                                  kandi-Quality Quality

                                    summary
                                    gensim has 0 bugs and 0 code smells.
                                    gensim Quality
                                      Best in #Topic Modeling
                                        Average in #Topic Modeling
                                        gensim Quality
                                          Best in #Topic Modeling
                                            Average in #Topic Modeling

                                              kandi-Security Security

                                                summary
                                                gensim has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                summary
                                                gensim code analysis shows 0 unresolved vulnerabilities.
                                                summary
                                                There are 0 security hotspots that need review.
                                                gensim Security
                                                  Best in #Topic Modeling
                                                    Average in #Topic Modeling
                                                    gensim Security
                                                      Best in #Topic Modeling
                                                        Average in #Topic Modeling

                                                          kandi-License License

                                                            summary
                                                            gensim is licensed under the LGPL-2.1 License. This license is Weak Copyleft.
                                                            summary
                                                            Weak Copyleft licenses have some restrictions, but you can use them in commercial projects.
                                                            gensim License
                                                              Best in #Topic Modeling
                                                                Average in #Topic Modeling
                                                                gensim License
                                                                  Best in #Topic Modeling
                                                                    Average in #Topic Modeling

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        gensim releases are available to install and integrate.
                                                                        summary
                                                                        Deployable package is available in PyPI.
                                                                        summary
                                                                        Build file is available. You can build the component from source.
                                                                        summary
                                                                        Installation instructions, examples and code snippets are available.
                                                                        summary
                                                                        It has 61066 lines of code, 2260 functions and 199 files.
                                                                        summary
                                                                        It has high code complexity. Code complexity directly impacts maintainability of the code.
                                                                        gensim Reuse
                                                                          Best in #Topic Modeling
                                                                            Average in #Topic Modeling
                                                                            gensim Reuse
                                                                              Best in #Topic Modeling
                                                                                Average in #Topic Modeling
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi has reviewed gensim and discovered the below as its top functions. This is intended to give you an instant insight into gensim implemented functionality, and help decide if they suit your requirements.
                                                                                  • Update the model with a given corpus
                                                                                    • Perform the inference on the given document
                                                                                    • Compute the phinorm
                                                                                    • Evaluate the model
                                                                                  • Updates the Lda model with a given corpus
                                                                                    • Evaluate a single step
                                                                                    • Add metrics to the plot
                                                                                    • Set the model
                                                                                  • Fit LDAPE algorithm
                                                                                  • Merge two projections
                                                                                  • Write a corpus to a file
                                                                                  • Estimate the probability of a boolean sliding window
                                                                                  • Extract articles and positions from file
                                                                                  • Load a model
                                                                                  • Add new documents to the LsiModel
                                                                                  • Return unit vector
                                                                                  • Updates the model with the given corpus
                                                                                  • Update the LDA
                                                                                  • Add a model to the model
                                                                                  • Train the model
                                                                                  • Evaluate the word analogies in the model
                                                                                  • Evaluate a list of words
                                                                                  • Compute the difference between two topics
                                                                                  • Construct a sparse term similarity matrix
                                                                                  • Compute the inner product between two matrices
                                                                                  • Compute the distance between two documents
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  gensim Key Features

                                                                                  All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
                                                                                  Intuitive interfaces
                                                                                  easy to plug in your own input corpus/datastream (trivial streaming API)
                                                                                  easy to extend with other Vector Space algorithms (trivial transformation API)
                                                                                  Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
                                                                                  Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
                                                                                  Extensive [documentation and Jupyter Notebook tutorials].

                                                                                  gensim Examples and Code Snippets

                                                                                  How to get average pairwise cosine similarity per group in Pandas
                                                                                  Pythondot imgLines of Code : 31dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  # get average if more than 1 word is included in the "text" column
                                                                                  def document_vector(items):
                                                                                      # remove out-of-vocabulary words
                                                                                      doc = [word for word in items.split() if word in model_glove]
                                                                                      if doc:
                                                                                          doc_vector = model_glove[doc]
                                                                                          mean_vec = np.mean(doc_vector, axis=0)
                                                                                      else:
                                                                                          mean_vec = None
                                                                                      return mean_vec
                                                                                  
                                                                                  # get pairwise cosine similarity score
                                                                                  def mean_cos_sim(grp):
                                                                                      output = []
                                                                                      for i, j in combinations(grp.tolist(), 2):
                                                                                          if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                                                              sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                                                              output.append(sim)
                                                                                      return np.mean(output, axis=0)
                                                                                  
                                                                                  df = pd.DataFrame(np.array(
                                                                                      [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                                                                       ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                                                                  df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                                                                  
                                                                                  print(df_grpd)
                                                                                  > firm
                                                                                    apple       [[0.53190523]]
                                                                                    facebook    [[0.83989316]]
                                                                                    Name: text, dtype: object
                                                                                  
                                                                                  Plotly - Highlight data point and nearest three points on hover
                                                                                  Pythondot imgLines of Code : 107dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  import gensim
                                                                                  import numpy as np
                                                                                  import pandas as pd
                                                                                  from sklearn.manifold import TSNE
                                                                                  import plotly.express as px
                                                                                  import plotly.graph_objects as go
                                                                                  
                                                                                  import json
                                                                                  
                                                                                  import dash
                                                                                  from dash import dcc, html, Input, Output
                                                                                  
                                                                                  external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
                                                                                  app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
                                                                                  
                                                                                  
                                                                                  def get_2d_coordinates(model, words):
                                                                                      arr = np.empty((0,100), dtype='f')
                                                                                      labels = []
                                                                                      for wrd_score in words:
                                                                                          try:
                                                                                              wrd_vector = model.wv.get_vector(wrd_score)
                                                                                              arr = np.append(arr, np.array([wrd_vector]), axis=0)
                                                                                              labels.append(wrd_score)
                                                                                          except:
                                                                                              pass
                                                                                      tsne = TSNE(n_components=2, random_state=0)
                                                                                      np.set_printoptions(suppress=True)
                                                                                      Y = tsne.fit_transform(arr)
                                                                                      x_coords = Y[:, 0]
                                                                                      y_coords = Y[:, 1]
                                                                                      return x_coords, y_coords
                                                                                  
                                                                                  ic_model = gensim.models.Word2Vec.load("w2v_IceCream.model")
                                                                                  ic = pd.read_csv('ic_prods.csv')
                                                                                  
                                                                                  icx, icy = get_2d_coordinates(ic_model, ic['ITEM_DESC'])
                                                                                  ic_data = {'Category': ic['SUB_CATEGORY'],
                                                                                              'Words':ic['ITEM_DESC'],
                                                                                              'X':icx,
                                                                                              'Y':icy}
                                                                                  
                                                                                  ic_df = pd.DataFrame(ic_data)
                                                                                  ic_fig = px.scatter(ic_df, x=icx, y=icy, color=ic_df['Category'], hover_name=ic_df['Words'], title='IceCream Data')
                                                                                  
                                                                                  NUMBER_OF_TRACES = len(ic_df['Category'].unique())
                                                                                  ic_fig.update_layout(clickmode='event+select')
                                                                                  
                                                                                  app.layout = html.Div([
                                                                                      dcc.Graph(
                                                                                          id='ic_figure',
                                                                                          figure=ic_fig)
                                                                                      ])
                                                                                  
                                                                                  ## we take the 4 closest points because the 1st closest point will be the point itself
                                                                                  def get_n_closest_points(x0, y0, df=ic_df[['X','Y']].copy(), n=4):
                                                                                  
                                                                                      """we can save some computation time by looking for the smallest distance^2 instead of distance"""
                                                                                      """distance = sqrt[(x1-x0)^2 + (y1-y0)^2]"""
                                                                                      """distance^2 = [(x1-x0)^2 + (y1-y0)^2]"""
                                                                                      
                                                                                      df["dist"] = (df["X"]-x0)**2 + (df["Y"]-y0)**2
                                                                                  
                                                                                      ## we don't return the point itself which will always be closest to itself
                                                                                      return df.sort_values(by="dist")[1:n][["X","Y"]].values
                                                                                  
                                                                                  @app.callback(
                                                                                      Output('ic_figure', 'figure'),
                                                                                      [Input('ic_figure', 'clickData'),
                                                                                      Input('ic_figure', 'figure')]
                                                                                      )
                                                                                  def display_hover_data(clickData, figure):
                                                                                      print(clickData)
                                                                                      if clickData is None:
                                                                                          # print("nothing was clicked")
                                                                                          return figure
                                                                                      else:
                                                                                          hover_x, hover_y = clickData['points'][0]['x'], clickData['points'][0]['y']
                                                                                          closest_points = get_n_closest_points(hover_x, hover_y)
                                                                                  
                                                                                          ## this means that this function has ALREADY added another trace, so we reduce the number of traces down the original number
                                                                                          if len(figure['data']) > NUMBER_OF_TRACES:
                                                                                              # print(f'reducing the number of traces to {NUMBER_OF_TRACES}')
                                                                                              figure['data'] = figure['data'][:NUMBER_OF_TRACES]
                                                                                              # print(figure['data'])
                                                                                          
                                                                                          new_traces = [{
                                                                                              'marker': {'color': 'teal', 'symbol': 'circle'},
                                                                                              'mode': 'markers',
                                                                                              'orientation': 'v',
                                                                                              'showlegend': False,
                                                                                              'x': [x],
                                                                                              'xaxis': 'x',
                                                                                              'y': [y],
                                                                                              'yaxis': 'y',
                                                                                              'type': 'scatter',
                                                                                              'selectedpoints': [0]
                                                                                          } for x,y in closest_points]
                                                                                  
                                                                                          figure['data'].extend(new_traces)
                                                                                          # print("after\n")
                                                                                          # print(figure['data'])
                                                                                          return figure
                                                                                  
                                                                                  if __name__ == '__main__':
                                                                                      app.run_server(debug=True)
                                                                                  
                                                                                  How to get the dimensions of a word2vec object in python?
                                                                                  Pythondot imgLines of Code : 13dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  import gensim
                                                                                  gensim.__version__
                                                                                  # 3.6.0
                                                                                  
                                                                                  from gensim.test.utils import common_texts
                                                                                  from gensim.models import Word2Vec
                                                                                  
                                                                                  model = Word2Vec(sentences=common_texts, window=5, min_count=1, workers=4) # do not specify size, leave the default 100
                                                                                  
                                                                                  wv = model.wv['computer']  # get numpy vector of a word in the corpus
                                                                                  wv.shape # verify the dimension of a single vector is 100
                                                                                  # (100,)
                                                                                  
                                                                                  'Doc2Vec' object has no attribute 'outputs', while saving doc2vec for tensorflow serving
                                                                                  Pythondot imgLines of Code : 5dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  filename = 'my_doc2vec_model'
                                                                                  initial_model.save(filename)
                                                                                  
                                                                                  reloaded_model = Doc2Vec.load(filename)
                                                                                  
                                                                                  No such file or directory: 'GoogleNews-vectors-negative300.bin'
                                                                                  Pythondot imgLines of Code : 4dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  serv/GoogleNews-vectors-negative300.bin
                                                                                  
                                                                                  /Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/GoogleNews-vectors-negative300.bin
                                                                                  
                                                                                  How to store the Phrase trigrams gensim model after training
                                                                                  Pythondot imgLines of Code : 4dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  trigram_transformer.save(TRIPHRASER_PATH)
                                                                                  
                                                                                  reloads_trigram_transformer = Phrases.load(TRIPHRASER_PATH)
                                                                                  
                                                                                  Problem with creating dictionary with gensim for LDA
                                                                                  Pythondot imgLines of Code : 6dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  from gensim import corpora
                                                                                  corpus = [
                                                                                      ['door', 'cat', 'mom'],
                                                                                  ]
                                                                                  dictionary = corpora.Dictionary(corpus)
                                                                                  
                                                                                  gensim/ Training a LDA Model: 'int' object is not subscriptable
                                                                                  Pythondot imgLines of Code : 2dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  corpus2 = [dct.doc2bow(filtered_sentence),]
                                                                                  
                                                                                  Using gensim most_similar function on a subset of total vocab
                                                                                  Pythondot imgLines of Code : 5dot imgLicense : Strong Copyleft (CC BY-SA 4.0)
                                                                                  copy iconCopy
                                                                                  finite_set = set(['word_d', 'word_e', 'word_f'])  # set for efficient 'in'
                                                                                  all_candidates = wv_from_bin.most_similar(positive=["word_a", "word_b"],
                                                                                                                            topn=len(vw_from_bin))
                                                                                  filtered_results = [word_sim for word_sim in all_candidates if word_sim[0] in finite_set]
                                                                                  
                                                                                  copy iconCopy
                                                                                  error: can't find Rust compiler
                                                                                  
                                                                                  RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
                                                                                  ENV PATH="/root/.cargo/bin:${PATH}"
                                                                                  
                                                                                  Community Discussions

                                                                                  Trending Discussions on gensim

                                                                                  How to get average pairwise cosine similarity per group in Pandas
                                                                                  chevron right
                                                                                  KeyedVectors\' object has no attribute \'wv for gensim 4.1.2
                                                                                  chevron right
                                                                                  Gensim phrases model vocabulary length does not correspond to amount of iteratively added documents
                                                                                  chevron right
                                                                                  Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?
                                                                                  chevron right
                                                                                  Unpickle instance from Jupyter Notebook in Flask App
                                                                                  chevron right
                                                                                  Word2Vec returning vectors for individual character and not words
                                                                                  chevron right
                                                                                  No such file or directory: 'GoogleNews-vectors-negative300.bin'
                                                                                  chevron right
                                                                                  How to store the Phrase trigrams gensim model after training
                                                                                  chevron right
                                                                                  Plotly - Highlight data point and nearest three points on hover
                                                                                  chevron right
                                                                                  gensim w2k - additional file
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  How to get average pairwise cosine similarity per group in Pandas
                                                                                  Asked 2022-Mar-29 at 20:51

                                                                                  I have a sample dataframe as below

                                                                                  df=pd.DataFrame(np.array([['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],['apple', "vice president"], ['apple', 'swimming contest']]),columns=['firm','text'])
                                                                                  

                                                                                  Now I'd like to calculate the degree of text similarity within each firm using word embedding. For example, the average cosine similarity for facebook would be the cosine similarity between row 0, 1, and 2. The final dataframe should have a column ['mean_cos_between_items'] next to each row for each firm. The value will be the same for each company, since it is a within-firm pairwise comparison.

                                                                                  I wrote below code:

                                                                                  import gensim
                                                                                  from gensim import utils
                                                                                  from gensim.models import Word2Vec
                                                                                  from gensim.models import KeyedVectors
                                                                                  from gensim.scripts.glove2word2vec import glove2word2vec
                                                                                  from sklearn.metrics.pairwise import cosine_similarity
                                                                                  
                                                                                   # map each word to vector space
                                                                                      def represent(sentence):
                                                                                          vectors = []
                                                                                          for word in sentence:
                                                                                              try:
                                                                                                  vector = model.wv[word]
                                                                                                  vectors.append(vector)
                                                                                              except KeyError:
                                                                                                  pass
                                                                                          return np.array(vectors).mean(axis=0)
                                                                                      
                                                                                      # get average if more than 1 word is included in the "text" column
                                                                                      def document_vector(items):
                                                                                          # remove out-of-vocabulary words
                                                                                          doc = [word for word in items if word in model_glove.vocab]
                                                                                          if doc:
                                                                                              doc_vector = model_glove[doc]
                                                                                              mean_vec=np.mean(doc_vector, axis=0)
                                                                                          else:
                                                                                              mean_vec = None
                                                                                          return mean_vec
                                                                                      
                                                                                  # get average pairwise cosine distance score 
                                                                                  def mean_cos_sim(grp):
                                                                                     output = []
                                                                                     for i,j in combinations(grp.index.tolist(),2 ): 
                                                                                         doc_vec=document_vector(grp.iloc[i]['text'])
                                                                                         if doc_vec is not None and len(doc_vec) > 0:      
                                                                                             sim = cosine_similarity(document_vector(grp.iloc[i]['text']).reshape(1,-1),document_vector(grp.iloc[j]['text']).reshape(1,-1))
                                                                                             output.append([i, j, sim])
                                                                                         return np.mean(np.array(output), axis=0)
                                                                                  
                                                                                  # save the result to a new column    
                                                                                  df['mean_cos_between_items']=df.groupby(['firm']).apply(mean_cos_sim)
                                                                                  

                                                                                  However, I got below error:

                                                                                  Could you kindly help? Thanks!

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-29 at 18:47

                                                                                  Remove the .vocab here in model_glove.vocab, this is not supported in the current version of gensim any more: Edit: also needs split() to iterate over words and not characters here.

                                                                                  # get average if more than 1 word is included in the "text" column
                                                                                  def document_vector(items):
                                                                                      # remove out-of-vocabulary words
                                                                                      doc = [word for word in items.split() if word in model_glove]
                                                                                      if doc:
                                                                                          doc_vector = model_glove[doc]
                                                                                          mean_vec = np.mean(doc_vector, axis=0)
                                                                                      else:
                                                                                          mean_vec = None
                                                                                      return mean_vec
                                                                                  

                                                                                  Here you iterate over tuples of indices when you want to iterate over the values, so drop the .index. Also you put all values in output including the words (/indices) i and j, so if you want to get their average you would have to specify what exactly you want the average over. Since you seem to not need i and j you can just put only the resulting sims in a list and then take the lists average:

                                                                                  # get pairwise cosine similarity score
                                                                                  def mean_cos_sim(grp):
                                                                                      output = []
                                                                                      for i, j in combinations(grp.tolist(), 2):
                                                                                          if document_vector(i) is not None and len(document_vector(i)) > 0:
                                                                                              sim = cosine_similarity(document_vector(i).reshape(1, -1), document_vector(j).reshape(1, -1))
                                                                                              output.append(sim)
                                                                                      return np.mean(output, axis=0)
                                                                                  

                                                                                  Here you try to add the results as a column but the number of rows is going to be different as the result DataFrame only has one row per firm while the original DataFrame has one per text. So you have to create a new DataFrame (which you can optionally then merge/join with the original DataFrame based on the firm column):

                                                                                  df = pd.DataFrame(np.array(
                                                                                      [['facebook', "women tennis"], ['facebook', "men basketball"], ['facebook', 'club'],
                                                                                       ['apple', "vice president"], ['apple', 'swimming contest']]), columns=['firm', 'text'])
                                                                                  df_grpd = df.groupby(['firm'])["text"].apply(mean_cos_sim)
                                                                                  

                                                                                  Which overall will give you (Edit: updated):

                                                                                  print(df_grpd)
                                                                                  > firm
                                                                                    apple       [[0.53190523]]
                                                                                    facebook    [[0.83989316]]
                                                                                    Name: text, dtype: object
                                                                                  
                                                                                  Edit:

                                                                                  I just noticed that the reason for the super high score is that this is missing a tokenization, see the changed part. Without the split() this just compares character similarities which tend to be super high.

                                                                                  Source https://stackoverflow.com/questions/71666450

                                                                                  QUESTION

                                                                                  KeyedVectors\' object has no attribute \'wv for gensim 4.1.2
                                                                                  Asked 2022-Mar-20 at 19:43

                                                                                  i have migrated from gensim 3.8.3 to 4.1.2 and i am using this

                                                                                  claim = [token for token in claim_text if token in w2v_model.wv.vocab]

                                                                                  reference = [token for token in ref_text if token in w2v_model.wv.vocab]

                                                                                  i am not sure how to replace w2v_model.wv.vocab to newer attribute and i am getting this error

                                                                                  KeyedVectors' object has no attribute 'wv' can anyone please help.

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-20 at 19:43

                                                                                  You only use the .wv property to fetch the KeyedVectors object from another more complete algorithmic model, like a full Word2Vec model (which contains a KeyedVectors in its .wv attribute).

                                                                                  If you're already working with just-the-vectors, there's no need to request the word-vectors subcomponent. Whatever you were going to do, you just do to the KeyedVectors directly.

                                                                                  However, you're also using the .vocab attribute, which has been replaced. See the migration FAQ for more details:

                                                                                  https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#4-vocab-dict-became-key_to_index-for-looking-up-a-keys-integer-index-or-get_vecattr-and-set_vecattr-for-other-per-key-attributes

                                                                                  (Mainly: instead of doing an in w2v_model.wv.vocab, you may only need to do in kv_model or in kv_model.key_to_index.)

                                                                                  Source https://stackoverflow.com/questions/71544767

                                                                                  QUESTION

                                                                                  Gensim phrases model vocabulary length does not correspond to amount of iteratively added documents
                                                                                  Asked 2022-Mar-14 at 19:50

                                                                                  I iteratively apply the...

                                                                                  bigram.add_vocab()
                                                                                  

                                                                                  method in order to update a...

                                                                                  bigram = gensim.models.phrases.Phrases(min_count=bigramMinFreq, threshold=10.0)
                                                                                  

                                                                                  Gensim phrases model. With each iteration up to ~10'000 documents are added. Therefore my intuition is that the Phrases model grows with each added document set. I check this intuition by checking the length of the bigram vocabulary with...

                                                                                  len(bigram.vocab))
                                                                                  

                                                                                  Furthermore I also check the amount of phrasegrams in the freezed Phrase model with...

                                                                                  bigram_freezed = bigram.freeze()
                                                                                  len(bigram_freezed.phrasegrams)
                                                                                  

                                                                                  A resulting output looks as follows:

                                                                                  Data of directory:  000  is loaded
                                                                                  Num of Docs: 97802
                                                                                  Updated Bigram Vocab is:  31819758
                                                                                  Amount of phrasegrams in freezed bigram model:  397554
                                                                                  -------------------------------------------------------
                                                                                  Data of directory:  001  
                                                                                  Num of Docs: 93368
                                                                                  Updated Bigram Vocab is:  17940420
                                                                                  Amount of phrasegrams in freezed bigram model:  429162
                                                                                  -------------------------------------------------------
                                                                                  Data of directory:  002  
                                                                                  Num of Docs: 87265
                                                                                  Updated Bigram Vocab is:  36120292
                                                                                  Amount of phrasegrams in freezed bigram model:  661023
                                                                                  -------------------------------------------------------
                                                                                  Data of directory:  003
                                                                                  Num of Docs: 55852
                                                                                  Updated Bigram Vocab is:  20330876
                                                                                  Amount of phrasegrams in freezed bigram model:  604504
                                                                                  -------------------------------------------------------
                                                                                  Data of directory:  004
                                                                                  Num of Docs: 49390
                                                                                  Updated Bigram Vocab is:  31101880
                                                                                  Amount of phrasegrams in freezed bigram model:  745827
                                                                                  -------------------------------------------------------
                                                                                  Data of directory:  005
                                                                                  Num of Docs: 56258
                                                                                  Updated Bigram Vocab is:  19236483
                                                                                  Amount of phrasegrams in freezed bigram model:  675705
                                                                                  -------------------------------------------------------
                                                                                  ...
                                                                                  

                                                                                  As can be seen neither the bigram vocab count nor the phrasegram count of the freezed bigram model is continuously increasing. I expected both counts to increase with added documents.

                                                                                  Do I not understand what phrase.vocab and phraser.phrasegrams are referring to? (if needed I can add the whole corrsponding Jupyter Notebook cell)

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-14 at 19:50

                                                                                  By default, to avoid using an unbounded amount of RAM, the Gensim Phrases class uses a default parameter max_vocab_size=40000000, per the source code & docs at:

                                                                                  https://radimrehurek.com/gensim/models/phrases.html#gensim.models.phrases.Phrases

                                                                                  Unfortunately, the mechanism behind this cap is very crude & non-intuitive. Whenever the tally of all known keys in they survey-dict (which includes both unigrams & bigrams) hits this threshold (default 40,000,000), a prune operation is performed that discards all token counts (unigrams & bigrams) at low-frequencies until the total unique-keys is under the threshold. And, it sets the low-frequency floor for future prunes to be at least as high as was necessary for this prune.

                                                                                  For example, the 1st time this is hit, it might need to discard all the 1-count tokens. And due to the typical Zipfian distribution of word-frequencies, that step along might not just get the total count of known tokens slightly under the threshold, but massively under the threshold. And, any subsequent prune will start by eliminated at least everything with fewer than 2 occurrences.

                                                                                  This results in the sawtooth counts you're seeing. When the model can't fit in max_vocab_size, it overshrinks. It may do this many times in the course of processing a very-large corpus. As a result, final counts of lower-frequency words/bigrams can also be serious undercounts - depending somewhat arbitrarily on whether a key's counts survived the various prune-thresholds. (That's also influenced by where in the corpus a token appears. A token that only appears in the corpus after the last prune will still have a precise count, even if it only appears once! Although rare tokens that appeared any number of times could be severely undercounted, if they were always below the cutoff at each prior prune.)

                                                                                  The best solution would be to use a precise count that uses/correlates some spillover storage on-disk, to only prune (if at all) at the very end, ensuring only the truly-least-frequent keys are discarded. Unfortunately, Gensim's never implemented that option.

                                                                                  The next-best, for many cases, could be to use a memory-efficient approximate counting algorithm, that vaguely maintains the right magnitudes of counts for a much-larger number of keys. There's been a litte work in Gensim on this in the past, but not yet integrated with the Phrases functionality.

                                                                                  That leaves you with the only practical workaround in the short term: change the max_vocab_size parameter to be larger.

                                                                                  You could try setting it to math.inf (might risk lower performance due to int-vs-float comparisons) or sys.maxsize – essentially turning off the pruning entirely, to see if your survey can complete without exhausting your RAM. But, you might run out of memory anyway.

                                                                                  You could also try a larger-but-not-essentially-infinite cap – whatever fits in your RAM – so that far less pruning is done. But you'll still see the non-intuitive decreases in total counts, sometimes, if in fact the threshold is ever enforced. Per the docs, a very rough (perhaps outdated) estimate is that the default max_vocab_size=40000000 consumes about 3.6GB at peak saturation. So if you've got a 64GB machine, you could possibly try a max_vocab_size thats 10-14x larger than the default, etc.

                                                                                  Source https://stackoverflow.com/questions/71457117

                                                                                  QUESTION

                                                                                  Can I use a different corpus for fasttext build_vocab than train in Gensim Fasttext?
                                                                                  Asked 2022-Mar-07 at 22:50

                                                                                  I am curious to know if there are any implications of using a different source while calling the build_vocab and train of Gensim FastText model. Will this impact the contextual representation of the word embedding?

                                                                                  My intention for doing this is that there is a specific set of words I am interested to get the vector representation for and when calling model.wv.most_similar. I only want words defined in this vocab list to get returned rather than all possible words in the training corpus. I would use the result of this to decide if I want to group those words to be relevant to each other based on similarity threshold.

                                                                                  Following is the code snippet that I am using, appreciate your thoughts if there are any concerns or implication with this approach.

                                                                                  • vocab.txt contains a list of unique words of interest
                                                                                  • corpus.txt contains full conversation text (i.e. chat messages) where each line represents a paragraph/sentence per chat

                                                                                  A follow up question to this is what values should I set for total_examples & total_words during training in this case?

                                                                                  from gensim.models.fasttext import FastText
                                                                                  
                                                                                  model = FastText(min_count=1, vector_size=300,)
                                                                                  
                                                                                  corpus_path = f'data/{client}-corpus.txt'
                                                                                  vocab_path = f'data/{client}-vocab.txt'
                                                                                  # Unsure if below counts should be based on the training corpus or vocab
                                                                                  corpus_count = get_lines_count(corpus_path)
                                                                                  total_words = get_words_count(corpus_path)
                                                                                  
                                                                                  # build the vocabulary
                                                                                  model.build_vocab(corpus_file=vocab_path)
                                                                                  
                                                                                  # train the model
                                                                                  model.train(corpus_file=corpus.corpus_path, epochs=100, 
                                                                                      total_examples=corpus_count, total_words=total_words,
                                                                                  )
                                                                                  
                                                                                  # save the model
                                                                                  model.save(f'models/gensim-fastext-model-{client}')
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-07 at 22:50

                                                                                  Incase someone has similar question, I'll paste the reply I got when asking this question in the Gensim Disussion Group for reference:

                                                                                  You can try it, but I wouldn't expect it to work well for most purposes.

                                                                                  The build_vocab() call establishes the known vocabulary of the model, & caches some stats about the corpus.

                                                                                  If you then supply another corpus – & especially one with more words – then:

                                                                                  • You'll want your train() parameters to reflect the actual size of your training corpus. You'll want to provide a true total_examples and total_words count that are accurate for the training-corpus.
                                                                                  • Every word in the training corpus that's not in the know vocabulary is ignored completely, as if it wasn't even there. So you might as well filter your corpus down to just the words-of-interest first, then use that same filtered corpus for both steps. Will the example texts still make sense? Will that be enough data to train meaningful, generalizable word-vectors for just the words-of-interest, alongside other words-of-interest, without the full texts? (You could look at your pref-filtered corpus to get a sense of that.) I'm not sure - it could depend on how severely trimming to just the words-of-interest changed the corpus. In particular, to train high-dimensional dense vectors – as with vector_size=300 – you need a lot of varied data. Such pre-trimming might thin the corpus so much as to make the word-vectors for the words-of-interest far less useful.

                                                                                  You could certainly try it both ways – pre-filtered to just your words-of-interest, or with the full original corpus – and see which works better on downstream evaluations.

                                                                                  More generally, if the concern is training time with the full corpus, there are likely other ways to get an adequate model in an acceptable amount of time.

                                                                                  If using corpus_file mode, you can increase workers to equal the local CPU core count for a nearly-linear speedup from number of cores. (In traditional corpus_iterable mode, max throughput is usually somewhere in the 6-12 workers threads, as long as you ahve that many cores.)

                                                                                  min_count=1 is usually a bad idea for these algorithms: they tend to train faster, in less memory, leaving better vectors for the remaining words when you discard the lowest-frequency words, as the default min_count=5 does. (It's possible FastText can eke a little bit of benefit out of lower-frequency words via their contribution to character-n-gram-training, but I'd only ever lower the default min_count if I could confirm it was actually improving relevant results.

                                                                                  If your corpus is so large that training time is a concern, often a more-aggressive (smaller) sample parameter value not only speeds training (by dropping many redundant high-frequency words), but ofthen improves final word-vector quality for downstream purposes as well (by letting the rarer words have relatively more influence on the model in the absense of the downsampled words).

                                                                                  And again if the corpus is so large that training time is a concern, than epochs=100 is likely overkill. I believe the GoogleNews vectors were trained using only 3 passes – over a gigantic corpus. A sufficiently large & varied corpus, with plenty of examples of all words all throughout, could potentially train in 1 pass – because each word-vector can then get more total training-updates than many epochs with a small corpus. (In general larger epochs values are more often used when the corpus is thin, to eke out something – not on a corpus so large you're considering non-standard shortcuts to speed the steps.)

                                                                                  -- Gordon

                                                                                  Source https://stackoverflow.com/questions/71289683

                                                                                  QUESTION

                                                                                  Unpickle instance from Jupyter Notebook in Flask App
                                                                                  Asked 2022-Feb-28 at 18:03

                                                                                  I have created a class for word2vec vectorisation which is working fine. But when I create a model pickle file and use that pickle file in a Flask App, I am getting an error like:

                                                                                  AttributeError: module '__main__' has no attribute 'GensimWord2VecVectorizer'

                                                                                  I am creating the model on Google Colab.

                                                                                  Code in Jupyter Notebook:

                                                                                  # Word2Vec Model
                                                                                  import numpy as np
                                                                                  from sklearn.base import BaseEstimator, TransformerMixin
                                                                                  from gensim.models import Word2Vec
                                                                                  
                                                                                  class GensimWord2VecVectorizer(BaseEstimator, TransformerMixin):
                                                                                  
                                                                                      def __init__(self, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None,
                                                                                                   sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5,
                                                                                                   ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
                                                                                                   trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False,
                                                                                                   callbacks=(), max_final_vocab=None):
                                                                                          self.size = size
                                                                                          self.alpha = alpha
                                                                                          self.window = window
                                                                                          self.min_count = min_count
                                                                                          self.max_vocab_size = max_vocab_size
                                                                                          self.sample = sample
                                                                                          self.seed = seed
                                                                                          self.workers = workers
                                                                                          self.min_alpha = min_alpha
                                                                                          self.sg = sg
                                                                                          self.hs = hs
                                                                                          self.negative = negative
                                                                                          self.ns_exponent = ns_exponent
                                                                                          self.cbow_mean = cbow_mean
                                                                                          self.hashfxn = hashfxn
                                                                                          self.iter = iter
                                                                                          self.null_word = null_word
                                                                                          self.trim_rule = trim_rule
                                                                                          self.sorted_vocab = sorted_vocab
                                                                                          self.batch_words = batch_words
                                                                                          self.compute_loss = compute_loss
                                                                                          self.callbacks = callbacks
                                                                                          self.max_final_vocab = max_final_vocab
                                                                                  
                                                                                      def fit(self, X, y=None):
                                                                                          self.model_ = Word2Vec(
                                                                                              sentences=X, corpus_file=None,
                                                                                              size=self.size, alpha=self.alpha, window=self.window, min_count=self.min_count,
                                                                                              max_vocab_size=self.max_vocab_size, sample=self.sample, seed=self.seed,
                                                                                              workers=self.workers, min_alpha=self.min_alpha, sg=self.sg, hs=self.hs,
                                                                                              negative=self.negative, ns_exponent=self.ns_exponent, cbow_mean=self.cbow_mean,
                                                                                              hashfxn=self.hashfxn, iter=self.iter, null_word=self.null_word,
                                                                                              trim_rule=self.trim_rule, sorted_vocab=self.sorted_vocab, batch_words=self.batch_words,
                                                                                              compute_loss=self.compute_loss, callbacks=self.callbacks,
                                                                                              max_final_vocab=self.max_final_vocab)
                                                                                          return self
                                                                                  
                                                                                      def transform(self, X):
                                                                                          X_embeddings = np.array([self._get_embedding(words) for words in X])
                                                                                          return X_embeddings
                                                                                  
                                                                                      def _get_embedding(self, words):
                                                                                          valid_words = [word for word in words if word in self.model_.wv.vocab]
                                                                                          if valid_words:
                                                                                              embedding = np.zeros((len(valid_words), self.size), dtype=np.float32)
                                                                                              for idx, word in enumerate(valid_words):
                                                                                                  embedding[idx] = self.model_.wv[word]
                                                                                  
                                                                                              return np.mean(embedding, axis=0)
                                                                                          else:
                                                                                              return np.zeros(self.size)
                                                                                  
                                                                                  # column transformer
                                                                                  from sklearn.compose import ColumnTransformer
                                                                                  
                                                                                  ct = ColumnTransformer([
                                                                                      ('step1', GensimWord2VecVectorizer(), 'STATUS')
                                                                                  ], remainder='drop')
                                                                                  
                                                                                  # Create Model
                                                                                  from sklearn.svm import SVC
                                                                                  from sklearn.pipeline import Pipeline
                                                                                  from sklearn.model_selection import GridSearchCV
                                                                                  import pickle
                                                                                  import numpy as np
                                                                                  import dill
                                                                                  import torch
                                                                                  # ##########
                                                                                  # SVC - support vector classifier
                                                                                  # ##########
                                                                                  # defining parameter range
                                                                                  hyperparameters = {'C': [0.1, 1],
                                                                                                     'gamma': [1, 0.1],
                                                                                                     'kernel': ['rbf'],
                                                                                                     'probability': [True]}
                                                                                  model_sv = Pipeline([
                                                                                      ('column_transformers', ct),
                                                                                      ('model', GridSearchCV(SVC(), hyperparameters,
                                                                                                             refit=True, verbose=3)),
                                                                                  ])
                                                                                  model_sv_cEXT = model_sv.fit(X_train, y_train['cEXT'])
                                                                                  # Save the trained cEXT - SVM Model.
                                                                                  import joblib
                                                                                  joblib.dump(model_sv_cEXT, 'model_Word2Vec_sv_cEXT.pkl')
                                                                                  

                                                                                  Code in Flask App:

                                                                                  # Word2Vec
                                                                                  model_EXT_WV_SV = joblib.load('utility/model/MachineLearning/SVM/model_Word2Vec_sv_cEXT.pkl')
                                                                                  

                                                                                  I tried to copy the same class into my Flask file, but it is also not working.

                                                                                  import numpy as np
                                                                                  from sklearn.base import BaseEstimator, TransformerMixin
                                                                                  from gensim.models import Word2Vec
                                                                                  
                                                                                  class GensimWord2VecVectorizer(BaseEstimator, TransformerMixin):
                                                                                  
                                                                                      def __init__(self, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None,
                                                                                                   sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5,
                                                                                                   ns_exponent=0.75, cbow_mean=1, hashfxn=hash, iter=5, null_word=0,
                                                                                                   trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False,
                                                                                                   callbacks=(), max_final_vocab=None):
                                                                                          self.size = size
                                                                                          self.alpha = alpha
                                                                                          self.window = window
                                                                                          self.min_count = min_count
                                                                                          self.max_vocab_size = max_vocab_size
                                                                                          self.sample = sample
                                                                                          self.seed = seed
                                                                                          self.workers = workers
                                                                                          self.min_alpha = min_alpha
                                                                                          self.sg = sg
                                                                                          self.hs = hs
                                                                                          self.negative = negative
                                                                                          self.ns_exponent = ns_exponent
                                                                                          self.cbow_mean = cbow_mean
                                                                                          self.hashfxn = hashfxn
                                                                                          self.iter = iter
                                                                                          self.null_word = null_word
                                                                                          self.trim_rule = trim_rule
                                                                                          self.sorted_vocab = sorted_vocab
                                                                                          self.batch_words = batch_words
                                                                                          self.compute_loss = compute_loss
                                                                                          self.callbacks = callbacks
                                                                                          self.max_final_vocab = max_final_vocab
                                                                                  
                                                                                      def fit(self, X, y=None):
                                                                                          self.model_ = Word2Vec(
                                                                                              sentences=X, corpus_file=None,
                                                                                              size=self.size, alpha=self.alpha, window=self.window, min_count=self.min_count,
                                                                                              max_vocab_size=self.max_vocab_size, sample=self.sample, seed=self.seed,
                                                                                              workers=self.workers, min_alpha=self.min_alpha, sg=self.sg, hs=self.hs,
                                                                                              negative=self.negative, ns_exponent=self.ns_exponent, cbow_mean=self.cbow_mean,
                                                                                              hashfxn=self.hashfxn, iter=self.iter, null_word=self.null_word,
                                                                                              trim_rule=self.trim_rule, sorted_vocab=self.sorted_vocab, batch_words=self.batch_words,
                                                                                              compute_loss=self.compute_loss, callbacks=self.callbacks,
                                                                                              max_final_vocab=self.max_final_vocab)
                                                                                          return self
                                                                                  
                                                                                      def transform(self, X):
                                                                                          X_embeddings = np.array([self._get_embedding(words) for words in X])
                                                                                          return X_embeddings
                                                                                  
                                                                                      def _get_embedding(self, words):
                                                                                          valid_words = [word for word in words if word in self.model_.wv.vocab]
                                                                                          if valid_words:
                                                                                              embedding = np.zeros((len(valid_words), self.size), dtype=np.float32)
                                                                                              for idx, word in enumerate(valid_words):
                                                                                                  embedding[idx] = self.model_.wv[word]
                                                                                  
                                                                                              return np.mean(embedding, axis=0)
                                                                                          else:
                                                                                              return np.zeros(self.size)
                                                                                  
                                                                                  # Word2Vec
                                                                                  model_EXT_WV_SV = joblib.load('utility/model/MachineLearning/SVM/model_Word2Vec_sv_cEXT.pkl')
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-24 at 11:48

                                                                                  Import GensimWord2VecVectorizer in your Flask Web app python file.

                                                                                  Source https://stackoverflow.com/questions/71231611

                                                                                  QUESTION

                                                                                  Word2Vec returning vectors for individual character and not words
                                                                                  Asked 2022-Feb-12 at 13:11

                                                                                  For the following list:

                                                                                  words= ['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA','unimodal','7','regarding','random','59','intimating','COMPETITION','prospects','2K15','gather','Mega','SENSOR','NCTT','NETWORKING','orgainsed','acts']
                                                                                  

                                                                                  I try to:

                                                                                  from gensim.models import Word2Vec
                                                                                  vec_model= Word2Vec(words, min_count=1, size=30)
                                                                                  vec_model['gather']
                                                                                  

                                                                                  Which returns:

                                                                                  KeyError: "word 'gather' not in vocabulary"

                                                                                  But

                                                                                  vec_model['g']
                                                                                  

                                                                                  Does return a vector, so believe i'm returning all vectors for characters found in the list instead of vectors for all words found in the list.

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-12 at 13:11

                                                                                  Word2Vec expects a list of lists as input, where the corpus (main list) is composed of individual documents. The individual documents are composed of individual words (tokens). Word2Vec iterates over all documents and all tokens. In your example you have passed a single list to Word2Vec, therefore Word2Vec interprets each word as an individual document and iterates over each word character which is interpreted as a token. Therefore you have built a vocabulary of characters not words. To build a vocabulary of words you can pass a nested list to Word2Vec as in the example below.

                                                                                  from gensim.models import Word2Vec
                                                                                  
                                                                                  words= [['S.B.MILS','Siblings','DEVASTHALY','KOTRESHA'],
                                                                                  ['unimodal','7','regarding','random','59','intimating'],
                                                                                  ['COMPETITION','prospects','2K15','gather','Mega'],
                                                                                  ['SENSOR','NCTT','NETWORKING','orgainsed','acts']]
                                                                                  
                                                                                  vec_model= Word2Vec(words, min_count=1, size=30)
                                                                                  vec_model['gather']
                                                                                  

                                                                                  Output:

                                                                                  array([ 0.01106581,  0.00968017, -0.00090574,  0.01115612, -0.00766465,
                                                                                         -0.01648632, -0.01455364,  0.01107104,  0.00769841,  0.01037362,
                                                                                          0.01551551, -0.01188449,  0.01262331,  0.01608987,  0.01484082,
                                                                                          0.00528397,  0.01613582,  0.00437328,  0.00372362,  0.00480989,
                                                                                         -0.00299072, -0.00261444,  0.00282137, -0.01168992, -0.01402746,
                                                                                         -0.01165612,  0.00088562,  0.01581018, -0.00671618, -0.00698833],
                                                                                        dtype=float32)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71091209

                                                                                  QUESTION

                                                                                  No such file or directory: 'GoogleNews-vectors-negative300.bin'
                                                                                  Asked 2022-Feb-04 at 06:08

                                                                                  I have this code :

                                                                                  import gensim
                                                                                  filename = 'GoogleNews-vectors-negative300.bin'
                                                                                  model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)
                                                                                  

                                                                                  and this is my folder organization thing : image of my folder tree that shows that the .bin file is in the same directory as the file calling it, the file being ai_functions

                                                                                  But sadly I'm not sure why I'm having an error saying that it can't find it. Btw I checked, I am sure the file is not corrupted. Any thoughts?

                                                                                  Full traceback :

                                                                                    File "/Users/Ile-Maurice/Desktop/Flask/flaskapp/run.py", line 1, in 
                                                                                      from serv import app
                                                                                    File "/Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/__init__.py", line 13, in 
                                                                                      from serv import routes
                                                                                    File "/Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/routes.py", line 7, in 
                                                                                      from serv.ai_functions import checkplagiarism
                                                                                    File "/Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/ai_functions.py", line 31, in 
                                                                                      model = gensim.models.KeyedVectors.load_word2vec_format(filename, binary=True)
                                                                                    File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gensim/models/keyedvectors.py", line 1629, in load_word2vec_format
                                                                                      return _load_word2vec_format(
                                                                                    File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/gensim/models/keyedvectors.py", line 1955, in _load_word2vec_format
                                                                                      with utils.open(fname, 'rb') as fin:
                                                                                    File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/smart_open/smart_open_lib.py", line 188, in open
                                                                                      fobj = _shortcut_open(
                                                                                    File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/smart_open/smart_open_lib.py", line 361, in _shortcut_open
                                                                                      return _builtin_open(local_path, mode, buffering=buffering, **open_kwargs)
                                                                                  FileNotFoundError: [Errno 2] No such file or directory: 'GoogleNews-vectors-negative300.bin'
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-04 at 06:08

                                                                                  The 'current working directory' that the Python process will consider active, and thus will use as the expected location for your plain relative filename GoogleNews-vectors-negative300.bin, will depend on how you launched Flask.

                                                                                  You could print out the directory to be sure – see some ways at How do you properly determine the current script directory? – but I suspect it may just be the /Users/Ile-Maurice/Desktop/Flask/flaskapp/ directory.

                                                                                  If so, you could relatively-reference your file with the path relative to the above directory...

                                                                                  serv/GoogleNews-vectors-negative300.bin
                                                                                  

                                                                                  ...or you could use a full 'absolute' path...

                                                                                  /Users/Ile-Maurice/Desktop/Flask/flaskapp/serv/GoogleNews-vectors-negative300.bin
                                                                                  

                                                                                  ...or you could move the file up to its parent directory, so that it is alonside your Flask run.py.

                                                                                  Source https://stackoverflow.com/questions/70973660

                                                                                  QUESTION

                                                                                  How to store the Phrase trigrams gensim model after training
                                                                                  Asked 2022-Feb-04 at 01:01

                                                                                  I would like to know can I store the gensim Phrase model after training it on the sentences

                                                                                  documents = ["the mayor of new york was there", "human computer interaction and 
                                                                                  machine learning has now become a trending research area","human computer interaction 
                                                                                  is interesting","human computer interaction is a pretty interesting subject", "human 
                                                                                  computer interaction is a great and new subject", "machine learning can be useful 
                                                                                  sometimes","new york mayor was present", "I love machine learning because it is a new 
                                                                                  subject area", "human computer interaction helps people to get user friendly 
                                                                                  applications"]
                                                                                  
                                                                                  sentences = [doc.split(" ") for doc in documents]
                                                                                  
                                                                                  bigram_transformer = Phrases(sentences)
                                                                                  bigram_sentences = bigram_transformer[sentences]
                                                                                  print("Bigrams - done")
                                                                                  # Here we use a phrase model that detects the collocation of 3 words (trigrams).
                                                                                  trigram_transformer = Phrases(bigram_sentences)
                                                                                  trigram_sentences = trigram_transformer[bigram_sentences]
                                                                                  print("Trigrams - done")
                                                                                  

                                                                                  How to store trigram_transformer physically to reuse it again using pickle maybe?

                                                                                  Thank you in advance for your help.

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-03 at 18:40

                                                                                  Convert list or that partular format into an numpy array and save it as a .npy file easy to save and easy to read, using this by numpy gives you advantage of loading it in almost every platform like google colab, replit ..... refer to this link for more details on saving a npy file numpy.save()

                                                                                  Using pickle is also a good option but things get a bit tricky at points when difference in encoding standards and such problems arise.

                                                                                  Source https://stackoverflow.com/questions/70976566

                                                                                  QUESTION

                                                                                  Plotly - Highlight data point and nearest three points on hover
                                                                                  Asked 2022-Feb-02 at 04:15

                                                                                  I have made a scatter plot of the word2vec model using plotly.
                                                                                  I want functionality of highlighting the specific data point on hover along with the top 3 nearest vectors to that. It would be of great help if anyone can guide me with this or suggest any other option

                                                                                  model
                                                                                  csv

                                                                                  Code:

                                                                                  import gensim
                                                                                  import numpy as np
                                                                                  import pandas as pd
                                                                                  from sklearn.manifold import TSNE
                                                                                  import plotly.express as px
                                                                                  
                                                                                  def get_2d_coordinates(model, words):
                                                                                      arr = np.empty((0,100), dtype='f')
                                                                                      labels = []
                                                                                      for wrd_score in words:
                                                                                          try:
                                                                                              wrd_vector = model.wv.get_vector(wrd_score)
                                                                                              arr = np.append(arr, np.array([wrd_vector]), axis=0)
                                                                                              labels.append(wrd_score)
                                                                                          except:
                                                                                              pass
                                                                                      tsne = TSNE(n_components=2, random_state=0)
                                                                                      np.set_printoptions(suppress=True)
                                                                                      Y = tsne.fit_transform(arr)
                                                                                      x_coords = Y[:, 0]
                                                                                      y_coords = Y[:, 1]
                                                                                      return x_coords, y_coords
                                                                                  
                                                                                  ic_model = gensim.models.Word2Vec.load("w2v_IceCream.model")
                                                                                  ic = pd.read_csv('ic_prods.csv')
                                                                                  
                                                                                  icx, icy = get_2d_coordinates(ic_model, ic['ITEM_DESC'])
                                                                                  ic_data = {'Category': ic['SUB_CATEGORY'],
                                                                                              'Words':ic['ITEM_DESC'],
                                                                                              'X':icx,
                                                                                              'Y':icy}
                                                                                  ic_df = pd.DataFrame(ic_data)
                                                                                  ic_df.head()
                                                                                  ic_fig = px.scatter(ic_df, x=icx, y=icy, color=ic_df['Category'], hover_name=ic_df['Words'], title='IceCream Data')
                                                                                  ic_fig.show()
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-02 at 04:15

                                                                                  In plotly-python, I don't think there's an easy way of retrieving the location of the cursor. You can attempt to use go.FigureWidget to highlight a trace as described in this answer, but i think you're going to be limited with with plotly-python and i'm not sure if highlighting the closest n points will be possible.

                                                                                  However, I believe that you can accomplish what you want in plotly-dash since callbacks are supported - meaning you would be able to retrieve location of your cursor and then calculate the n closest data points to your cursor and highlight the data points as needed.

                                                                                  Below is an example of such a solution. If you haven't seen it before, it looks complicated, but what is happening is that I am taking the point where you clicked as an input. plotly is plotly.js under the hood so it comes us in the form of a dictionary (and not some kind of plotly-python object). Then I calculate the closest three data points to the clicked input point by comparing the coordinates of every other point in the dataframe, add the information from the three closest points as traces to the input with the color teal (or any color of your choosing), and send this modified input back as the output, and update the figure.

                                                                                  I am using click instead of hover because hover would cause the highlighted points to flicker too much as you drag your mouse through the points.

                                                                                  Also the dash app doesn't work perfectly as I believe there is some issue when you double click on points (you can see me click once in the gif below before getting it to start working), but this basic framework is hopefully close enough to what you want. Cheers!

                                                                                  import gensim
                                                                                  import numpy as np
                                                                                  import pandas as pd
                                                                                  from sklearn.manifold import TSNE
                                                                                  import plotly.express as px
                                                                                  import plotly.graph_objects as go
                                                                                  
                                                                                  import json
                                                                                  
                                                                                  import dash
                                                                                  from dash import dcc, html, Input, Output
                                                                                  
                                                                                  external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
                                                                                  app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
                                                                                  
                                                                                  
                                                                                  def get_2d_coordinates(model, words):
                                                                                      arr = np.empty((0,100), dtype='f')
                                                                                      labels = []
                                                                                      for wrd_score in words:
                                                                                          try:
                                                                                              wrd_vector = model.wv.get_vector(wrd_score)
                                                                                              arr = np.append(arr, np.array([wrd_vector]), axis=0)
                                                                                              labels.append(wrd_score)
                                                                                          except:
                                                                                              pass
                                                                                      tsne = TSNE(n_components=2, random_state=0)
                                                                                      np.set_printoptions(suppress=True)
                                                                                      Y = tsne.fit_transform(arr)
                                                                                      x_coords = Y[:, 0]
                                                                                      y_coords = Y[:, 1]
                                                                                      return x_coords, y_coords
                                                                                  
                                                                                  ic_model = gensim.models.Word2Vec.load("w2v_IceCream.model")
                                                                                  ic = pd.read_csv('ic_prods.csv')
                                                                                  
                                                                                  icx, icy = get_2d_coordinates(ic_model, ic['ITEM_DESC'])
                                                                                  ic_data = {'Category': ic['SUB_CATEGORY'],
                                                                                              'Words':ic['ITEM_DESC'],
                                                                                              'X':icx,
                                                                                              'Y':icy}
                                                                                  
                                                                                  ic_df = pd.DataFrame(ic_data)
                                                                                  ic_fig = px.scatter(ic_df, x=icx, y=icy, color=ic_df['Category'], hover_name=ic_df['Words'], title='IceCream Data')
                                                                                  
                                                                                  NUMBER_OF_TRACES = len(ic_df['Category'].unique())
                                                                                  ic_fig.update_layout(clickmode='event+select')
                                                                                  
                                                                                  app.layout = html.Div([
                                                                                      dcc.Graph(
                                                                                          id='ic_figure',
                                                                                          figure=ic_fig)
                                                                                      ])
                                                                                  
                                                                                  ## we take the 4 closest points because the 1st closest point will be the point itself
                                                                                  def get_n_closest_points(x0, y0, df=ic_df[['X','Y']].copy(), n=4):
                                                                                  
                                                                                      """we can save some computation time by looking for the smallest distance^2 instead of distance"""
                                                                                      """distance = sqrt[(x1-x0)^2 + (y1-y0)^2]"""
                                                                                      """distance^2 = [(x1-x0)^2 + (y1-y0)^2]"""
                                                                                      
                                                                                      df["dist"] = (df["X"]-x0)**2 + (df["Y"]-y0)**2
                                                                                  
                                                                                      ## we don't return the point itself which will always be closest to itself
                                                                                      return df.sort_values(by="dist")[1:n][["X","Y"]].values
                                                                                  
                                                                                  @app.callback(
                                                                                      Output('ic_figure', 'figure'),
                                                                                      [Input('ic_figure', 'clickData'),
                                                                                      Input('ic_figure', 'figure')]
                                                                                      )
                                                                                  def display_hover_data(clickData, figure):
                                                                                      print(clickData)
                                                                                      if clickData is None:
                                                                                          # print("nothing was clicked")
                                                                                          return figure
                                                                                      else:
                                                                                          hover_x, hover_y = clickData['points'][0]['x'], clickData['points'][0]['y']
                                                                                          closest_points = get_n_closest_points(hover_x, hover_y)
                                                                                  
                                                                                          ## this means that this function has ALREADY added another trace, so we reduce the number of traces down the original number
                                                                                          if len(figure['data']) > NUMBER_OF_TRACES:
                                                                                              # print(f'reducing the number of traces to {NUMBER_OF_TRACES}')
                                                                                              figure['data'] = figure['data'][:NUMBER_OF_TRACES]
                                                                                              # print(figure['data'])
                                                                                          
                                                                                          new_traces = [{
                                                                                              'marker': {'color': 'teal', 'symbol': 'circle'},
                                                                                              'mode': 'markers',
                                                                                              'orientation': 'v',
                                                                                              'showlegend': False,
                                                                                              'x': [x],
                                                                                              'xaxis': 'x',
                                                                                              'y': [y],
                                                                                              'yaxis': 'y',
                                                                                              'type': 'scatter',
                                                                                              'selectedpoints': [0]
                                                                                          } for x,y in closest_points]
                                                                                  
                                                                                          figure['data'].extend(new_traces)
                                                                                          # print("after\n")
                                                                                          # print(figure['data'])
                                                                                          return figure
                                                                                  
                                                                                  if __name__ == '__main__':
                                                                                      app.run_server(debug=True)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70944316

                                                                                  QUESTION

                                                                                  gensim w2k - additional file
                                                                                  Asked 2022-Feb-01 at 14:52

                                                                                  I trained w2v on rather big (> 200 million sentences) corpus, and got, in addition to file w2v_model.model, files: w2v_model.model.trainables.syn1neg.npy and w2v.model_model.wv.vectors.npy. Model file was successfully loaded and read all npy files without any exceptions. The obtained model performed OK.

                                                                                  Now I retrained the model on much bigger corpus (> 1 billion sentences). The same 3 files were automatically saved, as expected.

                                                                                  When I try to load my new retrained model:

                                                                                  w2v_model = Word2Vec.load(path_filename)
                                                                                  

                                                                                  I get:

                                                                                  FileNotFoundError: [Errno 2] No such file or directory: '/Users/...../w2v_US.model.trainables.vectors_lockf.npy'
                                                                                  

                                                                                  But no .npy file with such extension was saved by gensim at the end of the training (I save all output files in the same library, as required).

                                                                                  What should I do to obtain such file as a part of output .npy files (may be some option in gensim w2v when training)? May be there are other ways to overcome this issue?

                                                                                  ANSWER

                                                                                  Answered 2022-Jan-24 at 18:39

                                                                                  If a .save() is creating any files with the word trainables in it, you're using a older version fo Gensim. Any new training should definitely prefer using a current version. As of now (January 2022), that's gensim-4.1.2, released 2021-09.

                                                                                  If an attempt at a .load() generated that particular error, then there should've been that file, alongside the others you mention, created when the .save() had been done. (In fact, the only way that the main file you named with path_filename should be able to know that other filename is if that other file was written successfully, allowing the main file to complete writing.)

                                                                                  Are you sure that file wasn't written, but then somehow left behind, perhaps getting deleted or not moving alongside the other few files to some new filesystem path?

                                                                                  In general, I would suggest:

                                                                                  • using latest Gensim for any new training
                                                                                  • always enable Python logging at the INFO level, & watch the logging/console output of training/saving processes closely to see confirmation of expected activity/steps
                                                                                  • keep all files from a .save() that begin with the same main filename (in your examples above, w2v_US.model) together - & keep in mind that for larger models it may be a larger roster of files than for a small test model

                                                                                  You will probably have to re-train the model, but you might be able to re-generate a compatible lockf file via steps like the following:

                                                                                  • save aside all files of any potential use
                                                                                  • from the exact same configuration as your original .save() – including the same outdated Gensim version, exact same model parameters, & exact same training corpus – repeat all the model-building steps you did before up through the .build_vocab() step. (That is: no extra need to .train().) This will create an untrained dummy model that should exactly match the vocabulary 'shape' of your broken model.
                                                                                  • use .save() to save that dummy model again - watching the logs/output for errors. There should be, alongside the other files, a file with a name like dummy.model.trainables.vectors_lockf.npy. If so, you might be able to copy that away, rename it to tbe the file expected by the original model whose load failed, then leave it alongside that original model - and the .load() might then succeed, or fail in a different way.

                                                                                  (If there were other problems/corruption at the time of the original model creation, this might not work. In particular, I wonder if when you talk about retraining the model, you didn't start with a fresh Word2Vec instance, but somehow expanded the older one, which might've added other problems/complications. In that case, a full retraining, ideally in the latest Gensim, would be necessary, and also a better basis for going forward.)

                                                                                  Source https://stackoverflow.com/questions/70693372

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install gensim

                                                                                  This software depends on [NumPy and Scipy], two Python packages for scientific computing. You must have them installed prior to installing gensim. It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as MKL, [ATLAS] or [OpenBLAS] is known to improve performance by as much as an order of magnitude. On OSX, NumPy picks up its vecLib BLAS automatically, so you don’t need to do anything special.

                                                                                  Support

                                                                                  [QuickStart][Tutorials][Official API Documentation] [QuickStart]: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html [Tutorials]: https://radimrehurek.com/gensim/auto_examples/ [Official Documentation and Walkthrough]: http://radimrehurek.com/gensim/ [Official API Documentation]: http://radimrehurek.com/gensim/apiref.html
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit
                                                                                  Install
                                                                                • PyPI

                                                                                  pip install gensim

                                                                                • CLONE
                                                                                • HTTPS

                                                                                  https://github.com/RaRe-Technologies/gensim.git

                                                                                • CLI

                                                                                  gh repo clone RaRe-Technologies/gensim

                                                                                • sshUrl

                                                                                  git@github.com:RaRe-Technologies/gensim.git

                                                                                • Share this Page

                                                                                  share link

                                                                                  Consider Popular Topic Modeling Libraries

                                                                                  gensim

                                                                                  by RaRe-Technologies

                                                                                  Familia

                                                                                  by baidu

                                                                                  BERTopic

                                                                                  by MaartenGr

                                                                                  Top2Vec

                                                                                  by ddangelov

                                                                                  lda

                                                                                  by lda-project

                                                                                  Try Top Libraries by RaRe-Technologies

                                                                                  smart_open

                                                                                  by RaRe-TechnologiesPython

                                                                                  sqlitedict

                                                                                  by RaRe-TechnologiesPython

                                                                                  bounter

                                                                                  by RaRe-TechnologiesPython

                                                                                  gensim-data

                                                                                  by RaRe-TechnologiesPython

                                                                                  movie-plots-by-genre

                                                                                  by RaRe-TechnologiesJupyter Notebook

                                                                                  Compare Topic Modeling Libraries with Highest Support

                                                                                  gensim

                                                                                  by RaRe-Technologies

                                                                                  BERTopic

                                                                                  by MaartenGr

                                                                                  Top2Vec

                                                                                  by ddangelov

                                                                                  lda

                                                                                  by lda-project

                                                                                  Familia

                                                                                  by baidu

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit