NLP-progress | Natural Language Processing | Natural Language Processing library

 by   sebastianruder Python Version: v0.3 License: MIT

kandi X-RAY | NLP-progress Summary

NLP-progress is a Python library typically used in Artificial Intelligence, Natural Language Processing, Deep Learning, Bert applications. NLP-progress has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. However NLP-progress build file is not available. You can download it from GitHub.
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        NLP-progress has a medium active ecosystem.
                        summary
                        It has 21447 star(s) with 3558 fork(s). There are 1274 watchers for this library.
                        summary
                        It had no major release in the last 12 months.
                        summary
                        There are 34 open issues and 65 have been closed. On average issues are closed in 29 days. There are 12 open pull requests and 0 closed requests.
                        summary
                        It has a neutral sentiment in the developer community.
                        summary
                        The latest version of NLP-progress is v0.3
                        NLP-progress Support
                          Best in #Natural Language Processing
                            Average in #Natural Language Processing
                            NLP-progress Support
                              Best in #Natural Language Processing
                                Average in #Natural Language Processing

                                  kandi-Quality Quality

                                    summary
                                    NLP-progress has 0 bugs and 0 code smells.
                                    NLP-progress Quality
                                      Best in #Natural Language Processing
                                        Average in #Natural Language Processing
                                        NLP-progress Quality
                                          Best in #Natural Language Processing
                                            Average in #Natural Language Processing

                                              kandi-Security Security

                                                summary
                                                NLP-progress has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                summary
                                                NLP-progress code analysis shows 0 unresolved vulnerabilities.
                                                summary
                                                There are 0 security hotspots that need review.
                                                NLP-progress Security
                                                  Best in #Natural Language Processing
                                                    Average in #Natural Language Processing
                                                    NLP-progress Security
                                                      Best in #Natural Language Processing
                                                        Average in #Natural Language Processing

                                                          kandi-License License

                                                            summary
                                                            NLP-progress is licensed under the MIT License. This license is Permissive.
                                                            summary
                                                            Permissive licenses have the least restrictions, and you can use them in most projects.
                                                            NLP-progress License
                                                              Best in #Natural Language Processing
                                                                Average in #Natural Language Processing
                                                                NLP-progress License
                                                                  Best in #Natural Language Processing
                                                                    Average in #Natural Language Processing

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        NLP-progress releases are available to install and integrate.
                                                                        summary
                                                                        NLP-progress has no build file. You will be need to create the build yourself to build the component from source.
                                                                        summary
                                                                        NLP-progress saves you 128 person hours of effort in developing the same functionality from scratch.
                                                                        summary
                                                                        It has 323 lines of code, 13 functions and 3 files.
                                                                        summary
                                                                        It has low code complexity. Code complexity directly impacts maintainability of the code.
                                                                        NLP-progress Reuse
                                                                          Best in #Natural Language Processing
                                                                            Average in #Natural Language Processing
                                                                            NLP-progress Reuse
                                                                              Best in #Natural Language Processing
                                                                                Average in #Natural Language Processing
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi has reviewed NLP-progress and discovered the below as its top functions. This is intended to give you an instant insight into NLP-progress implemented functionality, and help decide if they suit your requirements.
                                                                                  • Parse a Markdown directory
                                                                                    • Parse a markdown file
                                                                                    • Extract the Sota table from the SOTA table
                                                                                    • Parse multiple Sotaions
                                                                                    • Extract description and tables from md_lines
                                                                                    • Extract lines before tables
                                                                                    • Extract links from a dataset description
                                                                                    • Extract the links from a markdown document
                                                                                    • Extract model name and author name
                                                                                    • Extract the title and link
                                                                                    • Return the line number for a section
                                                                                    • Sanitize a subdataset name
                                                                                    • Extract title and link
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  NLP-progress Key Features

                                                                                  Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

                                                                                  NLP-progress Examples and Code Snippets

                                                                                  No Code Snippets are available at this moment for NLP-progress.
                                                                                  Community Discussions

                                                                                  Trending Discussions on Natural Language Processing

                                                                                  number of matches for keywords in specified categories
                                                                                  chevron right
                                                                                  Apple's Natural Language API returns unexpected results
                                                                                  chevron right
                                                                                  Tokenize text but keep compund hyphenated words together
                                                                                  chevron right
                                                                                  Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe
                                                                                  chevron right
                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  chevron right
                                                                                  Which model/technique to use for specific sentence extraction?
                                                                                  chevron right
                                                                                  Assigning True/False if a token is present in a data-frame
                                                                                  chevron right
                                                                                  How to calculate perplexity of a sentence using huggingface masked language models?
                                                                                  chevron right
                                                                                  Mapping values from a dictionary's list to a string in Python
                                                                                  chevron right
                                                                                  What are differences between AutoModelForSequenceClassification vs AutoModel
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  number of matches for keywords in specified categories
                                                                                  Asked 2022-Apr-14 at 13:32

                                                                                  For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.

                                                                                  As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.

                                                                                  df_texts <- tibble(
                                                                                    text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
                                                                                    grasshopper"),
                                                                                    mammals=NA,
                                                                                    reptiles=NA,
                                                                                    birds=NA,
                                                                                    insects=NA
                                                                                  )
                                                                                  
                                                                                  df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
                                                                                             type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))
                                                                                  

                                                                                  So my desired result would be:

                                                                                  df_result <- tibble(
                                                                                    text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
                                                                                    grasshopper"),
                                                                                    mammals=c(2,1,0),
                                                                                    reptiles=c(0,1,0),
                                                                                    birds=c(0,0,1),
                                                                                    insects=c(0,0,1)
                                                                                  )
                                                                                  

                                                                                  Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?

                                                                                  Thanks in advance!

                                                                                  ANSWER

                                                                                  Answered 2022-Apr-14 at 13:32

                                                                                  Here's a way do to it in the tidyverse. First look at whether strings in df_texts$text contain animals, then count them and sum by text and type.

                                                                                  library(tidyverse)
                                                                                  
                                                                                  cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>% 
                                                                                    pivot_longer(-text, names_to = "animals") %>% 
                                                                                    left_join(df_animals) %>% 
                                                                                    group_by(text, type) %>% 
                                                                                    summarise(sum = sum(value)) %>% 
                                                                                    pivot_wider(id_cols = text, names_from = type, values_from = sum)
                                                                                  
                                                                                    text                                   bird insect mammal reptile
                                                                                                                            
                                                                                  1 "the ape and the fox"                     0      0      2       0
                                                                                  2 "the owl and the the \n  grasshopper"     1      0      0       0
                                                                                  3 "the tortoise and the hare"               0      0      1       1
                                                                                  

                                                                                  To account for the several occurrences per text:

                                                                                  cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% 
                                                                                    setNames(c("text", df_animals$animals)) %>% 
                                                                                    pivot_longer(-text, names_to = "animals") %>% 
                                                                                    left_join(df_animals) %>% 
                                                                                    group_by(text, type) %>% 
                                                                                    summarise(sum = sum(value)) %>% 
                                                                                    pivot_wider(id_cols = text, names_from = type, values_from = sum)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71871613

                                                                                  QUESTION

                                                                                  Apple's Natural Language API returns unexpected results
                                                                                  Asked 2022-Apr-01 at 20:30

                                                                                  I'm trying to figure out why Apple's Natural Language API returns unexpected results.

                                                                                  What am I doing wrong? Is it a grammar issue?

                                                                                  I have the following four strings, and I want to extract each word's "stem form."

                                                                                      // text 1 has two "accredited" in a different order
                                                                                      let text1: String = "accredit accredited accrediting accredited accreditation accredits"
                                                                                      
                                                                                      // text 2 has three "accredited" in different order
                                                                                      let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
                                                                                      
                                                                                      // text 3 has "accreditation"
                                                                                      let text3: String = "accreditation"
                                                                                      
                                                                                      // text 4 has "accredited"
                                                                                      let text4: String = "accredited"
                                                                                  

                                                                                  The issue is with the words accreditation and accredited.

                                                                                  The word accreditation never returned the stem. And accredited returns different results based on the words' order in the string, as shown in Text 1 and Text 2 in the attached image.

                                                                                  I've used the code from Apple's documentation

                                                                                  And here is the full code in SwiftUI:

                                                                                  import SwiftUI
                                                                                  import NaturalLanguage
                                                                                  
                                                                                  struct ContentView: View {
                                                                                      
                                                                                      // text 1 has two "accredited" in a different order
                                                                                      let text1: String = "accredit accredited accrediting accredited accreditation accredits"
                                                                                      
                                                                                      // text 2 has three "accredited" in a different order
                                                                                      let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
                                                                                      
                                                                                      // text 3 has "accreditation"
                                                                                      let text3: String = "accreditation"
                                                                                      
                                                                                      // text 4 has "accredited"
                                                                                      let text4: String = "accredited"
                                                                                      
                                                                                      var body: some View {
                                                                                          ScrollView {
                                                                                              VStack {
                                                                                                  
                                                                                                  Text("Text 1").bold()
                                                                                                  tagText(text: text1, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                                  Text("Text 2").bold()
                                                                                                  tagText(text: text2, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                                  Text("Text 3").bold()
                                                                                                  tagText(text: text3, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                                  Text("Text 4").bold()
                                                                                                  tagText(text: text4, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                              }
                                                                                          }
                                                                                      }
                                                                                      
                                                                                      // MARK: - tagText
                                                                                      func tagText(text: String, scheme: NLTagScheme) -> some View {
                                                                                          VStack {
                                                                                              ForEach(partsOfSpeechTagger(for: text, scheme: scheme)) { word in
                                                                                                  Text(word.description)
                                                                                              }
                                                                                          }
                                                                                      }
                                                                                      
                                                                                      // MARK: - partsOfSpeechTagger
                                                                                      func partsOfSpeechTagger(for text: String, scheme: NLTagScheme) -> [NLPTagResult] {
                                                                                          
                                                                                          var listOfTaggedWords: [NLPTagResult] = []
                                                                                          let tagger = NLTagger(tagSchemes: [scheme])
                                                                                          tagger.string = text
                                                                                          
                                                                                          let range = text.startIndex.. Bool {
                                                                                              lhs.id == rhs.id
                                                                                          }
                                                                                          
                                                                                          func hash(into hasher: inout Hasher) {
                                                                                              hasher.combine(id)
                                                                                          }
                                                                                          
                                                                                          // MARK: - Comparable requirements
                                                                                          static func <(lhs: NLPTagResult, rhs: NLPTagResult) -> Bool {
                                                                                              lhs.id.uuidString < rhs.id.uuidString
                                                                                          }
                                                                                      }
                                                                                      
                                                                                  }
                                                                                  
                                                                                  // MARK: - Previews
                                                                                  struct ContentView_Previews: PreviewProvider {
                                                                                      static var previews: some View {
                                                                                          ContentView()
                                                                                      }
                                                                                  }
                                                                                  

                                                                                  Thanks for your help!

                                                                                  ANSWER

                                                                                  Answered 2022-Apr-01 at 20:30

                                                                                  As for why the tagger doesn't find "accredit" from "accreditation", this is because the scheme .lemma finds the lemma of words, not actually the stems. See the difference between stem and lemma on Wikipedia.

                                                                                  The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production and producing In linguistic analysis, the stem is defined more generally as the analyzed base form from which all inflected forms can be formed.

                                                                                  The documentation uses the word "stem", but I do think that the lemma is what is intended here, and getting "accreditation" is the expected behaviour. See the Usage section of the Wikipedia article for "Word stem" for more info. The lemma is the dictionary form of a word, and "accreditation" has a dictionary entry, whereas something like "accredited" doesn't. Whatever you call these things, the point is that there are two distinct concepts, and the tagger gets you one of them, but you are expecting the other one.

                                                                                  As for why the order of the words matters, this is because the tagger tries to analyse your words as "natural language", rather than each one individually. Naturally, word order matters. If you use .lexicalClass, you'll see that it thinks the third word in text2 is an adjective, which explains why it doesn't think its dictionary form is "accredit", because adjectives don't conjugate like that. Note that accredited is an adjective in the dictionary. So "is it a grammar issue?" Exactly.

                                                                                  Source https://stackoverflow.com/questions/71711847

                                                                                  QUESTION

                                                                                  Tokenize text but keep compund hyphenated words together
                                                                                  Asked 2022-Mar-29 at 09:16

                                                                                  I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).

                                                                                  def preprocess(text):
                                                                                    #remove punctuation
                                                                                    text = re.sub('\b[A-Za-z]+(?:-+[A-Za-z]+)+\b', '-', text)
                                                                                    text = re.sub('[^a-zA-Z]', ' ', text)
                                                                                    text = text.split()
                                                                                    text = " ".join(text)
                                                                                    return text
                                                                                  

                                                                                  For instance, the original text:

                                                                                  "Attended pre-tender meetings" 
                                                                                  

                                                                                  should be split into

                                                                                  ['attended', 'pre-tender', 'meeting'] 
                                                                                  

                                                                                  rather than

                                                                                  ['attended', 'pre', 'tender', 'meeting']
                                                                                  

                                                                                  Any help would be appreciated!

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-29 at 09:14

                                                                                  To remove all non-alpha characters but - between letters, you can use

                                                                                  [\W\d_](?

                                                                                  ASCII only equivalent:

                                                                                  [^A-Za-z](?

                                                                                  See the regex demo. Details:

                                                                                  • [\W\d_] - any non-letter
                                                                                  • (? - a negative lookbehind that fails the match if there is a letter and a - immediately to the left, and right after -, there is any letter (checked with the (?=[^\W\d_]) positive lookahead).

                                                                                  See the Python demo:

                                                                                  import re
                                                                                  
                                                                                  def preprocess(text):
                                                                                    #remove all non-alpha characters but - between letters
                                                                                    text = re.sub(r'[\W\d_](? Attended pre-tender etc meetings
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71659125

                                                                                  QUESTION

                                                                                  Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe
                                                                                  Asked 2022-Feb-16 at 20:47

                                                                                  Looping over a list of bigrams to search for, I need to create a boolean field for each bigram according to whether or not it is present in a tokenized pandas series. And I'd appreciate an upvote if you think this is a good question!

                                                                                  List of bigrams:

                                                                                  bigrams = ['data science', 'computer science', 'bachelors degree']
                                                                                  

                                                                                  Dataframe:

                                                                                  df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                                                                                                              ['computer', 'science', 'degree', 'masters'],
                                                                                                                              ['bachelors', 'degree', 'computer', 'vision'],
                                                                                                                              ['data', 'processing', 'science']]})
                                                                                  

                                                                                  Desired Output:

                                                                                                           job_description  data science computer science bachelors degree
                                                                                  0        [data, science, degree, expert]          True            False            False
                                                                                  1   [computer, science, degree, masters]         False             True            False
                                                                                  2  [bachelors, degree, computer, vision]         False            False             True
                                                                                  3             [data, bachelors, science]         False            False            False
                                                                                  

                                                                                  Criteria:

                                                                                  1. Only exact matches should be replaced (for example, flagging for 'data science' should return True for 'data science' but False for 'science data' or 'data bachelors science')
                                                                                  2. Each search term should get it's own field and be concatenated to the original df

                                                                                  What I've tried:

                                                                                  Failed: df = [x for x in df['job_description'] if x in bigrams]

                                                                                  Failed: df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]

                                                                                  Failed: Could not adapt the approach here -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python

                                                                                  Failed: Could not get this one to adapt, either -> Compare two bigrams lists and return the matching bigram

                                                                                  Failed: This method is very close, but couldn't adapt it to bigrams -> Create new boolean fields based on specific terms appearing in a tokenized pandas dataframe

                                                                                  Thanks for any help you can provide!

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-16 at 20:28

                                                                                  You could use a regex and extractall:

                                                                                  regex = '|'.join('(%s)' % b.replace(' ', r'\s+') for b in bigrams)
                                                                                  matches = (df['job_description'].apply(' '.join)
                                                                                             .str.extractall(regex).droplevel(1).notna()
                                                                                             .groupby(level=0).max()
                                                                                             )
                                                                                  matches.columns = bigrams
                                                                                  
                                                                                  out = df.join(matches).fillna(False)
                                                                                  

                                                                                  output:

                                                                                                           job_description  data science  computer science  bachelors degree
                                                                                  0        [data, science, degree, expert]          True             False             False
                                                                                  1   [computer, science, degree, masters]         False              True             False
                                                                                  2  [bachelors, degree, computer, vision]         False             False              True
                                                                                  3            [data, processing, science]         False             False             False
                                                                                  

                                                                                  generated regex:

                                                                                  '(data\\s+science)|(computer\\s+science)|(bachelors\\s+degree)'
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71147799

                                                                                  QUESTION

                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  Asked 2022-Feb-15 at 19:23

                                                                                  Goal: to run this Auto Labelling Notebook on AWS SageMaker Jupyter Labs.

                                                                                  Kernels tried: conda_pytorch_p36, conda_python3, conda_amazonei_mxnet_p27.

                                                                                  ! pip install farm-haystack -q
                                                                                  # Install the latest master of Haystack
                                                                                  !pip install grpcio-tools==1.34.1 -q
                                                                                  !pip install git+https://github.com/deepset-ai/haystack.git -q
                                                                                  !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
                                                                                  !tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin
                                                                                  !pip install git+https://github.com/deepset-ai/haystack.git -q
                                                                                  
                                                                                  # Here are the imports we need
                                                                                  from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
                                                                                  from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
                                                                                  from haystack.schema import Document
                                                                                  from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
                                                                                  

                                                                                  Traceback:

                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Loading faiss with AVX2 support.
                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Could not load library with AVX2 support due to:
                                                                                  ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'",)
                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Loading faiss.
                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Successfully loaded faiss.
                                                                                  02/02/2022 10:36:33 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
                                                                                  ---------------------------------------------------------------------------
                                                                                  ModuleNotFoundError                       Traceback (most recent call last)
                                                                                   in 
                                                                                        1 # Here are the imports we need
                                                                                  ----> 2 from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
                                                                                        3 from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
                                                                                        4 from haystack.schema import Document
                                                                                        5 from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/__init__.py in 
                                                                                        3 import pandas as pd
                                                                                        4 from haystack.schema import Document, Label, MultiLabel, BaseComponent
                                                                                  ----> 5 from haystack.finder import Finder
                                                                                        6 from haystack.pipeline import Pipeline
                                                                                        7 
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/finder.py in 
                                                                                        6 from collections import defaultdict
                                                                                        7 
                                                                                  ----> 8 from haystack.reader.base import BaseReader
                                                                                        9 from haystack.retriever.base import BaseRetriever
                                                                                       10 from haystack import MultiLabel
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/reader/__init__.py in 
                                                                                  ----> 1 from haystack.reader.farm import FARMReader
                                                                                        2 from haystack.reader.transformers import TransformersReader
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/reader/farm.py in 
                                                                                       22 
                                                                                       23 from haystack import Document
                                                                                  ---> 24 from haystack.document_store.base import BaseDocumentStore
                                                                                       25 from haystack.reader.base import BaseReader
                                                                                       26 
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/document_store/__init__.py in 
                                                                                        2 from haystack.document_store.faiss import FAISSDocumentStore
                                                                                        3 from haystack.document_store.memory import InMemoryDocumentStore
                                                                                  ----> 4 from haystack.document_store.milvus import MilvusDocumentStore
                                                                                        5 from haystack.document_store.sql import SQLDocumentStore
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/document_store/milvus.py in 
                                                                                        5 import numpy as np
                                                                                        6 
                                                                                  ----> 7 from milvus import IndexType, MetricType, Milvus, Status
                                                                                        8 from scipy.special import expit
                                                                                        9 from tqdm import tqdm
                                                                                  
                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  
                                                                                  pip install milvus
                                                                                  
                                                                                  import milvus
                                                                                  

                                                                                  Traceback:

                                                                                  ---------------------------------------------------------------------------
                                                                                  ModuleNotFoundError                       Traceback (most recent call last)
                                                                                   in 
                                                                                  ----> 1 import milvus
                                                                                  
                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-03 at 09:29

                                                                                  I would recommend to downgrade your milvus version to a version before the 2.0 release just a week ago. Here is a discussion on that topic: https://github.com/deepset-ai/haystack/issues/2081

                                                                                  Source https://stackoverflow.com/questions/70954157

                                                                                  QUESTION

                                                                                  Which model/technique to use for specific sentence extraction?
                                                                                  Asked 2022-Feb-08 at 18:35

                                                                                  I have a dataset of tens of thousands of dialogues / conversations between a customer and customer support. These dialogues, which could be forum posts, or long-winded email conversations, have been hand-annotated to highlight the sentence containing the customers problem. For example:

                                                                                  Dear agent, I am writing to you because I have a very annoying problem with my washing machine. I bought it three weeks ago and was very happy with it. However, this morning the door does not lock properly. Please help

                                                                                  Dear customer.... etc

                                                                                  The highlighted sentence would be:

                                                                                  However, this morning the door does not lock properly.

                                                                                  1. What approaches can I take to model this, so that in future I can automatically extract the customers problem? The domain of the datasets are broad, but within the hardware space, so it could be appliances, gadgets, machinery etc.
                                                                                  2. What is this type of problem called? I thought this might be called "intent recognition", but most guides seem to refer to multiclass classification. The sentence either is or isn't the customers problem. I considered analysing each sentence and performing binary classification, but I'd like to explore options that take into account the context of the rest of the conversation if possible.
                                                                                  3. What resources are available to research how to implement this in Python (using tensorflow or pytorch)

                                                                                  I found a model on HuggingFace which has been pre-trained with customer dialogues, and have read the research paper, so I was considering fine-tuning this as a starting point, but I only have experience with text (multiclass/multilabel) classification when it comes to transformers.

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-07 at 10:21

                                                                                  This type of problem where you want to extract the customer problem from the original text is called Extractive Summarization and this type of task is solved by Sequence2Sequence models.

                                                                                  The main reason for this type of model being called Sequence2Sequence is because the input and the output of this model would both be text.

                                                                                  I recommend you to use a transformers model called Pegasus which has been pre-trained to predict a masked text, but its main application is to be fine-tuned for text summarization (extractive or abstractive).

                                                                                  This Pegasus model is listed on Transformers library, which provides you with a simple but powerful way of fine-tuning transformers with custom datasets. I think this notebook will be extremely useful as guidance and for understanding how to fine-tune this Pegasus model.

                                                                                  Source https://stackoverflow.com/questions/70990722

                                                                                  QUESTION

                                                                                  Assigning True/False if a token is present in a data-frame
                                                                                  Asked 2022-Jan-06 at 12:38

                                                                                  My current data-frame is:

                                                                                       |articleID | keywords                                               | 
                                                                                       |:-------- |:------------------------------------------------------:| 
                                                                                  0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      |     
                                                                                  1    |58b6393b  | ['Crossword Puzzles']                                  |          
                                                                                  2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']|            
                                                                                  3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        |  
                                                                                  

                                                                                  I want a data-frame similar to the following, where a column is added based on whether a Trump token, 'Trump, Donald J' is mentioned in the keywords and if so then it is assigned True :

                                                                                       |articleID | keywords                                               | trumpMention |
                                                                                       |:-------- |:------------------------------------------------------:| ------------:|
                                                                                  0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      | False        |      
                                                                                  1    |58b6393b  | ['Crossword Puzzles']                                  | False        |          
                                                                                  2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']| True         |           
                                                                                  3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        | True         |       
                                                                                  

                                                                                  I have tried multiple ways using df functions. But cannot achieve my wanted results. Some of the ways I've tried are:

                                                                                  df['trumpMention'] = np.where(any(df['keywords']) == 'Trump, Donald J', True, False) 
                                                                                  

                                                                                  or

                                                                                  df['trumpMention'] = df['keywords'].apply(lambda x: any(token == 'Trump, Donald J') for token in x) 
                                                                                  

                                                                                  or

                                                                                  lst = ['Trump, Donald J']  
                                                                                  df['trumpMention'] = df['keywords'].apply(lambda x: ([ True for token in x if any(token in lst)]))   
                                                                                  

                                                                                  Raw input:

                                                                                  df = pd.DataFrame({'articleID': ['58b61d1d', '58b6393b', '58b6556e', '58b657fa'],
                                                                                                     'keywords': [['Second Avenue (Manhattan, NY)'],
                                                                                                                  ['Crossword Puzzles'],
                                                                                                                  ['Workplace Hazards and Violations', 'Trump, Donald J'],
                                                                                                                  ['Trump, Donald J', 'Speeches and Statements']],
                                                                                                     'trumpMention': [False, False, True, True]})
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Jan-06 at 12:13

                                                                                  try

                                                                                  df["trumpMention"] = df["keywords"].apply(lambda x: "Trump, Donald J" in x)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70606847

                                                                                  QUESTION

                                                                                  How to calculate perplexity of a sentence using huggingface masked language models?
                                                                                  Asked 2021-Dec-25 at 21:51

                                                                                  I have several masked language models (mainly Bert, Roberta, Albert, Electra). I also have a dataset of sentences. How can I get the perplexity of each sentence?

                                                                                  From the huggingface documentation here they mentioned that perplexity "is not well defined for masked language models like BERT", though I still see people somehow calculate it.

                                                                                  For example in this SO question they calculated it using the function

                                                                                  def score(model, tokenizer, sentence,  mask_token_id=103):
                                                                                    tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                                                                    repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                                                                    mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                                                                    masked_input = repeat_input.masked_fill(mask == 1, 103)
                                                                                    labels = repeat_input.masked_fill( masked_input != 103, -100)
                                                                                    loss,_ = model(masked_input, masked_lm_labels=labels)
                                                                                    result = np.exp(loss.item())
                                                                                    return result
                                                                                  
                                                                                  score(model, tokenizer, '我爱你') # returns 45.63794545581973
                                                                                  

                                                                                  However, when I try to use the code I get TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'.

                                                                                  I tried it with a couple of my models:

                                                                                  from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
                                                                                  import torch
                                                                                  
                                                                                  1)
                                                                                  tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
                                                                                  model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
                                                                                  2)
                                                                                  tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
                                                                                  model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
                                                                                  

                                                                                  This SO question also used the masked_lm_labels as an input and it seemed to work somehow.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-25 at 21:51

                                                                                  There is a paper Masked Language Model Scoring that explores pseudo-perplexity from masked language models and shows that pseudo-perplexity, while not being theoretically well justified, still performs well for comparing "naturalness" of texts.

                                                                                  As for the code, your snippet is perfectly correct but for one detail: in recent implementations of Huggingface BERT, masked_lm_labels are renamed to simply labels, to make interfaces of various models more compatible. I have also replaced the hard-coded 103 with the generic tokenizer.mask_token_id. So the snippet below should work:

                                                                                  from transformers import AutoModelForMaskedLM, AutoTokenizer
                                                                                  import torch
                                                                                  import numpy as np
                                                                                  
                                                                                  model_name = 'cointegrated/rubert-tiny'
                                                                                  model = AutoModelForMaskedLM.from_pretrained(model_name)
                                                                                  tokenizer = AutoTokenizer.from_pretrained(model_name)
                                                                                  
                                                                                  def score(model, tokenizer, sentence):
                                                                                      tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                                                                      repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                                                                      mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                                                                      masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
                                                                                      labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
                                                                                      with torch.inference_mode():
                                                                                          loss = model(masked_input, labels=labels).loss
                                                                                      return np.exp(loss.item())
                                                                                  
                                                                                  print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
                                                                                  # 4.541251105675365
                                                                                  print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer)) 
                                                                                  # 6.162017238332462
                                                                                  

                                                                                  You can try this code in Google Colab by running this gist.

                                                                                  Source https://stackoverflow.com/questions/70464428

                                                                                  QUESTION

                                                                                  Mapping values from a dictionary's list to a string in Python
                                                                                  Asked 2021-Dec-21 at 16:45

                                                                                  I am working on some sentence formation like this:

                                                                                  sentence = "PERSON is ADJECTIVE"
                                                                                  dictionary = {"PERSON": ["Alice", "Bob", "Carol"], "ADJECTIVE": ["cute", "intelligent"]}
                                                                                  

                                                                                  I would now need all possible combinations to form this sentence from the dictionary, like:

                                                                                  Alice is cute
                                                                                  Alice is intelligent
                                                                                  Bob is cute
                                                                                  Bob is intelligent
                                                                                  Carol is cute
                                                                                  Carol is intelligent
                                                                                  

                                                                                  The above use case was relatively simple, and it was done with the following code

                                                                                  dictionary = {"PERSON": ["Alice", "Bob", "Carol"], "ADJECTIVE": ["cute", "intelligent"]}
                                                                                  
                                                                                  for i in dictionary["PERSON"]:
                                                                                      for j in dictionary["ADJECTIVE"]:
                                                                                          print(f"{i} is {j}")
                                                                                  

                                                                                  But can we also make this scale up for longer sentences?

                                                                                  Example:

                                                                                  sentence = "PERSON is ADJECTIVE and is from COUNTRY" 
                                                                                  dictionary = {"PERSON": ["Alice", "Bob", "Carol"], "ADJECTIVE": ["cute", "intelligent"], "COUNTRY": ["USA", "Japan", "China", "India"]}
                                                                                  

                                                                                  This should again provide all possible combinations like:

                                                                                  Alice is cute and is from USA
                                                                                  Alice is intelligent and is from USA
                                                                                  .
                                                                                  .
                                                                                  .
                                                                                  .
                                                                                  Carol is intelligent and is from India
                                                                                  

                                                                                  I tried to use https://www.pythonpool.com/python-permutations/ , but the sentence are all are mixed up - but how can we make a few words fixed, like in this example the words "and is from" is fixed

                                                                                  Essentially if any key in the dictionary is equal to the word in the string, then the word should be replaced by the list of values from the dictionary.

                                                                                  Any thoughts would be really helpful.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-12 at 17:53

                                                                                  You can first replace the dictionary keys in sentence to {} so that you can easily format a string in loop. Then you can use itertools.product to create the Cartesian product of dictionary.values(), so you can simply loop over it to create your desired sentences.

                                                                                  from itertools import product
                                                                                  sentence = ' '.join([('{}' if w in dictionary else w) for w in sentence.split()])
                                                                                  mapped_sentences_generator = (sentence.format(*tple) for tple in product(*dictionary.values()))
                                                                                  for s in mapped_sentences_generator:
                                                                                      print(s)
                                                                                  

                                                                                  Output:

                                                                                  Alice is cute and is from USA
                                                                                  Alice is cute and is from Japan
                                                                                  Alice is cute and is from China
                                                                                  Alice is cute and is from India
                                                                                  Alice is intelligent and is from USA
                                                                                  Alice is intelligent and is from Japan
                                                                                  Alice is intelligent and is from China
                                                                                  Alice is intelligent and is from India
                                                                                  Bob is cute and is from USA
                                                                                  Bob is cute and is from Japan
                                                                                  Bob is cute and is from China
                                                                                  Bob is cute and is from India
                                                                                  Bob is intelligent and is from USA
                                                                                  Bob is intelligent and is from Japan
                                                                                  Bob is intelligent and is from China
                                                                                  Bob is intelligent and is from India
                                                                                  Carol is cute and is from USA
                                                                                  Carol is cute and is from Japan
                                                                                  Carol is cute and is from China
                                                                                  Carol is cute and is from India
                                                                                  Carol is intelligent and is from USA
                                                                                  Carol is intelligent and is from Japan
                                                                                  Carol is intelligent and is from China
                                                                                  Carol is intelligent and is from India
                                                                                  

                                                                                  Note that this works for Python >3.6 because it assumes the dictionary insertion order is maintained. For older Python, must use collections.OrderedDict rather than dict.

                                                                                  Source https://stackoverflow.com/questions/70325758

                                                                                  QUESTION

                                                                                  What are differences between AutoModelForSequenceClassification vs AutoModel
                                                                                  Asked 2021-Dec-05 at 09:07

                                                                                  We can create a model from AutoModel(TFAutoModel) function:

                                                                                  from transformers import AutoModel 
                                                                                  model = AutoModel.from_pretrained('distilbert-base-uncase')
                                                                                  

                                                                                  In other hand, a model is created by AutoModelForSequenceClassification(TFAutoModelForSequenceClassification):

                                                                                  from transformers import AutoModelForSequenceClassification
                                                                                  model = AutoModelForSequenceClassification('distilbert-base-uncase')
                                                                                  

                                                                                  As I know, both models use distilbert-base-uncase library to create models. From name of methods, the second class( AutoModelForSequenceClassification ) is created for Sequence Classification.

                                                                                  But what are really differences in 2 classes? And how to use them correctly?

                                                                                  (I searched in huggingface but it is not clear)

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-05 at 09:07

                                                                                  The difference between AutoModel and AutoModelForSequenceClassification model is that AutoModelForSequenceClassification has a classification head on top of the model outputs which can be easily trained with the base model

                                                                                  Source https://stackoverflow.com/questions/69907682

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install NLP-progress

                                                                                  You can download it from GitHub.
                                                                                  You can use NLP-progress like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

                                                                                  Support

                                                                                  Results   Results reported in published papers are preferred; an exception may be made for influential preprints. Datasets   Datasets should have been used for evaluation in at least one published paper besides the one that introduced the dataset. Code   We recommend to add a link to an implementation if available. You can add a Code column (see below) to the table if it does not exist. In the Code column, indicate an official implementation with Official. If an unofficial implementation is available, use Link (see below). If no implementation is available, you can leave the cell empty. If you would like to add a new result, you can just click on the small edit button in the top-right corner of the file for the respective task (see below). This allows you to edit the file in Markdown. Simply add a row to the corresponding table in the same format. Make sure that the table stays sorted (with the best result on top). After you've made your change, make sure that the table still looks ok by clicking on the "Preview changes" tab at the top of the page. If everything looks good, go to the bottom of the page, where you see the below form. Add a name for your proposed change, an optional description, indicate that you would like to "Create a new branch for this commit and start a pull request", and click on "Propose file change".
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit

                                                                                  Share this Page

                                                                                  share link

                                                                                  Consider Popular Natural Language Processing Libraries

                                                                                  transformers

                                                                                  by huggingface

                                                                                  funNLP

                                                                                  by fighting41love

                                                                                  bert

                                                                                  by google-research

                                                                                  jieba

                                                                                  by fxsjy

                                                                                  Python

                                                                                  by geekcomputers

                                                                                  Try Top Libraries by sebastianruder

                                                                                  learn-to-select-data

                                                                                  by sebastianruderPython

                                                                                  sluice-networks

                                                                                  by sebastianruderPython

                                                                                  sebastianruder

                                                                                  by sebastianruderHTML

                                                                                  tensorflow-experiments

                                                                                  by sebastianruderC++

                                                                                  BART

                                                                                  by sebastianruderJava

                                                                                  Compare Natural Language Processing Libraries with Highest Support

                                                                                  transformers

                                                                                  by huggingface

                                                                                  bert

                                                                                  by google-research

                                                                                  allennlp

                                                                                  by allenai

                                                                                  flair

                                                                                  by flairNLP

                                                                                  spaCy

                                                                                  by explosion

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit