corenlp-examples | Stanford Core NLP API usage examples | Natural Language Processing library

 by   drewfarris Java Version: Current License: No License

kandi X-RAY | corenlp-examples Summary

corenlp-examples is a Java library typically used in Artificial Intelligence, Natural Language Processing applications. corenlp-examples has no bugs, it has no vulnerabilities, it has build file available and it has low support. You can download it from GitHub.
Stanford Core NLP API usage examples.
    Support
      Quality
        Security
          License
            Reuse
            Support
              Quality
                Security
                  License
                    Reuse

                      kandi-support Support

                        summary
                        corenlp-examples has a low active ecosystem.
                        summary
                        It has 27 star(s) with 27 fork(s). There are 3 watchers for this library.
                        summary
                        It had no major release in the last 6 months.
                        summary
                        corenlp-examples has no issues reported. There are no pull requests.
                        summary
                        It has a neutral sentiment in the developer community.
                        summary
                        The latest version of corenlp-examples is current.
                        corenlp-examples Support
                          Best in #Natural Language Processing
                            Average in #Natural Language Processing
                            corenlp-examples Support
                              Best in #Natural Language Processing
                                Average in #Natural Language Processing

                                  kandi-Quality Quality

                                    summary
                                    corenlp-examples has 0 bugs and 0 code smells.
                                    corenlp-examples Quality
                                      Best in #Natural Language Processing
                                        Average in #Natural Language Processing
                                        corenlp-examples Quality
                                          Best in #Natural Language Processing
                                            Average in #Natural Language Processing

                                              kandi-Security Security

                                                summary
                                                corenlp-examples has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
                                                summary
                                                corenlp-examples code analysis shows 0 unresolved vulnerabilities.
                                                summary
                                                There are 0 security hotspots that need review.
                                                corenlp-examples Security
                                                  Best in #Natural Language Processing
                                                    Average in #Natural Language Processing
                                                    corenlp-examples Security
                                                      Best in #Natural Language Processing
                                                        Average in #Natural Language Processing

                                                          kandi-License License

                                                            summary
                                                            corenlp-examples does not have a standard license declared.
                                                            summary
                                                            Check the repository for any license declaration and review the terms closely.
                                                            summary
                                                            Without a license, all rights are reserved, and you cannot use the library in your applications.
                                                            corenlp-examples License
                                                              Best in #Natural Language Processing
                                                                Average in #Natural Language Processing
                                                                corenlp-examples License
                                                                  Best in #Natural Language Processing
                                                                    Average in #Natural Language Processing

                                                                      kandi-Reuse Reuse

                                                                        summary
                                                                        corenlp-examples releases are not available. You will need to build from source code and install.
                                                                        summary
                                                                        Build file is available. You can build the component from source.
                                                                        summary
                                                                        corenlp-examples saves you 65 person hours of effort in developing the same functionality from scratch.
                                                                        summary
                                                                        It has 176 lines of code, 2 functions and 3 files.
                                                                        summary
                                                                        It has low code complexity. Code complexity directly impacts maintainability of the code.
                                                                        corenlp-examples Reuse
                                                                          Best in #Natural Language Processing
                                                                            Average in #Natural Language Processing
                                                                            corenlp-examples Reuse
                                                                              Best in #Natural Language Processing
                                                                                Average in #Natural Language Processing
                                                                                  Top functions reviewed by kandi - BETA
                                                                                  kandi has reviewed corenlp-examples and discovered the below as its top functions. This is intended to give you an instant insight into corenlp-examples implemented functionality, and help decide if they suit your requirements.
                                                                                  • Main method for testing
                                                                                  • Entry point for the example
                                                                                  Get all kandi verified functions for this library.
                                                                                  Get all kandi verified functions for this library.

                                                                                  corenlp-examples Key Features

                                                                                  Stanford Core NLP API usage examples

                                                                                  corenlp-examples Examples and Code Snippets

                                                                                  No Code Snippets are available at this moment for corenlp-examples.
                                                                                  Community Discussions

                                                                                  Trending Discussions on Natural Language Processing

                                                                                  number of matches for keywords in specified categories
                                                                                  chevron right
                                                                                  Apple's Natural Language API returns unexpected results
                                                                                  chevron right
                                                                                  Tokenize text but keep compund hyphenated words together
                                                                                  chevron right
                                                                                  Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe
                                                                                  chevron right
                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  chevron right
                                                                                  Which model/technique to use for specific sentence extraction?
                                                                                  chevron right
                                                                                  Assigning True/False if a token is present in a data-frame
                                                                                  chevron right
                                                                                  How to calculate perplexity of a sentence using huggingface masked language models?
                                                                                  chevron right
                                                                                  Mapping values from a dictionary's list to a string in Python
                                                                                  chevron right
                                                                                  What are differences between AutoModelForSequenceClassification vs AutoModel
                                                                                  chevron right

                                                                                  QUESTION

                                                                                  number of matches for keywords in specified categories
                                                                                  Asked 2022-Apr-14 at 13:32

                                                                                  For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.

                                                                                  As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.

                                                                                  df_texts <- tibble(
                                                                                    text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
                                                                                    grasshopper"),
                                                                                    mammals=NA,
                                                                                    reptiles=NA,
                                                                                    birds=NA,
                                                                                    insects=NA
                                                                                  )
                                                                                  
                                                                                  df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
                                                                                             type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))
                                                                                  

                                                                                  So my desired result would be:

                                                                                  df_result <- tibble(
                                                                                    text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
                                                                                    grasshopper"),
                                                                                    mammals=c(2,1,0),
                                                                                    reptiles=c(0,1,0),
                                                                                    birds=c(0,0,1),
                                                                                    insects=c(0,0,1)
                                                                                  )
                                                                                  

                                                                                  Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?

                                                                                  Thanks in advance!

                                                                                  ANSWER

                                                                                  Answered 2022-Apr-14 at 13:32

                                                                                  Here's a way do to it in the tidyverse. First look at whether strings in df_texts$text contain animals, then count them and sum by text and type.

                                                                                  library(tidyverse)
                                                                                  
                                                                                  cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>% 
                                                                                    pivot_longer(-text, names_to = "animals") %>% 
                                                                                    left_join(df_animals) %>% 
                                                                                    group_by(text, type) %>% 
                                                                                    summarise(sum = sum(value)) %>% 
                                                                                    pivot_wider(id_cols = text, names_from = type, values_from = sum)
                                                                                  
                                                                                    text                                   bird insect mammal reptile
                                                                                                                            
                                                                                  1 "the ape and the fox"                     0      0      2       0
                                                                                  2 "the owl and the the \n  grasshopper"     1      0      0       0
                                                                                  3 "the tortoise and the hare"               0      0      1       1
                                                                                  

                                                                                  To account for the several occurrences per text:

                                                                                  cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% 
                                                                                    setNames(c("text", df_animals$animals)) %>% 
                                                                                    pivot_longer(-text, names_to = "animals") %>% 
                                                                                    left_join(df_animals) %>% 
                                                                                    group_by(text, type) %>% 
                                                                                    summarise(sum = sum(value)) %>% 
                                                                                    pivot_wider(id_cols = text, names_from = type, values_from = sum)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71871613

                                                                                  QUESTION

                                                                                  Apple's Natural Language API returns unexpected results
                                                                                  Asked 2022-Apr-01 at 20:30

                                                                                  I'm trying to figure out why Apple's Natural Language API returns unexpected results.

                                                                                  What am I doing wrong? Is it a grammar issue?

                                                                                  I have the following four strings, and I want to extract each word's "stem form."

                                                                                      // text 1 has two "accredited" in a different order
                                                                                      let text1: String = "accredit accredited accrediting accredited accreditation accredits"
                                                                                      
                                                                                      // text 2 has three "accredited" in different order
                                                                                      let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
                                                                                      
                                                                                      // text 3 has "accreditation"
                                                                                      let text3: String = "accreditation"
                                                                                      
                                                                                      // text 4 has "accredited"
                                                                                      let text4: String = "accredited"
                                                                                  

                                                                                  The issue is with the words accreditation and accredited.

                                                                                  The word accreditation never returned the stem. And accredited returns different results based on the words' order in the string, as shown in Text 1 and Text 2 in the attached image.

                                                                                  I've used the code from Apple's documentation

                                                                                  And here is the full code in SwiftUI:

                                                                                  import SwiftUI
                                                                                  import NaturalLanguage
                                                                                  
                                                                                  struct ContentView: View {
                                                                                      
                                                                                      // text 1 has two "accredited" in a different order
                                                                                      let text1: String = "accredit accredited accrediting accredited accreditation accredits"
                                                                                      
                                                                                      // text 2 has three "accredited" in a different order
                                                                                      let text2: String = "accredit accredits accredited accrediting accredited accredited accreditation"
                                                                                      
                                                                                      // text 3 has "accreditation"
                                                                                      let text3: String = "accreditation"
                                                                                      
                                                                                      // text 4 has "accredited"
                                                                                      let text4: String = "accredited"
                                                                                      
                                                                                      var body: some View {
                                                                                          ScrollView {
                                                                                              VStack {
                                                                                                  
                                                                                                  Text("Text 1").bold()
                                                                                                  tagText(text: text1, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                                  Text("Text 2").bold()
                                                                                                  tagText(text: text2, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                                  Text("Text 3").bold()
                                                                                                  tagText(text: text3, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                                  Text("Text 4").bold()
                                                                                                  tagText(text: text4, scheme: .lemma).padding(.bottom)
                                                                                                  
                                                                                              }
                                                                                          }
                                                                                      }
                                                                                      
                                                                                      // MARK: - tagText
                                                                                      func tagText(text: String, scheme: NLTagScheme) -> some View {
                                                                                          VStack {
                                                                                              ForEach(partsOfSpeechTagger(for: text, scheme: scheme)) { word in
                                                                                                  Text(word.description)
                                                                                              }
                                                                                          }
                                                                                      }
                                                                                      
                                                                                      // MARK: - partsOfSpeechTagger
                                                                                      func partsOfSpeechTagger(for text: String, scheme: NLTagScheme) -> [NLPTagResult] {
                                                                                          
                                                                                          var listOfTaggedWords: [NLPTagResult] = []
                                                                                          let tagger = NLTagger(tagSchemes: [scheme])
                                                                                          tagger.string = text
                                                                                          
                                                                                          let range = text.startIndex.. Bool {
                                                                                              lhs.id == rhs.id
                                                                                          }
                                                                                          
                                                                                          func hash(into hasher: inout Hasher) {
                                                                                              hasher.combine(id)
                                                                                          }
                                                                                          
                                                                                          // MARK: - Comparable requirements
                                                                                          static func <(lhs: NLPTagResult, rhs: NLPTagResult) -> Bool {
                                                                                              lhs.id.uuidString < rhs.id.uuidString
                                                                                          }
                                                                                      }
                                                                                      
                                                                                  }
                                                                                  
                                                                                  // MARK: - Previews
                                                                                  struct ContentView_Previews: PreviewProvider {
                                                                                      static var previews: some View {
                                                                                          ContentView()
                                                                                      }
                                                                                  }
                                                                                  

                                                                                  Thanks for your help!

                                                                                  ANSWER

                                                                                  Answered 2022-Apr-01 at 20:30

                                                                                  As for why the tagger doesn't find "accredit" from "accreditation", this is because the scheme .lemma finds the lemma of words, not actually the stems. See the difference between stem and lemma on Wikipedia.

                                                                                  The stem is the part of the word that never changes even when morphologically inflected; a lemma is the base form of the word. For example, from "produced", the lemma is "produce", but the stem is "produc-". This is because there are words such as production and producing In linguistic analysis, the stem is defined more generally as the analyzed base form from which all inflected forms can be formed.

                                                                                  The documentation uses the word "stem", but I do think that the lemma is what is intended here, and getting "accreditation" is the expected behaviour. See the Usage section of the Wikipedia article for "Word stem" for more info. The lemma is the dictionary form of a word, and "accreditation" has a dictionary entry, whereas something like "accredited" doesn't. Whatever you call these things, the point is that there are two distinct concepts, and the tagger gets you one of them, but you are expecting the other one.

                                                                                  As for why the order of the words matters, this is because the tagger tries to analyse your words as "natural language", rather than each one individually. Naturally, word order matters. If you use .lexicalClass, you'll see that it thinks the third word in text2 is an adjective, which explains why it doesn't think its dictionary form is "accredit", because adjectives don't conjugate like that. Note that accredited is an adjective in the dictionary. So "is it a grammar issue?" Exactly.

                                                                                  Source https://stackoverflow.com/questions/71711847

                                                                                  QUESTION

                                                                                  Tokenize text but keep compund hyphenated words together
                                                                                  Asked 2022-Mar-29 at 09:16

                                                                                  I am trying to clean up text using a pre-processing function. I want to remove all non-alpha characters such as punctuation and digits, but I would like to retain compound words that use a dash without splitting them (e.g. pre-tender, pre-construction).

                                                                                  def preprocess(text):
                                                                                    #remove punctuation
                                                                                    text = re.sub('\b[A-Za-z]+(?:-+[A-Za-z]+)+\b', '-', text)
                                                                                    text = re.sub('[^a-zA-Z]', ' ', text)
                                                                                    text = text.split()
                                                                                    text = " ".join(text)
                                                                                    return text
                                                                                  

                                                                                  For instance, the original text:

                                                                                  "Attended pre-tender meetings" 
                                                                                  

                                                                                  should be split into

                                                                                  ['attended', 'pre-tender', 'meeting'] 
                                                                                  

                                                                                  rather than

                                                                                  ['attended', 'pre', 'tender', 'meeting']
                                                                                  

                                                                                  Any help would be appreciated!

                                                                                  ANSWER

                                                                                  Answered 2022-Mar-29 at 09:14

                                                                                  To remove all non-alpha characters but - between letters, you can use

                                                                                  [\W\d_](?

                                                                                  ASCII only equivalent:

                                                                                  [^A-Za-z](?

                                                                                  See the regex demo. Details:

                                                                                  • [\W\d_] - any non-letter
                                                                                  • (? - a negative lookbehind that fails the match if there is a letter and a - immediately to the left, and right after -, there is any letter (checked with the (?=[^\W\d_]) positive lookahead).

                                                                                  See the Python demo:

                                                                                  import re
                                                                                  
                                                                                  def preprocess(text):
                                                                                    #remove all non-alpha characters but - between letters
                                                                                    text = re.sub(r'[\W\d_](? Attended pre-tender etc meetings
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71659125

                                                                                  QUESTION

                                                                                  Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe
                                                                                  Asked 2022-Feb-16 at 20:47

                                                                                  Looping over a list of bigrams to search for, I need to create a boolean field for each bigram according to whether or not it is present in a tokenized pandas series. And I'd appreciate an upvote if you think this is a good question!

                                                                                  List of bigrams:

                                                                                  bigrams = ['data science', 'computer science', 'bachelors degree']
                                                                                  

                                                                                  Dataframe:

                                                                                  df = pd.DataFrame(data={'job_description': [['data', 'science', 'degree', 'expert'],
                                                                                                                              ['computer', 'science', 'degree', 'masters'],
                                                                                                                              ['bachelors', 'degree', 'computer', 'vision'],
                                                                                                                              ['data', 'processing', 'science']]})
                                                                                  

                                                                                  Desired Output:

                                                                                                           job_description  data science computer science bachelors degree
                                                                                  0        [data, science, degree, expert]          True            False            False
                                                                                  1   [computer, science, degree, masters]         False             True            False
                                                                                  2  [bachelors, degree, computer, vision]         False            False             True
                                                                                  3             [data, bachelors, science]         False            False            False
                                                                                  

                                                                                  Criteria:

                                                                                  1. Only exact matches should be replaced (for example, flagging for 'data science' should return True for 'data science' but False for 'science data' or 'data bachelors science')
                                                                                  2. Each search term should get it's own field and be concatenated to the original df

                                                                                  What I've tried:

                                                                                  Failed: df = [x for x in df['job_description'] if x in bigrams]

                                                                                  Failed: df[bigrams] = [[any(w==term for w in lst) for term in bigrams] for lst in df['job_description']]

                                                                                  Failed: Could not adapt the approach here -> Match trigrams, bigrams, and unigrams to a text; if unigram or bigram a substring of already matched trigram, pass; python

                                                                                  Failed: Could not get this one to adapt, either -> Compare two bigrams lists and return the matching bigram

                                                                                  Failed: This method is very close, but couldn't adapt it to bigrams -> Create new boolean fields based on specific terms appearing in a tokenized pandas dataframe

                                                                                  Thanks for any help you can provide!

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-16 at 20:28

                                                                                  You could use a regex and extractall:

                                                                                  regex = '|'.join('(%s)' % b.replace(' ', r'\s+') for b in bigrams)
                                                                                  matches = (df['job_description'].apply(' '.join)
                                                                                             .str.extractall(regex).droplevel(1).notna()
                                                                                             .groupby(level=0).max()
                                                                                             )
                                                                                  matches.columns = bigrams
                                                                                  
                                                                                  out = df.join(matches).fillna(False)
                                                                                  

                                                                                  output:

                                                                                                           job_description  data science  computer science  bachelors degree
                                                                                  0        [data, science, degree, expert]          True             False             False
                                                                                  1   [computer, science, degree, masters]         False              True             False
                                                                                  2  [bachelors, degree, computer, vision]         False             False              True
                                                                                  3            [data, processing, science]         False             False             False
                                                                                  

                                                                                  generated regex:

                                                                                  '(data\\s+science)|(computer\\s+science)|(bachelors\\s+degree)'
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/71147799

                                                                                  QUESTION

                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  Asked 2022-Feb-15 at 19:23

                                                                                  Goal: to run this Auto Labelling Notebook on AWS SageMaker Jupyter Labs.

                                                                                  Kernels tried: conda_pytorch_p36, conda_python3, conda_amazonei_mxnet_p27.

                                                                                  ! pip install farm-haystack -q
                                                                                  # Install the latest master of Haystack
                                                                                  !pip install grpcio-tools==1.34.1 -q
                                                                                  !pip install git+https://github.com/deepset-ai/haystack.git -q
                                                                                  !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
                                                                                  !tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin
                                                                                  !pip install git+https://github.com/deepset-ai/haystack.git -q
                                                                                  
                                                                                  # Here are the imports we need
                                                                                  from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
                                                                                  from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
                                                                                  from haystack.schema import Document
                                                                                  from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
                                                                                  

                                                                                  Traceback:

                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Loading faiss with AVX2 support.
                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Could not load library with AVX2 support due to:
                                                                                  ModuleNotFoundError("No module named 'faiss.swigfaiss_avx2'",)
                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Loading faiss.
                                                                                  02/02/2022 10:36:29 - INFO - faiss.loader -   Successfully loaded faiss.
                                                                                  02/02/2022 10:36:33 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
                                                                                  ---------------------------------------------------------------------------
                                                                                  ModuleNotFoundError                       Traceback (most recent call last)
                                                                                   in 
                                                                                        1 # Here are the imports we need
                                                                                  ----> 2 from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
                                                                                        3 from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
                                                                                        4 from haystack.schema import Document
                                                                                        5 from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/__init__.py in 
                                                                                        3 import pandas as pd
                                                                                        4 from haystack.schema import Document, Label, MultiLabel, BaseComponent
                                                                                  ----> 5 from haystack.finder import Finder
                                                                                        6 from haystack.pipeline import Pipeline
                                                                                        7 
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/finder.py in 
                                                                                        6 from collections import defaultdict
                                                                                        7 
                                                                                  ----> 8 from haystack.reader.base import BaseReader
                                                                                        9 from haystack.retriever.base import BaseRetriever
                                                                                       10 from haystack import MultiLabel
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/reader/__init__.py in 
                                                                                  ----> 1 from haystack.reader.farm import FARMReader
                                                                                        2 from haystack.reader.transformers import TransformersReader
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/reader/farm.py in 
                                                                                       22 
                                                                                       23 from haystack import Document
                                                                                  ---> 24 from haystack.document_store.base import BaseDocumentStore
                                                                                       25 from haystack.reader.base import BaseReader
                                                                                       26 
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/document_store/__init__.py in 
                                                                                        2 from haystack.document_store.faiss import FAISSDocumentStore
                                                                                        3 from haystack.document_store.memory import InMemoryDocumentStore
                                                                                  ----> 4 from haystack.document_store.milvus import MilvusDocumentStore
                                                                                        5 from haystack.document_store.sql import SQLDocumentStore
                                                                                  
                                                                                  ~/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/haystack/document_store/milvus.py in 
                                                                                        5 import numpy as np
                                                                                        6 
                                                                                  ----> 7 from milvus import IndexType, MetricType, Milvus, Status
                                                                                        8 from scipy.special import expit
                                                                                        9 from tqdm import tqdm
                                                                                  
                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  
                                                                                  pip install milvus
                                                                                  
                                                                                  import milvus
                                                                                  

                                                                                  Traceback:

                                                                                  ---------------------------------------------------------------------------
                                                                                  ModuleNotFoundError                       Traceback (most recent call last)
                                                                                   in 
                                                                                  ----> 1 import milvus
                                                                                  
                                                                                  ModuleNotFoundError: No module named 'milvus'
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-03 at 09:29

                                                                                  I would recommend to downgrade your milvus version to a version before the 2.0 release just a week ago. Here is a discussion on that topic: https://github.com/deepset-ai/haystack/issues/2081

                                                                                  Source https://stackoverflow.com/questions/70954157

                                                                                  QUESTION

                                                                                  Which model/technique to use for specific sentence extraction?
                                                                                  Asked 2022-Feb-08 at 18:35

                                                                                  I have a dataset of tens of thousands of dialogues / conversations between a customer and customer support. These dialogues, which could be forum posts, or long-winded email conversations, have been hand-annotated to highlight the sentence containing the customers problem. For example:

                                                                                  Dear agent, I am writing to you because I have a very annoying problem with my washing machine. I bought it three weeks ago and was very happy with it. However, this morning the door does not lock properly. Please help

                                                                                  Dear customer.... etc

                                                                                  The highlighted sentence would be:

                                                                                  However, this morning the door does not lock properly.

                                                                                  1. What approaches can I take to model this, so that in future I can automatically extract the customers problem? The domain of the datasets are broad, but within the hardware space, so it could be appliances, gadgets, machinery etc.
                                                                                  2. What is this type of problem called? I thought this might be called "intent recognition", but most guides seem to refer to multiclass classification. The sentence either is or isn't the customers problem. I considered analysing each sentence and performing binary classification, but I'd like to explore options that take into account the context of the rest of the conversation if possible.
                                                                                  3. What resources are available to research how to implement this in Python (using tensorflow or pytorch)

                                                                                  I found a model on HuggingFace which has been pre-trained with customer dialogues, and have read the research paper, so I was considering fine-tuning this as a starting point, but I only have experience with text (multiclass/multilabel) classification when it comes to transformers.

                                                                                  ANSWER

                                                                                  Answered 2022-Feb-07 at 10:21

                                                                                  This type of problem where you want to extract the customer problem from the original text is called Extractive Summarization and this type of task is solved by Sequence2Sequence models.

                                                                                  The main reason for this type of model being called Sequence2Sequence is because the input and the output of this model would both be text.

                                                                                  I recommend you to use a transformers model called Pegasus which has been pre-trained to predict a masked text, but its main application is to be fine-tuned for text summarization (extractive or abstractive).

                                                                                  This Pegasus model is listed on Transformers library, which provides you with a simple but powerful way of fine-tuning transformers with custom datasets. I think this notebook will be extremely useful as guidance and for understanding how to fine-tune this Pegasus model.

                                                                                  Source https://stackoverflow.com/questions/70990722

                                                                                  QUESTION

                                                                                  Assigning True/False if a token is present in a data-frame
                                                                                  Asked 2022-Jan-06 at 12:38

                                                                                  My current data-frame is:

                                                                                       |articleID | keywords                                               | 
                                                                                       |:-------- |:------------------------------------------------------:| 
                                                                                  0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      |     
                                                                                  1    |58b6393b  | ['Crossword Puzzles']                                  |          
                                                                                  2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']|            
                                                                                  3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        |  
                                                                                  

                                                                                  I want a data-frame similar to the following, where a column is added based on whether a Trump token, 'Trump, Donald J' is mentioned in the keywords and if so then it is assigned True :

                                                                                       |articleID | keywords                                               | trumpMention |
                                                                                       |:-------- |:------------------------------------------------------:| ------------:|
                                                                                  0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      | False        |      
                                                                                  1    |58b6393b  | ['Crossword Puzzles']                                  | False        |          
                                                                                  2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']| True         |           
                                                                                  3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        | True         |       
                                                                                  

                                                                                  I have tried multiple ways using df functions. But cannot achieve my wanted results. Some of the ways I've tried are:

                                                                                  df['trumpMention'] = np.where(any(df['keywords']) == 'Trump, Donald J', True, False) 
                                                                                  

                                                                                  or

                                                                                  df['trumpMention'] = df['keywords'].apply(lambda x: any(token == 'Trump, Donald J') for token in x) 
                                                                                  

                                                                                  or

                                                                                  lst = ['Trump, Donald J']  
                                                                                  df['trumpMention'] = df['keywords'].apply(lambda x: ([ True for token in x if any(token in lst)]))   
                                                                                  

                                                                                  Raw input:

                                                                                  df = pd.DataFrame({'articleID': ['58b61d1d', '58b6393b', '58b6556e', '58b657fa'],
                                                                                                     'keywords': [['Second Avenue (Manhattan, NY)'],
                                                                                                                  ['Crossword Puzzles'],
                                                                                                                  ['Workplace Hazards and Violations', 'Trump, Donald J'],
                                                                                                                  ['Trump, Donald J', 'Speeches and Statements']],
                                                                                                     'trumpMention': [False, False, True, True]})
                                                                                  

                                                                                  ANSWER

                                                                                  Answered 2022-Jan-06 at 12:13

                                                                                  try

                                                                                  df["trumpMention"] = df["keywords"].apply(lambda x: "Trump, Donald J" in x)
                                                                                  

                                                                                  Source https://stackoverflow.com/questions/70606847

                                                                                  QUESTION

                                                                                  How to calculate perplexity of a sentence using huggingface masked language models?
                                                                                  Asked 2021-Dec-25 at 21:51

                                                                                  I have several masked language models (mainly Bert, Roberta, Albert, Electra). I also have a dataset of sentences. How can I get the perplexity of each sentence?

                                                                                  From the huggingface documentation here they mentioned that perplexity "is not well defined for masked language models like BERT", though I still see people somehow calculate it.

                                                                                  For example in this SO question they calculated it using the function

                                                                                  def score(model, tokenizer, sentence,  mask_token_id=103):
                                                                                    tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                                                                    repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                                                                    mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                                                                    masked_input = repeat_input.masked_fill(mask == 1, 103)
                                                                                    labels = repeat_input.masked_fill( masked_input != 103, -100)
                                                                                    loss,_ = model(masked_input, masked_lm_labels=labels)
                                                                                    result = np.exp(loss.item())
                                                                                    return result
                                                                                  
                                                                                  score(model, tokenizer, '我爱你') # returns 45.63794545581973
                                                                                  

                                                                                  However, when I try to use the code I get TypeError: forward() got an unexpected keyword argument 'masked_lm_labels'.

                                                                                  I tried it with a couple of my models:

                                                                                  from transformers import pipeline, BertForMaskedLM, BertForMaskedLM, AutoTokenizer, RobertaForMaskedLM, AlbertForMaskedLM, ElectraForMaskedLM
                                                                                  import torch
                                                                                  
                                                                                  1)
                                                                                  tokenizer = AutoTokenizer.from_pretrained("bioformers/bioformer-cased-v1.0")
                                                                                  model = BertForMaskedLM.from_pretrained("bioformers/bioformer-cased-v1.0")
                                                                                  2)
                                                                                  tokenizer = AutoTokenizer.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
                                                                                  model = ElectraForMaskedLM.from_pretrained("sultan/BioM-ELECTRA-Large-Generator")
                                                                                  

                                                                                  This SO question also used the masked_lm_labels as an input and it seemed to work somehow.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-25 at 21:51

                                                                                  There is a paper Masked Language Model Scoring that explores pseudo-perplexity from masked language models and shows that pseudo-perplexity, while not being theoretically well justified, still performs well for comparing "naturalness" of texts.

                                                                                  As for the code, your snippet is perfectly correct but for one detail: in recent implementations of Huggingface BERT, masked_lm_labels are renamed to simply labels, to make interfaces of various models more compatible. I have also replaced the hard-coded 103 with the generic tokenizer.mask_token_id. So the snippet below should work:

                                                                                  from transformers import AutoModelForMaskedLM, AutoTokenizer
                                                                                  import torch
                                                                                  import numpy as np
                                                                                  
                                                                                  model_name = 'cointegrated/rubert-tiny'
                                                                                  model = AutoModelForMaskedLM.from_pretrained(model_name)
                                                                                  tokenizer = AutoTokenizer.from_pretrained(model_name)
                                                                                  
                                                                                  def score(model, tokenizer, sentence):
                                                                                      tensor_input = tokenizer.encode(sentence, return_tensors='pt')
                                                                                      repeat_input = tensor_input.repeat(tensor_input.size(-1)-2, 1)
                                                                                      mask = torch.ones(tensor_input.size(-1) - 1).diag(1)[:-2]
                                                                                      masked_input = repeat_input.masked_fill(mask == 1, tokenizer.mask_token_id)
                                                                                      labels = repeat_input.masked_fill( masked_input != tokenizer.mask_token_id, -100)
                                                                                      with torch.inference_mode():
                                                                                          loss = model(masked_input, labels=labels).loss
                                                                                      return np.exp(loss.item())
                                                                                  
                                                                                  print(score(sentence='London is the capital of Great Britain.', model=model, tokenizer=tokenizer)) 
                                                                                  # 4.541251105675365
                                                                                  print(score(sentence='London is the capital of South America.', model=model, tokenizer=tokenizer)) 
                                                                                  # 6.162017238332462
                                                                                  

                                                                                  You can try this code in Google Colab by running this gist.

                                                                                  Source https://stackoverflow.com/questions/70464428

                                                                                  QUESTION

                                                                                  Mapping values from a dictionary's list to a string in Python
                                                                                  Asked 2021-Dec-21 at 16:45

                                                                                  I am working on some sentence formation like this:

                                                                                  sentence = "PERSON is ADJECTIVE"
                                                                                  dictionary = {"PERSON": ["Alice", "Bob", "Carol"], "ADJECTIVE": ["cute", "intelligent"]}
                                                                                  

                                                                                  I would now need all possible combinations to form this sentence from the dictionary, like:

                                                                                  Alice is cute
                                                                                  Alice is intelligent
                                                                                  Bob is cute
                                                                                  Bob is intelligent
                                                                                  Carol is cute
                                                                                  Carol is intelligent
                                                                                  

                                                                                  The above use case was relatively simple, and it was done with the following code

                                                                                  dictionary = {"PERSON": ["Alice", "Bob", "Carol"], "ADJECTIVE": ["cute", "intelligent"]}
                                                                                  
                                                                                  for i in dictionary["PERSON"]:
                                                                                      for j in dictionary["ADJECTIVE"]:
                                                                                          print(f"{i} is {j}")
                                                                                  

                                                                                  But can we also make this scale up for longer sentences?

                                                                                  Example:

                                                                                  sentence = "PERSON is ADJECTIVE and is from COUNTRY" 
                                                                                  dictionary = {"PERSON": ["Alice", "Bob", "Carol"], "ADJECTIVE": ["cute", "intelligent"], "COUNTRY": ["USA", "Japan", "China", "India"]}
                                                                                  

                                                                                  This should again provide all possible combinations like:

                                                                                  Alice is cute and is from USA
                                                                                  Alice is intelligent and is from USA
                                                                                  .
                                                                                  .
                                                                                  .
                                                                                  .
                                                                                  Carol is intelligent and is from India
                                                                                  

                                                                                  I tried to use https://www.pythonpool.com/python-permutations/ , but the sentence are all are mixed up - but how can we make a few words fixed, like in this example the words "and is from" is fixed

                                                                                  Essentially if any key in the dictionary is equal to the word in the string, then the word should be replaced by the list of values from the dictionary.

                                                                                  Any thoughts would be really helpful.

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-12 at 17:53

                                                                                  You can first replace the dictionary keys in sentence to {} so that you can easily format a string in loop. Then you can use itertools.product to create the Cartesian product of dictionary.values(), so you can simply loop over it to create your desired sentences.

                                                                                  from itertools import product
                                                                                  sentence = ' '.join([('{}' if w in dictionary else w) for w in sentence.split()])
                                                                                  mapped_sentences_generator = (sentence.format(*tple) for tple in product(*dictionary.values()))
                                                                                  for s in mapped_sentences_generator:
                                                                                      print(s)
                                                                                  

                                                                                  Output:

                                                                                  Alice is cute and is from USA
                                                                                  Alice is cute and is from Japan
                                                                                  Alice is cute and is from China
                                                                                  Alice is cute and is from India
                                                                                  Alice is intelligent and is from USA
                                                                                  Alice is intelligent and is from Japan
                                                                                  Alice is intelligent and is from China
                                                                                  Alice is intelligent and is from India
                                                                                  Bob is cute and is from USA
                                                                                  Bob is cute and is from Japan
                                                                                  Bob is cute and is from China
                                                                                  Bob is cute and is from India
                                                                                  Bob is intelligent and is from USA
                                                                                  Bob is intelligent and is from Japan
                                                                                  Bob is intelligent and is from China
                                                                                  Bob is intelligent and is from India
                                                                                  Carol is cute and is from USA
                                                                                  Carol is cute and is from Japan
                                                                                  Carol is cute and is from China
                                                                                  Carol is cute and is from India
                                                                                  Carol is intelligent and is from USA
                                                                                  Carol is intelligent and is from Japan
                                                                                  Carol is intelligent and is from China
                                                                                  Carol is intelligent and is from India
                                                                                  

                                                                                  Note that this works for Python >3.6 because it assumes the dictionary insertion order is maintained. For older Python, must use collections.OrderedDict rather than dict.

                                                                                  Source https://stackoverflow.com/questions/70325758

                                                                                  QUESTION

                                                                                  What are differences between AutoModelForSequenceClassification vs AutoModel
                                                                                  Asked 2021-Dec-05 at 09:07

                                                                                  We can create a model from AutoModel(TFAutoModel) function:

                                                                                  from transformers import AutoModel 
                                                                                  model = AutoModel.from_pretrained('distilbert-base-uncase')
                                                                                  

                                                                                  In other hand, a model is created by AutoModelForSequenceClassification(TFAutoModelForSequenceClassification):

                                                                                  from transformers import AutoModelForSequenceClassification
                                                                                  model = AutoModelForSequenceClassification('distilbert-base-uncase')
                                                                                  

                                                                                  As I know, both models use distilbert-base-uncase library to create models. From name of methods, the second class( AutoModelForSequenceClassification ) is created for Sequence Classification.

                                                                                  But what are really differences in 2 classes? And how to use them correctly?

                                                                                  (I searched in huggingface but it is not clear)

                                                                                  ANSWER

                                                                                  Answered 2021-Dec-05 at 09:07

                                                                                  The difference between AutoModel and AutoModelForSequenceClassification model is that AutoModelForSequenceClassification has a classification head on top of the model outputs which can be easily trained with the base model

                                                                                  Source https://stackoverflow.com/questions/69907682

                                                                                  Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                                                                                  Vulnerabilities

                                                                                  No vulnerabilities reported

                                                                                  Install corenlp-examples

                                                                                  You can download it from GitHub.
                                                                                  You can use corenlp-examples like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the corenlp-examples component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

                                                                                  Support

                                                                                  For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
                                                                                  Find more information at:
                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit
                                                                                  CLONE
                                                                                • HTTPS

                                                                                  https://github.com/drewfarris/corenlp-examples.git

                                                                                • CLI

                                                                                  gh repo clone drewfarris/corenlp-examples

                                                                                • sshUrl

                                                                                  git@github.com:drewfarris/corenlp-examples.git

                                                                                • Share this Page

                                                                                  share link

                                                                                  Consider Popular Natural Language Processing Libraries

                                                                                  transformers

                                                                                  by huggingface

                                                                                  funNLP

                                                                                  by fighting41love

                                                                                  bert

                                                                                  by google-research

                                                                                  jieba

                                                                                  by fxsjy

                                                                                  Python

                                                                                  by geekcomputers

                                                                                  Try Top Libraries by drewfarris

                                                                                  sample-cfssl-ca

                                                                                  by drewfarrisShell

                                                                                  mahout-avro-testbed

                                                                                  by drewfarrisJava

                                                                                  datacube-experiments

                                                                                  by drewfarrisJava

                                                                                  drewfarris.github.io

                                                                                  by drewfarrisJavaScript

                                                                                  scratch-archetype

                                                                                  by drewfarrisJava

                                                                                  Compare Natural Language Processing Libraries with Highest Support

                                                                                  transformers

                                                                                  by huggingface

                                                                                  bert

                                                                                  by google-research

                                                                                  allennlp

                                                                                  by allenai

                                                                                  flair

                                                                                  by flairNLP

                                                                                  spaCy

                                                                                  by explosion

                                                                                  Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
                                                                                  Find more libraries
                                                                                  Explore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
                                                                                  Save this library and start creating your kit