kandi background
Explore Kits

jieba | Stuttering Chinese word segmentation

 by   fxsjy Python Version: Current License: MIT

 by   fxsjy Python Version: Current License: MIT

kandi X-RAY | jieba Summary

Stuttering Chinese word segmentation
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse
jieba Support
Best in #null
Average in #null
jieba Support
Best in #null
Average in #null
jieba Quality
Best in #null
Average in #null
jieba Quality
Best in #null
Average in #null
jieba Security
Best in #null
Average in #null
jieba Security
Best in #null
Average in #null
jieba License
Best in #null
Average in #null
jieba License
Best in #null
Average in #null

buildReuse

jieba Reuse
Best in #null
Average in #null
jieba Reuse
Best in #null
Average in #null
Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample Here

Get all kandi verified functions for this library.

Get all kandi verified functions for this library.

jieba Key Features

Stuttering Chinese word segmentation

jieba Examples and Code Snippets

See all related Code Snippets

Community Discussions

Trending Discussions on jieba
  • Problem to extract NER subject + verb with spacy and Matcher
  • Docker AWS Elastic beanstalk no error in local machine docker build but spacy NLP hanging forever when put on server
  • Chinese segmentation selection in model loading in Spacy 2.4 release
  • How to implement parallel process on huge dataframe
  • Why python program runs with different result between Linux shell and Jenkins job
  • Calculate tangent for each point of the curve python in matplotlib
  • skmultiLearn classifiers predictions always return 0
  • Python 3 cannot find a module
  • ModuleNotFoundError: No module named 'jieba'
  • POS tagging and NER for Chinese Text with Spacy
Trending Discussions on jieba

QUESTION

Problem to extract NER subject + verb with spacy and Matcher

Asked 2021-Apr-26 at 17:44

I work on an NLP project and i have to use spacy and spacy Matcher to extract all named entities who are nsubj (subjects) and the verb to which it relates : the governor verb of my NE nsubj. Example :

Georges and his friends live in Mexico City
"Hello !", says Mary

I'll need to extract "Georges" and "live" in the first sentence and "Mary" and "says" in the second one but i don't know how many words will be between my named entity and the verb to which it relate. So i decided to explore spacy Matcher more. So i'm struggling to write a pattern on Matcher to extract my 2 words. When the NE subj is before the verb, i get good results but i don't know how to write a pattern to match a NE subj after words which it correlates to. I could also, according to the guideline, do this task with "regular spacy" but i don't know how to do that. The problem with Matcher concerns the fact that i can't manage the type of dependency between the NE and VERB and grab the good VERB. I'm new with spacy, i've always worked with NLTK or Jieba (for chineese). I don't know even how to tokenize a text in sentence with spacy. But i chose to split the whole text in sentences to avoir bad matching between two sentences. Here is my code

import spacy
from nltk import sent_tokenize
from spacy.matcher import Matcher

nlp = spacy.load('fr_core_news_md')

matcher = Matcher(nlp.vocab)

def get_entities_verbs():

    try:

        # subjet before verb
        pattern_subj_verb = [{'ENT_TYPE': 'PER', 'DEP': 'nsubj'}, {"POS": {'NOT_IN':['VERB']}, "DEP": {'NOT_IN':['nsubj']}, 'OP':'*'}, {'POS':'VERB'}]
        # subjet after verb
        # this pattern is not good

        matcher.add('ent-verb', [pattern_subj_verb])

        for sent in sent_tokenize(open('Le_Ventre_de_Paris-short.txt').read()):
            sent = nlp(sent)
            matches = matcher(sent)
            for match_id, start, end in matches:
                span = sent[start:end]
                print(span)

    except Exception as error:
        print(error)


def main():

    get_entities_verbs()

if __name__ == '__main__':
    main()

Even if it's french, i can assert you that i get good results

Florent regardait
Lacaille reparut
Florent baissait
Claude regardait
Florent resta
Florent, soulagé
Claude s’était arrêté
Claude en riait
Saget est matinale, dit
Florent allait
Murillo peignait
Florent accablé
Claude entra
Claude l’appelait
Florent regardait
Florent but son verre de punch ; il le sentit
Alexandre, dit
Florent levait
Claude était ravi
Claude et Florent revinrent
Claude, les mains dans les poches, sifflant

I have some wrong results but 90% is good. I just need to grab the first ans last word of each line to have my couple NE/verb. So my question is. How to extract NE when NE is subj with the verb which it correlates to with Matcher or simply how to do that with spacy (not Matcher) ? There are to many factors to be taken into account. Do you have a method to get the best results as possible even if 100% is not possible. I need a pattern matching VERB governor + NER subj after from this pattern:

pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

All credit to polm23 for this pattern

ANSWER

Answered 2021-Apr-26 at 05:05

This is a perfect use case for the Dependency Matcher. It also makes things easier if you merge entities to single tokens before running it. This code should do what you need:

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_sm")

# merge entities to simplify this
nlp.add_pipe("merge_entities")


pattern = [
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"},
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"},
        }
        ]

matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", [pattern])

texts = [
        "John Smith and some other guy live there",
        '"Hello!", says Mary.',
        ]

for text in texts:
    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, (start, end) = match
        # note order here is defined by the pattern, so the nsubj will be first
        print(doc[start], "::", doc[end])
    print()

Source https://stackoverflow.com/questions/67259823

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install jieba

No Installation instructions are available at this moment for jieba.Refer to component home page for details.

Support

For feature suggestions, bugs create an issue on GitHub
If you have any questions vist the community on GitHub, Stack Overflow.

Find more information at:

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 650 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases
Explore Kits

Save this library and start creating your kit

Clone
  • git@github.com:fxsjy/jieba.git

Share this Page

share link
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 650 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases
Explore Kits

Save this library and start creating your kit