spaCy | 💫 Industrial-strength Natural Language Processing | Natural Language Processing library
kandi X-RAY | spaCy Summary
Support
Quality
Security
License
Reuse
- Defines a factory
- Returns a fully qualified name for the given language
- Sets the factory meta
- Register a factory function
- Compute the PRF score for the given examples
- Calculate the tp score
- Embed a character embedding
- Construct a model of static vectors
- Lemmatize a token
- Get a table by name
- Command line interface for debugging
- Parse dependencies
- Process if node
- Create a model of static vectors
- Lemmatize a word
- Parse command line interface
- Lemmatize rule
- Setup package
- Forward layer computation
- Lemmatize a specific word
- Update the model with the given examples
- Builds a token embedding model
- Command line interface for pretraining
- Extract the words from the wiktionary
- Rehearse the language
- Process a for loop
- Lemmatize a rule
spaCy Key Features
spaCy Examples and Code Snippets
import spacy import pandas as pd import json from itertools import groupby # Download spaCy models: models = { 'en_core_web_sm': spacy.load("en_core_web_sm"), 'en_core_web_lg': spacy.load("en_core_web_lg") } # This function converts spaCy docs to the list of named entity spans in Label Studio compatible JSON format: def doc_to_spans(doc): tokens = [(tok.text, tok.idx, tok.ent_type_) for tok in doc] results = [] entities = set() for entity, group in groupby(tokens, key=lambda t: t[-1]): if not entity: continue group = list(group) _, start, _ = group[0] word, last, _ = group[-1] text = ' '.join(item[0] for item in group) end = last + len(word) results.append({ 'from_name': 'label', 'to_name': 'text', 'type': 'labels', 'value': { 'start': start, 'end': end, 'text': text, 'labels': [entity] } }) entities.add(entity) return results, entities # Now load the dataset and include only lines containing "Easter ": df = pd.read_csv('lines_clean.csv') df = df[df['line_text'].str.contains("Easter ", na=False)] print(df.head()) texts = df['line_text'] # Prepare Label Studio tasks in import JSON format with the model predictions: entities = set() tasks = [] for text in texts: predictions = [] for model_name, nlp in models.items(): doc = nlp(text) spans, ents = doc_to_spans(doc) entities |= ents predictions.append({'model_version': model_name, 'result': spans}) tasks.append({ 'data': {'text': text}, 'predictions': predictions }) # Save Label Studio tasks.json print(f'Save {len(tasks)} tasks to "tasks.json"') with open('tasks.json', mode='w') as f: json.dump(tasks, f, indent=2) # Save class labels as a txt file print('Named entities are saved to "named_entities.txt"') with open('named_entities.txt', mode='w') as f: f.write('\n'.join(sorted(entities)))
import json from collections import defaultdict tasks = json.load(open('annotations.json')) model_hits = defaultdict(int) for task in tasks: annotation_result = task['annotations'][0]['result'] for r in annotation_result: r.pop('id') for prediction in task['predictions']: model_hits[prediction['model_version']] += int(prediction['result'] == annotation_result) num_task = len(tasks) for model_name, num_hits in model_hits.items(): acc = num_hits / num_task print(f'Accuracy for {model_name}: {acc:.2f}%')
Accuracy for en_core_web_sm: 0.03% Accuracy for en_core_web_lg: 0.41%
python -m pip install -U pip
pip install -U spacy
pip install pandas
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# Add match ID "HelloWorld" with no callback and one pattern
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
matcher.add("HelloWorld", [pattern])
doc = nlp("Hello, world! Hello world!")
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_lg')
text='The car comprises 4 brakes 4.1, 4.2, 4.3 and 4.4 in fig. 5, all include an ESP system. This is shown in Fig. 6. Fig. 5 shows how the motors 56 and 57 are blocked. Besides the doors (44, 45) are painted blue.'
# Add EntityRuler to pipeline
ruler = nlp.add_pipe("entity_ruler", before="ner", config={"validate": True})
patterns = [{"label": "2_DIGIT", "pattern": [{"IS_DIGIT": True}, {"IS_PUNCT": True}, {"IS_DIGIT": True}]}]
ruler.add_patterns(patterns)
# Print 2-Digit Ents
print([(ent.label_, text[ent.start_char:ent.end_char]) for ent in doc.ents if ent.label_ == "2_DIGIT"])
nlp = spacy_stanza.load_pipeline("xx", lang="la")
COPY core ${LAMBDA_TASK_ROOT}
COPY core ${LAMBDA_TASK_ROOT}/core
import spacy
from spacy.scorer import Scorer
from spacy.tokens import Doc
from spacy.training.example import Example
examples = [
('Who is Talha Tayyab?',
{(7, 19, 'PERSON')}),
('I like London and Berlin.',
{(7, 13, 'LOC'), (18, 24, 'LOC')}),
('Agra is famous for Tajmahal, The CEO of Facebook will visit India shortly to meet Murari Mahaseth and to visit Tajmahal.',
{(0, 4, 'LOC'), (40, 48, 'ORG'), (60, 65, 'GPE'), (82, 97, 'PERSON'), (111, 119, 'GPE')})
]
def my_evaluate(ner_model, examples):
scorer = Scorer()
example = []
for input_, annotations in examples:
pred = ner_model(input_)
print(pred,annotations)
temp = Example.from_dict(pred, dict.fromkeys(annotations))
example.append(temp)
scores = scorer.score(example)
return scores
ner_model = spacy.load('en_core_web_sm') # for spaCy's pretrained use 'en_core_web_sm'
results = my_evaluate(ner_model, examples)
print(results)
FINANCE = ["Frontwave Credit Union",
"St. Mary's Bank",
"Center for Financial Services Innovation"]
SPORT = [
"Christiano Ronaldo",
"Lewis Hamilton",
]
FINANCE = '|'.join(FINANCE)
sent = pd.DataFrame({'sent': ["Dear members of Frontwave Credit Union, any credit demanded by Lewis Hamilton is invalid, said Ronaldo"]})
home = sent['sent'].str.extractall(f'({FINANCE})')
def labeler(row, group):
l = len(row.split())
return [f'I-{group}' if i !=0 else f'B-{group}' for i in range(l)]
home[0].apply(labeler, group='FINANCE').explode()
for batch in batches:
nlp.update(batch, sgd=optimizer, losses=losses)
Trending Discussions on spaCy
Trending Discussions on spaCy
QUESTION
I am facing the following attribute error when loading glove model:
Code used to load model:
nlp = spacy.load('en_core_web_sm')
tokenizer = spacy.load('en_core_web_sm', disable=['tagger','parser', 'ner', 'textcat'])
nlp.vocab.vectors.from_glove('../models/GloVe')
Getting the following atribute error when trying to load the glove model:
AttributeError: 'spacy.vectors.Vectors' object has no attribute 'from_glove'
Have tried to search on StackOverflow and elsewhere but can't seem to find the solution. Thanks!
From pip list:
- spacy version: 3.1.4
- spacy-legacy 3.0.8
- en-core-web-sm 3.1.0
ANSWER
Answered 2022-Mar-17 at 14:08spacy version: 3.1.4
does not have the feature from_glove
.
I was able to use nlp.vocab.vectors.from_glove()
in spacy version: 2.2.4
.
If you want, you can change your spacy version by using:
!pip install spacy==2.2.4
on your Jupyter cell.
QUESTION
I want to use SpaCy to analyze many small texts and I want to store the nlp results for further use to save processing time. I found code at Storing and Loading spaCy Documents Containing Word Vectors but I get an error and I cannot find how to fix it. I am fairly new to python.
In the following code, I store the nlp results to a file and try to read it again. I can write the first file but I do not find the second file (vocab). I also get two errors: that Doc
and Vocab
are not defined.
Any idea to fix this or another method to achieve the same result is more than welcomed.
Thanks!
import spacy
nlp = spacy.load('en_core_web_md')
doc = nlp("He eats a green apple")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
NLP_FName = "E:\\SaveTest.nlp"
doc.to_disk(NLP_FName)
Vocab_FName = "E:\\SaveTest.voc"
doc.vocab.to_disk(Vocab_FName)
#To read the data again:
idoc = Doc(Vocab()).from_disk(NLP_FName)
idoc.vocab.from_disk(Vocab_FName)
for token in idoc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
ANSWER
Answered 2022-Mar-10 at 18:06I tried your code and I had a few minor issues wgich I fixed on the code below.
Note that SaveTest.nlp
is a binary file with your doc info and
SaveTest.voc
is a folder with all the spacy model vocab information (vectors, strings among other).
Changes I made:
- Import
Doc
class fromspacy.tokens
- Import
Vocab
class fromspacy.vocab
- Download
en_core_web_md
model using the following command:
python -m spacy download en_core_web_md
Please note that spacy has multiple models for each language, and usually you have to download it first (typically sm
, md
and lg
models). Read more about it here.
Code:
import spacy
from spacy.tokens import Doc
from spacy.vocab import Vocab
nlp = spacy.load('en_core_web_md')
doc = nlp("He eats a green apple")
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
NLP_FName = "E:\\SaveTest.nlp"
doc.to_disk(NLP_FName)
Vocab_FName = "E:\\SaveTest.voc"
doc.vocab.to_disk(Vocab_FName)
#To read the data again:
idoc = Doc(Vocab()).from_disk(NLP_FName)
idoc.vocab.from_disk(Vocab_FName)
for token in idoc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
Let me know if this is helpful to you, and if not, please add your error message to your original question so I can help.
QUESTION
I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True}
can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}
.
Thanks for any hints.
ANSWER
Answered 2022-Mar-09 at 04:14The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.
QUESTION
I loaded regular spacy language, and tries the following code:
import spacy
nlp = spacy.load("en_core_web_md")
text = "xxasdfdsfsdzz is the first U.S. public company"
if 'xxasdfdsfsdzz' in nlp.vocab:
print("in")
else:
print("not")
if 'Apple' in nlp.vocab:
print("in")
else:
print("not")
# Process the text
doc = nlp(text)
if 'xxasdfdsfsdzz' in nlp.vocab:
print("in")
else:
print("not")
if 'Apple' in nlp.vocab:
print("in")
else:
print("not")
It seems like spacy loaded words after they called to analyze - nlp(text)
Can someone explain the output? How can I avoid it? Why "Apple
" is not existing in vocab? and why "xxasdfdsfsdzz
" exists?
Output:
not
not
in
not
ANSWER
Answered 2022-Feb-28 at 04:26The spaCy Vocab is mainly an internal implementation detail to interface with a memory-efficient method of storing strings. It is definitely not a list of "real words" or any other thing that you are likely to find useful.
The main thing a Vocab stores by default is strings that are used internally, such as POS and dependency labels. In pipelines with vectors, words in the vectors are also included. You can read more about the implementation details here.
All words an nlp
object has seen need storage for their strings, and so will be present in the Vocab. That's what you're seeing with your nonsense string in the example above.
QUESTION
I've been trying to solve a problem with the spacy Tokenizer for a while, without any success. Also, I'm not sure if it's a problem with the tokenizer or some other part of the pipeline.
Any help is welcome!
Description
I have an application that for reasons besides the point, creates a spacy Doc
from the spacy vocab and the list of tokens from a string (see code below). Note that while this is not the simplest and most common way to do this, according to spacy doc this can be done.
However, when I create a Doc
for a text that contains compound words or dates with hyphen as a separator, the behavior I am getting is not what I expected.
import spacy
from spacy.language import Doc
# My current way
doc = Doc(nlp.vocab, words=tokens) # Tokens is a well defined list of tokens for a certein string
# Standard way
doc = nlp("My text...")
For example, with the following text, if I create the Doc
using the standard procedure, the spacy Tokenizer
recognizes the "-"
as tokens but the Doc
text is the same as the input text, in addition the spacy NER model correctly recognizes the DATE entity.
import spacy
doc = nlp("What time will sunset be on 2022-12-24?")
print(doc.text)
tokens = [str(token) for token in doc]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
What time will sunset be on 2022-12-24?
['What', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022-12-24
On the other hand, if I create the Doc
from the model's vocab
and the previously calculated tokens, the result obtained is different. Note that for the sake of simplicity I am using the tokens from doc
, so I'm sure there are no differences in tokens. Also note that I am manually running each pipeline model in the correct order with the doc
, so at the end of this process I would theoretically get the same results.
However, as you can see in the output below, while the Doc's tokens are the same, the Doc's text is different, there were blank spaces between the digits and the date separators.
doc2 = Doc(nlp.vocab, words=tokens)
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
Output:
what time will sunset be on 2022 - 12 - 24 ?
['what', 'time', 'will', 'sunset', 'be', 'on', '2022', '-', '12', '-', '24', '?']
DATE
2022 - 12 - 24
I know it must be something silly that I'm missing but I don't realize it.
Could someone please explain to me what I'm doing wrong and point me in the right direction?
Thanks a lot in advance!
EDIT
Following the Talha Tayyab suggestion, I have to create an array of booleans with the same length that my list of tokens to indicate for each one, if the token is followed by an empty space. Then pass this array in doc construction as follows: doc = Doc(nlp.vocab, words=words, spaces=spaces)
.
To compute this list of boolean values ​​based on my original text string and list of tokens, I implemented the following vanilla function:
def get_spaces(self, text: str, tokens: List[str]) -> List[bool]:
# Spaces
spaces = []
# Copy text to easy operate
t = text.lower()
# Iterate over tokens
for token in tokens:
if t.startswith(token.lower()):
t = t[len(token):] # Remove token
# If after removing token we have an empty space
if len(t) > 0 and t[0] == " ":
spaces.append(True)
t = t[1:] # Remove space
else:
spaces.append(False)
return spaces
With these two improvements in my code, the result obtained is as expected. However, now I have the following question:
Is there a more spacy-like way to compute whitespace, instead of using my vanilla implementation?
ANSWER
Answered 2022-Feb-14 at 21:06Please try this:
from spacy.language import Doc
doc2 = Doc(nlp.vocab, words=tokens,spaces=[1,1,1,1,1,1,0,0,0,0,0,0])
# Run each model in pipeline
for model_name in nlp.pipe_names:
pipe = nlp.get_pipe(model_name)
doc2 = pipe(doc2)
# Print text and tokens
print(doc2.text)
tokens = [str(token) for token in doc2]
print(tokens)
# Show entities
print(doc.ents[0].label_)
print(doc.ents[0].text)
# You can also replace 0 with False and 1 with True
This is the complete syntax:
doc = Doc(nlp.vocab, words=words, spaces=spaces)
spaces are a list of boolean values indicating whether each word has a subsequent space. Must have the same length as words, if specified. Defaults to a sequence of True.
So you can choose which ones you gonna have space and which ones you do not need.
Reference: https://spacy.io/api/doc
QUESTION
I am getting the below error when I'm trying to run the following line of code to load en_core_web_sm in the Azure Machine Learning instance.
I debugged the issue and found out that once I install scrubadub_spacy, that seems is the issue causing the error.
spacy.load("en_core_web_sm")
OSError Traceback (most recent call last)
in
1 # Load English tokenizer, tagger, parser and NER
----> 2 nlp = spacy.load("en_core_web_sm")
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/spacy/__init__.py in load(name, vocab, disable, exclude, config)
50 """
51 return util.load_model(
---> 52 name, vocab=vocab, disable=disable, exclude=exclude, config=config
53 )
54
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/spacy/util.py in load_model(name, vocab, disable, exclude, config)
418 return get_lang_class(name.replace("blank:", ""))()
419 if is_package(name): # installed as package
--> 420 return load_model_from_package(name, **kwargs) # type: ignore[arg-type]
421 if Path(name).exists(): # path to model data directory
422 return load_model_from_path(Path(name), **kwargs) # type: ignore[arg-type]
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/spacy/util.py in load_model_from_package(name, vocab, disable, exclude, config)
451 """
452 cls = importlib.import_module(name)
--> 453 return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config) # type: ignore[attr-defined]
454
455
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/en_core_web_sm/__init__.py in load(**overrides)
10
11 def load(**overrides):
---> 12 return load_model_from_init_py(__file__, **overrides)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/spacy/util.py in load_model_from_init_py(init_file, vocab, disable, exclude, config)
619 disable=disable,
620 exclude=exclude,
--> 621 config=config,
622 )
623
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/spacy/util.py in load_model_from_path(model_path, meta, vocab, disable, exclude, config)
485 config_path = model_path / "config.cfg"
486 overrides = dict_to_dot(config)
--> 487 config = load_config(config_path, overrides=overrides)
488 nlp = load_model_from_config(config, vocab=vocab, disable=disable, exclude=exclude)
489 return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/spacy/util.py in load_config(path, overrides, interpolate)
644 else:
645 if not config_path or not config_path.exists() or not config_path.is_file():
--> 646 raise IOError(Errors.E053.format(path=config_path, name="config.cfg"))
647 return config.from_disk(
648 config_path, overrides=overrides, interpolate=interpolate
OSError: [E053] Could not read config.cfg from /anaconda/envs/azureml_py36/lib/python3.6/site-packages/en_core_web_sm/en_core_web_sm-2.3.1/config.cfg
I installed the packages using the below three lines codes from Spacy
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
How should I fix this issue? thanks in advance.
ANSWER
Answered 2022-Feb-06 at 04:46Taking the path from your error message:
en_core_web_sm-2.3.1/config.cfg
You have a model for v2.3, but it's looking for a config.cfg
, which is only a thing in v3 of spaCy. It looks like you upgraded spaCy without realizing it.
There are two ways to fix this. One is to reinstall the model with spacy download
, which will get a version that matches your current spaCy version. If you are just starting something that is probably the best idea. Based on the release date of scrubadub, it seems to be intended for use with spaCy v3.
However, note that v2 and v3 are pretty different - if you have a project with v2 of spaCy you might want to downgrade instead.
QUESTION
I am doing some web scraping to export text info from an html and using a NER (Spacy) to identify information such as Assets Under Management, Addresses, and founding dates of companies. Once the information is extracted, I would like to place it in a dataframe.
I am working with the following script:
from bs4 import BeautifulSoup
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
import pandas as pd
import spacy
from spacy import displacy
import en_core_web_sm
import requests
import re
NER = spacy.load("en_core_web_sm")
url = "https://www.baincapital.com/"
driver = webdriver.Chrome("C:/Program Files/chromedriver.exe")
driver.get(url)
sleep(randint(5,15))
soup = BeautifulSoup(driver.page_source, 'html.parser')
body=soup.body.text
body
body= body.replace('\n', ' ')
body= body.replace('\t', ' ')
body= body.replace('\r', ' ')
body= body.replace('\xa0', ' ')
text3= NER(body)
displacy.render(text3,style="ent",jupyter=True)
The output is shown as:
And I would like to place it in the following rudimentary table:
Entity Identified Money $155 Billion Date 1984 Org Bain Capital Org Bain Capital Investor Portal Please Cardinal four Cardinal 24 GPE USEssentially, take highlighted info and place it in a dataframe with identifying features.
ANSWER
Answered 2022-Jan-25 at 21:27After you obtained the body
with plain text, you can parse the text into a document and get a list of all entities with their labels and texts, and then instantiate a Pandas dataframe with those data:
#... your code here ...
body=soup.body.text
# now, this is the modification:
body = ' '.join(body.split())
doc = NER(body)
entities = [(e.label_,e.text) for e in doc.ents]
df = pd.DataFrame(entities, columns=['Entity','Identified'])
Note that the body = ' '.join(body.split())
line is used to normalize all whitespace in a simpler and shorter way than you used.
QUESTION
I am using Spacy NER model to extract from a text, some named entities relevant to my problem, such us DATE, TIME, GPE among others.
For example, I need to recognize the Time Zone in the following sentence:
"Australian Central Time"
With Spacy model en_core_web_lg
, I got the following result:
doc = nlp("Australian Central Time")
print([(ent.label_, ent.text) for ent in doc.ents])
>> [('NORP', 'Australian')]
My problem is: I don't have a clear idea about what exactly means entity NORP
and more general what exactly means each Spacy NER entity (leaving aside the intuitive values of course).
I found the following snippet to get the complete entities list, but after that I'm blocked:
import spacy
nlp = spacy.load("en_core_web_lg")
nlp.get_pipe("ner").labels
I'm pretty new to using Spacy NLP and didn't find what I'm looking for on the official documentation, so any help will be appreciated!
BTW, I'm using Spacy version 3.2.1
.
ANSWER
Answered 2022-Jan-24 at 16:01Most labels have definitions you can access using spacy.explain(label)
.
For NORP
: "Nationalities or religious or political groups"
For more details you would need to look into the annotation guidelines for the resources listed in the model documentation under https://spacy.io/models/.
QUESTION
I am new to NER
and Spacy
. Trying to figure out what, if any, text cleaning needs to be done. Seems like some examples I've found trim the leading and trailing whitespace and then muck with the start/stop indexes. I saw one example where the guy did a bunch of cleaning and his accuracy was really bad because all the indexes were messed up.
Just to clarify, the dataset was annotated with DataTurks, so you get json like this:
"Content":
"label": [
"Skills"
],
"points": [
{
"start": 1295,
"end": 1621,
"text": "\n• Programming language...
So by "mucking with the indexes", I mean, if you strip off the leading \n
, you need to update the start index, so it's still aligned properly.
So that's really the question, if I start removing characters from the beginning, end or middle, I need to apply the rule to the content attribute and adjust start/end indexes to match, no? I'm guessing an obvious "yes" :), so I was wondering how much cleaning needs to be done.
So you would remove the \n
s, bullets, leading / trailing whitespace, but leave standard punctuation like commas, periods, etc?
What about stuff like lowercasing, stop words, lemmatizing, etc?
One concern I'm seeing with a few samples I've looked at, is the start/stop indexes do get thrown off by the cleaning they do because you kind of need to update EVERY annotation as you remove characters to keep them in sync.
I.e.
A 0 -> 100
B 101 -> 150
if I remove a char
at position 50
, then I need to adjust B to 100 -> 149
.
ANSWER
Answered 2021-Dec-28 at 05:19First, spaCy does no transformation of the input - it takes it literally as-is and preserves the format. So you don't lose any information when you provide text to spaCy.
That said, input to spaCy with the pretrained pipelines will work best if it is in natural sentences with no weird punctuation, like a newspaper article, because that's what spaCy's training data looks like.
To that end, you should remove meaningless white space (like newlines, leading and trailing spaces) or formatting characters (maybe a line of ----
?), but that's about all the cleanup you have to do. The spaCy training data won't have bullets, so they might get some weird results, but I would leave them in to start. (Also, bullets are obviously printable characters - maybe you mean non-ASCII?)
I have no idea what you mean by "muck with the indexes", but for some older NLP methods it was common to do more extensive preprocessing, like removing stop words and lowercasing everything. Doing that will make things worse with spaCy because it uses the information you are removing for clues, just like a human reader would.
Note that you can train your own models, in which case they'll learn about the kind of text you show them. In that case you can get rid of preprocessing entirely, though for actually meaningless things like newlines / leading and following spaces you might as well remove them anyway.
To address your new info briefly...
Yes, character indexes for NER labels must be updated if you do preprocessing. If they aren't updated they aren't usable.
It looks like you're trying to extract "skills" from a resume. That has many bullet point lists. The spaCy training data is newspaper articles, which don't contain any lists like that, so it's hard to say what the right thing to do is. I don't think the bullets matter much, but you can try removing or not removing them.
What about stuff like lowercasing, stop words, lemmatizing, etc?
I already addressed this, but do not do this. This was historically common practice for NLP models, but for modern neural models, including spaCy, it is actively unhelpful.
QUESTION
I am building a NLP App using python. I heard the Spacy is proper to NLP and installed it. How should I use the Japanese engine from Spacy?
pip install -u spacy
or
python -m pip -u Spacy
What shall I install more?
ANSWER
Answered 2021-Dec-15 at 21:39You should download and install the language package.
pip spacy download ja_core_news_lg
or
python -m spacy download ja_core_news_lg
If you face an issue, please try this.
python -m spacy download ja_core_news_sm
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spaCy
Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
Python version: Python 3.6+ (only 64 bit)
Package managers: pip · conda (via conda-forge)
Trained pipelines for spaCy can be installed as Python packages. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's download command, or manually by pointing pip to a path or URL.
Support
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesExplore Kits - Develop, implement, customize Projects, Custom Functions and Applications with kandi kits​
Save this library and start creating your kit
Share this Page