How to Remove tokens Like Symbols -Punctuation-Numbers with SpaCy in Python

by vigneshchennai74 Updated: Jan 31, 2023

Solution Kit

SpaCy is an open-source software library for advanced natural language processing. It assists you in creating programs that process and "understand" massive amounts of text because it was created expressly for use in production environments. The quick and effective tokenization offered by spaCy is one of its main advantages. SpaCy is frequently used for tasks including information extraction, machine translation, named entity recognition, part-of-speech tagging, and text summarization in business, academia, and government research projects.

Additionally, spaCy offers tools for standard tasks like text classification, language recognition, working with word vectors and similarity, and more. You can use spaCy's tokenizer to remove certain types of tokens from a text. You may use SpaCy in a few ways to get rid of tokens in text, including symbols, punctuation, and numerals. Some examples include:

Eliminating common stop words: SpaCy has a built-in list of terms you can eliminate from your writing, like "and," "or," and "the."
Eliminating punctuation: You may verify whether a token is a punctuation using the spacy.tokens.token.is punct property and then deletes it from the text.
Removing numbers: To determine whether a token is a number and to delete it from the text, use the spacy.tokens.token.like num property.
Removing symbols: To determine whether a token is a symbol or not and to delete it from the text, use the spacy.tokens.token.isalpha and spacy.tokens.token.is digit properties.

Here is how you can remove tokens like symbols, punctuation, and numbers in SpaCy:

Preview of the output that you will get on running this code from your IDE

Code

In this solution we use the Attributes method of the SpaCy library.

How to filter tokens from spaCy document

PythonLines of Code : 23License : Strong Copyleft (CC BY-SA 4.0)

Dependent Libraries :

import spacy
from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
from spacy.tokens import Doc
import numpy

def remove_tokens_on_match(doc):
    indexes = []
    for index, token in enumerate(doc):
        if (token.pos_  in ('PUNCT', 'NUM', 'SYM')):
            indexes.append(index)
    np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
    np_array = numpy.delete(np_array, indexes, axis = 0)
    doc2 = Doc(doc.vocab, words=[t.text for i, t in enumerate(doc) if i not in indexes])
    doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
    return doc2

# load english model
nlp  = spacy.load('en')
doc = nlp(u'This document is only an example. \
I would like to create a custom pipeline that will remove specific tokens from \
the final document.')
print(remove_tokens_on_match(doc))

Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
Enter the Text
Run the file to annihilate symbols ,numbers and punctuation

I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.

I found this code snippet by searching for "How to filter tokens from spacy Document " in kandi. You can try any such use case!

Note

In this snippet we are using a Language model (en_core_web_sm)

Download the model using the command python -m spacy download en_core_web_sm .
paste it in your terminal and download it.

Check the user's spacy version using pip show spacy command in users terminal.

if its version 3.0 or above you will need to load it using nlp = spacy.load("en_core_web_sm")
if its version is less than 3.0 you will need to load it using nlp = spacy.load("en")

Environment Test

I tested this solution in the following versions. Be mindful of changes when working with other versions.

The solution is created in Python 3.7.15. Version
The solution is tested on Spacy 3.4.3 Version
The solution is tested on numpy 1.21.6 Version

Using this solution, we can able to delete or remove symbols ,punctuation, numbers using python with the help of Spacy library. This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us remove the token in python.

Dependent Library

numpyby numpy

Python

23755

Version:v1.25.0rc1

License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

numpyby numpy

Python 23755 Version:v1.25.0rc1 License: Permissive (BSD-3-Clause)

The fundamental package for scientific computing with Python.

Support

Quality

Security

License

Reuse

spaCyby explosion

Python

26383

Version:v3.2.6

License: Permissive (MIT)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Support

Quality

Security

License

Reuse

spaCyby explosion

Python 26383 Version:v3.2.6 License: Permissive (MIT)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Support

Quality

Security

License

Reuse

If you do not have SpaCy and numpy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy and numpy

Support

For any support on kandi solution kits, please use the chat
For further learning resources, visit the Open Weaver Community learning page.

See similar Kits and Libraries

Open Weaver – Develop Applications Faster with Open Source

Terms
Privacy policy

How to Remove tokens Like Symbols -Punctuation-Numbers with SpaCy in Python

Code

Environment Test

Dependent Library

Support

Open Weaver – Develop Applications Faster with Open Source

kandi

Community and Support

Company

Follow