Remove Stopwords and Lemmas using Pandas

by vigneshchennai74 Updated: Feb 20, 2023

Solution Kit

Lemmatization is a text preprocessing technique that can reduce inflected forms of words to their base or dictionary forms, known as lemmas. By reducing words to their base forms, lemmatization can help to normalize the text data and make it more amenable to analysis and modeling. For example, the lemma of the word "running" is "run", and the lemma of "rocks" is "rock".

The en_core_web_sm language model is provided by Spacy. The language model is a pre-trained statistical model that enables natural language processing for English text. Spacy is a powerful natural language processing (NLP) library that provides tools for processing and analyzing textual data. Spacy is a valuable tool that can help businesses, researchers, and individuals better understand and analyze text data.

The en_core_web_sm model is a small-sized model that includes vocabulary, syntax, and named entity recognition, among other features. It can process and analyze text to identify linguistic features such as parts of speech, dependencies between words, named entities, and more.

The code performs basic preprocessing to make the text data more amenable to analysis and modeling. Removing stop words and lemmatizing the text data can reduce the noise in the text data by removing inflectional affixes that may not be relevant to the meaning of the text. This can improve the accuracy and interpretability of any downstream analysis or modeling tasks.

Preview of the output that you will get on running this code from your IDE

Code

In this solution we used remove_stop and lemmatize function of python.

How to remove stop words and get lemmas in a pandas data frame using spacy?

PythonLines of Code : 32License : Strong Copyleft (CC BY-SA 4.0)

Dependent Libraries :

import spacy
import pandas as pd

nlp = spacy.load('en_core_web_sm')

texts = [('the','cheeseburger','was','great'),
         ('i','never','did','like','the','pizzas','too','much'), 
         ('yellowed','submarines','was','only','an','ok','song')]

df = pd.DataFrame({'word_tokens': texts})

def to_doc(words:tuple) -> spacy.tokens.Doc:
    # Create SpaCy documents by joining the words into a string
    return nlp(' '.join(words))

def remove_stops(doc) -> list:
    # Filter out stop words by using the `token.is_stop` attribute
    return [token.text for token in doc if not token.is_stop]

def lemmatize(doc) -> list:
    # Take the `token.lemma_` of each non-stop word
    return [token.lemma_ for token in doc if not token.is_stop]

# create documents for all tuples of tokens
docs = list(map(to_doc, df.word_tokens))

# apply removing stop words to all
df['removed_stops'] = list(map(remove_stops, docs))

# apply lemmatization to all
df['lemmatized'] = list(map(lemmatize, docs))

Copy the code using the "Copy" button above, and paste it in a Python file in your IDE.
Enter the Text
Run the file to annihilate stopwords and lemmas in the text

I hope you found this useful. I have added the link to dependent libraries, version information in the following sections.

I found this code snippet by searching for "how to remove stopwrds and get lemmas in a pandas ddata frame using spacy " in kandi. You can try any such use case

Environment Tested

I tested this solution in the following versions. Be mindful of changes when working with other versions.

The solution is created in Python 3.7.15 Version
The solution is tested on Spacy 3.4.3 Version
The solution is tested on pandas 1.3.5 Version

Using this solution, we can remove stopwords and lemmas in text with the help of function in spacy . This process also facilities an easy to use, hassle free method to create a hands-on working version of code which would help us remove the text in python.

Dependent Library

spaCyby explosion

Python

26383

Version:v3.2.6

License: Permissive (MIT)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Support

Quality

Security

License

Reuse

spaCyby explosion

Python 26383 Version:v3.2.6 License: Permissive (MIT)

💫 Industrial-strength Natural Language Processing (NLP) in Python

Support

Quality

Security

License

Reuse

pandasby pandas-dev

Python

38689

Version:v2.0.2

License: Permissive (BSD-3-Clause)

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Support

Quality

Security

License

Reuse

pandasby pandas-dev

Python 38689 Version:v2.0.2 License: Permissive (BSD-3-Clause)

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more

Support

Quality

Security

License

Reuse

If you do not have SpaCy that is required to run this code, you can install it by clicking on the above link and copying the pip Install command from the Spacy page in kandi.

You can search for any dependent library on kandi like Spacy

Support

For any support on kandi solution kits, please use the chat
For further learning resources, visit the Open Weaver Community learning page

See similar Kits and Libraries

Open Weaver – Develop Applications Faster with Open Source

Terms
Privacy policy

Remove Stopwords and Lemmas using Pandas

Code

Environment Tested

Dependent Library

Support

Open Weaver – Develop Applications Faster with Open Source

kandi

Community and Support

Company

Follow