Support
Quality
Security
License
Reuse
kandi has reviewed nltk and discovered the below as its top functions. This is intended to give you an instant insight into nltk implemented functionality, and help decide if they suit your requirements.
Get all kandi verified functions for this library.
Get all kandi verified functions for this library.
NLTK Source
See all related Code Snippets
QUESTION
Pandas - Keyword count by Category
Asked 2022-Apr-04 at 13:41I am trying to get a count of the most occurring words in my df, grouped by another Columns values:
I have a dataframe like so:
df=pd.DataFrame({'Category':['Red','Red','Blue','Yellow','Blue'],'Text':['this is very good ','good','dont like','stop','dont like']})
This is the way that I have counted the keywords in the Text column:
from collections import Counter
top_N = 100
stopwords = nltk.corpus.stopwords.words('english')
# # RegEx for stopwords
RE_stopwords = r'\b(?:{})\b'.format('|'.join(stopwords))
# replace '|'-->' ' and drop all stopwords
words = (df.Text
.str.lower()
.replace([r'\|', RE_stopwords], [' ', ''], regex=True)
.str.cat(sep=' ')
.split()
)
# generate DF out of Counter
df_top_words = pd.DataFrame(Counter(words).most_common(top_N),
columns=['Word', 'Frequency']).set_index('Word')
print(df_top_words)
Which produces this result:
However this just generates a list of all of the words in the data frame, what I am after is something along the lines of this:
ANSWER
Answered 2022-Apr-04 at 13:11Your words
statement finds the words that you care about (removing stopwords) in the text of the whole column. We can change that a bit to apply the replacement on each row instead:
df["Text"] = (
df["Text"]
.str.lower()
.replace([r'\|', RE_stopwords], [' ', ''], regex=True)
.str.strip()
# .str.cat(sep=' ')
.str.split() # Previously .split()
)
Resulting in:
Category Text
0 Red [good]
1 Red [good]
2 Blue [dont, like]
3 Yellow [stop]
4 Blue [dont, like]
Now, we can use .explode
and then .groupby
and .size
to expand each list element to its own row and then count how many times does a word appear in the text of each (original) row:
df.explode("Text").groupby(["Category", "Text"]).size()
Resulting in:
Category Text
Blue dont 2
like 2
Red good 2
Yellow stop 1
Now, this does not match your output sample because in that sample you're not applying the .replace
step from the original words
statement (now used to calculate the new value of the "Text" column). If you wanted that result, you just have to comment out that .replace
line (but I guess that's the whole point of this question)
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
No vulnerabilities reported
Save this library and start creating your kit
See Similar Libraries in
Save this library and start creating your kit
Open Weaver – Develop Applications Faster with Open Source