kandi has reviewed py-lingualytics and discovered the below as its top functions. This is intended to give you an instant insight into py-lingualytics implemented functionality, and help decide if they suit your requirements.
Preprocessing Remove stopwords Remove punctuations, with an option to add punctuations of your own language Remove words less than a character limit
Representation Find n-grams from given text
NLP Classification using PyTorch Train a classifier on your data to perform tasks like Sentiment Analysis Evaluate the classifier with metrics like accuracy, f1 score, precision and recall Use the trained tokenizer to tokenize text
pip install lingualytics
number of matches for keywords in specified categoriesAsked 2022-Apr-14 at 13:32
For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.
As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.
df_texts <- tibble( text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the grasshopper"), mammals=NA, reptiles=NA, birds=NA, insects=NA ) df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"), type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))
So my desired result would be:
df_result <- tibble( text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the grasshopper"), mammals=c(2,1,0), reptiles=c(0,1,0), birds=c(0,0,1), insects=c(0,0,1) )
Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?
Thanks in advance!
ANSWERAnswered 2022-Apr-14 at 13:32
Here's a way do to it in the
tidyverse. First look at whether strings in
df_texts$text contain animals, then count them and sum by text and type.
library(tidyverse) cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>% pivot_longer(-text, names_to = "animals") %>% left_join(df_animals) %>% group_by(text, type) %>% summarise(sum = sum(value)) %>% pivot_wider(id_cols = text, names_from = type, values_from = sum) text bird insect mammal reptile <chr> <int> <int> <int> <int> 1 "the ape and the fox" 0 0 2 0 2 "the owl and the the \n grasshopper" 1 0 0 0 3 "the tortoise and the hare" 0 0 1 1
To account for the several occurrences per text:
cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% setNames(c("text", df_animals$animals)) %>% pivot_longer(-text, names_to = "animals") %>% left_join(df_animals) %>% group_by(text, type) %>% summarise(sum = sum(value)) %>% pivot_wider(id_cols = text, names_from = type, values_from = sum)
No vulnerabilities reported
Explore Related Topics