kandi background

py-lingualytics | A text analytics library with support for codemixed data | Natural Language Processing library

Download this library from

kandi X-RAY | py-lingualytics Summary

py-lingualytics is a Python library typically used in Artificial Intelligence, Natural Language Processing, Deep Learning, Pytorch, Bert, Neural Network, Transformer applications. py-lingualytics has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can install using 'pip install py-lingualytics' or download it from GitHub, PyPI.
Lingualytics is a Python library for dealing with indic text. Lingualytics is powered by powerful libraries like Pytorch, Transformers, Texthero, NLTK and Scikit-learn.

kandi-support Support

  • py-lingualytics has a low active ecosystem.
  • It has 32 star(s) with 3 fork(s). There are 2 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 1 open issues and 0 have been closed. There are no pull requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of py-lingualytics is current.

quality kandi Quality

  • py-lingualytics has 0 bugs and 0 code smells.

securitySecurity

  • py-lingualytics has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • py-lingualytics code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.

license License

  • py-lingualytics is licensed under the MIT License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.

buildReuse

  • py-lingualytics releases are not available. You will need to build from source code and install.
  • Deployable package is available in PyPI.
  • Build file is available. You can build the component from source.
  • Installation instructions, examples and code snippets are available.
  • py-lingualytics saves you 136 person hours of effort in developing the same functionality from scratch.
  • It has 341 lines of code, 23 functions and 7 files.
  • It has high code complexity. Code complexity directly impacts maintainability of the code.
Top functions reviewed by kandi - BETA

kandi has reviewed py-lingualytics and discovered the below as its top functions. This is intended to give you an instant insight into py-lingualytics implemented functionality, and help decide if they suit your requirements.

  • Evaluate the model .
  • Train the model .
  • Get the n - grams from a string .
  • Removes lsthan characters from a string .
  • Remove punctuation .
  • Remove stopwords from a string .
  • Remove leading links .
  • Get a tensor item .
  • Initialize the model .
  • Calculate n - grams from text .

py-lingualytics Key Features

Preprocessing Remove stopwords Remove punctuations, with an option to add punctuations of your own language Remove words less than a character limit

Representation Find n-grams from given text

NLP Classification using PyTorch Train a classifier on your data to perform tasks like Sentiment Analysis Evaluate the classifier with metrics like accuracy, f1 score, precision and recall Use the trained tokenizer to tokenize text

py-lingualytics Examples and Code Snippets

  • 💾 Installation
  • Preprocessing
  • Classification
  • Find topmost n-grams

💾 Installation

pip install lingualytics

Community Discussions

Trending Discussions on Natural Language Processing
  • number of matches for keywords in specified categories
  • Apple's Natural Language API returns unexpected results
  • Tokenize text but keep compund hyphenated words together
  • Create new boolean fields based on specific bigrams appearing in a tokenized pandas dataframe
  • ModuleNotFoundError: No module named 'milvus'
  • Which model/technique to use for specific sentence extraction?
  • Assigning True/False if a token is present in a data-frame
  • How to calculate perplexity of a sentence using huggingface masked language models?
  • Mapping values from a dictionary's list to a string in Python
  • What are differences between AutoModelForSequenceClassification vs AutoModel
Trending Discussions on Natural Language Processing

QUESTION

number of matches for keywords in specified categories

Asked 2022-Apr-14 at 13:32

For a large scale text analysis problem, I have a data frame containing words that fall into different categories, and a data frame containing a column with strings and (empty) counting columns for each category. I now want to take each individual string, check which of the defined words appear, and count them within the appropriate category.

As a simplified example, given the two data frames below, i want to count how many of each animal type appear in the text cell.

df_texts <- tibble(
  text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
  grasshopper"),
  mammals=NA,
  reptiles=NA,
  birds=NA,
  insects=NA
)

df_animals <- tibble(animals=c("ape", "fox", "tortoise", "hare", "owl", "grasshopper"),
           type=c("mammal", "mammal", "reptile", "mammal", "bird", "insect"))

So my desired result would be:

df_result <- tibble(
  text=c("the ape and the fox", "the tortoise and the hare", "the owl and the the 
  grasshopper"),
  mammals=c(2,1,0),
  reptiles=c(0,1,0),
  birds=c(0,0,1),
  insects=c(0,0,1)
)

Is there a straightforward way to achieve this keyword-matching-and-counting that would be applicable to a much larger dataset?

Thanks in advance!

ANSWER

Answered 2022-Apr-14 at 13:32

Here's a way do to it in the tidyverse. First look at whether strings in df_texts$text contain animals, then count them and sum by text and type.

library(tidyverse)

cbind(df_texts[, 1], sapply(df_animals$animals, grepl, df_texts$text)) %>% 
  pivot_longer(-text, names_to = "animals") %>% 
  left_join(df_animals) %>% 
  group_by(text, type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(id_cols = text, names_from = type, values_from = sum)

  text                                   bird insect mammal reptile
  <chr>                                 <int>  <int>  <int>   <int>
1 "the ape and the fox"                     0      0      2       0
2 "the owl and the the \n  grasshopper"     1      0      0       0
3 "the tortoise and the hare"               0      0      1       1

To account for the several occurrences per text:

cbind(df_texts[, 1], t(sapply(df_texts$text, str_count, df_animals$animals, USE.NAMES = F))) %>% 
  setNames(c("text", df_animals$animals)) %>% 
  pivot_longer(-text, names_to = "animals") %>% 
  left_join(df_animals) %>% 
  group_by(text, type) %>% 
  summarise(sum = sum(value)) %>% 
  pivot_wider(id_cols = text, names_from = type, values_from = sum)

Source https://stackoverflow.com/questions/71871613

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install py-lingualytics

Use the package manager pip to install lingualytics.

Support

Lingualytics is a Python library for dealing with indic text. Lingualytics is powered by powerful libraries like Pytorch, Transformers, Texthero, NLTK and Scikit-learn.

Build your Application

Share this kandi XRay Report

Try Top Libraries by lingualytics