textclean | Tools for cleaning and normalizing text data | Regex library

by trinker R Version: 0.8.0 License: No License

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | textclean Summary

textclean is a R library typically used in Utilities, Regex applications. textclean has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

[] textclean is a collection of tools to clean and normalize text. many of these tools have been taken from the qdap package and revamped to be more intuitive, better named, and faster. tools are geared at checking for substrings that are not optimal for analysis and replacing or removing them (normalizing) with more analysis friendly substrings (see sproat, black, chen, kumar, ostendorf, & richards, 2001, [doi:10.1006/csla.2001.0169] or extracting them into new variables. for example, emoticons are often used in text but not always easily handled by analysis algorithms. the replace_emoticon() function replaces emoticons with word equivalents. other r packages provide some of the same functionality (e.g., english, gsubfn, mgsub, stringi, stringr,

Support

Quality

Security

License

Reuse

Support

textclean has a low active ecosystem.

It has 166 star(s) with 26 fork(s). There are 11 watchers for this library.

It had no major release in the last 12 months.

There are 3 open issues and 49 have been closed. On average issues are closed in 97 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of textclean is 0.8.0

Quality

textclean has 0 bugs and 0 code smells.

Security

textclean has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

textclean code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

textclean does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

textclean releases are available to install and integrate.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of textclean

Get all kandi verified functions for this library.

textclean Key Features

No Key Features are available at this moment for textclean.

textclean Examples and Code Snippets

No Code Snippets are available at this moment for textclean.

Community Discussions

Trending Discussions on textclean

Incorrect day and year format after extracting "MM-DD-YYY" variable

Excel date format not imported correctly into R

Sklearn Pipeline and original model aren't the same answer "Fixed Output"

R - mgsub problem: substrings being replaced not whole strings

How can I replace emojis with text and treat them as single words?

R: tokenize n-grams but not strip punctuations

Function to replace incorrectly spelled words with correctly spelled words in R?

replace_emoticon function incorrectly replaces characters within a word - R

Replace Emojis in R with replace_emoji() function does not work due to different encoding - UTF8/Unicode?

Extract emojis from tweets in R

QUESTION

Incorrect day and year format after extracting "MM-DD-YYY" variable

Asked 2022-Apr-10 at 17:56

I am trying to generate day, month, and year variables based on the string values of a "date" variable, which is formatted as "27-02-2012" or "DD-MM-YYYY".

...

ANSWER

Answered 2022-Apr-10 at 17:56

We can use

Source https://stackoverflow.com/questions/71819145

QUESTION

Excel date format not imported correctly into R

Asked 2022-Apr-08 at 22:41

I am importing an Excel file into R, where the date format in Excel is "27-02-2012". However, once I import the dataset into R with the code below:

...

ANSWER

Answered 2022-Apr-08 at 22:41

Try using the col_types parameter

Source https://stackoverflow.com/questions/71801578

QUESTION

Sklearn Pipeline and original model aren't the same answer "Fixed Output"

Asked 2022-Jan-01 at 09:30

I'm developing simple text classification for SMS and the full model will be 3 steps:

TextCleaning() "Custom function"
TfidfVectorizer() "Vectorizer"
MultinomialNB() "Classification model"

I wanted to merge the 3 steps in one model using sklearn.pipeline and save the model using joblib.dump, The problem is when load the saved model the output is fixed every time with any test or training data of spam class I got ham!

This is the custom function before Pipeline :

...

ANSWER

Answered 2022-Jan-01 at 09:30

In your notebook, you are doing:

Source https://stackoverflow.com/questions/70509222

QUESTION

R - mgsub problem: substrings being replaced not whole strings

Asked 2021-Nov-04 at 19:58

I have downloaded the street abbreviations from USPS. Here is the data:

...

ANSWER

Answered 2021-Nov-03 at 10:26

Update

Here is the benchmarking for the existing to OP's question (borrow test data from @Marek Fiołka but with n <- 10000)

Source https://stackoverflow.com/questions/69467651

QUESTION

How can I replace emojis with text and treat them as single words?

Asked 2021-May-18 at 15:56

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results.

A red heart emoji is translated as "red heart ufef". These words are then treated separately during the analysis and compromise the results.

Terms like "heart" can have a very different meaning as can be seen with "red heart ufef" and "broken heart" The function replace_emoji_identifier() doesn't help either, as the identifiers make an analysis hard.

Dummy data set reproducible with by using dput() (including the step force to lowercase:

...

ANSWER

Answered 2021-May-18 at 15:56

Answer

Replace the default conversion table in replace_emoji with a version where the spaces/punctuation is removed:

Source https://stackoverflow.com/questions/67576269

QUESTION

R: tokenize n-grams but not strip punctuations

Asked 2020-Dec-08 at 18:26

I am trying to conduct tokenization of n-grams (between 1 (minimum) and 3(maximum)) on my data. After applying this function , I can see that it strips some relevant words such as [sad](words that I have converted from emojis).

For example the input is:

I dislike lemons [sad]

When I apply the n-gram tokenizer and assess their frequency (which are separated by "_") the output for sad appears like this (bare in mind that I am only printing the top 100 n-grams and other words are included but I want to assess this one specifically):

[_sad]
[_sad _]

How do I make sure that "[" its not stripped during tokenization of n-grams? (i.e. In order to become [sad])

This is my code and I am using quanteda package:

...

ANSWER

Answered 2020-Dec-06 at 22:26

I played around with this a little bit -- and I am thinking you should convert your [ and ] characters to something unique but alphanumeric. It seems like {quanteda} wants to parse tokens that contain or are adjacent to special characters in this way -- and not consider them part of the "word" per se. Since your concept of "[sad]" is a single word, then to tokenize it, just do something that distinguishes it from regular "sad".

I use gsub and search for patterns "\\[" and "\\]" respectively. [ is a regular expression special character so you need to escape it with two backslashes. I replace the first with the word "emoji" and the second with "" to form "emojisad"

Note how the period at the end of the sentence is handled. You said you stripped out punctuation, but this behavior seems like a "feature" not a bug.

Source https://stackoverflow.com/questions/65173224

QUESTION

Function to replace incorrectly spelled words with correctly spelled words in R?

Asked 2020-Oct-09 at 10:08

I've built a spell check function for a sample of 1000 rows to ensure its efficiency, using the 'hunspell' package and the Australian English dictionary. The spell-checker ignores abbreviations. My actual data has close to 2 million lines, I therefore need to convert the 'for' loops into the 'apply' family functions.

I'm almost there I, but the last part isn't working. Below are the original for loop functions:

...

ANSWER

Answered 2020-Oct-09 at 10:08

This will identify and replace incorrectly spelt words with the correct spelling. Note that it will ignore abbreviations as desired, and it assumes all words are separated by a space.

Source https://stackoverflow.com/questions/62755026

QUESTION

replace_emoticon function incorrectly replaces characters within a word - R

Asked 2020-Jun-09 at 15:37

I am working in R and using the replace_emoticon function from the textclean package to replace emoticons with their corresponding words:

...

ANSWER

Answered 2020-Jun-09 at 15:37

Wiktor is right, the word boundery check is causing an issue. I have adjusted it slightly in the below function. There is still 1 issue with this and that is if the emoticon is immediately followed by a word without a space between the emoticon and the word. The question is if the last issue is important or not. See examples below.

Note: I added this info to the issue tracker with textclean.

Source https://stackoverflow.com/questions/62270337

QUESTION

Replace Emojis in R with replace_emoji() function does not work due to different encoding - UTF8/Unicode?

Asked 2020-Jun-08 at 09:47

I am trying to clean my text data and replace Emojis with words so that I can perform a sentiment analysis later on.

Therefore, I am using the replace_emoji function from the textclean package. This should replace all emojis with their corresponding words.

The dataset I am working with is a text corpus, that is also the reason why I used the VCorpus function from the tm package in my sample code below:

...

ANSWER

Answered 2020-Jun-08 at 09:47

I found an answer to your question. I will mark this one as a duplicate later today when you read my answer.

Using my example:

Source https://stackoverflow.com/questions/62248654

QUESTION

Extract emojis from tweets in R

Asked 2020-Apr-24 at 10:55

I'm doing feature extraction from labelled Twitter data to use for predicting fake tweets. I've been spending a lot of time on various GitHub methods, R libraries, stackoverflow posts, but somehow I couldn't find a "direct" method of extracting features related to emojis, e.g. number of emojis, whether the tweet contains emoji(1/0) or even occurrence of specific emojis(that might occur more often in fake/real news). I'm not sure whether there is a point in showing reproducible code.

"Ore" library, for example, offers functions that gather all tweets in an object and extracts emojis, but the formats are problematic (at least, to me) when trying to create features out of the extractions, as mentioned above. The example below uses a whatsapp text sample. I will add twitter data from kaggle to make it somewhat reproducible. Twitter Dataset: https://github.com/sherylWM/Fake-News-Detection-using-Twitter/blob/master/FinalDataSet.csv

...

ANSWER

Answered 2020-Apr-24 at 10:55

I wrote a function for this purpose in my package rwhatsapp.

As your example is a whatsapp dataset, you can test it directly using the package (install via remotes::install_github("JBGruber/rwhatsapp"))

Source https://stackoverflow.com/questions/61216342

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install textclean

You can download it from GitHub.

Support

You are welcome to: - submit suggestions and bug-reports at: <a href="https://github.com/trinker/textclean/issues" class="uri">https://github.com/trinker/textclean/issues</a>;.

Find more information at: