Stemmer | Porter Stemming for Russian language | Natural Language Processing library

by neonxp PHP Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | Stemmer Summary

Stemmer is a PHP library typically used in Artificial Intelligence, Natural Language Processing applications. Stemmer has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

Porter Stemming for Russian language

Support

Quality

Security

License

Reuse

Support

Stemmer has a low active ecosystem.

It has 41 star(s) with 10 fork(s). There are 2 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 5 have been closed. On average issues are closed in 0 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of Stemmer is current.

Quality

Stemmer has 0 bugs and 0 code smells.

Security

Stemmer has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

Stemmer code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

Stemmer is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

Stemmer releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Stemmer saves you 25572 person hours of effort in developing the same functionality from scratch.

It has 49796 lines of code, 6 functions and 4 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed Stemmer and discovered the below as its top functions. This is intended to give you an instant insight into Stemmer implemented functionality, and help decide if they suit your requirements.

Get the base of a word
Find regions in a word
Remove ending characters from a word
Check if a character is a vowel

Get all kandi verified functions for this library.

Stemmer Key Features

No Key Features are available at this moment for Stemmer.

Stemmer Examples and Code Snippets

No Code Snippets are available at this moment for Stemmer.

Community Discussions

Trending Discussions on Stemmer

How to solve TypeError: iteration over a 0-d array and TypeError: cannot use a string pattern on a bytes-like object

Calculate TF-IDF in WEKA API for single document to predict classification

Should you Stem and lemmatize?

Higher Testing Accuracy and Lower Trainning Accuracy

"string indices must be integers" error when loading in JSON

Python: TypeError: 'module' object is not subscriptable

How to apply stemming to a column in a pandas dataframe

Convert words between part of speech, when wordnet doesn't do it

Can a stemming dictionary be used as rejection criteria in R?

Elasticsearch query for all values of field with group by

QUESTION

How to solve TypeError: iteration over a 0-d array and TypeError: cannot use a string pattern on a bytes-like object

Asked 2022-Mar-17 at 14:18

I am trying to apply preprocessing steps to my data. I have 6 functions to preprocess data and I call these functions in preprocess function. It works when I try these functions one by one with the example sentence.

...

ANSWER

Answered 2022-Mar-17 at 14:18

First problem that can be identified is that your convert_lower_case returns something different than it accepts - which could be perfectly fine, if treated properly. But you keep treating your data as a string, which it no longer is after data = convert_lower_case(data)

"But it looks like a string when I print it" - yeah, but it isn't a string. You can see that if you do this:

Source https://stackoverflow.com/questions/71513259

QUESTION

Calculate TF-IDF in WEKA API for single document to predict classification

Asked 2022-Mar-13 at 19:53

For some reason I am using the WEKA API...

I have generated tf-idf scores for a set of documents,

...

ANSWER

Answered 2022-Mar-13 at 19:53

The StringToWordVector filter uses the weka.core.DictionaryBuilder class under the hood for the TF/IDF computation.

As long as you create a weka.core.Instance object with the text that you want to have converted, you can do that using the builder's vectorizeInstance(Instance) method.

Edit 1:

Below is an example based on your code (but with Weka classes), which shows how to either use the filter or the DictionaryBuilder for the TF/IDF transformation. Both get serialized, deserialized and re-used as well to demonstrate that these classes are serializable:

Source https://stackoverflow.com/questions/71036021

QUESTION

Should you Stem and lemmatize?

Asked 2022-Feb-25 at 10:39

I am currently working with python NLTK to preprocess text data for Kaggle SMS Spam Classification Dataset. I have completed the following steps during preprocessing:

Removed any extra spaces
Removed punctuation and special characters
Converted the text to lower case
Replaced abbreviations such as lol,brb etc with their meaning or full form.
Removed stop words
Tokenized the data

Now I plan to perform lemmatization and stemming separately on the tokenized data followed by TF-IDF done separately on lemmatized data and stemmed data.

Questions are as follows:

Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.

Context: I am relatively new to NLP and hence I am trying to understand as much as I can about these concepts. The main idea behind this question is to understand whether lemmatization or stemming together make any sense theoretically/practically or whether these should be done separately.

Questions Referenced:

Should I perform both lemmatization and stemming?: The answer to this question was inconclusive and not accepted, it never discussed why you should or should not do it in the first place.
What is the difference between lemmatization vs stemming?: Provides the ideas behind stemming and lemmatization but I was unable to conclude the answers to my questions based on this
Stemmers vs Lemmatizers: Explains the pros and cons, as well as the context in which stemming and lemmatization, might help
NLP Stemming and Lemmatization using Regular expression tokenization: The question discusses the different preprocessing steps and does stemming and lemmatization separately

...

ANSWER

Answered 2022-Feb-25 at 10:39

Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.

Regarding (1): Lemmatisation and stemming do essentially the same thing: they convert an inflected word form to a canonical form, on the assumption that features expressed through morphology (such as word endings) are not important for the use case. If you are not interested in tense, number, voice, etc, then lemmatising/stemming will reduce the number of distinct word forms you have to deal with (as different variations get folded into one canonical form). So without knowing what you want to do exactly, and whether morphological information is relevant to that task, it's hard to answer.

Lemmatisation is a linguistically motivated procedure. Its output is a valid word in the target language, but with endings etc removed. It is not without information loss, but there are not that many problematic cases. Is does a third person singular auxiliary verb, or the plural of a female deer? Is building a noun, referring to a structure, or a continuous form of the verb to build? What about housing? A casing for an object (such as an engine) or the process of finding shelter for someone?

Stemming is a less resource intense procedure, but as a trade-off it works with approximations only. You will have less precise results, which might not matter too much in an application such as information retrieval, but if you are at all interested in meaning, then it is probably too coarse a tool. Its output also will not be a word, but a 'stem', basically a character string roughly related to those you get when stemming similar words.

Re (2): no, it doesn't make any sense. Both procedures attempt the same task (normalising inflected words) in different ways, and once you have lemmatised, stemming is pointless. And if you stem first, you generally do not end up with valid words, so lemmatisation would not work anyway.

Source https://stackoverflow.com/questions/71261467

QUESTION

Higher Testing Accuracy and Lower Trainning Accuracy

Asked 2022-Feb-23 at 19:55

I am rather new to the process of NLP, and I am running into a situation where my training accuracy is around 70% but my test accuracy is 80%. I have roughly 6000 entries from 2020 to be used as training data and 300 entires from first quarter of 2021 to be used as test data (due to unavailability of Q2,Q3,Q4 data). Each entire would have at least 2-3 paragraphs within them.

I have setup cross validation using RepeatedStratifiedKFold with 10 split and 3 repeat, and using grideserachCV with C=.1 and kernel = linear. Setup stop words (I did customized it somewhat such as include top 100 common names, month, as well as some of more common words that doesn't mean much in my setting), lowercased everything, and used Snowball stemmer. The resulting confusion matrix for the test set is as appeared

...

ANSWER

Answered 2022-Feb-23 at 19:55

I am not really familiar with the model you use and might be mising something here, but it might be that your test set is not representative of the data. Perhaps there is something in the 2021 data that causes it to be easier to predict.

You might want to try something like sklearn's train_test_split() with shuffle=True to ensure the test set is a representative random subset of the data and see if you get more balanced performances between the sets this way.

Depending on which task exactly you are doing, 300 entries is really not a lot for a test set in NLP, so that small test set size alone might distort the test results.

It is a bit difficult to give advise on how to generally improve the predictions without knowing what you generally are trying to do. I assume it has to do with doing some kind of two class classification on stemmed tokens?

Can you clarify/give an example for an entry and the desired predictions?

Source https://stackoverflow.com/questions/71242173

QUESTION

"string indices must be integers" error when loading in JSON

Asked 2022-Feb-22 at 03:20

I want to access items from a new dictionary called conversations by implementing a for loop.

...

ANSWER

Answered 2022-Feb-22 at 01:39

You should use the json module to load in JSON data as opposed to reading in the file line-by-line. Whatever procedure you build yourself is likely to be fragile and less efficient.

Here is the looping structure that you're looking for:

Source https://stackoverflow.com/questions/71214931

QUESTION

Python: TypeError: 'module' object is not subscriptable

Asked 2022-Feb-13 at 07:17

I was watching this tutorial. I copied over the stuff he wrote, but with some changes in the variables and other stuff. Then I got the error (The error is below).

Here's the code (main.py):

...

ANSWER

Answered 2022-Feb-13 at 07:17

Explanation:

data is not defined, your json data is loaded in trainer_load.

I was watching this tutorial. I copied over the stuff he wrote, but with some changes in the variables and other stuff. Then I got the error (The error is below).

Change in the variable is the cause of the error. You must change the variable in every occurrence if you want to change, you just changed while defining the variable
In the link you have refered the code is:

Source https://stackoverflow.com/questions/71098486

QUESTION

How to apply stemming to a column in a pandas dataframe

Asked 2022-Feb-11 at 15:13

If i had the following dataframe:

...

ANSWER

Answered 2022-Feb-11 at 15:13

df['col1'] = df['col1'].apply(porter.stem)

Source https://stackoverflow.com/questions/71082205

QUESTION

Convert words between part of speech, when wordnet doesn't do it

Asked 2022-Jan-15 at 09:38

There are a lot of Q&A about part-of-speech conversion, and they pretty much all point to WordNet derivationally_related_forms() (For example, Convert words between verb/noun/adjective forms)

However, I'm finding that the WordNet data on this has important gaps. For example, I can find no relation at all between 'succeed', 'success', 'successful' which seem like they should be V/N/A variants on the same concept. Likewise none of the lemmatizers I've tried seem to see these as related, although I can get snowball stemmer to turn 'failure' into 'failur' which isn't really much help.

So my questions are:

Are there any other (programmatic, ideally python) tools out there that do this POS-conversion, which I should check out? (The WordNet hits are masking every attempt I've made to google alternatives.)
Failing that, are there ways to submit additions to WordNet despite the "due to lack of funding" situation they're presently in? (Or, can we set up a crowdfunding campaign?)
Failing that, are there straightforward ways to distribute supplementary corpus to users of nltk that augments the WordNet data where needed?

...

ANSWER

Answered 2022-Jan-15 at 09:38

(Asking for software/data recommendations is off-topic for StackOverflow; but I have tried to give a more general "approach" answer.)

Another approach to finding related words would be one of the machine learning approaches. If you are dealing with words in isolation, look at word embeddings such as GloVe or Word2Vec. Spacy and gensim have libraries for working with them, though I'm also getting some search hits for tutorials of working with them in nltk.

2/3. One of the (in my opinion) core reasons for the success of Princeton WordNet was the liberal license they used. That means you can branch the project, add your extra data, and redistribute.

You might also find something useful at http://globalwordnet.org/resources/global-wordnet-grid/ Obviously most of them are not for English, but there are a few multilingual ones in there, that might be worth evaluating?

Another approach would be to create a wrapper function. It first searches a lookup list of fixes and additions you think should be in there. If not found then it searches WordNet as normal. This allows you to add 'succeed', 'success', 'successful', and then other sets of words as end users point out something missing.

Source https://stackoverflow.com/questions/70713831

QUESTION

Can a stemming dictionary be used as rejection criteria in R?

Asked 2022-Jan-13 at 09:41

I am struggling through some text analysis, and I'm not sure I'm doing the stemming correctly. Right now, my command for single-term stemming is

...

ANSWER

Answered 2022-Jan-13 at 09:41

wordStem does not employ a dictionary but uses grammatical rules to do stemming (which is a rather crude approximation to lemmatisation btw). Here is an example:

Source https://stackoverflow.com/questions/70645817

QUESTION

Elasticsearch query for all values of field with group by

Asked 2021-Dec-31 at 05:49

i am having trouble forming query to fetch all values with sql group by kind of thing.

so below is my data structure:

...

ANSWER

Answered 2021-Dec-31 at 05:49

This query first group by names then groups each name's values. By setting sizes, you can arrange number of facets you want and number of items in each facet. I think it does what you need.

Note that if you have too many documents and if performance matters, this query may perform bad.

Source https://stackoverflow.com/questions/70495404

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install Stemmer

You can download it from GitHub.
PHP requires the Visual C runtime (CRT). The Microsoft Visual C++ Redistributable for Visual Studio 2019 is suitable for all these PHP versions, see visualstudio.microsoft.com. You MUST download the x86 CRT for PHP x86 builds and the x64 CRT for PHP x64 builds. The CRT installer supports the /quiet and /norestart command-line switches, so you can also script it.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: