tfidf | #Computer Vision | generic TfIdf utility with example code

 by   wpm Java Updated: 2 years ago - Current License: No License

Download this library from

Build Applications

kandi X-RAY | tfidf REVIEW AND RATINGS

This package provides utilities for calculating tf-idf for a set of documents. A document is a bag of terms, where the definition of term is left to the caller. The example program NgramTfIdf calculates tf-idf of n-gram frequencies. It takes a single file as an argument and treats each line of that file as a separate document, calculating tf-idf for n-gram terms.

kandi-support
Support

  • tfidf has a low active ecosystem.
  • It has 21 star(s) with 12 fork(s).
  • It had no major release in the last 12 months.
  • It has a neutral sentiment in the developer community.

quality kandi
Quality

  • tfidf has no issues reported.

security
Security

  • tfidf has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

license
License

  • tfidf does not have a standard license declared.
  • Check the repository for any license declaration and review the terms closely.
  • Without a license, all rights are reserved, and you cannot use the library in your applications.

build
Reuse

  • tfidf releases are not available. You will need to build from source code and install.
  • Build file is available. You can build the component from source.
Top functions reviewed by kandi - BETA

kandi has reviewed tfidf and discovered the below as its top functions. This is intended to give you an instant insight into tfidf implemented functionality, and help decide if they suit your requirements.

  • Entry point for testing .
  • Splits the stats into a human readable string .
  • Tokenizes the input string .
  • Returns the ngrams in a list of words .
  • Computes the IDF from a list of terminal sequences .
  • Get ngram document terms from a list of documents .
  • Returns a mapping between terms .
  • Searches through a collection of TERM nodes .
  • Convert a collection of TERM terms into a map .

tfidf Key Features

A generic Tf-Idf utility with example code that works on n-grams extracted from a text document.

tfidf examples and code snippets

  • how to use trained model to test new sentence in python (sklearn)
  • How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline?
  • How to build parameter grid with FeatureUnion?
  • Predict unseen data by previously trained model
  • Counting in how many documents does a word appear
  • Weird behaviour in MapReduce, values get overwritten
  • Index of max value from list of floats | Python
  • Calling predict on an example from an already trained logistic regression model
  • How to use BERT and Elmo embedding with sklearn
  • tfidf.idf_ what is meaning of this in the code

how to use trained model to test new sentence in python (sklearn)

lr.fit(X_train, y_train)
y_pred1 = lr.predict(X_test)
print(f"Accuracy is : {accuracy_score(y_test, y_pred1)}")   #<--- here
print(lr.predict(['ماست کم چرب 900 گرمی رامک']))   #<--- here

How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline?

def get_numeric_data(x):
    return x.number.values.reshape(-1, 1)
combined_clf = Pipeline([
    ('transformer', ColumnTransformer([
        ('vectorizer', Pipeline([
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl)
        ]), 'text')
    ], remainder='passthrough')),
    ('clf', SGDClassifier(random_state=42, max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])
-----------------------
def get_numeric_data(x):
    return x.number.values.reshape(-1, 1)
combined_clf = Pipeline([
    ('transformer', ColumnTransformer([
        ('vectorizer', Pipeline([
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl)
        ]), 'text')
    ], remainder='passthrough')),
    ('clf', SGDClassifier(random_state=42, max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])

How to build parameter grid with FeatureUnion?

'features__text_features__tfidf__use_idf': [True, False]
'features__text_features__vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]
-----------------------
'features__text_features__tfidf__use_idf': [True, False]
'features__text_features__vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]

Predict unseen data by previously trained model

### Do this in the first program
# Dump model and tfidf
pickle.dump(model, open('model.pkl', 'wb'))
pickle.dump(tfidf, open('tfidf.pkl', 'wb'))

### Do this in the second program
model = pickle.load(open('model.pkl', 'rb'))
tfidf = pickle.load(open('tfidf.pkl', 'rb'))

# Use `transform` instead of `fit_transform`
X_unseen = tfidf.transform(unseen_df['text']).toarray()

# Predict on `X_unseen`
y_pred_unseen = model_pre_trained.predict(X_unseen)

Counting in how many documents does a word appear

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            #you did mistake here
            if word in doc_count:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
    return doc_count    

ans=docs(corpus)
print(ans)

Weird behaviour in MapReduce, values get overwritten

job.setCombinerClass (MR2Reducer.class) ;
job.setReducerClass (MR2Reducer.class) ;
balance|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt    1|661
suppress|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt   1|661
back|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt       4|661
after|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt      1|661
suspicious|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt 2|661
swang|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt      2|661
swinging|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt   1|661
-----------------------
job.setCombinerClass (MR2Reducer.class) ;
job.setReducerClass (MR2Reducer.class) ;
balance|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt    1|661
suppress|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt   1|661
back|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt       4|661
after|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt      1|661
suspicious|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt 2|661
swang|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt      2|661
swinging|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt   1|661

Index of max value from list of floats | Python

max_val = max(tfidf_)
max_idx = tfidf_.index(max_val)

Calling predict on an example from an already trained logistic regression model

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

raw_input = [
    "first sentence looks like this",
    "second sentence looks like that",
    "it's going to demonstrate something",
]

vectorizer = TfidfVectorizer(stop_words="english", strip_accents="ascii")

X = vectorizer.fit_transform(raw_input)
y = np.array([0, 0, 1])

clf = LogisticRegression()
clf.fit(X, y)

d = {
    "short_description": ["[mitigated]  [ubl5] ssd slam station not working"],
    "details": ["ssd slam station not working, unable to  take slam from the station."],
}
df_test = pd.DataFrame(data=d)
X_test = vectorizer.fit_transform(df_test)
print(clf.predict(X_test))
Traceback (most recent call last):
  File "vectorizer_test.py", line 27, in <module>
    print(clf.predict(X_test))
  File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 309, in predict
    scores = self.decision_function(X)
  File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 289, in decision_function
    % (X.shape[1], n_features))
ValueError: X has 2 features per sample; expecting 6
X_test = vectorizer.transform(df_test)
print(clf.predict(X_test))
# [0 0]
-----------------------
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

raw_input = [
    "first sentence looks like this",
    "second sentence looks like that",
    "it's going to demonstrate something",
]

vectorizer = TfidfVectorizer(stop_words="english", strip_accents="ascii")

X = vectorizer.fit_transform(raw_input)
y = np.array([0, 0, 1])

clf = LogisticRegression()
clf.fit(X, y)

d = {
    "short_description": ["[mitigated]  [ubl5] ssd slam station not working"],
    "details": ["ssd slam station not working, unable to  take slam from the station."],
}
df_test = pd.DataFrame(data=d)
X_test = vectorizer.fit_transform(df_test)
print(clf.predict(X_test))
Traceback (most recent call last):
  File "vectorizer_test.py", line 27, in <module>
    print(clf.predict(X_test))
  File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 309, in predict
    scores = self.decision_function(X)
  File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 289, in decision_function
    % (X.shape[1], n_features))
ValueError: X has 2 features per sample; expecting 6
X_test = vectorizer.transform(df_test)
print(clf.predict(X_test))
# [0 0]
-----------------------
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

raw_input = [
    "first sentence looks like this",
    "second sentence looks like that",
    "it's going to demonstrate something",
]

vectorizer = TfidfVectorizer(stop_words="english", strip_accents="ascii")

X = vectorizer.fit_transform(raw_input)
y = np.array([0, 0, 1])

clf = LogisticRegression()
clf.fit(X, y)

d = {
    "short_description": ["[mitigated]  [ubl5] ssd slam station not working"],
    "details": ["ssd slam station not working, unable to  take slam from the station."],
}
df_test = pd.DataFrame(data=d)
X_test = vectorizer.fit_transform(df_test)
print(clf.predict(X_test))
Traceback (most recent call last):
  File "vectorizer_test.py", line 27, in <module>
    print(clf.predict(X_test))
  File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 309, in predict
    scores = self.decision_function(X)
  File "/home/hayesall/miniconda3/envs/stackoverflow/lib/python3.7/site-packages/sklearn/linear_model/_base.py", line 289, in decision_function
    % (X.shape[1], n_features))
ValueError: X has 2 features per sample; expecting 6
X_test = vectorizer.transform(df_test)
print(clf.predict(X_test))
# [0 0]

How to use BERT and Elmo embedding with sklearn

import torch
import numpy as np
from flair.data import Sentence
from flair.embeddings import TransformerDocumentEmbeddings
from sklearn.base import BaseEstimator, TransformerMixin


class FlairTransformerEmbedding(TransformerMixin, BaseEstimator):

    def __init__(self, model_name='bert-base-uncased', batch_size=None, layers=None):
        # From https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/
        # For pickling reason you should not load models in __init__
        self.model_name = model_name
        self.model_kw_args = {'batch_size': batch_size, 'layers': layers}
        self.model_kw_args = {k: v for k, v in self.model_kw_args.items()
                              if v is not None}
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        model = TransformerDocumentEmbeddings(
                self.model_name, fine_tune=False,
                **self.model_kw_args)

        sentences = [Sentence(text) for text in X]
        embedded = model.embed(sentences)
        embedded = [e.get_embedding().reshape(1, -1) for e in embedded]
        return np.array(torch.cat(embedded).cpu())

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import AutoTokenizer, AutoModel
from more_itertools import chunked

class TransformerEmbedding(TransformerMixin, BaseEstimator):

    def __init__(self, model_name='bert-base-uncased', batch_size=1, layer=-1):
        # From https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/
        # For pickling reason you should not load models in __init__
        self.model_name = model_name
        self.layer = layer
        self.batch_size = batch_size
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        model = AutoModel.from_pretrained(self.model_name)

        res = []
        for batch in chunked(X, self.batch_size):
            encoded_input = tokenizer.batch_encode_plus(
                batch, return_tensors='pt', padding=True, truncation=True)
            output = model(**encoded_input)
            embed = output.last_hidden_state[:,self.layer].detach().numpy()
            res.append(embed)

        return np.concatenate(res)
column_trans = ColumnTransformer([
    ('embedding', FlairTransformerEmbedding(), 'text'),
    ('number_scaler', MinMaxScaler(), ['number'])
])
-----------------------
import torch
import numpy as np
from flair.data import Sentence
from flair.embeddings import TransformerDocumentEmbeddings
from sklearn.base import BaseEstimator, TransformerMixin


class FlairTransformerEmbedding(TransformerMixin, BaseEstimator):

    def __init__(self, model_name='bert-base-uncased', batch_size=None, layers=None):
        # From https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/
        # For pickling reason you should not load models in __init__
        self.model_name = model_name
        self.model_kw_args = {'batch_size': batch_size, 'layers': layers}
        self.model_kw_args = {k: v for k, v in self.model_kw_args.items()
                              if v is not None}
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        model = TransformerDocumentEmbeddings(
                self.model_name, fine_tune=False,
                **self.model_kw_args)

        sentences = [Sentence(text) for text in X]
        embedded = model.embed(sentences)
        embedded = [e.get_embedding().reshape(1, -1) for e in embedded]
        return np.array(torch.cat(embedded).cpu())

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from transformers import AutoTokenizer, AutoModel
from more_itertools import chunked

class TransformerEmbedding(TransformerMixin, BaseEstimator):

    def __init__(self, model_name='bert-base-uncased', batch_size=1, layer=-1):
        # From https://lvngd.com/blog/spacy-word-vectors-as-features-in-scikit-learn/
        # For pickling reason you should not load models in __init__
        self.model_name = model_name
        self.layer = layer
        self.batch_size = batch_size
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        tokenizer = AutoTokenizer.from_pretrained(self.model_name)
        model = AutoModel.from_pretrained(self.model_name)

        res = []
        for batch in chunked(X, self.batch_size):
            encoded_input = tokenizer.batch_encode_plus(
                batch, return_tensors='pt', padding=True, truncation=True)
            output = model(**encoded_input)
            embed = output.last_hidden_state[:,self.layer].detach().numpy()
            res.append(embed)

        return np.concatenate(res)
column_trans = ColumnTransformer([
    ('embedding', FlairTransformerEmbedding(), 'text'),
    ('number_scaler', MinMaxScaler(), ['number'])
])

tfidf.idf_ what is meaning of this in the code

idf_: array of shape (n_features,)
The inverse document frequency (IDF) vector; only defined if use_idf is True.

COMMUNITY DISCUSSIONS

Top Trending Discussions on tfidf
  • It is necessary to encode labels when using `TfidfVectorizer`, `CountVectorizer` etc?
  • how to use trained model to test new sentence in python (sklearn)
  • I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?
  • How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline?
  • How to build parameter grid with FeatureUnion?
  • different score when using train_test_split before vs after SMOTETomek
  • Predict unseen data by previously trained model
  • Counting in how many documents does a word appear
  • Weird behaviour in MapReduce, values get overwritten
  • Text (cosine) similarity
Top Trending Discussions on tfidf

QUESTION

It is necessary to encode labels when using `TfidfVectorizer`, `CountVectorizer` etc?

Asked 2021-Jun-08 at 08:55

When working with text data, I understand the need to encode text labels into some numeric representation (i.e., by using LabelEncoder, OneHotEncoder etc.)

However, my question is whether you need to perform this step explicitly when you're using some feature extraction class (i.e. TfidfVectorizer, CountVectorizer etc.) or whether these will encode the labels under the hood for you?

If you do need to encode the labels separately yourself, are you able to perform this step in a Pipeline (such as the one below)

    pipeline = Pipeline(steps=[
        ('tfidf', TfidfVectorizer()),
        ('sgd', SGDClassifier())
    ])

Or do you need encode the labels beforehand since the pipeline expects to fit() and transform() the data (not the labels)?

ANSWER

Answered 2021-Jun-08 at 08:55

Have a look into the scikit-learn glossary for the term transform:

In a transformer, transforms the input, usually only X, into some transformed space (conventionally notated as Xt). Output is an array or sparse matrix of length n_samples and with the number of columns fixed after fitting.

In fact, almost all transformers only transform the features. This holds true for TfidfVectorizer and CountVectorizer as well. If ever in doubt, you can always check the return type of the transforming function (like the fit_transform method of CountVectorizer).

Same goes when you assemble several transformers in a pipeline. It is stated in its user guide:

Transformers are usually combined with classifiers, regressors or other estimators to build a composite estimator. The most common tool is a Pipeline. Pipeline is often used in combination with FeatureUnion which concatenates the output of transformers into a composite feature space. TransformedTargetRegressor deals with transforming the target (i.e. log-transform y). In contrast, Pipelines only transform the observed data (X).

So in conclusion, you typically handle the labels separately and before you fit the estimator/pipeline.

Source https://stackoverflow.com/questions/67880537

QUESTION

how to use trained model to test new sentence in python (sklearn)

Asked 2021-Jun-06 at 12:17

I have code to training the model for multi class text classification and it's work but I can't use that model. this is my code for training

def training(df):
X = df.Text
y = df.Tags
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
lr = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', LogisticRegression()),
               ])

lr.fit(X_train, y_train)
y_pred1 = lr.predict(X_test)
print(f"Accuracy is : {accuracy_score(y_pred1, y_test)}")
print(lr.predict('ماست کم چرب 900 گرمی رامک'))

when I run the code got this this result Accuracy is : 0.9957983193277311 and this error

  • Traceback (most recent call last): File "E:\Python\NLP Project\Beta_00\Level0\handleClassification.py", line 100, in training(df)

    File "E:\Python\NLP Project\Beta_00\Level0\handleClassification.py", line 85, in training print(lr.predict('ماست کم چرب 900 گرمی رامک'))

    File "E:\Python\NLP Project\Beta_00\venv\lib\site-packages\sklearn\utils\metaestimators.py" line 120, in out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)

    File "E:\Python\NLP Project\Beta_00\venv\lib\site-packages\sklearn\pipeline.py", line 418, in predict Xt = transform.transform(Xt)

    File "E:\Python\NLP Project\Beta_00\venv\lib\site-
    packages\sklearn\feature_extraction\text.py", line 1248, in transform raise ValueError( ValueError: Iterable over raw text documents expected, string object received.

ANSWER

Answered 2021-Jun-06 at 12:17

Below lines need correction:

lr.fit(X_train, y_train)
y_pred1 = lr.predict(X_test)
print(f"Accuracy is : {accuracy_score(y_test, y_pred1)}")   #<--- here
print(lr.predict(['ماست کم چرب 900 گرمی رامک']))   #<--- here

The line lr.predict(input) should take in an input of 'array' type.

Source https://stackoverflow.com/questions/67858355

QUESTION

I applied W2V on ML Algorithms. It gives error of negative value for NB and gives 0.48 accuracy for for all the other algorithms. How come?

Asked 2021-Jun-02 at 16:43
from gensim.models import Word2Vec
import time
# Skip-gram model (sg = 1)
size = 1000
window = 3
min_count = 1
workers = 3
sg = 1

word2vec_model_file = 'word2vec_' + str(size) + '.model'
start_time = time.time()
stemmed_tokens = pd.Series(df['STEMMED_TOKENS']).values
# Train the Word2Vec Model
w2v_model = Word2Vec(stemmed_tokens, min_count = min_count, size = size, workers = workers, window = window, sg = sg)
print("Time taken to train word2vec model: " + str(time.time() - start_time))
w2v_model.save(word2vec_model_file)

This is the code I have written. I applied this file on all ML algorithms for binary classification but all algorithms gives same result 0.48. How does it possible ? ANd also this result is very poor compare to BERT and TFIDF scores.

ANSWER

Answered 2021-Jun-02 at 16:43

A vector size of 1000 dimensions is very uncommon, and would require massive amounts of data to train. For example, the famous GoogleNews vectors were for 3 million words, trained on something like 100 billion corpus words - and still only 300 dimensions. Your STEMMED_TOKENS may not be enough data to justify 100-dimensional vectors, much less 300 or 1000.

A choice of min_count=1 is a bad idea. This algorithm can't learn anything valuable from words that only appear a few times. Typically people get better results by discarding rare words entirely, as the default min_count=5 will do. (If you have a lot of data, you're likely to increase this value to discard even more words.)

Are you examining the model's size or word-to-word results at all to ensure it's doing what you expect? Despite your colum being named STEMMED_TOKENS, I don't see any actual splitting-into-tokens, and the Word2Vec class expects each text to be a list-of-strings, not a string.

Finally, without seeing all your other choices for feeding word-vector-enriched data to your other classification steps, it is possible (likely even) that there are other errors there.

Given that a binary-classification model can always get at least 50% accuracy by simply classifying every example with whichever class is more common, any accuracy result less than 50% should immediately cause suspicions of major problems in your process like:

  • misalignment of examples & labels
  • insufficient/unrepresentative training data
  • some steps not running at all due to data-prep or invocation errors

Source https://stackoverflow.com/questions/67801844

QUESTION

How to properly build a SGDClassifier with both text and numerical data using FeatureUnion and Pipeline?

Asked 2021-Jun-02 at 07:56

I have a features DF that looks like

text number
text1 0
text2 1
... ...

where the number column is binary and the text column contains texts with ~2k characters in each row. The targets DF contains three classes.

def get_numeric_data(x):
    return [x.number.values]
def get_text_data(x):
    return [record for record in x.text.values]
transfomer_numeric = FunctionTransformer(get_numeric_data)
transformer_text = FunctionTransformer(get_text_data)

and when trying to fit, code below, I get the error File "C:\fakepath\scipy\sparse\construct.py", line 588, in bmat raise ValueError(msg) ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 98, expected 1.. I tried to build functions get_text_data and get_numerical_data in different ways but none helped.

combined_clf = Pipeline([
    ('features', FeatureUnion([
        ('numeric_features', Pipeline([
            ('selector', transfomer_numeric)
        ])),
        ('text_features', Pipeline([
            ('selector', transformer_text),
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl),
        ]))
    ])),
    ('clf', SGDClassifier(random_state=42,
                          max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])
gs_clf = GridSearchCV(combined_clf, parameters, cv=5,n_jobs=-1)
gs_clf.fit(X_train, y_train)

ANSWER

Answered 2021-Jun-02 at 07:56

The main problem is the way you are returning the numeric values. x.number.values will return an array of shape (n_samples,) which the FeatureUnion object will try to combine with the result of the transformation of the text features later on. In your case, the dimension of the transformed text features is (n_samples, 98) which cannot be combined with the vector you get for the numeric features.

An easy fix would be to reshape the vector into a 2d array with dimensions (n_samples, 1) like the following:

def get_numeric_data(x):
    return x.number.values.reshape(-1, 1)

Note that I removed the brackets surrounding the expression, as they unnecessarily wrapped the result in a list.


While the above will make your code run, there are still a couple of things about your code that are not quite efficient and can be improved.

First is the expression [record for record in x.text.values] which is redundant, as x.text.values would already be enough. The only difference is that the former is a list object, whereas the latter is a numpy ndarray which is usually preferred.

Second is what Ben Reiniger already stated in his comment. FeatureUnion is meant to perform several transformations on the same data and combine the results into a single object. However, it appears that you simply want to transform the text features separately from your numeric ones. In this case, the ColumnTransformer offers a much simpler and canonical way:

combined_clf = Pipeline([
    ('transformer', ColumnTransformer([
        ('vectorizer', Pipeline([
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl)
        ]), 'text')
    ], remainder='passthrough')),
    ('clf', SGDClassifier(random_state=42, max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])

What happens above is that ColumnTransformer only selects the text column and passes it to the pipeline of transformations, and will eventually merge it with the numeric column that was just passed through. Note that it becomes obsolete to define your own selectors as ColumnTransformer will take care of that by specifying the columns to be transformed by each transformer. See the documentation for more information.

Source https://stackoverflow.com/questions/67795761

QUESTION

How to build parameter grid with FeatureUnion?

Asked 2021-Jun-01 at 19:18

I am trying to run this combined model, of text and numeric features, and I am getting the error ValueError: Invalid parameter tfidf for estimator. Is the problem in the parameters synthax? Possibly helpful links: FeatureUnion usage FeatureUnion documentation

tknzr = tokenize.word_tokenize
vect = CountVectorizer(tokenizer=tknzr, stop_words={'english'}, max_df=0.9, min_df=2)
scl = StandardScaler(with_mean=False)
tfidf = TfidfTransformer(norm=None)
parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)],
    'tfidf__use_idf': (True, False),
    'clf__alpha': tuple(10 ** (np.arange(-4, 4, dtype='float'))),
    'clf__loss': ('hinge', 'squared_hinge', 'log', 'modified_huber', 'perceptron'),
    'clf__penalty': ('l1', 'l2'),
    'clf__tol': (1e07, 1e-6, 1e-5, 1e-4, 1e-3)
}

combined_clf = Pipeline([
    ('features', FeatureUnion([
        ('numeric_features', Pipeline([
            ('selector', transfomer_numeric)
        ])),
        ('text_features', Pipeline([
            ('selector', transformer_text),
            ('vect', vect),
            ('tfidf', tfidf),
            ('scaler', scl),
        ]))
    ])),
    ('clf', SGDClassifier(random_state=42,
                          max_iter=int(10 ** 6 / len(X_train)), shuffle=True))
])

ANSWER

Answered 2021-Jun-01 at 19:18

As stated here, nested parameters must be accessed by the __ (double underscore) syntax. Depending on the depth of the parameter you want to access, this applies recursively. The parameter use_idf is under:

features > text_features > tfidf > use_idf

So the resulting parameter in your grid needs to be:

'features__text_features__tfidf__use_idf': [True, False]

Similarly, the syntax for ngram_range should be:

'features__text_features__vect__ngram_range': [(1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]

Source https://stackoverflow.com/questions/67794357

QUESTION

different score when using train_test_split before vs after SMOTETomek

Asked 2021-May-29 at 19:35

I'm trying to classify a text to a 6 different classes. Since I'm having an imbalanced dataset, I'm also using SMOTETomek method that should synthetically balance the dataset with additional artificial samples.

I've noticed a huge score difference when applying it via pipeline vs 'Step by step" where the only difference is (I believe) the place I'm using train_test_split

Here are my features and labels:

for curr_features, label in self.training_data:
    features.append(curr_features)
    labels.append(label)

algorithms = [
    linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None),
    naive_bayes.MultinomialNB(),
    naive_bayes.BernoulliNB(),
    tree.DecisionTreeClassifier(max_depth=1000),
    tree.ExtraTreeClassifier(),
    ensemble.ExtraTreesClassifier(),
    svm.LinearSVC(),
    neighbors.NearestCentroid(),
    ensemble.RandomForestClassifier(),
    linear_model.RidgeClassifier(),
]

Using Pipeline:

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, random_state=42)

# Provide Report for all algorithms
score_dict = {}
for algorithm in algorithms:
    model = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('smote', SMOTETomek()),
        ('classifier', algorithm)
    ])
    model.fit(X_train, y_train)

    # Score
    score = model.score(X_test, y_test)
    score_dict[model] = int(score * 100)

sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
for classifier, score in sorted_score_dict.items():
    print(f'{classifier.__class__.__name__}: score is {score}%')

Using Step by Step:

vectorizer = CountVectorizer()
transformer = TfidfTransformer()
cv = vectorizer.fit_transform(features)
text_tf = transformer.fit_transform(cv).toarray()

smt = SMOTETomek()
X_smt, y_smt = smt.fit_resample(text_tf, labels)

X_train, X_test, y_train, y_test = train_test_split(X_smt, y_smt, test_size=0.2, random_state=0)
self.test_classifiers(X_train, X_test, y_train, y_test, algorithms)

def test_classifiers(self, X_train, X_test, y_train, y_test, classifiers_list):
    score_dict = {}
    for model in classifiers_list:
        model.fit(X_train, y_train)

        # Score
        score = model.score(X_test, y_test)
        score_dict[model] = int(score * 100)
       
    print()
    print("SCORE:")
    sorted_score_dict = {k: v for k, v in sorted(score_dict.items(), key=lambda item: item[1])}
    for model, score in sorted_score_dict.items():
        print(f'{model.__class__.__name__}: score is {score}%')

I'm getting (for the best classifier model) around 65% using pipeline vs 90% using step by step. Not sure what am I missing.

ANSWER

Answered 2021-May-29 at 13:28

There is nothing wrong with your code by itself. But your step-by-step approach is using bad practice in Machine Learning theory:

Do not resample your testing data

In your step-by-step approach, you resample all of the data first and then split them into train and test sets. This will lead to an overestimation of model performance because you have altered the original distribution of classes in your test set and it is not representative of the original problem anymore.

What you should do instead is to leave the testing data in its original distribution in order to get a valid approximation of how your model will perform on the original data, which is representing the situation in production. Therefore, your approach with the pipeline is the way to go.

As a side note: you could think about shifting the whole data preparation (vectorization and resampling) out of your fitting and testing loop as you probably want to compare the model performance against the same data anyway. Then you would only have to run these steps once and your code executes faster.

Source https://stackoverflow.com/questions/67750208

QUESTION

Predict unseen data by previously trained model

Asked 2021-May-28 at 15:01

I am performing supervised machine learning using Scikit-learn. I have two datasets. First dataset contains data that has X features and Y labels. Second dataset contains only X features but NO Y labels. I can successfully perform the LinearSVC for training/testing data and get the Y labels for the test dataset.

Now, I want to use the model that I have trained for the first dataset to predict the second dataset labels. How do I use the pre-trained model from first dataset to second dataset (unseen labels) in Scikit-learn?

Code snippet from my attempts: UPDATED code from comments below:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd
import pickle


# ----------- Dataset 1: for training ----------- #
# Sample data ONLY
some_text = ['Books are amazing',
             'Harry potter book is awesome. It rocks',
             'Nutrition is very important',
             'Welcome to library, you can find as many book as you like',
             'Food like brocolli has many advantages']
y_variable = [1,1,0,1,0]

# books = 1 : y label
# food = 0 : y label

df = pd.DataFrame({'text':some_text,
                   'y_variable': y_variable
                          })

# ------------- TFIDF process -------------#
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df['text']).toarray()
labels = df.y_variable
features.shape


# ------------- Build Model -------------#
model = LinearSVC()
X_train, X_test, y_train, y_test= train_test_split(features,
                                                 labels,
                                                 train_size=0.5,
                                                 random_state=0)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


# Export model
pickle.dump(model, open('model.pkl', 'wb'))
# Read the Model
model_pre_trained = pickle.load(open('model.pkl','rb'))


# ----------- Dataset 2: UNSEEN DATASET ----------- #

some_text2 = ['Harry potter books are amazing',
             'Gluten free diet is getting popular']

unseen_df = pd.DataFrame({'text':some_text2}) # Notice this doesn't have y_variable. This the is the data set I am trying to predict y_variable labels 1 or 0.


# This is where the ERROR occurs
X_unseen = tfidf.fit_transform(unseen_df['text']).toarray()
y_pred_unseen = model_pre_trained.predict(X_unseen) # error here: 
# ValueError: X has 11 features per sample; expecting 26


print(X_unseen.shape) # prints (2, 11)
print(X_train.shape) # prints (2, 26)


# Looking for an output like this for UNSEEN data
# Looking for results after predicting unseen and no label data. 
text                                   y_variable
Harry potter books are amazing         1
Gluten free diet is getting popular    0

It doesn't have to be pickle code as I tried above. I am looking if someone has suggestion or if there is any pre-build function that does the prediction from scikit?

ANSWER

Answered 2021-Apr-15 at 19:48

Imagine you trained an AI to recognize a plane using pictures of the motors, wheels, wings and of the pilot's bowtie. Now you're calling this same AI and you ask it to predict the model of a plane with the bowtie alone. That's what scikit-learn is telling you : there are much less features (= columns) in X_unseen than in X_train or X_test.

Source https://stackoverflow.com/questions/67114967

QUESTION

Counting in how many documents does a word appear

Asked 2021-May-28 at 07:13

I'm trying to implement a TFIDF vectorizer without sklearn. I want to count the number of documents(list of strings) in which a word appears, and so on for all the words in that corpus. Example:

corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

Desired OP: {this : 4, is : 4} and so on for every word

My code:

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            if word in line:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
        print(counts)

docs(corpus)

Error I'm facing:

KeyError                                  Traceback (most recent call last)
<ipython-input-70-6bf2b69708bc> in <module>
      9         print(counts)
     10 
---> 11 docs(corpus)

<ipython-input-70-6bf2b69708bc> in docs(corpus)
      4         for word in line.split():
      5             if word in line.split():
----> 6                 doc_count[word] +=1
      7             else:
      8                 doc_count[word] = 1

KeyError: 'this'

Please let me know where I'm lacking and if I'm not iterating properly. Thank you!

ANSWER

Answered 2021-May-28 at 07:13
corpus = [
     'this is the first document',
     'this document is the second document',
     'and this is the third one',
     'is this the first document',
]

def docs(corpus):
    doc_count = dict()
    for line in corpus:
        for word in line.split():
            #you did mistake here
            if word in doc_count:
                doc_count[word] +=1
            else:
                doc_count[word] = 1
    return doc_count    

ans=docs(corpus)
print(ans)

Source https://stackoverflow.com/questions/67734007

QUESTION

Weird behaviour in MapReduce, values get overwritten

Asked 2021-May-20 at 12:08

I've been trying to implement the TfIdf algorithm using MapReduce in Hadoop. My TFIDF takes place in 4 steps (I call them MR1, MR2, MR3, MR4). Here are my input/outputs:

MR1: (offset, line) ==(Map)==> (word|file, 1) ==(Reduce)==> (word|file, n)

MR2: (word|file, n) ==(Map)==> (file, word|n) ==(Reduce)==> (word|file, n|N)

MR3: (word|file, n|N) ==(Map)==> (word, file|n|N|1) ==(Reduce)==> (word|file, n|N|M)

MR4: (word|file, n|N|M) ==(Map)==> (word|file, n/N log D/M)

Where n = number of (word, file) distinct pairs, N = number of words in each file, M = number of documents where each word appear, D = number of documents.

As of the MR1 phase, I'm getting the correct output, for example: hello|hdfs://..... 2

For the MR2 phase, I expect: hello|hdfs://....... 2|192 but I'm getting 2|hello|hdfs://...... 192|192

I'm pretty sure my code is correct, every time I try to add a string to my "value" in the reduce phase to see what's going on, the same string gets "teleported" in the key part.

Example: gg|word|hdfs://.... gg|192

Here is my MR1 code:

public class MR1 {
    /* Classe Map :
    *  Entree : (offset, line)
    *  Sortie : (word|file, 1)
    *  Sends 1 for each word per line.
    */
    static class MR1Mapper extends Mapper <LongWritable, Text, Text, IntWritable > {
        public void map (LongWritable key, Text value, Context contexte)
                       throws IOException, InterruptedException {
            // Recuperation du nom du fichier associe au "split" 
            FileSplit split = (FileSplit) contexte.getInputSplit();
            String fileName = split.getPath().toString();
    
            String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line, "' \t:,;:!?./-_()[]{}\"&%<>");
            while (tokenizer.hasMoreTokens()) {
                String word = tokenizer.nextToken().toLowerCase();
                contexte.write(new Text(word + "|" + fileName), new IntWritable(1));
            }
        }
    } 
    
    /* Class Reducer : compte le nombre d'occurrence total par mot/fichier
    * Entree : (word|file, x)
    * Sortie : (word|file, n)
    */
    public static class MR1Reducer extends Reducer <Text, IntWritable, Text, IntWritable > {
        public void reduce(Text key, Iterable < IntWritable > values, Context contexte) 
                    throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val:values) {
                sum += val.get();
            } 
            contexte.write(key, new IntWritable(sum));
        }
    } 
    
    public static void main(String args[]) throws Exception {
        if (args.length != 2) {
            System.err.println(args.length + "(" + args[0] + "," + args[1] + ")");
            System.err.  println("Usage : MR1 <source> <destination>");
            System.exit(-1);
        }
    
        Job job = new Job();
        job.setJarByClass(MR1.class);
    
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
        job.setMapperClass(MR1Mapper.class);
        job.setCombinerClass (MR1Reducer.class) ;
        job.setReducerClass(MR1Reducer.class);
    
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
    
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
    
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Here is my MR2 code:

public class MR2 {
    /* Map : on isole le nom du fichier.
    * Entree : (word|file, n)
    * Sortie : (file, word|n)
    */
    static class MR2Mapper extends Mapper <Text, Text, Text, Text> {
        public void map (Text key, Text value, Context contexte) 
                    throws IOException, InterruptedException {
            String skey = key.toString () ;
            String word = skey.substring (0, skey.indexOf ("|")) ;
            String fileName = skey.substring (skey.indexOf ("|")+1) ;
            contexte.write (new Text (fileName), new Text (word + "|" + value)) ;
        }
    }
    
    /* Reduce : on somme le nombre d'occurence de chaque mot du fichier
    * Entree : (file, word|n)
    * Sortie : (word|file, n|N)
    */
    public static class MR2Reducer extends Reducer <Text, Text, Text, Text> {
        public void reduce (Text key, Iterable <Text> values, Context contexte) 
                    throws IOException, InterruptedException {
            int N = 0 ;
    
            // les iterateurs sont utilisable qu'une seule fois. Donc il faut
            // garder les valeurs dans une arraylist pour les reparcourir.
            ArrayList <String> altmp = new ArrayList <String> () ;
        
            // 1ere boucle : calcul de la somme totale des mots
            for (Text val : values) {
                String sval = val.toString () ;
                String sn = sval.substring (sval.indexOf ("|")+1) ;
                int n = Integer.parseInt (sn) ;
            
                altmp.add (val.toString ()) ;
                N += n ;
            }
    
            // 2eme boucle : calcul de la somme totale des mots
            Iterator <String> it = altmp.iterator () ;
            while (it.hasNext ()) {
                String val = it.next () ;
                String sval = val.toString () ;
                String word = sval.substring (0, sval.indexOf ("|")) ;
                String sn = sval.substring (sval.indexOf ("|")+1) ;
                int n = Integer.parseInt (sn) ;
            
                // I tried to replace n with "gg" here, still same teleporting issue
                contexte.write (new Text (word + "|" + clef.toString ()), new Text (n + "|" + N)) ; 
            }
        }
    }
    
    public static void main (String args []) throws Exception {
        if (args.length != 2) {
            System.err.println (args.length + "("+args [0] + "," +args [1] + ")") ;
            System.err.println ("Usage : MR2 <source> <destination>") ;
            System.exit (-1) ;
        }
    
        Job job = new Job () ;
        job.setJarByClass (MR2.class) ;
    
        // Le fichier HDFS a utiliser en entree
        FileInputFormat.addInputPath (job, new Path (args [0])) ;
        FileOutputFormat.setOutputPath (job, new Path (args [1])) ;
    
        job.setInputFormatClass(KeyValueTextInputFormat.class);
    
        job.setMapperClass (MR2Mapper.class) ;
        job.setCombinerClass (MR2Reducer.class) ;
        job.setReducerClass (MR2Reducer.class) ;
    
        job.setMapOutputKeyClass (Text.class) ;
        job.setMapOutputValueClass (Text.class) ;
    
        job.setOutputKeyClass (Text.class) ;
        job.setOutputValueClass (Text.class) ;
    
        System.exit (job.waitForCompletion (true) ? 0 : 1) ;
    }
}

Any help would be appreciated.

ANSWER

Answered 2021-May-20 at 12:08

It's the Combiner's fault. You are specifying in the driver class that you want to use MR2Reducer both as a Combiner and a Reducer in the following commands:

job.setCombinerClass (MR2Reducer.class) ;
job.setReducerClass (MR2Reducer.class) ;

However, a Combiner is running within the range of a Map instance, while a Reducer is operating in series after the execution of all the Mappers. By using a Combiner, you are essentially putting MR2Reducer to execute right after the execution of each individual Mapper task, so it calculates N and splits the composite value of the given key-value input within each Mapper task range.

This basically results in the Reduce phase kicking off by having input of the (word|file, n|N) key-value pair schema (aka the output of a MR2Reducer task before the Reduce phase) instead of the desired (file, word|n) schema. By unknowingly using the false schema, you falsely split the composite value and the output key-value pairs look wonky, wrong, and/or reverse.

To fix this, you can either:

  • create a custom Combiner that will have the same commands as your current MR2Reducer, and then change your MR2Reducer class to receive key-value pairs in the (word|file, n|N) schema (not recommended, as it will probably negate all the benefits in terms of scalability and execution time, and will only make your MapReduce job more complicated that it can be), or
  • delete or comment out the job.setCombinerClass (MR2Reducer.class) ; line from your driver class to keep things simple and functional, so you can built from there in the future.

To showcase this, I used your MR1, MR2 classes locally on my machine, deleted the job.setCombinerClass (MR2Reducer.class) ; line and used this input stored in HDFS to verify that the output key-value pairs are as desired. Here is a snippet of the output after the execution:

balance|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt    1|661
suppress|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt   1|661
back|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt       4|661
after|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt      1|661
suspicious|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt 2|661
swang|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt      2|661
swinging|hdfs://localhost:9000/user/crsl/metamorphosis/05.txt   1|661

Source https://stackoverflow.com/questions/67593978

QUESTION

Text (cosine) similarity

Asked 2021-May-15 at 11:36

I have followed the explanation of Fred Foo in this stack overflow question: How to compute the similarity between two text documents?

I have run the following piece of code that he wrote:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away",
          "Never compare an apple to an orange",
          "I prefer scikit-learn to Orange",
          "The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T
print(pairwise_similarity.toarray())

And the result is:

[[1.         0.17668795 0.27056873 0.         0.        ]
 [0.17668795 1.         0.15439436 0.         0.        ]
 [0.27056873 0.15439436 1.         0.19635649 0.16815247]
 [0.         0.         0.19635649 1.         0.54499756]
 [0.         0.         0.16815247 0.54499756 1.        ]]

But what I noticed is that when I set corpus to be:

corpus = ["I'd like an apple",
          "An apple a day keeps the doctor away"]

and run the same code again, I get the matrix:

[[1.         0.19431434]
 [0.19431434 1.        ]]

Thus their similarity changes (in the first matrix, their similarity is 0.17668795). Why is that the case? I am really confused. Thank you in advance!

ANSWER

Answered 2021-May-15 at 11:36

In Wikipedia you can see how to calculate Tf-idf


enter image description here

enter image description here

enter image description here


N - number of documents in corpus.

So similarity depends on number of all documents/sentences in corpus.

If you have more documents/sentences then it changes results.

If you add the same document/sentence few times then it also changes results.

Source https://stackoverflow.com/questions/67531709

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

VULNERABILITIES

No vulnerabilities reported

INSTALL tfidf

You can use tfidf like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the tfidf component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

SUPPORT

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Implement tfidf faster with kandi.

  • Use the support, quality, security, license, reuse scores and reviewed functions to confirm the fit for your project.
  • Use the, Q & A, Installation and Support guides to implement faster.

Discover Millions of Libraries and
Pre-built Use Cases on kandi