sentencepiece | Unsupervised text tokenizer for Neural Network | Natural Language Processing library

by google C++ Version: 0.2.0 License: Apache-2.0

X-Ray Key Features Code Snippets(10)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | sentencepiece Summary

sentencepiece is a C++ library typically used in Artificial Intelligence, Natural Language Processing, Deep Learning applications. sentencepiece has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Unsupervised text tokenizer for Neural Network-based text generation.

Support

Quality

Security

License

Reuse

Support

sentencepiece has a medium active ecosystem.

It has 7616 star(s) with 981 fork(s). There are 119 watchers for this library.

There were 1 major release(s) in the last 6 months.

There are 17 open issues and 610 have been closed. On average issues are closed in 135 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of sentencepiece is 0.2.0

Quality

sentencepiece has 0 bugs and 0 code smells.

Security

sentencepiece has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

sentencepiece code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

sentencepiece is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

sentencepiece releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of sentencepiece

Get all kandi verified functions for this library.

sentencepiece Key Features

No Key Features are available at this moment for sentencepiece.

sentencepiece Examples and Code Snippets

HuggingFace Pipeline: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0. How to improve about this warning?

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

ner = pipeline("ner", aggregation_strategy="simple", model="dbmdz/bert-large-cased-finetuned-conll03-english")  # Named Entity Recognition (NER)

'NoneType' error when using PegasusTokenizer

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

pip install sentencepiece

Using sentence transformers with limited access to internet

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

['1_Pooling', 'config_sentence_transformers.json', 'tokenizer.json', 'tokenizer_config.json', 'modules.json', 'sentence_bert_config.json', 'pytorch_model.bin', 'special_tokens_map.json', 'config.json', 'train_script.py', 'data_config.json'

TypeError: not a string | parameters in AutoTokenizer.from_pretrained()

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2')

python packages not being installed on the virtual environment using ubuntu

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

pip3 install package_name --user

Heroku: Compiled Slug Size is too large Python

Python

Lines of Code : 4

License : Strong Copyleft (CC BY-SA 4.0)

Copy

heroku run bash -a

du -ha --max-depth 1 /app | sort -hr

Wrong tensor type when trying to do the HuggingFace tutorial (pytorch)

Python

Lines of Code : 4

License : Strong Copyleft (CC BY-SA 4.0)

Copy

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=1)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Substring replacements based on replace and no-replace rules

Python

Lines of Code : 27

License : Strong Copyleft (CC BY-SA 4.0)

Copy

replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.

def match_fun(match: re.Match):
    str_match: str = match.group()

Substring replacements based on replace and no-replace rules

Python

Lines of Code : 31

License : Strong Copyleft (CC BY-SA 4.0)

Copy

def replace_whole(sentence, replace_token, replace_with, dont_replace):
    rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
    iter = re.finditer(rx, sentence)
    out_sentence = ""
    found = []
    indices = []
    for m in iter:

Error importing BERT: module 'tensorflow._api.v2.train' has no attribute 'Optimizer'

Python

Lines of Code : 4

License : Strong Copyleft (CC BY-SA 4.0)

Copy

class AdamWeightDecayOptimizer(tf.train.Optimizer):

class AdamWeightDecayOptimizer(tf.compat.v1.train.Optimizer):

Community Discussions

Trending Discussions on sentencepiece

HuggingFace Pipeline: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0. How to improve about this warning?

Pip install results in this error " cl.exe' failed with exit code 2 "

How can I use custom tokenizer in opennmt transformer

Colab: (0) UNIMPLEMENTED: DNN library is not found

Using sentence transformers with limited access to internet

HuggingFace AutoTokenizer | ValueError: Couldn't instantiate the backend tokenizer

TypeError: not a string | parameters in AutoTokenizer.from_pretrained()

Tensorflow Object Detection API taking forever to install in a Google Colab and failing

python packages not being installed on the virtual environment using ubuntu

Heroku: Compiled Slug Size is too large Python

QUESTION

HuggingFace Pipeline: UserWarning: `grouped_entities` is deprecated and will be removed in version v5.0.0. How to improve about this warning?

Asked 2022-Apr-17 at 08:20

I am using the following code

...

ANSWER

Answered 2022-Apr-17 at 08:19

According to [HuggingFace]: Pipelines - class transformers.TokenClassificationPipeline (emphasis is mine):

grouped_entities (bool, optional, defaults to False) - DEPRECATED, use aggregation_strategy instead. Whether or not to group the tokens corresponding to the same entity together in the predictions or not.

So, your line of code could be:

Source https://stackoverflow.com/questions/71900161

QUESTION

Pip install results in this error " cl.exe' failed with exit code 2 "

Asked 2022-Mar-27 at 07:11

I've read all of the other questions on this error and frustratingly enough, none give a solution that works.

If I run pip install sentencepiece in the cmd line, it gives me the following output.

...

ANSWER

Answered 2022-Feb-24 at 15:50

I haven't seen this problem in Windows, but for Linux, I would normally reinstall Python after installing the dependencies (such as the MSVC thing). In that case this is especially helpful because I'm often rebuilding (compiling and other related steps) Python/Pip.

Could also just be an error specific to the module and Python version combo you're trying.

From a discussion in the comments:

I have the pyenv-win version manager, so I was able to create venvs and test this for you. With Python 3.10.2, it fails; with Python 3.8.10, it's successful. So, yes, reinstalling does seem to be worthy of your time.

Source https://stackoverflow.com/questions/71242919

QUESTION

How can I use custom tokenizer in opennmt transformer

Asked 2022-Feb-11 at 09:07

I'm tring to transformer for translation with opennmt-py.
And I already have the tokenizer trained by sentencepiece(unigram).
But I don't know how to use my custom tokenizer in training config yaml.
I'm refering the site of opennmt-docs (https://opennmt.net/OpenNMT-py/examples/Translation.html).
Here are my code and the error .

...

ANSWER

Answered 2022-Feb-11 at 09:07

I got the answers.

we can use tools/spm_to_vocab in onmt.
train_from argument is the one.

Source https://stackoverflow.com/questions/71062002

QUESTION

Colab: (0) UNIMPLEMENTED: DNN library is not found

Asked 2022-Feb-08 at 19:27

I have pretrained model for object detection (Google Colab + TensorFlow) inside Google Colab and I run it two-three times per week for new images I have and everything was fine for the last year till this week. Now when I try to run model I have this message:

...

ANSWER

Answered 2022-Feb-07 at 09:19

It happened the same to me last friday. I think it has something to do with Cuda instalation in Google Colab but I don't know exactly the reason

Source https://stackoverflow.com/questions/71000120

QUESTION

Using sentence transformers with limited access to internet

Asked 2022-Jan-19 at 13:27

I have access to the latest packages but I cannot access internet from my python enviroment.

Package versions that I have are as below

...

ANSWER

Answered 2022-Jan-19 at 13:27

Based on the things you mentioned, I checked the source code of sentence-transformers on Google Colab. After running the model and getting the files, I check the directory and I saw the pytorch_model.bin there.

And according to sentence-transformers code: Link

the flax_model.msgpack , rust_model.ot, tf_model.h5 are getting ignored when the it is trying to download.

and these are the files that it downloads :

Source https://stackoverflow.com/questions/70716702

QUESTION

HuggingFace AutoTokenizer | ValueError: Couldn't instantiate the backend tokenizer

Asked 2022-Jan-14 at 14:10

Goal: Amend this Notebook to work with albert-base-v2 model

Error occurs in Section 1.3.

Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.

There are 3 listed ways this error can be caused. I'm not sure which my case falls under.

Section 1.3:

...

ANSWER

Answered 2022-Jan-14 at 14:09

First, I had to pip install sentencepiece.

However, in the same code line, I was getting an error with sentencepiece.

Wrapping str() around both parameters yielded the same Traceback.

Source https://stackoverflow.com/questions/70698407

QUESTION

TypeError: not a string | parameters in AutoTokenizer.from_pretrained()

Asked 2022-Jan-14 at 14:07

Goal: Amend this Notebook to work with albert-base-v2 model.

Kernel: conda_pytorch_p36. I did Restart & Run All, and refreshed file view in working directory.

In order to evaluate and to export this Quantised model, I need to setup a Tokenizer.

Error occurs in Section 1.3.

Both parameters in AutoTokenizer.from_pretrained() throw the same error.

Section 1.3 Code:

...

ANSWER

Answered 2022-Jan-14 at 14:07

Passing just the model name suffices.

Source https://stackoverflow.com/questions/70709572

QUESTION

Tensorflow Object Detection API taking forever to install in a Google Colab and failing

Asked 2021-Nov-19 at 00:16

I am trying to install the Tensorflow Object Detection API on a Google Colab and the part that installs the API, shown below, takes a very long time to execute (in excess of one hour) and eventually fails to install.

...

ANSWER

Answered 2021-Nov-19 at 00:16

I have solved this problem with

Source https://stackoverflow.com/questions/70012098

QUESTION

python packages not being installed on the virtual environment using ubuntu

Asked 2021-Aug-18 at 18:11

I have a requirements.txt file which holds all information of my python packages I need for my Flask application. Here is what I did:

python3 -m venv venv
source venv/bin/activate
sudo pip install -r requirements.txt

When I tried to check if the packages were installed on the virtual environment using pip list, I do not see the packages. Can someone tell what went wrong?

...

ANSWER

Answered 2021-Aug-18 at 18:05

If you want to use python3+ to install the packages try to use pip3 install package_name

And to solve the errno 13 try to add --user at the end

Source https://stackoverflow.com/questions/68837021

QUESTION

Heroku: Compiled Slug Size is too large Python

Asked 2021-Jul-21 at 06:50

I trying to deploy my app to heroku

I have following deploying error

...

ANSWER

Answered 2021-Jul-21 at 06:50

The maximum allowed slug size is 500MB. Slugs are an important aspect for heroku. When you git push to Heroku, your code is received by the slug compiler which transforms your repository into a slug.

First of all, lets determine what all files are taking up a considerate amount of space in your slug. To do that, fire up your heroku cli and enter / access your dyno by typing the following:

Source https://stackoverflow.com/questions/68464527

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install sentencepiece

The following tools and libraries are required to build SentencePiece:.
cmake
C++11 compiler
gperftools library (optional, 10-40% performance improvement can be obtained.)
You can download and install sentencepiece using the vcpkg dependency manager:. The sentencepiece port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: