spark-nlp | State of the Art Natural Language Processing | Natural Language Processing library

by JohnSnowLabs Scala Version: 5.4.0rc2 License: Apache-2.0

X-Ray Key Features Code Snippets(4)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | spark-nlp Summary

spark-nlp is a Scala library typically used in Artificial Intelligence, Natural Language Processing, Pytorch, Tensorflow, Bert, Neural Network, Transformer, Spark applications. spark-nlp has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

Spark NLP is a state-of-the-art Natural Language Processing library built on top of Apache Spark. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 4000+ pretrained pipelines and models in more than 200+ languages. It also offers tasks such as Tokenization, Word Segmentation, Part-of-Speech Tagging, Word and Sentence Embeddings, Named Entity Recognition, Dependency Parsing, Spell Checking, Text Classification, Sentiment Analysis, Token Classification, Machine Translation (+180 languages), Summarization & Question Answering, Text Generation, and many more NLP tasks. Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, and GPT2 not only to Python and R, but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively.

Support

Quality

Security

License

Reuse

Support

spark-nlp has a medium active ecosystem.

It has 3279 star(s) with 661 fork(s). There are 94 watchers for this library.

It had no major release in the last 12 months.

There are 31 open issues and 758 have been closed. On average issues are closed in 99 days. There are 6 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of spark-nlp is 5.4.0rc2

Quality

spark-nlp has no bugs reported.

Security

spark-nlp has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

spark-nlp is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-nlp releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-nlp

Get all kandi verified functions for this library.

spark-nlp Key Features

No Key Features are available at this moment for spark-nlp.

spark-nlp Examples and Code Snippets

AWS Lambda function not able to find other packages in same directory

Python

Lines of Code : 4

License : Strong Copyleft (CC BY-SA 4.0)

Copy

COPY core ${LAMBDA_TASK_ROOT}

COPY core ${LAMBDA_TASK_ROOT}/core

How to set a tag at the experiment level in MLFlow

Python

Lines of Code : 7

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from mlflow.tracking import MlflowClient

# Create an experiment with a name that is unique and case sensitive.
client = MlflowClient()
experiment_id = client.create_experiment("Social NLP Experiments")
client.set_experiment_tag(experiment

Regex in Spark NLP Normalizer is not working correctly

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

"(?U)[^\w -]|_|-(?!\w)|(?

NLP analysis for some pyspark dataframe columns by numpy vectorization

Python

Lines of Code : 25

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from sparknlp.pretrained import PretrainedPipeline

df = spark.sql('select year, month, u_id, p_id, comment from MY_DF where rating_score = 1 and isnull(comment) = false')

df1 = df.withColumnRenamed('comment', 'text')

pipeline_dl = Pretr

Community Discussions

Trending Discussions on spark-nlp

trying to use johnsnow pretrained pipeline on spark dataframe but unable to read delta file in the same session

AWS Lambda function not able to find other packages in same directory

TensorFlowException: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/yarn/usercache

SparkNLP's NerCrfApproach with custom labels

Glue job failed with `JohnSnowLabs spark-nlp dependency not found` error randomly

java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler spark in Pycharm with conda env

spark-nlp 'JavaPackage' object is not callable

How to install offline Spark NLP packages

object johnsnowlabs is not a member of package com

Scala Spark: Multiple sources found for json

QUESTION

trying to use johnsnow pretrained pipeline on spark dataframe but unable to read delta file in the same session

Asked 2022-Apr-03 at 08:47

i am using the below code to read the spark dataframe from hdfs:

...

ANSWER

Answered 2022-Apr-03 at 08:47

The configure_spark_with_delta_pip is just a shortcut to setup correct parameters of the SparkSession... If you look into its source code you'll see following code, you'll see that everything it's doing is configuring the spark.jars.packages. But because you're using it separately for SparkNLP, you're overwriting Delta's value. To handle such situations, configure_spark_with_delta_pip has an additional parameter extra_packages to specify additional packages to be configured. So in your case the code should look as following:

Source https://stackoverflow.com/questions/71723324

QUESTION

AWS Lambda function not able to find other packages in same directory

Asked 2022-Mar-14 at 08:57

I am deploying a lambda function as a container image. Here's my project structure :

core
plugins
lambda_handler.py

All three are at the same level - /var/task

Inside lambda_handler.py I am importing the core package, but when I test it locally it says :

...

ANSWER

Answered 2022-Mar-14 at 08:57

If you just use

Source https://stackoverflow.com/questions/71463799

QUESTION

TensorFlowException: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for /mnt/yarn/usercache

Asked 2021-Nov-18 at 09:23

I am trying to run "onto_electra_base_uncased" model on some data stored in hive table, I ran count() on df before saving the data into hive table and got this exception.

Spark Shell launch configurations:

...

ANSWER

Answered 2021-Nov-18 at 09:23

The solution to this issue is use kryo serialization, the default spark-shell or spark-submit invocation is using java serialization, the Annotate class in spark-nlp is implemented to use Kryo Serialization hence same should be used for running any spark-nlp jobs

Source https://stackoverflow.com/questions/68998112

QUESTION

SparkNLP's NerCrfApproach with custom labels

Asked 2021-Oct-14 at 06:26

I am trying to train a SparkNLP NerCrfApproach model with a dataset in CoNLL format that has custom labels for product entities (like I-Prod, B-Prod etc.). However, when using the trained model to make predictions, I get only "O" as the assigned label for all tokens. When using the same model trained on the CoNLL data from the SparkNLP workshop example, the classification works fine. (cf. https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training/english/crf-ner)

So, the question is: Does NerCrfApproach rely on the standard tag set for NER labels used by the CoNLL data? Or can I use it for any custom labels and, if yes, do I need to specify these somehow? My assumption was that the labels are inferred from the training data.

Cheers, Martin

Update: The issue might not be related to the labels after all. I tried to replace my custom labels with CoNLL standard labels and I am still not getting the expected classification results.

...

ANSWER

Answered 2021-Oct-14 at 06:26

As it turns out, this issue was not caused by the labels, but rather by the size of the dataset. I was using a rather small dataset for development purposes. Not only was this dataset quite small, but also heavily imbalanced, with a lot more "O" labels than the other labels. Fixing this by using a dataset of 10x the original size (in terms of sentences), I am able to get meaningful results, even for my custom labels.

Source https://stackoverflow.com/questions/69551405

QUESTION

Glue job failed with `JohnSnowLabs spark-nlp dependency not found` error randomly

Asked 2021-May-10 at 09:20

I'm using AWS Glue to run some pyspark python code, sometimes it succeeded but sometimes failed with a dependency error: Resource Setup Error: Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: JohnSnowLabs#spark-nlp;2.5.4: not found], here is the error logs:

...

ANSWER

Answered 2021-May-07 at 20:28

spark-packages moved on May 1 2021. In my scala project I had to add a different resolver like so. It's got to be similar in java.

Source https://stackoverflow.com/questions/67414623

QUESTION

java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler spark in Pycharm with conda env

Asked 2021-Mar-17 at 15:29

I saved a pre-trained model from spark-nlp, then I'm trying to run a Python script in Pycharm with anaconda env:

...

ANSWER

Answered 2021-Mar-17 at 15:29

some context first. The spark-nlp library depends on a jar file that needs to be present in the Spark classpath. There are three ways to provide this jar according to how you start the context in PySpark. a) When you start your Python app throught interpreter, you call sparknlp.start() and the jar will be automatically downloaded.

b) You pass the jar to pyspark command using the --jars switch. In this case you took the jar from the releases page and download it manually.

c) You start pyspark and pass --packages, here you need to pass a maven coordinate, example,

Source https://stackoverflow.com/questions/66064423

QUESTION

spark-nlp 'JavaPackage' object is not callable

Asked 2020-Dec-24 at 10:18

I am using jupyter lab to run spark-nlp text analysis. At the moment I am just running the sample code:

...

ANSWER

Answered 2020-Dec-24 at 10:18

Remove Spark 3.0.1, leave just PySpark 2.4.x. as Spark NLP still doesn't support Spark 3.x. Use Java 8 instead of Java 11 because it's not supported in Spark 2.4.

Source https://stackoverflow.com/questions/65430871

QUESTION

How to install offline Spark NLP packages

Asked 2020-Aug-26 at 14:41

How can I install offline Spark NLP packages without internet connection. I've downloaded the package (recognizee_entities_dl) and uploaded it to the cluster.

I've installed Spark NLP using pip install spark-nlp==2.5.5. I'm using PySpark and from the cluster I'm unable to download the packages.

Already tried;

...

ANSWER

Answered 2020-Aug-26 at 14:41

Looking at your error:

Source https://stackoverflow.com/questions/63446312

QUESTION

object johnsnowlabs is not a member of package com

Asked 2020-Jul-28 at 12:40

I am very new to Zeppelin/spark and couldn't get an accurate description of steps to configure new dependencies like that of NLP libraries. Found similar issue here.

I was trying to use Johnsnowlabs NLP library in Zeppelin notebook (spark version2.2.1). Setup included :

In Zeppelin's Interpreters configurations for Spark, include the following artifact: com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.4
Then, in conf/zeppelin-env.sh, setup SPARK_SUBMIT_OPTIONS. export SPARK_SUBMIT_OPTIONS=” — packages JohnSnowLabs:spark-nlp:2.2.2". Then restarted Zeppelin.

But the below program gives the error :

...

ANSWER

Answered 2020-Jul-28 at 12:40

you don't need to edit the conf/zeppelin-env.sh (anyway you're using it incorrectly, as you're specifying completely different version), you can make all changes via Zeppelin UI. Go to the Spark interpreter configuration, and put com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.4 into spark.jars.packages configuration property (or add it if it doesn't exist), and into the Dependencies at the end of configuration (for some reason, it isn't automatically pulled into driver classpath).

Source https://stackoverflow.com/questions/63132135

QUESTION

Scala Spark: Multiple sources found for json

Asked 2020-Jul-08 at 21:40

I'm getting an exception when executing spark2-submit on my hadoop cluster, when reading a directory of .jsons in hdfs I have no idea how to resolve it.

I have found some question on several board about this, but none of them popular or with an answer.

I tried explicit importing org.apache.spark.sql.execution.datasources.json.JsonFileFormat, but it seems redundant, to importing SparkSession, so it's not getting recognised.

I can however confirm that both of these classes are available.

...

ANSWER

Answered 2020-Jul-05 at 18:31

It seems you have both Spark 2.x and 3.x jars in classpath. According to the sbt file, Spark 2.x should be used, however, JsonFileFormat was added in Spark 3.x with this issue

Source https://stackoverflow.com/questions/62743053

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-nlp

This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:.
FAT-JAR for CPU on Apache Spark 3.0.x and 3.1.x
FAT-JAR for GPU on Apache Spark 3.0.x and 3.1.x
FAT-JAR for CPU on Apache Spark 3.2.x
FAT-JAR for GPU on Apache Spark 3.2.x
FAT-JAR for CPU on Apache Spark 2.4.x
FAT-JAR for GPU on Apache Spark 2.4.x
FAT-JAR for CPU on Apache Spark 2.3.x
FAT-JAR for GPU on Apache Spark 2.3.x

Support

Slack For live discussion with the Spark NLP community and the teamGitHub Bug reports, feature requests, and contributionsDiscussions Engage with other community members, share ideas, and show off how you use Spark NLP!Medium Spark NLP articlesYouTube Spark NLP video tutorials

Find more information at: