spark-elastic | project combines Apache Spark and Elasticsearch to enable

by skrusche63 Scala Version: Current License: No License

X-Ray Key Features Code Snippets Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | spark-elastic Summary

spark-elastic is a Scala library typically used in Big Data, Spark applications. spark-elastic has no bugs, it has no vulnerabilities and it has low support. You can download it from GitHub.

This project combines Apache Spark and Elasticsearch to enable mining & prediction for Elasticsearch.

Support

Quality

Security

License

Reuse

Support

spark-elastic has a low active ecosystem.

It has 204 star(s) with 73 fork(s). There are 26 watchers for this library.

It had no major release in the last 6 months.

There are 1 open issues and 0 have been closed. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of spark-elastic is current.

Quality

spark-elastic has no bugs reported.

Security

spark-elastic has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

spark-elastic does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

spark-elastic releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-elastic

Get all kandi verified functions for this library.

spark-elastic Key Features

No Key Features are available at this moment for spark-elastic.

spark-elastic Examples and Code Snippets

No Code Snippets are available at this moment for spark-elastic.

Community Discussions

Trending Discussions on spark-elastic

Pulling only required columns in Spark from Cassandra without loading all the columns

Spark Elasticsearch basic tuning

How to join RDDs based on elastic-hadoop

QUESTION

Pulling only required columns in Spark from Cassandra without loading all the columns

Asked 2020-Jun-19 at 02:21

Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector

Reading data from ES into Spark -- here only required columns are being brought from ES to Spark :

...

ANSWER

Answered 2020-Jun-18 at 20:17

Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:

The connector will automatically pushdown all valid predicates to Cassandra. The Datasource will also automatically only select columns from Cassandra which are required to complete the query. This can be monitored with the explain command.

source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md

Source https://stackoverflow.com/questions/62457616

QUESTION

Spark Elasticsearch basic tuning

Asked 2020-Jan-04 at 13:02

How to setup spark for speed?

I'm running spark-elasticsearch to analyze log data.

It takes about 5min to do aggregate/join with 2million rows (4gig).

I'm running 1 master, 3 workers on 3 machines. I increased executor memory to 8g, increased ES nodes from 1 to 3.

I'm running standalone clusters in client mode (https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040) I'm not using spark-submit, just running python code after launching master/workers

Spark seems to launch 3 executors total (which are from 3 workers).

I'd like to tune spark a little bit to get the most performance with little tuning..

Which way should I take for optimization?

consider other cluster (yarn, etc .. although I have no idea what they offer, but it seems it's easier to change memory related settings there)

run more executors

analyze the job plan with explain api

accept it takes that much time because you have to download 4gig data (should spark grap all data to run aggregate? such as group by and sum), if applicable, save the data to parquet (?) for further analysis

Below are my performance related setting

...

ANSWER

Answered 2020-Jan-04 at 13:02

It is not always a matter of memory or cluster configuration, I would suggest starting by trying to optimize the query/aggregation you're running before increasing memory.

You can find here some hints for Spark Performance Tuning. See also Tuning Spark. Make sure the query is optimal and avoid known bad performance as UDFs.

For executor and memory configuration in your cluster, you have to take into consideration the available memory and cores on all machines to calculate the adequate parameters. Here is an intersting post on best practices.

Source https://stackoverflow.com/questions/59590216

QUESTION

How to join RDDs based on elastic-hadoop

Asked 2017-Sep-18 at 13:04

I'm looking for ways to process parallel data from large index, I thought about snapshot the index (to hdfs) and then submit spark jobs to process the records.

Other way to solve it, is to use elastic with spark.

My questions:

Can the snapshot API output be text file instead of binary files?
How can I use spark-elastic and perform sub queries for a specific document? (lets say I have index of dogs and then I want to find the bones of each dog)?

------EDIT------

My indexes changed a little, There is Dogs indexes, an dogs-relation index. Dogs index:

...

ANSWER

Answered 2017-Jan-31 at 15:37

Pt 1.

I don't think so, AFAIK the closest option would be to use the scan/scroll API (depend on which ES version you are on): ES v5.1 scroll api. You can 'export' your indexes to text file/s that way.

Pt 2.

The simplest way - code-wise - to do what you want (elasticsearch query per dog document), would be to load your dogsRDD using elastic-hadoop, then for the sub-query behaviour, do something like:

Source https://stackoverflow.com/questions/41552208

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-elastic

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: