spark-elastic | project combines Apache Spark and Elasticsearch to enable
kandi X-RAY | spark-elastic Summary
kandi X-RAY | spark-elastic Summary
This project combines Apache Spark and Elasticsearch to enable mining & prediction for Elasticsearch.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-elastic
spark-elastic Key Features
spark-elastic Examples and Code Snippets
Community Discussions
Trending Discussions on spark-elastic
QUESTION
Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector
Reading data from ES into Spark -- here only required columns are being brought from ES to Spark :
...ANSWER
Answered 2020-Jun-18 at 20:17Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:
The connector will automatically pushdown all valid predicates to Cassandra. The Datasource will also automatically only select columns from Cassandra which are required to complete the query. This can be monitored with the explain command.
source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
QUESTION
How to setup spark for speed?
I'm running spark-elasticsearch to analyze log data.
It takes about 5min to do aggregate/join with 2million rows (4gig).
I'm running 1 master, 3 workers on 3 machines. I increased executor memory to 8g, increased ES nodes from 1 to 3.
I'm running standalone clusters in client mode (https://becominghuman.ai/real-world-python-workloads-on-spark-standalone-clusters-2246346c7040) I'm not using spark-submit, just running python code after launching master/workers
Spark seems to launch 3 executors total (which are from 3 workers).
I'd like to tune spark a little bit to get the most performance with little tuning..
Which way should I take for optimization?
- consider other cluster (yarn, etc .. although I have no idea what they offer, but it seems it's easier to change memory related settings there)
- run more executors
- analyze the job plan with
explain
api- accept it takes that much time because you have to download 4gig data (should spark grap all data to run aggregate? such as group by and sum), if applicable, save the data to parquet (?) for further analysis
Below are my performance related setting
...ANSWER
Answered 2020-Jan-04 at 13:02It is not always a matter of memory or cluster configuration, I would suggest starting by trying to optimize the query/aggregation you're running before increasing memory.
You can find here some hints for Spark Performance Tuning. See also Tuning Spark. Make sure the query is optimal and avoid known bad performance as UDFs.
For executor and memory configuration in your cluster, you have to take into consideration the available memory and cores on all machines to calculate the adequate parameters. Here is an intersting post on best practices.
QUESTION
I'm looking for ways to process parallel data from large index, I thought about snapshot the index (to hdfs) and then submit spark jobs to process the records.
Other way to solve it, is to use elastic with spark.
My questions:
- Can the snapshot API output be text file instead of binary files?
- How can I use spark-elastic and perform sub queries for a specific document? (lets say I have index of dogs and then I want to find the bones of each dog)?
------EDIT------
My indexes changed a little, There is Dogs indexes, an dogs-relation index. Dogs index:
...ANSWER
Answered 2017-Jan-31 at 15:37Pt 1.
I don't think so, AFAIK the closest option would be to use the scan/scroll API (depend on which ES version you are on): ES v5.1 scroll api. You can 'export' your indexes to text file/s that way.
Pt 2.
The simplest way - code-wise - to do what you want (elasticsearch query per dog document), would be to load your dogsRDD using elastic-hadoop, then for the sub-query behaviour, do something like:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark-elastic
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page