spark | Apache Spark - A unified analytics engine
kandi X-RAY | spark Summary
kandi X-RAY | spark Summary
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs. Please refer to the build documentation at "Specifying the Hadoop Version and Enabling YARN" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark
spark Key Features
spark Examples and Code Snippets
filters = ['category', 'name']
for col in x.columns:
if all(x in col for x in filters):
print(col)
category_name_courses
category_name_area
x = pd.DataFrame([['flow', 'x', 'cate
reg_str = '&'.join(filters)
x.filter(regex=reg_str)
category_name_courses category_name_area
0 Spark cloud
1 PySpark cloud
2 Python
sc = spark.sparkContext
rdd = sc.parallelize(["abcdefg", "hijklmno"])
rdd.collect()
# Out: ['abcdefg', 'hijklmno']
rdd.map(lambda x: '-'.join([x[i:i+2] for i in range(0, len(x), 2)])).collect()
# Out:['ab-cd-ef-g', 'hi-jk-lm-no']
rdd.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1])).collect()
# [('word1', (3, 5)), ('word2', (8, 10))]
init_cols = df.columns
new_cols = new_df.columns
result = ','.join([c for c in new_cols if c not in init_cols])
from itertools import chain
import pyspark.sql.functions as F
keys_map = F.create_map(*[F.lit(x)for x in chain(*keys.items())])
df = df.withColumn(
"geno",
F.struct(
keys_map[F.col("geno.sampleId")
df['match'] = [set(c.split()).intersection(k.split(',')) > set()
for c,k in zip(df['comments'], df['keywords'])]
name comments keywords match
0 paul account
keys = ('|').join([f'({x})' for x in df['keywords'].iloc[0].split(',')])
df['comments'].str.contains(keys)
0 True
1 True
2 True
Name: comments, dtype: bool
rdd1 = sc.parallelize([('python', 36), ('c', 6), ('c#', 8)])
rdd2 = sc.parallelize([('python', 10), ('c', 1), ('c#', 1)])
rdd1.join(rdd2).map(lambda x: (x[0], *x[1])).toDF().show()
+------+---+---+
| _1| _2| _3|
+------+---+---+
|python
df = spark.createDataFrame([
(1, 'open', '01.01.22 10:05:04'),
(1, 'In process', '01.01.22 10:07:02'),
], ['a', 'b', 'c'])
+---+----------+-----------------+
| a| b| c|
+---+----------+-----------------+
|
Community Discussions
Trending Discussions on spark
QUESTION
When I execute run-example SparkPi
, for example, it works perfectly, but
when I run spark-shell
, it throws these exceptions:
ANSWER
Answered 2022-Jan-07 at 15:11i face the same problem, i think Spark 3.2 is the problem itself
switched to Spark 3.1.2, it works fine
QUESTION
Update: the root issue was a bug which was fixed in Spark 3.2.0.
Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6
). I know I can use aliases for dataframes which would return desired result.
The question. What is the underlying Spark mechanics in creating df3
? Spark reads df1.c1 == df2.c2
in the join
's on
clause, but it's evident that it does not pay attention to the dfs provided. What's under the hood there? How to anticipate such behaviour?
First run (incorrect df3
result):
ANSWER
Answered 2021-Sep-24 at 16:19Spark for some reason doesn't distinguish your c1
and c2
columns correctly. This is the fix for df3
to have your expected result:
QUESTION
I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on . Below is a snippet of the code that threw this error:
...
ANSWER
Answered 2021-Aug-26 at 14:53I had the same error using pandas 1.3.2 in the server while 1.2 in my client. Downgrading pandas to 1.2 solved the problem.
QUESTION
When switching from Glue 2.0 to 3.0, which means also switching from Spark 2.4 to 3.1.1, my jobs start to fail when processing timestamps prior to 1900 with this error:
...ANSWER
Answered 2022-Feb-10 at 13:45I made it work by setting --conf
to spark.sql.legacy.parquet.int96RebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=CORRECTED --conf spark.sql.legacy.parquet.datetimeRebaseModeInWrite=CORRECTED
.
This is a workaround though and Glue Dev team is working on a fix, although there is no ETA.
Also this is still very buggy. You can not call .show()
on a DynamicFrame
for example, you need to call it on a DataFrame
. Also all my jobs failed where I call data_frame.rdd.isEmpty()
, don't ask me why.
Update 24.11.2021: I reached out to the Glue Dev Team and they told me that this is the intended way of fixing it. There is a workaround that can be done inside of the script though:
QUESTION
I'm parsing a XML string to convert it to a JsonNode
in Scala using a XmlMapper
from the Jackson library. I code on a Databricks notebook, so compilation is done on a cloud cluster. When compiling my code I got this error java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()Lcom/fasterxml/jackson/databind/cfg/MutableCoercionConfig;
with a hundred lines of "at com.databricks. ..."
I maybe forget to import something but for me this is ok (tell me if I'm wrong) :
...ANSWER
Answered 2021-Oct-07 at 12:08Welcome to dependency hell and breaking changes in libraries.
This usually happens, when various lib bring in different version of same lib. In this case it is Jackson.
java.lang.NoSuchMethodError: com.fasterxml.jackson.dataformat.xml.XmlMapper.coercionConfigDefaults()Lcom/fasterxml/jackson/databind/cfg/MutableCoercionConfig;
means: One lib probably require Jackson version, which has this method, but on class path is version, which does not yet have this funcion or got removed bcs was deprecated or renamed.
In case like this is good to print dependency tree and check version of Jackson required in libs. And if possible use newer versions of requid libs.
Solution: use libs, which use compatible versions of Jackson lib. No other shortcut possible.
QUESTION
I am trying to install conda on EMR and below is my bootstrap script, it looks like conda is getting installed but it is not getting added to environment variable. When I manually update the $PATH
variable on EMR master node, it can identify conda
. I want to use conda on Zeppelin.
I also tried adding condig into configuration like below while launching my EMR instance however I still get the below mentioned error.
...ANSWER
Answered 2022-Feb-05 at 00:17I got the conda working by modifying the script as below, emr python versions were colliding with the conda version.:
QUESTION
I am trying to set my env_file
configuration to be relative to each of the multiple docker-compose.yml
file locations instead of relative to the first docker-compose.yml
.
The documentation (https://docs.docker.com/compose/compose-file/compose-file-v3/#env_file) suggests this should be possible:
If you have specified a Compose file with docker-compose -f FILE, paths in env_file are relative to the directory that file is in.
For example, when I issue
...ANSWER
Answered 2021-Dec-20 at 18:51It turns out that there's already an issue and discussion regarding this:
The thread points out that this is the expected behavior and is documented here: https://docs.docker.com/compose/extends/#understanding-multiple-compose-files
When you use multiple configuration files, you must make sure all paths in the files are relative to the base Compose file (the first Compose file specified with -f). This is required because override files need not be valid Compose files. Override files can contain small fragments of configuration. Tracking which fragment of a service is relative to which path is difficult and confusing, so to keep paths easier to understand, all paths must be defined relative to the base file.
There's a workaround within that discussion that works fairly well: https://github.com/docker/compose/issues/3874#issuecomment-470311052
The workaround is to use a ENV var that has a default:
- ${PROXY:-.}/haproxy/conf:/usr/local/etc/haproxy
Or in my case:
QUESTION
I have the following file paths that we read with partitions on s3
...ANSWER
Answered 2021-Dec-14 at 02:46Yes, we can read all the json files without partition columns. Directly use the parent folder path and it will load all partitions data into the data frame.
After reading the data frame, you can use withColumn() function to rename the date field.
Something like the following should work
QUESTION
I have a set of .xml
documents that I want to parse.
I previously have tried to parse them using methods that take the file contents and dump them into a single cell, however I've noticed this doesn't work in practice since I'm seeing slower and slower run times, often with one task taking tens of hours to run:
The first transform of mine takes the .xml
contents and puts it into a single cell, and a second transform takes this string and uses Python's xml library to parse the string into a document. This document I'm then able to extract properties from and return a DataFrame.
I'm using a UDF to conduct the process of mapping the string contents to the fields I want.
How can I make this faster / work better with large .xml
files?
ANSWER
Answered 2021-Dec-09 at 21:17For this problem, we're going to combine a couple of different techniques to make this code both testable and highly scalable.
TheoryWhen parsing raw files, you have a couple of options you can consider:
- ❌ You can write your own parser to read bytes from files and convert them into data Spark can understand.
- This is highly discouraged whenever possible due to the engineering time and unscalable architecture. It doesn't take advantage of distributed compute when you do this as you must bring the entire raw file to your parsing method before you can use it. This is not an effective use of your resources.
- ⚠ You can use your own parser library not made for Spark, such as the XML Python library mentioned in the question
- While this is less difficult to accomplish than writing your own parser, it still does not take advantage of distributed computation in Spark. It is easier to get something running, but it will eventually hit a limit of performance because it does not take advantage of low-level Spark functionality only exposed when writing a Spark library.
- ✅ You can use a Spark-native raw file parser
- This is the preferred option in all cases as it takes advantage of low-level Spark functionality and doesn't require you to write your own code. If a low-level Spark parser exists, you should use it.
In our case, we can use the Databricks parser to great effect.
In general, you should also avoid using the .udf
method as it likely is being used instead of good functionality already available in the Spark API. UDFs are not as performant as native methods and should be used only when no other option is available.
A good example of UDFs covering up hidden problems would be string manipulations of column contents; while you technically can use a UDF to do things like splitting and trimming strings, these things already exist in the Spark API and will be orders of magnitude faster than your own code.
DesignOur design is going to use the following:
- Low-level Spark-optimized file parsing done via the Databricks XML Parser
- Test-driven raw file parsing as explained here
First, we need to add the .jar
to our spark_session
available inside Transforms. Thanks to recent improvements, this argument, when configured, will allow you to use the .jar
in both Preview/Test and at full build time. Previously, this would have required a full build but not so now.
We need to go to our transforms-python/build.gradle
file and add 2 blocks of config:
- Enable the
pytest
plugin - Enable the
condaJars
argument and declare the.jar
dependency
My /transforms-python/build.gradle
now looks like the following:
QUESTION
- dockerfile:
ANSWER
Answered 2021-Dec-07 at 08:54It seems that you have problems with peer dependencies
, if you just set your npm to use legacy dependency logic to install your packages you will solve the problem.
Just add to your Dockerfile this setting before running npm install:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install spark
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page