spark-xml | XML data source for Spark SQL and DataFrames

by databricks Scala Version: v0.16.0 License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | spark-xml Summary

spark-xml is a Scala library. spark-xml has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

XML data source for Spark SQL and DataFrames

Support

Quality

Security

License

Reuse

Support

spark-xml has a low active ecosystem.

It has 446 star(s) with 227 fork(s). There are 40 watchers for this library.

It had no major release in the last 12 months.

There are 8 open issues and 380 have been closed. On average issues are closed in 160 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of spark-xml is v0.16.0

Quality

spark-xml has no bugs reported.

Security

spark-xml has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

spark-xml is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

spark-xml releases are available to install and integrate.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-xml

Get all kandi verified functions for this library.

spark-xml Key Features

No Key Features are available at this moment for spark-xml.

spark-xml Examples and Code Snippets

No Code Snippets are available at this moment for spark-xml.

Community Discussions

Trending Discussions on spark-xml

Extracting row tag schema from StructType in Scala to parse nested XML

(spark-xml) Receiving only null when parsing xml column using from_xml function

How to transform to spark Data Frame data from multiple nested XML files with attributes

How to write bytes string to hdfs hadoop in pyspark for spark-xml transformation?

Reading schema of streaming Dataframe in Spark Structured Streaming

Read files And Modify filename from the azure storage containers in Azure Databricks

File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found

How to create an XML string from dataframe using scala

How to resolve the error NameError: name 'SparkConf' is not defined in pycharm

Can't produce a flattened record from nested xml with 0-n child elements using Spark-xml

QUESTION

Extracting row tag schema from StructType in Scala to parse nested XML

Asked 2021-May-19 at 09:15

I'm trying to parse a wide, nested XML file into a DataFrame using the spark-xml library.

Here is an abbreviated schema definition (XSD):

...

ANSWER

Answered 2021-May-19 at 05:57

Columns in XSD are required or not null & Some of the columns in XML file is null to match XSD & XML file content, change schema from nullable=false to nullable=true

Try following code.

Source https://stackoverflow.com/questions/67596525

QUESTION

(spark-xml) Receiving only null when parsing xml column using from_xml function

Asked 2021-May-18 at 13:34

I'm trying to parse a very simple XML string column using spark-xml, but I only manage to receive null values, even when the XML is correctly populated.

The XSD that I'm using to parse the xml is:

...

ANSWER

Answered 2021-May-18 at 13:34

At the end what opened my eyes was reading the part of the spark-xml documentation that mentions:

Path to an XSD file that is used to validate the XML for each row individually

This mean that the schema matching is done through each row and not through the entire XML, in that case the schema for my example needs to be something like the following:

Source https://stackoverflow.com/questions/67531343

QUESTION

How to transform to spark Data Frame data from multiple nested XML files with attributes

Asked 2021-Apr-13 at 20:43

How to transform values below from multiple XML files to spark data frame :

attribute Id0 from Level_0
Date/Value from Level_4

Required output:

...

ANSWER

Answered 2021-Jan-01 at 15:51

You can use Level_0 as the rowTag, and explode the relevant arrays/structs:

Source https://stackoverflow.com/questions/65526383

QUESTION

How to write bytes string to hdfs hadoop in pyspark for spark-xml transformation?

Asked 2021-Jan-22 at 20:25

In python, bytes string can be simply saved to single xml file:

...

ANSWER

Answered 2021-Jan-22 at 20:25

Don't be misled by databricks spark-xml docs, as they lead to use uncompressed XML file as an input. This is very inefficient and much faster is to download XMLs directly to spark dataframe. Databricks xml-pyspark version doesn't include it but there is a workaround:

Source https://stackoverflow.com/questions/65728216

QUESTION

Reading schema of streaming Dataframe in Spark Structured Streaming

Asked 2021-Jan-22 at 10:43

I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML.

Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running perfectly but not in Structured Spark Streaming.

Code chunk of spark-xml Github library

...

ANSWER

Answered 2021-Jan-21 at 19:54

There is nothing wrong with your code if it works in batch mode.

It is important to not only convert the source into a stream (by using readStream and load) but it is also required to convert the sink part into a stream.

The error message you are getting is just reminding you to also look into the sink part. Your Dataframe final_df is actually a streaming Dataframe which has to be started through start.

The Structured Streaming Guide gives you a good overview on all available Output Sinks and the easiest would be to print the result to the console.

To summarize, you need to add the following to your program:

Source https://stackoverflow.com/questions/65832782

QUESTION

Read files And Modify filename from the azure storage containers in Azure Databricks

Asked 2020-Oct-14 at 13:49

I am ingesting Large XML file and generating individual JSON according to the XML Element, I am using SPARK-XML in azure databricks. Code to create the json file as

...

ANSWER

Answered 2020-Oct-13 at 13:06

Unfortunately, it's not possible to control the file name using standard spark library, but you can use Hadoop API for managing file system - save output in temporary directory and then move file to the requested path.

Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-0000 files.

In order to change filename, try to add something like this in your code:

In Scala it will look like:

Source https://stackoverflow.com/questions/64329977

QUESTION

File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found

Asked 2020-Aug-16 at 12:37

I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path. I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration property xxx.dfs.core.windows.net not found.

I haven't tried mounting the file but trying to understand if it's a known limitation with XML files, as I am able to read the parquet files just fine.

Here is my XML libraries config com.databricks:spark-xml_2.11:0.9.0

I tried a couple of things per the other articles but still getting the same error.

Added a new scope to see if it's a scope issue in the Databricks Workspace.
Tried adding configuration spark.conf.set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx==")

...

ANSWER

Answered 2020-Aug-16 at 12:37

I summarize the solution as below.

The package com.databricks:spark-xml seems using RDD API to read xml file. When we use using the RDD API to access Azure Data Lake Storage Gen2, wecannot access Hadoop configuration options set using spark.conf.set(...). So we should update the code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx=="). For more details, please refer to here.

Besides, you aslo can mount Azure Data Lake Storage Gen2 as file system in Azure databricks.

Source https://stackoverflow.com/questions/63400161

QUESTION

How to create an XML string from dataframe using scala

Asked 2020-Aug-04 at 15:40

I have a scenario where I am reading from my hive table and creating a spark dataframe. I want to generate an xml string from the output of dataframe and save it in a new dataframe (as xml string) , rather than writing it to a file in HDFS to create an xml. Please tell me if this can be done using databricks spark-xml.

...

ANSWER

Answered 2020-Aug-04 at 15:40

You cannot do this with the spark-xml lib, but you can reuse the write out part from it to create your own solution for an XmlRdd: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala

You can find in line 80 exactly this.

Source https://stackoverflow.com/questions/63245745

QUESTION

How to resolve the error NameError: name 'SparkConf' is not defined in pycharm

Asked 2020-Jun-20 at 09:32

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'
conf = SparkConf().setAppName('Stackoverflow')
sc = SparkContext(master="local", appName="test")
sc.setLogLevel("Error")
spark = SparkSession.builder.getOrCreate()
df=spark.read.format("com.databricks.spark.xml").option("rowTag","Transaction").load("C:/Users/Rajaraman/Desktop/task/data/transactions.xml")

...

ANSWER

Answered 2020-Jun-20 at 09:32

You need to import libraries referenced in the code

Add this line to import the referenced package

Source https://stackoverflow.com/questions/62483215

QUESTION

Can't produce a flattened record from nested xml with 0-n child elements using Spark-xml

Asked 2020-Jun-10 at 19:13

I am working with xml structure something like below:

...

ANSWER

Answered 2020-Jun-10 at 19:13

You can explode the applicants in a first step and then select the required columns from each applicant in a second step:

Source https://stackoverflow.com/questions/62296618

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install spark-xml

You can download it from GitHub.

Support

Per above, the XML for individual rows can be validated against an XSD using rowValidationXSDPath. The utility com.databricks.spark.xml.util.XSDToSchema can be used to extract a Spark DataFrame schema from some XSD files. It supports only simple, complex and sequence types, only basic XSD functionality, and is experimental.

Find more information at: