spark-xml | XML data source for Spark SQL and DataFrames

 by   databricks Scala Version: v0.16.0 License: Apache-2.0

kandi X-RAY | spark-xml Summary

kandi X-RAY | spark-xml Summary

spark-xml is a Scala library. spark-xml has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

XML data source for Spark SQL and DataFrames
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              spark-xml has a low active ecosystem.
              It has 446 star(s) with 227 fork(s). There are 40 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 8 open issues and 380 have been closed. On average issues are closed in 160 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of spark-xml is v0.16.0

            kandi-Quality Quality

              spark-xml has no bugs reported.

            kandi-Security Security

              spark-xml has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              spark-xml is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              spark-xml releases are available to install and integrate.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of spark-xml
            Get all kandi verified functions for this library.

            spark-xml Key Features

            No Key Features are available at this moment for spark-xml.

            spark-xml Examples and Code Snippets

            No Code Snippets are available at this moment for spark-xml.

            Community Discussions

            QUESTION

            Extracting row tag schema from StructType in Scala to parse nested XML
            Asked 2021-May-19 at 09:15

            I'm trying to parse a wide, nested XML file into a DataFrame using the spark-xml library.

            Here is an abbreviated schema definition (XSD):

            ...

            ANSWER

            Answered 2021-May-19 at 05:57

            Columns in XSD are required or not null & Some of the columns in XML file is null to match XSD & XML file content, change schema from nullable=false to nullable=true

            Try following code.

            Source https://stackoverflow.com/questions/67596525

            QUESTION

            (spark-xml) Receiving only null when parsing xml column using from_xml function
            Asked 2021-May-18 at 13:34

            I'm trying to parse a very simple XML string column using spark-xml, but I only manage to receive null values, even when the XML is correctly populated.

            The XSD that I'm using to parse the xml is:

            ...

            ANSWER

            Answered 2021-May-18 at 13:34

            At the end what opened my eyes was reading the part of the spark-xml documentation that mentions:

            Path to an XSD file that is used to validate the XML for each row individually

            This mean that the schema matching is done through each row and not through the entire XML, in that case the schema for my example needs to be something like the following:

            Source https://stackoverflow.com/questions/67531343

            QUESTION

            How to transform to spark Data Frame data from multiple nested XML files with attributes
            Asked 2021-Apr-13 at 20:43

            How to transform values below from multiple XML files to spark data frame :

            • attribute Id0 from Level_0
            • Date/Value from Level_4

            Required output:

            ...

            ANSWER

            Answered 2021-Jan-01 at 15:51

            You can use Level_0 as the rowTag, and explode the relevant arrays/structs:

            Source https://stackoverflow.com/questions/65526383

            QUESTION

            How to write bytes string to hdfs hadoop in pyspark for spark-xml transformation?
            Asked 2021-Jan-22 at 20:25

            In python, bytes string can be simply saved to single xml file:

            ...

            ANSWER

            Answered 2021-Jan-22 at 20:25

            Don't be misled by databricks spark-xml docs, as they lead to use uncompressed XML file as an input. This is very inefficient and much faster is to download XMLs directly to spark dataframe. Databricks xml-pyspark version doesn't include it but there is a workaround:

            Source https://stackoverflow.com/questions/65728216

            QUESTION

            Reading schema of streaming Dataframe in Spark Structured Streaming
            Asked 2021-Jan-22 at 10:43

            I'm new with Apache Spark Structured Streaming. I'm trying to read some events from Event Hub (in XML format) and trying to create new Spark DF from the nested XML.

            Im using the code example described in https://github.com/databricks/spark-xml and in batch mode is running perfectly but not in Structured Spark Streaming.

            Code chunk of spark-xml Github library

            ...

            ANSWER

            Answered 2021-Jan-21 at 19:54

            There is nothing wrong with your code if it works in batch mode.

            It is important to not only convert the source into a stream (by using readStream and load) but it is also required to convert the sink part into a stream.

            The error message you are getting is just reminding you to also look into the sink part. Your Dataframe final_df is actually a streaming Dataframe which has to be started through start.

            The Structured Streaming Guide gives you a good overview on all available Output Sinks and the easiest would be to print the result to the console.

            To summarize, you need to add the following to your program:

            Source https://stackoverflow.com/questions/65832782

            QUESTION

            Read files And Modify filename from the azure storage containers in Azure Databricks
            Asked 2020-Oct-14 at 13:49

            I am ingesting Large XML file and generating individual JSON according to the XML Element, I am using SPARK-XML in azure databricks. Code to create the json file as

            ...

            ANSWER

            Answered 2020-Oct-13 at 13:06

            Unfortunately, it's not possible to control the file name using standard spark library, but you can use Hadoop API for managing file system - save output in temporary directory and then move file to the requested path.

            Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part-0000 files.

            In order to change filename, try to add something like this in your code:

            In Scala it will look like:

            Source https://stackoverflow.com/questions/64329977

            QUESTION

            File read from ADLS Gen2 Error - Configuration property xxx.dfs.core.windows.net not found
            Asked 2020-Aug-16 at 12:37

            I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path. I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration property xxx.dfs.core.windows.net not found.

            I haven't tried mounting the file but trying to understand if it's a known limitation with XML files, as I am able to read the parquet files just fine.

            Here is my XML libraries config com.databricks:spark-xml_2.11:0.9.0

            I tried a couple of things per the other articles but still getting the same error.

            • Added a new scope to see if it's a scope issue in the Databricks Workspace.
            • Tried adding configuration spark.conf.set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx==")
            ...

            ANSWER

            Answered 2020-Aug-16 at 12:37

            I summarize the solution as below.

            The package com.databricks:spark-xml seems using RDD API to read xml file. When we use using the RDD API to access Azure Data Lake Storage Gen2, wecannot access Hadoop configuration options set using spark.conf.set(...). So we should update the code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx=="). For more details, please refer to here.

            Besides, you aslo can mount Azure Data Lake Storage Gen2 as file system in Azure databricks.

            Source https://stackoverflow.com/questions/63400161

            QUESTION

            How to create an XML string from dataframe using scala
            Asked 2020-Aug-04 at 15:40

            I have a scenario where I am reading from my hive table and creating a spark dataframe. I want to generate an xml string from the output of dataframe and save it in a new dataframe (as xml string) , rather than writing it to a file in HDFS to create an xml. Please tell me if this can be done using databricks spark-xml.

            ...

            ANSWER

            Answered 2020-Aug-04 at 15:40

            You cannot do this with the spark-xml lib, but you can reuse the write out part from it to create your own solution for an XmlRdd: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/util/XmlFile.scala

            You can find in line 80 exactly this.

            Source https://stackoverflow.com/questions/63245745

            QUESTION

            How to resolve the error NameError: name 'SparkConf' is not defined in pycharm
            Asked 2020-Jun-20 at 09:32
            from pyspark import SparkContext
            from pyspark.sql import SparkSession
            from pyspark.sql.types import *
            import os
            os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'
            conf = SparkConf().setAppName('Stackoverflow')
            sc = SparkContext(master="local", appName="test")
            sc.setLogLevel("Error")
            spark = SparkSession.builder.getOrCreate()
            df=spark.read.format("com.databricks.spark.xml").option("rowTag","Transaction").load("C:/Users/Rajaraman/Desktop/task/data/transactions.xml")
            
            ...

            ANSWER

            Answered 2020-Jun-20 at 09:32

            You need to import libraries referenced in the code

            Add this line to import the referenced package

            Source https://stackoverflow.com/questions/62483215

            QUESTION

            Can't produce a flattened record from nested xml with 0-n child elements using Spark-xml
            Asked 2020-Jun-10 at 19:13

            I am working with xml structure something like below:

            ...

            ANSWER

            Answered 2020-Jun-10 at 19:13

            You can explode the applicants in a first step and then select the required columns from each applicant in a second step:

            Source https://stackoverflow.com/questions/62296618

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install spark-xml

            You can download it from GitHub.

            Support

            Per above, the XML for individual rows can be validated against an XSD using rowValidationXSDPath. The utility com.databricks.spark.xml.util.XSDToSchema can be used to extract a Spark DataFrame schema from some XSD files. It supports only simple, complex and sequence types, only basic XSD functionality, and is experimental.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries

            Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link