HDFS | PowerShell module provides a wrapper for the Hadoop File

by bamcis-io PowerShell Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | HDFS Summary

HDFS is a PowerShell library. HDFS has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

A PowerShell native interface for the Hadoop WebHDFS APIs.

Support

Quality

Security

License

Reuse

Support

HDFS has a low active ecosystem.

It has 8 star(s) with 1 fork(s). There are 3 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 1 have been closed. On average issues are closed in 643 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of HDFS is current.

Quality

HDFS has 0 bugs and 0 code smells.

Security

HDFS has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

HDFS code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

HDFS is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

HDFS releases are not available. You will need to build from source code and install.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of HDFS

Get all kandi verified functions for this library.

HDFS Key Features

No Key Features are available at this moment for HDFS.

HDFS Examples and Code Snippets

No Code Snippets are available at this moment for HDFS.

Community Discussions

Trending Discussions on HDFS

spark-shell exception org.apache.spark.SparkException: Exception thrown in awaitResult

Why is my hdfs capacity not remainng constant?

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

Colab: (0) UNIMPLEMENTED: DNN library is not found

Single map task taking long time and failing in hive map reduce

Reading single parquet-partition with single file results in DataFrame with more partitions

How to run Spark SQL Thrift Server in local mode and connect to Delta using JDBC

FileNotFoundException on _temporary/0 directory when saving Parquet files

How to avoid Hive Staging Area Write on Cloud

Error while working with Page rank problem.Mapreduce error

QUESTION

spark-shell exception org.apache.spark.SparkException: Exception thrown in awaitResult

Asked 2022-Mar-23 at 09:29

Facing below error while starting spark-shell with yarn master. Shell is working with spark local master.

...

ANSWER

Answered 2022-Mar-23 at 09:29

Adding these properties in spark-env.sh fixed the issue for me.

Source https://stackoverflow.com/questions/69823486

QUESTION

Why is my hdfs capacity not remainng constant?

Asked 2022-Feb-14 at 15:06

I am running a pyspark job on dataproc and my total hdfs capacity is not remaining constant.

As you can see in the first chart that the remaining hdfs capacity is falling even though the used hdfs capacity is minimal. Why is remaining + used not constant?

...

ANSWER

Answered 2022-Feb-14 at 15:06

The "used" in the monitoring graph is actually "DFS used", and it didn't show "non-DFS used". If you open the HDFS UI in the component gateway web interfaces you should be able to see something like:

Source https://stackoverflow.com/questions/71105078

QUESTION

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?

Asked 2022-Feb-10 at 05:12

I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).

The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.

The issues are:

By default the file name is part-0000-.snappy.parquet, with a different rid on every build. This means that, whenever a new file is uploaded, it appears in the same folder as an additional file, the only way to tell which is the newest version is by last modified date.
Every version of the data is stored in my external system, this takes up unnecessary storage unless I frequently go in and delete old files.

All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.

...

ANSWER

Answered 2022-Jan-13 at 15:27

This is possible by renaming the single parquet file in the dataset so that it always has the same file name, that way the export task will overwrite the previous file in the external system.

This can be done using raw file system access. The write_single_named_parquet_file function below validates its inputs, creates a file with a given name in the output dataset, then copies the file in the input dataset to it. The result is a schemaless output dataset that contains a single named parquet file.

Notes

The build will fail if the input contains more than one parquet file, as pointed out in the question, calling .coalesce(1) (or .repartition(1)) is necessary in the upstream transform
If you require transaction history in your external store, or your dataset is much larger than 512 MB this method is not appropriate, as only the latest version is kept, and you likely want multiple parquet files for use in your downstream system. The createTransactionFolders (put each new export in a different folder) and flagFile (create a flag file once all files have been written) options can be useful in this case.
The transform does not require any spark executors, so it is possible to use @configure() to give it a driver only profile. Giving the driver additional memory should fix out of memory errors when working with larger datasets.
shutil.copyfileobj is used because the 'files' that are opened are actually just file objects.

Full code snippet

example_transform.py

Source https://stackoverflow.com/questions/70652943

QUESTION

Colab: (0) UNIMPLEMENTED: DNN library is not found

Asked 2022-Feb-08 at 19:27

I have pretrained model for object detection (Google Colab + TensorFlow) inside Google Colab and I run it two-three times per week for new images I have and everything was fine for the last year till this week. Now when I try to run model I have this message:

...

ANSWER

Answered 2022-Feb-07 at 09:19

It happened the same to me last friday. I think it has something to do with Cuda instalation in Google Colab but I don't know exactly the reason

Source https://stackoverflow.com/questions/71000120

QUESTION

Single map task taking long time and failing in hive map reduce

Asked 2022-Jan-11 at 15:56

I am running a simple query like the one shown below(similar form)

...

ANSWER

Answered 2022-Jan-11 at 15:56

It may happen because some partition is bigger than others.

Try to trigger reducer task by adding distribute by

Source https://stackoverflow.com/questions/70656966

QUESTION

Reading single parquet-partition with single file results in DataFrame with more partitions

Asked 2022-Jan-08 at 12:41

Context

I have a Parquet-table stored in HDFS with two partitions, whereby each partition yields only one file.

...

ANSWER

Answered 2022-Jan-08 at 12:41

One of the issues is that partition is an overloaded term in Spark world and you're looking at 2 different kind of partitions:

your dataset is organized as a Hive-partitioned table, where each partition is a separate directory named with = that may contain many data files inside. This is only useful for dynamically pruning the set of input files to read and has no effect on the actual RDD processing
when Spark loads your data and creates a DataFrame/RDD, this RDD is organized in splits that can be processed in parallel and that are also called partitions.

df.rdd.getNumPartitions() returns the number of splits in your data and that is completely unrelated to your input table partitioning. It's determined by a number of config options but is mostly driven by 3 factors:

computing parallelism: spark.default.parallelism in particular is the reason why you have 2 partitions in your RDD even though you don't have enough data to fill the first
input size: spark will try to not create partitions bigger than spark.sql.files.maxPartitionBytes and thus may split a single multi-gigabyte parquet file into many partitions)
shuffling: any operation that need to reorganize data for correct behavior (for example join or groupBy) will repartition your RDD with a new strategy and you will end up with many more partitions (governed by spark.sql.shuffle.partitions and AQE settings)

On the whole, you want to preserve this behavior since it's necessary for Spark to process your data in parallel and achieve good performance. When you use df.coalesce(1) you will coalesce your data into a single RDD partition but you will do your processing on a single core in which case simply doing your work in Pandas and/or Pyarrow would be much faster.

If what you want is to preserve the property on your output to have a single parquet file per Hive-partition attribute, you can use the following construct:

Source https://stackoverflow.com/questions/70396271

QUESTION

How to run Spark SQL Thrift Server in local mode and connect to Delta using JDBC

Asked 2022-Jan-08 at 06:42

I'd like connect to Delta using JDBC and would like to run the Spark Thrift Server (STS) in local mode to kick the tyres.

I start STS using the following command:

...

ANSWER

Answered 2022-Jan-08 at 06:42

Once you can copy io.delta:delta-core_2.12:1.0.0 JAR file to $SPARK_HOME/lib and restart, this error goes away.

Source https://stackoverflow.com/questions/69862388

QUESTION

FileNotFoundException on _temporary/0 directory when saving Parquet files

Asked 2021-Dec-17 at 16:58

Using Python on an Azure HDInsight cluster, we are saving Spark dataframes as Parquet files to an Azure Data Lake Storage Gen2, using the following code:

...

ANSWER

Answered 2021-Dec-17 at 16:58

ABFS is a "real" file system, so the S3A zero rename committers are not needed. Indeed, they won't work. And the client is entirely open source - look into the hadoop-azure module.

the ADLS gen2 store does have scale problems, but unless you are trying to commit 10,000 files, or clean up massively deep directory trees -you won't hit these. If you do get error messages about Elliott to rename individual files and you are doing Jobs of that scale (a) talk to Microsoft about increasing your allocated capacity and (b) pick this up https://github.com/apache/hadoop/pull/2971

This isn't it. I would guess that actually you have multiple jobs writing to the same output path, and one is cleaning up while the other is setting up. In particular -they both seem to have a job ID of "0". Because of the same job ID is being used, what only as task set up and task cleanup getting mixed up, it is possible that when an job one commits it includes the output from job 2 from all task attempts which have successfully been committed.

I believe that this has been a known problem with spark standalone deployments, though I can't find a relevant JIRA. SPARK-24552 is close, but should have been fixed in your version. SPARK-33402 Jobs launched in same second have duplicate MapReduce JobIDs. That is about job IDs just coming from the system current time, not 0. But: you can try upgrading your spark version to see if it goes away.

My suggestions

make sure your jobs are not writing to the same table simultaneously. Things will get in a mess.
grab the most recent version spark you are happy with

Source https://stackoverflow.com/questions/70393987

QUESTION

How to avoid Hive Staging Area Write on Cloud

Asked 2021-Nov-14 at 09:04

I have to frequently write Dataframes as Hive tables.

...

ANSWER

Answered 2021-Nov-14 at 09:04

(not so) Short answer

It's ... complicated. Very complicated. I wanted to write a short answer but I'd risk being misleading on several points. Instead I'll try to give a very short summary of the very long answer.

Hive uses staging directories for a good reason: atomicity. You don't want users reading a table while it is being re-written, so instead you write in a staging directory and rename the directory when it's done, like this.
Problem is: Cloud storages are "object storages", not "distributed files systems" like HDFS, and some operations like folder renaming can be much slower because of that.
Each cloud has it's own storage implementation, with it's own specificities and downsides, and with time they even propose new variants to overcome some of these downsides (e.g. Azure has 3 different storage variants: Blob Storage, Datalake Storage Gen 1 and Gen 2).
Therefore, the best solution on one cloud isn't necessarily the best on another cloud.
The FileSystem API implementation for various cloud storage is part of the Hadoop distribution, which Spark uses. So the solutions available to you will also depend on which version of Hadoop your Spark installation is using.
Azure/GCS only: You could try setting [this option]:(https://spark.apache.org/docs/3.1.1/cloud-integration.html#configuring): spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version 2. It is faster than v1 but it also not recommended as it is not atomic and therefore less safe in case of partial failures.
v2 is currently the default in Hadoop, but Spark 3 set it back to v1 by default and there are some discussion in the Hadoop community to deprecate it and make v1 the default again.
There are also some ongoing development to write better output committers for Azure and GCS, based on a similar output committer done for S3.

Alternatively, you can try switching to cloud-first formats like Apache Iceberg, Apache Hudi or Delta Lake.
I am not very familiar with these yet, but a quick look at Delta Lake's documentation convinced me that they had to deal with same kind of issues (cloud storages not being real file systems), and depending on which cloud you're on, it may require extra configuration, especially on GCP where the feature is flagged as experimental.
EDIT: Apache Iceberg does not have this issue as it uses metadata files to point to the real data files location. Thanks to this, changes to a table are committed via an atomic change on a single metadata file.
I am not very familiar with Apache Hudi, and I couldn't find any mention of them dealing with these kind of issues. I'd have to dig further into their design architecture to know for sure.

Now, for the long answer, maybe I should write a blog article... I'll post it here whenever it's done.

Source https://stackoverflow.com/questions/69950723

QUESTION

Error while working with Page rank problem.Mapreduce error

Asked 2021-Oct-27 at 12:33

I have been working on PageRank algorithm with help of Map Reduce jobs.

I need to create Mapper and Reducer classes with the help of which I will be creating jar file.

I am using jar file to work with Hadoop clusters.

Currently my java files is PageRank.java

...

ANSWER

Answered 2021-Oct-27 at 12:33

Here, you have permission denied error message;

Source https://stackoverflow.com/questions/69695032

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install HDFS

To setup a basic session using user name authentication:.
You may need to force the use of TLS 1.2 in a secured environment for all Invoke-WebRequest calls. Be aware that forcing this usage may affect other cmdlets or scripts in the same PowerShell session. After forcing TLS 1.2, you can now establish a session with a secured CDH platform using TLS and Kerberos.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: