adlfs | fsspec-compatible Azure Datake and Azure Blob Storage access | Azure library

by dask Python Version: 2023.10.0 License: Non-SPDX

X-Ray Key Features Code Snippets(2)Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | adlfs Summary

adlfs is a Python library typically used in Cloud, Azure applications. adlfs has no bugs, it has no vulnerabilities, it has build file available and it has low support. However adlfs has a Non-SPDX License. You can install using 'pip install adlfs' or download it from GitHub, PyPI.

fsspec-compatible Azure Datake and Azure Blob Storage access

Support

Quality

Security

License

Reuse

Support

adlfs has a low active ecosystem.

It has 62 star(s) with 43 fork(s). There are 12 watchers for this library.

There were 1 major release(s) in the last 12 months.

There are 22 open issues and 110 have been closed. On average issues are closed in 31 days. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of adlfs is 2023.10.0

Quality

adlfs has 0 bugs and 0 code smells.

Security

adlfs has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

adlfs code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

adlfs has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

adlfs releases are not available. You will need to build from source code and install.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

adlfs saves you 1679 person hours of effort in developing the same functionality from scratch.

It has 4268 lines of code, 185 functions and 13 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of adlfs

Get all kandi verified functions for this library.

adlfs Key Features

No Key Features are available at this moment for adlfs.

adlfs Examples and Code Snippets

Details

Python

Lines of Code : 25

License : Non-SPDX (NOASSERTION)

Copy

The filesystem can be instantiated with a variety of credentials, including:
    account_name
    account_key
    sas_token
    connection_string
    Azure ServicePrincipal credentials (which requires tenant_id, client_id, client_secret)
    anon

Quickstart

Python

Lines of Code : 17

License : Non-SPDX (NOASSERTION)

Copy

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

import dask.dataframe as dd

storage_opt

Community Discussions

Trending Discussions on adlfs

Best practice on data access with remote cluster: pushing from client memory to workers vs direct link from worker to data storage

dask: read parquet from Azure blob - AzureHttpError

Moving data from a database to Azure blob storage

How can I speed up reading a CSV/Parquet file from adl:// with fsspec+adlfs?

QUESTION

Best practice on data access with remote cluster: pushing from client memory to workers vs direct link from worker to data storage

Asked 2021-Feb-03 at 16:32

Hi I am new to dask and cannot seem to find relevant examples on the topic of this title. Would appreciate any documentation or help on this.

The example I am working with is pre-processing of an image dataset on the azure environment with the dask_cloudprovider library, I would like to increase the speed of processing by dividing the work on a cluster of machines.

From what I have read and tested, I can (1) load the data to memory on the client machine, and push it to the workers or

...

ANSWER

Answered 2021-Feb-03 at 16:32

If you were to try version 1), you would first see warnings saying that sending large delayed objects is a bad pattern in Dask, and makes for large graphs and high memory use on the scheduler. You can send the data directly to workers using client.scatter, but it would still be essentially a serial process, bottlenecking on receiving and sending all of your data through the client process's network connection.

The best practice and canonical way to load data in Dask is for the workers to do it. All the built in loading functions work this way, and is even true when running locally (because any download or open logic should be easily parallelisable).

This is also true for the outputs of your processing. You haven't said what you plan to do next, but to grab all of those images to the client (e.g., .compute()) would be the other side of exactly the same bottleneck. You want to reduce and/or write your images directly on the workers and only handle small transfers from the client.

Note that there are examples out there of image processing with dask (e.g., https://examples.dask.org/applications/image-processing.html ) and of course a lot about arrays. Passing around whole image arrays might be fine for you, but this should be worth a read.

Source https://stackoverflow.com/questions/66005453

QUESTION

dask: read parquet from Azure blob - AzureHttpError

Asked 2020-Apr-17 at 02:30

I created a parquet file in an Azure blob using dask.dataframe.to_parquet (Moving data from a database to Azure blob storage).

I would now like to read that file. I'm doing:

...

ANSWER

Answered 2020-Apr-15 at 13:05

The text of the error suggests that the service was temporarily down. If it persists, you may want to lodge an issue at adlfs; perhaps it could be as simple as more thorough retry logic on their end.

Source https://stackoverflow.com/questions/61220615

QUESTION

Moving data from a database to Azure blob storage

Asked 2020-Apr-16 at 21:28

I'm able to use dask.dataframe.read_sql_table to read the data e.g. df = dd.read_sql_table(table='TABLE', uri=uri, index_col='field', npartitions=N)

What would be the next (best) steps to saving it as a parquet file in Azure blob storage?

From my small research there are a couple of options:

Save locally and use https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs?toc=/azure/storage/blobs/toc.json (not great for big data)
I believe adlfs is to read from blob
use dask.dataframe.to_parquet and work out how to point to the blob container
intake project (not sure where to start)

...

ANSWER

Answered 2020-Apr-16 at 21:28

$ pip install adlfs

Source https://stackoverflow.com/questions/60765331

QUESTION

How can I speed up reading a CSV/Parquet file from adl:// with fsspec+adlfs?

Asked 2020-Mar-12 at 16:28

I have a several gigabyte CSV file residing in Azure Data Lake. Using Dask, I can read this file in under a minute as follows:

...

ANSWER

Answered 2020-Mar-12 at 15:19

I do not know why fs.get doesn't work, but please try this for the final line:

Source https://stackoverflow.com/questions/60646151

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install adlfs

This package can be installed using:. The adl:// and abfs:// protocols are included in fsspec's known_implementations registry in fsspec > 0.6.1, otherwise users must explicitly inform fsspec about the supported adlfs protocols.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: