dask | Parallel computing with task scheduling | Machine Learning library

by dask Python Version: 2024.4.2 License: BSD-3-Clause

X-Ray Key Features Code Snippets(10)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | dask Summary

dask is a Python library typically used in Artificial Intelligence, Machine Learning, Numpy, Pandas applications. dask has no bugs, it has build file available, it has a Permissive License and it has high support. However dask has 1 vulnerabilities. You can install using 'pip install dask' or download it from GitHub, PyPI.

Parallel computing with task scheduling

Support

Quality

Security

License

Reuse

Support

dask has a highly active ecosystem.

It has 11106 star(s) with 1628 fork(s). There are 214 watchers for this library.

There were 10 major release(s) in the last 6 months.

There are 758 open issues and 3997 have been closed. On average issues are closed in 48 days. There are 164 open pull requests and 0 closed requests.

It has a negative sentiment in the developer community.

The latest version of dask is 2024.4.2

Quality

dask has 0 bugs and 0 code smells.

Security

dask has 1 vulnerability issues reported (1 critical, 0 high, 0 medium, 0 low).

dask code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

dask is licensed under the BSD-3-Clause License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

dask releases are not available. You will need to build from source code and install.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

It has 89482 lines of code, 5994 functions and 245 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed dask and discovered the below as its top functions. This is intended to give you an instant insight into dask implemented functionality, and help decide if they suit your requirements.

Calculates the ordering of a dsk .
Converts an array into a block item index .
Tests a tensor .
Fires a low - level fringe .
Map an array of blocks to a map .
Apply a Gufunc to a function .
Generate histogramdd for a given sample .
Read Parquet from path .
Performs a blockwise operation .
Convert an array to a dask array .

Get all kandi verified functions for this library.

dask Key Features

No Key Features are available at this moment for dask.

dask Examples and Code Snippets

visualize the low level Dask graph using cytoscape

Python

Lines of Code : 3

License : Permissive (BSD-3-Clause)

Copy

release-procedure.md

Python

Lines of Code : 0

License : Permissive (BSD-3-Clause)

Copy

git log $(git describe --tags --abbrev=0)..HEAD --pretty=format:"- %s \`%an\`_"  > change.md && sed -i -e 's/(#/(:pr:`/g' change.md && sed -i -e 's/) `/`) `/g' change.md

git commit -a -m "bump version to YYYY.M.X"

git tag -a YYYY

Creating a Dask Object

Python

Lines of Code : 0

License : Permissive (BSD-3-Clause)

Copy

See :doc:`dataframe`.
>>> index = pd.date_range("2021-09-01", periods=2400, freq="1H")
... df = pd.DataFrame({"a": np.arange(2400), "b": list("abcaddbe" * 300)}, index=index)
... ddf = dd.from_pandas(df, npartitions=10)
... ddf
Dask DataFram

GroupBy /Map_partitions in Dask

Python

Lines of Code : 26

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import pandas as pd

dtypes = {
    "y": "int",
    "z": "int",
    "a": "int",
    "b": "int",
    "c": "object",
    "total_x": "f64",
}
meta = pd.DataFrame(columns=dtypes.keys())

import dask
import pandas as pd

does dask compute store results?

Python

Lines of Code : 8

License : Strong Copyleft (CC BY-SA 4.0)

Copy

Dask Series Structure:
npartitions=2
0    int64
5      ...
9      ...
Name: data1x2, dtype: int64
Dask Name: getitem, 15 tasks

Faster methods to create geodataframe from a Dask or Pandas dataframe

Python

Lines of Code : 10

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import dask
from dask import dataframe as dd
import dask_geopandas

BM = dd.read_csv(BM_path, skiprows=2,names=["X","Y","Z","Lith"])
BM["geometry"] = dask_geopandas.points_from_xy(BM,"X","Y","Z")
gdf = dask_geopandas.from_dask_dataframe(BM

Dask to_parquet throws exception "No such file or directory"

Python

Lines of Code : 9

License : Strong Copyleft (CC BY-SA 4.0)

Copy

ddf = dd.from_pandas(pdf, npartitions=3) 
ddf.to_parquet('C:\\temp\\OLD_FILE_NAME', engine='pyarrow', overwrite=True)
ddf2 = dd.read_parquet('C:\\temp\\OLD_FILE_NAME') 
ddf2['new_column'] = 1
ddf2.to_parquet('C:\\temp\\NEW_FILE_NAME', engi

ImportError: cannot import name '_unicodefun' from 'click'

Python

Lines of Code : 9

License : Strong Copyleft (CC BY-SA 4.0)

Copy

          python-version: 3.8
      - name: install black
        run: |
-          pip install black==20.8b1
+          pip install black==22.3.0
      - name: run black
        run: |
          black . --check --line-length 100

Creating a new column in dask (arrays ,list)

Python

Lines of Code : 22

License : Strong Copyleft (CC BY-SA 4.0)

Copy

df["x"] = df["y"].isin(a_list).map({False: "No", True: "Yes"})

import dask

df = dask.datasets.timeseries(seed=123)

df["x"] = df["name"].isin(["Bob", "Tim"]).map({False: "No", True: "Yes"})

print(df.head(10))
#

Dask DataFrame.to_parquet fails on read - repartition - write operation

Python

Lines of Code : 50

License : Strong Copyleft (CC BY-SA 4.0)

Copy

In [9]: tinydf = pd.DataFrame({"col1": [11, 21], "col2": [12, 22]})
   ...: for i in range(1000):
   ...:     tinydf.to_parquet(f"myfile_{i}.parquet")

In [10]: df = dask.dataframe.read_parquet([f"myfile_{i}.parquet

Community Discussions

Trending Discussions on dask

Faster methods to create geodataframe from a Dask or Pandas dataframe

ImportError: cannot import name '_unicodefun' from 'click'

Dask DataFrame.to_parquet fails on read - repartition - write operation

Dask : how the memory limit is calculated in "auto" mode?

Dask worker post-processing

Running dask map_partition functions in multiple workers

Submit worker functions in dask distributed without waiting for the functions to end

TypeError: load() missing 1 required positional argument: 'Loader' in Google Colab

limit number of CPUs used by dask compute

Dask read CSV files recursively from directories

QUESTION

Faster methods to create geodataframe from a Dask or Pandas dataframe

Asked 2022-Mar-31 at 20:54

Problem

I'm trying to clip a very large block model (5.8gb CSV file) containing centroid x, y, and z coordinates with an elevation raster. I'm trying to obtain only the blocks lying just above the raster layer.

I usually do this in ArcGIS by clipping my block model points to the outline of my raster and then extracting the raster values to the block model points. For large datasets this takes an ungodly amount of time (yes, that's a technical term) in ArcGIS.

How I want to solve it

I want to speed this up by importing the CSV to Python. Using Dask, this is quick and easy:

...

ANSWER

Answered 2022-Mar-31 at 20:54

The optimal way of linking dask and geopandas is the dask-geopandas package.

Source https://stackoverflow.com/questions/71685387

QUESTION

ImportError: cannot import name '_unicodefun' from 'click'

Asked 2022-Mar-30 at 08:58

if running our lint checks with the python black pkg. an error comes up

ImportError: cannot import name '_unicodefun' from 'click' (/Users/robot/.cache/pre-commit/repo3u71ccm2/py_env-python3.9/lib/python3.9/site-packages/click/__init__.py)

...

ANSWER

Answered 2022-Mar-30 at 08:58

This has been fixed by Black 22.3.0. Versions before that won't work with click 8.1.0.

https://github.com/psf/black/issues/2964

E.g.: black.yml

Source https://stackoverflow.com/questions/71673404

QUESTION

Dask DataFrame.to_parquet fails on read - repartition - write operation

Asked 2022-Mar-20 at 17:41

I have the following workflow.

...

ANSWER

Answered 2022-Mar-16 at 04:54

The new divisions are chosen so that the total memory of the files in each partition doesn't exceed 1000 MB.

If the main consideration for repartitioning is memory, it might be a good idea to use .repartition(partition_size='1000MB'). The script looks like:

Source https://stackoverflow.com/questions/71486742

QUESTION

Dask : how the memory limit is calculated in "auto" mode?

Asked 2022-Mar-16 at 14:05

The documentation shows the following formula in case of "auto" mode :

$ dask-worker .. --memory-limit=auto # TOTAL_MEMORY * min(1, nthreads / total_nthreads)

My CPU spec :

...

ANSWER

Answered 2022-Mar-16 at 14:05

I suspect nthreads refers to how many threads this particular worker has available to schedule tasks on while total_nthreads refers to the total number of threads available on your system.

The dask-worker CLI command has the same defaults as LocalCluster (see GitHub issue). Assuming the defaults for LocalCluster spin up n workers where n is the number of available cores on your system and assign m threads to each worker where m is the number of threads per core:

Source https://stackoverflow.com/questions/71494237

QUESTION

Dask worker post-processing

Asked 2022-Mar-12 at 00:15

I'm new to dask and am trying to implement some post-processing tasks when workers shutdown. I'm currently using an EC2Cluster with n_workers=5

The cluster is created each time I need to run my large task. The task outputs a bunch of files which I want to send to AWS S3.

How would I implement a "post-processing" function that would run on each worker to send any logs and outputs to my AWS S3?

Thanks in advance

...

ANSWER

Answered 2022-Mar-12 at 00:15

You can use Python’s standard logging module to log whatever you'd like as the workers are running and then use the worker plugin you wrote to save these logs to an S3 bucket on teardown (check out the docs on logging in Dask for more details). Here's an example:

Source https://stackoverflow.com/questions/71410169

QUESTION

Running dask map_partition functions in multiple workers

Asked 2022-Mar-11 at 19:11

I have a dask architecture implemented with five docker containers: a client, a scheduler, and three workers. I also have a large dask dataframe stored in parquet format in a docker volume. The dataframe was created with 3 partitions, so there are 3 files (one file per partition).

I need to run a function on the dataframe with map_partitions, where each worker will take one partition to process.

My attempt:

...

ANSWER

Answered 2022-Mar-11 at 13:27

The python snippet does not appear to use the dask API efficiently. It might be that your actual function is a bit more complex, so map_partitions cannot be avoided, but let's take a look at the simple case first:

Source https://stackoverflow.com/questions/71401760

QUESTION

Submit worker functions in dask distributed without waiting for the functions to end

Asked 2022-Mar-07 at 01:32

I have this python code that uses the apscheduler library to submit processes, it works fine:

...

ANSWER

Answered 2022-Mar-07 at 01:32

Dask distributed has a fire_and_forget method which is an alternative to e.g. client.compute or dask.distributed.wait if you want the scheduler to hang on to the tasks even if the futures have fallen out of scope on the python process which submitted them.

Source https://stackoverflow.com/questions/71341193

QUESTION

TypeError: load() missing 1 required positional argument: 'Loader' in Google Colab

Asked 2022-Mar-04 at 11:01

I am trying to do a regular import in Google Colab.
This import worked up until now.
If I try:

...

ANSWER

Answered 2021-Oct-15 at 21:11

Found the problem.
I was installing pandas_profiling, and this package updated pyyaml to version 6.0 which is not compatible with the current way Google Colab imports packages.
So just reverting back to pyyaml version 5.4.1 solved the problem.

For more information check versions of pyyaml here.
See this issue and formal answers in GitHub

##################################################################
For reverting back to pyyaml version 5.4.1 in your code, add the next line at the end of your packages installations:

Source https://stackoverflow.com/questions/69564817

QUESTION

limit number of CPUs used by dask compute

Asked 2022-Feb-28 at 17:07

Below code uses appx 1 sec to execute on an 8-CPU system. How to manually configure number of CPUs used by dask.compute eg to 4 CPUs so the below code will use appx 2 sec to execute even on an 8-CPU system?

...

ANSWER

Answered 2022-Feb-22 at 14:23

There are a few options:

specify number of workers at the time of cluster creation

Source https://stackoverflow.com/questions/69396200

QUESTION

Dask read CSV files recursively from directories

Asked 2022-Jan-30 at 11:35

For the following directory structure

...

ANSWER

Answered 2022-Jan-30 at 11:35

IIUC, you can use:

Source https://stackoverflow.com/questions/70913988

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install dask

You can install using 'pip install dask' or download it from GitHub, PyPI.
You can use dask like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: