vaex | Core hybrid Apache Arrow/NumPy DataFrame

by vaexio Python Version: vaexpaper_v1 License: MIT

X-Ray Key Features Code Snippets(3)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | vaex Summary

vaex is a Python library typically used in Big Data, Jupyter, Pandas, Spark applications. vaex has build file available, it has a Permissive License and it has high support. However vaex has 140 bugs and it has 21 vulnerabilities. You can download it from GitHub.

Watch our more recent talks:. Contact us for data science solutions, training, or enterprise support at

Support

Quality

Security

License

Reuse

Support

vaex has a highly active ecosystem.

It has 7914 star(s) with 593 fork(s). There are 141 watchers for this library.

It had no major release in the last 12 months.

There are 397 open issues and 770 have been closed. On average issues are closed in 189 days. There are 98 open pull requests and 0 closed requests.

It has a positive sentiment in the developer community.

The latest version of vaex is vaexpaper_v1

Quality

vaex has 140 bugs (14 blocker, 0 critical, 62 major, 64 minor) and 2384 code smells.

Security

vaex has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

vaex code analysis shows 21 unresolved vulnerabilities (0 blocker, 21 critical, 0 major, 0 minor).

There are 15 security hotspots that need review.

License

vaex is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

vaex releases are available to install and integrate.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

vaex saves you 16331 person hours of effort in developing the same functionality from scratch.

It has 32487 lines of code, 2700 functions and 174 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed vaex and discovered the below as its top functions. This is intended to give you an instant insight into vaex implemented functionality, and help decide if they suit your requirements.

Plot a heatmap .
Add a toolbar .
Initialize the plugin .
Sets up the main widget .
Draw the front side .
Plot the grid
this is the main function of the scheduler
Evaluate an expression .
Read a column from a file .
Plot a bq plot .

Get all kandi verified functions for this library.

vaex Key Features

No Key Features are available at this moment for vaex.

vaex Examples and Code Snippets

Vaex support

Python

Lines of Code : 84

License : Non-SPDX (NOASSERTION)

Copy

tims2hdf5.py --help

python tims2hdf5.py --help

usage: tims2hdf5.py [-h] [--compression {gzip,lzf,szip,none}] [--compression_level COMPRESSION_LEVEL] [--chunksNo CHUNKSNO] [--shuffle]
                    [--columns {frame,scan,tof,intensity,mz,inv_i

Requirements

Python

Lines of Code : 7

License : Non-SPDX (NOASSERTION)

Copy

sudo apt install python3.8-dev

pip install timspy

pip install git+https://github.com/michalsta/opentims

pip uninstall numpy
pip install numpy==1.19.3

pip install timspy[vaex]

pip install vaex-core vaex-hdf5 h5py

Installation

Python

Lines of Code : 7

License : Non-SPDX (NOASSERTION)

Copy

pip install timspy

pip install -e git+https://github.com/MatteoLacki/timspy/tree/devel

git clone https://github.com/MatteoLacki/timspy
cd timspy
pip install -e .

pip install timspy[vaex]

pip install vaex-core vaex-hdf5

Community Discussions

Trending Discussions on vaex

Virtual column with calculation in Vaex

Splitting an array according input offset values, but leaving duplicated in a same chunk

MainThread: Vaex: Error while Opening Azure Data Lake Parquet file

vaex apply does not work when using dataframe columns

Apply function to each group where the group are splitted in multiple files without concatenating all the files

Loading vaex dataframe to dash datatable

Caching a large data structure in python across instances while maintaining types

Fastest most efficient way to apply groupby that references multiple columns

python pandas - is there any faster way to do explode operation according to the requirement

Why does matplotlib.pyplot.savefig() mess up image outputs for very large pandas.plotting.scatter_matrix()?

QUESTION

Virtual column with calculation in Vaex

Asked 2022-Jan-05 at 15:22

I want to set a virtual column to a calculation using another column in Vaex. I need to use an if statement inside this calculation. In general I want to call

...

ANSWER

Answered 2021-Dec-30 at 06:41

It might be useful to use a mask for subsetting the relevant rows:

Source https://stackoverflow.com/questions/70513644

QUESTION

Splitting an array according input offset values, but leaving duplicated in a same chunk

Asked 2021-Dec-19 at 20:59

Given a list of indexes (offset values) according which splitting a numpy array, I would like to adjust it so that the splitting does not occur on duplicate values. This means duplicate values will be in one chunk only.

I have worked out following piece of code, which gives the result, but I am not super proud of it. I would like to stay in numpy world and use vectorized numpy functions as much as possible. But to check the indexes (offset values) I use a for loop, and store the result in a list.

Do you have any idea how to vectorize the 2nd part?

If this can help, ar is an ordered array. (I am not using this info in below code).

...

ANSWER

Answered 2021-Dec-18 at 17:43

You can use np.digitize to clamp the offsets into bins:

Source https://stackoverflow.com/questions/70405390

QUESTION

MainThread: Vaex: Error while Opening Azure Data Lake Parquet file

Asked 2021-Nov-16 at 13:02

I tried to open a parquet on an Azure data lake gen 2 storage using SAS URL generated (with the datetime limit and token embedded in the url) using vaex by doing:

vaex.open(sas_url)

and I got the error

ERROR:MainThread:vaex:error opening 'the path which was also the sas_url(can't post it for security reasons)' ValueError: Do not know how to open (can't publicize the sas url) , no handler for https is known

How do I get vaex to read the file or is there another azure storage that works better with vaex?

...

ANSWER

Answered 2021-Aug-24 at 08:40

Vaex is not capable to read the data using https source, that's the reason you are getting error "no handler for https is known".

Also, as per the document, vaex supports data input from Amazon S3 buckets and Google cloud storage.

Cloud support:

Amazon Web Services S3

Google Cloud Storage

Other cloud storage options

They mentioned that other cloud storages are also supported but there is no supporting document anywhere with any example where they are fetching the data from Azure storage account, that also using SAS URL.

Also please visit API document for vaex library for more info.

Source https://stackoverflow.com/questions/68814291

QUESTION

vaex apply does not work when using dataframe columns

Asked 2021-Nov-15 at 16:16

I am trying to tokenize natural language for the first sentence in wikipedia in order to find 'is a' patterns. n-grams of the tokens and left over text would be the next step. "Wellington is a town in the UK." becomes "town is a attr_root in the country." Then find common patterns using n-grams.

For this I need to replace string values in a string column using other string columns in the dataframe. In Pandas I can do this using

...

ANSWER

Answered 2021-Nov-15 at 16:16

Turns out I had a bug. Needed dfv in the call to apply instead of df.

Also got this faster method from the nice people at vaex.

Source https://stackoverflow.com/questions/69971992

QUESTION

Apply function to each group where the group are splitted in multiple files without concatenating all the files

Asked 2021-Oct-07 at 08:29

My data come from BigQuery exported to GCS bucket as CSV file and if the file size is quite massive, BigQuery will automatically split the data into several chunk. With time series in mind, the time series might be scattered across different files. I have a custom function that I want to applied to each TimeseriesID.

Here's some constraint of the data:

The data is sorted by TimeseriesID and TimeID
The number of row of each files is may vary, but at minimum 1 row (which is very unlikely)
The starting of TimeID is not always 0
The length of each time series may vary but at maximum it will only scattered across 2 files. No time series scatter in 3 different files.

Here's the initial setup to illustrate the problem:

...

ANSWER

Answered 2021-Oct-07 at 08:29

Approach presented by op to use concat with million of records would be overkill for memories/other resources.

I have tested OP code using Google Colab Nootebooks and this was a bad approach

Source https://stackoverflow.com/questions/69430988

QUESTION

Loading vaex dataframe to dash datatable

Asked 2021-Sep-21 at 15:47

I am trying to load a dataframe(vaex) to dash datatable and getting following error.

Invalid argument data passed into DataTable with ID "table". Expected an array. Was supplied type object.

Tried the following, is it possible to load vaex dataframe to dash datatable like pandas dataframe.

...

ANSWER

Answered 2021-Sep-21 at 15:47

After looking at the vaex docs and testing, it looks like you need to use the code below to get the proper data orientation:

Source https://stackoverflow.com/questions/69224055

QUESTION

Caching a large data structure in python across instances while maintaining types

Asked 2021-Sep-13 at 15:20

I'm looking to use a distributed cache in python. I have a fastApi application and wish that every instance have access to the same data as our load balancer may route the incoming requests differently. The problem is that I'm storing / editing information about a relatively big data set from a arrow feather file and processing it with Vaex. The feather file automaticaly loads the correct types for the data. The data structure I need to store will use a user id as a key and the value will be a large array of arrays of numbers. I've looked at memcache and redis as possible caching solutions, but both seem to store entries as strings / simple values. I'm looking to avoid parsing strings and extra processing on a large amount data. Is there a distributed caching stategy that will let me persist types?

One solution we came up with is to store the data in mutliple feather files in a directory that is accessible to all instances of the app but this seems to be messy as you would need to clean up / delete the files after each session.

...

ANSWER

Answered 2021-Sep-13 at 15:20

Redis 'strings' are actually able to store arbitrary binary data, it isn't limited to actual strings. From https://redis.io/topics/data-types:

Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object. A String value can be at max 512 Megabytes in length.

Another option is to use Flatbuffers, which is a serialisation protocol specifically designed to allow reading/writing serialised objects without expensive deserialisation.

Although I would suggest reconsidering storing large, complex data structures as cache values. The drawback is that any change will lead to having to rewrite the entire thing in cache which can get expensive, so consider breaking it up into smaller k/v pairs if possible. You could use Redis Hash data type to make this easier to implement.

Source https://stackoverflow.com/questions/69165052

QUESTION

Fastest most efficient way to apply groupby that references multiple columns

Asked 2021-Aug-14 at 08:09

Suppose we have a dataset.

...

ANSWER

Answered 2021-Aug-14 at 05:26

You can use agg on all wanted columns and add a prefix:

Source https://stackoverflow.com/questions/68780560

QUESTION

python pandas - is there any faster way to do explode operation according to the requirement

Asked 2021-Aug-12 at 09:17

The code is as following the input dataframe is

...

ANSWER

Answered 2021-Aug-12 at 08:43

You could do it within Pandas (not sure why you need to combine the data this way):

Groupby on class, convert everything to string, and aggregate with python's str.join:

Source https://stackoverflow.com/questions/68753947

QUESTION

Why does matplotlib.pyplot.savefig() mess up image outputs for very large pandas.plotting.scatter_matrix()?

Asked 2021-Jul-30 at 15:08

I was trying to compute the pandas.plotting.scatter_matrix() values for very large pandas.DataFrame() (relatively speaking for this specific operation, most libraries either run OOM most of the time or implement a row count check of 50000, see vaex-scatter).

The 'Time series' DataFrame shape I have is (10000000, 41). Every value is either a float or an integer.

Q1: So the first thing I would already like to ask is how do I do that memory and space efficiently.

What I tried for Q1

I tried to do it typically (like in the examples in the documentation) using matplotlib and modin.pandas.DataFrames looping over each pair, so the indexing and operations/calculations I want to do are relatively fast including the to_numpy() method. How ever as you might have already seen from the image 1 pair takes 18.1 secs at least and 41x41 pairs are too difficult to handle in my task and I feel there is a relatively faster way of doing things. :)
I tried using the pandas scatter plot function which is also too slow and crashes my memory. This is done using the native pandas package and not the modin.pandas. This was done by first converting the modin.pandas.DataFrame to pandas.DataFrame via the private modin.pandas.DataFrame._to_pandas() accessor. This approach is too slow too. I stopped waiting after I ran out of memory 1 hour later.
I tried plotting with vaex. This was the fastest but I ran into other errors which arent related to the question.
please do not suggest seaborn's pair plot. Tried and it takes around 5 mins to generate a pairplot() for a pandas.DataFrame of shape (1000x8), also is cantered around pandas.

Current workaround for Q1 and new Q2

I am plotting a scatter matrix of all the features sampled 10000 times. so modin.DataFrame.sample(10000) since it kind of is okay to view at the general trend but i do not wish to do this if there is a better option.
Converting it to pandas.DataFrame and using pandas.plotting.scatter_matrix like this, so that i dont have to wait for it to be rendered onto the jupyter notebook.

...

ANSWER

Answered 2021-Jul-30 at 15:08

For future readers, the process I opted was to use datashader.org as @JodyKlymak suggested in his comment(Thanks) with pandas.DataFrame.

please bear in mind that this approach answers both the questions.

Convert your modin.pandas.DataFrame to pandas.DataFrame with the private modin.pandas.DataFrame._to_pandas()
plot the graphs first to an xarray image like so xarray-imshow.

Source https://stackoverflow.com/questions/68578730

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install vaex

You can download it from GitHub.
You can use vaex like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: