vaex | Core hybrid Apache Arrow/NumPy DataFrame

 by   vaexio Python Version: vaexpaper_v1 License: MIT

kandi X-RAY | vaex Summary

kandi X-RAY | vaex Summary

vaex is a Python library typically used in Big Data, Jupyter, Pandas, Spark applications. vaex has build file available, it has a Permissive License and it has high support. However vaex has 140 bugs and it has 21 vulnerabilities. You can download it from GitHub.

Watch our more recent talks:. Contact us for data science solutions, training, or enterprise support at
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              vaex has a highly active ecosystem.
              It has 7914 star(s) with 593 fork(s). There are 141 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 397 open issues and 770 have been closed. On average issues are closed in 189 days. There are 98 open pull requests and 0 closed requests.
              It has a positive sentiment in the developer community.
              The latest version of vaex is vaexpaper_v1

            kandi-Quality Quality

              OutlinedDot
              vaex has 140 bugs (14 blocker, 0 critical, 62 major, 64 minor) and 2384 code smells.

            kandi-Security Security

              vaex has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              OutlinedDot
              vaex code analysis shows 21 unresolved vulnerabilities (0 blocker, 21 critical, 0 major, 0 minor).
              There are 15 security hotspots that need review.

            kandi-License License

              vaex is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              vaex releases are available to install and integrate.
              Build file is available. You can build the component from source.
              Installation instructions are not available. Examples and code snippets are available.
              vaex saves you 16331 person hours of effort in developing the same functionality from scratch.
              It has 32487 lines of code, 2700 functions and 174 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed vaex and discovered the below as its top functions. This is intended to give you an instant insight into vaex implemented functionality, and help decide if they suit your requirements.
            • Plot a heatmap .
            • Add a toolbar .
            • Initialize the plugin .
            • Sets up the main widget .
            • Draw the front side .
            • Plot the grid
            • this is the main function of the scheduler
            • Evaluate an expression .
            • Read a column from a file .
            • Plot a bq plot .
            Get all kandi verified functions for this library.

            vaex Key Features

            No Key Features are available at this moment for vaex.

            vaex Examples and Code Snippets

            Vaex support
            Pythondot img1Lines of Code : 84dot img1License : Non-SPDX (NOASSERTION)
            copy iconCopy
            tims2hdf5.py --help
            
            python tims2hdf5.py --help
            
            usage: tims2hdf5.py [-h] [--compression {gzip,lzf,szip,none}] [--compression_level COMPRESSION_LEVEL] [--chunksNo CHUNKSNO] [--shuffle]
                                [--columns {frame,scan,tof,intensity,mz,inv_i  
            Requirements
            Pythondot img2Lines of Code : 7dot img2License : Non-SPDX (NOASSERTION)
            copy iconCopy
            sudo apt install python3.8-dev
            
            pip install timspy
            
            pip install git+https://github.com/michalsta/opentims
            
            pip uninstall numpy
            pip install numpy==1.19.3
            
            pip install timspy[vaex]
            
            pip install vaex-core vaex-hdf5 h5py
              
            Installation
            Pythondot img3Lines of Code : 7dot img3License : Non-SPDX (NOASSERTION)
            copy iconCopy
            pip install timspy
            
            pip install -e git+https://github.com/MatteoLacki/timspy/tree/devel
            
            git clone https://github.com/MatteoLacki/timspy
            cd timspy
            pip install -e .
            
            pip install timspy[vaex]
            
            pip install vaex-core vaex-hdf5
              

            Community Discussions

            QUESTION

            Virtual column with calculation in Vaex
            Asked 2022-Jan-05 at 15:22

            I want to set a virtual column to a calculation using another column in Vaex. I need to use an if statement inside this calculation. In general I want to call

            ...

            ANSWER

            Answered 2021-Dec-30 at 06:41

            It might be useful to use a mask for subsetting the relevant rows:

            Source https://stackoverflow.com/questions/70513644

            QUESTION

            Splitting an array according input offset values, but leaving duplicated in a same chunk
            Asked 2021-Dec-19 at 20:59

            Given a list of indexes (offset values) according which splitting a numpy array, I would like to adjust it so that the splitting does not occur on duplicate values. This means duplicate values will be in one chunk only.

            I have worked out following piece of code, which gives the result, but I am not super proud of it. I would like to stay in numpy world and use vectorized numpy functions as much as possible. But to check the indexes (offset values) I use a for loop, and store the result in a list.

            Do you have any idea how to vectorize the 2nd part?

            If this can help, ar is an ordered array. (I am not using this info in below code).

            ...

            ANSWER

            Answered 2021-Dec-18 at 17:43

            You can use np.digitize to clamp the offsets into bins:

            Source https://stackoverflow.com/questions/70405390

            QUESTION

            MainThread: Vaex: Error while Opening Azure Data Lake Parquet file
            Asked 2021-Nov-16 at 13:02

            I tried to open a parquet on an Azure data lake gen 2 storage using SAS URL generated (with the datetime limit and token embedded in the url) using vaex by doing:

            vaex.open(sas_url)

            and I got the error

            ERROR:MainThread:vaex:error opening 'the path which was also the sas_url(can't post it for security reasons)' ValueError: Do not know how to open (can't publicize the sas url) , no handler for https is known

            How do I get vaex to read the file or is there another azure storage that works better with vaex?

            ...

            ANSWER

            Answered 2021-Aug-24 at 08:40

            Vaex is not capable to read the data using https source, that's the reason you are getting error "no handler for https is known".

            Also, as per the document, vaex supports data input from Amazon S3 buckets and Google cloud storage.

            Cloud support:

            Amazon Web Services S3

            Google Cloud Storage

            Other cloud storage options

            They mentioned that other cloud storages are also supported but there is no supporting document anywhere with any example where they are fetching the data from Azure storage account, that also using SAS URL.

            Also please visit API document for vaex library for more info.

            Source https://stackoverflow.com/questions/68814291

            QUESTION

            vaex apply does not work when using dataframe columns
            Asked 2021-Nov-15 at 16:16

            I am trying to tokenize natural language for the first sentence in wikipedia in order to find 'is a' patterns. n-grams of the tokens and left over text would be the next step. "Wellington is a town in the UK." becomes "town is a attr_root in the country." Then find common patterns using n-grams.

            For this I need to replace string values in a string column using other string columns in the dataframe. In Pandas I can do this using

            ...

            ANSWER

            Answered 2021-Nov-15 at 16:16

            Turns out I had a bug. Needed dfv in the call to apply instead of df.

            Also got this faster method from the nice people at vaex.

            Source https://stackoverflow.com/questions/69971992

            QUESTION

            Apply function to each group where the group are splitted in multiple files without concatenating all the files
            Asked 2021-Oct-07 at 08:29

            My data come from BigQuery exported to GCS bucket as CSV file and if the file size is quite massive, BigQuery will automatically split the data into several chunk. With time series in mind, the time series might be scattered across different files. I have a custom function that I want to applied to each TimeseriesID.

            Here's some constraint of the data:

            • The data is sorted by TimeseriesID and TimeID
            • The number of row of each files is may vary, but at minimum 1 row (which is very unlikely)
            • The starting of TimeID is not always 0
            • The length of each time series may vary but at maximum it will only scattered across 2 files. No time series scatter in 3 different files.

            Here's the initial setup to illustrate the problem:

            ...

            ANSWER

            Answered 2021-Oct-07 at 08:29

            Approach presented by op to use concat with million of records would be overkill for memories/other resources.

            I have tested OP code using Google Colab Nootebooks and this was a bad approach

            Source https://stackoverflow.com/questions/69430988

            QUESTION

            Loading vaex dataframe to dash datatable
            Asked 2021-Sep-21 at 15:47

            I am trying to load a dataframe(vaex) to dash datatable and getting following error.

            Invalid argument data passed into DataTable with ID "table". Expected an array. Was supplied type object.

            Tried the following, is it possible to load vaex dataframe to dash datatable like pandas dataframe.

            ...

            ANSWER

            Answered 2021-Sep-21 at 15:47

            After looking at the vaex docs and testing, it looks like you need to use the code below to get the proper data orientation:

            Source https://stackoverflow.com/questions/69224055

            QUESTION

            Caching a large data structure in python across instances while maintaining types
            Asked 2021-Sep-13 at 15:20

            I'm looking to use a distributed cache in python. I have a fastApi application and wish that every instance have access to the same data as our load balancer may route the incoming requests differently. The problem is that I'm storing / editing information about a relatively big data set from a arrow feather file and processing it with Vaex. The feather file automaticaly loads the correct types for the data. The data structure I need to store will use a user id as a key and the value will be a large array of arrays of numbers. I've looked at memcache and redis as possible caching solutions, but both seem to store entries as strings / simple values. I'm looking to avoid parsing strings and extra processing on a large amount data. Is there a distributed caching stategy that will let me persist types?

            One solution we came up with is to store the data in mutliple feather files in a directory that is accessible to all instances of the app but this seems to be messy as you would need to clean up / delete the files after each session.

            ...

            ANSWER

            Answered 2021-Sep-13 at 15:20

            Redis 'strings' are actually able to store arbitrary binary data, it isn't limited to actual strings. From https://redis.io/topics/data-types:

            Redis Strings are binary safe, this means that a Redis string can contain any kind of data, for instance a JPEG image or a serialized Ruby object. A String value can be at max 512 Megabytes in length.

            Another option is to use Flatbuffers, which is a serialisation protocol specifically designed to allow reading/writing serialised objects without expensive deserialisation.

            Although I would suggest reconsidering storing large, complex data structures as cache values. The drawback is that any change will lead to having to rewrite the entire thing in cache which can get expensive, so consider breaking it up into smaller k/v pairs if possible. You could use Redis Hash data type to make this easier to implement.

            Source https://stackoverflow.com/questions/69165052

            QUESTION

            Fastest most efficient way to apply groupby that references multiple columns
            Asked 2021-Aug-14 at 08:09

            Suppose we have a dataset.

            ...

            ANSWER

            Answered 2021-Aug-14 at 05:26

            You can use agg on all wanted columns and add a prefix:

            Source https://stackoverflow.com/questions/68780560

            QUESTION

            python pandas - is there any faster way to do explode operation according to the requirement
            Asked 2021-Aug-12 at 09:17

            The code is as following the input dataframe is

            ...

            ANSWER

            Answered 2021-Aug-12 at 08:43

            You could do it within Pandas (not sure why you need to combine the data this way):

            Groupby on class, convert everything to string, and aggregate with python's str.join:

            Source https://stackoverflow.com/questions/68753947

            QUESTION

            Why does matplotlib.pyplot.savefig() mess up image outputs for very large pandas.plotting.scatter_matrix()?
            Asked 2021-Jul-30 at 15:08

            I was trying to compute the pandas.plotting.scatter_matrix() values for very large pandas.DataFrame() (relatively speaking for this specific operation, most libraries either run OOM most of the time or implement a row count check of 50000, see vaex-scatter).

            The 'Time series' DataFrame shape I have is (10000000, 41). Every value is either a float or an integer.

            Q1: So the first thing I would already like to ask is how do I do that memory and space efficiently.

            What I tried for Q1
            • I tried to do it typically (like in the examples in the documentation) using matplotlib and modin.pandas.DataFrames looping over each pair, so the indexing and operations/calculations I want to do are relatively fast including the to_numpy() method. How ever as you might have already seen from the image 1 pair takes 18.1 secs at least and 41x41 pairs are too difficult to handle in my task and I feel there is a relatively faster way of doing things. :)

            • I tried using the pandas scatter plot function which is also too slow and crashes my memory. This is done using the native pandas package and not the modin.pandas. This was done by first converting the modin.pandas.DataFrame to pandas.DataFrame via the private modin.pandas.DataFrame._to_pandas() accessor. This approach is too slow too. I stopped waiting after I ran out of memory 1 hour later.

            • I tried plotting with vaex. This was the fastest but I ran into other errors which arent related to the question.

            • please do not suggest seaborn's pair plot. Tried and it takes around 5 mins to generate a pairplot() for a pandas.DataFrame of shape (1000x8), also is cantered around pandas.

            Current workaround for Q1 and new Q2
            • I am plotting a scatter matrix of all the features sampled 10000 times. so modin.DataFrame.sample(10000) since it kind of is okay to view at the general trend but i do not wish to do this if there is a better option.
            • Converting it to pandas.DataFrame and using pandas.plotting.scatter_matrix like this, so that i dont have to wait for it to be rendered onto the jupyter notebook.
            ...

            ANSWER

            Answered 2021-Jul-30 at 15:08

            For future readers, the process I opted was to use datashader.org as @JodyKlymak suggested in his comment(Thanks) with pandas.DataFrame.

            please bear in mind that this approach answers both the questions.

            1. Convert your modin.pandas.DataFrame to pandas.DataFrame with the private modin.pandas.DataFrame._to_pandas()
            2. plot the graphs first to an xarray image like so xarray-imshow.

            Source https://stackoverflow.com/questions/68578730

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install vaex

            You can download it from GitHub.
            You can use vaex like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries

            Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link