petastorm | Petastorm library enables single machine | Machine Learning library

 by   uber Python Version: 0.12.2rc0 License: Apache-2.0

kandi X-RAY | petastorm Summary

kandi X-RAY | petastorm Summary

petastorm is a Python library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch, Tensorflow, Spark applications. petastorm has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can install using 'pip install petastorm' or download it from GitHub, PyPI.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              petastorm has a highly active ecosystem.
              It has 1584 star(s) with 267 fork(s). There are 43 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 149 open issues and 148 have been closed. On average issues are closed in 59 days. There are 18 open pull requests and 0 closed requests.
              OutlinedDot
              It has a negative sentiment in the developer community.
              The latest version of petastorm is 0.12.2rc0

            kandi-Quality Quality

              petastorm has 0 bugs and 0 code smells.

            kandi-Security Security

              petastorm has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              petastorm code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              petastorm is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              petastorm releases are available to install and integrate.
              Deployable package is available in PyPI.
              Build file is available. You can build the component from source.
              It has 10755 lines of code, 1049 functions and 143 files.
              It has high code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed petastorm and discovered the below as its top functions. This is intended to give you an instant insight into petastorm implemented functionality, and help decide if they suit your requirements.
            • Create a reader for a dataset .
            • Create a batch reader .
            • Creates a namedtuple of namedtuple fields .
            • Start the worker .
            • Context manager for creating a dataset .
            • Train and test dataset .
            • Make a petastorm dataset .
            • Converts a directory directory to PETAST .
            • Load rows from a pq file .
            • Generate metadata for petastorm .
            Get all kandi verified functions for this library.

            petastorm Key Features

            No Key Features are available at this moment for petastorm.

            petastorm Examples and Code Snippets

            Lance: A Columnar Data Format for Deep Learning Dataset,Why
            C++dot img1Lines of Code : 10dot img1License : Permissive (Apache-2.0)
            copy iconCopy
            graph LR
                A[Collection] --> B[Exploration];
                B --> C[Analytics];
                C --> D[Feature Engineer];
                D --> E[Training];
                E --> F[Evaluation];
                F --> C;
                E --> G[Deployment];
                G --> H[Monitoring];
                H -->   
            Spectral Synthesis for Satellite-to-Satellite Translation,Dependencies
            Pythondot img2Lines of Code : 3dot img2License : Permissive (MIT)
            copy iconCopy
            conda create --name geonex_torch1.5 python=3.7 pytorch=1.5 xarray numpy scipy pandas torchvision tensorboard opencv pyyaml jupyterlab matplotlib seaborn
            conda install -c conda-forge pyhdf
            pip install petastorm
              
            Horovod on Spark-Installation-Horovod Spark Estimators
            Pythondot img3Lines of Code : 0dot img3License : Non-SPDX (NOASSERTION)
            copy iconCopy
            from tensorflow import keras
            import tensorflow as tf
            import horovod.spark.keras as hvd
            model = keras.models.Sequential()
                .add(keras.layers.Dense(8, input_dim=2))
                .add(keras.layers.Activation('tanh'))
                .add(keras.layers.Dense(1))
                .add(k  
            horovod - keras spark3 rossmann
            Pythondot img4Lines of Code : 382dot img4License : Non-SPDX
            copy iconCopy
            # Copyright 2017 onwards, fast.ai, Inc.
            # Modifications copyright (C) 2018 Uber Technologies, Inc.
            #
            # Licensed under the Apache License, Version 2.0 (the "License");
            # you may not use this file except in compliance with the License.
            # You may obtain  
            horovod - keras spark rossmann run
            Pythondot img5Lines of Code : 358dot img5License : Non-SPDX
            copy iconCopy
            # Copyright 2017 onwards, fast.ai, Inc.
            # Modifications copyright (C) 2018 Uber Technologies, Inc.
            #
            # Licensed under the Apache License, Version 2.0 (the "License");
            # you may not use this file except in compliance with the License.
            # You may obtain  
            Python: Reading Parquet files stored on s3 using petastorm generates connection warnings
            Pythondot img6Lines of Code : 3dot img6License : Strong Copyleft (CC BY-SA 4.0)
            copy iconCopy
            fs = s3fs.S3FileSystem(config_kwargs = {'max_pool_connections': 50})
            
            

            Community Discussions

            QUESTION

            How to print out data that goes to keras model.fit , specifically if using petastorm dataset
            Asked 2022-Jan-18 at 14:30

            Update

            While I appreciated AloneTogether's answer, I didn't like that I was using take() and it was separate from model.fit.

            I put another answer here if you want to look at it. It involves subclassing Model. It's not too bad.

            End of Update

            I have a simple example, a parquet file with 8 columns named feature_# populated with 1 to 100 for each column

            ...

            ANSWER

            Answered 2022-Jan-18 at 11:59

            I think it all depends on the size of your batch_size because take(1) takes one batch and if the batch_size is < 100 you will not see all the values. If, for example, you have batch_size=100, then you will definitely see the values 1 to 100:

            Source https://stackoverflow.com/questions/70753331

            QUESTION

            PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'
            Asked 2020-Dec-04 at 09:15

            I am trying to use petastorm in a different manner which requires that I tell it where my parquet files are stored through one of the following:

            hdfs://some_hdfs_cluster/user/yevgeni/parquet8, or file:///tmp/mydataset, or s3://bucket/mydataset, or gs://bucket/mydataset. Since I am on DataBricks and given other constraints, my option is to use the file:/// option.

            However, I am at a loss as to how specify the location of my parquet files. I continually get rejected saying that Path does not exist:

            Here is what I am doing: ...

            ANSWER

            Answered 2020-Nov-29 at 16:30

            You just need to specify the path as it is, no need for 'file:///':

            Source https://stackoverflow.com/questions/65062318

            QUESTION

            Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?
            Asked 2020-Apr-29 at 13:40

            The problem

            I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
            In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.

            What I found until now

            I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
            I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).

            How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?

            Minimal example with horovod ...

            ANSWER

            Answered 2020-Apr-29 at 13:40

            The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
            Thanks to @joris' comment for pointing this out.

            Source https://stackoverflow.com/questions/61234955

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install petastorm

            You can install using 'pip install petastorm' or download it from GitHub, PyPI.
            You can use petastorm like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            Install
          • PyPI

            pip install petastorm

          • CLONE
          • HTTPS

            https://github.com/uber/petastorm.git

          • CLI

            gh repo clone uber/petastorm

          • sshUrl

            git@github.com:uber/petastorm.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link