petastorm | Petastorm library enables single machine | Machine Learning library

by uber Python Version: 0.12.2rc0 License: Apache-2.0

X-Ray Key Features Code Snippets(6)Community Discussions(3)Vulnerabilities Install Support

kandi X-RAY | petastorm Summary

petastorm is a Python library typically used in Artificial Intelligence, Machine Learning, Deep Learning, Pytorch, Tensorflow, Spark applications. petastorm has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can install using 'pip install petastorm' or download it from GitHub, PyPI.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Support

Quality

Security

License

Reuse

Support

petastorm has a highly active ecosystem.

It has 1584 star(s) with 267 fork(s). There are 43 watchers for this library.

It had no major release in the last 12 months.

There are 149 open issues and 148 have been closed. On average issues are closed in 59 days. There are 18 open pull requests and 0 closed requests.

It has a negative sentiment in the developer community.

The latest version of petastorm is 0.12.2rc0

Quality

petastorm has 0 bugs and 0 code smells.

Security

petastorm has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

petastorm code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

petastorm is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

petastorm releases are available to install and integrate.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

It has 10755 lines of code, 1049 functions and 143 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed petastorm and discovered the below as its top functions. This is intended to give you an instant insight into petastorm implemented functionality, and help decide if they suit your requirements.

Create a reader for a dataset .
Create a batch reader .
Creates a namedtuple of namedtuple fields .
Start the worker .
Context manager for creating a dataset .
Train and test dataset .
Make a petastorm dataset .
Converts a directory directory to PETAST .
Load rows from a pq file .
Generate metadata for petastorm .

Get all kandi verified functions for this library.

petastorm Key Features

No Key Features are available at this moment for petastorm.

petastorm Examples and Code Snippets

Lance: A Columnar Data Format for Deep Learning Dataset,Why

C++

Lines of Code : 10

License : Permissive (Apache-2.0)

Copy

graph LR
    A[Collection] --> B[Exploration];
    B --> C[Analytics];
    C --> D[Feature Engineer];
    D --> E[Training];
    E --> F[Evaluation];
    F --> C;
    E --> G[Deployment];
    G --> H[Monitoring];
    H -->

Spectral Synthesis for Satellite-to-Satellite Translation,Dependencies

Python

Lines of Code : 3

License : Permissive (MIT)

Copy

conda create --name geonex_torch1.5 python=3.7 pytorch=1.5 xarray numpy scipy pandas torchvision tensorboard opencv pyyaml jupyterlab matplotlib seaborn
conda install -c conda-forge pyhdf
pip install petastorm

Horovod on Spark-Installation-Horovod Spark Estimators

Python

Lines of Code : 0

License : Non-SPDX (NOASSERTION)

Copy

from tensorflow import keras
import tensorflow as tf
import horovod.spark.keras as hvd
model = keras.models.Sequential()
    .add(keras.layers.Dense(8, input_dim=2))
    .add(keras.layers.Activation('tanh'))
    .add(keras.layers.Dense(1))
    .add(k

horovod - keras spark3 rossmann

Python

Lines of Code : 382

License : Non-SPDX

Copy

# Copyright 2017 onwards, fast.ai, Inc.
# Modifications copyright (C) 2018 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain

horovod - keras spark rossmann run

Python

Lines of Code : 358

License : Non-SPDX

Copy

# Copyright 2017 onwards, fast.ai, Inc.
# Modifications copyright (C) 2018 Uber Technologies, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain

Python: Reading Parquet files stored on s3 using petastorm generates connection warnings

Python

Lines of Code : 3

License : Strong Copyleft (CC BY-SA 4.0)

Copy

fs = s3fs.S3FileSystem(config_kwargs = {'max_pool_connections': 50})

Community Discussions

Trending Discussions on petastorm

How to print out data that goes to keras model.fit , specifically if using petastorm dataset

PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

QUESTION

How to print out data that goes to keras model.fit , specifically if using petastorm dataset

Asked 2022-Jan-18 at 14:30

Update

While I appreciated AloneTogether's answer, I didn't like that I was using take() and it was separate from model.fit.

I put another answer here if you want to look at it. It involves subclassing Model. It's not too bad.

End of Update

I have a simple example, a parquet file with 8 columns named feature_# populated with 1 to 100 for each column

...

ANSWER

Answered 2022-Jan-18 at 11:59

I think it all depends on the size of your batch_size because take(1) takes one batch and if the batch_size is < 100 you will not see all the values. If, for example, you have batch_size=100, then you will definitely see the values 1 to 100:

Source https://stackoverflow.com/questions/70753331

QUESTION

PySpark/DataBricks: How to read parquet files using 'file:///' and not 'dbfs'

Asked 2020-Dec-04 at 09:15

I am trying to use petastorm in a different manner which requires that I tell it where my parquet files are stored through one of the following:

hdfs://some_hdfs_cluster/user/yevgeni/parquet8, or file:///tmp/mydataset, or s3://bucket/mydataset, or gs://bucket/mydataset. Since I am on DataBricks and given other constraints, my option is to use the file:/// option.

However, I am at a loss as to how specify the location of my parquet files. I continually get rejected saying that Path does not exist:

Here is what I am doing: ...

ANSWER

Answered 2020-Nov-29 at 16:30

You just need to specify the path as it is, no need for 'file:///':

Source https://stackoverflow.com/questions/65062318

QUESTION

Are Apache Spark 2.0 parquet files incompatible with Apache Arrow?

Asked 2020-Apr-29 at 13:40

The problem

I have written an Apache Spark DataFrame as a parquet file for a deep learning application in a Python environment ; I am currently experiencing issues in implementing basic examples of both petastorm (following this notebook) and horovod frameworks, in reading the aforementioned file namely. The DataFrame has the following type : DataFrame[features: array, next: int, weight: int] (much like in DataBricks' notebook, I had features be a VectorUDT, which I converted to an array).
In both cases, Apache Arrow throws an ArrowIOError : Invalid parquet file. Corrupt footer. error.

What I found until now

I discovered in this question and in this PR that as of version 2.0, Spark doesn't write _metadata or _common_metadata files, unless spark.hadoop.parquet.enable.summary-metadata is set to true in Spark's configuration ; those files are indeed missing.
I thus tried rewriting my DataFrame with this environment, still no _common_metadata file. What also works is to explicitely pass a schema to petastorm when constructing a reader (passing schema_fields to make_batch_reader for instance ; which is a problem with horovod as there is no such parameter in horovod.spark.keras.KerasEstimator's constructor).

How would I be able, if at all possible, to either make Spark output those files, or in Arrow to infer the schema, just like Spark seems to be doing ?

Minimal example with horovod ...

ANSWER

Answered 2020-Apr-29 at 13:40

The problem is solved in pyarrow 0.14+ (issues.apache.org/jira/browse/ARROW-4723), be sure to install the updated version with pip (up until Databricks Runtime 6.5, the included version is 0.13).
Thanks to @joris' comment for pointing this out.

Source https://stackoverflow.com/questions/61234955

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install petastorm

You can install using 'pip install petastorm' or download it from GitHub, PyPI.
You can use petastorm like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: