aws-data-wrangler | Pandas on AWS - Easy integration with Athena Glue

by awslabs Python Version: 2.15.1 License: Apache-2.0

X-Ray Key Features Code Snippets(10)Community Discussions(7)Vulnerabilities Install Support

kandi X-RAY | aws-data-wrangler Summary

aws-data-wrangler is a Python library typically used in Big Data, Spark, Amazon S3 applications. aws-data-wrangler has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. However aws-data-wrangler build file is not available. You can install using 'pip install aws-data-wrangler' or download it from GitHub, PyPI.

Amazon SageMaker Data Wrangler is a new SageMaker Studio feature that has a similar name but has a different purpose than the AWS Data Wrangler open source project.

Support

Quality

Security

License

Reuse

Support

aws-data-wrangler has a medium active ecosystem.

It has 2734 star(s) with 459 fork(s). There are 60 watchers for this library.

It had no major release in the last 12 months.

There are 27 open issues and 621 have been closed. On average issues are closed in 61 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of aws-data-wrangler is 2.15.1

Quality

aws-data-wrangler has 0 bugs and 0 code smells.

Security

aws-data-wrangler has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

aws-data-wrangler code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

aws-data-wrangler is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

aws-data-wrangler releases are available to install and integrate.

Deployable package is available in PyPI.

aws-data-wrangler has no build file. You will be need to create the build yourself to build the component from source.

Installation instructions, examples and code snippets are available.

aws-data-wrangler saves you 7237 person hours of effort in developing the same functionality from scratch.

It has 22791 lines of code, 1045 functions and 117 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed aws-data-wrangler and discovered the below as its top functions. This is intended to give you an instant insight into aws-data-wrangler implemented functionality, and help decide if they suit your requirements.

Converts a dataset to a Pdf3 file .
Parquet format .
Serialize a JSON object to a JSON file .
Builds and returns a map of options for the cluster .
Creates a cluster .
Read Parquet .
Read a table from a table .
Stores the Parquet metadata .
Performs a copy of the Redshift database .
Create a deep copy of the data table .

Get all kandi verified functions for this library.

aws-data-wrangler Key Features

No Key Features are available at this moment for aws-data-wrangler.

aws-data-wrangler Examples and Code Snippets

AWS data wrangler error: WaiterError: Waiter BucketExists failed: Max attempts exceeded

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

os.environ['AWS_DEFAULT_REGION'] = 'ap-northeast-1' # specify your AWS region.

List all S3-files parsed by AWS Glue from a table using the AWS Python SDK boto3

Python

Lines of Code : 28

License : Strong Copyleft (CC BY-SA 4.0)

Copy

def _yield_recently_updated_glue_tables(upload_path_list: List[str],
                                        db_name: str) -> Union(dict, None):
    """Check which tables have been updated recently.

    Args:
        upload_path_list (

pylint and astroid AttributeError: 'Module' object has no attribute 'col_offset'

Python

Lines of Code : 3

License : Strong Copyleft (CC BY-SA 4.0)

Copy

pip install pylint==2.9.3
pip install git+git://github.com/PyCQA/astroid.git@c37b6fd47b62486fd6cbe77b913b568b809f1a6d#egg=astroid

Connect to AWS Redshift using awswrangler

Python

Lines of Code : 22

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import awswrangler as wr
con = wr.redshift.connect("test_1")
with con.cursor() as cursor:
    cursor.execute("SELECT 1;")
    print(cursor.fetchall())
    con.close()

import awswrangler as wr
import boto3

session

How to catch exceptions.NoFilesFound error from awswrangler in Python 3

Python

Lines of Code : 6

License : Strong Copyleft (CC BY-SA 4.0)

Copy

from awswrangler import exceptions
try:
  ...
except exceptions.NoFilesFound:
  ...

Pandas merge two DF with rows replacement

Python

Lines of Code : 2

License : Strong Copyleft (CC BY-SA 4.0)

Copy

df = pd.concat([df1, df2]).drop_duplicates('id', keep='last')

Unable to read data from AWS Glue Database/Tables using Python

Python

Lines of Code : 16

License : Strong Copyleft (CC BY-SA 4.0)

Copy

profile_name = 'Dev-AWS'
REGION = 'us-east-1'

#this automatically retrieves credentials from your aws credentials file after you run aws configure on command-line
ACCESS_KEY_ID, SECRET_ACCESS_KEY,SESSION_TOKEN = get_profile_credentials(pr

Cannot upload data from pandas data-frame to AWS athena table due to 'import package error'

Python

Lines of Code : 63

License : Strong Copyleft (CC BY-SA 4.0)

Copy

sh-4.2$ source activate python3
(python3) sh-4.2$ pip install awswrangler

Collecting awswrangler
  Downloading https://files.pythonhosted.org/packages/e9/99/b3ba9811e1a5f346da484f2dff40924613ec481df5d463e30bc3fd71096e/awswrangler-0.3.2.ta

Create table over an existing parquet file in glue

Python

Lines of Code : 30

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import awswrangler as wr

df = wr.pandas.read_parquet(path='s3://staging/tables/test_table/version=2020-03-26', 
                            columns=['country', 'city', ...], filters=[("c5", "=", 0)])

# Typical Pandas, Numpy or Pyarrow tr

Unable to read Athena query into pandas dataframe

Python

Lines of Code : 7

License : Strong Copyleft (CC BY-SA 4.0)

Copy

import awswrangler as wr

df = wr.pandas.read_sql_athena(
    sql="select * from table",
    database="database"
)

Community Discussions

Trending Discussions on aws-data-wrangler

Adding tags to S3 objects using awswrangler?

store parquet files (in aws s3) into a spark dataframe using pyspark

awswrangler.s3.read_parquet ignores partition_filter argument

How to get python package `awswranger` to accept a custom `endpoint_url`

AWS Wrangler error with establishing engine connection in Python, must specify a region?

AWS Lambda - AwsWrangler - Pandas/Pytz - Unable to import required dependencies:pytz:

Copy parquet files then query them with Athena

QUESTION

Adding tags to S3 objects using awswrangler?

Asked 2022-Jan-30 at 23:19

I'm using awswrangler to write parquets in my S3 and I usually add tags on all my objects to access and cost control, but I didn't find a way to do that using directly awswrangler. I'm current using the code below to test:

...

ANSWER

Answered 2021-Sep-07 at 10:08

I just figured out that awswrangler has a parameter called s3_additional_kwargs that you can pass additional variables to the s3 requests that awswrangler does for you. You can send tags like in boto3 'Key1=value1&Key2=value2'

Below is an example how to add tags to your objects:

Source https://stackoverflow.com/questions/69086237

QUESTION

store parquet files (in aws s3) into a spark dataframe using pyspark

Asked 2021-Jun-09 at 21:13

I'm trying to read data from a specific folder in my s3 bucket. This data is in parquet format. To do that I'm using awswrangler:

...

ANSWER

Answered 2021-Jun-09 at 21:13

I didn't use awswrangler. Instead I used the following code which I found on this github:

Source https://stackoverflow.com/questions/67908664

QUESTION

awswrangler.s3.read_parquet ignores partition_filter argument

Asked 2021-Apr-07 at 08:15

The partition_filter argument in wr.s3.read_parquet() is failing to filter a partitioned parquet dataset on S3. Here's a reproducible example (might require a correctly configured boto3_session argument):

Dataset setup:

...

ANSWER

Answered 2021-Apr-07 at 08:15

From the documentation, Ignored if dataset=False.. Adding dataset=True as an argument to your read_parquet call will do the trick

Source https://stackoverflow.com/questions/66977251

QUESTION

How to get python package `awswranger` to accept a custom `endpoint_url`

Asked 2021-Mar-25 at 02:33

I'm attempting to use the python package awswrangler to access a non-AWS S3 service.

The AWS Data Wranger docs state that you need to create a boto3.Session() object.

The problem is that the boto3.client() supports setting the endpoint_url, but boto3.Session() does not (docs here).

In my previous uses of boto3 I've always used the client for this reason.

Is there a way to create a boto3.Session() with a custom endpoint_url or otherwise configure awswrangler to accept the custom endpoint?

...

ANSWER

Answered 2021-Mar-25 at 00:49

Once you create your session, you can use client as well. For example:

Source https://stackoverflow.com/questions/66791435

QUESTION

AWS Wrangler error with establishing engine connection in Python, must specify a region?

Asked 2020-Nov-19 at 15:26

This is probably an easy fix, but I cannot get this code to run. I have been using AWS Secrets Manager with no issues on Pycharm 2020.2.3. The problems with AWS Wrangler however are listed below:

Read in Dataframe ...

ANSWER

Answered 2020-Nov-19 at 15:26

Data Wrangler uses Boto3 under the hood. And Boto3 will look for the AWS_DEFAULT_REGION env variable. So you have two options:

Set this in your ~/.aws/config file:

Source https://stackoverflow.com/questions/64450788

QUESTION

AWS Lambda - AwsWrangler - Pandas/Pytz - Unable to import required dependencies:pytz:

Asked 2020-Oct-27 at 15:41

To get past Numpy errors, I downloaded this zip awswrangler-layer-1.9.6-py3.8 from https://github.com/awslabs/aws-data-wrangler/releases.

I want to use Pandas to convert JSON to CSV and it's working fine in my PyCharm development environment on Windows 2000.

I have a script that builds the zip for my "deploy package" for Lambda. I create a new clean directory, copy my code in it, then copy the code from awsrangler into it.

At that point, I stopped getting the errors about Numpy version, and started getting the error below.

Error:

...

ANSWER

Answered 2020-Oct-27 at 15:41

I updated my deploy script to delete the __pycache__ directory, and have got past this issue.

Got the idea from this video about using Pandas on AWS Lambda: https://www.youtube.com/watch?v=vf1m1ogKYrg

Source https://stackoverflow.com/questions/64557193

QUESTION

Copy parquet files then query them with Athena

Asked 2020-Mar-17 at 21:36

I use aws-data-wrangler (https://github.com/awslabs/aws-data-wrangler) to process pandas dataframes. Once they are processed, I export them to parquet files with:

...

ANSWER

Answered 2020-Mar-17 at 21:36

Without knowing the details of what Pandas is doing under the hood I suspect the issue is that it's creating a partitioned table (as suggested by the partition_cols=["date"] part of the command). A partitioned table doesn't just have one location, it has one location per partition.

This is probably what is going on: when you create the first table you end up with data on S3 looking something like this: s3://example/table1/date=20200317/file.parquet, and a partitioned table with a partition with a location s3://example/table1/date=20200317/. The table may have a location too, and it's probably s3://example/table1/, but this is mostly meaningless – it's not used for anything, it's just that Glue requires tables to have a location.

When you create the next table you get data in say s3://example/table2/date=20200318/file.parquet, and a table with a corresponding partition. What I assume you do next is copy the data from the first table to s3://example/table2/date=20200317/file.parquet (table1 -> table2 is the difference).

When you query the new table it will not look in this location, because it is not a location belonging to any of its partitions.

You can fix this in a number of ways:

Perhaps you don't need the partitioning at all, what happens if you remove the partition_cols=["date"] part of the command? Do you still get a partitioned table? (check in the Glue console, or by running SHOW CREATE TABLE tableX in Athena). With an unpartitioned table you can move whatever data you want into the table's location and it will be found by Athena.
Instead of moving the data you can add the partition from the first table to the new table, run something like this in Athena: ALTER TABLE table2 ADD PARTITION ("date" = '20200317') LOCATION 's3://example/table1/date=20200317/'.
Instead add the partition to the old table, or both. It doesn't really matter and just depends on which name you want to use when you run queries. You could also have a table that you've set up manually that is your master table, and treat the tables created by Pandas as temporary. Once Pandas has created the data you add it as a partition to the master table and drop the newly created table. That way you can have a nice name for your table and not have a datestamp in the name.
You can copy the data if you want the data to be all in one place, and then add the partition like above.
Someone will probably suggest copying the data like above and then run MSCK REPAIR TABLE afterwards. That works, but it will get slower and slower as you get more partitions so it's not a scalable solution.

Source https://stackoverflow.com/questions/60721022

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install aws-data-wrangler

Installation command: pip install awswrangler. ⚠️ For platforms without PyArrow 3 support (e.g. EMR, Glue PySpark Job, MWAA): ➡️pip install pyarrow==2 awswrangler. What is AWS Data Wrangler?.
What is AWS Data Wrangler?
Install PyPi (pip) Conda AWS Lambda Layer AWS Glue Python Shell Jobs AWS Glue PySpark Jobs Amazon SageMaker Notebook Amazon SageMaker Notebook Lifecycle EMR From source
Tutorials 001 - Introduction 002 - Sessions 003 - Amazon S3 004 - Parquet Datasets 005 - Glue Catalog 006 - Amazon Athena 007 - Databases (Redshift, MySQL, PostgreSQL and SQL Server) 008 - Redshift - Copy & Unload.ipynb 009 - Redshift - Append, Overwrite and Upsert 010 - Parquet Crawler 011 - CSV Datasets 012 - CSV Crawler 013 - Merging Datasets on S3 014 - Schema Evolution 015 - EMR 016 - EMR & Docker 017 - Partition Projection 018 - QuickSight 019 - Athena Cache 020 - Spark Table Interoperability 021 - Global Configurations 022 - Writing Partitions Concurrently 023 - Flexible Partitions Filter 024 - Athena Query Metadata 025 - Redshift - Loading Parquet files with Spectrum 026 - Amazon Timestream 027 - Amazon Timestream 2 028 - Amazon DynamoDB 029 - S3 Select 030 - Data Api 031 - OpenSearch 032 - Lake Formation Governed Tables
API Reference Amazon S3 AWS Glue Catalog Amazon Athena Amazon Redshift PostgreSQL MySQL SQL Server DynamoDB Amazon Timestream Amazon EMR Amazon CloudWatch Logs Amazon Chime Amazon QuickSight AWS STS AWS Secrets Manager
License
Contributing
Legacy Docs (pre-1.0.0)

Support

The best way to interact with our team is through GitHub. You can open an issue and choose from one of our templates for bug reports, feature requests... You may also find help on these community resources:.

Find more information at: