glue | Glue strings to data in R. Small , fast , dependency
kandi X-RAY | glue Summary
kandi X-RAY | glue Summary
Glue offers interpreted string literals that are small, fast, and dependency-free. Glue does this by embedding R expressions in curly braces which are then evaluated and inserted into the argument string.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of glue
glue Key Features
glue Examples and Code Snippets
Community Discussions
Trending Discussions on glue
QUESTION
Im trying to get the first 2 names in the following example json, without having to call them
test.json
...ANSWER
Answered 2021-Jun-15 at 15:44You can use the keys
function as in:
QUESTION
I have a a few dataframes, a few thousand rows each that look similar to this :
...ANSWER
Answered 2021-Jun-15 at 15:54IIUC, if all unique id's can be sorted into contiguous blocks.
QUESTION
I have an Aurora Serverless instance which has data loaded across 3 tables (mixture of standard and jsonb data types). We currently use traditional views where some of the deeply nested elements are surfaced along with other columns for aggregations and such.
We have two materialized views that we'd like to send to Redshift. Both the Aurora Postgres and Redshift are in Glue Catalog and while I can see Postgres views as a selectable table, the crawler does not pick up the materialized views.
Currently exploring two options to get the data to redshift.
- Output to parquet and use copy to load
- Point the Materialized view to jdbc sink specifying redshift.
Wanted recommendations on what might be most efficient approach if anyone has done a similar use case.
Questions:
- In option 1, would I be able to handle incremental loads?
- Is bookmarking supported for JDBC (Aurora Postgres) to JDBC (Redshift) transactions even if through Glue?
- Is there a better way (other than the options I am considering) to move the data from Aurora Postgres Serverless (10.14) to Redshift.
Thanks in advance for any guidance provided.
...ANSWER
Answered 2021-Jun-15 at 13:51Went with option 2. The Redshift Copy/Load process writes csv with manifest to S3 in any case so duplicating that is pointless.
Regarding the Questions:
N/A
Job Bookmarking does work. There is some gotchas though - ensure Connections both to RDS and Redshift are present in Glue Pyspark job, IAM self ref rules are in place and to identify a row that is unique [I chose the primary key of underlying table as an additional column in my materialized view] to use as the bookmark.
Using the primary key of core table may buy efficiencies in pruning materialized views during maintenance cycles. Just retrieve latest bookmark from cli using
aws glue get-job-bookmark --job-name yourjobname
and then just that in the where clause of the mv aswhere id >= idinbookmark
conn = glueContext.extract_jdbc_conf("yourGlueCatalogdBConnection")
connection_options_source = { "url": conn['url'] + "/yourdB", "dbtable": "table in dB", "user": conn['user'], "password": conn['password'], "jobBookmarkKeys":["unique identifier from source table"], "jobBookmarkKeysSortOrder":"asc"}
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="postgresql", connection_options=connection_options_source, transformation_ctx="datasource0")
That's all, folks
QUESTION
I am new to AWS VPC and exploring everything about it. I understood that VPC is majorly used to have a secure and isolated environment. What are the different use cases for AWS VPC in the area of Data Analytics? I have a data lake pipeline currently which is as follows:
- Extract data using APIs
- Store raw data in S3
- Create Lambda functions or Glue Jobs to perform business metrics
- Store metric outputs in S3
- Create tables in Athena for all the data stored in S3
- Import tables in Quicksight to produce business insights from visuals
In this process how can VPC be used or make this process efficient/better?
...ANSWER
Answered 2021-Jun-15 at 07:40The services you mention (mostly) live outside of VPCs.
VPCs are used for services that use virtual computers, such as Amazon EC2 computers and Amazon RDS databases.
By using services that don't involve specific 'computers' (such as Amazon S3, Athena, QuickSight) you can take advantage of much lower costs, paying only what you use. These services do not mimic traditional servers and therefore don't need VPCs. All the networking complexity is hidden and you can concentrate on using the service instead of running a network.
Yes, VPCs add extra security, but that's only because resources on a VPC need securing due to potential security holes. The services you mention are all secured via IAM and do not expose themselves outside the published APIs.
QUESTION
I have a question around how best to manage concurrent job instances in AWS glue.
I have a job defined like so:
...ANSWER
Answered 2021-Jun-14 at 20:29The "Max concurrent job runs per account" limit is a soft limit (https://docs.aws.amazon.com/general/latest/gr/glue.html). Maybe log a service request with AWS and ask for an increase in the limit. The second thing is I am not sure how you have implemented your sleep action in the code, maybe instead of doing just a sleep catch the exception each time you make the call, if there is an exception, sleep with an exponential backoff in seconds and try again when sleep time is finished and repeat until your get a positive response OR when you reach your own set limit to stop. This way your processing will not stop until you give up, but just slow down when throtteling kicks in.
QUESTION
The objective of my code is to scrape the information in the Characteristics tab of the following url, preferably as a data frame
...ANSWER
Answered 2021-Jun-11 at 15:38The data is dynamically retrieved from an API call. You can retrieve direct from that url and simplify the json returned to get a dataframe:
QUESTION
I just noticed that read_csv()
somehow uses random numbers which is unexpected (at least to me). The corresponding base R function read.csv()
does not do that. So, what does read_csv()
use the random numbers for? I looked into the documentation but could not find a clear answer to that. Are the random numbers related to the guess_max
argument?
ANSWER
Answered 2021-Jun-10 at 19:21tl;dr somewhere deep in the guts of the cli
package (called to generate the pretty-printed output about column types), the code is generating a random string to use as a label.
A major clue is that
QUESTION
I have a AWS Glue job that was slightly modified, only the read was changed, the job runs fine however the datatypes on my columns have changed. Where I previously had BigInt, I now just have Ints. This is causing an EMR Job dependent on these files to error out due to the schema mismatch. I'm not sure what would cause this issue since the mapping did not change, so if anyone has insight that would be great here is the old & new code:
...ANSWER
Answered 2021-Jun-10 at 14:28Both spark DataFrame
and glue DynamicFrame
infer the schema when reading data from json, but evidently, they do it differently: sparks treats all numerical values as bigint
, while glue is trying to be clever, and (I guess) looks at the actual range of values on the fly.
Some more info about DynamicFrame
schema inference can be found here.
If you are going to write parquet in the end anyway, and want the schema stable and consistent, I'd say your easiest way around this is to just revert your change and go back to spark DataFrame
.
You can also use apply_mapping to change the types explicitly after reading the data, but it seems like defeating the purpose of having the dynamic frame in the first place.
QUESTION
What I am trying to do?
Glue-Athena-like process.
- Data in S3
- AWS Glue (create metadata tables)
- Tables can be queried using Athena via boto3 (python library)
Problem I am facing in Azure Cloud
~Trying to replicate the above process using Azure Synapse Analytics~
- Data in linked Azure Storage container
- Azure Data Factory (create external tables)
- How to make T-SQL queries on the external tables using python?
Is there any python library to make T-SQL calls to the external tables created in Azure Synapse workspace?
...ANSWER
Answered 2021-Jun-10 at 08:45Yes. PyODBC works with Synapse. It's not perfect but I use it.
https://docs.microsoft.com/en-us/azure/azure-sql/database/connect-query-python
Note that installing it can be a bit tricky. You need the Python package, but also the ODBC driver and the apt package unixodbc-dev.
Here is the part of my dockerfile that does it on Ubuntu 18.04
QUESTION
I have an AWS Glue crawler that is set-up to crawl new folders only. I tried to see if deleting a partition would cause it to re-visit the corresponding S3 folder, and it doesn't. Is there a way I can force a re-visit of a folder, short of changing the crawler to crawl all folders?
...ANSWER
Answered 2021-Jun-09 at 08:44If your partitions are "predictable", for example date based, you could completely bypass the crawlers and use partition projection. See the docs:
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install glue
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page