big-data | wrench Use dplyr to analyze Big Data elephant | Data Visualization library
kandi X-RAY | big-data Summary
kandi X-RAY | big-data Summary
:wrench: Use dplyr to analyze Big Data :elephant:
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of big-data
big-data Key Features
big-data Examples and Code Snippets
Community Discussions
Trending Discussions on big-data
QUESTION
I'm creating a Glue ETL job that transfers data from BigQuery to S3. Similar to this example, but with my own dataset.
n.b.: I use BigQuery Connector for AWS Glue v0.22.0-2 (link).
The data in BigQuery is already partitioned by date, and I would like to have every Glue job run fetches a specific date only (WHERE date = ...
) and group them into 1 CSV file output. But I don't find any clue where to insert the custom WHERE
query.
In BigQuery source node configuration options, the options are only these:
Also in the generated script, it uses create_dynamic_frame.from_options
which does not accommodate custom query (per documentation).
ANSWER
Answered 2022-Mar-24 at 06:45Quoting this AWS sample project, we can use filter
in Connection Options:
- filter – Passes the condition to select the rows to convert. If the table is partitioned, the selection is pushed down and only the rows in the specified partition are transferred to AWS Glue. In all other cases, all data is scanned and the filter is applied in AWS Glue Spark processing, but it still helps limit the amount of memory used in total.
Example if used in script:
QUESTION
I posted a while back about how efficiently to calculate sets of distances using big data. The answers there didn't quite answer my question, since the issue is more computational (e.g. like how to find k-nearest neighbors without doing a huge merge in order to calculate the distance of every point from one another) rather than an issue about calculating the distances themselves.
We've come up with a solution using a non-equi join in data.table
, but I'd really appreciate any feedback on whether this is the right way to go/ways to improve the speed, and so on.
A quick overview of the problem
(See the linked post above for more detail.) We have a (in reality very large) dataset with the location of stores, for example:
...ANSWER
Answered 2022-Mar-22 at 16:31If I understand correctly, the OP wants to know how many other stores are within some arbitrary radius of each store.
The code below elaborates OP's idea of a non-equi self-join but combined with grouping by each i. It appends a new column to ex
which contains the requested number of other stores within a given radius.
QUESTION
Am running nodetool rebuild, there is a table having 400 sstables on one node from where streaming is happening. Only one file is being streamed at a time, is there any way to parallelize this operation so that multiple sstables can be streamed in parallel rather than sequential file streaming.
...ANSWER
Answered 2022-Mar-05 at 08:01It isn't possible to increase the number of streaming threads. In any case, there are several factors which affect the speed of the streaming, not just network throughput. The type of disks as well as the data model have a significant impact on how quick the JVM can serialise the data to stream as well as how quick it can cleanup the heap (GC).
I see that you've already tried to increase the streaming throughput. Note that you'll need to increase it for both the sending and receiving nodes (and really, all nodes) otherwise, the stream will only be as fast as the slowest node. Cheers!
QUESTION
I'm not seeing how an AWS Kinesis Firehose lambda can send update and delete requests to ElasticSearch (AWS OpenSearch service).
Elasticsearch document APIs provides for CRUD operations: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs.html
The examples I've found deals with the Create case, but doesn't show how to do delete
or update
requests.
https://aws.amazon.com/blogs/big-data/ingest-streaming-data-into-amazon-elasticsearch-service-within-the-privacy-of-your-vpc-with-amazon-kinesis-data-firehose/
https://github.com/amazon-archives/serverless-app-examples/blob/master/python/kinesis-firehose-process-record-python/lambda_function.py
The output format in the examples do not show a way to specify create
, update
or delete
requests:
ANSWER
Answered 2022-Mar-03 at 04:20Firehose uses lambda function to transform records before they are being delivered to the destination in your case OpenSearch(ES) so they are only used to modify the structure of the data but can't be used to influence CRUD actions. Firehose can only insert records into a specific index. If you need a simple option to remove records from ES index after a certain period of time have a look at "Index rotation" option when specifying destination for your Firehose stream.
If you want to use CRUD actions with ES and keep using Firehose I would suggest to send records to S3 bucket in the raw format and then trigger a lambda function on object upload event that will perform a CRUD action depending on fields in your payload.
A good example of performing CRUD actions against ES from lambda https://github.com/chankh/ddb-elasticsearch/blob/master/src/lambda_function.py
This particular example is built to send data from DynamoDB streams into ES but it should be a good starting point for you
QUESTION
Usually, to read a local .csv
file I use this:
ANSWER
Answered 2022-Feb-24 at 12:33It's not possible to access external data from driver. There are some workarounds like simple using pandas:
QUESTION
I copied a large number of gzip files from Google Cloud Storage to AWS's S3 using s3DistCp (as this AWS article describes). When I try to compare the files' checksums, they differ (md5/sha-1/sha-256 have same issue).
If I compare the sizes (bytes) or the decompressed contents of a few files (diff
or another checksum), they match. (In this case, I'm comparing files pulled directly down from Google via gsutil
vs pulling down my distcp'd files from S3).
Using file
, I do see a difference between the two:
ANSWER
Answered 2022-Feb-21 at 17:05While typing out my question, I figured out the answer:
S3DistCp is apparently switching the "OS" version in the gzip header, which explains the "FAT filesystem" label I'm seeing with file
. (Note: to rule out S3 directly causing the issue, I copied my "file1-gs-direct.gz" up to S3, and after pulling down, the checksum remains the same.)
Here's the diff between the two files:
QUESTION
I have written the following NumPy code by Python:
...ANSWER
Answered 2021-Dec-28 at 14:11First of all, the algorithm can be improved to be much more efficient. Indeed, a polygon can be directly assigned to each point. This is like a classification of points by polygons. Once the classification is done, you can perform one/many reductions by key where the key is the polygon ID.
This new algorithm consists in:
- computing all the bounding boxes of the polygons;
- classifying the points by polygons;
- performing the reduction by key (where the key is the polygon ID).
This approach is much more efficient than iterating over all the points for each polygons and filtering the attributes arrays (eg. operate_
and contact_poss
). Indeed, a filtering is an expensive operation since it requires the target array (that may not fit in the CPU caches) to be fully read and then written back. Not to mention this operation requires a temporary array to be allocated/deleted if it is not performed in-place and the operation cannot benefit from being implemented with SIMD instructions on most x86/x86-64 platforms (as it requires the new AVX-512 instruction set). It is also harder to parallelize since the filtering steps are too fast for threads to be useful but steps need to be done sequentially.
Regarding the implementation of the algorithm, Numba can be used to speed up a lot the overall computation. The main benefit of using Numba is to drastically reduce the number of expensive temporary arrays created by Numpy in your current implementation. Note that you can specify the function types to Numba so it can compile functions when it is defined. Assertions can be used to make the code more robust and help the compiler to know the size of a given dimension so to generate a significantly faster code (the JIT compiler of Numba can unroll the loops). Ternaries operators can help a bit the JIT compiler to generate a faster branch-less program.
Note the classification can be easily parallelized using multiple threads. However, one needs to be very careful about constant propagation since some critical constants (like the shape of the working arrays and assertions) tends not to be propagated to the code executed by threads while the propagation is critical to optimize the hot loops (eg. vectorization, unrolling). Note also that creating of many threads can be expensive on machines with many cores (from 10 ms to 0.1 ms). Thus, this is often better to use a parallel implementation only on big input data.
Here is the resulting implementation (working with both Python2 and Python3):
QUESTION
I am working on Glue in AWS and trying to test and debug in local dev. I follow the instruction here https://aws.amazon.com/blogs/big-data/developing-aws-glue-etl-jobs-locally-using-a-container/ to develop Glue job locally. On that post, they use Glue 1.0 image for testing and it works as it should be. However when I load and try to dev by Glue 3.0 version; I follow the guidance steps but, I can't open Jupyter notebook on :8888 like the post said even every step seems correct.
here my cmd to start a Jupyter notebook on Glue 3.0 container
...ANSWER
Answered 2022-Jan-16 at 11:25It seems that GLUE 3.0 image has some issues with SSL. A workaround for working locally is to disable SSL (you also have to change the script paths as documentation is not updated).
QUESTION
I am trying to train a seq2seq
model for language translation, and I am copy-pasting code from this Kaggle Notebook on Google Colab. The code is working fine with CPU and GPU, but it is giving me errors while training on a TPU. This same question has been already asked here.
Here is my code:
...ANSWER
Answered 2021-Nov-09 at 06:27Need to down-grade to Keras 1.0.2 If works then great, otherwise I will tell other solution.
QUESTION
I'm running a Cassandra's container (version 3.11.11) from an official Cassandra's docker image. When I run the command line to copy datas from othes Cassandra's DB (replication datas), the container gave the error: java.io.IOException: failed to connect to /127.0.0.1:7000 for streaming data
.
However, the container's nodetool status is OK:
...ANSWER
Answered 2021-Nov-03 at 12:17I needed to change the IP 127.0.0.1 to 172.18.0.2 to access the port 7000.
The IP Address 172.18.0.2 is a private IP address. Private IP addresses are used inside a local area network (LAN) and are not visible on the internet. Private IP addresses are defined in RFC 1918 (IPv4) and RFC 4193 (IPv6).
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install big-data
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page