lambda-refarch-mapreduce | repo presents a reference architecture

by awslabs JavaScript Version: Current License: Non-SPDX

X-Ray Key Features Code Snippets Community Discussions(1)Vulnerabilities Install Support

kandi X-RAY | lambda-refarch-mapreduce Summary

lambda-refarch-mapreduce is a JavaScript library typically used in Big Data, Spark applications. lambda-refarch-mapreduce has no bugs, it has no vulnerabilities and it has low support. However lambda-refarch-mapreduce has a Non-SPDX License. You can download it from GitHub.

This repo presents a reference architecture for running serverless MapReduce jobs. This has been implemented using AWS Lambda and Amazon S3.

Support

Quality

Security

License

Reuse

Support

lambda-refarch-mapreduce has a low active ecosystem.

It has 388 star(s) with 77 fork(s). There are 91 watchers for this library.

It had no major release in the last 6 months.

There are 6 open issues and 1 have been closed. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of lambda-refarch-mapreduce is current.

Quality

lambda-refarch-mapreduce has 0 bugs and 65 code smells.

Security

lambda-refarch-mapreduce has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

lambda-refarch-mapreduce code analysis shows 0 unresolved vulnerabilities.

There are 2 security hotspots that need review.

License

lambda-refarch-mapreduce has a Non-SPDX License.

Non-SPDX licenses can be open source with a non SPDX compliant license, or non open source licenses, and you need to review them closely before use.

Reuse

lambda-refarch-mapreduce releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

lambda-refarch-mapreduce saves you 255 person hours of effort in developing the same functionality from scratch.

It has 620 lines of code, 26 functions and 15 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of lambda-refarch-mapreduce

Get all kandi verified functions for this library.

lambda-refarch-mapreduce Key Features

No Key Features are available at this moment for lambda-refarch-mapreduce.

lambda-refarch-mapreduce Examples and Code Snippets

No Code Snippets are available at this moment for lambda-refarch-mapreduce.

Community Discussions

Trending Discussions on lambda-refarch-mapreduce

How can I return the result of a mapreduce operation to an AWS API request

QUESTION

How can I return the result of a mapreduce operation to an AWS API request

Asked 2017-Aug-11 at 14:47

I have a program that performs several thousand monte-carlo simulations to predict a result; I can't say what they really predict, so I'm going to use another example from "the indisputable existence of santa claus", since the content of those algorithms are not relevant to the question. I want to know how often each square on a Monopoly board is visited (to predict which the best properties to buy are). To do this, I simulate thousands of games and collate the results. My current implementation is a stand-alone C# application but I want to move it to the cloud so that I can provide this as a service - each user can get personalised results by submitting the number of sides that each of their dice have.

The current implementation is also quite slow - it is very parallisable since each simulation is entirely independent but I only have 8 cores, so it takes upwards of 20 minutes to complete the full prediction with about 50000 individual simulations on my local machine.

The plan is to have AWS lambda functions each run one (or several) simulations and then collate - basically mapreduce it. I looked in to using AWS EMR (Elastic MapReduce) but that is way too large-scale for what I want, spinning up the instances to run the computations alone seems to take longer than the whole calculation alone (which would not matter for multi-hour offline analyses, but I want low-latency to respond over a web request).

The ideal as I see it would be:

Lambda 0 - Fires off many other lambda functions, each doing a small part of the calculation. Lambda 1..N - Do many simulations in parallel (the number is not a constant). Lambda N+1 - Collate all the results and return the answer.

There is a lambda mapreduce framework here:

https://github.com/awslabs/lambda-refarch-mapreduce

But it seems to have one major drawback - each time a map stage completes, it writes its results to S3 (I'm fine with using that as a temporary) then triggers a new lambda via an event. That triggered lambda looks to see if all the results have been written to storage yet. If not, it ends, if yes it does the reduction step. That seems like a fair solution, but I'm just slightly concerned about a) race-hazards when two results come in together, could two reducers both compute the results? And b) that seems like it is firing off a lot of lambdas that all just decide not to run (I know they're cheap to run, but doubling the number to two per simulation - calculate and maybe reduce - will obviously double the costs). Is there a way to fire off an S3 result after, say, 100 files are written to a folder instead of after every one?

I looked at using step functions, but I'm not sure how to fire many lambdas in parallel in one step and have them all return before the state machine transitions. Step functions would however be useful for the final wrinkle - I want to hide all this behind an API.

From what I've read, APIs can fire off a lambda and return the result of that lambda, but I don't want the invoked lambda to be the one returning the result. It isn't when you instead invoke a step function from the API, the results of the last state are returned by the API call instead.

In short, I want:

API request -> Calculate results in parallel -> API response

It is that bit in the middle I'm not clear how to do, while being able to return all the results as a response to the original request - either on their own are easy.

A few options I can see:

Use a step function, which is natively supported by the AWS API gateway now, and invoke multiple lambdas in one state, waiting for them all to return before transitioning.

Use AWS EMR, but somehow keep the provisioned instances always live to avoid the provisioning time overheads. This obviously negates the scalability of Lambda and is more expensive.

Use the mapreduce framework, or something similar, and find a way to respond to an incoming request from a different lambda to the one that was initially invoked by the API request. Ideally also reduce the number of S3 events involved here, but that's not a priority.

Respond instantly to the original API request from the first lambda, then push more data to the user later when the calculations finish (they should only take about 30 seconds with the parallelism, and the domain is such that that is an acceptable time to wait for a response, even an HTTP response).

I doubt it will make any difference to the solution, since it is just an expansion of the middle bit, not a fundamental change, but the real calculation is iterative, so would be:

Request -> Mapreduce -> Mapreduce -> ... -> Response

As long as I know how to chain one set of lambda functions within a request, chaining more should be just more of the same (I hope).

Thank you.

P.S. I can't create them, and neither the tags aws-emr nor aws-elastic-mapreduce exist yet.

...

ANSWER

Answered 2017-Aug-07 at 19:44

One idea would be to call a Lambda function (call it 'workflow director') via API GW, then write code in that function to call step functions (or whatever) directly and poll the state so you can eventually respond synchronously to the HTTP request.

That's just a sync wrapper around the async workflow. Keep in mind that API GW has a hard timeout at 29 seconds, so if you expect that this workflow will take around 30 seconds, it might not be worth it to implement a sync version.

The async model (I guess in this case calling step function directly from API GW) would work in either case.

Edit: sorry, may have misunderstood your comment about step functions. I thought there was no synchronous way to call the step functions workflow and await the final state, but from your comment it seems that there already is.

Let me quickly answer a couple of your specific questions:

Is there a way to fire off an S3 result after, say, 100 files are written to a folder instead of after every one?

I believe this is not possible.

I'm not sure how to fire many lambdas in parallel in one step and have them all return before the state machine transitions

Did you see this in the docs? http://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

Source https://stackoverflow.com/questions/45359497

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install lambda-refarch-mapreduce

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: