flink-training | Apache Flink Training Excercises

by apache Java Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | flink-training Summary

flink-training is a Java library typically used in Big Data applications. flink-training has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

The initial set of exercises are all based on data streams of events about taxi rides and taxi fares. These streams are produced by source functions which reads data from input files. Please read the instructions above to learn how to use them.

Support

Quality

Security

License

Reuse

Support

flink-training has a low active ecosystem.

It has 663 star(s) with 518 fork(s). There are 48 watchers for this library.

It had no major release in the last 6 months.

flink-training has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of flink-training is current.

Quality

flink-training has 0 bugs and 0 code smells.

Security

flink-training has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

flink-training code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

flink-training is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

flink-training releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

flink-training saves you 771 person hours of effort in developing the same functionality from scratch.

It has 1774 lines of code, 152 functions and 37 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed flink-training and discovered the below as its top functions. This is intended to give you an instant insight into flink-training implemented functionality, and help decide if they suit your requirements.

Determines the end time for this trip
This method calculates a long duration in seconds
Returns the start time for this track
Returns true if the given arguments are equal
Check if the given object is equal to or not
Checks if is equal to or not
Get the euclidean distance between two road points
Get the Euclidean distance between two points
The main method
Creates the pipeline used to execute a long - trip pipeline
Generate a batch of START events
Maps a direct path between two points
Determine the total fare for this trip
Determines the ID of this track
Get the startLon for this route
Generates a random city within the city area
Get the center of a grid cell
Returns a new TaxiFareGenerator that runs on the specified duration
The main entry point
Main method
Runs a few timespan
Compares another TaxiRide with another
Returns the angle between the start and the destination coordinates
Map a location to a grid cell
Determines if the given exception is a MissingSolutionException
Main entry point

Get all kandi verified functions for this library.

flink-training Key Features

No Key Features are available at this moment for flink-training.

flink-training Examples and Code Snippets

No Code Snippets are available at this moment for flink-training.

Community Discussions

Trending Discussions on flink-training

Timeout CEP pattern if next event not received in a given interval of time

How do I join two streams in apache flink?

Flink object reuse: modify input objects?

Flink Joins to Enrich the Stream

Flink Job Execution Failed on run

Looks like a bug in flink-training-exercises for the CEP example

Missing Dependencies in Eclipse IDE with Flink Quickstart

TimeCharacteristics & TimerService in Apache Flink

Flink 'timeWindow' operation not generating output for PopularPlacesFromKafka example file

Kafka-Flink-Stream processing: Is there a way to reload input files into the variables being used in a streaming process?

QUESTION

Timeout CEP pattern if next event not received in a given interval of time

Asked 2020-Oct-17 at 20:26

I am newbie to Flink i am trying a POC in which if no event is received in x amount of time greater than time specified in within time period in CEP

...

ANSWER

Answered 2020-Oct-17 at 20:26

Your application is using event time, so you will need to arrange for a sufficiently large Watermark to be generated despite the lack of incoming events. You could use this example if you want to artificially advance the current watermark when the source is idle.

Given that your events don't have event-time timestamps, why don't you simply use processing time instead, and thereby avoid this problem? (Note, however, the limitation mentioned in https://stackoverflow.com/a/50357721/2000823).

Source https://stackoverflow.com/questions/64405247

QUESTION

How do I join two streams in apache flink?

Asked 2020-May-22 at 08:32

I am getting started with flink and having a look at one of the official tutorials.

To my understanding the goal of this exercise is to join the two streams on the time attribute.

Task:

The result of this exercise is a data stream of Tuple2 records, one for each distinct rideId. You should ignore the END events, and only join the event for the START of each ride with its corresponding fare data.

The resulting stream should be printed to standard out.

Question: How is the EnrichmentFunction able to join the two streams aka. how does it know which fair to join with which ride? I expected it to buffer multiple fairs/rides until for an incoming fair/ride there is a matching partner.

In my understanding it just saves every ride/fair it sees and combines it with the next best ride/fair. Why is this a proper join?

Provided Solution:

...

ANSWER

Answered 2020-May-22 at 08:32

In the context of this particular training exercise on stateful enrichment, there are three events for each value of rideId -- a TaxiRide start event, a TaxiRide end event, and a TaxiFare. The objective of this exercise is to connect each TaxiRide start event with the one TaxiFare event having the same rideId -- or in other words, to join the ride stream and fare stream on rideId, while knowing that there will be only one of each.

This exercise is demonstrating how keyed state works in Flink. Keyed state is effectively a sharded key-value store. When we have an item of ValueState, such as ValueState rideState, Flink will store a separate record in its state backend for each distinct value of the key (the rideId).

Each time flatMap1 and flatMap2 are called there is a key (a rideId) implicitly in context, and when we call rideState.update(ride) or rideState.value() we are not accessing a single variable, but rather setting and getting an entry in a key-value store, using the rideId as the key.

In this exercise, both streams are keyed by the rideId, so there is potentially one element of rideState and one element of fareState for each distinct rideId. Hence the solution that's been provided is buffering lots of rides and fares, but only one for each rideId (which is enough, given that the rides and fares are perfectly paired in this dataset).

So, you asked:

How is the EnrichmentFunction able to join the two streams aka. how does it know which fare to join with which ride?

And the answer is

It joins the fare having the same rideId.

This particular exercise you've asked about shows how to implement a simple enrichment join for the purpose of getting across the ideas of keyed state, and connected streams. But more complex joins are certainly possible with Flink. See the docs on joins using the DataStream API, joins with Flink's Table API, and joins with Flink SQL.

Source https://stackoverflow.com/questions/54277910

QUESTION

Flink object reuse: modify input objects?

Asked 2020-Mar-04 at 09:30

I have a Flink streaming application that spends roughly 20% of its CPU time in Kyro.copy. I can evade that by turning on object reuse mode, but I have a slight problem: I'd like to modify input objects to my operators.

The general contract for object reuse mode seems to state: Do not modify input objects or remember input objects after returning from your map function. You may modify objects after output and re-emit them. (e.g.: Slide 6)

Now, my question is: If I immediately dispose of all references to objects after output-ing them from my operators, is it safe to modify input objects? Or is there some other combination of rules that can make it safe to modify input objects?

...

ANSWER

Answered 2020-Mar-04 at 09:30

Yes, it would be safe. But note that immediate disposal also means that you cannot use them as a key in maps and that also means heap state backends (you can use it for lookup, but would need to create a copy on modification). So for simple map chains, it should work well, but before using joins, windows, and grouping, I'd double check it or create my own defensive copies at appropriate places.

Btw, if you want to improve performance, it's almost always recommended to ditch Kryo serialization. Kryo would slow down any network traffic if you have any. If so, try to use POJOs, some well-supported formats like Avro, or write your own serializer. That would certainly improve performance more than object reuse. This paragraph does not apply if you don't have any network channels.

Source https://stackoverflow.com/questions/60518520

QUESTION

Flink Joins to Enrich the Stream

Asked 2020-Feb-03 at 11:00

I am very new to Apache Flink. I am using v1.9.0. I want to join multiple streams example. I am getting following exception while running following example.

Exception:

...

ANSWER

Answered 2020-Feb-03 at 11:00

If you add

Source https://stackoverflow.com/questions/60036852

QUESTION

Flink Job Execution Failed on run

Asked 2019-Mar-07 at 15:02

I am trying to run the data artisans examples available at github. I read the tutorial and added the needed SDKs and downloaded the files for NYCFares and Rides. Whenever i am running the RideCount.java example i get a Job Execution Failed. Here is the link to the git repo for the RideCount class file. Github repo RideCount.java

here is the error

...

ANSWER

Answered 2019-Mar-07 at 15:02

It appears that the nycTaxiRides.gz file has somehow been corrupted. The line that is shown in your screenshot should have these contents

Source https://stackoverflow.com/questions/55045187

QUESTION

Looks like a bug in flink-training-exercises for the CEP example

Asked 2018-Jun-01 at 08:09

I got a example for the CEP in the following URL https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/cep/LongRides.java

And the "goal for this exercise is to emit START events for taxi rides that have not been matched by an END event during the first 2 hours of the ride." However from the code below, it seems get a pattern to find rides have been completed in 2 hours instead of have NOT been completed in 2 hours.

It looks like the pattern firstly find the Start event , then find the End Event(!ride.isStart), and within 2 hours, so doesn't it explains as a pattern to find rides have been completed in 2 hours?

...

ANSWER

Answered 2018-Jun-01 at 08:09

I've improved the comment in the sample solution to make this clearer.

Source https://stackoverflow.com/questions/50627744

QUESTION

Missing Dependencies in Eclipse IDE with Flink Quickstart

Asked 2018-May-31 at 12:26

I have cloned Flink Training repo and followed instructions on building and deploying from here in order to get familiar with Apache Flink. However, there are the errors in the projects after building and importing into Eclipse IDE. In the Flink Training Exercises project i find errors in the pom Plugin execution not covered by lifecycle configuration: net.alchim31.maven:scala-maven-plugin:3.1.4:testCompile. There are also errors in the project flink-quickstart-java . Some dependencies are not being resolved e.g. ExecutionEnvironment cannot be resolved in the BatchJob class.

...

ANSWER

Answered 2018-May-31 at 12:20

I got this working in Eclipse by selecting the add-dependencies-for-IDEA maven profile.

I added this section to in my pom file:

Source https://stackoverflow.com/questions/50608571

QUESTION

TimeCharacteristics & TimerService in Apache Flink

Asked 2018-Mar-03 at 18:29

I'm currently working through this tutorial on Stream processing in Apache Flink and am a little confused on how the TimeCharacteristics of a StreamEnvironment effect the order of the data values in the stream and in respect to which time an onTimer function of a ProcessFunction is called.

In the tutorial, they set the characteristics to EventTime, since we want to compare the start & end events based on the time they store and not the time they are received in the stream.

Now in the reference solution they set a timerService to fire 2 hours after an events timestamp for each key.

What really confuses me is when this timer actually fires during runtime. Possible explanation I came up with:

Setting the TimeCharacteristics to EventTime makes the stream to process the entries ordered by their event timestamp and this way the timer can be fired for each rideId, when an event arrives with a timestamp > rideId.timeStamp + 2 hours (2 hours coming from exercise context).

But with this explanation a startEvent of a Taxi ride would always be processed before an endEvent (I'm assuming that a ride can't end before it started), and we wouldn't have to check if a matching EndEvent has already arrived like they do in the processElement function.

In the documentation of ProcessFunction they state that the timer is called

"When a timer’s particular time is reached"

but since we have a (potentially infinite) stream of data and we don't care when the data point arrives but only when it happened, how can we be sure that there will not arrive a matching data point for a startEvent somewhere in the future that would trigger the criteria with 2 hours stated in the exercise?

If someone could link me an explanation of this or correct me where I'm wrong that would be highly appreciated.

...

ANSWER

Answered 2018-Mar-03 at 18:29

An event-time timer fires when Flink is satisfied that all events with timestamps earlier than the time in the timer have already been processed. This is done by waiting for the current watermark to reach the time specified in the timer.

When working with event-time, events are usually processed out-of-order, and this is the case in the exercises you are working with. In general, watermarks are used to mark the passage of event-time -- a watermark is characterized by a timestamp t, and indicates that the stream is now complete up through time t (meaning that all earlier events have already been processed). In the training exercises, the TaxiRideSource is parameterized according to how much out-of-orderness you want to have, and the TaxiRideSource takes care to emit appropriately delayed watermarks.

You can read more about event time and watermarks in the Flink documentation.

Source https://stackoverflow.com/questions/49072844

QUESTION

Flink 'timeWindow' operation not generating output for PopularPlacesFromKafka example file

Asked 2017-Nov-09 at 16:15

I'm going through Flink tutorial materials from dataArtisans and for some reason when I get to the sample file PopularPlacesFromKafka.scala I don't get any output sent to stdout.

...

ANSWER

Answered 2017-Sep-19 at 22:03

Did you configure an appropriate speedup for the source? By default (without a speedup factor), the source emulates the original data, i.e., it emits records at the same rate as they were originally generated. That means it takes 1 minute to produce 1 minute of data.

The window operator aggregates every 5 minutes the last 15 minutes of data. Consequently, it will take 5 minutes until the window operator produces the first result.

If you set the speedup factor to 600, you'll get 10 minutes of data in 1 second.

Source https://stackoverflow.com/questions/46306882

QUESTION

Kafka-Flink-Stream processing: Is there a way to reload input files into the variables being used in a streaming process?

Asked 2017-Oct-24 at 15:07

We are planning to use Flink to process a stream of data from a kafka topic (Logs in Json format).

But for that processing, we need to use input files which change every day, and the information within can change completely (not the format, but the contents).

Each time one of those input files changes we will have to reload those files into the program and keep the stream processing going on.

Re-loading of the data could be done same way as it is done now:

...

ANSWER

Answered 2017-Oct-20 at 12:53

Flink can monitor a directory and ingest files when they are moved into that directory; maybe that's what you are looking for. See the PROCESS_CONTINUOUSLY option for readfile in the documentation.

However, if the data is in Kafka, it would be much more natural to use Flink's Kafka consumer to stream the data directly into Flink. There is also documentation about using the Kafka connector. And the Flink training includes an exercise on using Kafka with Flink.

Source https://stackoverflow.com/questions/46847114

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install flink-training

The following instructions guide you through the process of setting up a development environment for the purpose of developing, debugging, and executing solutions to the Flink developer training exercises and examples.
This flink-training project contains exercises, tests, and reference solutions for the programming exercises. Clone the flink-training project from Github and build it. :information_source: Repository Layout: This repository has several branches set up pointing to different Apache Flink versions, similarly to the apache/flink repository with: a release branch for each minor version of Apache Flink, e.g. release-1.10, and a master branch that points to the current Flink release (not flink:master!) If you want to work on a version other than the current Flink release, make sure to check out the appropriate branch. If you haven’t done this before, at this point you’ll end up downloading all of the dependencies for this Flink training project. This usually takes a few minutes, depending on the speed of your internet connection. If all of the tests pass and the build is successful, you are off to a good start.
a release branch for each minor version of Apache Flink, e.g. release-1.10, and
a master branch that points to the current Flink release (not flink:master!)

Support

The initial set of exercises are all based on data streams of events about taxi rides and taxi fares. These streams are produced by source functions which reads data from input files. Please read the instructions above to learn how to use them.

Find more information at: