flink-training | Apache Flink Training Excercises
kandi X-RAY | flink-training Summary
kandi X-RAY | flink-training Summary
The initial set of exercises are all based on data streams of events about taxi rides and taxi fares. These streams are produced by source functions which reads data from input files. Please read the instructions above to learn how to use them.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Determines the end time for this trip
- This method calculates a long duration in seconds
- Returns the start time for this track
- Returns true if the given arguments are equal
- Check if the given object is equal to or not
- Checks if is equal to or not
- Get the euclidean distance between two road points
- Get the Euclidean distance between two points
- The main method
- Creates the pipeline used to execute a long - trip pipeline
- Generate a batch of START events
- Maps a direct path between two points
- Determine the total fare for this trip
- Determines the ID of this track
- Get the startLon for this route
- Generates a random city within the city area
- Get the center of a grid cell
- Returns a new TaxiFareGenerator that runs on the specified duration
- The main entry point
- Main method
- Runs a few timespan
- Compares another TaxiRide with another
- Returns the angle between the start and the destination coordinates
- Map a location to a grid cell
- Determines if the given exception is a MissingSolutionException
- Main entry point
flink-training Key Features
flink-training Examples and Code Snippets
Community Discussions
Trending Discussions on flink-training
QUESTION
I am newbie to Flink i am trying a POC in which if no event is received in x amount of time greater than time specified in within time period in CEP
...ANSWER
Answered 2020-Oct-17 at 20:26Your application is using event time, so you will need to arrange for a sufficiently large Watermark to be generated despite the lack of incoming events. You could use this example if you want to artificially advance the current watermark when the source is idle.
Given that your events don't have event-time timestamps, why don't you simply use processing time instead, and thereby avoid this problem? (Note, however, the limitation mentioned in https://stackoverflow.com/a/50357721/2000823).
QUESTION
I am getting started with flink and having a look at one of the official tutorials.
To my understanding the goal of this exercise is to join the two streams on the time attribute.
Task:
The result of this exercise is a data stream of Tuple2 records, one for each distinct rideId. You should ignore the END events, and only join the event for the START of each ride with its corresponding fare data.
The resulting stream should be printed to standard out.
Question: How is the EnrichmentFunction able to join the two streams aka. how does it know which fair to join with which ride? I expected it to buffer multiple fairs/rides until for an incoming fair/ride there is a matching partner.
In my understanding it just saves every ride/fair it sees and combines it with the next best ride/fair. Why is this a proper join?
Provided Solution:
...ANSWER
Answered 2020-May-22 at 08:32In the context of this particular training exercise on stateful enrichment, there are three events for each value of rideId -- a TaxiRide start event, a TaxiRide end event, and a TaxiFare. The objective of this exercise is to connect each TaxiRide start event with the one TaxiFare event having the same rideId -- or in other words, to join the ride stream and fare stream on rideId, while knowing that there will be only one of each.
This exercise is demonstrating how keyed state works in Flink. Keyed state is effectively a sharded key-value store. When we have an item of ValueState
, such as ValueState rideState
, Flink will store a separate record in its state backend for each distinct value of the key (the rideId
).
Each time flatMap1
and flatMap2
are called there is a key (a rideId
) implicitly in context, and when we call rideState.update(ride)
or rideState.value()
we are not accessing a single variable, but rather setting and getting an entry in a key-value store, using the rideId
as the key.
In this exercise, both streams are keyed by the rideId
, so there is potentially one element of rideState
and one element of fareState
for each distinct rideId
. Hence the solution that's been provided is buffering lots of rides and fares, but only one for each rideId
(which is enough, given that the rides and fares are perfectly paired in this dataset).
So, you asked:
How is the EnrichmentFunction able to join the two streams aka. how does it know which fare to join with which ride?
And the answer is
It joins the fare having the same
rideId
.
This particular exercise you've asked about shows how to implement a simple enrichment join for the purpose of getting across the ideas of keyed state, and connected streams. But more complex joins are certainly possible with Flink. See the docs on joins using the DataStream API, joins with Flink's Table API, and joins with Flink SQL.
QUESTION
I have a Flink streaming application that spends roughly 20% of its CPU time in Kyro.copy. I can evade that by turning on object reuse mode, but I have a slight problem: I'd like to modify input objects to my operators.
The general contract for object reuse mode seems to state: Do not modify input objects or remember input objects after returning from your map function. You may modify objects after output
and re-emit them. (e.g.: Slide 6)
Now, my question is: If I immediately dispose of all references to objects after output
-ing them from my operators, is it safe to modify input objects? Or is there some other combination of rules that can make it safe to modify input objects?
ANSWER
Answered 2020-Mar-04 at 09:30Yes, it would be safe. But note that immediate disposal also means that you cannot use them as a key in maps and that also means heap state backends (you can use it for lookup, but would need to create a copy on modification). So for simple map chains, it should work well, but before using joins, windows, and grouping, I'd double check it or create my own defensive copies at appropriate places.
Btw, if you want to improve performance, it's almost always recommended to ditch Kryo serialization. Kryo would slow down any network traffic if you have any. If so, try to use POJOs, some well-supported formats like Avro, or write your own serializer. That would certainly improve performance more than object reuse. This paragraph does not apply if you don't have any network channels.
QUESTION
I am very new to Apache Flink
. I am using v1.9.0
. I want to join multiple streams example. I am getting following exception while running following example.
Exception:
...ANSWER
Answered 2020-Feb-03 at 11:00If you add
QUESTION
I am trying to run the data artisans examples available at github. I read the tutorial and added the needed SDKs and downloaded the files for NYCFares and Rides. Whenever i am running the RideCount.java example i get a Job Execution Failed. Here is the link to the git repo for the RideCount class file. Github repo RideCount.java
...ANSWER
Answered 2019-Mar-07 at 15:02It appears that the nycTaxiRides.gz file has somehow been corrupted. The line that is shown in your screenshot should have these contents
QUESTION
I got a example for the CEP in the following URL https://github.com/dataArtisans/flink-training-exercises/blob/master/src/main/java/com/dataartisans/flinktraining/exercises/datastream_java/cep/LongRides.java
And the "goal for this exercise is to emit START events for taxi rides that have not been matched by an END event during the first 2 hours of the ride." However from the code below, it seems get a pattern to find rides have been completed in 2 hours instead of have NOT been completed in 2 hours.
It looks like the pattern firstly find the Start event , then find the End Event(!ride.isStart), and within 2 hours, so doesn't it explains as a pattern to find rides have been completed in 2 hours?
...ANSWER
Answered 2018-Jun-01 at 08:09I've improved the comment in the sample solution to make this clearer.
QUESTION
I have cloned Flink Training repo and followed instructions on building and deploying from here in order to get familiar with Apache Flink. However, there are the errors in the projects after building and importing into Eclipse IDE. In the Flink Training Exercises
project i find errors in the pom Plugin execution not covered by lifecycle configuration: net.alchim31.maven:scala-maven-plugin:3.1.4:testCompile
. There are also errors in the project flink-quickstart-java
. Some dependencies are not being resolved e.g. ExecutionEnvironment cannot be resolved
in the BatchJob
class.
ANSWER
Answered 2018-May-31 at 12:20I got this working in Eclipse by selecting the add-dependencies-for-IDEA
maven profile.
I added this section to in my pom file:
QUESTION
I'm currently working through this tutorial on Stream processing in Apache Flink and am a little confused on how the TimeCharacteristics of a StreamEnvironment effect the order of the data values in the stream and in respect to which time an onTimer function of a ProcessFunction is called.
In the tutorial, they set the characteristics to EventTime
, since we want to compare the start & end events based on the time they store and not the time they are received in the stream.
Now in the reference solution they set a timerService to fire 2 hours after an events timestamp for each key.
What really confuses me is when this timer actually fires during runtime. Possible explanation I came up with:
Setting the TimeCharacteristics
to EventTime
makes the stream to process the entries ordered by their event timestamp and this way the timer can be fired for each rideId, when an event arrives with a timestamp > rideId.timeStamp + 2 hours
(2 hours coming from exercise context).
But with this explanation a startEvent of a Taxi ride would always be processed before an endEvent (I'm assuming that a ride can't end before it started), and we wouldn't have to check if a matching EndEvent has already arrived like they do in the processElement function.
In the documentation of ProcessFunction
they state that the timer is called
"When a timer’s particular time is reached"
but since we have a (potentially infinite) stream of data and we don't care when the data point arrives but only when it happened, how can we be sure that there will not arrive a matching data point for a startEvent somewhere in the future that would trigger the criteria with 2 hours stated in the exercise?
If someone could link me an explanation of this or correct me where I'm wrong that would be highly appreciated.
...ANSWER
Answered 2018-Mar-03 at 18:29An event-time timer fires when Flink is satisfied that all events with timestamps earlier than the time in the timer have already been processed. This is done by waiting for the current watermark to reach the time specified in the timer.
When working with event-time, events are usually processed out-of-order, and this is the case in the exercises you are working with. In general, watermarks are used to mark the passage of event-time -- a watermark is characterized by a timestamp t, and indicates that the stream is now complete up through time t (meaning that all earlier events have already been processed). In the training exercises, the TaxiRideSource is parameterized according to how much out-of-orderness you want to have, and the TaxiRideSource takes care to emit appropriately delayed watermarks.
You can read more about event time and watermarks in the Flink documentation.
QUESTION
I'm going through Flink tutorial materials from dataArtisans and for some reason when I get to the sample file PopularPlacesFromKafka.scala I don't get any output sent to stdout.
...ANSWER
Answered 2017-Sep-19 at 22:03Did you configure an appropriate speedup for the source? By default (without a speedup factor), the source emulates the original data, i.e., it emits records at the same rate as they were originally generated. That means it takes 1 minute to produce 1 minute of data.
The window operator aggregates every 5 minutes the last 15 minutes of data. Consequently, it will take 5 minutes until the window operator produces the first result.
If you set the speedup factor to 600, you'll get 10 minutes of data in 1 second.
QUESTION
We are planning to use Flink to process a stream of data from a kafka topic (Logs in Json format).
But for that processing, we need to use input files which change every day, and the information within can change completely (not the format, but the contents).
Each time one of those input files changes we will have to reload those files into the program and keep the stream processing going on.
Re-loading of the data could be done same way as it is done now:
...ANSWER
Answered 2017-Oct-20 at 12:53Flink can monitor a directory and ingest files when they are moved into that directory; maybe that's what you are looking for. See the PROCESS_CONTINUOUSLY option for readfile in the documentation.
However, if the data is in Kafka, it would be much more natural to use Flink's Kafka consumer to stream the data directly into Flink. There is also documentation about using the Kafka connector. And the Flink training includes an exercise on using Kafka with Flink.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install flink-training
This flink-training project contains exercises, tests, and reference solutions for the programming exercises. Clone the flink-training project from Github and build it. :information_source: Repository Layout: This repository has several branches set up pointing to different Apache Flink versions, similarly to the apache/flink repository with: a release branch for each minor version of Apache Flink, e.g. release-1.10, and a master branch that points to the current Flink release (not flink:master!) If you want to work on a version other than the current Flink release, make sure to check out the appropriate branch. If you haven’t done this before, at this point you’ll end up downloading all of the dependencies for this Flink training project. This usually takes a few minutes, depending on the speed of your internet connection. If all of the tests pass and the build is successful, you are off to a good start.
a release branch for each minor version of Apache Flink, e.g. release-1.10, and
a master branch that points to the current Flink release (not flink:master!)
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page