flink | Apache Flink is an open source project | SQL Database library
kandi X-RAY | flink Summary
kandi X-RAY | flink Summary
Apache Flink is an open source project of The Apache Software Foundation (ASF). The Apache Flink project originated from the Stratosphere research project.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Initialize handlers .
- Starts the application .
- Generate a select logical plan .
- Registers an alias from the given node .
- Perform a phase 1 .
- Performs the actual operation .
- Returns the converter for the given data type .
- Runs the loop .
- Load resource .
- Builds an execution graph .
flink Key Features
flink Examples and Code Snippets
git clone https://github.com/apache/flink.git
cd flink
./mvnw clean package -DskipTests # this will take up to 10 minutes
public static void capitalize() throws Exception {
String inputTopic = "flink_input";
String outputTopic = "flink_output";
String consumerGroup = "baeldung";
String address = "localhost:9092";
StreamExecutionE
@Async
public void consumeSSEFromFluxEndpoint() {
ParameterizedTypeReference> type = new ParameterizedTypeReference>() {
};
Flux> eventStream = client.get()
.uri("/stream-flux")
.accept(Me
Community Discussions
Trending Discussions on flink
QUESTION
We want to keep in a Flink operator's state the last n
unique id's.
When the n+1
unique id arrives, we want to keep it and drop the oldest unique id in the state. This is in order to avoid an ever-growing state.
We already have a TTL (expiration time) mechanism in place. The size limit is another restriction we're looking to put in place.
Not every element holds a unique id.
QuestionDoes Flink provide an API that limits the number of elements in the state?
Things tried- Using
MapState
with aStateTtlConfig
generated TTL/expiration mechanism. - Window limited the number of processed elements, but not the number of elements in state.
ANSWER
Answered 2022-Apr-07 at 14:30I don't think Flink has a state type that would support this out of the box. The closest thing I can think of is to use ListState. With ListState you can append elements like you would a regular list.
For your use case, you would read the state, call .get()
which would give you an iterable that you can iterate on, removing the item you'd like to drop and then pushing the state back.
From a performance perspective, the iteration may not be ideal but on the other hand, it would not be significant in comparison to accessing state from disk (in case you're using RocksDB as a state backend) which incurs a heavy cost due to serialization and deserialization.
QUESTION
I am a kafka and flink beginner.
I have implemented FlinkKafkaConsumer
to consume messages from a kafka-topic. The only custom setting other than "group" and "topic" is (ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
to enable re-reading the same messages several times. It works out of the box for consuming and logic.
Now FlinkKafkaConsumer
is deprecated, and i wanted to change to the successor KafkaSource
.
Initializing KafkaSource
with the same parameters as i do FlinkKafkaConsumer
produces a read of the topic as expected, i can verify this by printing the stream. De-serialization and timestamps seem to work fine. However execution of windows are not done, and as such no results are produced.
I assume some default setting(s) in KafkaSource
are different from that of FlinkKafkaConsumer
, but i have no idea what they might be.
KafkaSource - Not working
...ANSWER
Answered 2021-Nov-24 at 18:39Update: The answer is that the KafkaSource behaves differently than FlinkKafkaConsumer in the case where the number of Kafka partitions is smaller than the parallelism of Flink's kafka source operator. See https://stackoverflow.com/a/70101290/2000823 for details.
Original answer:
The problem is almost certainly something related to the timestamps and watermarks.
To verify that timestamps and watermarks are the problem, you could do a quick experiment where you replace the 3-hour-long event time sliding windows with short processing time tumbling windows.
In general it is preferred (but not required) to have the KafkaSource do the watermarking. Using forMonotonousTimestamps
in a watermark generator applied after the source, as you are doing now, is a risky move. This will only work correctly if the timestamps in all of the partitions being consumed by each parallel instance of the source are processed in order. If more than one Kafka partition is assigned to any of the KafkaSource tasks, this isn't going to happen. On the other hand, if you supply the forMonotonousTimestamps
watermarking strategy in the fromSource call (rather than noWatermarks
), then all that will be required is that the timestamps be in order on a per-partition basis, which I imagine is the case.
As troubling as that is, it's probably not enough to explain why the windows don't produce any results. Another possible root cause is that the test data set doesn't include any events with timestamps after the first window, so that window never closes.
Do you have a sink? If not, that would explain things.
You can use the Flink dashboard to help debug this. Look to see if the watermarks are advancing in the window tasks. Turn on checkpointing, and then look to see how much state the window task has -- it should have some non-zero amount of state.
QUESTION
When I go to https://cloud.google.com/dataproc, I see this ...
"Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."
But gcloud dataproc jobs submit
doesn't list all of them. It lists only 8 (hadoop, hive, pig, presto, pyspark, spark, spark-r, spark-sql). Any idea why?
ANSWER
Answered 2021-Oct-01 at 17:18Some OSS components are offered as Dataproc Optional Components. Not of all them have a job submit API, some (e.g., Anaconda, Jupyter) don't need one, some (e.g., Flink, Druid) might add in the future.
Some other OSS components are offered as libraries, e.g., GCS connector, BigQuery connector, Apache Parquet.
QUESTION
I'm starting a new Flink application to allow my company to perform lots of reporting. We have an existing legacy system with most the data we need held in SQL Server databases. We will need to consume data from these databases initially before starting to consume more data from newly deployed Kafka streams.
I've spent a lot of time reading the Flink book and web pages but I have some simple questions and assumptions I hope you can help with so I can progress.
Firstly, I am wanting to use the DataStream API so we can both consume historic data and also realtime data. I do not think I want to use the DataSet API but I also don't see the point in using the SQL/Table apis as I would prefer to write my functions in Java classes. I need to maintain my own state and it seems DataStream keyed functions are the way to go.
Now I am trying to actually write code against our production databases, I need to be able to read in "streams" of data with SQL queries - there does not appear to be a JDBC source connector so I think I have to make the JDBC call myself and then possibly create a DataSource using env.fromElements(). Obviously this is a "bounded" data set, but how else am I meant to get historic data loaded in? In the future I want to include a Kafka stream as well which will only have a few weeks worth of data, so I imagine I will sometimes need to merge data from a SQL Server/Snowflake database with a live stream from a Kafka stream. What is the best practice for this as I don't see examples discussing this.
With retrieving data from a JDBC source, I have also seen some examples using a StreamingTableEnvironment - am I meant to use this somehow instead to query data from a JDBC connection into my DataStream functions etc? Again, I want to write my functions in Java not some Flink SQL. Is it best practice to use a StreamingTableEnvironment to query JDBC data if I'm only using the DataStream API?
...ANSWER
Answered 2021-Aug-11 at 07:39Following approaches can be used to read from the database and create a datastream :
You can use RichParallelSourceFunction where you can do a custom query to your database and get the datastream from it. An SQL with JDBC driver can be fired in the extension of RichParallelSourceFunction class.
Using Table DataStream API - It is possible to query a Database by creating a JDBC catalog and then transform it into a stream
An alternative to this, a more expensive solution perhaps - You can use a Flink CDC connectors which provides source connectors for Apache Flink, ingesting changes from different databases using change data capture (CDC)
Then you can add Kafka as source and get a datastream.
So, briefly your your pipeline could look like follows : You have both the sources transformed as a datastreams, you can join these streams using, for example coprocess function which will also give you possibility to maintain a state and use it in your business logic. And finally Sink your final output to either a database, to Kafka or to even AWS S3 buckets using a Sink function.
QUESTION
I am reading at
It is using the MySQL as a lookup table in the temporal table join as
...ANSWER
Answered 2021-Aug-07 at 14:24You'll find some relevant details in the docs for the Table / JDBC connector: https://ci.apache.org/projects/flink/flink-docs-stable/docs/connectors/table/jdbc/#features. See especially the section describing the Lookup Cache, which says
JDBC connector can be used in temporal join as a lookup source (aka. dimension table). Currently, only sync lookup mode is supported.
By default, lookup cache is not enabled. You can enable it by setting both lookup.cache.max-rows and lookup.cache.ttl.
The lookup cache is used to improve performance of temporal join the JDBC connector. By default, lookup cache is not enabled, so all the requests are sent to external database. When lookup cache is enabled, each process (i.e. TaskManager) will hold a cache. Flink will lookup the cache first, and only send requests to external database when cache missing, and update cache with the rows returned. The oldest rows in cache will be expired when the cache hit to the max cached rows lookup.cache.max-rows or when the row exceeds the max time to live lookup.cache.ttl. The cached rows might not be the latest, users can tune lookup.cache.ttl to a smaller value to have a better fresh data, but this may increase the number of requests send to database. So this is a balance between throughput and correctness.
QUESTION
In Flink 1.12, Flink introduced a new concept which is called versioned table
, It is very similar with temporal table function
,but I am kind of confused between these two concepts.
temporal table function
only supports append-only table(Please correct me if I am wrong),such as:
ANSWER
Answered 2021-Aug-02 at 04:34You can view Temporal Table Function as a previous version of some sorts of versioned tables (This is also stated in the legacy features part of Flink).
In your first example, you create an append only table which will hold multiple keys of the same currency (i.e. EURO), which makes it ineligible to be a primary key.
Can you call an append only table a versioned table? The definition on Versioned Tables in doc says:
Flink SQL can define versioned tables over any dynamic table with a PRIMARY KEY constraint and time attribute.
This implies that an append only table cannot serve as a versioned table since it can't, as we said, hold true the primary key constraint.
But, since the append only table in your example does hold the relevant information to become a versioned table, containing inserts as well as updates and deletes, we can turn it to one with a deduplication query, as you posted above.
QUESTION
I have a requirement to delay processing of some of the events.
eg. I have three events (published on Kafka):
- A (id: 1, retryAt: now)
- B (id: 2, retryAt: 10 minutes later)
- C (id: 3, retryAt: now)
I need to process record A and C immediately while record B needs to be processed Ten minutes later. Is this something feasible to achieve in Apache Flink?
So far whatever I have researched, it seems, "Triggers" is something which might help to achieve it in Flink but have not been able to implement it correctly yet.
I looked through Kafka documentation too, but it doesn't look feasible there.
...ANSWER
Answered 2021-Jul-29 at 16:16Triggers are for windows, but windowing doesn't seem appropriate for your use case.
A better solution would be to use Timers with a KeyedProcessFunction
. Depending on whether you want to wait for 10 minutes of processing time, or 10 minutes of event time, you'll choose processing time timers or event time timers.
You'll also need to use Flink state to store the events that need to be processed later.
You'll find the documentation for process functions here. There are some additional examples in the Flink training, here and here.
FWIW, Flink's Stateful Functions API might be a better fit for what you're doing, in which case you would use delayed messages.
QUESTION
I have a Flink 1.11 job that consumes messages from a Kafka topic, keys them, filters them (keyBy followed by a custom ProcessFunction), and saves them into the db via JDBC sink (as described here: https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/connectors/jdbc.html)
The Kafka consumer is initialized with these options:
...ANSWER
Answered 2021-Jul-01 at 13:09There are 3 options that I can see:
- Try out the JDBC 1.13 connector with your Flink version. There is a good chance it might just work.
- If that doesn't work immediately, check if you can backport it to 1.11. There shouldn't be too many changes.
- Write your own 2-phase-commit sink, either by extending
TwoPhaseCommitSinkFunction
or implement your ownSinkFunction
withCheckpointedFunction
andCheckpointListener
. Basically, you create a new transaction after a successful checkpoint and commit it withnotifyCheckpointCompleted
.
QUESTION
I am reading the source code of SingleOutputStreamOperator#returns
, its javadoc is:
ANSWER
Answered 2021-Jun-30 at 08:55When the docs says NonInferrableReturnType
it means that we can use the type variable , or any other letter that you prefer. So you can create a
MapFunction
that return a T
. But then you have to use .returns(TypeInformation.of(String.class)
for example, if your goal is to return a String
.
QUESTION
I am researching on building a flink pipeline without a data sink. i.e my pipeline ends when it makes a successful api call to a datastore.
In that case if we don't use a sink operator how will checkpointing work ?
As checkpointing is based on the concept of pre-checkpoint epoch (all events that are persisted in state or emitted into sinks) and a post-checkpoint epoch. Is having a sink required for a flink pipeline?
...ANSWER
Answered 2021-Jun-09 at 16:43Yes, sinks are required as part of Flink's execution model:
DataStream programs in Flink are regular programs that implement transformations on data streams (e.g., filtering, updating state, defining windows, aggregating). The data streams are initially created from various sources (e.g., message queues, socket streams, files). Results are returned via sinks, which may for example write the data to files, or to standard output (for example the command line terminal)
One could argue that your that the call to your datastore is the actual sink implementation that you could use. You could define your own sink and execute the datastore call there.
I am not keen on the details of your datastore, but one could assume that you are serializing these events and sending them to the datastore in some way. In that case, you could flow all your elements to the sink operator, and store each of these elements in some ListState
which you can continuously offload and send. This way, if your application needs to be upgraded, in flight records will not be lost and will be recovered and sent once the job has restored.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
Install flink
You can use flink like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the flink component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page