simple-backup | Hosted on Google AppEngine | Continuous Backup library
kandi X-RAY | simple-backup Summary
kandi X-RAY | simple-backup Summary
Backup and export for Simplenote. Hosted on Google AppEngine
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Represent a Python object
- Represent a mapping
- Get all bases of cls
- Represent the given data
- List all notes
- Get token from server
- Get notes from an index
- Get notes
- Get a specific note
- Represent a string
- Represent a float value
- Emit events
- Parse a document end event
- Expect the first flow mapping key
- Expect a document end event
- Expect a flow sequence item
- Parse an indentation sequence entry
- Parse a flow mapping
- Parse a block mapping
- Parse a stream start event
- Expect the next flow sequence item
- Represent a complex complex complex
- Expect stream start event
- Parse an implicit document start
- Parse a Flow SequenceEntry
- Expect a mapping key
simple-backup Key Features
simple-backup Examples and Code Snippets
Community Discussions
Trending Discussions on simple-backup
QUESTION
What I'm doing: I'm building a system in which one Cloud Pub/Sub topic will be read by dozens of Apache Beam pipelines in streaming mode. Each time I deploy a new pipeline, it should first process several years of historic data (stored in BigQuery).
The problem: If I replay historic data into the topic whenever I deploy a new pipeline (as suggested here), it will also be delivered to every other pipeline currently reading the topic, which would be wasteful and very costly. I can't use Cloud Pub/Sub Seek (as suggested here) as it stores a maximum of 7 days history (more details here).
The question: What is the recommended pattern to replay historic data into new Apache Beam streaming pipelines with minimal overhead (and without causing event time/watermark issues)?
Current ideas: I can currently think of three approaches to solving the problem, however, none of them seem very elegant and I have not seen any of them mentioned in the documentation, common patterns (part 1 or part 2) or elsewhere. They are:
Ideally, I could use Flatten to merge the real-time
ReadFromPubSub
with a one-offBigQuerySource
, however, I see three potential issues: a) I can't account for data that has already been published to Pub/Sub, but hasn't yet made it into BigQuery, b) I am not sure whether theBigQuerySource
might inadvertently be rerun if the pipeline is restarted, and c) I am unsure whetherBigQuerySource
works in streaming mode (per the table here).I create a separate replay topic for each pipeline and then use Flatten to merge the
ReadFromPubSub
s for the main topic and the pipeline-specific replay topic. After deployment of the pipeline, I replay historic data to the pipeline-specific replay topic.I create dedicated topics for each pipeline and deploy a separate pipeline that reads the main topic and broadcasts messages to the pipeline-specific topics. Whenever a replay is needed, I can replay data into the pipeline-specific topic.
ANSWER
Answered 2019-Mar-08 at 22:47Out of your three ideas:
The first one will not work because currently the Python SDK does not support unbounded reads from bounded sources (meaning that you can't add a
ReadFromBigQuery
to a streaming pipeline).The third one sounds overly complicated, and maybe costly.
I believe your best bet at the moment is as you said, to replay your table into an extra PubSub topic that you Flatten with your main topic, as you rightly pointed out.
I will check if there's a better solution, but for now, option #2 should do the trick.
Also, I'd refer you to an interesting talk from Lyft on doing this for their architecture (in Flink).
QUESTION
The post sets the Max Poll Records to 1 to guarantee the events in one flow file come from the same partition. https://community.hortonworks.com/articles/223849/simple-backup-and-restore-of-kafka-messages-via-ni.html
Does that mean if using Message Demarcator, the events in the same FlowFile can be from different partitions?
from the source code I think the above thinking is true? https://github.com/apache/nifi/blob/ea9b0db2f620526c8dd0db595cf8b44c3ef835be/nifi-nar-bundles/nifi-kafka-bundle/nifi-kafka-0-9-processors/src/main/java/org/apache/nifi/processors/kafka/pubsub/ConsumerLease.java#L366
...ANSWER
Answered 2019-Jan-18 at 16:08When using a demarcator it creates a bundle per topic/partition, so you will get flow files where all messages are from the same topic partition:
The reason that post set max pool records to 1 was explained in the post, it was because the key of the messages is only available when there is 1 message per flow file, and they needed the key in this case. In general, it is better to not do this and to have many messages per flow file.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install simple-backup
You can use simple-backup like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page