flume-interceptor-analytics | Real-time analytics in Apache Flume | Pub Sub library

by jrkinley Java Version: Current License: Apache-2.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | flume-interceptor-analytics Summary

flume-interceptor-analytics is a Java library typically used in Messaging, Pub Sub, Kafka applications. flume-interceptor-analytics has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has low support. You can download it from GitHub.

Flume is a distributed service for efficiently collecting, aggregating, and moving large amounts of data to a centralised data store. It's architecture is based on streaming data flows and it uses a simple extensible data model that allows for online analytic application. It is robust and fault tolerant with tuneable reliability mechanisms and many failover and recovery mechanisms. A unit of data in Flume is called an event, and events flow through one or more Flume agents to reach their destination. An event has a byte payload and an optional set of string attributes. An agent is a Java process that hosts the components through which events flow. The components are a combination of sources, channels, and sinks. A Flume source consumes events delivered to it by an external source. When a source receives an event, it stores it into one or more Flume channels. A channel is a passive store that keeps the event until it's consumed by a Flume sink. The sink removes the event from the channel and puts it into an external repository (i.e. HDFS or HBase) or forwards it to the source of the next agent in the flow. The source and sink within a given agent run asynchronously, with the events staged in the channel. Flume agents can be chained together to form multi-hop flows. This allows flows to fan-out and fan-in, and for contextual routing and backup routes to be configured. For more information, see the Apache Flume User Guide.

Support

Quality

Security

License

Reuse

Support

flume-interceptor-analytics has a low active ecosystem.

It has 50 star(s) with 14 fork(s). There are 12 watchers for this library.

It had no major release in the last 6 months.

flume-interceptor-analytics has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of flume-interceptor-analytics is current.

Quality

flume-interceptor-analytics has 0 bugs and 0 code smells.

Security

flume-interceptor-analytics has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

flume-interceptor-analytics code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

flume-interceptor-analytics is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

flume-interceptor-analytics releases are not available. You will need to build from source code and install.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed flume-interceptor-analytics and discovered the below as its top functions. This is intended to give you an instant insight into flume-interceptor-analytics implemented functionality, and help decide if they suit your requirements.

Get a JSON representation of the top N events
Gets the item
Returns the number of messages
Returns the top N
Intercept the event
Adds a counter to the rankings list
Gets the current ranking index for a given counter
Intercepts the events
Adds a counter to the rankings list
Gets the current ranking index for a given counter
Gets the events
Gets all the objects and their totals
Gets the current objects and totals
Starts the reaper thread
Gets the tail bucket
Returns the total count for the given object
Gets the hashtags to count
Extract the status update from a Twitter event
Release the interceptor
Register topNMergeInterceptor
Register this HibernateTopN tag
Override this method to register Hashtags
Configure the emit frequency
Get interceptor instances for a given class
Starts the periodic handler
Gets the counters for the given event
Creates a list of counters from the event body
Clear the registry

Get all kandi verified functions for this library.

flume-interceptor-analytics Key Features

No Key Features are available at this moment for flume-interceptor-analytics.

flume-interceptor-analytics Examples and Code Snippets

No Code Snippets are available at this moment for flume-interceptor-analytics.

Community Discussions

Trending Discussions on Pub Sub

Build JSON content in R according Google Cloud Pub Sub message format

BigQuery Table a Pub Sub Topic not working in Apache Beam Python SDK? Static source to Streaming Sink

Pub Sub Lite topics with Peak Capacity Throughput option

How do I add permissions to a NATS User to allow the User to query & create Jestream keyvalue stores?

MSK vs SQS + SNS

Dataflow resource usage

Run code on Python Flask AppEngine startup in GCP

Is there a way to listen for updates on multiple Google Classroom Courses using Pub Sub?

Flow.take(ITEM_COUNT) returning all the elements rather then specified amount of elements

Wrapping Pub-Sub Java API in Akka Streams Custom Graph Stage

QUESTION

Build JSON content in R according Google Cloud Pub Sub message format

Asked 2022-Apr-16 at 09:59

In R, I want to build json content according this Google Cloud Pub Sub message format: https://cloud.google.com/pubsub/docs/reference/rest/v1/PubsubMessage

It have to respect :

...

ANSWER

Answered 2022-Apr-16 at 09:59

Not sure why, but replacing the dataframe by a list seems to work:

Source https://stackoverflow.com/questions/71892778

QUESTION

BigQuery Table a Pub Sub Topic not working in Apache Beam Python SDK? Static source to Streaming Sink

Asked 2022-Apr-02 at 19:41

My basic requirement was to create a pipeline to read from BigQuery Table and then convert it into JSON and pass it onto a PubSub topic.

At first I read from Big Query and tried to write it into Pub Sub Topic but got an exception error saying "Pub Sub" is not supported for batch pipelines. So I tried some workarounds and

I was able to work around this in python by

Reading from BigQuery-> ConvertTo JSON string-> Save as text file in cloud storage (Beam pipeline)

...

ANSWER

Answered 2021-Oct-14 at 20:27

Because your pipeline does not have any unbounded PCollections, it will be automatically run in batch mode. You can force a pipeline to run in streaming mode with the --streaming command line flag.

Source https://stackoverflow.com/questions/69549649

QUESTION

Pub Sub Lite topics with Peak Capacity Throughput option

Asked 2022-Feb-20 at 21:46

We are using Pub Sub lite instances along with reservations, we want to deploy it via Terraform, on UI while creating a Pub Sub Lite we get an option to specify Peak Publish Throughput (MiB/s) and Peak Subscribe Throughput (MiB/s) which is not available in the resource "google_pubsub_lite_topic" as per this doc https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/pubsub_lite_topic.

...

ANSWER

Answered 2022-Feb-20 at 21:46

If you check the bottom of your Google Cloud console screenshot, you can see it suggests to have 4 partitions with 4MiB/s publish and subscribe throughput.

Therefore your Terraform partition_config should match this. Count should be 4 for the 4 partitions, with capacity of 4MiB/s publish and 4MiB/s subscribe for each partition.

The "peak throughput" in web UI is just for convenience to help you choose some numbers here. The actual underlying PubSub Lite API doesn't actually have this field, which is why there is no Terraform setting either. You will notice the sample docs require a per-partiton setting just like Terraform.

eg. https://cloud.google.com/pubsub/lite/docs/samples/pubsublite-create-topic

I think the only other alternative would be to create a reservation attached to your topic with enough throughput units for desired capacity. And then completely omit capacity block in Terraform and let the reservation decide.

Source https://stackoverflow.com/questions/70210745

QUESTION

How do I add permissions to a NATS User to allow the User to query & create Jestream keyvalue stores?

Asked 2022-Feb-14 at 14:46

I have a User that needs to be able to query and create Jetstream keyvalue stores. I attempted to add pub/sub access to $JS.API.STREAM.INFO.* in order to give the User the ability to query and create keyvalue stores:

...

ANSWER

Answered 2022-Jan-31 at 16:16

Should be:

nsc edit user RequestCacheService --allow-pubsub '$JS.API.STREAM.INFO.*'

With single-quotes around the subject. I was under the impression that double & single quotes would escape the $ but apparently only single-quote will escape special characters in the subject.

Source https://stackoverflow.com/questions/70901601

QUESTION

MSK vs SQS + SNS

Asked 2022-Feb-09 at 17:58

I am deciding if I should use MSK (managed kafka from AWS) or a combination of SQS + SNS to achieve a pub sub model?

Background

Currently, we have a micro service architecture but we don't use any messaging service and only use REST apis (dont ask why - related to some 3rd party vendors who designed the architecture). Now, I want to revamp it and start using messaging for communication between micro-services.

Initially, the plan is to start publishing entity events for any other micro service to consume - these events will also be stored in data lake in S3 which will also serve as a base for starting data team.

Later, I want to move certain features from REST to async communication.

Anyway, the main question I have is - should I decide to go with MSK or should I use SQS + SNS for the same? ( I already understand the basic concepts but wanted to understand from fellow community if there are some other pros and cons)?

Thanks in advance

...

ANSWER

Answered 2022-Feb-09 at 17:58

MSK VS SQS+SNS is not really 1:1 comparison. The choice depends on various use cases. Please find out some of specific difference between two

Scalability -> MSK has better scalability option because of inherent design of partitions that allow parallelism and ordering of message. SNS has limitation of 300 publish/Second, to achieve same performance as MSK, there need to have higher number of SNS topic for same purpose.

Example : Topic: Order Service in MSK -> one topic+ 10 Partitions SNS -> 10 topics

if client/message producer use 10 SNS topic for same purpose, then client needs to have information of all 10 SNS topic and distribution of message. In MSK, it's pretty straightforward, key needs to send in message and kafka will allocate the partition based on Key value.

Administration/Operation -> SNS+SQS setup is much simpler compare to MSK. Operational challenge is much more with MSK( even this is managed service). MSK needs more in depth skills to use optimally.
SNS +SQS VS SQS -> I believe you have multiple subscription(fanout) for same message thats why you have refer SNS +SQS. If you have One Subscription for one message, then only SQS is also sufficient.
Replay of message -> MSK can be use for replaying the already processed message. It will be tricky for SQS, though can be achieve by having duplicate queue so that can be use for replay.

Source https://stackoverflow.com/questions/70648467

QUESTION

Dataflow resource usage

Asked 2022-Feb-03 at 21:43

After following the dataflow tutorial, I used the pub/sub topic to big query template to parse a JSON record into a table. The Job has been streaming for 21 days. During that time I have ingested about 5000 JSON records, containing 4 fields (around 250 bytes).

After the bill came this month I started to look into resource usage. I have used 2,017.52 vCPU hr, memory 7,565.825 GB hr, Total HDD 620,407.918 GB hr.

This seems absurdly high for the tiny amount of data I have been ingesting. Is there a minimum amount of data I should have before using dataflow? It seems over powered for small cases. Is there another preferred method for ingesting data from a pub sub topic? Is there a different configuration when setting up a Dataflow Job that uses less resources?

...

ANSWER

Answered 2022-Feb-03 at 21:43

It seems that the numbers you mentioned, correspond to not customizing the job resources. By default streaming jobs use a n1-standar-4 machine:

3 Streaming worker defaults: 4 vCPU, 15 GB memory, 400 GB Persistent Disk.
4 vCPU x 24 hrs x 21 days = 2,016
15 GB x 24 hrs x 21 days = 7,560

If you really need streaming in Dataflow, you will need to pay for resources allocated even if there is nothing to process.

Options:

Optimizing Dataflow

Considering that the number and size of the JSON string you need to process are really small, you can reduce the cost to aprox 1/4 of current charge. You just need to set the job to use a n1-standard-1 machine, which has 1vCPU and 3.75GB memory. Just be careful with max nodes, unless you are planning increase the load, one node may be enough.

Your own way

If you don't really need streaming (not likely), you can just create a function that pulls using Synchronous Pull, and add the part that writes to BigQuery. You can schedule according to your needs.

Cloud functions (my recommendation)

You can create a serverless Event-Driven Cloud Function with a Cloud Pub/Sub trigger. This way, considering your low volume, you can take advantage of the Free Tier and keep the real time processing:

"Cloud Functions provides a perpetual free tier for compute-time resources, which includes an allocation of both GB-seconds and GHz-seconds. In addition to the 2 million invocations, the free tier provides 400,000 GB-seconds, 200,000 GHz-seconds of compute time and 5GB of Internet egress traffic per month."[1]

[1] https://cloud.google.com/functions/pricing

Source https://stackoverflow.com/questions/70972652

QUESTION

Run code on Python Flask AppEngine startup in GCP

Asked 2021-Dec-27 at 16:07

I need to have a TCP client that listens to messages constantly (and publish pub sub events for each message)

Since there is no Kafka in GCP, I'm trying to do it using my flask service (which runs using AppEngine in GCP).

I'm planning on setting the app.yaml as:

...

ANSWER

Answered 2021-Dec-27 at 16:07

I eventually went for implementing a Kafka connector myself and using Kafka.

Source https://stackoverflow.com/questions/70365210

QUESTION

Is there a way to listen for updates on multiple Google Classroom Courses using Pub Sub?

Asked 2021-Dec-22 at 08:48

Goal

Trigger a function which updates Cloud Firestore when a student completes assignments or assignments are added for any course.

Problem

The official docs state that a feed for CourseWorkChangesInfo requires a courseId, and I would like to avoid having a registration and subscription for each course, each running on its own thread.

What I Have

I have managed to get a registration to one course working:

...

ANSWER

Answered 2021-Dec-22 at 08:48

Answer:

This is not possible.

You cannot have a single registration to track course work changes for multiple courses, as you can see here:

Types of feeds

The Classroom API currently offers three types of feed:

Each domain has a roster changes for domain feed, which exposes notifications when students and teachers join and leave courses in that domain.

Each course has a roster changes for course feed, which exposes notifications when students and teachers join and leave courses in that course.

Each course has a course work changes for course feed, which exposes notifications when any course work or student submission objects are created or modified in that course.

File a feature request:

If you think this feature could be useful, I'd suggest you to file a feature request in Issue Tracker using this template.

Reference:

Source https://stackoverflow.com/questions/70428658

QUESTION

Flow.take(ITEM_COUNT) returning all the elements rather then specified amount of elements

Asked 2021-Dec-17 at 21:00

I've a method X that's getting data from the server via pub sub. This method returns a flow. I've another method that subscribes to the flow by method X but only wants to take the first 3 values max from the flow if the data is distinct compared to previous data. I've written the following code

...

ANSWER

Answered 2021-Dec-17 at 19:13

You have a Flow> here, which means every element of this flow is itself a list.

The take operator is applied on the flow, so you will take the 3 first lists of the flow. Each individual list is not limited, unless you use take on the list itself.

So the name transformedListOf3Elements is incorrect, because the list is of an unknown number of elements, unless you filter it somehow in the map.

Source https://stackoverflow.com/questions/70397526

QUESTION

Wrapping Pub-Sub Java API in Akka Streams Custom Graph Stage

Asked 2021-Oct-26 at 13:31

I am working with a Java API from a data vendor providing real time streams. I would like to process this stream using Akka streams.

The Java API has a pub sub design and roughly works like this:

...

ANSWER

Answered 2021-Oct-26 at 13:31

To feed a Source, you don't necessarily need to use a custom graph stage. Source.queue will materialize as a buffered queue to which you can add elements which will then propagate through the stream.

There are a couple of tricky things to be aware of. The first is that there's some subtlety around materializing the Source.queue so you can set up the subscription. Something like this:

Source https://stackoverflow.com/questions/69368519

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install flume-interceptor-analytics

Edit the flume-env.sh file and uncomment the FLUME_CLASSPATH line. Enter the paths to interceptor-analytics-0.0.1-SNAPSHOT.jar and flume-sources-1.0-SNAPSHOT.jar separating them with a colon. Copy the example agent configuration from /src/main/resources/flume-topn-example.conf to /etc/flume-ng/conf/flume.conf.
Install Flume
Build flume-interceptor-analytics $ git clone https://github.com/jrkinley/flume-interceptor-analytics.git $ cd flume-interceptor-analytics $ mvn clean package $ ls target interceptor-analytics-0.0.1-SNAPSHOT.jar
Build or download Cloudera's custom Flume source $ git clone https://github.com/cloudera/cdh-twitter-example.git $ cd cdh-twitter-example/flume-sources $ mvn clean package $ ls target flume-sources-1.0-SNAPSHOT.jar or $ curl -O http://files.cloudera.com/samples/flume-sources-1.0-SNAPSHOT.jar
Add JARs to the Flume classpath $ sudo cp /etc/flume-ng/conf/flume-env.sh.template /etc/flume-ng/conf/flume-env.sh $ vi /etc/flume-ng/conf/flume-env.sh FLUME_CLASSPATH=/path/to/file/interceptor-analytics-0.0.1-SNAPSHOT.jar:/path/to/file/flume-sources-1.0-SNAPSHOT.jar Edit the flume-env.sh file and uncomment the FLUME_CLASSPATH line. Enter the paths to interceptor-analytics-0.0.1-SNAPSHOT.jar and flume-sources-1.0-SNAPSHOT.jar separating them with a colon.
Set the Flume agent name to AnalyticsAgent $ vi /etc/default/flume-ng-agent FLUME_AGENT_NAME=AnalyticsAgent
Set the Flume agent configuration Copy the example agent configuration from /src/main/resources/flume-topn-example.conf to /etc/flume-ng/conf/flume.conf. Add your authentication details for accessing the twitter streaming API: AnalyticsAgent.sources.Twitter.consumerKey = [required] AnalyticsAgent.sources.Twitter.consumerSecret = [required] AnalyticsAgent.sources.Twitter.accessToken = [required] AnalyticsAgent.sources.Twitter.accessTokenSecret = [required] Set where you would like to store the status updates in HDFS: AnalyticsAgent.sinks.TwitterHDFS.hdfs.path = hdfs://[required]:8020/user/flume/tweets/%Y/%m/%d/%H Set where you would like to store the topN results in HDFS: AnalyticsAgent.sinks.TopNHDFS.hdfs.path = hdfs://[required]:8020/user/flume/topn/%Y/%m/%d/%H
Create HDFS directories $ hadoop fs -mkdir /user/flume/tweets $ hadoop fs -mkdir /user/flume/topn
Start the Flume agent $ sudo /etc/init.d/flume-ng-agent start $ tail -100f /var/log/flume-ng/flume.log

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: