kandi background
Explore Kits

flink-learning | flink learning blog

 by   zhisheng17 Java Version: Current License: Apache-2.0

 by   zhisheng17 Java Version: Current License: Apache-2.0

Download this library from

kandi X-RAY | flink-learning Summary

flink-learning is a Java library typically used in Big Data, Kafka, Spark, RabbitMQ applications. flink-learning has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. You can download it from GitHub.
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • flink-learning has a medium active ecosystem.
  • It has 12001 star(s) with 3345 fork(s). There are 497 watchers for this library.
  • It had no major release in the last 12 months.
  • flink-learning has no issues reported. There are 2 open pull requests and 0 closed requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of flink-learning is current.
flink-learning Support
Best in #Java
Average in #Java
flink-learning Support
Best in #Java
Average in #Java

quality kandi Quality

  • flink-learning has 0 bugs and 0 code smells.
flink-learning Quality
Best in #Java
Average in #Java
flink-learning Quality
Best in #Java
Average in #Java

securitySecurity

  • flink-learning has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • flink-learning code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
flink-learning Security
Best in #Java
Average in #Java
flink-learning Security
Best in #Java
Average in #Java

license License

  • flink-learning is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
flink-learning License
Best in #Java
Average in #Java
flink-learning License
Best in #Java
Average in #Java

buildReuse

  • flink-learning releases are not available. You will need to build from source code and install.
  • Build file is available. You can build the component from source.
  • Installation instructions, examples and code snippets are available.
  • flink-learning saves you 8274 person hours of effort in developing the same functionality from scratch.
  • It has 14745 lines of code, 626 functions and 388 files.
  • It has low code complexity. Code complexity directly impacts maintainability of the code.
flink-learning Reuse
Best in #Java
Average in #Java
flink-learning Reuse
Best in #Java
Average in #Java
Top functions reviewed by kandi - BETA

kandi has reviewed flink-learning and discovered the below as its top functions. This is intended to give you an instant insight into flink-learning implemented functionality, and help decide if they suit your requirements.

  • Parses a duration from a string .
  • Main method .
  • Flushes all incoming events .
  • Returns the string representation of the given array .
  • Sets checkpointing environment .
  • Shut down the given executor services .
  • Main method .
  • Flat map log event .
  • Builds an OutageMetricEvent from a MetricEvent .
  • Get rules .

flink-learning Key Features

flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》

Community Discussions

Trending Discussions on Big Data
  • How to group unassociated content
  • Using Spark window with more than one partition when there is no obvious partitioning column
  • What is the best way to store +3 millions records in Firestore?
  • spark-shell throws java.lang.reflect.InvocationTargetException on running
  • For function over multiple rows (i+1)?
  • Filling up shuffle buffer (this may take a while)
  • Designing Twitter Search - How to sort large datasets?
  • Unnest Query optimisation for singular record
  • handling million of rows for lookup operation using python
  • split function does not return any observations with large dataset
Trending Discussions on Big Data

QUESTION

How to group unassociated content

Asked 2022-Apr-15 at 12:43

I have a hive table that records user behavior

like this

userid behavior timestamp url
1 view 1650022601 url1
1 click 1650022602 url2
1 click 1650022614 url3
1 view 1650022617 url4
1 click 1650022622 url5
1 view 1650022626 url7
2 view 1650022628 url8
2 view 1650022631 url9

About 400GB is added to the table every day.

I want to order by timestamp asc, then one 'view' is in a group between another 'view' like this table, the first 3 lines belong to a same group , then subtract the timestamps, like 1650022614 - 1650022601 as the view time.

How to do this?

i try lag and lead function, or scala like this

        val pairRDD: RDD[(Int, String)] = record.map(x => {
            if (StringUtil.isDateString(x.split("\\s+")(0))) {
                partition = partition + 1
                (partition, x)
            } else {
                (partition, x)
            }
        })

or java like this

        LongAccumulator part = spark.sparkContext().longAccumulator("part");

        JavaPairRDD<Long, Row> pairRDD = spark.sql(sql).coalesce(1).javaRDD().mapToPair((PairFunction<Row, Long, Row>) row -> {
            if (row.getAs("event") == "pageview") {
                part.add(1L);
            }
        return new Tuple2<>(part.value(), row);
        });

but when a dataset is very large, this code just stupid.

save me plz

ANSWER

Answered 2022-Apr-15 at 12:43

If you use dataframe, you can build partition by using window that sum a column whose value is 1 when you change partition and 0 if you don't change partition.

You can transform a RDD to a dataframe with sparkSession.createDataframe() method as explained in this answer

Back to your problem. In you case, you change partition every time column behavior is equal to "view". So we can start with this condition:

import org.apache.spark.sql.functions.col

val df1 = df.withColumn("is_view", (col("behavior") === "view").cast("integer"))

You get the following dataframe:

+------+--------+----------+----+-------+
|userid|behavior|timestamp |url |is_view|
+------+--------+----------+----+-------+
|1     |view    |1650022601|url1|1      |
|1     |click   |1650022602|url2|0      |
|1     |click   |1650022614|url3|0      |
|1     |view    |1650022617|url4|1      |
|1     |click   |1650022622|url5|0      |
|1     |view    |1650022626|url7|1      |
|2     |view    |1650022628|url8|1      |
|2     |view    |1650022631|url9|1      |
+------+--------+----------+----+-------+

Then you use a window ordered by timestamp to sum over the is_view column:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum

val df2 = df1.withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))

Which get you the following dataframe:

+------+--------+----------+----+-------+---------+
|userid|behavior|timestamp |url |is_view|partition|
+------+--------+----------+----+-------+---------+
|1     |view    |1650022601|url1|1      |1        |
|1     |click   |1650022602|url2|0      |1        |
|1     |click   |1650022614|url3|0      |1        |
|1     |view    |1650022617|url4|1      |2        |
|1     |click   |1650022622|url5|0      |2        |
|1     |view    |1650022626|url7|1      |3        |
|2     |view    |1650022628|url8|1      |1        |
|2     |view    |1650022631|url9|1      |2        |
+------+--------+----------+----+-------+---------+

Then, you just have to aggregate per userid and partition:

import org.apache.spark.sql.functions.{max, min}

val result = df2.groupBy("userid", "partition")
  .agg((max("timestamp") - min("timestamp")).as("duration"))

And you get the following results:

+------+---------+--------+
|userid|partition|duration|
+------+---------+--------+
|1     |1        |13      |
|1     |2        |5       |
|1     |3        |0       |
|2     |1        |0       |
|2     |2        |0       |
+------+---------+--------+

The complete scala code:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, max, min, sum}

val result = df
  .withColumn("is_view", (col("behavior") === "view").cast("integer"))
  .withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))
  .groupBy("userid", "partition")
  .agg((max("timestamp") - min("timestamp")).as("duration"))

Source https://stackoverflow.com/questions/71883786

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install flink-learning

Maybe your Maven conf file settings.xml mirrors can add aliyun central mirror :.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

DOWNLOAD this Library from

Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

Share this Page

share link
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
over 430 million Knowledge Items
Find more libraries
Reuse Solution Kits and Libraries Curated by Popular Use Cases

Save this library and start creating your kit

  • © 2022 Open Weaver Inc.