Support
Quality
Security
License
Reuse
kandi has reviewed Spark-based-Ripley-K-Functions and discovered the below as its top functions. This is intended to give you an instant insight into Spark-based-Ripley-K-Functions implemented functionality, and help decide if they suit your requirements.
A visual analytics framework for Distributed Ripley's K functions based on Apache Spark, including Space K function, Space-Time K function, Local K function and Cross K function. This framework consists of three parts, i.e., front-end web visualization components, Web service API component and back-end Spark-based algorithm packages.
No Code Snippets are available at this moment for Spark-based-Ripley-K-Functions.
QUESTION
How to group unassociated content
Asked 2022-Apr-15 at 12:43I have a hive table that records user behavior
like this
userid | behavior | timestamp | url |
---|---|---|---|
1 | view | 1650022601 | url1 |
1 | click | 1650022602 | url2 |
1 | click | 1650022614 | url3 |
1 | view | 1650022617 | url4 |
1 | click | 1650022622 | url5 |
1 | view | 1650022626 | url7 |
2 | view | 1650022628 | url8 |
2 | view | 1650022631 | url9 |
About 400GB is added to the table every day.
I want to order by timestamp asc, then one 'view' is in a group between another 'view' like this table, the first 3 lines belong to a same group , then subtract the timestamps, like 1650022614 - 1650022601 as the view time.
How to do this?
i try lag and lead function, or scala like this
val pairRDD: RDD[(Int, String)] = record.map(x => {
if (StringUtil.isDateString(x.split("\\s+")(0))) {
partition = partition + 1
(partition, x)
} else {
(partition, x)
}
})
or java like this
LongAccumulator part = spark.sparkContext().longAccumulator("part");
JavaPairRDD<Long, Row> pairRDD = spark.sql(sql).coalesce(1).javaRDD().mapToPair((PairFunction<Row, Long, Row>) row -> {
if (row.getAs("event") == "pageview") {
part.add(1L);
}
return new Tuple2<>(part.value(), row);
});
but when a dataset is very large, this code just stupid.
save me plz
ANSWER
Answered 2022-Apr-15 at 12:43If you use dataframe, you can build partition by using window that sum a column whose value is 1 when you change partition and 0 if you don't change partition.
You can transform a RDD to a dataframe with sparkSession.createDataframe()
method as explained in this answer
Back to your problem. In you case, you change partition every time column behavior is equal to "view". So we can start with this condition:
import org.apache.spark.sql.functions.col
val df1 = df.withColumn("is_view", (col("behavior") === "view").cast("integer"))
You get the following dataframe:
+------+--------+----------+----+-------+
|userid|behavior|timestamp |url |is_view|
+------+--------+----------+----+-------+
|1 |view |1650022601|url1|1 |
|1 |click |1650022602|url2|0 |
|1 |click |1650022614|url3|0 |
|1 |view |1650022617|url4|1 |
|1 |click |1650022622|url5|0 |
|1 |view |1650022626|url7|1 |
|2 |view |1650022628|url8|1 |
|2 |view |1650022631|url9|1 |
+------+--------+----------+----+-------+
Then you use a window ordered by timestamp to sum over the is_view
column:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.sum
val df2 = df1.withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))
Which get you the following dataframe:
+------+--------+----------+----+-------+---------+
|userid|behavior|timestamp |url |is_view|partition|
+------+--------+----------+----+-------+---------+
|1 |view |1650022601|url1|1 |1 |
|1 |click |1650022602|url2|0 |1 |
|1 |click |1650022614|url3|0 |1 |
|1 |view |1650022617|url4|1 |2 |
|1 |click |1650022622|url5|0 |2 |
|1 |view |1650022626|url7|1 |3 |
|2 |view |1650022628|url8|1 |1 |
|2 |view |1650022631|url9|1 |2 |
+------+--------+----------+----+-------+---------+
Then, you just have to aggregate per userid and partition:
import org.apache.spark.sql.functions.{max, min}
val result = df2.groupBy("userid", "partition")
.agg((max("timestamp") - min("timestamp")).as("duration"))
And you get the following results:
+------+---------+--------+
|userid|partition|duration|
+------+---------+--------+
|1 |1 |13 |
|1 |2 |5 |
|1 |3 |0 |
|2 |1 |0 |
|2 |2 |0 |
+------+---------+--------+
The complete scala code:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, max, min, sum}
val result = df
.withColumn("is_view", (col("behavior") === "view").cast("integer"))
.withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))
.groupBy("userid", "partition")
.agg((max("timestamp") - min("timestamp")).as("duration"))
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
No vulnerabilities reported
Save this library and start creating your kit
Save this library and start creating your kit