kandi background
Explore Kits

hadoop-gpu | Koichi Shirahata optimized Hadoop Distribution

 by   koichishirahata Java Version: Current License: Apache-2.0

 by   koichishirahata Java Version: Current License: Apache-2.0

Download this library from

kandi X-RAY | hadoop-gpu Summary

hadoop-gpu is a Java library typically used in Big Data, Docker, Hadoop applications. hadoop-gpu has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. However hadoop-gpu build file is not available. You can download it from GitHub.
Koichi Shirahata optimized Hadoop Distribution, especially with high performance of MapReduce with GPGPU. Here is our paper: Koichi Shirahata, Hitoshi Sato, and Satoshi Matsuoka. "Hybrid Map Task Scheduling for GPU-based Heterogeneous Clusters" In Proceedings of the 1st International Workshop on Theory and Practice of MapReduce (MAPRED'2010), pp. 466-471, Indianapolis, USA, November 2010. This software modified and includes Hadoop-0.20.1, The Apache Software Foundation. You can watch a demo which shows k-means application is running on both CPU and GPU from the following URL. http://www.youtube.com/watch?v=4CFGR0TFcNA. The image is our customized web interface, in which blue bars show tasks running on CPU, and green bars show tasks running on GPU. Please read CHANGES.txt to find more detailed modifications. ###Build system and apps. $ hadoop accel ¥ -D hadoop.pipes.java.recordreader=true ¥ -D hadoop.pipes.java.recordwriter=true ¥ -output output ¥ -cpubin bin/cpu-kmeans ¥ -gpubin bin/gpu-kmeans ¥ -input input/ik2_sample. ##Open Source License All Koichi Shirahata offered code is licensed under the Apache License, Version 2.0. And others follow the original license announcement.
Support
Support
Quality
Quality
Security
Security
License
License
Reuse
Reuse

kandi-support Support

  • hadoop-gpu has a low active ecosystem.
  • It has 24 star(s) with 13 fork(s). There are 13 watchers for this library.
  • It had no major release in the last 12 months.
  • There are 1 open issues and 0 have been closed. On average issues are closed in 1903 days. There are no pull requests.
  • It has a neutral sentiment in the developer community.
  • The latest version of hadoop-gpu is current.
hadoop-gpu Support
Best in #Java
Average in #Java
hadoop-gpu Support
Best in #Java
Average in #Java

quality kandi Quality

  • hadoop-gpu has 0 bugs and 0 code smells.
hadoop-gpu Quality
Best in #Java
Average in #Java
hadoop-gpu Quality
Best in #Java
Average in #Java

securitySecurity

  • hadoop-gpu has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
  • hadoop-gpu code analysis shows 0 unresolved vulnerabilities.
  • There are 0 security hotspots that need review.
hadoop-gpu Security
Best in #Java
Average in #Java
hadoop-gpu Security
Best in #Java
Average in #Java

license License

  • hadoop-gpu is licensed under the Apache-2.0 License. This license is Permissive.
  • Permissive licenses have the least restrictions, and you can use them in most projects.
hadoop-gpu License
Best in #Java
Average in #Java
hadoop-gpu License
Best in #Java
Average in #Java

buildReuse

  • hadoop-gpu releases are not available. You will need to build from source code and install.
  • hadoop-gpu has no build file. You will be need to create the build yourself to build the component from source.
  • Installation instructions are not available. Examples and code snippets are available.
  • hadoop-gpu saves you 3044897 person hours of effort in developing the same functionality from scratch.
  • It has 1169853 lines of code, 18688 functions and 3875 files.
  • It has medium code complexity. Code complexity directly impacts maintainability of the code.
hadoop-gpu Reuse
Best in #Java
Average in #Java
hadoop-gpu Reuse
Best in #Java
Average in #Java
Top functions reviewed by kandi - BETA

kandi has reviewed hadoop-gpu and discovered the below as its top functions. This is intended to give you an instant insight into hadoop-gpu implemented functionality, and help decide if they suit your requirements.

  • Load the number of edits from an input stream .
    • Find the next map task for this job .
      • Dump properties of a TFile
        • Generate file chunks .
          • Filter the request attributes .
            • Prepare and move to the end of the stream .
              • Get multiple splits for a set of files
                • Write a block
                  • start the data node
                    • Uploads local files to HDFS

                      Get all kandi verified functions for this library.

                      Get all kandi verified functions for this library.

                      hadoop-gpu Key Features

                      Add CPU and GPU hybrid executable feature on Hadoop pipes (in hadoop-gpu-0.20.1/src/mapred/org/apache/hadoop/mapred)

                      Add dynamic hybrid task scheduling feature on Hadoop (in hadoop-gpu-0.20.1/src/mapred/org/apache/hadoop/mapred)

                      set HADOOP_HOME to hadoop-gpu-{version} directory

                      set JAVA_HOME in $HADOOP_HOME/conf/hadoop-env.sh

                      set configuration files (core-site.xml, hdfs-site.xml, mapred-site.xml, masters, slaves), specify the number of CPU cores / GPU devices in mapred-site.xml

                      ####build hadoop-gpu $ cd $HADOOP_HOME $ ant compile

                      ####build apps (show kmeans2D app as an example) $ cd $HADOOP_HOME/../apps/pipes/kmeans/cpu-kmeans2D $ make $ cd $HADOOP_HOME/../apps/pipes/kmeans/gpu-kmeans2D $ make

                      ####start hadoop-gpu (same as standard hadoop) $ cd $HADOOP_HOME $ bin/hadoop namenode -format $ bin/start-all.sh

                      ####put binary and input files into HDFS $ bin/hadoop dfs -mkdir bin $ bin/hadoop dfs -mkdir input $ bin/hadoop dfs -put $HADOOP_HOME/../apps/pipes/kmeans/cpu-kmeans2D/cpu-kmeans2D bin $ bin/hadoop dfs -put $HADOOP_HOME/../apps/pipes/kmeans/gpu-kmeans2D/gpu-kmeans2D bin $ bin/hadoop dfs -put $HADOOP_HOME/../data/kmeans/input2D/ik2_sample input

                      ####run apps (show kmeans2D app as an example) $ ./kmeans2D.sh input/ik2_sample

                      Copyright (C) 2013 - 2014 Koichi Shirahata All Rights Reserved.

                      hadoop-gpu Examples and Code Snippets

                      See all related Code Snippets

                      default

                      copy iconCopydownload iconDownload
                      * *if you want to run with either single binary, please set the same (cpu or gpu) binary both at cpubin and gpubin*
                      

                      See all related Code Snippets

                      Community Discussions

                      Trending Discussions on Big Data
                      • How to group unassociated content
                      • Using Spark window with more than one partition when there is no obvious partitioning column
                      • What is the best way to store +3 millions records in Firestore?
                      • spark-shell throws java.lang.reflect.InvocationTargetException on running
                      • For function over multiple rows (i+1)?
                      • Filling up shuffle buffer (this may take a while)
                      • Designing Twitter Search - How to sort large datasets?
                      • Unnest Query optimisation for singular record
                      • handling million of rows for lookup operation using python
                      • split function does not return any observations with large dataset
                      Trending Discussions on Big Data

                      QUESTION

                      How to group unassociated content

                      Asked 2022-Apr-15 at 12:43

                      I have a hive table that records user behavior

                      like this

                      userid behavior timestamp url
                      1 view 1650022601 url1
                      1 click 1650022602 url2
                      1 click 1650022614 url3
                      1 view 1650022617 url4
                      1 click 1650022622 url5
                      1 view 1650022626 url7
                      2 view 1650022628 url8
                      2 view 1650022631 url9

                      About 400GB is added to the table every day.

                      I want to order by timestamp asc, then one 'view' is in a group between another 'view' like this table, the first 3 lines belong to a same group , then subtract the timestamps, like 1650022614 - 1650022601 as the view time.

                      How to do this?

                      i try lag and lead function, or scala like this

                              val pairRDD: RDD[(Int, String)] = record.map(x => {
                                  if (StringUtil.isDateString(x.split("\\s+")(0))) {
                                      partition = partition + 1
                                      (partition, x)
                                  } else {
                                      (partition, x)
                                  }
                              })
                      

                      or java like this

                              LongAccumulator part = spark.sparkContext().longAccumulator("part");
                      
                              JavaPairRDD<Long, Row> pairRDD = spark.sql(sql).coalesce(1).javaRDD().mapToPair((PairFunction<Row, Long, Row>) row -> {
                                  if (row.getAs("event") == "pageview") {
                                      part.add(1L);
                                  }
                              return new Tuple2<>(part.value(), row);
                              });
                      

                      but when a dataset is very large, this code just stupid.

                      save me plz

                      ANSWER

                      Answered 2022-Apr-15 at 12:43

                      If you use dataframe, you can build partition by using window that sum a column whose value is 1 when you change partition and 0 if you don't change partition.

                      You can transform a RDD to a dataframe with sparkSession.createDataframe() method as explained in this answer

                      Back to your problem. In you case, you change partition every time column behavior is equal to "view". So we can start with this condition:

                      import org.apache.spark.sql.functions.col
                      
                      val df1 = df.withColumn("is_view", (col("behavior") === "view").cast("integer"))
                      

                      You get the following dataframe:

                      +------+--------+----------+----+-------+
                      |userid|behavior|timestamp |url |is_view|
                      +------+--------+----------+----+-------+
                      |1     |view    |1650022601|url1|1      |
                      |1     |click   |1650022602|url2|0      |
                      |1     |click   |1650022614|url3|0      |
                      |1     |view    |1650022617|url4|1      |
                      |1     |click   |1650022622|url5|0      |
                      |1     |view    |1650022626|url7|1      |
                      |2     |view    |1650022628|url8|1      |
                      |2     |view    |1650022631|url9|1      |
                      +------+--------+----------+----+-------+
                      

                      Then you use a window ordered by timestamp to sum over the is_view column:

                      import org.apache.spark.sql.expressions.Window
                      import org.apache.spark.sql.functions.sum
                      
                      val df2 = df1.withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))
                      

                      Which get you the following dataframe:

                      +------+--------+----------+----+-------+---------+
                      |userid|behavior|timestamp |url |is_view|partition|
                      +------+--------+----------+----+-------+---------+
                      |1     |view    |1650022601|url1|1      |1        |
                      |1     |click   |1650022602|url2|0      |1        |
                      |1     |click   |1650022614|url3|0      |1        |
                      |1     |view    |1650022617|url4|1      |2        |
                      |1     |click   |1650022622|url5|0      |2        |
                      |1     |view    |1650022626|url7|1      |3        |
                      |2     |view    |1650022628|url8|1      |1        |
                      |2     |view    |1650022631|url9|1      |2        |
                      +------+--------+----------+----+-------+---------+
                      

                      Then, you just have to aggregate per userid and partition:

                      import org.apache.spark.sql.functions.{max, min}
                      
                      val result = df2.groupBy("userid", "partition")
                        .agg((max("timestamp") - min("timestamp")).as("duration"))
                      

                      And you get the following results:

                      +------+---------+--------+
                      |userid|partition|duration|
                      +------+---------+--------+
                      |1     |1        |13      |
                      |1     |2        |5       |
                      |1     |3        |0       |
                      |2     |1        |0       |
                      |2     |2        |0       |
                      +------+---------+--------+
                      

                      The complete scala code:

                      import org.apache.spark.sql.expressions.Window
                      import org.apache.spark.sql.functions.{col, max, min, sum}
                      
                      val result = df
                        .withColumn("is_view", (col("behavior") === "view").cast("integer"))
                        .withColumn("partition", sum("is_view").over(Window.partitionBy("userid").orderBy("timestamp")))
                        .groupBy("userid", "partition")
                        .agg((max("timestamp") - min("timestamp")).as("duration"))
                      

                      Source https://stackoverflow.com/questions/71883786

                      Community Discussions, Code Snippets contain sources that include Stack Exchange Network

                      Vulnerabilities

                      No vulnerabilities reported

                      Install hadoop-gpu

                      You can download it from GitHub.
                      You can use hadoop-gpu like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the hadoop-gpu component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

                      Support

                      For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

                      DOWNLOAD this Library from

                      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                      over 430 million Knowledge Items
                      Find more libraries
                      Reuse Solution Kits and Libraries Curated by Popular Use Cases
                      Explore Kits

                      Save this library and start creating your kit

                      Explore Related Topics

                      Share this Page

                      share link
                      Consider Popular Java Libraries
                      Try Top Libraries by koichishirahata
                      Compare Java Libraries with Highest Support
                      Compare Java Libraries with Highest Quality
                      Compare Java Libraries with Highest Security
                      Compare Java Libraries with Permissive License
                      Compare Java Libraries with Highest Reuse
                      Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from
                      over 430 million Knowledge Items
                      Find more libraries
                      Reuse Solution Kits and Libraries Curated by Popular Use Cases
                      Explore Kits

                      Save this library and start creating your kit

                      • © 2022 Open Weaver Inc.