kandi X-RAY | job-server Summary
kandi X-RAY | job-server Summary
job-server
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Parse field from string
- Initialize the worker
- Parse command line arguments
- Decode a job
- Load job data .
- Register the middleware .
- Parse header array .
- Get random port
- Yield a job
- Called when a job has finished
job-server Key Features
job-server Examples and Code Snippets
Community Discussions
Trending Discussions on job-server
QUESTION
I created a simple golang Apache Beam pipeline and it is working well with DirectRunner
. I tried to deploy it on a Spark cluster using the following command :
./bin/spark-submit --master=spark://vm:7077 main.go --runner=SparkRunner --job_endpoint=localhost:8099 --artifact_endpoint=localhost:8098 --environment_type=LOOPBACK --output=/tmp/output
Before submiting the application, i runned the job_endpoint
using the following command :
./gradlew :runners:spark:job-server:runShadow -PsparkMasterUrl=spark://localhost:7077
The job fails on Spark with this error : WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread "main" org.apache.spark.SparkException: Failed to get main class in JAR with error 'null'. Please specify one with --class.
It seems that i need to specify the class argument but I do not understand what the error mean? can I get help ?
...ANSWER
Answered 2021-Mar-10 at 02:46spark-submit is a Spark utility that accepts either a Java JAR or a Python script. It doesn't know how to run Go programs.
I updated the Beam Go quickstart guide with instructions for the Spark runner. Let me know if that works for you.
QUESTION
It's possible to configure the Beam portable runner with the spark configurations? More precisely, it's possible to configure the spark.driver.host
in the Portable Runner?
Currently, we have airflow implemented in a Kubernetes cluster, and aiming to use TensorFlow Extended we need to use Apache beam. For our use case Spark would be the appropriate runner to be used, and as airflow and TensorFlow are coded in python we would need to use the Apache Beam's Portable Runner (https://beam.apache.org/documentation/runners/spark/#portability).
The problemThe portable runner creates the spark context inside its container and does not leave space for the driver DNS configuration making the executors inside the worker pods non-communicable to the driver (the job server).
Setup- Following the beam documentation, the job serer was implemented in the same pod as the airflow to use the local network between these two containers. Job server config:
ANSWER
Answered 2021-Feb-23 at 22:28I have three solutions to choose from depending on your deployment requirements. In order of difficulty:
- Use the Spark "uber jar" job server. This starts an embedded job server inside the Spark master, instead of using a standalone job server in a container. This would simplify your deployment a lot, since you would not need to start the
beam_spark_job_server
container at all.
QUESTION
I am trying to configure spark job-sever for Mesos cluster deployment mode. I have set spark.master = "mesos://mesos-master:5050" in jobserver config.
When I am trying to create a context on job-server, it is failing with the following exception:
...ANSWER
Answered 2017-May-29 at 08:58I was setting MESOS_NATIVE_JAVA_LIBRARY env variable for user and I was running job-server with Sudo privileges.
QUESTION
I'm trying to process the data streaming from Apache Kafka using the Python SDK for Apache Beam with the Flink runner. After running Kafka 2.4.0 and Flink 1.8.3, I follow these steps:
1) Compile and run Beam 2.16 with Flink 1.8 runner.
...ANSWER
Answered 2019-Dec-28 at 07:51Disclaimer: this is my first encounter with Apache Beam project.
It seems that Kafka consumer support is quite fresh thing in Beam (at least in Python interface) according to this JIRA. Apparently, it seems that there is still problem with FlinkRunner
combined with this new API. Even though your code is technically correct it will not run correctly on Flink. There is a patch available which seems more like a quickfix than final solution to me. It requires recompilation and thus is not something I would propose using on production. If you are just getting started with technology and don't want to be blocked then feel free to try it out.
QUESTION
I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad
is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
...ANSWER
Answered 2019-Oct-31 at 12:08First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.
QUESTION
I want to run Spark Job on Spark Jobserver. During execution, I got an exception:
stack:
java.lang.RuntimeException: scala.ScalaReflectionException: class com.some.example.instrument.data.SQLMapping in JavaMirror with org.apache.spark.util.MutableURLClassLoader@55b699ef of type class org.apache.spark.util.MutableURLClassLoader with classpath [file:/app/spark-job-server.jar] and parent being sun.misc.Launcher$AppClassLoader@2e817b38 of type class sun.misc.Launcher$AppClassLoader with classpath [.../classpath jars/] not found.
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) at com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1$$typecreator15$1.apply(DataRetriever.scala:136) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33) at com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1.apply(DataRetriever.scala:136) at com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1.apply(DataRetriever.scala:135) at scala.util.Success$$anonfun$map$1.apply(Try.scala:237) at scala.util.Try$.apply(Try.scala:192) at scala.util.Success.map(Try.scala:237) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
In DataRetriever
I convert simple case class to DataSet.
case class definition:
...ANSWER
Answered 2018-Mar-28 at 10:34Calling toDS()
inside future causing ScalaReflectionException.
I decided to construct DataSet outside future.map
.
You can verify that Dataset can't be constructed in future.map
with this example job.
QUESTION
I have added job-server 0.9.0 dependencies in build.sbt by add
...ANSWER
Answered 2019-Jun-18 at 16:14I figured out, the sparkSessionJob located in job-server-extras, so I just have to add
QUESTION
I observed a strange behavior of my doctrine object. In my symfony project I'm using ORM with doctrine to save my data in a mysql database. This is working normal in the most situations. I'm also using gearman in my project, this is a framework that allows applications to complete tasks in parallel. I have a gearman job-server running on the same machine where also my apache is running and I have registered a gearman worker on the same machine in a seperate 'screen' session using the screen window manager. By this method, I have always access to the standard console out of the function registered for the gearman-worker.
In the gearman-worker function I'm invoking, I have access to the doctrine object by $doctrine = $this->getContainer()->get('doctrine')
and it works almost normal. But when I have changed some data in my database, doctrine is using still the old data, which were stored before in the database. I'm totally confused, because I expected that by callling:
$repo = $doctrine->getRepository("PackageManagerBundle:myRepo");
$dbElement = $repo->findOneById($Id);
I'm always getting the current data entrys from my database. This is looking like a strange caching behavior, but I have no clue what I've made wrong.
I can solve this problem, by registering the gearman worker and function new:
$worker = new \GearmanWorker();
$worker->addServer();
$worker->addFunction
After that I've back the current state of my database, until I've changing something else. I'm oberserving this behavior only in my gearman worker function. In the rest of the application everthing is synchronized with my database and normal.
...ANSWER
Answered 2018-May-21 at 11:10This is what I think may be happening. Could be wrong though.
A gearman worker is going to be a long-running process that picks up jobs to do. The first job it gets will then cause doctrine to load the entity into its object map from the database. But, for the second job the worker receives, doctrine will not perform a database lookup, it will instead check it's identity map and find it already has the object loaded and so will simply return the one from memory. If something else, external to the worker process, has altered the database record then you'll end up with an object that is out of date.
You can tell doctrine to drop objects from its identity map, then it will perform a database lookup. To enforce loading objects from the database again instead of serving them from the identity map, you should use EntityManager#clear().
More info here: https://www.doctrine-project.org/projects/doctrine-orm/en/2.6/reference/working-with-objects.html#entities-and-the-identity-map
QUESTION
I'm new to Scala and SBT. I noticed an unfamiliar operator in the build.sbt of an open source project:
:=
Here are a couple examples of how it's used:
...ANSWER
Answered 2018-Feb-08 at 11:46The :=
has essentially nothing to do with the ordinary assignment operator =
. It's not a built-in scala operator, but rather a family of methods/macros called :=
. These methods (or macros) are members of classes such as SettingKey[T]
(similarly for TaskKey[T]
and InputKey[T]
). They consume the right hand side of the key := value
expression, and return instances of type Def.Setting[T]
(or similarly, Task
s), where T
is the type of the value represented by the key. They are usually written in infix notation. Without syntactic sugar, the invocations of these methods/macros would look as follows:
QUESTION
I am pretty new to spark. I have produced a file having around 420 mb of data with SPARK job. I have a Java application which only needs to query data concurrently from that file based on certain conditions and return data in json format. So far I have found two RESTful APIs for SPARK but they are only for submitting SPARK jobs remotely and managing SPARK contexts,
...ANSWER
Answered 2017-Oct-11 at 18:46You can actually use Livy to get results back as friendly JSON in a RESTful way!
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install job-server
PHP requires the Visual C runtime (CRT). The Microsoft Visual C++ Redistributable for Visual Studio 2019 is suitable for all these PHP versions, see visualstudio.microsoft.com. You MUST download the x86 CRT for PHP x86 builds and the x64 CRT for PHP x64 builds. The CRT installer supports the /quiet and /norestart command-line switches, so you can also script it.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page