mongo-hadoop | MongoDB Connector for Hadoop

by mongodb Java Version: r2.0.2 License: No License

X-Ray Key Features Code Snippets Community Discussions(4)Vulnerabilities Install Support

kandi X-RAY | mongo-hadoop Summary

mongo-hadoop is a Java library typically used in Big Data, MongoDB, Kafka, Spark, Hadoop applications. mongo-hadoop has no bugs, it has no vulnerabilities, it has build file available and it has medium support. You can download it from GitHub, Maven.

MongoDB Connector for Hadoop

Support

Quality

Security

License

Reuse

Support

mongo-hadoop has a medium active ecosystem.

It has 1521 star(s) with 617 fork(s). There are 173 watchers for this library.

It had no major release in the last 12 months.

mongo-hadoop has no issues reported. There are 15 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of mongo-hadoop is r2.0.2

Quality

mongo-hadoop has 0 bugs and 0 code smells.

Security

mongo-hadoop has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

mongo-hadoop code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

mongo-hadoop does not have a standard license declared.

Check the repository for any license declaration and review the terms closely.

Without a license, all rights are reserved, and you cannot use the library in your applications.

Reuse

mongo-hadoop releases are available to install and integrate.

Deployable package is available in Maven.

Build file is available. You can build the component from source.

Installation instructions, examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi has reviewed mongo-hadoop and discovered the below as its top functions. This is intended to give you an instant insight into mongo-hadoop implemented functionality, and help decide if they suit your requirements.

Calculates input splits
Get input split key
Creates a split from a range query
Gets the splits for a single split
Calculate input splits
Builds a MongoClient URI
Calculates the list of splitter splits based on the input uri configuration
Converts a Map into a Configuration object
Compress BSON files
Calculates splits for each file in the given path
This method returns a list of splits for the given number of splits
Appends the next schema to the stream
Synchronized
Push the required fields on the server
Execute collection
Reduces the values of a collection
Returns the next key in the stream
Returns a record reader for a single split
Command - line
Deserialize fields
Returns the next element in the stream
Writes the value to the output stream
Puts a tuple into the record
Runs the model
Deserialize a Hive table row
Returns a RecordReader for the given split

Get all kandi verified functions for this library.

mongo-hadoop Key Features

No Key Features are available at this moment for mongo-hadoop.

mongo-hadoop Examples and Code Snippets

No Code Snippets are available at this moment for mongo-hadoop.

Community Discussions

Trending Discussions on mongo-hadoop

MongoDB pyspark connector issue, [Error 13] permission denied 'home/ .cache'

Complete a RDD based on RDDs depending on data

Spark Task not Serializable Hadoop-MongoDB-Connector Enron

Mappers fail for pig to insert data into MongoDB

QUESTION

MongoDB pyspark connector issue, [Error 13] permission denied 'home/ .cache'

Asked 2018-Apr-09 at 18:08

I am having trouble making a simple 'hello world' connection between pyspark and mongoDB (see example I am trying to emulate https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python). Can someone please help me understand and fix this issue?

Details:

I can successfully run the pyspark shell with the seen-below --jars --conf --py-files, then import pymongo_spark, and finally connect to the DB; however, when I try and print 'hello world' python is having trouble extracting files because of a permission denied '/home/ .cache' issue. I don't think our env settings are correct and I am not sure how to fix this...

(see attached error file screenshot)

My Analysis: It is not clear if this is a Spark/HDFS, pymongo_spark, or pySpark issue. Spark or PyMongo_spark seems to be defaulted to each nodes /home .cache

Here is my pyspark environment:

pyspark --jars mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar,mongo-java-driver-3.6.3.jar --driver-class-path mongo-java-driver-3.6.3.jar,mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar --master yarn-client --conf "spark.mongodb.input.uri=mongodb:127.0.0.1/test.coll?readPreference=primaryPreferred","spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" --py-files pymongo_spark.py

In 1: import pymongo_spark

In 2: pymongo_spark.activate()

In 3: mongo_rdd =sc.mongoRDD('mongodb://xx.xxx.xxx.xx:27017/test.restaurantssss')

In 4: print(mongo_rdd.first())

Error message --3

Error message --1

Error message --2

...

ANSWER

Answered 2018-Apr-09 at 18:08

We knew about the 'Change your EGG cache to point to a different directory by setting the PYTHON_EGG_CACHE environment variable to point to an accessible variable' but we were unsure on how to accomplish this.

We were trying to do this locally but we needed to change the reading and writing permissions (as the Hadoop user - not the local user) for each node

Set Hadoop-user PYTHON_EGG_CACHE == tmp

Then in the unix prompt:

export PYTHONPATH=/usr/anaconda/bin/python

export MONGO_SPARK_SRC=/home/arustagi/mongodb/mongo-hadoop/spark

export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python

Verify PYTHONPATH -bash-4.2$ echo $PYTHONPATH /usr/anaconda/bin/python:/home/arustagi/mongodb/mongo-hadoop/spark/src/main/python

Command to invoke PySpark

pyspark --jars /home/arustagi/mongodb/mongo-hadoop-spark-1.5.2.jar,/home/arustagi/mongodb/mongodb-driver-3.6.3.jar,/home/arustagi/mongodb/mongo-java-driver-3.6.3.jar --driver-class-path /home/arustagi/mongodb/mongo-hadoop-spark-1.5.2.jar,/home/arustagi/mongodb/mongodb-driver-3.6.3.jar,/home/arustagi/mongodb/mongo-java-driver-3.6.3.jar --master yarn-client --py-files /usr/anaconda/lib/python2.7/site-packages/pymongo_spark-0.1.dev0-py2.7.egg,/home/arustagi/mongodb/pymongo_spark.py

On pyspark console

18/04/06 15:21:04 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 Welcome to

Source https://stackoverflow.com/questions/49681737

QUESTION

Complete a RDD based on RDDs depending on data

Asked 2017-May-20 at 19:30

I'm using spark 2.1 on yarn cluster. I have a RDD that contains data I would like to complete based on other RDDs (which correspond to different mongo databases that I get through https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage, but I don't think that is important, just mention it in case)

My problem is that the RDD I have to use to complete data depends on data itself because data contain the database to use. Here is a simplified exemple of what I have to do :

...

ANSWER

Answered 2017-May-20 at 19:30

I would suggest you to convert your RDDs to dataframes and then joins, distinct and other functions that you would want to apply to the data would be very easy.
Dataframes are distributed and with addition to dataframe apis, sql queries can be used. More information can be found in Spark SQL, DataFrames and Datasets Guide and Introducing DataFrames in Apache Spark for Large Scale Data Science
Moreover your need of foreach and collect functions which makes your code run slow won't be needed.
Example to convert RDDtoDevelop to dataframe is as below

Source https://stackoverflow.com/questions/44065845

QUESTION

Spark Task not Serializable Hadoop-MongoDB-Connector Enron

Asked 2017-Apr-04 at 09:15

I am trying to run the EnronMail example of Hadoop-MongoDB Connector for Spark. Therefore I am using the java code example from GitHub: https://github.com/mongodb/mongo-hadoop/blob/master/examples/enron/spark/src/main/java/com/mongodb/spark/examples/enron/Enron.java I adjusted the server name and added username and password according to my needs.

The error message I got it the following:

...

ANSWER

Answered 2017-Apr-04 at 09:15

The problem got solved by including the mongo-hadoop-spark-2.0.2.jar into the call. And also by using the following pom:

Source https://stackoverflow.com/questions/43069944

QUESTION

Mappers fail for pig to insert data into MongoDB

Asked 2017-Apr-02 at 22:17

I am trying to import a file from HDFS to MongoDB using MongoInsertStorage with PIG. The files are large, around 5GB. The script runs fine when I run it in local mode with

...

ANSWER

Answered 2017-Apr-02 at 22:17

Found a solution.

For the error

Source https://stackoverflow.com/questions/43057640

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install mongo-hadoop

The best way to install the Hadoop connector is through a dependency management system like Maven:. You can also download the jars files yourself from the Maven Central Repository. New releases are announced on the releases page.

Support

For full documentation, please check out the Hadoop Connector Wiki. The documentation includes installation instructions, configuration options, as well as specific instructions and examples for each Hadoop application the connector supports.

Find more information at: