mongo-hadoop | MongoDB Connector for Hadoop
kandi X-RAY | mongo-hadoop Summary
kandi X-RAY | mongo-hadoop Summary
MongoDB Connector for Hadoop
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Calculates input splits
- Get input split key
- Creates a split from a range query
- Gets the splits for a single split
- Calculate input splits
- Builds a MongoClient URI
- Calculates the list of splitter splits based on the input uri configuration
- Converts a Map into a Configuration object
- Compress BSON files
- Calculates splits for each file in the given path
- This method returns a list of splits for the given number of splits
- Appends the next schema to the stream
- Synchronized
- Push the required fields on the server
- Execute collection
- Reduces the values of a collection
- Returns the next key in the stream
- Returns a record reader for a single split
- Command - line
- Deserialize fields
- Returns the next element in the stream
- Writes the value to the output stream
- Puts a tuple into the record
- Runs the model
- Deserialize a Hive table row
- Returns a RecordReader for the given split
mongo-hadoop Key Features
mongo-hadoop Examples and Code Snippets
Community Discussions
Trending Discussions on mongo-hadoop
QUESTION
I am having trouble making a simple 'hello world' connection between pyspark and mongoDB (see example I am trying to emulate https://github.com/mongodb/mongo-hadoop/tree/master/spark/src/main/python). Can someone please help me understand and fix this issue?
Details:
I can successfully run the pyspark shell with the seen-below --jars --conf --py-files
, then import pymongo_spark, and finally connect to the DB; however, when I try and print 'hello world' python is having trouble extracting files because of a permission denied '/home/ .cache'
issue. I don't think our env settings are correct and I am not sure how to fix this...
(see attached error file screenshot)
My Analysis: It is not clear if this is a Spark/HDFS, pymongo_spark, or pySpark issue. Spark or PyMongo_spark seems to be defaulted to each nodes /home .cache
Here is my pyspark environment:
pyspark --jars mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar,mongo-java-driver-3.6.3.jar --driver-class-path mongo-java-driver-3.6.3.jar,mongo-hadoop-spark-1.5.2.jar,mongodb-driver-3.6.3.jar --master yarn-client --conf "spark.mongodb.input.uri=mongodb:127.0.0.1/test.coll?readPreference=primaryPreferred","spark.mongodb.output.uri=mongodb://127.0.0.1/test.coll" --py-files pymongo_spark.py
In 1: import pymongo_spark
In 2: pymongo_spark.activate()
In 3: mongo_rdd
=sc.mongoRDD('mongodb://xx.xxx.xxx.xx:27017/test.restaurantssss')
In 4: print(mongo_rdd.first())
ANSWER
Answered 2018-Apr-09 at 18:08We knew about the 'Change your EGG cache to point to a different directory by setting the PYTHON_EGG_CACHE environment variable to point to an accessible variable' but we were unsure on how to accomplish this.
We were trying to do this locally but we needed to change the reading and writing permissions (as the Hadoop user - not the local user) for each node
Set Hadoop-user PYTHON_EGG_CACHE == tmp
Then in the unix prompt:
export PYTHONPATH=/usr/anaconda/bin/python
export MONGO_SPARK_SRC=/home/arustagi/mongodb/mongo-hadoop/spark
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
Verify PYTHONPATH -bash-4.2$ echo $PYTHONPATH /usr/anaconda/bin/python:/home/arustagi/mongodb/mongo-hadoop/spark/src/main/python
Command to invoke PySpark
pyspark --jars /home/arustagi/mongodb/mongo-hadoop-spark-1.5.2.jar,/home/arustagi/mongodb/mongodb-driver-3.6.3.jar,/home/arustagi/mongodb/mongo-java-driver-3.6.3.jar --driver-class-path /home/arustagi/mongodb/mongo-hadoop-spark-1.5.2.jar,/home/arustagi/mongodb/mongodb-driver-3.6.3.jar,/home/arustagi/mongodb/mongo-java-driver-3.6.3.jar --master yarn-client --py-files /usr/anaconda/lib/python2.7/site-packages/pymongo_spark-0.1.dev0-py2.7.egg,/home/arustagi/mongodb/pymongo_spark.py
On pyspark console
18/04/06 15:21:04 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 Welcome to
QUESTION
I'm using spark 2.1
on yarn
cluster. I have a RDD
that contains data I would like to complete based on other RDD
s (which correspond to different mongo
databases
that I get through https://github.com/mongodb/mongo-hadoop/wiki/Spark-Usage, but I don't think that is important, just mention it in case)
My problem is that the RDD
I have to use to complete data depends on data itself because data contain the database
to use. Here is a simplified exemple of what I have to do :
ANSWER
Answered 2017-May-20 at 19:30I would suggest you to convert your RDD
s to dataframe
s and then joins
, distinct
and other functions
that you would want to apply to the data would be very easy.
Dataframes
are distributed and with addition to dataframe apis
, sql queries
can be used. More information can be found in Spark SQL, DataFrames and Datasets Guide and Introducing DataFrames in Apache Spark for Large Scale Data Science
Moreover your need of foreach
and collect
functions which makes your code run slow won't be needed.
Example to convert RDDtoDevelop to dataframe is as below
QUESTION
I am trying to run the EnronMail example of Hadoop-MongoDB Connector for Spark. Therefore I am using the java code example from GitHub: https://github.com/mongodb/mongo-hadoop/blob/master/examples/enron/spark/src/main/java/com/mongodb/spark/examples/enron/Enron.java I adjusted the server name and added username and password according to my needs.
The error message I got it the following:
...ANSWER
Answered 2017-Apr-04 at 09:15The problem got solved by including the mongo-hadoop-spark-2.0.2.jar into the call. And also by using the following pom:
QUESTION
I am trying to import a file from HDFS to MongoDB using MongoInsertStorage with PIG. The files are large, around 5GB. The script runs fine when I run it in local mode with
...ANSWER
Answered 2017-Apr-02 at 22:17Found a solution.
For the error
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install mongo-hadoop
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page