marklogic-data-hub | The MarkLogic Data Hub : documentation == > | Database library
kandi X-RAY | marklogic-data-hub Summary
kandi X-RAY | marklogic-data-hub Summary
Go from nothing to an Operational Data Hub in a matter of minutes. MarkLogic Data Hub is a data integration platform and toolset that helps you quickly and efficiently integrate data from many sources into a single MarkLogic database and then expose that data.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Finishes the run step .
- Convert flows and mappings .
- Initialize artifact directories .
- Build aggregatable properties .
- Clears module modules .
- Convert the entity models in the project directory to the storage directory .
- Create a step file .
- Get the step runner for a given flow .
- Imports one or more jobs .
- Update app config .
marklogic-data-hub Key Features
marklogic-data-hub Examples and Code Snippets
Community Discussions
Trending Discussions on marklogic-data-hub
QUESTION
Ive been installing brand new ( empty ) Data Hub 4.1.1 so I can practice upgrade ( 4.1.1 to 4.3.2 , then up to 5.2.6 ).
Ive been using the quickstart instructions here https://marklogic.github.io/marklogic-data-hub/tutorial/4x/install/ to do the install, but wonder if I'm "cheating" and should instead be using same Gradle install method as for 5.2.x?
To clarify - can you install 4.1.1 using the same method for 5.2.x as described here https://docs.marklogic.com/datahub/5.2/projects/create-project-using-gradle.html, or do you need to follow the 4.x.x instructions only?
If we follow the 4.1.1 example, for sample data provided, it seems you initialize the project via Quickstart ( which then adds extra files etc to the local hard disk into the project directory), then you install the project as a discrete Data Hub into markLogic. Is it correct to say each project is its own Data Hub?
Thanks in advance.
...ANSWER
Answered 2021-Jan-12 at 06:56You can follow the 5.2 instructions. The 4.x instructions are not really different, as you can see from here:
https://marklogic.github.io/marklogic-data-hub/project/gradle/
4.x also has scaffolding tasks. Run ./gradlew tasks
, and look for tasks starting with hub
or ml
.
HTH!
QUESTION
As I understand, both the MLCP Transformation and Trigger can be used to modify ingested documents. The difference is that content transformation operates on the in-memory document object during the ingestion, whereas Trigger can be fired after a document is created.
So it seems to me there is no reason why I cannot use both of them together. My use cases is that I need to update some nodes of the documents after they are ingested to the database. The reason I use trigger is because running the same logic in MLCP transformation using the in-mem-update
module always caused ingestion failure, presumably due to the large file size and the large number of nodes I attempted to update.
2018-08-22 23:02:24 ERROR TransformWriter:546 - Exception:Error parsing HTTP headers: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So far, I have not been able to combine Content Transformations and Triggers. When I enabled transformation during MLCP ingestion, the trigger was not fired. When I disabled the transformation, the trigger worked without problem.
Is there any intrinsic reason why I cannot use both of them together? Or is it an issue related to my configuration? Thanks!
Edit:
I would like to provide some context for clarification and report results based on suggestions from @ElijahBernstein-Cooper, @MadsHansen and @grtjn (thanks!). I am using the MarkLogic Data Hub Framework to ingest PDF files (some are quite large) as binaries and extract the text as XML. I essentially followed this example, except that I am using xdmp:pdf-convert
instead of xdmp:document-filter
: https://github.com/marklogic/marklogic-data-hub/blob/master/examples/load-binaries/plugins/entities/Guides/input/LoadAsXml/content/content.xqy
While xdmp:pdf-convert
seems to preserve the PDF structure better than the xdmp:document-filter
, it also includes some styling nodes ( and
ANSWER
Answered 2018-Aug-24 at 07:24MLCP Transforms and Triggers operate independently. There is nothing in those Transforms that should stop Triggers from working per se.
Triggers are triggers by events. I typically use both a create and a modify trigger to cover the cases where I import the same files a second time (for testing purposes for instance).
Triggers also have a scope. They are configured to look for either a directory or a collection. Make sure your MLCP configuration matches the Trigger scope, and that your Transform does not influence the URI in such a way that it no longer matches directory scope if that is used.
Looking more closely to the error message however, I'd say that is caused by a timeout. Timeouts can occur both server-side (10 min by default), and client-side (might depend on client-side settings, but could be much smaller). The message basically says that the server took too long to respond, so I'd say you are facing a client-side timeout.
Timeouts can be caused by too small time-limits. You could try to increase timeout settings both server-side (xdmp:set-request-time-limit()
), and client-side (not sure how to do that in Java).
It is more common though, that you are simply trying to do too much at the same time. MLCP opens transactions, and tries to execute a number of batches within that transaction, aka the transaction_size
. Each batch contains a number of documents to the size of batch_size
. By default MLCP tries to process 10 x 100 = 1000 documents per transaction.
It also runs with 10 threads by default, so it typically opens 10 transactions at the same time, and tries to run 10 threads to process a 1000 docs each in parallel. With simple inserts this is just fine. With more heavy processing in transforms or pre-commit triggers, this can become a bottle-neck, particularly when the threads start to compete for server resources like memory and cpu.
Functions like xdmp:pdf-convert
can often be fairly slow. It depends on an external plugin for starters, but also imagine it has to process a 200 page PDF. Binaries can be large. You'll want to pace down to process them. If using -transaction_size 1 -batch_size 1 -thread_count 1
makes your transforms work, you really were facing timeouts, and may have been flooding your server. From there you can look at increasing some numbers, but binary sizes can be unpredictable, so be conservative.
It might also be worth looking at doing heavy processing asynchronously, for instance using CPF, the Content Processing Framework. It is a very robust implementation for processing content, and is designed to survive server restarts.
HTH!
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install marklogic-data-hub
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page