dsbulk | DataStax Bulk Loader is an open-source | SQL Database library
kandi X-RAY | dsbulk Summary
kandi X-RAY | dsbulk Summary
The DataStax Bulk Loader tool (DSBulk) is a unified tool for loading into and unloading from Cassandra-compatible storage engines, such as OSS Apache Cassandra, DataStax Astra and DataStax Enterprise (DSE).
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Test for a temporal round trip
- Checks that numTemporals are present in the given directory
- Checks that numTemporals are written
- Demonstrates how to store a temporal table
- Checks that there are enough temporal values
- Checks the values in a given directory
- Login table
- Checks the details of a file in a given directory
- Unloads the contents of a table
- Checks for a range checkpoint
- Demonstrates how to unload a CSV file
- Demonstrates how to store a dynamic composite type
- Initializes the schema manager
- Demonstrates how to fetch legacy settings
- Test a dataset
- Demonstrates how to create an empty table
- Test to see if table should be truncated
- Attempt to load complex types
- Read schema generation strategy
- Perform initialization
- Demonstrates how to fetch OpenSSL legacy settings
- Demonstrates how to execute the jdk settings
- Install jdk in jdk
- Demonstrates how to drop tables
- This method is used to test tests
- Test whether or not tables should be truncated
dsbulk Key Features
dsbulk Examples and Code Snippets
dsbulk {
connector {
name = "csv"
csv {
url = "C:\\Users\\My Folder"
delimiter = "\t"
}
}
}
dsbulk.connector.name = "csv"
dsbulk.connector.csv.url = "C:\\Users\\My Folder"
dsbulk.connector.csv.delimiter = "\t"
dsbulk load -delim '\t'
dsbulk load -h '"host.com:9042"'
dsbulk load -url '"C:\\Users\\My Folder"'
dsbulk load -url 'C:\\Users\\My Folder'
dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'
dsbulk load --co
# Load data
dsbulk load
# Unload data
dsbulk unload
# Count rows
dsbulk count
Community Discussions
Trending Discussions on dsbulk
QUESTION
I have unloaded more than 100 CSV files in a folder. When I try to load those files to cassandra using DSBULK load and specifying the the folder location of all these files, I get the below error
...ANSWER
Answered 2022-Jan-24 at 20:24Here are a few things you can try:
- You can pass any JVM option or system property to the dsbulk executable using the
DSBULK_JAVA_OPTS
env var. See this page for more. Set the allocated memory to a higher value if possible. - You can throttle dsbulk using the
-maxConcurrentQueries
option. Start with-maxConcurrentQueries 1
; then raise the value to get the best throughput possible without hitting the OOM error. More on this here.
QUESTION
I am using a configuration file as below to load data in Cassandra using DSBULK
...ANSWER
Answered 2022-Jan-12 at 09:48You can use -f
command-line switch to specify location of the configuration file (see doc). Location of driver.conf
will be relative to this file.
QUESTION
I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot
and Medusa
. but my question is that can I use dsbulk
to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?
ANSWER
Answered 2021-Sep-28 at 17:33It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):
- DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that
nodetool snapshot
just create a hardlinks to the files with data, no additional load to the nodes - It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of
writetime
function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. Withnodetool snapshot
, you just compare what files has changed since last backup, and backup only them.
QUESTION
trying to unload data from a huge table, below is the command used and output.
$ /home/cassandra/dsbulk-1.8.0/bin/dsbulk unload --driver.auth.provider PlainTextAuthProvider --driver.auth.username xxxx --driver.auth.password xxxx --datastax-java-driver.basic.contact-points 123.123.123.123 -query "select count(*) from sometable with where on clustering column and partial pk -- allow filtering" --connector.name json --driver.protocol.compression LZ4 --connector.json.mode MULTI_DOCUMENT -maxConcurrentFiles 1 -maxRecords -1 -url dsbulk --executor.continuousPaging.enabled false --executor.maxpersecond 2500 --driver.socket.timeout 240000
...ANSWER
Answered 2021-Apr-24 at 08:06Expand select count(*) from sometable with where on clustering column and partial pk -- allow filtering
with an additional condition on the token ranges, like this: and partial pk token(full_pk) > :start and token(full_pk) <= :end
- in this case, DSBulk will perform many queries against specific token ranges that are sent to multiple nodes, and won't create the load onto the single node as in your case.
Look into the documentation for -query option, and for 4th blog in this series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6
QUESTION
I'm using dsbulk
1.6.0 to unload data from cassandra
3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in 1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in 3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in 3 minutes and 35 seconds.
It would appear that Run 1
is missing the majority of the data. Run 2
may be closer to complete and Run 3
is missing significant data.
I'm invoking unload as follows:
...ANSWER
Answered 2020-Oct-26 at 19:06Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE
, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl
command line option (doc). You can compare results with using LOCAL_QUORUM
or ALL
- in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.
QUESTION
I want to run a dsbulk unload command, but my cassandra cluster has ~1tb of data in the table I want to export. Is there a way to run the dsbulk unload command and stream the data into s3 as opposed to writing to disk?
Im running the following command in my dev environment, but obviously this is just writing to disk on my machine
bin/dsbulk unload -k myKeySpace -t myTable -url ~/data --connector.csv.compression gzip
ANSWER
Answered 2020-Oct-21 at 15:13It doesn't support it "natively" out of the box. Theoretically it could be implemented, as DSBulk is now open source, but it should be done by somebody.
Update:
The workaround could be, as pointed by Adam is to use aws s3 cp
and pipe to it from DSBulk, like this:
QUESTION
I'm trying to install DataStax Bulk Loader on my Windows machine in order to import json file to Cassandra databse. I just follow the installation instructions from the official webstie. It's just unpack the folder. Printing dsbulk
from any catalogue into cmd prints the following result: "dsbulk" is not internal or external command, executable program, or batch file.
However I added C:\DSBulk\dsbulk-1.7.0\bin
into PATH variables. Anyone who faced with this problem what did you do? Thanks :D
ANSWER
Answered 2020-Oct-09 at 11:27Change into the bin/
directory where you unzipped the package. For example:
QUESTION
I have followed the instructions in the documentation: https://docs.datastax.com/en/dsbulk/doc/dsbulk/install/dsbulkInstall.html
However, after doing the following:
...ANSWER
Answered 2020-Aug-23 at 19:53Yes, DSBulk doesn’t include Java into it, so you need to install Java yourself - via apt, or whatever you use
QUESTION
I'm attempting to import data into Cassandra on EC2 using dsbulk loader. I have three nodes configured and communicating as follows:
...ANSWER
Answered 2020-Jun-04 at 22:43As mentioned in my edit, the problem was solved by increasing the volume on each of my node instances. The reason that DSBulk was failing and causing the nodes to crash was due to the EC2 instances running out of storage, from a combination of imported data, logging, and snapshots. I ended up running my primary node instance, in which I was running the DSBulk command, on a t2.medium instance with 30GB SSD, which solved the issue.
QUESTION
I am following this guide on setting up dsbulk: https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkSimpleLoad.html
I'm getting confused at this part:
...ANSWER
Answered 2020-May-13 at 21:18Please note that the first line is something specific to DataStax Astra. If you're loading to an Astra instance, you would find the secure connect bundle downloadable from the database dashboard in Astra console.
If you are using DS Bulk for Cassandra, DSE, or any other compatible API, you do not need to be concerned with the secure connect bundle. You should be able to pass every parameter you need on the command line, or written in a config file.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install dsbulk
You can use dsbulk like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the dsbulk component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page