dsbulk | DataStax Bulk Loader is an open-source | SQL Database library

by datastax Java Version: 1.8.0 License: Apache-2.0

X-Ray Key Features Code Snippets(3)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | dsbulk Summary

dsbulk is a Java library typically used in Database, SQL Database, Oracle applications. dsbulk has build file available, it has a Permissive License and it has low support. However dsbulk has 14 bugs and it has 1 vulnerabilities. You can download it from GitHub, Maven.

The DataStax Bulk Loader tool (DSBulk) is a unified tool for loading into and unloading from Cassandra-compatible storage engines, such as OSS Apache Cassandra, DataStax Astra and DataStax Enterprise (DSE).

Support

Quality

Security

License

Reuse

Support

dsbulk has a low active ecosystem.

It has 38 star(s) with 18 fork(s). There are 7 watchers for this library.

It had no major release in the last 12 months.

There are 10 open issues and 20 have been closed. On average issues are closed in 43 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of dsbulk is 1.8.0

Quality

dsbulk has 14 bugs (3 blocker, 0 critical, 3 major, 8 minor) and 2394 code smells.

Security

dsbulk has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

dsbulk code analysis shows 1 unresolved vulnerabilities (0 blocker, 1 critical, 0 major, 0 minor).

There are 143 security hotspots that need review.

License

dsbulk is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

dsbulk releases are available to install and integrate.

Deployable package is available in Maven.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

It has 72359 lines of code, 4198 functions and 656 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed dsbulk and discovered the below as its top functions. This is intended to give you an instant insight into dsbulk implemented functionality, and help decide if they suit your requirements.

Test for a temporal round trip
Checks that numTemporals are present in the given directory
Checks that numTemporals are written
Demonstrates how to store a temporal table
Checks that there are enough temporal values
Checks the values in a given directory
Login table
Checks the details of a file in a given directory
Unloads the contents of a table
Checks for a range checkpoint
Demonstrates how to unload a CSV file
Demonstrates how to store a dynamic composite type
Initializes the schema manager
Demonstrates how to fetch legacy settings
Test a dataset
Demonstrates how to create an empty table
Test to see if table should be truncated
Attempt to load complex types
Read schema generation strategy
Perform initialization
Demonstrates how to fetch OpenSSL legacy settings
Demonstrates how to execute the jdk settings
Install jdk in jdk
Demonstrates how to drop tables
This method is used to test tests
Test whether or not tables should be truncated

Get all kandi verified functions for this library.

dsbulk Key Features

No Key Features are available at this moment for dsbulk.

dsbulk Examples and Code Snippets

DataStax Bulk Loader Overview,Configuration Files vs Command Line Options

Java

Lines of Code : 12

License : Permissive (Apache-2.0)

Copy

dsbulk {
  connector {
    name = "csv"
    csv {
      url = "C:\\Users\\My Folder"
      delimiter = "\t"
    }
  }
}

dsbulk.connector.name = "csv"
dsbulk.connector.csv.url = "C:\\Users\\My Folder"
dsbulk.connector.csv.delimiter = "\t"

DataStax Bulk Loader Overview,Escaping and Quoting Command Line Arguments

Java

Lines of Code : 9

License : Permissive (Apache-2.0)

Copy

dsbulk load -delim '\t'

dsbulk load -h '"host.com:9042"'

dsbulk load -url '"C:\\Users\\My Folder"'

dsbulk load -url 'C:\\Users\\My Folder'

dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'

dsbulk load --co

DataStax Bulk Loader Overview,Basic Usage

Java

Lines of Code : 8

License : Permissive (Apache-2.0)

Copy

# Load data
dsbulk load 

# Unload data
dsbulk unload 

# Count rows
dsbulk count

Community Discussions

Trending Discussions on dsbulk

I am getting a heap memory issue while running DSBULK load

Location of driver.conf used for DSBULK to load data into Cassandra

Is it possible to backup and restore Cassandra cluster using dsbulk?

dsbulk unload is failing on large table

dsbulk unload missing data

How do I run dsbulk unload and write directly to S3

DataStax Bulk Loader for Apache Cassandra isn't installing on Windows

Datastax Bulk Loader for Apache Cassandra not installing

How to import data into Cassandra on EC2 using DSBulk Loader

First steps on loading data into Cassandra with dsbulk

QUESTION

I am getting a heap memory issue while running DSBULK load

Asked 2022-Jan-24 at 20:24

I have unloaded more than 100 CSV files in a folder. When I try to load those files to cassandra using DSBULK load and specifying the the folder location of all these files, I get the below error

...

ANSWER

Answered 2022-Jan-24 at 20:24

Here are a few things you can try:

You can pass any JVM option or system property to the dsbulk executable using the DSBULK_JAVA_OPTS env var. See this page for more. Set the allocated memory to a higher value if possible.
You can throttle dsbulk using the -maxConcurrentQueries option. Start with -maxConcurrentQueries 1; then raise the value to get the best throughput possible without hitting the OOM error. More on this here.

Source https://stackoverflow.com/questions/70835602

QUESTION

Location of driver.conf used for DSBULK to load data into Cassandra

Asked 2022-Jan-12 at 09:48

I am using a configuration file as below to load data in Cassandra using DSBULK

...

ANSWER

Answered 2022-Jan-12 at 09:48

You can use -f command-line switch to specify location of the configuration file (see doc). Location of driver.conf will be relative to this file.

Source https://stackoverflow.com/questions/70678883

QUESTION

Is it possible to backup and restore Cassandra cluster using dsbulk?

Asked 2021-Sep-28 at 17:33

I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?

...

ANSWER

Answered 2021-Sep-28 at 17:33

It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):

DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that nodetool snapshot just create a hardlinks to the files with data, no additional load to the nodes
It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of writetime function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. With nodetool snapshot, you just compare what files has changed since last backup, and backup only them.

Source https://stackoverflow.com/questions/69364605

QUESTION

dsbulk unload is failing on large table

Asked 2021-Apr-24 at 14:12

trying to unload data from a huge table, below is the command used and output.

$ /home/cassandra/dsbulk-1.8.0/bin/dsbulk unload --driver.auth.provider PlainTextAuthProvider --driver.auth.username xxxx --driver.auth.password xxxx --datastax-java-driver.basic.contact-points 123.123.123.123 -query "select count(*) from sometable with where on clustering column and partial pk -- allow filtering" --connector.name json --driver.protocol.compression LZ4 --connector.json.mode MULTI_DOCUMENT -maxConcurrentFiles 1 -maxRecords -1 -url dsbulk --executor.continuousPaging.enabled false --executor.maxpersecond 2500 --driver.socket.timeout 240000

...

ANSWER

Answered 2021-Apr-24 at 08:06

Expand select count(*) from sometable with where on clustering column and partial pk -- allow filtering with an additional condition on the token ranges, like this: and partial pk token(full_pk) > :start and token(full_pk) <= :end - in this case, DSBulk will perform many queries against specific token ranges that are sent to multiple nodes, and won't create the load onto the single node as in your case.

Look into the documentation for -query option, and for 4th blog in this series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

Source https://stackoverflow.com/questions/67240233

QUESTION

dsbulk unload missing data

Asked 2020-Oct-26 at 19:06

I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.

Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.

Run 1:

│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in 1 minute and 51 seconds.

Run 2:

│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in 3 minutes and 47 seconds.

Run 3:

│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in 3 minutes and 35 seconds.

It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.

I'm invoking unload as follows:

...

ANSWER

Answered 2020-Oct-26 at 19:06

Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).

You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.

Source https://stackoverflow.com/questions/64542597

QUESTION

How do I run dsbulk unload and write directly to S3

Asked 2020-Oct-21 at 15:13

I want to run a dsbulk unload command, but my cassandra cluster has ~1tb of data in the table I want to export. Is there a way to run the dsbulk unload command and stream the data into s3 as opposed to writing to disk?

Im running the following command in my dev environment, but obviously this is just writing to disk on my machine

bin/dsbulk unload -k myKeySpace -t myTable -url ~/data --connector.csv.compression gzip

...

ANSWER

Answered 2020-Oct-21 at 15:13

It doesn't support it "natively" out of the box. Theoretically it could be implemented, as DSBulk is now open source, but it should be done by somebody.

Update: The workaround could be, as pointed by Adam is to use aws s3 cp and pipe to it from DSBulk, like this:

Source https://stackoverflow.com/questions/64455248

QUESTION

DataStax Bulk Loader for Apache Cassandra isn't installing on Windows

Asked 2020-Oct-09 at 11:27

I'm trying to install DataStax Bulk Loader on my Windows machine in order to import json file to Cassandra databse. I just follow the installation instructions from the official webstie. It's just unpack the folder. Printing dsbulkfrom any catalogue into cmd prints the following result: "dsbulk" is not internal or external command, executable program, or batch file. However I added C:\DSBulk\dsbulk-1.7.0\bin into PATH variables. Anyone who faced with this problem what did you do? Thanks :D

...

ANSWER

Answered 2020-Oct-09 at 11:27

Change into the bin/ directory where you unzipped the package. For example:

Source https://stackoverflow.com/questions/64278534

QUESTION

Datastax Bulk Loader for Apache Cassandra not installing

Asked 2020-Aug-23 at 19:53

I have followed the instructions in the documentation: https://docs.datastax.com/en/dsbulk/doc/dsbulk/install/dsbulkInstall.html

However, after doing the following:

...

ANSWER

Answered 2020-Aug-23 at 19:53

Yes, DSBulk doesn’t include Java into it, so you need to install Java yourself - via apt, or whatever you use

Source https://stackoverflow.com/questions/63551007

QUESTION

How to import data into Cassandra on EC2 using DSBulk Loader

Asked 2020-Jun-04 at 22:43

I'm attempting to import data into Cassandra on EC2 using dsbulk loader. I have three nodes configured and communicating as follows:

...

ANSWER

Answered 2020-Jun-04 at 22:43

As mentioned in my edit, the problem was solved by increasing the volume on each of my node instances. The reason that DSBulk was failing and causing the nodes to crash was due to the EC2 instances running out of storage, from a combination of imported data, logging, and snapshots. I ended up running my primary node instance, in which I was running the DSBulk command, on a t2.medium instance with 30GB SSD, which solved the issue.

Source https://stackoverflow.com/questions/62166344

QUESTION

First steps on loading data into Cassandra with dsbulk

Asked 2020-May-14 at 15:18

I am following this guide on setting up dsbulk: https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkSimpleLoad.html

I'm getting confused at this part:

...

ANSWER

Answered 2020-May-13 at 21:18

Please note that the first line is something specific to DataStax Astra. If you're loading to an Astra instance, you would find the secure connect bundle downloadable from the database dashboard in Astra console.

If you are using DS Bulk for Cassandra, DSE, or any other compatible API, you do not need to be concerned with the secure connect bundle. You should be able to pass every parameter you need on the command line, or written in a config file.

Source https://stackoverflow.com/questions/61780877

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install dsbulk

You can download it from GitHub, Maven.
You can use dsbulk like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the dsbulk component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .