dsbulk | DataStax Bulk Loader is an open-source | SQL Database library

 by   datastax Java Version: 1.8.0 License: Apache-2.0

kandi X-RAY | dsbulk Summary

kandi X-RAY | dsbulk Summary

dsbulk is a Java library typically used in Database, SQL Database, Oracle applications. dsbulk has build file available, it has a Permissive License and it has low support. However dsbulk has 14 bugs and it has 1 vulnerabilities. You can download it from GitHub, Maven.

The DataStax Bulk Loader tool (DSBulk) is a unified tool for loading into and unloading from Cassandra-compatible storage engines, such as OSS Apache Cassandra, DataStax Astra and DataStax Enterprise (DSE).
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              dsbulk has a low active ecosystem.
              It has 38 star(s) with 18 fork(s). There are 7 watchers for this library.
              OutlinedDot
              It had no major release in the last 12 months.
              There are 10 open issues and 20 have been closed. On average issues are closed in 43 days. There are 2 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of dsbulk is 1.8.0

            kandi-Quality Quality

              OutlinedDot
              dsbulk has 14 bugs (3 blocker, 0 critical, 3 major, 8 minor) and 2394 code smells.

            kandi-Security Security

              dsbulk has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              dsbulk code analysis shows 1 unresolved vulnerabilities (0 blocker, 1 critical, 0 major, 0 minor).
              There are 143 security hotspots that need review.

            kandi-License License

              dsbulk is licensed under the Apache-2.0 License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              dsbulk releases are available to install and integrate.
              Deployable package is available in Maven.
              Build file is available. You can build the component from source.
              Installation instructions are not available. Examples and code snippets are available.
              It has 72359 lines of code, 4198 functions and 656 files.
              It has medium code complexity. Code complexity directly impacts maintainability of the code.

            Top functions reviewed by kandi - BETA

            kandi has reviewed dsbulk and discovered the below as its top functions. This is intended to give you an instant insight into dsbulk implemented functionality, and help decide if they suit your requirements.
            • Test for a temporal round trip
            • Checks that numTemporals are present in the given directory
            • Checks that numTemporals are written
            • Demonstrates how to store a temporal table
            • Checks that there are enough temporal values
            • Checks the values in a given directory
            • Login table
            • Checks the details of a file in a given directory
            • Unloads the contents of a table
            • Checks for a range checkpoint
            • Demonstrates how to unload a CSV file
            • Demonstrates how to store a dynamic composite type
            • Initializes the schema manager
            • Demonstrates how to fetch legacy settings
            • Test a dataset
            • Demonstrates how to create an empty table
            • Test to see if table should be truncated
            • Attempt to load complex types
            • Read schema generation strategy
            • Perform initialization
            • Demonstrates how to fetch OpenSSL legacy settings
            • Demonstrates how to execute the jdk settings
            • Install jdk in jdk
            • Demonstrates how to drop tables
            • This method is used to test tests
            • Test whether or not tables should be truncated
            Get all kandi verified functions for this library.

            dsbulk Key Features

            No Key Features are available at this moment for dsbulk.

            dsbulk Examples and Code Snippets

            DataStax Bulk Loader Overview,Configuration Files vs Command Line Options
            Javadot img1Lines of Code : 12dot img1License : Permissive (Apache-2.0)
            copy iconCopy
            dsbulk {
              connector {
                name = "csv"
                csv {
                  url = "C:\\Users\\My Folder"
                  delimiter = "\t"
                }
              }
            }
            
            dsbulk.connector.name = "csv"
            dsbulk.connector.csv.url = "C:\\Users\\My Folder"
            dsbulk.connector.csv.delimiter = "\t"
              
            DataStax Bulk Loader Overview,Escaping and Quoting Command Line Arguments
            Javadot img2Lines of Code : 9dot img2License : Permissive (Apache-2.0)
            copy iconCopy
            dsbulk load -delim '\t'
            
            dsbulk load -h '"host.com:9042"'
            
            dsbulk load -url '"C:\\Users\\My Folder"'
            
            dsbulk load -url 'C:\\Users\\My Folder'
            
            dsbulk load --codec.nullStrings 'NIL, NULL'
            dsbulk load --codec.nullStrings '[NIL, NULL]'
            
            dsbulk load --co  
            DataStax Bulk Loader Overview,Basic Usage
            Javadot img3Lines of Code : 8dot img3License : Permissive (Apache-2.0)
            copy iconCopy
            # Load data
            dsbulk load 
            
            # Unload data
            dsbulk unload 
            
            # Count rows
            dsbulk count 
              

            Community Discussions

            QUESTION

            I am getting a heap memory issue while running DSBULK load
            Asked 2022-Jan-24 at 20:24

            I have unloaded more than 100 CSV files in a folder. When I try to load those files to cassandra using DSBULK load and specifying the the folder location of all these files, I get the below error

            ...

            ANSWER

            Answered 2022-Jan-24 at 20:24

            Here are a few things you can try:

            1. You can pass any JVM option or system property to the dsbulk executable using the DSBULK_JAVA_OPTS env var. See this page for more. Set the allocated memory to a higher value if possible.
            2. You can throttle dsbulk using the -maxConcurrentQueries option. Start with -maxConcurrentQueries 1; then raise the value to get the best throughput possible without hitting the OOM error. More on this here.

            Source https://stackoverflow.com/questions/70835602

            QUESTION

            Location of driver.conf used for DSBULK to load data into Cassandra
            Asked 2022-Jan-12 at 09:48

            I am using a configuration file as below to load data in Cassandra using DSBULK

            ...

            ANSWER

            Answered 2022-Jan-12 at 09:48

            You can use -f command-line switch to specify location of the configuration file (see doc). Location of driver.conf will be relative to this file.

            Source https://stackoverflow.com/questions/70678883

            QUESTION

            Is it possible to backup and restore Cassandra cluster using dsbulk?
            Asked 2021-Sep-28 at 17:33

            I searched through the internet a lot and saw a lot of ways to backup and restore a Cassandra cluster, such as nodetool snapshot and Medusa. but my question is that can I use dsbulk to backup a Cassandra cluster. What are its limitations? Why doesn't anyone suggest that?

            ...

            ANSWER

            Answered 2021-Sep-28 at 17:33

            It's possible to use it in some cases, but it's not practical because (that are primary, list could be bigger):

            • DSBulk put an additional load onto the cluster nodes because it's going through the standard read path. In contrast to that nodetool snapshot just create a hardlinks to the files with data, no additional load to the nodes
            • It's harder to implement incremental backups with DSBulk - you need to come with condition for SELECT that will find only data that changed since the last backup, so you need to have timestamp column, because you can't do the WHERE condition on the value of writetime function. Plus it will require rescanning of whole data anyway. Plus it's impossible to find what data were deleted. With nodetool snapshot, you just compare what files has changed since last backup, and backup only them.

            Source https://stackoverflow.com/questions/69364605

            QUESTION

            dsbulk unload is failing on large table
            Asked 2021-Apr-24 at 14:12

            trying to unload data from a huge table, below is the command used and output.

            $ /home/cassandra/dsbulk-1.8.0/bin/dsbulk unload --driver.auth.provider PlainTextAuthProvider --driver.auth.username xxxx --driver.auth.password xxxx --datastax-java-driver.basic.contact-points 123.123.123.123 -query "select count(*) from sometable with where on clustering column and partial pk -- allow filtering" --connector.name json --driver.protocol.compression LZ4 --connector.json.mode MULTI_DOCUMENT -maxConcurrentFiles 1 -maxRecords -1 -url dsbulk --executor.continuousPaging.enabled false --executor.maxpersecond 2500 --driver.socket.timeout 240000

            ...

            ANSWER

            Answered 2021-Apr-24 at 08:06

            Expand select count(*) from sometable with where on clustering column and partial pk -- allow filtering with an additional condition on the token ranges, like this: and partial pk token(full_pk) > :start and token(full_pk) <= :end - in this case, DSBulk will perform many queries against specific token ranges that are sent to multiple nodes, and won't create the load onto the single node as in your case.

            Look into the documentation for -query option, and for 4th blog in this series of blog posts about DSBulk, that could provide more information & examples: 1, 2, 3, 4, 5, 6

            Source https://stackoverflow.com/questions/67240233

            QUESTION

            dsbulk unload missing data
            Asked 2020-Oct-26 at 19:06

            I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.

            Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.

            Run 1:

            │ total | failed | rows/s | p50ms | p99ms | p999ms
            │ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
            │ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in 1 minute and 51 seconds.

            Run 2:

            │ total | failed | rows/s | p50ms | p99ms | p999ms
            │ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
            │ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in 3 minutes and 47 seconds.

            Run 3:

            │ total | failed | rows/s | p50ms | p99ms | p999ms
            │ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
            │ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in 3 minutes and 35 seconds.

            It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.

            I'm invoking unload as follows:

            ...

            ANSWER

            Answered 2020-Oct-26 at 19:06

            Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).

            You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.

            Source https://stackoverflow.com/questions/64542597

            QUESTION

            How do I run dsbulk unload and write directly to S3
            Asked 2020-Oct-21 at 15:13

            I want to run a dsbulk unload command, but my cassandra cluster has ~1tb of data in the table I want to export. Is there a way to run the dsbulk unload command and stream the data into s3 as opposed to writing to disk?

            Im running the following command in my dev environment, but obviously this is just writing to disk on my machine

            bin/dsbulk unload -k myKeySpace -t myTable -url ~/data --connector.csv.compression gzip

            ...

            ANSWER

            Answered 2020-Oct-21 at 15:13

            It doesn't support it "natively" out of the box. Theoretically it could be implemented, as DSBulk is now open source, but it should be done by somebody.

            Update: The workaround could be, as pointed by Adam is to use aws s3 cp and pipe to it from DSBulk, like this:

            Source https://stackoverflow.com/questions/64455248

            QUESTION

            DataStax Bulk Loader for Apache Cassandra isn't installing on Windows
            Asked 2020-Oct-09 at 11:27

            I'm trying to install DataStax Bulk Loader on my Windows machine in order to import json file to Cassandra databse. I just follow the installation instructions from the official webstie. It's just unpack the folder. Printing dsbulkfrom any catalogue into cmd prints the following result: "dsbulk" is not internal or external command, executable program, or batch file. However I added C:\DSBulk\dsbulk-1.7.0\bin into PATH variables. Anyone who faced with this problem what did you do? Thanks :D

            ...

            ANSWER

            Answered 2020-Oct-09 at 11:27

            Change into the bin/ directory where you unzipped the package. For example:

            Source https://stackoverflow.com/questions/64278534

            QUESTION

            Datastax Bulk Loader for Apache Cassandra not installing
            Asked 2020-Aug-23 at 19:53

            I have followed the instructions in the documentation: https://docs.datastax.com/en/dsbulk/doc/dsbulk/install/dsbulkInstall.html

            However, after doing the following:

            ...

            ANSWER

            Answered 2020-Aug-23 at 19:53

            Yes, DSBulk doesn’t include Java into it, so you need to install Java yourself - via apt, or whatever you use

            Source https://stackoverflow.com/questions/63551007

            QUESTION

            How to import data into Cassandra on EC2 using DSBulk Loader
            Asked 2020-Jun-04 at 22:43

            I'm attempting to import data into Cassandra on EC2 using dsbulk loader. I have three nodes configured and communicating as follows:

            ...

            ANSWER

            Answered 2020-Jun-04 at 22:43

            As mentioned in my edit, the problem was solved by increasing the volume on each of my node instances. The reason that DSBulk was failing and causing the nodes to crash was due to the EC2 instances running out of storage, from a combination of imported data, logging, and snapshots. I ended up running my primary node instance, in which I was running the DSBulk command, on a t2.medium instance with 30GB SSD, which solved the issue.

            Source https://stackoverflow.com/questions/62166344

            QUESTION

            First steps on loading data into Cassandra with dsbulk
            Asked 2020-May-14 at 15:18

            I am following this guide on setting up dsbulk: https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkSimpleLoad.html

            I'm getting confused at this part:

            ...

            ANSWER

            Answered 2020-May-13 at 21:18

            Please note that the first line is something specific to DataStax Astra. If you're loading to an Astra instance, you would find the secure connect bundle downloadable from the database dashboard in Astra console.

            If you are using DS Bulk for Cassandra, DSE, or any other compatible API, you do not need to be concerned with the secure connect bundle. You should be able to pass every parameter you need on the command line, or written in a config file.

            Source https://stackoverflow.com/questions/61780877

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install dsbulk

            You can download it from GitHub, Maven.
            You can use dsbulk like any standard Java library. Please include the the jar files in your classpath. You can also use any IDE and you can run and debug the dsbulk component as you would do with any other Java program. Best practice is to use a build tool that supports dependency management such as Maven or Gradle. For Maven installation, please refer maven.apache.org. For Gradle installation, please refer gradle.org .

            Support

            The most up-to-date documentation is available online.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/datastax/dsbulk.git

          • CLI

            gh repo clone datastax/dsbulk

          • sshUrl

            git@github.com:datastax/dsbulk.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link