sstable | bigdata processing in golang

by xtaci Go Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | sstable Summary

sstable is a Go library typically used in Big Data, Neo4j, Kafka, Spark applications. sstable has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

bigdata processing in golang.

Support

Quality

Security

License

Reuse

Support

sstable has a low active ecosystem.

It has 23 star(s) with 2 fork(s). There are 3 watchers for this library.

It had no major release in the last 6 months.

sstable has no issues reported. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of sstable is current.

Quality

sstable has 0 bugs and 0 code smells.

Security

sstable has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

sstable code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

sstable is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

sstable releases are not available. You will need to build from source code and install.

It has 547 lines of code, 58 functions and 2 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed sstable and discovered the below as its top functions. This is intended to give you an instant insight into sstable implemented functionality, and help decide if they suit your requirements.

sort2Disk takes a reader and mapper and maps it to memory
Reduce takes a number of chunks and calls the Reduce function .
Add adds an entry to the set .
findUnique finds all the files in r
newDataSetReader returns a new dataSetReader .
newStreamReader returns a new streamReader .
newDataSet creates a new dataSet .
newStreamAggregator returns a new streamAggregator .

Get all kandi verified functions for this library.

sstable Key Features

No Key Features are available at this moment for sstable.

sstable Examples and Code Snippets

No Code Snippets are available at this moment for sstable.

Community Discussions

Trending Discussions on sstable

What is SSTable overlap in TWCS in Cassandra?

Run cassandra rebuild in parallel

How does cassandra handle update request under the hood?

Combine Cassandra Snapshot with updated data

Cassandra 4.0.1 can't be started using 'cassandra -f' command on MAC OSX

How to purge massive data from cassandra when node size is limited

Regarding sortedness guarantees despite immutability of SSTables

Running sstableutil after installing and running cassandra using docker

Getting the same errors at the exact time one hour apart, how can I fix them

Cassandra Repair fails

QUESTION

What is SSTable overlap in TWCS in Cassandra?

Asked 2022-Mar-08 at 09:15

I am trying to understand SStable overlaps in cassandra which is not suitable for TWCS. I found references like https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html but I still don't understand what overlap means and how it is caused by read repairs. Can anyone please provide a simple example that would help me to understand? Thanks

...

ANSWER

Answered 2022-Mar-08 at 09:15

For TWCS, data is compacted into "time windows". If you've configured a time window of 1 hour, TWCS will compact (combine) all partitions written within a one-hour window into a single SSTable. Over a 24-hour period you will end up with 24 SSTables, one for each hour of the day.

Let's say you inspect the SSTable generated at 9am. The minimum and maximum [write] timestamps in that SSTable would be between 8am and 9am.

Now consider a scenario where a replica has missed a few mutations (writes) around 10am. All the writes between 10am and 11am will get compacted to one SSTable. If a repair runs at 3pm, the missed mutations from earlier in that day will get included in the 3pm to 4pm time-window even when it really belongs to the SSTable from the 10-11am time-window.

In TWCS, SSTables from different time windows will not get compacted together. This means that the data from 2 different time windows is fragmented across 2 SSTables. Even if the 11am SSTable is expired, it cannot be dropped (deleted) from disk because there is data in the 4pm SSTable that overlaps with it. The 11am SSTable will not get dropped until all the data in the 4pm SSTable has also expired.

There's a simplified explanation of how TWCS works in How data is maintained in Cassandra. It includes a nice diagram which would hopefully make it easier for you to visualise how data could possibly overlap across SSTables. Cheers!

Source https://stackoverflow.com/questions/71391100

QUESTION

Run cassandra rebuild in parallel

Asked 2022-Mar-05 at 08:01

Am running nodetool rebuild, there is a table having 400 sstables on one node from where streaming is happening. Only one file is being streamed at a time, is there any way to parallelize this operation so that multiple sstables can be streamed in parallel rather than sequential file streaming.

...

ANSWER

Answered 2022-Mar-05 at 08:01

It isn't possible to increase the number of streaming threads. In any case, there are several factors which affect the speed of the streaming, not just network throughput. The type of disks as well as the data model have a significant impact on how quick the JVM can serialise the data to stream as well as how quick it can cleanup the heap (GC).

I see that you've already tried to increase the streaming throughput. Note that you'll need to increase it for both the sending and receiving nodes (and really, all nodes) otherwise, the stream will only be as fast as the slowest node. Cheers!

Source https://stackoverflow.com/questions/71360199

QUESTION

How does cassandra handle update request under the hood?

Asked 2022-Feb-06 at 14:39

I am trying to understand the storage mechanism of cassandra under the hood

From reading the official doc it seems like

write request write to mutable memtable
when memtable gets too large, its written to sstable

so I have the following question

is memtable durable?
if there is heavy update qps does it mean that there is going to be multiple versions of stale data in both memtable and sstable such that read latency can increase? how does cassandra get the latest data? and how is multiple version of data stored?
if there is heavy update qps does this mean there is alot of tombstone?

...

ANSWER

Answered 2022-Feb-06 at 14:39

is memtable durable?

There is the memtable which is flushed to disk based on size / a few other settings, but at the point the write is accepted - it is not durable in the memtable. There is also an entry placed in the commitlog which by default will flush every 10 seconds. (so on RF 3, you would expect a flush every 3.33 seconds). The flushing of the commitlog makes it durable to that specific node. To entirely lose the write before this flush has occurred would require all replicas to have failed before any of them had performed a commit log flush. As long as 1 of them flushed, it would be durable.

if there is heavy update qps does it mean that there is going to be multiple versions of stale data in both memtable and sstable such that read latency can increase?

In terms of the memtable, no there will not be stale data. In terms of the SSTables on disk, yes, there can be multiple versions of a record as it is updated over time which would lead to an increase in read latencies. A good metric to look at is the SSTablesPerRead metric which will give you the histogram of how many SSTables are being accessed per DB Table for the queries you run. The p95 or higher is the main value to look at, these will be the scenarios that will be causing slowness.

how does cassandra get the latest data? and how is multiple version of data stored?

During the read of the data, it will use the read path (bloom filters, partition summary etc) and read all versions of the row - and discard the parts which are not needed, before returning the records to the calling application. The multiple versions of the row is a facet of it existing in more than 1 sstable.

Part of the role of compaction is to manage this scenario and to bring together the multiple copies, older and newer versions of a record, and writing out new SStables which only retain the newer version. (and the SSTables it compacted together are removed).

if there is heavy update qps does this mean there is alot of tombstone?

This depends on the type of update, for most normal updates - no, this does not generate tombstones. Updates on list collection types though can and will generate tombstones. If you are issuing deletions, then yes, it will generate tombstones.

If you are going to be running a scenario of heavy updates, then I would recommend considering LeveledCompactionStrategy instead of a default SizeTieredCompactionStrategy - it is likely to provide you better read performance, but at a higher compaction IO cost.

Source https://stackoverflow.com/questions/71002813

QUESTION

Combine Cassandra Snapshot with updated data

Asked 2022-Feb-01 at 18:35

We deleted some old data within our 3 node cassandra cluster (v3.11) some days ago which shall now be restored from a Snapshot. Is there a possibility to restore the data from the snapshot without loosing updates made since the snapshot was taken?

There are two approaches which come to my mind

Create export via COPY keypsace.table TO xy.csv
Truncate table
restore table from snapshot via sstableloader
Reimport newer data via COPY keypsace.table FROM xy.csv

Just copy sstable files of snapshot into current table directory

Is A) a feasible option? What do we need to consider so that the COPY FROM/TO commands get synchronized over all nodes? For option B) I read that the deletion commands that happend may be executed again (tombstone rows). Can I ignore this warning if we make sure the deletion commands happened more than 10 days ago (gc_grace_seconds)?

...

ANSWER

Answered 2022-Feb-01 at 18:35

for exporting/importing data from Apache Cassandra®, there is an efficient tool -- DataStax Bulk Loader (aka DSBulk). You could refer to more documentation and examples here. For getting consistent reads and writes, you could leverage --datastax-java-driver.basic.request.consistency LOCAL_QUORUM in your unload & load commands.

Source https://stackoverflow.com/questions/70928841

QUESTION

Cassandra 4.0.1 can't be started using 'cassandra -f' command on MAC OSX

Asked 2022-Jan-20 at 08:46

After my Mac upgraded to Monterey, I had to reinstall cassandra from 3.x.x to 4.0.1.

I can't start Cassandra 4.0.1 using 'cassandra -f' command. I see following warning/errors:

...

ANSWER

Answered 2022-Jan-20 at 08:46

The error is here: Too many open files <- you need to increase the limit on the number of open files. This could be done with ulimit command and make it permanent as described in this answer.

Source https://stackoverflow.com/questions/70780194

QUESTION

How to purge massive data from cassandra when node size is limited

Asked 2022-Jan-19 at 19:15

I have a Cassandra cluster (with cassandra v3.11.11) with 3 data centers with replication factor 3. Each node has 800GB nvme but one of the data table is taking up 600GB of data. This results in below from nodetool status:

...

ANSWER

Answered 2022-Jan-19 at 19:15

I personally would start with checking if the whole space is occupied by actual data, and not by snapshots - use nodetool listsnapshots to list them, and nodetool clearsnapshot to remove them. Because if you did snapshot for some reason, then after compaction they are occupying the space as original files were removed.

The next step would be to try to cleanup tombstones & deteled data from the small tables using nodetool garbagecollect, or nodetool compact with -s option to split table into files of different size. For big table I would try to use nodetool compact with --user-defined option on the individual files (assuming that there will be enough space for them). As soon as you free > 200Gb, you can sstablesplit (node should be down!) to split big SSTable into small files (~1-2Gb), so when node starts again, the data will be compacted

Source https://stackoverflow.com/questions/70760173

QUESTION

Regarding sortedness guarantees despite immutability of SSTables

Asked 2022-Jan-10 at 01:14

I am reading LSM indexing in Designing Data-Intensive Applications by Martin Kleppmann.

The author states:

When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.
When the memtable gets bigger than some threshold—typically a few megabytes —write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key. The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.
In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.
From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

My question is: given that SSTables on disk are immutable, how is sorting guaranteed when new data comes in, that can change the ordering of data in SSTables (not memtable which is in memory)?

For e.g., suppose we have a SSTable on disk which has key-values pairs like [{1:a},{3:c},{4,d}]. Memtable in memory contains [{5,e},{6,f}] (which is sorted using AVL/RB tree). Suppose we now get a new entry: [{2,b}] which should reside between [{1:a}] and [{3:c}]. How would this be handled, if SSTable(s) on disk are immutable? In theory, we could create a new SSTable with [{2,b}] and compaction could later merge them, but wouldn't that break range-queries/reads that we perform before compaction takes place?

Thanks!

...

ANSWER

Answered 2021-Dec-29 at 09:00

If new data is coming, they are landing in new SSTables, not modifying existing ones. Each SSTable is read separately, and then data is consolidated from all SSTables and memtable, and then put into the correct order in memory before sending. See this doc, for example, on how data is read.

Source https://stackoverflow.com/questions/70514414

QUESTION

Running sstableutil after installing and running cassandra using docker

Asked 2022-Jan-08 at 18:30

I want to run some of the programs in SSTable Tools however the doc says: Cassandra must be stopped before these tools are executed, or unexpected results will occur. Note: the scripts do not verify that Cassandra is stopped.

I installed and started cassandra using docker. So how do I run something like sstableutil?

...

ANSWER

Answered 2022-Jan-08 at 18:30

Something like this, but you need to make sure that you have data on the host system, or in the Docker volume (it's good idea anyway):

stop container
execute docker run -it ...volume_config... --rm cassandra sstable_command
start container

P.S. But it really depends on the command - I remember that some commands were documented as required stop, but not really required

Source https://stackoverflow.com/questions/70633943

QUESTION

Getting the same errors at the exact time one hour apart, how can I fix them

Asked 2021-Dec-10 at 13:18

I am getting this two errors Validator.java:268 - Failed creating a merkle tree for and CassandraDaemon.java:228 - Exception in thread Thread at the exact time 0t:00:03 each hour

...

ANSWER

Answered 2021-Dec-10 at 13:18

The logs show that Cassandra failed in the validation phase of the anti-entropy repair process.

As "Cannot start multiple repair sessions over the same sstables" this means there're multiple repair sessions on the same token range at the same time.

You need to make sure that you have no repair session currently running on your cluster, and no anti-compaction.

I suggest rolling restart in order to stop all running repairs. then try to repair node by node.

One last suggestion is to try https://github.com/thelastpickle/cassandra-reaper it used to run automated repairs for Cassandra.

Source https://stackoverflow.com/questions/70304692

QUESTION

Cassandra Repair fails

Asked 2021-Nov-05 at 04:25

Cassandra repair is failing to run with the below error on node 1. I earlier started multiple repair sessions in parallel by mistake. I find that there is a bug https://issues.apache.org/jira/browse/CASSANDRA-11824 which has been resolved for the same scenario. But I am already using cassandra 3.9 Please confirm if running nodetool scrub is the only workaround? Are there any considerations that we need to keep in mind before running scrub as I need to run this directly on Prod.

...

ANSWER

Answered 2021-Nov-05 at 04:25

Nodetool tpstats revealed that there were indeed active repair jobs, but they were actually not running or compactionstats did not show any running jobs. So I restarted just the nodes on which the repair was stuck and this cleared up those stuck repair jobs and I was able to run a fresh repair after that.

Source https://stackoverflow.com/questions/69797144

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install sstable

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: