citus | Distributed PostgreSQL as an extension | Database library

by citusdata C Version: v11.3.0 License: AGPL-3.0

X-Ray Key Features Code Snippets Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | citus Summary

citus is a C library typically used in Database, PostgresSQL applications. citus has no bugs, it has no vulnerabilities, it has a Strong Copyleft License and it has medium support. You can download it from GitHub.

Citus is a PostgreSQL extension that transforms Postgres into a distributed database—so you can achieve high performance at any scale.

Support

Quality

Security

License

Reuse

Support

citus has a medium active ecosystem.

It has 8477 star(s) with 573 fork(s). There are 213 watchers for this library.

It had no major release in the last 12 months.

There are 859 open issues and 2079 have been closed. On average issues are closed in 194 days. There are 60 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of citus is v11.3.0

Quality

citus has 0 bugs and 0 code smells.

Security

citus has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

citus code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

citus is licensed under the AGPL-3.0 License. This license is Strong Copyleft.

Strong Copyleft licenses enforce sharing, and you can use them when creating open source projects.

Reuse

citus releases are available to install and integrate.

Installation instructions, examples and code snippets are available.

It has 1070 lines of code, 114 functions and 8 files.

It has low code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of citus

Get all kandi verified functions for this library.

citus Key Features

No Key Features are available at this moment for citus.

citus Examples and Code Snippets

No Code Snippets are available at this moment for citus.

Community Discussions

Trending Discussions on citus

Citus Sharding and PostgreSQL table partitioning on the same column

Primary Keys (UUID, Sequence) strategy on Citus

Best Approach for Developing a SaaS-based Event Analytics

AWS DMS Task failed with error: Error executing source loop; Stream component failed at subtask 0

Citus installation failing on GCP

measure tpc-ds benchmark on citus

citus can I use join if I use citus?

Citus: How can I add self referencing table in distributed tables list

Syncing data to Node that was offline while INSERT-ing (Citus)

What are settings to lookout for with Citus PostgresQL

QUESTION

Citus Sharding and PostgreSQL table partitioning on the same column

Asked 2022-Feb-01 at 22:58

We are starting a project with PostgreSQL and will need to use Citus in the near future for a multi tenant application so we are aiming at preparing our PostgreSQL database schema appropriately for an easy upgrading to Citus. I was reading the following page https://docs.microsoft.com/en-us/azure/postgresql/hyperscale/concepts-choose-distribution-column#best-practices and it states the following:

"Partition distributed tables by a common tenant_id column. For instance, in a SaaS application where tenants are companies, the tenant_id is likely to be the company_id."

The question is whether the term "Partition" in the statement from above is referring to PostgreSQL table partitioning (https://www.postgresql.org/docs/14/ddl-partitioning.html) or is it referring to Citus sharding by key? Does PostgreSQL table partitioning by tenant_id make any sense or provide any benefit when sharding table in Citus with the same sharding key as the one used to partition table in PostgreSQL (tenant_id)?

...

ANSWER

Answered 2022-Feb-01 at 22:58

Disclaimer: Ex-Citus team member here (but no longer affiliated with Citus or Microsoft)

I'm fairly certain that document is referencing partitions as in shards in a Citus cluster. Most Citus setups I have seen primarily use Citus sharding, and not Postgres table partitioning.

You can use Postgres table partitioning in combination with Citus, for example if you have time-based partitions that you would want to drop after the retention time has expired. Each time-based partition could be a separate distributed table in the Citus system, so that you get benefits of both worlds (Citus to distribute across nodes, Postgres Partitioning for effective deletion and avoiding autovacuum issues). Note that you would not partition on the same column - that doesn't really make sense in my experience.

Source https://stackoverflow.com/questions/70916433

QUESTION

Primary Keys (UUID, Sequence) strategy on Citus

Asked 2021-Dec-30 at 02:30

What is the best approach for primary keys on Citus?

UUID: No lock required, as opposed to the Identity/Serial. But expensive in storage and eventually on queries + causes fragmentations.

Sequence - Identity Causes a bottleneck while creating an entity. Less expensive in storage and queries will be faster + no fragmentations.

If we want to be scale-ready project, will it better to work with UUID?

According to this post: https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/

For shards it is recommended to work with UUID eventually.

How well it perform on Citus?

I'll give a schema example:

...

ANSWER

Answered 2021-Dec-30 at 02:30

Disclaimer: I am the Software Engineer working on Citus engine that opened a PR for supporting UDFs in column defaults.

In the post you shared in the question gen_random_uuid() UDF is used as a column default. This is not yet supported on Citus.

I suggest you use bigserial.

I also want to note that some of the statements in the blog are not correct for Citus. For example:

So if you use sharding, where you distribute your data across several databases, you cannot use a sequence.

In Citus, you can create distributed sequence objects where each worker node will use a disjoint subset of the allowed values of the sequence. The application sees a single sequence on the coordinator node, while the shards have their own sequences with different start values.

(You could use sequences defined with an INCREMENT greater than 1 and different START values, but that might lead to problems when you add additional shards.)

Citus shifts the start of a bigserial by node_group_id * 2^48 and it means that you are limited to a maximum of 2^18 shards that is practically unlimited. You will run into bigger issues when you have petabytes of data in a single table, or a quarter of a million shards, and this limitations here will not really affect your cluster.

PS: My work on the UDF column defaults is on hold now, as some recent code changes will make the solution cleaner, and easier to maintain. We did not yet ship the changes in a released version. https://github.com/citusdata/citus/pull/5149

Source https://stackoverflow.com/questions/70505619

QUESTION

Best Approach for Developing a SaaS-based Event Analytics

Asked 2021-Nov-22 at 07:46

My company is developing a SaaS service to store events and provide analytics through dashboards.

Since we won't have deletes or updates, the idea is to create a columnar-based, OLAP architecture to benefit from compression and latency it provides, and PostgreSQL Citus is one platform we intend to evaluate.

The overall architecture is pretty standard: an API will receive the events and then store them on Kafka in JSON format. Then, those events will be sent to PostgreSQL. Some fields will be "jsonb" data type.

By reading the docs, the best practice is distribute tables by tenant id.

Just wanted to doucle-check a few things and would greatly appreciate someone's input:

Does the architecture described above make sense? Is there anything we should change or pay attention to?
Are there limitations in the number of nodes or shards that can be scaled out for this columnar approach?
Is GIN index supported? (I believe it is, since it's not listed in 'Limitations')

Thanks!

...

ANSWER

Answered 2021-Nov-22 at 07:46

I've used citus for a multi-tenant service and distributing tables by the tenant id worked pretty well.

The overall architecture that you described makes sense but I'm not sure if someone from outside or at least without some details can give you much more to work with. Sending events through Kafka and processing them to be stored somewhere is a pretty standard way of working with events these days so nothing crazy is happening there.

There seem to be no limitations to scale out in terms of the number of nodes but what you should keep in mind is that how you set your shards count from the start. re-balancing will lock your tables and can take a while to finish so you want to keep it as small and easy to process as you can. Take a look here for more details: https://docs.citusdata.com/en/v10.2/faq/faq.html#how-do-i-choose-the-shard-count-when-i-hash-partition-my-data

GIN indexes are supported as they used it in their examples: https://docs.citusdata.com/en/v10.2/use_cases/multi_tenant.html?highlight=GIN#when-data-differs-across-tenants

Also, note that you won't have failover support in the community version. You have to go with the enterprise version which supports failover and also allows you to rebalance the tables without locking the entire table.

I wrote a simple service to handle the failover which you can use: https://github.com/Navid2zp/citus-failover

Source https://stackoverflow.com/questions/70039768

QUESTION

AWS DMS Task failed with error: Error executing source loop; Stream component failed at subtask 0

Asked 2021-Oct-29 at 04:45

I want to migrate my PostgresDB hosted in Citus cloud service to AWS RDS Aurora Postgres. I am using AWS DMS service. Have created task but getting following errors:

Last failure message Last Error Stream Component Fatal error. Task error notification received from subtask 0, thread 0 [reptask/replicationtask.c:2860] [1020101] Error executing source loop; Stream component failed at subtask 0, component st_0_QOIS7XIGJDKNPY6RXMGYRLJQHY2P7IQBWIBA5NQ; Stream component 'st_0_QOIS7XIGJDKNPY6RXMGYRLJQHY2P7IQBWIBA5NQ' terminated [reptask/replicationtask.c:2868] [1020101] Stop Reason FATAL_ERROR Error Level FATAL

Frankly speaking not able to understand what is wrong here, so any help is appreciated.

cloudwatch logs:

...

ANSWER

Answered 2021-Oct-29 at 04:45

I changed type to Full load it worked so it is not working for ongoing replication Citus Cloud service don't support it.

Source https://stackoverflow.com/questions/69579734

QUESTION

Citus installation failing on GCP

Asked 2021-Oct-01 at 19:28

After installing postgresql(13) on GCP, I tried installing citus using this command:

...

ANSWER

Answered 2021-Sep-30 at 10:41

Looks like a temporary issue on repository side. I just tried it and it works:

Source https://stackoverflow.com/questions/69387939

QUESTION

measure tpc-ds benchmark on citus

Asked 2021-Aug-13 at 17:21

I'm trying to do some measurement on citus (extension of postgres). for that mission I'm running tpc-ds query on citus. the citus that I'm using is a containers of master, workers and manager that taken from here: https://github.com/citusdata/docker I can add workers by adding their containers. so far so good but I'm having troubles by doing the measurement and need some answers:

to use all worker I need to run select_distributed_table/select_reference _table. is that copy the all data to all workers (for example 1TB of data became 16 TB for 16 workers)?
if I not using select_distributed_table but adding worker is there any benefit to that action?
If I already run select_distributed_table and later added worker, do it get the data distributed or I need to run again select_distributed_table?

...

ANSWER

Answered 2021-Aug-13 at 17:21

to use all worker I need to run select_distributed_table/select_reference _table. is that copy the all data to all workers (for example 1TB of data became 16 TB for 16 workers)?

Reference tables are copied across the whole cluster and distributed tables are sharded across the worker nodes.

If you ran the following queries on a Citus cluster with 16 workers for tables with 16 GB of data

Source https://stackoverflow.com/questions/68716191

QUESTION

citus can I use join if I use citus?

Asked 2021-Jul-01 at 18:53

citus can I use join if I use citus ?

And yes are they performant ?

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

...

ANSWER

Answered 2021-Jul-01 at 18:53

citus can I use join if I use citus ?

Yes, you can use joins with Citus.
https://docs.citusdata.com/en/v10.0/develop/reference_sql.html#joins.

Most of the joins between different Citus table types are just supported, but some of them requires enabling repartition joins.
https://docs.citusdata.com/en/v10.0/develop/api_guc.html?highlight=enable_repartition_joins#citus-enable-repartition-joins-boolean

As of Citus 10, joins between local tables and distributed tables are also supported.
https://docs.citusdata.com/en/v10.0/develop/api_guc.html?highlight=enable_repartition_joins#citus-local-table-join-policy-enum

And yes are they performant ?

Citus joins tables very efficiently when tables are co-located.
https://docs.citusdata.com/en/v10.0/develop/reference_sql.html#co-located-joins

Source https://stackoverflow.com/questions/68202956

QUESTION

Citus: How can I add self referencing table in distributed tables list

Asked 2021-Apr-28 at 08:54

I'm trying to run create_distributed_table for tables which i need to shard and almost all of the tables have self relation ( parent child ) but when I run SELECT create_distributed_table('table-name','id'); it throws error cannot create foreign key constraint

simple steps to reproduce

...

ANSWER

Answered 2021-Apr-28 at 08:54

For the time being, it is not possible to shard a table on PostgreSQL without dropping the self referencing foreign key constraints, or altering them to include a separate and new distribution column.

Citus places records into shards based on the hash values of the distribution column values. It is most likely the case that the hashes of parent and child id values are different and hence the records should be stored in different shards, and possibly on different worker nodes. PostgreSQL does not have a mechanism to create foreign key constraints that reference records on different PostgreSQL clusters.

Consider adding a new column tenant_id and adding this column to the primary key and foreign key constraints.

Source https://stackoverflow.com/questions/67209861

QUESTION

Syncing data to Node that was offline while INSERT-ing (Citus)

Asked 2020-Sep-29 at 07:24

I have set my replicaton_factor in Citus to 2 so every node has every data in case that if one Node gets a failure (goes offline) I can stil get all the data.

The problem I have is: I don't know how to get data inside of a node that failed while I was using INSERT or UPDATE and adding data to the DB? I have two Nodes and one Coordinator so basically after Node1 went offline and I added data to the DB Node2 got the data. Now I need to sync that data / shards to Node1 because at this time Node1 is basically useless.

...

ANSWER

Answered 2020-Sep-29 at 07:24

Found the answer: This is done through master_copy_shard_placement https://citus-doc.readthedocs.io/en/latest/develop/api_udf.html#master-copy-shard-placement Enterprise is not required.

Source https://stackoverflow.com/questions/64029066

QUESTION

What are settings to lookout for with Citus PostgresQL

Asked 2020-Sep-23 at 21:49

We are looking into using CitusDB. After reading all the documentation we are not clear on some fundamentals. Hoping somebody can give some directions.

In Citus you specify a shard_count and a shard_max_size, these settings are set on the coordinator according to the docs (but weirdly can also be set on a node).

What happens when you specify 1000 shards and distribute 10 tables with 100 clients?

Does it create a shard for every table (users_1, users_2, shops_1, etc.) (so effectively using all 1000 shards.
If you would grow with another 100 clients, we already hit the 1000 limit, how are these tables partitioned?
The shard_max_size defaults to 1Gb. If a shard is > 1Gb a new shard is created, but what happens when the shard_count is already hit?
Lastly, is it advisible to go for 3000 shards? We read in the docs 128 is adviced for a saas. But this seams low if you have 100 clients * 10 tables. (I know it depends.. but..)

...

ANSWER

Answered 2020-Sep-23 at 21:49

Former Citus/current Microsoft employee here, chiming in with some advice.

Citus shards are based on integer hash ranges of the distribution key. When a row is inserted, the value of the distribution key is hashed, the planner looks up what shard was assigned the range of hash values that that key falls into, then looks up what worker the shard lives on, and then runs the insert on that worker. This means that the customers are divided up across shards in a roughly even fashion, and when you add a new customer it'll just go into an existing shard.

It is critically important that all distributed tables that you wish to join to each other have the same number of shards and that their distribution columns have the same type. This lets us perform joins entirely on workers, which is awesome for performance.

If you've got a super big customer (100x as much data as your average customer is a decent heuristic), I'd use the tenant isolation features in advance to give them their own shard. This will make moving them to dedicated hardware much easier if you decide to do so down the road.

The shard_max_size setting has no effect on hash distributed tables. Shards will grow without limit as you keep inserting data, and hash-distributed tables will never increase their shard count under normal operations. This setting only applies to append distribution, which is pretty rarely used these days (I can think of one or two companies using it, but that's about it).

I'd strongly advise against changing the citus.shard_count to 3000 for the use case you've described. 64 or 128 is probably correct, and I'd consider 256 if you're looking at >100TB of data. It's perfectly fine if you end up having thousands of distributed tables and each one has 128 shards, but it's better to keep the number of shards per table reasonable.

Source https://stackoverflow.com/questions/64029060

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install citus

The quickest way to get started with Citus is to use the Hyperscale (Citus) deployment option in the Azure Database for PostgreSQL managed service—or set up Citus locally.
If you already have a local PostgreSQL installation, the easiest way to install Citus is to use our packaging repo.
If you want to set up a multi-node cluster, you can also set up additional PostgreSQL nodes with the Citus extensions and add them to form a Citus cluster:. For more details, see our documentation on how to set up a multi-node Citus cluster on various operating systems.

Support

If you’re ready to get started with Citus or want to know more, we recommend reading the Citus open source documentation. Or, if you are using Citus on Azure, then the Hyperscale (Citus) documentation is online and available as part of the Azure Database for PostgreSQL docs. Our Citus docs contain comprehensive use case guides on how to build a multi-tenant SaaS application, real-time analytics dashboard, or work with time series data.

Find more information at: