hyperloglog | Beta bias correction and TailCut space reduction
kandi X-RAY | hyperloglog Summary
kandi X-RAY | hyperloglog Summary
An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 33-50% less space than other usual HyperLogLog implementations. This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of hyperloglog
hyperloglog Key Features
hyperloglog Examples and Code Snippets
// 1. Create config object
Config config = new Config();
config.useClusterServers()
// use "rediss://" for SSL connection
.addNodeAddress("redis://127.0.0.1:7181");
// or read config from file
config = Config.fromYAML(new File("config-f
Community Discussions
Trending Discussions on hyperloglog
QUESTION
I have written an AGGREGATE
function that approximates a SELECT COUNT(DISTINCT ...)
over a UUID
column, a kind of poor man's HyperLogLog (and having different perf characteristics).
However, it is very slow because I am using set_bit
on a BIT
and that has copy-on-write semantics.
So my question is:
- is there a way to inplace / mutably update a
BIT
orbytea
? - failing that, are there any binary data structures that allow mutable/in-place
set_bit
edits?
A constraint is that I can't push C code or extensions to implement this. But I can use extensions that are available in AWS RDS postgres. If it's not faster than HLL then I'll just be using HLL. Note that HLL is optimised for pre-aggregated counts, it isn't terribly fast at doing adhoc count estimates over millions of rows (although still faster than a raw COUNT DISTINCT
).
Below is the code for context, probably buggy too:
...ANSWER
Answered 2021-Feb-09 at 11:58Yeah, SQL isn't actually that fast for raw computation. I might try a UDF, perhaps pljava or plv8 (JavaScript) which compile just-in-time to native and available on most major hosting providers. Of course for performance, use C (perhaps via LLVM) for maximum performance at maximum pain. Plv8 should take minutes to prototype, just pass an array constructed from array_agg(). Obviously keep the array size to millions of items, or find a way to roll-up your sketches ( bitwuse-AND ?) https://plv8.github.io/#function-calls https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-array_agg-function/
FYI HyperLogLog is available as an open source extension for PostgreSQL from Citus/Microsoft and of course available on Azure. https://www.google.com/search?q=hyperloglog+postgres (You could crib from their coffee and just change the core algorithm, then test side by side). Citus is pretty easy to install, so this isn't a bad option.
QUESTION
I'm using the HyperLogLog (hll) field to represent unique users, using the Django django-pg-hll
package. What I'd like to do is get a cumulative total of unique users over a specific time period, but I'm having trouble doing this.
Given a model like:
...ANSWER
Answered 2020-May-26 at 07:55This bug occurs because the django-pg-hll
pacakage uses the hll_cardinality
function instead of the #
operator for window functions. Moving to a raw
sql solution fixed the issue.
QUESTION
Context: we have telemetry system for our service and would like to track retention, how many users use various features, etc.
There are two options to deal with user identifiable information and be GDPR compliant:
- Support deleting user information based on request
- Keep data for less than 30 days
Option #1 is hard to implement (for telemetry system). Option #2 doesn't allow answering questions such as "what is 6-month retention for feature X?".
One idea how to get answers for above question is to calculate HyperLogLog blobs per feature every week/day and store them separately forever. This will allow moving forward to merge/dcount/calculate retention based on these blobs.
Assuming that any user identifiable information is gone after 30 days (after user account gets deleted), will HyperLogLog blobs still allow to track users or not (i.e. to answer whether a particular user used feature X two years ago)?
If it allows then it is not compliant (doesn't mean that it is compliant if it doesn't allow).
...ANSWER
Answered 2020-Apr-27 at 23:16In general HLLs are not GDPR compliant. This issue was somewhat addressed in a recent Google paper (see Section 8: 'Mitigation strategies').
The hash function used in HLL are usually not cryptographically secure (usually MurmurHash), hence even with salting you might still be able to answer the question "is a user part of a HLL data structure or not" and that's a no no.
Afaik you would be in compliance if you keep HLLs around for longer than 30 days iff you apply a salted crypto hash prior to HLL aggregation (i.e. a salted SHA-2 or BLAKE2b, BLAKE3) and you destroy the salt after each <30 day period. This would allow you to keep <30 day intervals. You would not be able to merge HLLs over several intervals but only over 28 day chunks, but that can still be super valuable dependent on your business needs.
QUESTION
I am trying to use redis
in my nodejs
project. I see that to build redis
you need make
command and gcc
. I have instaled cygwin
on my windows machine and then installed both make
and gcc
.
I downloaded redis
from here https://redis.io/download and as per the instructions -
ANSWER
Answered 2017-May-16 at 10:05As @FluffyNights suggested that redis
does not support windows. I made it in working mode using https://github.com/MSOpenTech/redis
Download releases from here https://github.com/MSOpenTech/redis/releases/download/win-3.2.100/Redis-x64-3.2.100.zip then extract this compressed file.
You will find redis-server.exe
there, just execute it to start redis
server. This has redis
client also which you can use to execute its commands to save, get etc
QUESTION
To calculate quantiles, I use approxQuantile
method accessible from the stat()
function in any Dataset
or Dataframe
of Spark
. The way it approximate them is explained by in this post.
ANSWER
Answered 2019-Sep-23 at 06:06The approxQuantile
function in Spark can be used to calculate exact quantiles. From the documentation we see that there are 3 parameters:
QUESTION
I want to implement URL filtering for the distributed crawling system on top of Redis database (e.g. don't visit the same URL twice, so I need somehow to keep tracking all of them with the minimal memory fingerprint, there is no need to store full URLs, just check if some particular URL has been visited or not). Bloom filters sounds right in this case, and I saw a native module for Redis implementing the Bloom filters. But it also has the built-in HyperLogLog data structure, so I'm wondering which one is a better choice in my scenario.
...ANSWER
Answered 2019-Feb-24 at 15:31Bloom filter is totally different from HyperLogLog. Bloom filter is used for checking if there're some duplicated items, while HyperLogLog is used for distinct counting. In your case, you should use Bloom filter.
Also see this question for their differences.
QUESTION
I want to implement a special counter: all increment operations time out after a fixed period of time (say 30 days).
An example:
- Day 0: counter = 0. TTL = 30 days
- Day 1: increment counter (+1)
- Day 2: increment counter (+1)
- Day 3: value of counter == 2
- Day 31: value of counter == 1
- Day 32: value of counter == 0
A naïve implementation is to maintain a set of timestamps, where each timestamp equals the time of an increment. The value of the counter equals the size of the set after subtracting all timestamps that have timed out.
This naïve counter has O(n) space (size of the set), has O(n) lookup and O(1) inserts. The values are exact.
Better solution (for me)Trade speed and memory for accuracy.
I want a counter with O(1) lookup and insert, O(1) space. The accuracy < exact.
Alternatively, I would accept O(log n) space and lookup.
The counter representation should be suited for storage in a database field, i.e., I should be able to update and poll the counter rapidly without too much (de)serialization overhead.
I'm essentially looking for a counter that resembles a HyperLogLog counter, but for a different type of approximate count: decaying increments vs. number of distinct elements
How could I implement such a counter?
...ANSWER
Answered 2017-Feb-17 at 11:31Since the increments expire in the same order as they happen, the timestamps form a simple queue.
The current value of the counter can be stored separately in O(1) additional memory. At the start of each operation (insert or query), while the front of the queue is expired, it's popped out of the queue, and the counter is decreased.
Note that each of the n timestamps is created and popped out once. Thus you have O(1) amortized time to access the current value, and O(n) memory to store the non-expired timestamps. The actual highest memory usage is also limited by the ratio of TTL / frequency of new timestamp insertions.
QUESTION
I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.
The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.
So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.
I then noticed a similar function for medians. Qdigest. https://prestodb.io/docs/current/functions/qdigest.html Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.
The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.
...ANSWER
Answered 2019-Jan-29 at 01:32Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.
QUESTION
I got a List of integers and I want to count the number of times each integer appears in the list.
For example: [0,5,0,1,3,3,1,1,1]
gives (0 -> 2), (1 -> 4), (3 -> 2), (5 -> 1)
. I only need the count, not the value (the goal is to have an histogram of the counts).
A common approach would be to group by value then count the cardinality of each set. In SQL: SELECT count(*) FROM myTable GROUPBY theColumnContainingIntegers
.
Is there a faster way to do this? A heuristic or a probabilistic approach is fine since I am computing a large data set and sacrifying precision for speed is fine.
Something similar to HyperLogLog algorithm (used to count the number of distinct elements in a data set) would be great, but I did not find anything like this...
...ANSWER
Answered 2018-Dec-02 at 12:14Let's take your set containing 9 elements [0,5,0,1,3,3,1,1,1]
and make it big but with same frequencies of elements:
QUESTION
I am learning HyperLogLogs examples with redis-cli
The redis-cli examples show how you can use HyperLogLog commands to record and count unique user visits to a website.
The command PFADD adds one or many strings to a HyperLogLog. PFADD returns 1 if the cardinality was changed and 0 if it remains the same:
Nonetheless, It report error when I follow the instructions:
...ANSWER
Answered 2018-Aug-15 at 17:42The error is, even if terse, quite informative - you are trying to use an existing key (i.e. visits:2015-01-01
) that is not an HLL.
The existing key is possibly a string of some kind, but if you DEL visits:2015-01-01
, you should be able to PFADD
to it.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install hyperloglog
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page