hyperloglog | Beta bias correction and TailCut space reduction

 by   axiomhq Go Version: Current License: MIT

kandi X-RAY | hyperloglog Summary

kandi X-RAY | hyperloglog Summary

hyperloglog is a Go library. hyperloglog has no bugs, it has no vulnerabilities, it has a Permissive License and it has medium support. You can download it from GitHub.

An improved version of HyperLogLog for the count-distinct problem, approximating the number of distinct elements in a multiset using 33-50% less space than other usual HyperLogLog implementations. This work is based on "Better with fewer bits: Improving the performance of cardinality estimation of large data streams - Qingjun Xiao, You Zhou, Shigang Chen".
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              hyperloglog has a medium active ecosystem.
              It has 853 star(s) with 65 fork(s). There are 18 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 3 open issues and 8 have been closed. On average issues are closed in 188 days. There are 1 open pull requests and 0 closed requests.
              It has a neutral sentiment in the developer community.
              The latest version of hyperloglog is current.

            kandi-Quality Quality

              hyperloglog has no bugs reported.

            kandi-Security Security

              hyperloglog has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

            kandi-License License

              hyperloglog is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              hyperloglog releases are not available. You will need to build from source code and install.

            Top functions reviewed by kandi - BETA

            kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
            Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of hyperloglog
            Get all kandi verified functions for this library.

            hyperloglog Key Features

            No Key Features are available at this moment for hyperloglog.

            hyperloglog Examples and Code Snippets

            Features
            mavendot img1Lines of Code : 34dot img1no licencesLicense : No License
            copy iconCopy
            // 1. Create config object
            Config config = new Config();
            config.useClusterServers()
                   // use "rediss://" for SSL connection
                  .addNodeAddress("redis://127.0.0.1:7181");
            
            // or read config from file
            config = Config.fromYAML(new File("config-f  

            Community Discussions

            QUESTION

            Is there a postgres function to mutably update a binary data structure?
            Asked 2021-Feb-09 at 12:34

            I have written an AGGREGATE function that approximates a SELECT COUNT(DISTINCT ...) over a UUID column, a kind of poor man's HyperLogLog (and having different perf characteristics).

            However, it is very slow because I am using set_bit on a BIT and that has copy-on-write semantics.

            So my question is:

            1. is there a way to inplace / mutably update a BIT or bytea?
            2. failing that, are there any binary data structures that allow mutable/in-place set_bit edits?

            A constraint is that I can't push C code or extensions to implement this. But I can use extensions that are available in AWS RDS postgres. If it's not faster than HLL then I'll just be using HLL. Note that HLL is optimised for pre-aggregated counts, it isn't terribly fast at doing adhoc count estimates over millions of rows (although still faster than a raw COUNT DISTINCT).

            Below is the code for context, probably buggy too:

            ...

            ANSWER

            Answered 2021-Feb-09 at 11:58

            Yeah, SQL isn't actually that fast for raw computation. I might try a UDF, perhaps pljava or plv8 (JavaScript) which compile just-in-time to native and available on most major hosting providers. Of course for performance, use C (perhaps via LLVM) for maximum performance at maximum pain. Plv8 should take minutes to prototype, just pass an array constructed from array_agg(). Obviously keep the array size to millions of items, or find a way to roll-up your sketches ( bitwuse-AND ?) https://plv8.github.io/#function-calls https://www.postgresqltutorial.com/postgresql-aggregate-functions/postgresql-array_agg-function/

            FYI HyperLogLog is available as an open source extension for PostgreSQL from Citus/Microsoft and of course available on Azure. https://www.google.com/search?q=hyperloglog+postgres (You could crib from their coffee and just change the core algorithm, then test side by side). Citus is pretty easy to install, so this isn't a bad option.

            Source https://stackoverflow.com/questions/66118133

            QUESTION

            Django culmulative sum of HyperLogLog (HLL) Postgres field
            Asked 2020-May-26 at 07:55

            I'm using the HyperLogLog (hll) field to represent unique users, using the Django django-pg-hll package. What I'd like to do is get a cumulative total of unique users over a specific time period, but I'm having trouble doing this.

            Given a model like:

            ...

            ANSWER

            Answered 2020-May-26 at 07:55

            This bug occurs because the django-pg-hll pacakage uses the hll_cardinalityfunction instead of the # operator for window functions. Moving to a raw sql solution fixed the issue.

            Source https://stackoverflow.com/questions/61937596

            QUESTION

            Is usage analysis based on HyperLogLog compliant with GDPR?
            Asked 2020-Apr-27 at 23:16

            Context: we have telemetry system for our service and would like to track retention, how many users use various features, etc.

            There are two options to deal with user identifiable information and be GDPR compliant:

            1. Support deleting user information based on request
            2. Keep data for less than 30 days

            Option #1 is hard to implement (for telemetry system). Option #2 doesn't allow answering questions such as "what is 6-month retention for feature X?".

            One idea how to get answers for above question is to calculate HyperLogLog blobs per feature every week/day and store them separately forever. This will allow moving forward to merge/dcount/calculate retention based on these blobs.

            Assuming that any user identifiable information is gone after 30 days (after user account gets deleted), will HyperLogLog blobs still allow to track users or not (i.e. to answer whether a particular user used feature X two years ago)?

            If it allows then it is not compliant (doesn't mean that it is compliant if it doesn't allow).

            ...

            ANSWER

            Answered 2020-Apr-27 at 23:16

            In general HLLs are not GDPR compliant. This issue was somewhat addressed in a recent Google paper (see Section 8: 'Mitigation strategies').

            The hash function used in HLL are usually not cryptographically secure (usually MurmurHash), hence even with salting you might still be able to answer the question "is a user part of a HLL data structure or not" and that's a no no.

            Afaik you would be in compliance if you keep HLLs around for longer than 30 days iff you apply a salted crypto hash prior to HLL aggregation (i.e. a salted SHA-2 or BLAKE2b, BLAKE3) and you destroy the salt after each <30 day period. This would allow you to keep <30 day intervals. You would not be able to merge HLLs over several intervals but only over 28 day chunks, but that can still be super valuable dependent on your business needs.

            Source https://stackoverflow.com/questions/57000767

            QUESTION

            Unable to make redis on windows machine
            Asked 2020-Jan-09 at 10:11

            I am trying to use redis in my nodejs project. I see that to build redis you need make command and gcc. I have instaled cygwin on my windows machine and then installed both make and gcc.

            I downloaded redis from here https://redis.io/download and as per the instructions -

            ...

            ANSWER

            Answered 2017-May-16 at 10:05

            As @FluffyNights suggested that redis does not support windows. I made it in working mode using https://github.com/MSOpenTech/redis

            Download releases from here https://github.com/MSOpenTech/redis/releases/download/win-3.2.100/Redis-x64-3.2.100.zip then extract this compressed file.

            You will find redis-server.exe there, just execute it to start redis server. This has redis client also which you can use to execute its commands to save, get etc

            Source https://stackoverflow.com/questions/43987090

            QUESTION

            Exact quantiles instead or approximate ones in Spark?
            Asked 2019-Sep-25 at 12:07

            To calculate quantiles, I use approxQuantile method accessible from the stat() function in any Dataset or Dataframe of Spark. The way it approximate them is explained by in this post.

            ...

            ANSWER

            Answered 2019-Sep-23 at 06:06

            The approxQuantile function in Spark can be used to calculate exact quantiles. From the documentation we see that there are 3 parameters:

            Source https://stackoverflow.com/questions/58055807

            QUESTION

            URL filtering on top of Redis: Bloom filters or HyperLogLog data structure
            Asked 2019-Feb-24 at 15:31

            I want to implement URL filtering for the distributed crawling system on top of Redis database (e.g. don't visit the same URL twice, so I need somehow to keep tracking all of them with the minimal memory fingerprint, there is no need to store full URLs, just check if some particular URL has been visited or not). Bloom filters sounds right in this case, and I saw a native module for Redis implementing the Bloom filters. But it also has the built-in HyperLogLog data structure, so I'm wondering which one is a better choice in my scenario.

            ...

            ANSWER

            Answered 2019-Feb-24 at 15:31

            Bloom filter is totally different from HyperLogLog. Bloom filter is used for checking if there're some duplicated items, while HyperLogLog is used for distinct counting. In your case, you should use Bloom filter.

            Also see this question for their differences.

            Source https://stackoverflow.com/questions/54825656

            QUESTION

            How to implement a counter that decays over time?
            Asked 2019-Feb-21 at 23:11
            Requirements of special counter

            I want to implement a special counter: all increment operations time out after a fixed period of time (say 30 days).

            An example:

            • Day 0: counter = 0. TTL = 30 days
            • Day 1: increment counter (+1)
            • Day 2: increment counter (+1)
            • Day 3: value of counter == 2
            • Day 31: value of counter == 1
            • Day 32: value of counter == 0
            Naive solution

            A naïve implementation is to maintain a set of timestamps, where each timestamp equals the time of an increment. The value of the counter equals the size of the set after subtracting all timestamps that have timed out.

            This naïve counter has O(n) space (size of the set), has O(n) lookup and O(1) inserts. The values are exact.

            Better solution (for me)

            Trade speed and memory for accuracy.

            I want a counter with O(1) lookup and insert, O(1) space. The accuracy < exact.

            Alternatively, I would accept O(log n) space and lookup.

            The counter representation should be suited for storage in a database field, i.e., I should be able to update and poll the counter rapidly without too much (de)serialization overhead.

            I'm essentially looking for a counter that resembles a HyperLogLog counter, but for a different type of approximate count: decaying increments vs. number of distinct elements

            How could I implement such a counter?

            ...

            ANSWER

            Answered 2017-Feb-17 at 11:31

            Since the increments expire in the same order as they happen, the timestamps form a simple queue.

            The current value of the counter can be stored separately in O(1) additional memory. At the start of each operation (insert or query), while the front of the queue is expired, it's popped out of the queue, and the counter is decreased.

            Note that each of the n timestamps is created and popped out once. Thus you have O(1) amortized time to access the current value, and O(n) memory to store the non-expired timestamps. The actual highest memory usage is also limited by the ratio of TTL / frequency of new timestamp insertions.

            Source https://stackoverflow.com/questions/42295046

            QUESTION

            Using QDigest over a date range
            Asked 2019-Jan-29 at 01:32

            I need to keep a 28 day history for some dashboard data. Essentially I have an event/action that is recorded through our BI system. I want to count the number of events and the distinct users who do that event for the past 1 day, 7 days and 28 days. I also use grouping sets (cube) to get the fully segmented data by country/browser/platform etc.

            The old way was to do this keeping a 28 day history per user, for all segments. So if a user accessed the site from mobile and desktop every day for all 28 days they would have 54 rows in the DB. This ends up being a large table and is time consuming even to calculate approx_distinct and not distinct. But the issue is that I also wish to calculate approx_percentiles.

            So I started investigating the user of HyperLogLog https://prestodb.io/docs/current/functions/hyperloglog.html
            This works great, its much more efficient storing the sketches daily rather than the entire list of unique users per day. As I am using approx_distinct the values are close enough and it works.

            I then noticed a similar function for medians. Qdigest. https://prestodb.io/docs/current/functions/qdigest.html Unfortunately the documentation is not nearly as good on this page as it is on previous pages, so it took me a while to figure it out. This works great for calculating daily medians. But it does not work if I want to calculate the median actions per user over the longer time period. The examples in HyperLogLog demonstrate how to calculate approx_distinct users over a time period but the Qdigest docs do not give such an example.

            The results that I get when I try something to the HLL example for date ranges with Qdigest I get results similar to 1 day results.

            ...

            ANSWER

            Answered 2019-Jan-29 at 01:32

            Because you're in need of medians that are aggregated (summed) across multiple days on a per user basis, you'll need to perform that aggregation prior to insertion into the qdigest in order for this to work for 7- and 28-day per-user counts. In other words, the units of the data need to be consistent, and if daily values are being inserted into qdigest, you can't use that qdigest for 7- or 28-day per-user counts of the events.

            Source https://stackoverflow.com/questions/54093246

            QUESTION

            Is there a heuristic algorithm for groupBy + count?
            Asked 2018-Dec-02 at 12:14

            I got a List of integers and I want to count the number of times each integer appears in the list.

            For example: [0,5,0,1,3,3,1,1,1] gives (0 -> 2), (1 -> 4), (3 -> 2), (5 -> 1). I only need the count, not the value (the goal is to have an histogram of the counts).

            A common approach would be to group by value then count the cardinality of each set. In SQL: SELECT count(*) FROM myTable GROUPBY theColumnContainingIntegers.

            Is there a faster way to do this? A heuristic or a probabilistic approach is fine since I am computing a large data set and sacrifying precision for speed is fine.

            Something similar to HyperLogLog algorithm (used to count the number of distinct elements in a data set) would be great, but I did not find anything like this...

            ...

            ANSWER

            Answered 2018-Dec-02 at 12:14

            Let's take your set containing 9 elements [0,5,0,1,3,3,1,1,1] and make it big but with same frequencies of elements:

            Source https://stackoverflow.com/questions/53305059

            QUESTION

            (error) WRONGTYPE Key is not a valid HyperLogLog string value
            Asked 2018-Aug-15 at 17:42

            I am learning HyperLogLogs examples with redis-cli

            The redis-cli examples show how you can use HyperLogLog commands to record and count unique user visits to a website.
            The command PFADD adds one or many strings to a HyperLogLog. PFADD returns 1 if the cardinality was changed and 0 if it remains the same:

            Nonetheless, It report error when I follow the instructions:

            ...

            ANSWER

            Answered 2018-Aug-15 at 17:42

            The error is, even if terse, quite informative - you are trying to use an existing key (i.e. visits:2015-01-01) that is not an HLL.

            The existing key is possibly a string of some kind, but if you DEL visits:2015-01-01, you should be able to PFADD to it.

            Source https://stackoverflow.com/questions/51857302

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install hyperloglog

            You can download it from GitHub.

            Support

            For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/axiomhq/hyperloglog.git

          • CLI

            gh repo clone axiomhq/hyperloglog

          • sshUrl

            git@github.com:axiomhq/hyperloglog.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link