datasketch | LSH Forest , Weighted MinHash | Hashing library

by ekzhu Python Version: v1.5.9 License: MIT

X-Ray Key Features Code Snippets(3)Community Discussions(9)Vulnerabilities Install Support

kandi X-RAY | datasketch Summary

datasketch is a Python library typically used in Security, Hashing applications. datasketch has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. However datasketch has 85 bugs. You can download it from GitHub.

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble

Support

Quality

Security

License

Reuse

Support

datasketch has a medium active ecosystem.

It has 1991 star(s) with 267 fork(s). There are 49 watchers for this library.

It had no major release in the last 12 months.

There are 41 open issues and 111 have been closed. On average issues are closed in 104 days. There are 2 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of datasketch is v1.5.9

Quality

datasketch has 85 bugs (0 blocker, 0 critical, 46 major, 39 minor) and 147 code smells.

Security

datasketch has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

datasketch code analysis shows 0 unresolved vulnerabilities.

There are 17 security hotspots that need review.

License

datasketch is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

datasketch releases are available to install and integrate.

Build file is available. You can build the component from source.

datasketch saves you 6317 person hours of effort in developing the same functionality from scratch.

It has 13149 lines of code, 479 functions and 88 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed datasketch and discovered the below as its top functions. This is intended to give you an instant insight into datasketch implemented functionality, and help decide if they suit your requirements.

Evaluate results
Compute the similarity between each query
Compute the distance between the ground truth values
Compute recall
Bootstrap sets from a set of sets
Updates the hash function
Merges two min hashes together
Benchmark LSHensemble
Insert entries into the set
Index a set of entries
Return True if all indexes are empty
Query the sum of keys in the pool
Add a key to the pool
Perform jaccard search
Adds a key to the hash
Compute the optimal probability for a given threshold
Compute the nearest neighbour between two sets
Query the hash values of all hashes
Save results to database
Create storages
Generate minhashes for a set of permutations
Perform LSH forest search
Queries the LSH algorithm
Reads a set of sets from a file
Search HNSW index using HNSW index
Plots a matplotlib plot
Create a list of objects from b
Calculate the minimum hash value for an input vector

Get all kandi verified functions for this library.

datasketch Key Features

No Key Features are available at this moment for datasketch.

datasketch Examples and Code Snippets

Artifact for Understanding and Bridging the Gaps in Current GNN Performance Optimizations,Getting started

C++

Lines of Code : 29

License : No License

Copy

git clone git@github.com:xxcclong/GNN-Computing.git

cd artifact
mkdir build && cd build
cmake ..
make -j16
cp fig7.out ../Figure7/
cp fig8.out ../Figure8/
cp fig9.out ../Figure9/
cp fig10a.out ../Figure10/
cp fig10b.out ../Figure10/
cp fig11

OCR_POST_DE

Python

Lines of Code : 8

License : No License

Copy

@misc{lyu2021neural,
      title={Neural OCR Post-Hoc Correction of Historical Corpora}, 
      author={Lijun Lyu and Maria Koutraki and Martin Krickl and Besnik Fetahu},
      year={2021},
      eprint={2102.00583},
      archivePrefix={arXiv},

default

Jupyter Notebook

Lines of Code : 4

License : Permissive (MIT)

Copy

git clone https://github.com/brendano/stanford_corenlp_pywrapper
cd stanford_corenlp_pywrapper
pip install .
cd ..

Community Discussions

Trending Discussions on datasketch

Apache Druid: count outliers

How can I authenticate when querying druid?

HDFS as Deep-Storage: Druid is not storing the historical data on hdfs

Efficiently calculate top-k elements in spark

LSH Binning On-The-Fly

How do I enable logging/tracing in Apache Calcite using Sqlline?

"Failed to submit supervisor: Request failed with status code 502" on submitting ingestion spec to router

Is it possible to store custom class object in Spark Data Frame as a column value?

Python: IOError 110 Connection timed out when reading from disk

QUESTION

Apache Druid: count outliers

Asked 2020-Apr-26 at 08:43

I prepared an installation of Apache Druid that takes data from a Kafka topic. It works very smoothly and efficiently.

I'm currently trying to implement some queries and I'm stuck in the count of rows (grouped by some fields) for which a column value is an outlier. In the normal SQL world, I will essentially compute the first and third quartiles (q1 and q3) and then use something like (I'm interested only in "right" outliers):

SUM(IF(column_value > q3 + 1.5*(q3-q1), 1, 0))

This approach makes use of cte and joins: I compute the quartiles in a cte with grouping and then I join it with the original table.

I was able to easily compute the quartiles and the outlier threshold with datasketch extension using a groupBy query, but I'm not realizing how to perform a postAggregation that can perform the count.

In theory, I may implement a second query using the thresholds obtained in the first. Unfortunately, I can get hundreds of thousands of different values. That makes this approach unfeasible.

Do you have any suggestion on how to tackle this problem?

...

ANSWER

Answered 2020-Apr-26 at 08:43

As of version 0.18.0, Apache Druid supports joins. This solves the problem.

Source https://stackoverflow.com/questions/61115365

QUESTION

How can I authenticate when querying druid?

Asked 2019-Dec-17 at 12:48

I installed druid from link attached here to install druid.

The following code has been added to the common.runtime.properties file .

...

ANSWER

Answered 2019-Dec-17 at 12:48

You use basic authentication. You should just be able to send your query to druid with an URL like this:

Source https://stackoverflow.com/questions/59368760

QUESTION

HDFS as Deep-Storage: Druid is not storing the historical data on hdfs

Asked 2019-Dec-12 at 11:09

I have set up a micro-server of Druid on on-prem machine. I want to use HDFS as deep-storage of druid. I have used the following Druid Docs, [druid-hdfs-storage] fully qualified deep storage path throws exceptions and imply-druid docs as references.

I have made following changes in /apache-druid-0.16.0-incubating/conf/druid/single-server/micro-quickstart/_common/common.runtime.properties

...

ANSWER

Answered 2019-Dec-12 at 11:09

I resolved the issue by changing the hdp.version in the mapred-site.xml manually. I was getting following exception in middleManager.log

java.lang.IllegalArgumentException: Unable to parse '/hdp/apps/${hdp.version}/mapreduce/mapreduce.tar.gz#mr-framework' as a URI, check the setting for mapreduce.application.framework.path

But Still the segment metadata is showing Request failed with status code 404.

Source https://stackoverflow.com/questions/59155685

QUESTION

Efficiently calculate top-k elements in spark

Asked 2019-Jul-24 at 08:35

I have a dataframe similarly to:

...

ANSWER

Answered 2019-May-23 at 14:29

RDD`s to the rescue

Source https://stackoverflow.com/questions/56270629

QUESTION

LSH Binning On-The-Fly

Asked 2019-Jul-09 at 02:32

I want to use MinHash LSH to bin a large number of documents into buckets of similar documents (Jaccard similarity).

The question: Is it possible to compute the bucket of a MinHash without knowing about the MinHash of the other documents?

As far as I understand LSH "just" computes a hash of the MinHashes. So it should be possible?

One implementation I find quite promissing is datasketch. I can query the LSH for documents similar to a given one after knowing the MinHash of all documents. However I see no way to get the bucket of a single document before knowing the other ones. https://ekzhu.github.io/datasketch/index.html

...

ANSWER

Answered 2019-Jul-09 at 02:32

LSH doesn't bucket entire documents, nor does it bucket individual minhashes. Rather, it buckets 'bands' of minhashes.

LSH is a means of both reducing the number of hashes stored per document, and reducing the number of hits found when using these hashes to search for similar documents. It achieves this by combining multiple minhashes together into a single hash. So, for example, instead of storing 200 minhashes per document, you might combine them in bands of four to yield 50 locality sensitive hashes.

The hash for each band is calculated from its constituent minhashes using a cheap hash function such as FNV-1a. This loses some information, which is why LSH is said to reduce the dimensionality of the data. The resulting hash is the bucket.

So the bucket for each band of minhashes within a document is calculated without requiring knowledge of any other bands or any other documents.

Using LSH hashes to find similar documents is simple: Let's say you want to find documents similar to document A. First generate the (e.g.) 50 LSH hashes for document A. Then look in your hash dictionary for all other documents that share one or more of these hashes. The more hashes they share, the higher their estimated jaccard similarity (though this is not a linear relationship, as it is when using plain minhashes).

The fewer total hashes stored per document, the greater the error in estimated jaccard similarity, and the greater the chance of missing similar documents.

Here's a good explanation of LSH.

Source https://stackoverflow.com/questions/56405054

QUESTION

How do I enable logging/tracing in Apache Calcite using Sqlline?

Asked 2019-Jun-18 at 08:29

Following https://calcite.apache.org/docs/tutorial.html, I ran Apache Calcite using SqlLine. I tried activating tracing as instructed in https://calcite.apache.org/docs/howto.html#tracing. However, I don't get any logging. Here is the content of my session (hopefully containing all relevant information):

...

ANSWER

Answered 2019-Jun-18 at 08:29

I have the impression that problem lies to the underlying implementation of the logger.

I am not an expert on logging configurations but I think specifying the properties file through -Djava.util.logging.config.file does not have any effect since the logger that is used (according to the classpath you provided) is the Log4J implementation (slf4j-log4j12-1.7.25.jar) and not the one of the jdk (https://mvnrepository.com/artifact/org.slf4j/slf4j-jdk14/1.7.26).

I think that the right property to use for the log4j implementation is the folowing: -Dlog4j.configuration=file:C:\Users\user0\workspaces\apache-projects\apache-calcite\core\src\test\resources\log4j.properties

Source https://stackoverflow.com/questions/56629738

QUESTION

"Failed to submit supervisor: Request failed with status code 502" on submitting ingestion spec to router

Asked 2019-May-22 at 11:36

I am getting the error "Failed to submit supervisor: Request failed with status code 502" when I am trying to submit an ingestion spec to the druid UI (through the router). The ingestion spec works in a standalone druid server.

I have set up the cluster using 4 machines-1 for the coordinator and overlord (master), 1 for historical and middle manager (data), 1 for broker (query), and 1 for router, with a separate instance for zookeeper. There is no error in the logs.

The ingestion spec is as follows:

...

ANSWER

Answered 2019-May-22 at 11:36

It happened because the druid-kafka-indexing-service extension was missing from the extension list of common.runtime.properties.

Source https://stackoverflow.com/questions/56220800

QUESTION

Is it possible to store custom class object in Spark Data Frame as a column value?

Asked 2019-Jan-16 at 08:39

I am working on duplicate documents detection problem using LSH algorithm. To handle large-scale data, we are using spark.

I have around 300K documents with at least 100-200 words per document. On spark cluster, these are the steps we are performing on data frame.

Run Spark ML pipeline for converting text into tokens.

...

ANSWER

Answered 2019-Jan-16 at 08:39

I don't think it might be possible to save python objects in DataFrames, but you can circumvent this in a couple of ways:

Store the result instead of the object (not sure about how MinHash works, but if the value is numerical/string, it should be easy to extract it from the class object).
If that is not feasible because you still need some properties of the object, you might want to serialize it using Pickle, saving the serialized result as an encoded string. This forces you to de-serialize every time that you want to use the object.

final_df_limit.rdd.map(lambda x: base64.encodestring(pickle.dumps(CalculateMinHash(x),))).toDF()
An alternative might be to use the Spark MinHash implementation instead, but that might not suit all your requirements.

Source https://stackoverflow.com/questions/54155341

QUESTION

Python: IOError 110 Connection timed out when reading from disk

Asked 2018-May-22 at 02:00

I'm running a Python script on a Sun Grid Engine supercompute cluster that reads in a list of file ids, sends each to a worker process for analysis, and writes one output per input file to disk.

The trouble is I'm getting IOError(110, 'Connection timed out') somewhere inside the worker function, and I'm not sure why. I've received this error in the past when making network requests that were severely delayed, but in this case the worker is only trying to read data from disk.

My question is: What would cause a Connection timed out error when reading from disk, and how can one resolve this error? Any help others can offer would be very appreciated.

Full script (the IOError crops up in minhash_text()):

...

ANSWER

Answered 2018-May-22 at 02:00

It turned out I was hammering the filesystem too hard, making too many concurrent read requests for files on the same server. That server could only allow a fixed number of reads in a given period, so any requests over that limit received a Connection Timed Out response.

The solution was to wrap each file read request in a while loop. Inside that while loop, try to read the appropriate file from disk. If the Connection timed out error springs, sleep for a second and try again. Only once the file has been read may the while loop be broken.

Source https://stackoverflow.com/questions/50448220

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install datasketch

You can download it from GitHub.
You can use datasketch like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: