dedupe | python library for accurate and scalable fuzzy matching | Database library

by dedupeio Python Version: 2.0.23 License: MIT

X-Ray Key Features Code Snippets(10)Community Discussions(10)Vulnerabilities Install Support

kandi X-RAY | dedupe Summary

dedupe is a Python library typically used in Database applications. dedupe has no vulnerabilities, it has build file available, it has a Permissive License and it has medium support. However dedupe has 1 bugs. You can install using 'pip install dedupe' or download it from GitHub, PyPI.

dedupe will help you:. dedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.

Support

Quality

Security

License

Reuse

Support

dedupe has a medium active ecosystem.

It has 3714 star(s) with 510 fork(s). There are 120 watchers for this library.

It had no major release in the last 12 months.

There are 63 open issues and 728 have been closed. On average issues are closed in 11 days. There are 10 open pull requests and 0 closed requests.

It has a neutral sentiment in the developer community.

The latest version of dedupe is 2.0.23

Quality

dedupe has 1 bugs (0 blocker, 0 critical, 0 major, 1 minor) and 137 code smells.

Security

dedupe has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

dedupe code analysis shows 0 unresolved vulnerabilities.

There are 11 security hotspots that need review.

License

dedupe is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

dedupe releases are not available. You will need to build from source code and install.

Deployable package is available in PyPI.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

dedupe saves you 2881 person hours of effort in developing the same functionality from scratch.

It has 6225 lines of code, 374 functions and 52 files.

It has medium code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed dedupe and discovered the below as its top functions. This is intended to give you an instant insight into dedupe implemented functionality, and help decide if they suit your requirements.

Computes the relationship between two pairs
Computes the score of duplicate pairs
Find all links that have a given threshold
Gather links between scores
Label the training data model
Removes an item from the candidates list
Return the set of uncertain pairs
Removes unlabeled examples from the list
Learn the model
Run the matching
Return the canonical representation of a record
Given an iterable of primary variables return a list of interaction types
Generator for pair - matching pairs
Convert data to markdown table
Parse asv output
Run training
Score records from record_pairs
Perform a dedupe of training data
Performs search
Link training data to two datasets
Score duplicate records
Prepare training
Prepare training data
Unindex documents
Cluster a given dupes
Index data

Get all kandi verified functions for this library.

dedupe Key Features

No Key Features are available at this moment for dedupe.

dedupe Examples and Code Snippets

s3-ocr,Avoiding processing duplicates,s3-ocr dedupe --help

Python

Lines of Code : 11

License : Permissive (Apache-2.0)

Copy

Usage: s3-ocr dedupe [OPTIONS] BUCKET

  Scan every file in the bucket checking for duplicates - files that have not
  yet been OCRd but that have the same contents (based on ETag) as a file that
  HAS been OCRd.

      s3-ocr dedupe name-of-bucket

Dedupe Geocoder,Setup

Python

Lines of Code : 6

License : Permissive (MIT)

Copy

mkvirtualenv dedupe-geocoder
git clone https://github.com/datamade/dedupe-geocoder.git
cd dedupe-geocoder
pip install -r requirements.txt
cp geocoder/app_config.py.example geocoder/app_config.py

workon dedupe-geocoder

Dedupe Geocoder,Setup your database

Python

Lines of Code : 6

License : Permissive (MIT)

Copy

createdb geocoder

python loadAddresses.py --download --load_data 

 --download     Download fresh address data.
 --load_data    Load downloaded address data into database.
 --train        Add more training data and save settings file.
 --block

Determine the filename to add to the asset list .

python

Lines of Code : 37

License : Non-SPDX (Apache License 2.0)

Copy

def get_asset_filename_to_add(asset_filepath, asset_filename_map):
  """Get a unique basename to add to the SavedModel if this file is unseen.

  Assets come from users as full paths, and we save them out to the
  SavedModel as basenames. In some cas

Deduplicates weights .

python

Lines of Code : 10

License : Non-SPDX (Apache License 2.0)

Copy

def _dedup_weights(self, weights):
    """Dedupe weights while maintaining order as much as possible."""
    output, seen_ids = [], set()
    for w in weights:
      if id(w) not in seen_ids:
        output.append(w)
        # Track the Variable's id

Eliminating duplicates from a list without a second list/module

Python

Lines of Code : 63