damerau-levenshtein | Realisation Damerau-Levenshtein algorithm

by mitallast C Version: Current License: MIT

X-Ray Key Features Code Snippets Community Discussions(9)Vulnerabilities Install Support

kandi X-RAY | damerau-levenshtein Summary

damerau-levenshtein is a C library. damerau-levenshtein has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

damerau_levenshtein – calculate Damerau-Levenshtein distance between two strings. The Levenshtein distance is defined as the minimal number of characters you have to replace, insert, delete or transitive to transform str1 into str2. The complexity of the algorithm is O(m*n), where n and m are the length of str1 and str2 (rather good when compared to similar_text(), which is O(max(n,m)**3), but still expensive). In its simplest form the function will take only the two strings as parameter and will calculate just the number of insert, replace, delete and transitive operations needed to transform str1 into str2. A second variant will take four additional parameters that define the cost of insert, replace, delete and transitive operations. This is more general and adaptive than variant one, but not as efficient.

Support

Quality

Security

License

Reuse

Support

damerau-levenshtein has a low active ecosystem.

It has 9 star(s) with 2 fork(s). There are 1 watchers for this library.

It had no major release in the last 6 months.

There are 0 open issues and 1 have been closed. There are no pull requests.

It has a neutral sentiment in the developer community.

The latest version of damerau-levenshtein is current.

Quality

damerau-levenshtein has no bugs reported.

Security

damerau-levenshtein has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

License

damerau-levenshtein is licensed under the MIT License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

damerau-levenshtein releases are not available. You will need to build from source code and install.

Installation instructions are not available. Examples and code snippets are available.

Top functions reviewed by kandi - BETA

kandi's functional review helps you automatically verify the functionalities of the libraries and avoid rework.
Currently covering the most popular Java, JavaScript and Python libraries. See a Sample of damerau-levenshtein

Get all kandi verified functions for this library.

damerau-levenshtein Key Features

No Key Features are available at this moment for damerau-levenshtein.

damerau-levenshtein Examples and Code Snippets

No Code Snippets are available at this moment for damerau-levenshtein.

Community Discussions

Trending Discussions on damerau-levenshtein

How compute lucene FuzzyQuery on top GraphDB lucene index?

Modify Damerau-Levenshtein algorithm to track transformations (insertions, deletions, etc)

Javascript Version of (MDLD) Modified Damerau-Levenshtein Distance Algorithm

Damerau-Levenshtein algorithm isn't working on short strings

Record linkage using String similarity Techniques

R: clustering with a similarity or dissimilarity matrix? And visualizing the results

Identify strings with same meaning in java

QUESTION

Best similarity distance metric for two strings

Asked 2019-Nov-10 at 02:14

I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION

with A&A PRECISION ENGINEERING

However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted Damerau-Levenshtein distance, Full Damerau-Levenshtein distance, Longest Common Substring distance, Q-gram distance, cosine distance, Jaccard distance Jaro, and Jaro-Winkler distance

matches: B&B PRECISION instead.

Any idea which metric would give more emphasis to the preciseness of the substrings and its sequence matched and care less about the length of the string? I think it is because of the length of the string that the metrics would always choose wrongly.

...

ANSWER

Answered 2019-Nov-10 at 02:14

If you really want to "...give more emphasis to the preciseness of the substrings and its sequence...", then this function could work, as it tests wether a string is a substring of another one:

Source https://stackoverflow.com/questions/58781572

QUESTION

How compute lucene FuzzyQuery on top GraphDB lucene index?

Asked 2019-Jun-22 at 06:07

GraphDB supports FTS Lucene plugin to build RDF 'molecule' to index texts efficiently. However, when there is a typo (missspell) in the word your are searching, Lucene would not retrieve a result. I wonder if it is possible to implement a FuzzyQuery based on the Damerau-Levenshtein algorithm on top the Lucene Index in GraphDB for FTS. That way even if the word is not correctly spell you can get a list of more 'closed' words based on an edit distance similarity.

This is the index I have created for indexing labels of NounSynset in WordNet RDF.

...

ANSWER

Answered 2019-Jun-22 at 06:07

If you use the ~ it should give you a fuzzy match.

Source https://stackoverflow.com/questions/56199048

QUESTION

String similarity with Python + Sqlite (Levenshtein distance / edit distance)

Asked 2018-Oct-23 at 19:15

Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module?

Example of use case:

...

ANSWER

Answered 2018-Oct-23 at 19:15

Here is a ready-to-use example test.py:

Source https://stackoverflow.com/questions/49779281

QUESTION

Modify Damerau-Levenshtein algorithm to track transformations (insertions, deletions, etc)

Asked 2018-Jun-22 at 18:19

I'm wondering how to modify the Damerau-Levenshtein algorithm to track the specific character transformations required to change a source string to a target string. This question has been answered for the Levenshtein distance, but I couldn't find any answers for DL distance.

I looked at the py-Levenshtein module: it provides exactly what I need, but for Levenshtein distance:

...

ANSWER

Answered 2017-Jun-20 at 15:01

import numpy as np

def levenshtein_distance(string1, string2):
    n1 = len(string1)
    n2 = len(string2)
    return _levenshtein_distance_matrix(string1, string2)[n1, n2]

def damerau_levenshtein_distance(string1, string2):
    n1 = len(string1)
    n2 = len(string2)
    return _levenshtein_distance_matrix(string1, string2, True)[n1, n2]

def get_ops(string1, string2, is_damerau=False):
    i, j = _levenshtein_distance_matrix(string1, string2, is_damerau).shape
    i -= 1
    j -= 1
    ops = list()
    while i != -1 and j != -1:
        if is_damerau:
            if i > 1 and j > 1 and string1[i-1] == string2[j-2] and string1[i-2] == string2[j-1]:
                if dist_matrix[i-2, j-2] < dist_matrix[i, j]:
                    ops.insert(0, ('transpose', i - 1, i - 2))
                    i -= 2
                    j -= 2
                    continue
        index = np.argmin([dist_matrix[i-1, j-1], dist_matrix[i, j-1], dist_matrix[i-1, j]])
        if index == 0:
            if dist_matrix[i, j] > dist_matrix[i-1, j-1]:
                ops.insert(0, ('replace', i - 1, j - 1))
            i -= 1
            j -= 1
        elif index == 1:
            ops.insert(0, ('insert', i - 1, j - 1))
            j -= 1
        elif index == 2:
            ops.insert(0, ('delete', i - 1, i - 1))
            i -= 1
    return ops

def execute_ops(ops, string1, string2):
    strings = [string1]
    string = list(string1)
    shift = 0
    for op in ops:
        i, j = op[1], op[2]
        if op[0] == 'delete':
            del string[i + shift]
            shift -= 1
        elif op[0] == 'insert':
            string.insert(i + shift + 1, string2[j])
            shift += 1
        elif op[0] == 'replace':
            string[i + shift] = string2[j]
        elif op[0] == 'transpose':
            string[i + shift], string[j + shift] = string[j + shift], string[i + shift]
        strings.append(''.join(string))
    return strings

def _levenshtein_distance_matrix(string1, string2, is_damerau=False):
    n1 = len(string1)
    n2 = len(string2)
    d = np.zeros((n1 + 1, n2 + 1), dtype=int)
    for i in range(n1 + 1):
        d[i, 0] = i
    for j in range(n2 + 1):
        d[0, j] = j
    for i in range(n1):
        for j in range(n2):
            if string1[i] == string2[j]:
                cost = 0
            else:
                cost = 1
            d[i+1, j+1] = min(d[i, j+1] + 1, # insert
                              d[i+1, j] + 1, # delete
                              d[i, j] + cost) # replace
            if is_damerau:
                if i > 0 and j > 0 and string1[i] == string2[j-1] and string1[i-1] == string2[j]:
                    d[i+1, j+1] = min(d[i+1, j+1], d[i-1, j-1] + cost) # transpose
    return d

if __name__ == "__main__":
    # GIFTS PROFIT
    # FBBDE BCDASD
    # SPARTAN PART
    # PLASMA ALTRUISM
    # REPUBLICAN DEMOCRAT
    # PLASMA PLASMA
    # FISH IFSH
    # STAES STATES
    string1 = 'FISH'
    string2 = 'IFSH'
    for is_damerau in [True, False]:
        if is_damerau:
            print('=== damerau_levenshtein_distance ===')
        else:
            print('=== levenshtein_distance ===')
        dist_matrix = _levenshtein_distance_matrix(string1, string2, is_damerau=is_damerau)
        print(dist_matrix)
        ops = get_ops(string1, string2, is_damerau=is_damerau)
        print(ops)
        res = execute_ops(ops, string1, string2)
        print(res)

Source https://stackoverflow.com/questions/44640570

QUESTION

Javascript Version of (MDLD) Modified Damerau-Levenshtein Distance Algorithm

Asked 2018-Apr-18 at 04:24

I was looking to test the performance of MDLD for some in-browser string comparisions to be integrated into a web-app. The use-case involves comparing strings like, "300mm, Packed Wall" and "Packed Wall - 300mm", so I was looking for fuzzy string matching, that has some tolerance for punctuation and typos, as well as allowing block character transpositions.

I wasn't able to find an implementation online for Javascript. I found a version written for PL/SQL available at CSIRO's Taxamatch Wiki.

This was my attempt at converting the code in to JS; the results for the basic function seem fairly accurate, however, the block transposition calculation doesn't give the expected results. E.g. "Hi There" vs "There Hi" returns "6", regardless of what the block limit is set to.

If anyone knows of a working implementation, could you point me to it? Alternatively, what's the problem with my adaptation, or the source code itself? The only major change I made was to use "Math.ceil()" in two instances where the source appeared to use integer division, which would always take the floor-- It was causing odd issues for inputs that would result in 1 character strings-- but didn't seem to affect the behaviour of other cases I'd tested.

...

ANSWER

Answered 2018-Apr-18 at 04:24

In the end, I couldn't figure out what the issue was with my adaptation of the code from CSIRO. Found a github repo that implemented the function in C with Ruby extensions, https://github.com/GlobalNamesArchitecture/damerau-levenshtein.

Adapted that to get a functional implementation. Seems to work fine, but not great for my use case. MDLD can swap blocks of text, but only in circumstances where multiple consecutive swaps aren't needed to construct the source string. Going to look at N-Grams instead.

For those who are interested, this was my final result. Performance-wise, with a block limit of 5, it compared about 1000, 20-40 character strings in about 5 seconds.

Source https://stackoverflow.com/questions/49871367

QUESTION

Damerau-Levenshtein algorithm isn't working on short strings

Asked 2018-Jan-09 at 21:31

I have a for loop that takes a user's input and one of the keys in my dictionary and passes them to a Damerau-Levenshtein function and based on the distance, overwrites the user's input with the dictionary key (The for loop is to cycle through each dictionary key). This works fine enough for strings larger than three characters, but if the string is three or fewer characters the algorithm returns with the wrong key. Here's the for loop:

...

ANSWER

Answered 2018-Jan-09 at 21:31

I figured it out. After much searching I found a post saying that an edit distance is commonly 2. (They didn't specify any merits on why 2 is common)

I switched my if statement to 2 from 4 and now all of the problem terms are being corrected as they should be.

Source https://stackoverflow.com/questions/48174624

QUESTION

Record linkage using String similarity Techniques

Asked 2017-Jul-20 at 06:27

We are working on Record linkage project. We are observing a strange behavior from all of the standard technique like Jaro Winkler, Levenshtein, N-Gram, Damerau-Levenshtein, Jaccard index, Sorensen-Dice

Say, String 1= MINI GRINDER KIT
String 2= Weiler 13001 Mini Grinder Accessory Kit, For Use With Small Right Angle Grinders
String 3= Milwaukee Video Borescope, Rotating Inspection Scope, Series: M-SPECTOR 360, 2.7 in 640 x 480 pixels High-Resolution LCD, Plastic, Black/Red

In the above case string 1 and string 2 are related the score of all the methods as shown below.
Jaro Winkler -> 0.391666651
Levenshtein -> 75
N-Gram, -> 0.9375
Damerau -> 75
Jaccard index -> 0
Sorensen-Dice -> 0
Cosine -> 0

But string 1 and string 3 are not at all related, but distance method are giving very high score.
Jaro Winkler -> 0.435714275
Levenshtein -> 133
N-Gram, -> 0.953571439
Damerau -> 133
Jaccard index -> 1
Sorensen-Dice -> 0
Cosine -> 0

Any thoughts .?

...

ANSWER

Answered 2017-Mar-07 at 11:35

All distance calculation score are case sensitive. Hence bring all of them to same case. Then you get to see the score calculation appropriately.

Source https://stackoverflow.com/questions/41859924

QUESTION

R: clustering with a similarity or dissimilarity matrix? And visualizing the results

Asked 2017-Jul-13 at 12:08

I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:

Normalized compression distance (NCD)
Damerau-Levenshtein distance
Jaro-Winkler distance
Levenshtein distance
Optimal string alignment distance (OSA)

("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")

At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.

But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.

I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.

To get the dendrograms using the similarity function I do:

plot(hclust(as.dist(""similarityMATRIX""), "average"))

With the dissimilarity matrix I tried:

plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))

and

plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))

From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)

I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.

Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?

...

ANSWER

Answered 2017-Jul-13 at 12:08

You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package). You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).

Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).

Source https://stackoverflow.com/questions/45061440

QUESTION

Identify strings with same meaning in java

Asked 2017-Apr-26 at 14:14

I have the following problem. I want to identify strings in java that have a similar meaning. I tried to calculate similarities between strings with Stringmetrics. This works as expected but I need something more convenient.

For example when I have the following 2 strings (1 word):

...

ANSWER

Answered 2017-Apr-26 at 14:14

Levenshtein distance (edit distance) is like the auto-correct in your phone. Taking your example we have apple vs appel. The words are kinda close to each other if you consider adding/removing/replacing a single letter, all we need to do here is swap e and l (actually replace e with l and l with e). If you had other words like applr or appee - these are closer to the original word apple because all you need to do is replace a single letter.

Cosine similiarity is completely different - it counts the words, makes vector of those counts and checks how similiar the counts are, here you have 2 completely different words so it returns 0.

What you want is: combo of those 2 techniques + computer having language knowledge + another dictionary for synonyms that are somehow taken into consideration before and after using those similarity algorithms. Imagine if you had a sentence and then you would replace every single word with synonym (who remembers Joey and Thesaurus?). Sentences could be completely different. Plus every word can have multiple synonyms, and some of those synonyms can be used only in a specific context. Your task is simply impossible as of now, maybe in the future.

P.S. If your task was possible I think that translating software would be basically perfect, but I'm not really sure about that.

Source https://stackoverflow.com/questions/43635719

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install damerau-levenshtein

You can download it from GitHub.

Support

For any new features, suggestions and bugs create an issue on GitHub. If you have any questions check and ask questions on community page Stack Overflow .

Find more information at: