damerau-levenshtein | Calculates edit distance using Damerau-Levenshtein algorithm
kandi X-RAY | damerau-levenshtein Summary
kandi X-RAY | damerau-levenshtein Summary
The damerau-levenshtein gem allows to find edit distance between two UTF-8 or ASCII encoded strings with O(N*M) efficiency. This gem implements pure Levenshtein algorithm, Damerau modification of it (where 2 character transposition counts as 1 edit distance). It also includes Boehmer & Rees 2008 modification of Damerau algorithm, where transposition of bigger than 1 character blocks is taken in account as well (Rees 2014). It also returns a diff between two strings according to Levenshtein alrorithm. The diff is expressed by tags , , and . Such tags make it possible to highlight differnce between strings in a flexible way.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Constructor for a formatter
- returns the backtrace
- Shortcut method to display text .
- diff between two strings
- Iterate through the matrix and return the matrix array
- Searches the previous row .
- Sets the format .
- Returns a list of cells in the table
- Prints formatted format .
- Remove the matrix by index
damerau-levenshtein Key Features
damerau-levenshtein Examples and Code Snippets
Community Discussions
Trending Discussions on damerau-levenshtein
QUESTION
I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION
with A&A PRECISION ENGINEERING
However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted Damerau-Levenshtein distance, Full Damerau-Levenshtein distance, Longest Common Substring distance, Q-gram distance, cosine distance, Jaccard distance Jaro, and Jaro-Winkler distance
matches: B&B PRECISION
instead.
Any idea which metric would give more emphasis to the preciseness of the substrings and its sequence matched and care less about the length of the string? I think it is because of the length of the string that the metrics would always choose wrongly.
...ANSWER
Answered 2019-Nov-10 at 02:14If you really want to "...give more emphasis to the preciseness of the substrings and its sequence...", then this function could work, as it tests wether a string is a substring of another one:
QUESTION
GraphDB supports FTS Lucene plugin to build RDF 'molecule' to index texts efficiently. However, when there is a typo (missspell) in the word your are searching, Lucene would not retrieve a result. I wonder if it is possible to implement a FuzzyQuery based on the Damerau-Levenshtein algorithm on top the Lucene Index in GraphDB for FTS. That way even if the word is not correctly spell you can get a list of more 'closed' words based on an edit distance similarity.
This is the index I have created for indexing labels of NounSynset in WordNet RDF.
...ANSWER
Answered 2019-Jun-22 at 06:07If you use the ~
it should give you a fuzzy match.
QUESTION
Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3
module?
Example of use case:
...ANSWER
Answered 2018-Oct-23 at 19:15Here is a ready-to-use example test.py
:
QUESTION
I'm wondering how to modify the Damerau-Levenshtein algorithm to track the specific character transformations required to change a source string to a target string. This question has been answered for the Levenshtein distance, but I couldn't find any answers for DL distance.
I looked at the py-Levenshtein module: it provides exactly what I need, but for Levenshtein distance:
...ANSWER
Answered 2017-Jun-20 at 15:01import numpy as np
def levenshtein_distance(string1, string2):
n1 = len(string1)
n2 = len(string2)
return _levenshtein_distance_matrix(string1, string2)[n1, n2]
def damerau_levenshtein_distance(string1, string2):
n1 = len(string1)
n2 = len(string2)
return _levenshtein_distance_matrix(string1, string2, True)[n1, n2]
def get_ops(string1, string2, is_damerau=False):
i, j = _levenshtein_distance_matrix(string1, string2, is_damerau).shape
i -= 1
j -= 1
ops = list()
while i != -1 and j != -1:
if is_damerau:
if i > 1 and j > 1 and string1[i-1] == string2[j-2] and string1[i-2] == string2[j-1]:
if dist_matrix[i-2, j-2] < dist_matrix[i, j]:
ops.insert(0, ('transpose', i - 1, i - 2))
i -= 2
j -= 2
continue
index = np.argmin([dist_matrix[i-1, j-1], dist_matrix[i, j-1], dist_matrix[i-1, j]])
if index == 0:
if dist_matrix[i, j] > dist_matrix[i-1, j-1]:
ops.insert(0, ('replace', i - 1, j - 1))
i -= 1
j -= 1
elif index == 1:
ops.insert(0, ('insert', i - 1, j - 1))
j -= 1
elif index == 2:
ops.insert(0, ('delete', i - 1, i - 1))
i -= 1
return ops
def execute_ops(ops, string1, string2):
strings = [string1]
string = list(string1)
shift = 0
for op in ops:
i, j = op[1], op[2]
if op[0] == 'delete':
del string[i + shift]
shift -= 1
elif op[0] == 'insert':
string.insert(i + shift + 1, string2[j])
shift += 1
elif op[0] == 'replace':
string[i + shift] = string2[j]
elif op[0] == 'transpose':
string[i + shift], string[j + shift] = string[j + shift], string[i + shift]
strings.append(''.join(string))
return strings
def _levenshtein_distance_matrix(string1, string2, is_damerau=False):
n1 = len(string1)
n2 = len(string2)
d = np.zeros((n1 + 1, n2 + 1), dtype=int)
for i in range(n1 + 1):
d[i, 0] = i
for j in range(n2 + 1):
d[0, j] = j
for i in range(n1):
for j in range(n2):
if string1[i] == string2[j]:
cost = 0
else:
cost = 1
d[i+1, j+1] = min(d[i, j+1] + 1, # insert
d[i+1, j] + 1, # delete
d[i, j] + cost) # replace
if is_damerau:
if i > 0 and j > 0 and string1[i] == string2[j-1] and string1[i-1] == string2[j]:
d[i+1, j+1] = min(d[i+1, j+1], d[i-1, j-1] + cost) # transpose
return d
if __name__ == "__main__":
# GIFTS PROFIT
# FBBDE BCDASD
# SPARTAN PART
# PLASMA ALTRUISM
# REPUBLICAN DEMOCRAT
# PLASMA PLASMA
# FISH IFSH
# STAES STATES
string1 = 'FISH'
string2 = 'IFSH'
for is_damerau in [True, False]:
if is_damerau:
print('=== damerau_levenshtein_distance ===')
else:
print('=== levenshtein_distance ===')
dist_matrix = _levenshtein_distance_matrix(string1, string2, is_damerau=is_damerau)
print(dist_matrix)
ops = get_ops(string1, string2, is_damerau=is_damerau)
print(ops)
res = execute_ops(ops, string1, string2)
print(res)
QUESTION
I was looking to test the performance of MDLD for some in-browser string comparisions to be integrated into a web-app. The use-case involves comparing strings like, "300mm, Packed Wall" and "Packed Wall - 300mm", so I was looking for fuzzy string matching, that has some tolerance for punctuation and typos, as well as allowing block character transpositions.
I wasn't able to find an implementation online for Javascript. I found a version written for PL/SQL available at CSIRO's Taxamatch Wiki.
This was my attempt at converting the code in to JS; the results for the basic function seem fairly accurate, however, the block transposition calculation doesn't give the expected results. E.g. "Hi There" vs "There Hi" returns "6", regardless of what the block limit is set to.
If anyone knows of a working implementation, could you point me to it? Alternatively, what's the problem with my adaptation, or the source code itself? The only major change I made was to use "Math.ceil()" in two instances where the source appeared to use integer division, which would always take the floor-- It was causing odd issues for inputs that would result in 1 character strings-- but didn't seem to affect the behaviour of other cases I'd tested.
...ANSWER
Answered 2018-Apr-18 at 04:24In the end, I couldn't figure out what the issue was with my adaptation of the code from CSIRO. Found a github repo that implemented the function in C with Ruby extensions, https://github.com/GlobalNamesArchitecture/damerau-levenshtein.
Adapted that to get a functional implementation. Seems to work fine, but not great for my use case. MDLD can swap blocks of text, but only in circumstances where multiple consecutive swaps aren't needed to construct the source string. Going to look at N-Grams instead.
For those who are interested, this was my final result. Performance-wise, with a block limit of 5, it compared about 1000, 20-40 character strings in about 5 seconds.
QUESTION
I have a for loop that takes a user's input and one of the keys in my dictionary and passes them to a Damerau-Levenshtein function and based on the distance, overwrites the user's input with the dictionary key (The for loop is to cycle through each dictionary key). This works fine enough for strings larger than three characters, but if the string is three or fewer characters the algorithm returns with the wrong key. Here's the for loop:
...ANSWER
Answered 2018-Jan-09 at 21:31I figured it out. After much searching I found a post saying that an edit distance is commonly 2. (They didn't specify any merits on why 2 is common)
I switched my if statement to 2 from 4 and now all of the problem terms are being corrected as they should be.
QUESTION
We are working on Record linkage project. We are observing a strange behavior from all of the standard technique like Jaro Winkler, Levenshtein, N-Gram, Damerau-Levenshtein, Jaccard index, Sorensen-Dice
Say,
String 1= MINI GRINDER KIT
String 2= Weiler 13001 Mini Grinder Accessory Kit, For Use With Small Right Angle Grinders
String 3= Milwaukee Video Borescope, Rotating Inspection Scope, Series: M-SPECTOR 360, 2.7 in 640 x 480 pixels High-Resolution LCD, Plastic, Black/Red
In the above case string 1 and string 2 are related the score of all the methods as shown below.
Jaro Winkler -> 0.391666651
Levenshtein -> 75
N-Gram, -> 0.9375
Damerau -> 75
Jaccard index -> 0
Sorensen-Dice -> 0
Cosine -> 0
But string 1 and string 3 are not at all related, but distance method are giving very high score.
Jaro Winkler -> 0.435714275
Levenshtein -> 133
N-Gram, -> 0.953571439
Damerau -> 133
Jaccard index -> 1
Sorensen-Dice -> 0
Cosine -> 0
Any thoughts .?
...ANSWER
Answered 2017-Mar-07 at 11:35All distance calculation score are case sensitive. Hence bring all of them to same case. Then you get to see the score calculation appropriately.
QUESTION
I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:
- Normalized compression distance (NCD)
- Damerau-Levenshtein distance
- Jaro-Winkler distance
- Levenshtein distance
- Optimal string alignment distance (OSA)
("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")
At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust
, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust
.
But, the groups that I get using hclust
with a similarity matrix are much better than the ones I get using hclust
and it's correspondent dissimilarity matrix.
I tried to use the proxy
package as well and the same problem, the groups that I get aren't what I expected, happens.
To get the dendrograms using the similarity function I do:
plot(hclust(as.dist(""similarityMATRIX""), "average"))
With the dissimilarity matrix I tried:
plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))
and
plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))
From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)
I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.
Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?
...ANSWER
Answered 2017-Jul-13 at 12:08You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package).
You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic
(use the most recent version from github).
Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).
QUESTION
I have the following problem. I want to identify strings in java that have a similar meaning. I tried to calculate similarities between strings with Stringmetrics. This works as expected but I need something more convenient.
For example when I have the following 2 strings (1 word):
...ANSWER
Answered 2017-Apr-26 at 14:14Levenshtein distance (edit distance) is like the auto-correct in your phone. Taking your example we have apple
vs appel
. The words are kinda close to each other if you consider adding/removing/replacing a single letter, all we need to do here is swap e
and l
(actually replace e
with l
and l
with e
). If you had other words like applr
or appee
- these are closer to the original word apple
because all you need to do is replace a single letter.
Cosine similiarity is completely different - it counts the words, makes vector of those counts and checks how similiar the counts are, here you have 2 completely different words so it returns 0.
What you want is: combo of those 2 techniques + computer having language knowledge + another dictionary for synonyms that are somehow taken into consideration before and after using those similarity algorithms. Imagine if you had a sentence and then you would replace every single word with synonym (who remembers Joey and Thesaurus?). Sentences could be completely different. Plus every word can have multiple synonyms, and some of those synonyms can be used only in a specific context. Your task is simply impossible as of now, maybe in the future.
P.S. If your task was possible I think that translating software would be basically perfect, but I'm not really sure about that.
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install damerau-levenshtein
On a UNIX-like operating system, using your system’s package manager is easiest. However, the packaged Ruby version may not be the newest one. There is also an installer for Windows. Managers help you to switch between multiple Ruby versions on your system. Installers can be used to install a specific or multiple Ruby versions. Please refer ruby-lang.org for more information.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page