damerau-levenshtein | Calculates edit distance using Damerau-Levenshtein algorithm

 by   GlobalNamesArchitecture Ruby Version: Current License: MIT

kandi X-RAY | damerau-levenshtein Summary

kandi X-RAY | damerau-levenshtein Summary

damerau-levenshtein is a Ruby library. damerau-levenshtein has no bugs, it has no vulnerabilities, it has a Permissive License and it has low support. You can download it from GitHub.

The damerau-levenshtein gem allows to find edit distance between two UTF-8 or ASCII encoded strings with O(N*M) efficiency. This gem implements pure Levenshtein algorithm, Damerau modification of it (where 2 character transposition counts as 1 edit distance). It also includes Boehmer & Rees 2008 modification of Damerau algorithm, where transposition of bigger than 1 character blocks is taken in account as well (Rees 2014). It also returns a diff between two strings according to Levenshtein alrorithm. The diff is expressed by tags , , and . Such tags make it possible to highlight differnce between strings in a flexible way.
Support
    Quality
      Security
        License
          Reuse

            kandi-support Support

              damerau-levenshtein has a low active ecosystem.
              It has 116 star(s) with 14 fork(s). There are 7 watchers for this library.
              OutlinedDot
              It had no major release in the last 6 months.
              There are 0 open issues and 10 have been closed. On average issues are closed in 95 days. There are no pull requests.
              It has a neutral sentiment in the developer community.
              The latest version of damerau-levenshtein is current.

            kandi-Quality Quality

              damerau-levenshtein has 0 bugs and 0 code smells.

            kandi-Security Security

              damerau-levenshtein has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.
              damerau-levenshtein code analysis shows 0 unresolved vulnerabilities.
              There are 0 security hotspots that need review.

            kandi-License License

              damerau-levenshtein is licensed under the MIT License. This license is Permissive.
              Permissive licenses have the least restrictions, and you can use them in most projects.

            kandi-Reuse Reuse

              damerau-levenshtein releases are not available. You will need to build from source code and install.
              Installation instructions are not available. Examples and code snippets are available.

            Top functions reviewed by kandi - BETA

            kandi has reviewed damerau-levenshtein and discovered the below as its top functions. This is intended to give you an instant insight into damerau-levenshtein implemented functionality, and help decide if they suit your requirements.
            • Constructor for a formatter
            • returns the backtrace
            • Shortcut method to display text .
            • diff between two strings
            • Iterate through the matrix and return the matrix array
            • Searches the previous row .
            • Sets the format .
            • Returns a list of cells in the table
            • Prints formatted format .
            • Remove the matrix by index
            Get all kandi verified functions for this library.

            damerau-levenshtein Key Features

            No Key Features are available at this moment for damerau-levenshtein.

            damerau-levenshtein Examples and Code Snippets

            No Code Snippets are available at this moment for damerau-levenshtein.

            Community Discussions

            QUESTION

            Best similarity distance metric for two strings
            Asked 2019-Nov-10 at 02:14

            I have a bunch of company names to match, for example, I want to match this string: A&A PRECISION

            with A&A PRECISION ENGINEERING

            However, almost every similarity measure I use: like Hamming distance, Levenshtein distance, Restricted Damerau-Levenshtein distance, Full Damerau-Levenshtein distance, Longest Common Substring distance, Q-gram distance, cosine distance, Jaccard distance Jaro, and Jaro-Winkler distance

            matches: B&B PRECISION instead.

            Any idea which metric would give more emphasis to the preciseness of the substrings and its sequence matched and care less about the length of the string? I think it is because of the length of the string that the metrics would always choose wrongly.

            ...

            ANSWER

            Answered 2019-Nov-10 at 02:14

            If you really want to "...give more emphasis to the preciseness of the substrings and its sequence...", then this function could work, as it tests wether a string is a substring of another one:

            Source https://stackoverflow.com/questions/58781572

            QUESTION

            How compute lucene FuzzyQuery on top GraphDB lucene index?
            Asked 2019-Jun-22 at 06:07

            GraphDB supports FTS Lucene plugin to build RDF 'molecule' to index texts efficiently. However, when there is a typo (missspell) in the word your are searching, Lucene would not retrieve a result. I wonder if it is possible to implement a FuzzyQuery based on the Damerau-Levenshtein algorithm on top the Lucene Index in GraphDB for FTS. That way even if the word is not correctly spell you can get a list of more 'closed' words based on an edit distance similarity.

            This is the index I have created for indexing labels of NounSynset in WordNet RDF.

            ...

            ANSWER

            Answered 2019-Jun-22 at 06:07

            If you use the ~ it should give you a fuzzy match.

            Source https://stackoverflow.com/questions/56199048

            QUESTION

            String similarity with Python + Sqlite (Levenshtein distance / edit distance)
            Asked 2018-Oct-23 at 19:15

            Is there a string similarity measure available in Python+Sqlite, for example with the sqlite3 module?

            Example of use case:

            ...

            ANSWER

            Answered 2018-Oct-23 at 19:15

            Here is a ready-to-use example test.py:

            Source https://stackoverflow.com/questions/49779281

            QUESTION

            Modify Damerau-Levenshtein algorithm to track transformations (insertions, deletions, etc)
            Asked 2018-Jun-22 at 18:19

            I'm wondering how to modify the Damerau-Levenshtein algorithm to track the specific character transformations required to change a source string to a target string. This question has been answered for the Levenshtein distance, but I couldn't find any answers for DL distance.

            I looked at the py-Levenshtein module: it provides exactly what I need, but for Levenshtein distance:

            ...

            ANSWER

            Answered 2017-Jun-20 at 15:01
            import numpy as np
            
            def levenshtein_distance(string1, string2):
                n1 = len(string1)
                n2 = len(string2)
                return _levenshtein_distance_matrix(string1, string2)[n1, n2]
            
            def damerau_levenshtein_distance(string1, string2):
                n1 = len(string1)
                n2 = len(string2)
                return _levenshtein_distance_matrix(string1, string2, True)[n1, n2]
            
            def get_ops(string1, string2, is_damerau=False):
                i, j = _levenshtein_distance_matrix(string1, string2, is_damerau).shape
                i -= 1
                j -= 1
                ops = list()
                while i != -1 and j != -1:
                    if is_damerau:
                        if i > 1 and j > 1 and string1[i-1] == string2[j-2] and string1[i-2] == string2[j-1]:
                            if dist_matrix[i-2, j-2] < dist_matrix[i, j]:
                                ops.insert(0, ('transpose', i - 1, i - 2))
                                i -= 2
                                j -= 2
                                continue
                    index = np.argmin([dist_matrix[i-1, j-1], dist_matrix[i, j-1], dist_matrix[i-1, j]])
                    if index == 0:
                        if dist_matrix[i, j] > dist_matrix[i-1, j-1]:
                            ops.insert(0, ('replace', i - 1, j - 1))
                        i -= 1
                        j -= 1
                    elif index == 1:
                        ops.insert(0, ('insert', i - 1, j - 1))
                        j -= 1
                    elif index == 2:
                        ops.insert(0, ('delete', i - 1, i - 1))
                        i -= 1
                return ops
            
            def execute_ops(ops, string1, string2):
                strings = [string1]
                string = list(string1)
                shift = 0
                for op in ops:
                    i, j = op[1], op[2]
                    if op[0] == 'delete':
                        del string[i + shift]
                        shift -= 1
                    elif op[0] == 'insert':
                        string.insert(i + shift + 1, string2[j])
                        shift += 1
                    elif op[0] == 'replace':
                        string[i + shift] = string2[j]
                    elif op[0] == 'transpose':
                        string[i + shift], string[j + shift] = string[j + shift], string[i + shift]
                    strings.append(''.join(string))
                return strings
            
            def _levenshtein_distance_matrix(string1, string2, is_damerau=False):
                n1 = len(string1)
                n2 = len(string2)
                d = np.zeros((n1 + 1, n2 + 1), dtype=int)
                for i in range(n1 + 1):
                    d[i, 0] = i
                for j in range(n2 + 1):
                    d[0, j] = j
                for i in range(n1):
                    for j in range(n2):
                        if string1[i] == string2[j]:
                            cost = 0
                        else:
                            cost = 1
                        d[i+1, j+1] = min(d[i, j+1] + 1, # insert
                                          d[i+1, j] + 1, # delete
                                          d[i, j] + cost) # replace
                        if is_damerau:
                            if i > 0 and j > 0 and string1[i] == string2[j-1] and string1[i-1] == string2[j]:
                                d[i+1, j+1] = min(d[i+1, j+1], d[i-1, j-1] + cost) # transpose
                return d
            
            if __name__ == "__main__":
                # GIFTS PROFIT
                # FBBDE BCDASD
                # SPARTAN PART
                # PLASMA ALTRUISM
                # REPUBLICAN DEMOCRAT
                # PLASMA PLASMA
                # FISH IFSH
                # STAES STATES
                string1 = 'FISH'
                string2 = 'IFSH'
                for is_damerau in [True, False]:
                    if is_damerau:
                        print('=== damerau_levenshtein_distance ===')
                    else:
                        print('=== levenshtein_distance ===')
                    dist_matrix = _levenshtein_distance_matrix(string1, string2, is_damerau=is_damerau)
                    print(dist_matrix)
                    ops = get_ops(string1, string2, is_damerau=is_damerau)
                    print(ops)
                    res = execute_ops(ops, string1, string2)
                    print(res)
            

            Source https://stackoverflow.com/questions/44640570

            QUESTION

            Javascript Version of (MDLD) Modified Damerau-Levenshtein Distance Algorithm
            Asked 2018-Apr-18 at 04:24

            I was looking to test the performance of MDLD for some in-browser string comparisions to be integrated into a web-app. The use-case involves comparing strings like, "300mm, Packed Wall" and "Packed Wall - 300mm", so I was looking for fuzzy string matching, that has some tolerance for punctuation and typos, as well as allowing block character transpositions.

            I wasn't able to find an implementation online for Javascript. I found a version written for PL/SQL available at CSIRO's Taxamatch Wiki.

            This was my attempt at converting the code in to JS; the results for the basic function seem fairly accurate, however, the block transposition calculation doesn't give the expected results. E.g. "Hi There" vs "There Hi" returns "6", regardless of what the block limit is set to.

            If anyone knows of a working implementation, could you point me to it? Alternatively, what's the problem with my adaptation, or the source code itself? The only major change I made was to use "Math.ceil()" in two instances where the source appeared to use integer division, which would always take the floor-- It was causing odd issues for inputs that would result in 1 character strings-- but didn't seem to affect the behaviour of other cases I'd tested.

            ...

            ANSWER

            Answered 2018-Apr-18 at 04:24

            In the end, I couldn't figure out what the issue was with my adaptation of the code from CSIRO. Found a github repo that implemented the function in C with Ruby extensions, https://github.com/GlobalNamesArchitecture/damerau-levenshtein.

            Adapted that to get a functional implementation. Seems to work fine, but not great for my use case. MDLD can swap blocks of text, but only in circumstances where multiple consecutive swaps aren't needed to construct the source string. Going to look at N-Grams instead.

            For those who are interested, this was my final result. Performance-wise, with a block limit of 5, it compared about 1000, 20-40 character strings in about 5 seconds.

            Source https://stackoverflow.com/questions/49871367

            QUESTION

            Damerau-Levenshtein algorithm isn't working on short strings
            Asked 2018-Jan-09 at 21:31

            I have a for loop that takes a user's input and one of the keys in my dictionary and passes them to a Damerau-Levenshtein function and based on the distance, overwrites the user's input with the dictionary key (The for loop is to cycle through each dictionary key). This works fine enough for strings larger than three characters, but if the string is three or fewer characters the algorithm returns with the wrong key. Here's the for loop:

            ...

            ANSWER

            Answered 2018-Jan-09 at 21:31

            I figured it out. After much searching I found a post saying that an edit distance is commonly 2. (They didn't specify any merits on why 2 is common)

            I switched my if statement to 2 from 4 and now all of the problem terms are being corrected as they should be.

            Source https://stackoverflow.com/questions/48174624

            QUESTION

            Record linkage using String similarity Techniques
            Asked 2017-Jul-20 at 06:27

            We are working on Record linkage project. We are observing a strange behavior from all of the standard technique like Jaro Winkler, Levenshtein, N-Gram, Damerau-Levenshtein, Jaccard index, Sorensen-Dice

            Say, String 1= MINI GRINDER KIT
            String 2= Weiler 13001 Mini Grinder Accessory Kit, For Use With Small Right Angle Grinders
            String 3= Milwaukee Video Borescope, Rotating Inspection Scope, Series: M-SPECTOR 360, 2.7 in 640 x 480 pixels High-Resolution LCD, Plastic, Black/Red

            In the above case string 1 and string 2 are related the score of all the methods as shown below.
            Jaro Winkler -> 0.391666651
            Levenshtein -> 75
            N-Gram, -> 0.9375
            Damerau -> 75
            Jaccard index -> 0
            Sorensen-Dice -> 0
            Cosine -> 0

            But string 1 and string 3 are not at all related, but distance method are giving very high score.
            Jaro Winkler -> 0.435714275
            Levenshtein -> 133
            N-Gram, -> 0.953571439
            Damerau -> 133
            Jaccard index -> 1
            Sorensen-Dice -> 0
            Cosine -> 0

            Any thoughts .?

            ...

            ANSWER

            Answered 2017-Mar-07 at 11:35

            All distance calculation score are case sensitive. Hence bring all of them to same case. Then you get to see the score calculation appropriately.

            Source https://stackoverflow.com/questions/41859924

            QUESTION

            R: clustering with a similarity or dissimilarity matrix? And visualizing the results
            Asked 2017-Jul-13 at 12:08

            I have a similarity matrix that I created using Harry—a tool for string similarity, and I wanted to plot some dendrograms out of it to see if I could find some clusters / groups in the data. I'm using the following similarity measures:

            • Normalized compression distance (NCD)
            • Damerau-Levenshtein distance
            • Jaro-Winkler distance
            • Levenshtein distance
            • Optimal string alignment distance (OSA)

            ("For comparison Harry loads a set of strings from input, computes the specified similarity measure and writes a matrix of similarity values to output")

            At first, it was like my first time using R, I didn't pay to much attention on the documentation of hclust, so I used it with a similarity matrix. I know I should have used a dissimilarity matrix, and I know, since my similarity matrix is normalized [0,1], that I could just do dissimilarity = 1 - similarity and then use hclust.

            But, the groups that I get using hclustwith a similarity matrix are much better than the ones I get using hclustand it's correspondent dissimilarity matrix.

            I tried to use the proxy package as well and the same problem, the groups that I get aren't what I expected, happens.

            To get the dendrograms using the similarity function I do:

            1. plot(hclust(as.dist(""similarityMATRIX""), "average"))

            With the dissimilarity matrix I tried:

            1. plot(hclust(as.dist(""dissimilarityMATRIX""), "average"))

            and

            1. plot(hclust(as.sim(""dissimilarityMATRIX""), "average"))

            From (1) I get what I believe to be a very good dendrogram, and so I can get very good groups out of it. From (2) and (3) I get the same dendrogram and the groups that I can get out of it aren't as good as the ones I get from (1)

            I'm saying that the groups are bad/good because at the moment I have a somewhat little volume of data to analyse, and so I can check them very easily.

            Does this that I'm getting makes any sense? There is something that justify this? Some suggestion on how to cluster with a similarity matrizx. Is there a better way to visualize a similarity matrix than a dendrogram?

            ...

            ANSWER

            Answered 2017-Jul-13 at 12:08

            You can visualize a similarity matrix using a heatmap (for example, using the heatmaply R package). You can check if a dendrogram fits by using the dendextend R package function cor_cophenetic (use the most recent version from github).

            Clustering which is based on distance can be done using hclust, but also using cluster::pam (k-medoids).

            Source https://stackoverflow.com/questions/45061440

            QUESTION

            Identify strings with same meaning in java
            Asked 2017-Apr-26 at 14:14

            I have the following problem. I want to identify strings in java that have a similar meaning. I tried to calculate similarities between strings with Stringmetrics. This works as expected but I need something more convenient.

            For example when I have the following 2 strings (1 word):

            ...

            ANSWER

            Answered 2017-Apr-26 at 14:14

            Levenshtein distance (edit distance) is like the auto-correct in your phone. Taking your example we have apple vs appel. The words are kinda close to each other if you consider adding/removing/replacing a single letter, all we need to do here is swap e and l (actually replace e with l and l with e). If you had other words like applr or appee - these are closer to the original word apple because all you need to do is replace a single letter.

            Cosine similiarity is completely different - it counts the words, makes vector of those counts and checks how similiar the counts are, here you have 2 completely different words so it returns 0.

            What you want is: combo of those 2 techniques + computer having language knowledge + another dictionary for synonyms that are somehow taken into consideration before and after using those similarity algorithms. Imagine if you had a sentence and then you would replace every single word with synonym (who remembers Joey and Thesaurus?). Sentences could be completely different. Plus every word can have multiple synonyms, and some of those synonyms can be used only in a specific context. Your task is simply impossible as of now, maybe in the future.

            P.S. If your task was possible I think that translating software would be basically perfect, but I'm not really sure about that.

            Source https://stackoverflow.com/questions/43635719

            Community Discussions, Code Snippets contain sources that include Stack Exchange Network

            Vulnerabilities

            No vulnerabilities reported

            Install damerau-levenshtein

            You can download it from GitHub.
            On a UNIX-like operating system, using your system’s package manager is easiest. However, the packaged Ruby version may not be the newest one. There is also an installer for Windows. Managers help you to switch between multiple Ruby versions on your system. Installers can be used to install a specific or multiple Ruby versions. Please refer ruby-lang.org for more information.

            Support

            Check out the latest master to make sure the feature hasn't been implemented or the bug hasn't been fixed yetCheck out the issue tracker to make sure someone already hasn't requested it and/or contributed itFork the projectStart a feature/bugfix branchCommit and push until you are happy with your contributionMake sure to add tests for it. This is important so I don't break it in a future version unintentionally.Please try not to mess with the Rakefile, version, or history. If you want to have your own version, or is otherwise necessary, that is fine, but please isolate to its own commit so I can cherry-pick around it.
            Find more information at:

            Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items

            Find more libraries
            CLONE
          • HTTPS

            https://github.com/GlobalNamesArchitecture/damerau-levenshtein.git

          • CLI

            gh repo clone GlobalNamesArchitecture/damerau-levenshtein

          • sshUrl

            git@github.com:GlobalNamesArchitecture/damerau-levenshtein.git

          • Stay Updated

            Subscribe to our newsletter for trending solutions and developer bootcamps

            Agree to Sign up and Terms & Conditions

            Share this Page

            share link

            Consider Popular Ruby Libraries

            rails

            by rails

            jekyll

            by jekyll

            discourse

            by discourse

            fastlane

            by fastlane

            huginn

            by huginn

            Try Top Libraries by GlobalNamesArchitecture

            biodiversity

            by GlobalNamesArchitectureC

            gnparser

            by GlobalNamesArchitectureScala

            gnrd

            by GlobalNamesArchitectureHTML

            taxamatch_rb

            by GlobalNamesArchitectureRuby

            dwca_hunter

            by GlobalNamesArchitectureRuby