edit-distance | Python library for computing edit distance | Autocomplete library

by belambert Python Version: v1.0.6 License: Apache-2.0

X-Ray Key Features Code Snippets(3)Community Discussions(8)Vulnerabilities Install Support

kandi X-RAY | edit-distance Summary

edit-distance is a Python library typically used in User Interface, Autocomplete applications. edit-distance has no bugs, it has no vulnerabilities, it has build file available, it has a Permissive License and it has high support. You can download it from GitHub.

Python module for computing edit distances and alignments between sequences. I needed a way to compute edit distances between sequences in Python. I wasn’t able to find any appropriate libraries that do this so I wrote my own. There appear to be numerous edit distance libraries available for computing edit distances between two strings, but not between two sequences. This is written entirely in Python. This implementation could likely be optimized to be faster within Python. And could probably be much faster if implemented in C. The library API is modeled after difflib.SequenceMatcher. This is very similar to difflib, except that this module computes edit distance (Levenshtein distance) rather than the Ratcliff and Oberhelp method that Python’s difflib uses. difflib "does not yield minimal edit sequences, but does tend to yield matches that look right to people.". If you find this library useful or have any suggestions, please send me a message.

Support

Quality

Security

License

Reuse

Support

edit-distance has a highly active ecosystem.

It has 96 star(s) with 16 fork(s). There are 2 watchers for this library.

It had no major release in the last 12 months.

There are 0 open issues and 5 have been closed. On average issues are closed in 316 days. There are no pull requests.

It has a positive sentiment in the developer community.

The latest version of edit-distance is v1.0.6

Quality

edit-distance has 0 bugs and 0 code smells.

Security

edit-distance has no vulnerabilities reported, and its dependent libraries have no vulnerabilities reported.

edit-distance code analysis shows 0 unresolved vulnerabilities.

There are 0 security hotspots that need review.

License

edit-distance is licensed under the Apache-2.0 License. This license is Permissive.

Permissive licenses have the least restrictions, and you can use them in most projects.

Reuse

edit-distance releases are available to install and integrate.

Build file is available. You can build the component from source.

Installation instructions are not available. Examples and code snippets are available.

edit-distance saves you 161 person hours of effort in developing the same functionality from scratch.

It has 439 lines of code, 34 functions and 8 files.

It has high code complexity. Code complexity directly impacts maintainability of the code.

Top functions reviewed by kandi - BETA

kandi has reviewed edit-distance and discovered the below as its top functions. This is intended to give you an instant insight into edit-distance implemented functionality, and help decide if they suit your requirements.

Compute the edit distance between two sequences
Return a list of opcodes from the given BP table

Get all kandi verified functions for this library.

edit-distance Key Features

No Key Features are available at this moment for edit-distance.

edit-distance Examples and Code Snippets

Calculate edit distance between hypothesis and truth matrix .

python

Lines of Code : 101

License : Non-SPDX (Apache License 2.0)

Copy

def edit_distance(hypothesis, truth, normalize=True, name="edit_distance"):
  """Computes the Levenshtein distance between sequences.

  This operation takes variable-length sequences (`hypothesis` and `truth`),
  each provided as a `SparseTensor`, a

Gets the min edit distance path from a node to u .

java

Lines of Code : 30

License : No License

Copy

public static Path getMinEditDistancePath(int u, int index) {

        if (index == tp.length) return new Path(0, new LinkedList());
        if (dp[u][index].path != null) return dp[u][index];

        String tcity = tp[index];
        String pcity =

Returns edit distance between two strings .

java

Lines of Code : 29

License : Permissive (MIT License)

Copy

public static int editDistance(String s1, String s2, int[][] storage) {
        int m = s1.length();
        int n = s2.length();
        if (storage[m][n] > 0) {
            return storage[m][n];

        }
        if (m == 0) {
            stora

Community Discussions

Trending Discussions on edit-distance

Coq Program Fixpoint vs equations as far as best way to get reduction lemmas?

How to normalize Levenshtein distance between 0 to 1

ghc error: hidden package, but it's actually exposed

Levenshtein distance with substitution, deletion and insertion count

Why does indexing a string inside of a recursive call yield a different result?

An algorithm for computing the edit-distance between two words

stack build on macOS

Speeding up Levenshtein distance calculation in python with global variables

QUESTION

Coq Program Fixpoint vs equations as far as best way to get reduction lemmas?

Asked 2022-Mar-24 at 21:42

I am trying to prove that particular implementations of how to calculate the edit distance between two strings are correct and yield identical results. I went with the most natural way to define edit distance recursively as a single function (see below). This caused coq to complain that it couldn't determine the decreasing argument. After some searching, it seems that using the Program Fixpoint mechanism and providing a measure function is one way around this problem. However, this led to the next problem that the tactic simpl no longer works as expected. I found this question which has a similar problem, but I am getting stuck because I don't understand the role the Fix_sub function is playing in the code generated by coq for my edit distance function which looks more complicated than in the simple example in the previous question.

Questions:

For a function like edit distance, could the Equations package be easier to use than Program Fixpoint (get reduction lemmas automatically)? The previous question on this front is from 2016, so I am curious if the best practices on this front have evolved since then.
I came across this coq program involving edit_distance that using an inductively defined prop instead of a function. Maybe this is me still trying to wrap my head around the Curry-Howard Correspondence, but why is Coq willing to accept the inductive proposition definition for edit_distance without termination/measure complaints but not the function driven approach? Does this mean there is an angle using a creatively defined inductive type that could be passed to edit_distance that contains both strings that wrapped as a pair and a number and process on that coq would more easily accept as structural recursion?

Is there an easier way using Program Fixpoint to get reductions?

...

ANSWER

Answered 2022-Mar-24 at 21:12

There is a common trick to this kind of recursion over two arguments, which is to write two nested functions, each recursing over one of the two arguments.

This can also be understood from the perspective of dynamic programming, where the edit distance is computed by traversing a matrix. More generally, the edit distance function edit xs ys can be viewed as a matrix of nat with rows indexed by xs and columns indexed by ys. The outer recursion iterates over rows xs, and for each of those rows, when xs = x :: xs', the inner recursion iterates over its columns ys to generates the entries of that row from another row with a smaller index xs'.

Source https://stackoverflow.com/questions/71608107

QUESTION

How to normalize Levenshtein distance between 0 to 1

Asked 2020-Sep-29 at 20:54

I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.

I am thinking to adopt the following approach:

if two strings, s1 and s2
len = max(s1.length(), s2.length());
normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);

Then the highest score 1.0 means an exact match and 0.0 means no match.

But I see variations here: two whole texts similarity using levenshtein distance where 1- distance(a,b)/max(a.length, b.length)

Difference in normalization of Levenshtein (edit) distance?

Explanation of normalized edit distance formula

I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text only implements LevenshteinDistance and not normalized LevenshteinDistance.

https://commons.apache.org/proper/commons-text/apidocs/org/apache/commons/text/similarity/LevenshteinDistance.html

...

ANSWER

Answered 2020-Sep-29 at 06:11

Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:

Source https://stackoverflow.com/questions/64113621

QUESTION

ghc error: hidden package, but it's actually exposed

Asked 2020-Aug-05 at 18:57

I'm trying to use the package Parsec. When I run ghc Main.hs I get the error message:

...

ANSWER

Answered 2020-Aug-05 at 18:57

This looks like an issue with global vs local installs. Oh, and there it is in your ghc-pkg list output. You've got a multiuser ghc install and a single-user list of packages you've installed. Things work when you run ghc as a superuser because they won't see your local (per-user) installs.

This is going to cause problems unless you use a tool to manage your environment for you. Both cabal and stack can handle this fine. I prefer cabal because it doesn't need coaxing to work with your preinstalled GHC, but this is a matter that has caused religious wars in the past. I won't argue against stack if you have a good resource for using it instead.

Source https://stackoverflow.com/questions/63267436

QUESTION

Levenshtein distance with substitution, deletion and insertion count

Asked 2020-May-13 at 21:24

There's a great blog post here https://davedelong.com/blog/2015/12/01/edit-distance-and-edit-steps/ on Levenshtein distance. I'm trying to implement this to also include counts of subs, dels and ins when returning the Levenshtein distance. Just running a smell check on my algorithm.

...

ANSWER

Answered 2020-May-13 at 21:23

The problem was that Python does address passing for objects so I should be cloning the lists to the variables rather than doing a direct reference.

Source https://stackoverflow.com/questions/61784300

QUESTION

Why does indexing a string inside of a recursive call yield a different result?

Asked 2020-May-10 at 10:30

In my naive implementation of edit-distance finder, I have to check whether the last characters of two strings match:

...

ANSWER

Answered 2020-May-10 at 10:30

The operators have different precedence from what you expect. In const auto delt = a[$ - 1] == b[$ - 1] ? 0 : 1; there is no ambiguity, but in editDistance(a[0 .. $ - 1], b[0 .. $ - 1]) + a[$ - 1] == b[$ - 1] ? 0 : 1, there is (seemingly).

Simplifying:

Source https://stackoverflow.com/questions/61707979

QUESTION

An algorithm for computing the edit-distance between two words

Asked 2020-Apr-24 at 02:10

I am trying to write Python code that takes a word as an input (e.g. book), and outputs the most similar word with similarity score.

I have tried different off-the-shelf edit-distance algorithms like cosine, Levenshtein and others, but these cannot tell the degree of differences. For example, (book, bouk) and (book,bo0k). I am looking for an algorithm that can gives different scores for these two examples. I am thinking about using fastText or BPE, however they use cosine distance.

Is there any algorithm that can solve this?

...

ANSWER

Answered 2020-Apr-22 at 12:25

The problem is that both "bo0k" and "bouk" are one character different from "book", and no other metric will give you a way to distinguish between them.

What you will need to do is change the scoring: Instead of counting a different character as an edit distance of 1, you could give it a higher score if it's a different character class (ie a digit instead of a letter). That way you will get a different score for your examples.

You might have to adapt the other scores as well, though, so that replacement / insertion / deletion are still consistent.

Source https://stackoverflow.com/questions/61364975

QUESTION

stack build on macOS

Asked 2020-Apr-04 at 17:27

I am new to haskell. I have the simplest of simple programs.

...

ANSWER

Answered 2020-Apr-04 at 17:27

I just found this known bug: https://github.com/commercialhaskell/stack/issues/4373

That is exactly what I'm seeing.

The workaround required is to update a settings file that is buried deep under a newly generated ~/.stack directory https://github.com/commercialhaskell/stack/issues/4373#issuecomment-432726112

Those instructions are incomplete so I added a comment to that bug to clarify. That settings location: ~/.stack/programs/x86_64-osx/ghc-8.8.3/lib/ghc-8.8.3/settings

And this works (note that stack test is a combination of stack build and stack test):

Source https://stackoverflow.com/questions/61023053

QUESTION

Speeding up Levenshtein distance calculation in python with global variables

Asked 2020-Jan-05 at 09:19

Hi I'm using python for a project in bioinformatics.

I have a function that uses the Needleman-Wunsch algorithm to calculate the edit-distance between a query and a read from our Next-generation-Sequencing platform. (both strings with the alphabet: 'ACGT') My script works fine, but takes a long time to run, because the function is called more than a 100 million times in total. In the function I use a 2-dimensional list with size MxN, where M is the length of the query and N is the length of the read. Every time the function is called this 2D-list has to be recreated in memory before it can be filled with the calculation. I was wondering if I could speed up the process by creating a 2D-List as global variable, and then passing the handle to this List as an argument to the function. This way the memory would only have to be allocated once by the operating system. Hope I made my question clear. How much time does requesting the memory for a list from the operating system take. Is it significant?

Edit: some sample code as requested:

The function goes through the 2D-Array and fills it with numbers:

...

ANSWER

Answered 2020-Jan-04 at 22:46

This is my take on the performance impact by having the list enclosed locally in the function rather than "globally".

Edit: as pointed out by @DanD in the comments, I wrote (and deleted) before the more traditional way of stacks and heaps. This is not entirely true for Python. The Python Virtual Machine (PVM) only uses a private heap to allocate its objects. But the PVM itself has been implemented as a stack. Then Python uses reference counters (among other things) to keep track of the objects, whether they should be discarded or not. When you use your first example, the list object gets pushed onto the stack again and again and again. The previous list object gets its reference counter decreased, and then gets removed when the reference counter reaches 0. This is a good amount of overhead. Your second example creates the list object once, keeps the reference counter satisfied, and then the PVM can use that object each time you make your call.

So: instead of recreating the list object for each call and generating new references, the performance is gained by having only 1 list object created with the same references.

Here is a small example, which your first and second example in a nutshell:

Source https://stackoverflow.com/questions/59593094

Community Discussions, Code Snippets contain sources that include Stack Exchange Network

Vulnerabilities

No vulnerabilities reported

Install edit-distance

You can download it from GitHub.
You can use edit-distance like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.

Support

For contributions, it’s best to Github issues and pull requests. Proper testing and documentation required. Code of conduct is expected to be reasonable, especially as specified by the [Contributor Covenant](http://contributor-covenant.org/version/1/4/).

Find more information at: