edit-distance | Python library for computing edit distance | Autocomplete library
kandi X-RAY | edit-distance Summary
kandi X-RAY | edit-distance Summary
Python module for computing edit distances and alignments between sequences. I needed a way to compute edit distances between sequences in Python. I wasn’t able to find any appropriate libraries that do this so I wrote my own. There appear to be numerous edit distance libraries available for computing edit distances between two strings, but not between two sequences. This is written entirely in Python. This implementation could likely be optimized to be faster within Python. And could probably be much faster if implemented in C. The library API is modeled after difflib.SequenceMatcher. This is very similar to difflib, except that this module computes edit distance (Levenshtein distance) rather than the Ratcliff and Oberhelp method that Python’s difflib uses. difflib "does not yield minimal edit sequences, but does tend to yield matches that look right to people.". If you find this library useful or have any suggestions, please send me a message.
Support
Quality
Security
License
Reuse
Top functions reviewed by kandi - BETA
- Compute the edit distance between two sequences
- Return a list of opcodes from the given BP table
edit-distance Key Features
edit-distance Examples and Code Snippets
def edit_distance(hypothesis, truth, normalize=True, name="edit_distance"):
"""Computes the Levenshtein distance between sequences.
This operation takes variable-length sequences (`hypothesis` and `truth`),
each provided as a `SparseTensor`, a
public static Path getMinEditDistancePath(int u, int index) {
if (index == tp.length) return new Path(0, new LinkedList());
if (dp[u][index].path != null) return dp[u][index];
String tcity = tp[index];
String pcity =
public static int editDistance(String s1, String s2, int[][] storage) {
int m = s1.length();
int n = s2.length();
if (storage[m][n] > 0) {
return storage[m][n];
}
if (m == 0) {
stora
Community Discussions
Trending Discussions on edit-distance
QUESTION
I am trying to prove that particular implementations of how to calculate the edit distance between two strings are correct and yield identical results. I went with the most natural way to define edit distance recursively as a single function (see below). This caused coq to complain that it couldn't determine the decreasing argument. After some searching, it seems that using the Program Fixpoint mechanism and providing a measure function is one way around this problem. However, this led to the next problem that the tactic simpl no longer works as expected. I found this question which has a similar problem, but I am getting stuck because I don't understand the role the Fix_sub function is playing in the code generated by coq for my edit distance function which looks more complicated than in the simple example in the previous question.
Questions:
- For a function like edit distance, could the Equations package be easier to use than Program Fixpoint (get reduction lemmas automatically)? The previous question on this front is from 2016, so I am curious if the best practices on this front have evolved since then.
- I came across this coq program involving edit_distance that using an inductively defined prop instead of a function. Maybe this is me still trying to wrap my head around the Curry-Howard Correspondence, but why is Coq willing to accept the inductive proposition definition for edit_distance without termination/measure complaints but not the function driven approach? Does this mean there is an angle using a creatively defined inductive type that could be passed to edit_distance that contains both strings that wrapped as a pair and a number and process on that coq would more easily accept as structural recursion?
Is there an easier way using Program Fixpoint to get reductions?
...ANSWER
Answered 2022-Mar-24 at 21:12There is a common trick to this kind of recursion over two arguments, which is to write two nested functions, each recursing over one of the two arguments.
This can also be understood from the perspective of dynamic programming, where the edit distance is computed by traversing a matrix. More generally, the edit distance function edit xs ys
can be viewed as a matrix of nat
with rows indexed by xs
and columns indexed by ys
. The outer recursion iterates over rows xs
, and for each of those rows, when xs = x :: xs'
, the inner recursion iterates over its columns ys
to generates the entries of that row from another row with a smaller index xs'
.
QUESTION
I have to normalize the Levenshtein distance between 0 to 1. I see different variations floating in SO.
I am thinking to adopt the following approach:
- if two strings, s1 and s2
- len = max(s1.length(), s2.length());
- normalized_distance = float(len - levenshteinDistance(s1, s2)) / float(len);
Then the highest score 1.0 means an exact match and 0.0 means no match.
But I see variations here: two whole texts similarity using levenshtein distance where 1- distance(a,b)/max(a.length, b.length)
Difference in normalization of Levenshtein (edit) distance?
Explanation of normalized edit distance formula
I am wondering is there a canonical code implementation in Java? I know org.apache.commons.text
only implements LevenshteinDistance and not normalized LevenshteinDistance.
ANSWER
Answered 2020-Sep-29 at 06:11Your first answer begins with "The effects of both variants should be nearly the same". The reason normalized LevenshteinDistance doesn't exist is because you (or somebody else) hasn't seen fit to implement it. Besides, it seems a rather trivial once you have the Levenshtein distance:
QUESTION
I'm trying to use the package Parsec.
When I run ghc Main.hs
I get the error message:
ANSWER
Answered 2020-Aug-05 at 18:57This looks like an issue with global vs local installs. Oh, and there it is in your ghc-pkg list
output. You've got a multiuser ghc install and a single-user list of packages you've installed. Things work when you run ghc as a superuser because they won't see your local (per-user) installs.
This is going to cause problems unless you use a tool to manage your environment for you. Both cabal and stack can handle this fine. I prefer cabal because it doesn't need coaxing to work with your preinstalled GHC, but this is a matter that has caused religious wars in the past. I won't argue against stack if you have a good resource for using it instead.
QUESTION
There's a great blog post here https://davedelong.com/blog/2015/12/01/edit-distance-and-edit-steps/ on Levenshtein distance. I'm trying to implement this to also include counts of subs, dels and ins when returning the Levenshtein distance. Just running a smell check on my algorithm.
...ANSWER
Answered 2020-May-13 at 21:23The problem was that Python does address passing for objects so I should be cloning the lists to the variables rather than doing a direct reference.
QUESTION
In my naive implementation of edit-distance finder, I have to check whether the last characters of two strings match:
...ANSWER
Answered 2020-May-10 at 10:30
The operators have different precedence from what you expect. In const auto delt = a[$ - 1] == b[$ - 1] ? 0 : 1;
there is no ambiguity, but in editDistance(a[0 .. $ - 1], b[0 .. $ - 1]) + a[$ - 1] == b[$ - 1] ? 0 : 1
, there is (seemingly).
Simplifying:
QUESTION
I am trying to write Python code that takes a word as an input (e.g. book), and outputs the most similar word with similarity score.
I have tried different off-the-shelf edit-distance algorithms like cosine, Levenshtein and others, but these cannot tell the degree of differences. For example, (book, bouk) and (book,bo0k). I am looking for an algorithm that can gives different scores for these two examples. I am thinking about using fastText or BPE, however they use cosine distance.
Is there any algorithm that can solve this?
...ANSWER
Answered 2020-Apr-22 at 12:25The problem is that both "bo0k" and "bouk" are one character different from "book", and no other metric will give you a way to distinguish between them.
What you will need to do is change the scoring: Instead of counting a different character as an edit distance of 1, you could give it a higher score if it's a different character class (ie a digit instead of a letter). That way you will get a different score for your examples.
You might have to adapt the other scores as well, though, so that replacement / insertion / deletion are still consistent.
QUESTION
I am new to haskell
. I have the simplest of simple programs.
ANSWER
Answered 2020-Apr-04 at 17:27I just found this known bug: https://github.com/commercialhaskell/stack/issues/4373
That is exactly what I'm seeing.
The workaround required is to update a settings
file that is buried deep under a newly generated ~/.stack
directory https://github.com/commercialhaskell/stack/issues/4373#issuecomment-432726112
Those instructions are incomplete so I added a comment to that bug to clarify. That settings location: ~/.stack/programs/x86_64-osx/ghc-8.8.3/lib/ghc-8.8.3/settings
And this works (note that stack test
is a combination of stack build
and stack test
):
QUESTION
Hi I'm using python for a project in bioinformatics.
I have a function that uses the Needleman-Wunsch algorithm to calculate the edit-distance between a query and a read from our Next-generation-Sequencing platform. (both strings with the alphabet: 'ACGT') My script works fine, but takes a long time to run, because the function is called more than a 100 million times in total. In the function I use a 2-dimensional list with size MxN, where M is the length of the query and N is the length of the read. Every time the function is called this 2D-list has to be recreated in memory before it can be filled with the calculation. I was wondering if I could speed up the process by creating a 2D-List as global variable, and then passing the handle to this List as an argument to the function. This way the memory would only have to be allocated once by the operating system. Hope I made my question clear. How much time does requesting the memory for a list from the operating system take. Is it significant?
Edit: some sample code as requested:
The function goes through the 2D-Array and fills it with numbers:
...ANSWER
Answered 2020-Jan-04 at 22:46This is my take on the performance impact by having the list enclosed locally in the function rather than "globally".
Edit: as pointed out by @DanD in the comments, I wrote (and deleted) before the more traditional way of stacks and heaps. This is not entirely true for Python. The Python Virtual Machine (PVM) only uses a private heap to allocate its objects. But the PVM itself has been implemented as a stack. Then Python uses reference counters (among other things) to keep track of the objects, whether they should be discarded or not. When you use your first example, the list object gets pushed onto the stack again and again and again. The previous list object gets its reference counter decreased, and then gets removed when the reference counter reaches 0. This is a good amount of overhead. Your second example creates the list object once, keeps the reference counter satisfied, and then the PVM can use that object each time you make your call.
So: instead of recreating the list object for each call and generating new references, the performance is gained by having only 1 list object created with the same references.
Here is a small example, which your first and second example in a nutshell:
Community Discussions, Code Snippets contain sources that include Stack Exchange Network
Vulnerabilities
No vulnerabilities reported
Install edit-distance
You can use edit-distance like any standard Python library. You will need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, and git installed. Make sure that your pip, setuptools, and wheel are up to date. When using pip it is generally recommended to install packages in a virtual environment to avoid changes to the system.
Support
Reuse Trending Solutions
Find, review, and download reusable Libraries, Code Snippets, Cloud APIs from over 650 million Knowledge Items
Find more librariesStay Updated
Subscribe to our newsletter for trending solutions and developer bootcamps
Share this Page