Calculating Edit Distance Between Sequences

Introduction

This document describes a (simpler) algorithm closely related to the Viterbi decoding method for HMMs. The task is to calculate the "edit distance" between two sequences. This is defined as the cheapest way of using the elementary operations "insert","delete" and "substitute" to turn one sequence into the other. The algorithm below does this efficiently by filling up a square array of distances, starting from the bottom left and working to the top right. (the efficiency in question is time efficiency, but space is not too bad either, since typically not every entry in the array is relevant to the optimimum path, so there are various opportunities to save space if one wants to take the trouble).

We work with a rectangular array of cells with dimensions one bigger than the strings to be compared.

    def sdist(string1,string2)
	delCost = 1.0
	insCost = 1.0
	substCost = 1.0
        m = len(string1)
        n = len(string2)
        d(0,0) = 0.0

We set up some variables, then seed the recursion with the left hand bottom element. It would be easy to write another version in which the costs were different. If the substitution cost is greater than the sum of the other two costs it will always be better to do an insertion then a deletion, and no substitutions will occur in the minumum cost path.

	
	for i below m
	    d(i+1,0) = d(i,0) + delCost
        endfor
We fill in the first row, adding entries with indices (1,0) through (m,0)
	for j below n
	    d(0,j+1) = d(0,j) + insCost
        endfor

We fill in the first column, adding entries with indices (0,1) through (0,n)


	for i below m
	    for j below n    
		if string1(i) == string2(j)
		    subst = 0
		else
		    subst = substCost
                endif
		d(i+1,j+1) = min(d(i,j) +  subst,
				 d(i+1,j)+ insCost,
				 d(i,j+1)+ delCost)
           endfor
       endfor

Once we have the edges of the array, we can work outwards, filling in the values which depend on these. We look at all three possible predecessors of each cell, taking the one which gives the lowest cost. This version says that there is no charge for matching a letter against itself, but that it costs one penalty point to match against anything else. It would be easy to vary this if we thought, for example, that it was less bad to confuse some letter pairs than to confuse others.

	return d(m,n)
enddef sdist

At the end, the total distance is in the cell at (m,n).

This scheme can be adapted to compute the sequence of edit operations rather than just the number. This is done by changing the element type of

d
from
integer
to
traceback = integer *move* traceback option
. where
move = INITIAL|DEL|INS|MATCH
(Here we are adopting terminology from the ML programming language. This means that there is a type of thing called a traceback which consists of an integer packaged with a field that points back to a further traceback). The idea is that the traceback information accessible from
d(m,n)
is sufficient to recover the full description of a minimal cost path all the way back to
d(0,0)
.

We again start by setting up variables and seeding the recursion at

d(0,0)
.
def edit_path(string1,string2)
	delCost = 1.0
	insCost = 1.0
	substCost = 1.0
        m = len(string1)
        n = len(string2)
        d(0,0) = (0,INITIAL,None)

This notation is intended to convey the fact that d(0,0) has cost zero, and that no further traceback information is necessary.

Once again we fill in the edges:

	for i below m
            (sofar,tb) = d(i,0) (*)
	    d(i+1,0) = (sofar + delCost,DEL,d(i,0)) (**)
        endfor
	for j below n
            (sofar,tb) = d(0,j) (*)
	    d(0,j+1) = (sofar + insCost, INS,d(0,j)) (**)
        endfor

Here we are using the lines marked (*) to get the cost out of the predecessor, and the lines marked (**) to put appropriate information into the new cell. The traceback information in each new cell points back to its predecessor. For example

d(0,2)
has cost
2*insCost
and its traceback field points to
d(0,1)
. This in turn has cost
insCost
and points back to
d(0,0)
which has cost 0 and no further traceback information.

Finally, we do the main loop as before, except that more bookkeeping is needed to make sure that best paths are preserved. So

min
is replaced by
best
, which is provided below.
	for i below m
	    for j below n    
		if string1(i) == string2(j)
		    subst = 0
		else
		    subst = substCost
                endif
		d(i+1,j+1) = best ((subst,MATCH,d(i,j))
				   (insCost,INS,d(i+1,j))
				   (delCost,DEL,d(i,j+1)))
           endfor
       endfor

       return d(m,n)
enddef edit_path
Now we just need
best
, which constructs the tuple that belongs in
d(i+1,j+1)
. This is the one with the lowest cost. It's difficult to make this code clear enough. My best effort uses a little auxilliary function to pull the cost out of the traceback tuple
def cost(tb)
   c,m,rest = tb
   return c
enddef cost


def best(sub_move,ins_move,del_move)
    increment, move1, tb1 = sub_move
    cost_with_sub =  increment + cost(tb1)
    increment,move2,  tb2 = ins_move
    cost_with_ins =  increment + cost(tb2)
    increment,move3,  tb3 = del_move
    cost_with_del =  increment + cost(tb3)

    best_cost = cost_with_sub
    move = sub_move
    tb = tb1
    if cost_with_ins < best_cost then
     move = ins_move
     tb = tb2
    endif
    if cost_with_del < best_cost then
     move = del_move
     tb = tb3
    endif
    return cost,move,tb
enddef best

Martin Jansche adds:: there is also the connection between string edit distance and weighted automata. The basic insight is that you view both strings as a special kind of automaton that has precisely one path from the start state to the final state, and you also need a weighted transducer that specifies what operations are possible and what weight is associated with each, then all you do is compose the three machines and find the least-cost path, whose cost is then the string edit distance. I've heard this being mentioned several times before, but I finally found it on some slides by Mohri, Pereira & Riley that came with the AT&T FSM library and toolkit as (or, in lieu of) some kind of documentation.

Alec Seward found some bugs in an earlier version of this page. I'm grateful. Also to Alex Martelli who found another typo and provided a Python implementation of the algorithm. If there are any more, please let me know.


Chris Brew
Last modified: Fri Mar 30 12:14:03 EST 2001