Approximate string-matching algorithms, part 2

Part 1, Part 2

Similarity metric

Other procedures discussed here use different types of textual similarity metric. They can be used in a fault-tolerant search.

Most of those procedures calculate the similarity of two strings as number between 0 and 1. Value 0 means the strings are completely different. Value 1 means perfect match of the strings. Intermediate values correspond to a partial match.

The similarity metric algorithms are useful for spell checking, speech recognition, DNA analysis, plagiarism detection, to find difference between files and for remote screen update problem.

Hamming distance

For two strings of the same length, the Hamming distance is the number of positions in which the two strings have different characters.

Examples
  1. Hamming distance from ALEXANDRE to ALEKSANDR is 6 (ALE coincide, the resting 6 characters are different).

Edit distance

The edit distance of two strings is defined as the minimum number of insertions, deletions and substitutions required to transform the first string into the second. The distance 0 means the strings are identical.

The algorithm was devised in 1965 by the sovietic scientist Vladimir Levenshtein. The edit distance is called also Levenshtein distance.

There are modern versions of Edit distance algorithm assigning different weights to insertions, deletions and substitutions.

Examples
  1. Edit distance from ALEXANDRE to ALEKSANDER is 4 (substitute X by K, insert S after K, insert E after D, delete E at the end).

Trigram

The trigram similarity between two strings is determined by the number of matching letter triples in both strings. The algorithm can be generalized to n-grams, where n=1, 2, ...

The strings string1 and string2 are padded to left and to right by one space symbol. Then the strings are divided into letter triplets (trigrams). Finally, the similarity is calculated as:

s = 2*m / (a + b).

Here:

m is number of matching trigrams
a is number of trigrams in string1
b is number of trigrams in string2

Examples
  1. The similarity of ALEXANDRE and ALEKSANDER is 2*3 / (9+10) = 0.32 (matching _AL, ALE, AND).

Ratcliff/Obershelp pattern recognition

The Ratcliff/Obershelp algorithm computes the similarity of two strings as the doubled number of matching characters divided by the total number of characters in the two strings. Matching characters are those in the longest common subsequence plus, recursively, matching characters in the unmatched region on either side of the longest common subsequence.

Examples
  1. The similarity of ALEXANDRE and ALEKSANDER is 2 * (3+3+1+1) / (9+10) = 0.84 (matching ALE, AND, E, R).

Jaro-Winkler

The Jaro-Winkler similarity was developed at the U.S. Census and used in post-enumeration analysis.

Given strings string1 and string2, their similarity is:

s = m/3a + m/3b + (m-t)/3m.

Here:

m is number of matching characters
a is length of string1
b is length of string2
t is number of transpositions

Two characters are considered matched only if they are no further apart than (max(a,b)/2 - 1). The first matched character on string1 is compared to the first matched character on string2; the second matched character on string1 is compared to the second matched character on string2, etc. The number of mismatched characters is divided by two to yield the number of transpositions.

The improved Jaro-Winkler method uses weights different from 1/3. It also penalizes less some types of error: visual scanning and keypunch errors, errors at the end of a string.

Examples
  1. The similarity of ALEXANDRE and ALEKSANDER is (8/9 + 8/10 + (8-1)/8) / 3 = 0.85 (matching A, L, E, A, N, D, R, E; 1 transposition).

Typewriting errors

There are some algorithms taking into account errors resulting from the pressing of erroneous buttons on the keyboard: V instead of B, 6 instead of Y etc.

Comparison of algorithms

The choice of one from the following string-matching algorithms mainly depends on the nature of error influencing the text data.

Algorithm Type When to use
Soundex phonetic misspelled English words
Daitch-Mokotoff phonetic differently spelled European surnames
NYSIIS phonetic misspelled English and some foreign words
Metaphone, Double metaphone phonetic misspelled English words
Caverphone 2.0 phonetic misspelled English words
Hamming, Edit distance similarity local spelling errors
Trigram, n-gram similarity spelling errors, edited text
Ratcliff/Obershelp similarity spelling errors, edited text
Jaro-Winkler similarity spelling and typewriting errors

Resources

Handbooks, lectures

  1. The dictionary of algorithms and data structures.
    http://www.nist.gov/dads/
  2. Handbook of algorithms and data structures by Gaston H. Gonnet and Ricardo Baeza-Yates.
    http://www.dcc.uchile.cl/~rbaeza/handbook/
  3. Exact string matching algorithms: lectures of Laboratoire d'Informatique de Rouen, France.
    http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Papers

  1. Soundex reference.
    http://www.archives.gov/
  2. Usage of Soundex and NYSIIS in NameSearch system.
    http://www.name-searching.com/
  3. Article on Soundex and NYSIIS by Andrew Coates.
    http://www.civilsolutions.com.au/
  4. Introduction to Daitch-Mokotoff soundex system by Gary Mokotoff.
    http://www.avotaynu.com/soundex.html
  5. Description of Caverphone 2.0.
    http://caversham.otago.ac.nz/
  6. Metaphone reference.
    http://aspell.sourceforge.net/
  7. Article by Reinhard Rapp on Trigram method.
    http://heise.de/ct/english/97/04/386/
  8. Article by Bruce L. Lambert on usage of N-gram and Edit distance in drug naming.
    http://www.hc-sc.gc.ca/
  9. Paper by William E. Winkler and Yves Thibaudeau on Jaro-Winkler string comparator.
    http://www.census.gov/srd/papers/pdf/rr91-9.pdf
  10. Articles on Fuzzy string searching, Phonetic algorithm, Soundex, NYSIIS, Metaphone, Daitch-Mokotoff Soundex, Levenshtein distance and Trigram search in Wikipedia, the free encyclopedia.
    http://en.wikipedia.org/
  11. Article on SQL Server 2005 Fuzzy Lookup and Grouping technology on MSDN Magazine.
    http://msdn.microsoft.com/

Implementations

  1. Implementation of Soundex in C.
    http://physics.nist.gov/cuu/Reference/soundex.html
  2. Implementation of Metaphone and Double metaphone in Basic, C, Perl, and C++.
    http://aspell.sourceforge.net/metaphone/
  3. Implementation of Edit distance in Java, C++ and Visual Basic.
    http://www.merriampark.com/ld.htm
  4. Dynamic programming algorithm for Edit distance.
    http://www.csse.monash.edu.au/
  5. Fuzzy string search in TCL.
    http://mini.net/tcl/3841
  6. Description of PHP string functions levenshtein, soundex, similar_text and metaphone.
    http://www.php.net/manual/en/ref.strings.php
  7. Fuzzy Matching & Deduplication SourceGorge project.
    http://sourceforge.net/projects/dedupe/

Approximate string-matching algorithms
Part 1, Part 2