Approximate string-matching algorithms, part 2
Part 1, Part 2
Other procedures discussed here use different types of textual similarity metric.
They can be used in a fault-tolerant search.
Most of those procedures calculate the similarity of two strings
as number between 0 and 1.
Value 0 means the strings are completely different.
Value 1 means perfect match of the strings.
Intermediate values correspond to a partial match.
The similarity metric algorithms are useful for spell checking,
speech recognition, DNA analysis, plagiarism detection,
to find difference between files and for remote screen update problem.
For two strings of the same length, the Hamming distance
is the number of positions in which the two strings have different characters.
- Hamming distance from ALEXANDRE to ALEKSANDR is 6
(ALE coincide, the resting 6 characters are different).
The edit distance of two strings is defined as the minimum number
of insertions, deletions and substitutions
required to transform the first string into the second.
The distance 0 means the strings are identical.
The algorithm was devised in 1965 by the sovietic scientist Vladimir Levenshtein.
The edit distance is called also Levenshtein distance.
There are modern versions of Edit distance algorithm
assigning different weights to insertions, deletions and substitutions.
- Edit distance from ALEXANDRE to ALEKSANDER is 4
(substitute X by K, insert S after K, insert E after D, delete E at the end).
The trigram similarity between two strings is determined
by the number of matching letter triples in both strings.
The algorithm can be generalized to n-grams, where n=1, 2, ...
The strings string1 and string2 are padded to left and to right by one space symbol.
Then the strings are divided into letter triplets (trigrams).
Finally, the similarity is calculated as:
s = 2*m / (a + b).
m is number of matching trigrams
a is number of trigrams in string1
b is number of trigrams in string2
- The similarity of ALEXANDRE and ALEKSANDER is 2*3 / (9+10) = 0.32
(matching _AL, ALE, AND).
Ratcliff/Obershelp pattern recognition
The Ratcliff/Obershelp algorithm computes the similarity of two strings
as the doubled number of matching characters divided by the total number
of characters in the two strings.
Matching characters are those in the longest common subsequence plus,
recursively, matching characters in the unmatched region
on either side of the longest common subsequence.
- The similarity of ALEXANDRE and ALEKSANDER is 2 * (3+3+1+1) / (9+10) = 0.84
(matching ALE, AND, E, R).
The Jaro-Winkler similarity was developed at the U.S. Census
and used in post-enumeration analysis.
Given strings string1 and string2, their similarity is:
s = m/3a + m/3b + (m-t)/3m.
m is number of matching characters
a is length of string1
b is length of string2
t is number of transpositions
Two characters are considered matched only
if they are no further apart than (max(a,b)/2 - 1).
The first matched character on string1 is compared to the first matched character on string2;
the second matched character on string1 is compared to the second matched character on string2, etc.
The number of mismatched characters is divided by two to yield the number of transpositions.
The improved Jaro-Winkler method uses weights different from 1/3.
It also penalizes less some types of error:
visual scanning and keypunch errors, errors at the end of a string.
- The similarity of ALEXANDRE and ALEKSANDER is (8/9 + 8/10 + (8-1)/8) / 3 = 0.85
(matching A, L, E, A, N, D, R, E; 1 transposition).
There are some algorithms taking into account errors
resulting from the pressing of erroneous buttons on the keyboard:
V instead of B, 6 instead of Y etc.
Comparison of algorithms
The choice of one from the following string-matching algorithms
mainly depends on the nature of error influencing the text data.
- The dictionary of algorithms and data structures.
- Handbook of algorithms and data structures
by Gaston H. Gonnet and Ricardo Baeza-Yates.
- Exact string matching algorithms:
lectures of Laboratoire d'Informatique de Rouen, France.
- Soundex reference.
- Usage of Soundex and NYSIIS in NameSearch system.
- Article on Soundex and NYSIIS by Andrew Coates.
- Introduction to Daitch-Mokotoff soundex system by Gary Mokotoff.
- Description of Caverphone 2.0.
- Metaphone reference.
- Double metaphone reference.
- Article by Reinhard Rapp on Trigram method.
- Article by Bruce L. Lambert on usage of N-gram
and Edit distance in drug naming.
- Paper by William E. Winkler and Yves Thibaudeau
on Jaro-Winkler string comparator.
- Articles on Fuzzy string searching, Phonetic algorithm, Soundex,
NYSIIS, Metaphone, Daitch-Mokotoff Soundex, Levenshtein distance and Trigram search
in Wikipedia, the free encyclopedia.
- Article on SQL Server 2005 Fuzzy Lookup and Grouping technology on MSDN Magazine.
- Implementation of Soundex in C.
- Implementation of Metaphone and Double metaphone in Basic, C, Perl, and C++.
- Implementation of Edit distance in Java, C++ and Visual Basic.
- Dynamic programming algorithm for Edit distance.
- Fuzzy string search in TCL.
- Description of PHP string functions levenshtein, soundex, similar_text and metaphone.
- Fuzzy Matching & Deduplication SourceGorge project.
Approximate string-matching algorithms
Part 1, Part 2