Approximate string-matching algorithms, part 1
Part 1, Part 2
Table of contents
Nowadays computers keep huge amount of data.
However the retrieval of textual information is difficult
when is misspelled or not exactly known.
This article is devoted to approximate string-matching algorithms.
It gives a simplified but clear description of algorithms with examples.
First, we are going to explain the procedures constructing the
for the searched text that sound the same, but spelled differently.
Then we explain procedures that give different types of
textual similarity metric
used in a fault-tolerant search.
In conclusion, the algorithms are compared
so that the reader can choose the right one.
See also resources later in this article.
Comments and questions are welcome: please reach me at
The procedures discussed here are used to simplify searches in databases
when one knows how the text is pronounced but not how it is spelled.
For this purpose the phonetic codes are constructed for the searched text,
while the database is preventively indexed using those codes.
Surnames that sound the same, but are spelled differently,
like SMITH and SMYTH, have the same code and can be filed together.
The use of phonetic codes reduces matching problems from wrong or different spelling.
Phonetic codes are very useful in order to match lists
of either personal names or enterprise names.
They are also used for speech recognition and in search engines of databases
like the anti-terrorist ones.
Soundex builds a phonetic code for people's name.
The resulting code is made of one letter and three digits,
each digit being one of six consonant sounds.
The algorithm was first applied in the U.S. census in 1880.
- Take the first letter.
- Translate remaining characters:
B, F, P, V → 1
C, G, J, K, Q, S, X, Z → 2
D, T → 3
L → 4
M, N → 5
R → 6
- Drop adjacent letters having the same code.
- Drop non-initial A, E, I, O, U, Y, W and H.
- Take first four characters padding with zeros.
- ALEXANDRE → A4E2A536E → A4E2A536E → A42536 → A425.
- ALEKSANDER → A4E22A53E6 → A4E2A53E6 → A42536 → A425.
In 1985 the new Daitch-Mokotoff Soundex algorithm was applied
for phonetic coding of Slavic and Yiddish surnames
with similar pronunciation and different spelling.
The most important improvements are the followings:
the code is composed of six characters;
there are two different codes when a name has two possible pronunciations;
the code is composed of ten figures from 0 to 9.
The New York State Identification and Intelligence System
was developed in 1970 through rigorous empirical analysis.
In this algorithm a phonetic code of up to 6 letters is constructed.
- Translate first characters of name:
MAC → MCC
PH → FF
KN → NN
PF → FF
K → C
SCH → SSS
- Translate last characters of name:
EE → Y
IE → Y
DT, RT, RD, NT, ND → D
- Translate remaining characters by following rules,
incrementing by one character each time:
EV → AF; else E, I, O, U → A
Q → G
Z → S
M → N
KN → N; else K → C
SCH → SSS
PH → FF
H → previous character, if previous or next are nonvowel
W → previous character, if previous is vowel
- Check last character:
if last character is S, remove it
if last characters are AY, replace with Y
if last character is A, remove it
- Drop second letter of doubled letters.
- Take first six characters.
- ALEXANDRE → ALAXANDRA → ALAXANDR → ALAXANDR → ALAXAN
- ALEKSANDER → ALACSANDAR → ALACSANDAR → ALACSANDAR → ALACSA
The algorithm phonetically codes words by reducing them
to 16 consonant sounds: B, X, S, K, J, T, F, H, L, M, N, P, R, 0, W, Y.
Zero represents the "th" sound; X stands for "sh".
- Drop second letter of doubled letters, except C.
- If the word begins with KN, GN, PN, AE, WR, drop first letter.
- Drop B at the end of a word after M.
- C → X in CIA or CH; C → S in CI, CE or CY; C → K otherwise.
- D → J in DGE, DGY or DGI; D → T otherwise.
- Drop G in GH and if not at end or before a vowel in GN or GNED;
G → J before I or E or Y if not double GG; G → K otherwise.
- Drop H after vowel and if no vowel follows.
- Drop K after C.
- P → F in PH.
- Q → K.
- S → X in SH or SIO or SIA.
- T → X in TIA or TIO; T → 0 in TH; T is dropped in TCH.
- V → F.
- If the word begins with WH, drop H; drop W if not followed by a vowel.
- If the word begins with X, then X → S; X → KS otherwise.
- Drop Y if not followed by a vowel.
- Z → S.
- Vowels are kept only when they are the first letter.
- In all the other cases the letters do non change.
- ALEXANDRE → ALEKSANTRE → ALKSNTR
- ALEKSANDER → ALEKSANTER → ALKSNTR
Double metaphone is the improved version of Metaphone.
This algorithm phonetically codes words by reducing them to 12 consonant sounds.
It returns two keys if a word has two possible pronunciations.
The Caverphone phonetic matching algorithm was created
in the Caversham Project at the University of Otago in New Zealand in 2002.
The algorithm allows accents present in the study area
(southern part of the city of Dunedin, New Zealand).
A new version, Caverphone 2.0, has been built as a more general phonetic match.
Approximate string-matching algorithms
Part 1, Part 2