Spell checker libraries based on dictionaries (like aspell) have two major drawbacks:
- They easily get outdated
- Corrections are based on edit distances.
What this actually means is, that even if we manage to keep the dictionary up to date and fill it with domain specific words, corrections will be based on how close the edit distance of a misspelled word is to a list of look-alike words, and not on the statistical probability that misspelled word A is actually B.
To give you an example: nikia is a common typo of the word nokia. While dictionary based spell checkers will suggest a whole list of corrections for nikia, it is only obvious that the user wanted to type nokia but a small finger slip lead to the adjacent i instead of o.
Google's approach was to use the vast number of search statistics they possess to track user generated corrections. Every time a user performed a search they would record if the user actually clicked on any search result or refined his search probably fixing a typo or a misspelled word. If more than 90% (or some safe threshold) of the users corrected their typo nikia to nokia, a record was created reflecting the successful correction. When a user is presented with a "Did you mean that?" alternative, clicking on it acts as a positive feedback making the statistical probability even higher. This technique can sometimes have side effects but it works ok 99,99% of the time.
At Skroutz while we have a small number of searches compared to google combined with the fact that our keywords have a much smaller topic span they are more than enough to generate an accurate dictionary of statistical based corrections.
Technically all we have to do is loop through our access logs and look for searches that led to product clicks and searches that were refined. Then using Levenshtein's algorithm we discover similar words between consequent searches and mark them as possible corrections with a calculated error rate (eg 78% of users who typed nolia changed it to nokia, but only 1% of users who typed nokia refined their search thus marking nokia as safe).
Still we need a correction occurrence threshold to determine if a correction is valid. For example one correction of cannon to canon is not enough to mark it as valid even though the error rate is 100%. We need a rating mechanism that will give a higher rank to 90% error rate in 100 searches than a 97% error rate in 20 searches. We can use the Lower bound of Wilson score confidence interval for a Bernoulli parameter to create a self regulating error rate.
Google actually goes way beyond the simple correction of words by doing context sensitive spelling based on their gigantic corpus.
Interesting links on the subject: