Auto-correction/spell checking is probably one of the most used functionalities over the web.
Has anyone thought how it is working?? How blogs, websites, that run clearly over the web, are able to serve the list nearly instantly??
I was really curious the other day and I wanted to spend some minutes crafting a small algori thm to study the functionality.
I’ve googled a bit arround for dictionaries and I’ve found (this one). It’s not perfect but enough to get me started.
Let start…my plan is to load all the words of the file to an array. Slice the array based on the first letter of the word and match somes letter based on their index.
{% highlight php %}
{% endhighlight %}
I prefer fopen because its quite faster that file_get_contents for example.
I will get the words from the command line…
{% highlight php %}
$value) { if (strlen($input) == strlen($value)) { $reducedWords[] = $value; } } foreach ($reducedWords as $key => $word) { $l = strlen($input); $index = 0; for ($i=0; $i < $l; $i++) { if (isset($word[$i]) && $input[$i] == $word[$i]) { $index++; } } if ($index >= ($l-1)) { echo "$word\n"; } } } ?>{% endhighlight %}
This algorithm is not perfect, I didn’t want to spend a lot of time on this. For example I’m checking words with only the same length like the given one which obviously is not correct for a production algorithm…anyway lets see the speed with one word.
{% highlight bash %} Argiriss-MacBook-Pro:CmdSpellCorrection argi$ php correct.php applo — Possible corrections for applo — apolo apple apply applz Finished in 0.89617490768433 seconds {% endhighlight %}
Hmm…pretty slow I cannot say that I’m happy…let’s try with 10 words
{% highlight bash %} php correct.php ezer tost applo airplone feor shep boet garagi squirral buildang palintrame Finished in 1.7699809074402 seconds {% endhighlight %}
Interesting…I have a feeling that the I/O is a small bottleneck on the implementation…let me try save the arrays per starting letter into a key/value…now the full algorithm becomes…
{% highlight php %} <?php $memcache = new Memcache; $memcache->connect(’localhost’, 11211) or die (“Could not connect”);
$time_start = microtime(true);
$words = array();
if (!is_array($memcache->get('a'))) {
$handle = @fopen("dictionary.txt", "r");
if ($handle) {
while (($buffer = fgets($handle, 4096)) !== false) {
$words[$buffer[0]][] = trim($buffer);
}
if (!feof($handle)) {
echo "Error: unexpected fgets() fail\n";
}
fclose($handle);
}
foreach ($words as $key => $eachWordArray) {
$memcache->set($key, $eachWordArray) or die ("Failed to save data at the server");
}
}
// removing the first element since it's the name of the script
unset($argv[0]);
$reducedWords = array();
foreach ($argv as $input) {
echo "--- Possible corrections for $input ---\n";
foreach ($memcache->get($input[0]) as $key => $value) {
if (strlen($input) == strlen($value)) {
$reducedWords[] = $value;
}
}
foreach ($reducedWords as $key => $word) {
$l = strlen($input);
$index = 0;
for ($i=0; $i < $l; $i++) {
if (isset($word[$i]) && $input[$i] == $word[$i]) {
$index++;
}
}
if ($index >= ($l-1)) {
echo "$word\n";
}
}
}
$time_end = microtime(true);
$time = $time_end - $time_start;
echo "Finished in $time seconds\n";
{% endhighlight %}
I expect the first run to be slow because I still have to read the file and populate memcached…but the second execution…
{% highlight bash %} Argiriss-MacBook-Pro:CmdSpellCorrection argi$ php correct.php applo — Possible corrections for applo — apolo apple apply applz Finished in 0.076194047927856 seconds {% endhighlight %}
And with 10 words..which is affected by the echo ofc…
{% highlight bash %} php correct.php ezer tost applo airplone feor shep boet garagi squirral buildang palintrame Finished in 1.0620329380035 seconds {% endhighlight %}
Pretty impressive difference!! Cache ftw!
Like I said before this is attempt is just for the concept. The numbers can be much better with a faster machine and the actual code needs improvements, but I didn’t want to spend more than 20 minutes with this.
Hope you enjoyed it, I did :) … I think I would like to implement this in Scala…maybe for the next post. The code is here CmdSpellCorrection
Cheers