Auto-Correct Using NLP

JAYESH KAMANE
5 min readDec 15, 2022

--

Autocorrect is very important in day-to-day life. Conversation through messaging has increased and autocorrect is become a saving grace for us all.

Autocorrect is software that helps correct the “typos” automatically while we write something.

Although everyone uses autocorrect, this essay will explain how it functions. However, in these remarks, we will just discuss grammatical faults and leave out contextual ones. For instance, the misspelled word “deah” appears in the phrase “Happy Birthday my deah Friend!” Autocorrect should catch this and replace it to “dear” (given our model is accurate). However, since “deer” is a real term (an animal), which is not suitable in this context unless your buddy is a deer, it would not be seen as a mistake to say, “Happy Birthday my deer Friend!”

How does Autocorrect work?

It is done on a string to change it into another string, and n is just the number of edit operations (i.e., an edit distance of 1, 2, 3, etc.). The n edit distance, therefore, indicates how many operations separate one string from another. The many edits are listed below:
1) Insert
2) Delete
3) Switch
4) Replace

#1: Spotting the typos — as we have already seen, how could we determine that the word “deah” is misspelled? When a word is spelled correctly, it can be looked up in a dictionary; when it cannot, it is likely a misspelled word. Therefore, we would mark a word for correction when it wasn’t listed in a dictionary.

#2: Find Strings n edit distance away — A string edit is an operation that transforms a string into another string, where n simply represents the total number of edit actions (i.e., an edit distance of 1, 2, 3, etc.). In this way, the n edit distance describes how many operations separate two strings. The following is a list of all the edits:

1) insert

# insert_letter: adds additional characters
def insert_letter(word):
split_l = []
insert_list = []
for i in range(len(word) + 1):
split_l.append((word[0:i], word[i:]))
letters = 'abcdefghijklmnopqrstuvwxyz'
insert_list = [a + l + b for a, b in split_l for l in letters]
# print(split_l)
return insert_list

2) Delete

# DeleteLetter:removes a letter from a given word
def DeleteLetter(word):
delete_list = []
split_list = []
for i in range(len(word)):
split_list.append((word[0:i], word[i:]))
for a, b in split_list:
delete_list.append(a + b[1:])
return delete_list

3) Switch

# SwitchLetter:swap two adjacent letters
def SwitchLetter(word):
split_l = []
switch_l = []
for i in range(len(word)):
split_l.append((word[0:i], word[i:]))
switch_l = [a + b[1] + b[0] + b[2:] for a, b in split_l if len(b) >= 2]
return switch_l

4) Replace

# replace_letter: changes one letter to another
def replace_letter(word):
split_l = []
replace_list = []
for i in range(len(word)):
split_l.append((word[0:i], word[i:]))
alphabets = 'abcdefghijklmnopqrstuvwxyz'
replace_list = [a + l + (b[1:] if len(b) > 1 else '') for a, b in split_l if b for l in alphabets]
return replace_lis

With these four changes, any string may be changed. The combination of these alterations allows us to find a list of all possible strings that are n modifications away.

#3: Filter Candidates — To ensure that only properly spelled genuine words from our candidate list are taken into consideration, we compare the words to a known dictionary (like in #1) and exclude any terms from our candidate list that do not present in the known dictionary.

#4: Determine the Most Likely Word Using our list of real words, we can determine the Most Likely Word among our Candidates. In order to do this, we must be aware of word frequencies and the corpus’ overall word count.

Figure 1: Formula to calculate word probabilities

We might imagine a corpus built of the line “I am happy because I am learning” using the method in Figure 1. First, we take the word counts.

Figure 2: Word Counts

Then, using our approach (we’ll take the letter “I” as an example), we compute the word probability:-

P(I) = C(I)/7 = 2/7

Figure 3: Computing probability of the word “I”

We must choose the word candidate with the highest likelihood as the replacement while doing autocorrect.

Minimum Distance of Edit

The usage of edit distance for autocorrect was seen above, but the minimum edit distance may also be used to compare the similarity of several words or whole manuscripts. In actuality, there are a variety of circumstances in which minimum edit distance may be used, such as:

Spelling Checker
Machine Translation
DNA Sequencing
Document Similarity

and more.

The smallest amount of operations necessary to change one string into another is known as the minimum edit distance.
We employ the three distinct edit computations we’ve already covered to get the minimum edit distance. These three edits are insert, remove, and replace; a sample of the minimal edit distance is shown in Figure 4.

Figure 4: Example using edits to replace source word to target word (Image by Author)

Given that we assume all edit operations cost the same amount, Figure 4 only contains 2 edits overall. The revised edit cost is shown in Figure 5 and may be calculated by adding the costs of all the edits, which is what we are aiming to do in order to reduce the edit cost.

Figure 5: How much an edit costs.
For our example, the updated edit cost in Figure 4 would be 4. It would have been simple to solve this problem graphically. It wouldn’t be as straightforward and there would be a lot more operations if we had much longer strings or huge volumes of text. We may use brute force to try to overcome this by adding one more distance at a time and counting all possible outcomes until one string is changed into another. However, it’s likely that this will take an excessively long period

Conclusion

In this blog, we explained how to autocorrect and autocomplete functionalities work. Autocorrect functionality needs edits in strings to be manipulated like INSERT, DELETE, SWAP, and REPLACE. Later on, we decide the most likely word to be used in place of the original string.

--

--