Validating Common Email Domain Misspellings using ColdFusion and jQuery
On the project I am working on we have had an issue whereby many registrants misspell their email address, making some of the information we collect far less valuable since these users can't easily be contacted. Despite confirming their email in a second form element, one of the most common mistakes is a simple typo of the domain name and/or extension on their email (ex. yhoo.com or hotmal.com). While we can't fully prevent this from happening, we can try to help the user avoid this error and cut down on the garbage emails that make it into our system. To accomplish this, I have worked up a proof of concept using ColdFusion and jQuery to offer up autocomplete suggestions for common email domains as you type using Levenshtein distance to try to catch common misspellings.
Getting Started
The first trick to solving this problem was that we needed a ColdFusion implementation of the Levenshtein distance formula. In simple terms, the Levenshtein distance is a numeric representation of the difference between two strings. Thankfully, one is available created by Nicholas Zographos on CFLib.org.
The next step was to determine a good method for implementing this validation. I thought it could be done server-side but since we may get false positives, adding this extra step to a registration form could add to frustration for the user and possibly contribute to attrition. Instead I decided that I would add this as an autosuggest box that would show your email username with the suggested domain extensions that most closely matched your entry, if any. ColdFusion has an autocomplete feature built into its Ajax functionality, but I decided to use the jQuery autocomplete plugin. In the end, this plugin was so simple to configure that it may actually be easier than using CFFORM's autocomplete.
Creating the Form
The first thing you need to do is download jquery as well as the autocomplete plugin and add those scripts to your project (note for my simple example, everything is in the root of the project folder). The only method within this component we are going to use is the autocomplete() method which simply takes either a URL or a JavaScript Array. If you use a URL, you will need to pass back the values as a line delimited list with each potential suggestion on its own line (we'll see how I built that in the next section).
My example form is pretty basic. It has a form with an email field. When the page loads, we call my init() function which simply has a variable for the location of my URL call (i.e. the URL to call the remote method of my CFC - you could have implemented this as a CFM as well) and adds the autocomplete functionality to the email form field. Honestly, the entire JavaScript could have been accomplished in a single line, though I have separated it out for easier reading. I told you that using CFForm's autocomplete wasn't any easier.
<html>
<head>
<title>JQuery Autocomplete Email Suggest Sample</title>
<link rel="stylesheet" href="jquery.autocomplete.css" type="text/css" />
<script src="jquery-1.3.2.min.js" /></script>
<script src="jquery.autocomplete.min.js" /></script>
<script type="text/javascript">
function init() {
var methodURL = "emailValidation.cfc?method=domainSpellCheck&returnFormat=json&_cf_nodebug=true";
$("#email").autocomplete(methodURL);
}
$(document).ready(init);
</script>
</head>
<body>
<cfoutput>
<form action="" method="post" onsubmit="return false;">
<label for="email">Email</label><br />
<input type="text" name="email" id="email" size="50" /><br />
<input type="submit" name="submitted" />
</form>
</cfoutput>
</body>
</html>
Now let's look at how to write the component that returns the suggestions.
Creating the Component
As I mentioned earlier, when using the URL option for the autocomplete plugin, it seems to expect the results to come back as a line-delimited list. Therefore, my method will actually output the list for use by the plugin. You also need to paste the levDistance method from CFLib into this component for it to work. We will use that method to calculate a score for each possible match and return only matches that fall under a certain threshhold.
When called via URL the autocomplete plugin passes two arguments via the query string, q which represents the value in the form field and limit which represents the maximum number of responses. Since we won't have that many responses, my method ignores the limit argument for the time being.
The first thing we do is check to see if you have started typing the domain portion of your email (i.e. the portion after the @ symbol). If not, we just pass back nothing. If you have a domain on there, we parse that out. Next I loop through our list of common domain names (which, obviously, you can add to or remove from as you see fit). After some testing, I found that in order to get better and earlier relevant matches, it helped if I appended the domain extension onto whatever you had typed (and removed the existing one if you had started to type it).
I take this string (i.e the domain with the extension replaced) and run the calculation for the score against the current item in the common domains list. The maximum numeric value of the Levenshtein distance, as I understand it, is equivalent to the length of the longer of the two strings being compared. My score then takes the Levenshtein distance divided by the maximum length to determine a percentage difference between the strings. I do this so that I can compare the scores on an equivalent scale regardless of the length of the strings. If it falls under the minimum score I append it to my query object. I used a query object here to make it simple for me to sort the items by their score, with the most relevant being first, once I am done finding matches.
Finally, we simply output all the matches separated by line breaks.
<cfcomponent output="false">
<!--- to get more results use a higher number, for fewer results lower the number --->
<cfset variables.minscore = 25 />
<!--- add any domains you want to check. keep in mind that it loops through all of these for each check --->
<cfset variables.commonEmailDomains = "yahoo.com,gmail.com,hotmail.com,comcast.net" />
<cffunction name="domainSpellCheck" output="true" access="remote" returntype="void">
<cfargument name="q" required="true" type="string" />
<cfargument name="limit" required="true" type="numeric" />
<cfset var domain = "" />
<cfset var testDomain = "" />
<cfset var thisDomain = "" />
<cfset var username = listFirst(arguments.q,"@") />
<cfset var qryMatches = queryNew("email,score") />
<cfset var qrySortedMatches = "" />
<cfif listLen(arguments.q,"@") eq 2>
<cfset domain = listLast(arguments.q,"@") />
<cfloop list="#variables.commonEmailDomains#" index="thisDomain">
<!--- removing the domain extension improves our chances of a usable match --->
<cfset testDomain = listFirst(domain,".") & "." & listLast(thisDomain,".") />
<!--- the maximum score is the length of the longer of the two strings --->
<cfset length = max(len(thisDomain),len(testDomain)) />
<cfset score = round((levDistance(testDomain,thisdomain)/length)*100) />
<cfif score lte variables.minscore>
<cfset queryAddRow(qryMatches) />
<cfset querySetCell(qryMatches,"email",username & "@" & thisDomain) />
<cfset querySetCell(qryMatches,"score",score) />
</cfif>
</cfloop>
</cfif>
<cfquery name="qrySortedMatches" dbtype="query">
SELECT * FROM qryMatches ORDER BY score
</cfquery>
<cfoutput query="qrySortedMatches">#qrySortedMatches.email##chr(10)##chr(13)#</cfoutput>
</cffunction>
<!--- paste the levDistance method here --->
</cfcomponent>
Improving Upon the Result
Obviously, this is just a proof of concept and can be improved upon. The first thing you could do is expand upon my list of common domains to come up with more relevant results (taking into account that this entire list is looped over every time the autosuggest calls the method). In addition, you can tweak the minimum score cutoff to offer more or less results as suits your needs. Other than that, I am pretty happy with the result. Only time and testing will tell if it improves the user's input in the manner I am hoping.
A full list of domain names can be extracted through your user database using string functions to get all chars after @ and then use DISTINCT for a unique list. You'd need to manually verify the spellings though and perhaps remove domains you've never heard of (like small companies and personal domains).
Erm, do you happen to have an online demo? :-)
My only slight concern is that people might possibly be freaked out when you return their email address before they've finished typing. If they're signing up for the first time they might wonder how the hell you managed to be in possession of their address already.
Also, it's a shame that the autocomplete plugin (which I use a lot) doesn't seem to have an option to only trigger the ajax call when the input matched a certain pattern, rather than fire requests on every keystroke, although there is the minChars option which will reduce that a little bit.
Lastly, in case anyone wants code to pull out a list of existing domains, this works on MySQL 5:
SELECT DISTINCT CONVERT( SUBSTRING( email,LOCATE( '@',email ) +1 ),CHAR( 255 ) )
AS domain
FROM
userTable
WHERE
email LIKE '%@%'
