Name Deduplication Techniques

October 4, 2016

The bane of any Database Administrator is maintaining duplicate records. They take up unnecessary space and generally do not provide any added value to contact records. A more challenging task for Database Administrators is how to identify and merge records which might be duplicates, and in particular, duplicate names.

Identifying Duplicate Records

There may be variants for a given name which might not be easily identified in a query, but they are invariantly linked. A common example might be Joe Smith vs Joseph Smith. Both could be referring to the same person depending on how the user may have entered their name.

Name Variants, Finding the Common Name

A particularly useful feature of the Name Validation 2 service is the Related Names output field. This field provides a comma-separated list of first name variants for a provided name. For example, using the given name; Joe, related names returned include Joel, Joeseph, Joey, Josef, Joseph, and José.

With this information, it becomes easier to identify names which are related but in a different form. There may be cases, however, where names cannot be identified as related but can be linked from similarity. Some examples include names that are misspelled or alternate names which are not related but similar. These names can still be identified through the Similar Names output fields of the Name Validation 2 service.

Similar Sounding Names

DOTS Name Validation 2 employs sophisticated similar name matching algorithms to match names drawing from a database of international names with up to 1.4 million first names and 2.75 million last names. First and last name similar results are returned in a comma-separated list which can be used to compare against names that already exist in the database.

An example similar name result for the given name; Robert Smith, would return similar first names Rhobert, Róbert, Robertt, Roebert, Roibert, Rubert, Robbert, and similar last names Smyth, Smithe, Smiith, Smiyth. Of the similar names that are found, names are returned in order of most common to least common.

Merge and Promote the Winning Record

Using these results, a query can potentially link similar or related names and identify records which are duplicates. Once duplicate records are identified, the question becomes which should be promoted as the winning record? This decision can depend on factors based on business logic, perhaps a record which contains other vital contact points such as address or phone number or perhaps entry date is chosen as the winning record. Once a winning record is chosen, a merge process is incorporated to merge contact fields from identified duplicates to build a complete record.

Conclusion

Ridding your database of duplicate contact records can be an arduous task, but with the help of Name Validation 2, it doesn’t have to be. Leveraging the vast quantity of names that Name Validation 2 draws upon yields a top quality solution to identifying duplicates through related and similar names.

For more information about Name Validation 2 service, or to receive a free trial key, click here.

For developers, our Name Validation 2 documentation can be found here.