Error: Only up to 6 widgets are supported in this layout. If you need more add your own layout.
Error: Only up to 6 widgets are supported in this layout. If you need more add your own layout.

Author Archive

How to Use DOTS Email Validation 3

The DOTS Email Validation 3 (EV3) service has been designed to be robust enough to accommodate the particular needs of a detailed oriented programmer and simple enough to be used by a marketing assistant who needs to run an email campaign. The service can meet various needs that can essentially be narrowed down to two use cases, form validation and post-processing jobs such as batches and database hygiene. Before we discuss those two cases we will first go over the recommended service operation and review some of the important result fields.

Which Operation Should I Use?

The recommended service operation for EV3 is the ValidateEmailAddress method. This operation performs real-time server-to-server email verification. It lets the user specify a timeout value, in milliseconds, for how long it can take to perform real-time server checks. A minimum value of 200 milliseconds is required; however, results are dependent on the network speed of an email’s host, which may require several seconds to verify. Average mail server response times are approximately between 2-3 seconds, but some slower mail servers may take 15 seconds or more to verify.

Please note that the above information is also available in the service developer guide.

Understanding the Results

The service returns many results that can be used to meet a programmer’s particular email validation needs, but the easiest way to determine if an email should be accepted or rejected is by looking at either the IsDeliverable value or the Score value.

Score:

For most cases it is recommended to use the Score along with other output values to cater to your particular needs. Here are the possible score values.

Score Description Notes
0 Email is Good Indicates with high confidence that the email address is deliverable and good. The email address was verified with the host mail server and no malicious warnings were found.
1 Email is Probably Good Indicates that the email is deliverable but one or more lesser warnings were found. For example the email may be a potential alias or a role, which are sometimes used as disposable addresses.
2 Unknown Indicates that not enough information was available to determine deliverability and integrity. Unknowns most commonly occur for slow mail servers that do not respond to the web service in time. They also occur for catch-all mail servers and greylists.
3 Email is Probably Bad Indicates that one or more warnings were found, such as a potential vulgarity or a string of garbage-like characters.
4 Email is Bad Indicates with high confidence that the email address is bad and/or undeliverable. Occurs for email addresses that fail critical checks such as syntax validation and DNS verification. Most commonly occurs for email addresses where the actual host mail server verified that the email does not exist. Also occurs for deliverable email addresses that are known spam traps or bots.

IsDeliverable:

The simplest way to use the service is to look at the IsDeliverable field. This field will return true, false or unknown. If your primary concern is to be able to send out email with the lowest possible chance of a hard bounceback then this field alone will suffice. However, this field does not take spamtraps, vulgarities, bots or other factors into consideration. It simply indicates if the service was able to verify the deliverability of an email address with the host mail server. It does not measure the overall integrity of the email address.

If you choose to only look at one result value then it is our recommendation that you use the Score value instead of the IsDeliverable value. The Score evaluates the overall integrity of the email address and not just its deliverability. Either one of these fields can be used in conjunction with other result values to more intelligently evaluate an email address if the need arises. For example, if an email comes back as unknown in either the Score or in IsDeliverable, then we can refer to the following outputs to help us decide if we should accept, reject or retry the email address.

IsSMTPServerGood:

Returns true, false or unknown to indicate if the email’s host mail server was responsive at the time of the check. This is a one of the service’s critical checks. If this value comes back false then it will be reflected in the IsDeliverable value and in the score. Refer to this value if the email is unknown. If the value for this field is also unknown then the service most likely did not have enough time to finish verifying the email address with its host mail server. In these cases the service will continue to try and verify the email in a background process even though the request has finished. Chances are high that if you wait one or more hours and check the email again that the service will have been able to finish verifying the email addresses with the host mail server.

IsCatchAllDomain:

Returns true, false or unknown to indicate if the email’s host mail server is a catch-all. A catch-all mail server will say that an email address is deliverable even if it is not.  This is because catch-all mail servers do not reject email addresses during the initial SMTP session. This means that a catch-all mail server cannot be trusted to verify the deliverability of an email address because it may or may not reject the email address until after an email message is sent. If an email address is unknown and this value is false then chances are good that if the email is checked again at a later time then the service will have verified its deliverability. If catchall is true and there are no warnings, then we know that the mail server is good and that the email does not appear to be bad. In general this scenario leads to a 55% chance that the email is deliverable and won’t result in a hard bounce.
IsSMTPMailBoxGood:

Returns true, false or unknown to indicate if the service was able to verify the email address with its host mail server. This value can be treated similarly to the IsDeliverable value. A true value indicates that the email address is deliverable. If the value comes back false then the mail server verified that the email is undeliverable. A false will be accompanied by the warning flag, ‘Email is Bad – Subsequent checks halted.‘ Some common reasons why this value will return unknown; the mail server is a catch-all, the service ran out of time when communicating with the host mail server or the host mail server used a defensive tactic such as a greylist.

A complete list of the output fields and values are available in the service developer guide.

The result fields given above are useful when it comes to sorting, grouping and filtering all of your validated email addresses. This is useful when working on a post-processing email job, which we will discuss later. Next, we will look at some of the descriptive flags that the service will return. These flags can be used programmatically or at a glance to determine the status of an email address.

Warning Codes & Descriptions:

There are many warning flags that the service may return but we will look at some of the more common and critical ones.

DisposableEmail, SpamTrap, KnownSpammer and Bot

An email address may be deliverable but if one or more of these warning flags is returned then it is highly recommended to reject it.

Alias, Bogus and Vulgar

If one of these warning flags is returned then you may want to either reject the email or set it aside for later review, depending on how strict you want to be.

InvalidSyntax, InvalidDomainSpecificSyntax and InvalidDNS

These are warnings for critical checks that failed. If one of these flags appears then it will be immediately followed by the warning flag ‘Email is Bad – Subsequent checks halted.

Email is Bad – Subsequent checks halted

This warning indicates that the email failed a critical check and is undeliverable. If the flag is not preceded by one of the critical warning flags then it simply means that the email’s host mail server verified that the email address is undeliverable.

A complete list of warning codes and their descriptors are available in the dev guide.

Note Codes & Descriptions:

The note flags will return descriptive information about the email, not all of which will affect the score, but we will focus on the ones that will explain why some email addresses came back as unknown.

GreyListed

The service is good at detecting greylist behavior from mail servers and has procedures in place to avoid them, but not all greylists are avoidable. If the service encounters a greylist then it is temporarily unable to verify the email address with its host mail server. If you encounter a greylist then chances are good that if you try to validate the email again a couple of hours later that you will get a better response.

MailServerTemporarilyUnavailable

This flag indicates that the service was able to connect to the email’s host mail server, but that the server was temporarily busy or unavailable and it was unable to verify the email for us. If you encounter this flag then try and validate the email again a few of hours later to see if the server becomes more responsive then.

ServerConnectTimeout

This flag indicates that the service was unable to establish a connection with a host mail server. A possible reasons for the connection failure could be that the mail server is completely offline or it is responding too slow and unable to respond in time. Some mail servers are configured to commonly respond slowly, taking as long as 60 seconds to respond to a connection. This behavior is rare but it is not entirely uncommon. If an email returns this flag then try and enter a longer timeout time to allow the service the time it needs to verify the email.

MailBoxTimeout

This flag indicates that the service was unable to finish verifying the email address with the host mail server in the time allowed. The mail server could be responding very slowly or the timeout time given to the service was too short. If an email returns this flag then try and enter a longer timeout time to allow the service the time it needs to verify the email.

A complete list of note codes and their descriptors are available in the developer guide.

Use Case 1 – Using Validate Email Address for Form Validation

The ValidateEmailAddress method has four input fields that are all required.

Input Field Name Description Notes
EmailAddress The email address you wish to validate.
AlowCorrections Accepts true or false. The service will attempt to correct an email address if set to true. Otherwise the email address will be left unaltered if set to false. The majority of the email corrections are being performed on the domain. The local part of the email address, the portion before the @ symbol, is generally left untouched.
Timeout Accepts an integer as a string. Timeout time is in milliseconds. Do not include any commas or non-numeric values. This value specifies how long the service is allowed to wait for all real-time network level checks to finish. Real-time checks consist primarily of DNS and SMTP level verification. A minimum value of 200ms is required. When it comes to form validation it is recommended to use a timeout time that is short enough to not keep your user impatiently waiting, but long enough to allow the server-to-server communication time to finish. A relatively short timeout time between 2 to 4 seconds is generally recommended.

 

LicenseKey Your license key to use the service.

Accept, Reject or Review & Retry

ACCEPT

Emails with a score of 0, 1 or 2. In general it is recommended to not be too strict when accepting emails in a form because you do not want to potentially lose an end user.  Also, when performing form validation an end user may become agitated if they have to wait more than 5 seconds for the validation process to complete, but some slow mail servers may not be able to respond in that short amount of time.

REJECT

Emails with a score of 3 or 4. If you do not want to be too strict then you can accept 3 for review, but you should always reject an email that receives a score of 4.

REVIEW & RETRY

Depending on how strict/cautious you want to be you can choose to not initially accept emails with a score of 2 and instead put them aside to have them reviewed. If the IsCatchAllDomain field is not true then you can try and validate the email again later. Email addresses that return a score of 3 can also be set aside for review if you do not want to initially reject all of them. An email will commonly be given a score of 3 if a potential vulgarity or string of garbage characters is found.

In form validation the programmer is sometimes allowed some luxuries while others are taken away. For example, a programmer can be given the opportunity to communicate a result back to the end user but is usually restricted to a shorter timeout time so that the end user is not kept waiting too long. If you have the ability to communicate back the end user then ask the user to check for a typo and try again or try a different email address. If you don’t want to accept a role or alias type email address because they are commonly not accepted by mass email marketers then you can catch for that and tell the user to try again with a different email address.

Use Case 2 – Using ValidateEmailAdress for Batches, Email Campaigns and Data Hygiene

The ValidateEmailAddress method has four input fields that are all required.

Input Field Name Description Notes
EmailAddress The email address you wish to validate.
AlowCorrections Accepts true or false. The service will attempt to correct an email address if set to true. Otherwise the email address will be left unaltered if set to false. The majority of the email corrections are being performed on the domain. The local part of the email address, the portion before the @ symbol, is generally left untouched. Since you are unable to ask a user to re-enter and try again if they make a mistake you can set this value to true and allow the service to make corrections.
Timeout Accepts an integer as a string. Timeout time is in milliseconds. Do not include any commas or non-numeric values. This value specifies how long the service is allowed to wait for all real-time network level checks to finish. Real-time checks consist primarily of DNS and SMTP level verification. A minimum value of 200ms is required. For non-form validation it is recommended to give the service plenty of time to verify an email address with its host mail server. Most mail servers will only take about 2 seconds on average to verify an email address, but for the occasional slow mail server that requires more time it is recommended to set the timeout time to 65 seconds. The number of mail servers that require this much time is generally minimal, so the long timeout should not make a big impact on the overall batch job.

 

LicenseKey Your license key to use the service.

Accept, Reject or Review & Retry

ACCEPT

Emails with a score of 0 or 1.

REJECT

Emails with a score of 3 or 4. If you do not want to be too strict then you can accept 3 for review, but you should always reject an email that receives a score of 4.

REVIEW & RETRY

Emails with a score of 2, unless the IsCatchAllDomain field value is true. An email that gets an unknown score  due to a greylist, timeout or temporarily busy server should be checked again a couple of hours later.

If you would like to discuss your particular use case for recommendations and best practices contact us!

Making an (email) list and checking it twice: Best practices for email validation

For most organizations, one of the most critical assets of their marketing operations is their email contact database. Email is still the lingua franca of business: according to the Radicati Group, over a quarter of a trillion email messages are sent every business day, and the number of email users is expected to top 4 billion by 2021 – roughly half of the world’s population. This article will explore current best practices for protecting the ROI and integrity of this asset, by validating its data quality.

The title of this article is not just a cute play on words – and it has nothing to do with Santa. Rather, it describes an important principle for your game plan for email data quality. By implementing a strong two-step email validation process, as we describe here, you will dramatically reduce deliverability problems, fraud and blacklisting from your email marketing and communications efforts.

The main reason we recommend checking emails in two stages revolves around the time these checks take: many checks can be performed live using a real-time API, particularly as email addresses are entered by users, but server validation in particular may require a longer processing time and interfere with user experience. Here are 3 of the most important checks that are part of the email validation process:

• Syntax (FAST): This check determines if an email address has the correct syntax and physical properties of an email address.

• DNS (FAST): We can quickly check the DNS record to ensure the validity of the email domain (MX record) for the email address. (There are some exceptions to this – for example, where the DNS record is with a shoddy or poor registry and the results take longer to come back.)

• Email Server (VARIABLE, and not within the email validation tool’s control): Although this check can take from milliseconds to minutes, it is one of the most important checks you can make – it ensures that you have a deliverable address. This response time is dependent on the email server provider (ESP) and can vary widely: large ESPs like Gmail or MSN normally respond quickly, while corporate or other domains may take longer.

There are many more checks in Service Objects’ Email Validation tool, including areas such as malicious activity, data integrity, and much more – over 50 verification tests in all! We auto-correct addresses for common spelling and syntax errors, flag bogus or vulgar address entries, and calculate an overall quality score you can use to accept or reject the email address. (For a deeper dive, take a look at this article to see many of the features of an advanced EV tool.)

Here are the two stages we recommend for your email validation process:

Stage 1: At point of entry. Here, you validate emails in real-time, as they are captured. This provides the opportunity for the user to correct mistakes in the moment such as typos or data entry errors. Here you can use our EV software to check for issues like syntax, DNS and the email server – however we recommend setting the API configuration settings to no more than a wait of a couple of seconds, for the sake of customer experience. At this stage either the user or validation software has a chance to update bad addresses.

Stage 2 – Before sending a campaign. Validate the emails in your database – using the API – after the email has been captured and the user is no longer available in real-time to make corrections. In this stage, you have more flexibility to wait for responses from the ESPs, providing more confidence in your list.

It is estimated that 10-15% of emails entered are not usable, for reasons ranging from data entry errors to fraud, and 30% of email addresses change each year. Together these two steps ensure that you are using clean and up-to-date email data every time – and the benefit to you will be fewer rejected addresses, a better sender reputation, and a greater overall ROI from your email contact data.

Maintaining a Good Email Sender Reputation

What are Honeypot Email Addresses?

A honeypot is a type of spamtrap. It is an email address that is created with the intention of identifying potential spammers. The email address is often hidden from human eyes and is generally only detectable to web crawlers. The address is never used to send out email and it is for the most part hidden, thus it should never receive any legitimate email. This means that any email it receives is unsolicited and is considered to be spam. Consequently, any user who continues to submit email to a honeypot will likely have their email, IP address and domain flagged as spam. It is highly recommended to never send email to a honeypot, otherwise you risk ruining your email sender reputation and you may end up on a blacklist.

Spamtraps typically show up in lists where the email addresses were gathered from web crawlers. In general, these types of lists cannot be trusted and should be avoided as they are often of low quality.

Service Objects participates in and uses several “White Hat” communities and services. Some of which are focused on identifying spamtraps. We use these resources to help identify known and active spamtraps. It is common practice for a spamtrap to be hidden from human eyes and only be visible in the page source where a bot would be able to scrape it, but it is important to note that not all emails from a page scrape are honeypot spamtraps. A false-positive could unfortunately lead to an unwarranted email rejection. Many legitimate emails are unfortunately exposed on business sites, job profiles, twitter, business listings and other random pages. So it is not uncommon to see a legitimate email get marked as a potential spamtrap by a competitor.

 

Not all Spamtraps are Honeypots

While the honeypot may be the most commonly known type of spamtrap, it is not the only type around. Some of you may not be old enough to remember, but there was a time when businesses would configure their mail servers to accept any email address, even if the mailbox did not exist, for fear that a message would be lost due to a typo or misspelling. Messages to non-existent email address would be delivered to a catch-all box as long as the domain was correctly spelled. However, it did not take long for these mailboxes to become flooded with spam. As a result, some mail server administrators started to use catch-alls as a way to identify potential spammers. A mail server admin could treat the sender of any mail that ended up in this folder as a spammer and block them. The reasoning being that only spammers and no legitimate senders would end up in the catch-all box. Thus making catch-alls one of the first spamtraps. The reasoning is flawed but still in practice today. Nowadays it is more common for admins use firewalls that will act as catch-alls to try and catch and prevent spammers.

Some spamtraps can be created and hidden in the source code of a website so that only a crawler would pick it up, some can be created from recycled email addresses or created specifically with the intention of planting them in mailing lists. Regardless of how a spamtrap is created it is clear that if you have one in your mailing list and you continue to send mail to it, that you will risk ruining your sender’s reputation.

Keeping Senders Honest

The reality is that not all honeypot spamtraps can be 100% identified. Doing so would highly diminish their value in keeping legitimate email senders honest.

It is very important that a sender or marketer follows their regional laws and best practices, such as tracking which emails are received, opened or bounced back. For example, some legitimate emails can still result in a hard or permanent bounce back. This may happen when an email is an alias or role that is connected to a group of users. In these cases, the email itself is not rejected but one of the emails within the group is. Which brings up another point. Role based email addresses are often not eligible for solicitation, since they are commonly tied to positions and not any one particular person who would have opted-in. That is why the DOTS Email Validation service also has a flag for identifying potential role based addresses.

Overall, it is up to the sender or marketer to ensure that they keep track of their mailing lists and that they always follow best practices. They should never purchase unqualified lists and they should only be soliciting to users who have opted-in. If an email address is bouncing back with a permanent rejection then they should remove it from the mailing list. If the email address that is being bounced back is not in your mailing list then it is likely connected to a role or group based email that should also be removed.

To stay on top of potential spamtraps marketers should also be keeping track of subscriber engagement. If a subscriber has never been engaged or is no longer engaged but email messages are not bouncing back, then it is possible that the email may be a spamtrap. If an email address was bouncing back before and not anymore, then it may have been recycled as a spamtrap.

Remember that by following the laws and best practices of your region you greatly reduce the risk of ruining your sender reputation, which will help ensure that your marketing campaigns reach the most amount of subscribers as possible.

Thinking Alternatively About Place Names

Here at Service Objects we come across a lot of names, particularly the names of places. We also work with a lot of personal names, but for now I would like to focus on just place names. Whether the name is for a city, town, village, hamlet, district, region, state, prefecture, mining area, national park, theme park or what have you; chances are that the place may have one or more even alternate spellings and alternate names associated with it.

For a human fluent in English, “North Carolina” and “N. Carolina” will be considered equal, but for a computer they are not. With the use of fuzzy-matching and/or standardization we can work around seemingly trivial issues like this. Now let us suppose that you are working with a set of Japanese data and come across the same name but written in Katakana “ノースカロライナ” or Ukrainian data written in Cyrillic “Північна Кароліна” or even Thai “รัฐนอร์ทแคโรไลนา”. Well, fuzzy-matching and standardization are still our friends; we just have more fuzzy-matching and standardization rules to consider. However, we first need to ensure that we even have the data available to associate a name in a different language.

We’ve been creating a list of place names to help us tackle problems like the ones mentioned above. We currently have a list of over five million unique place names generated from a pool of approximately 11 million names. We are aggregating name data to come up with a more comprehensive list that consists of known alternates, variations in spellings, different languages and the transliterated versions for the different languages.

Here’s a quick look at what we have accomplished, so far:

  • Current list of approximately eight million place names and growing
  • Transliteration and phonetic mappings for various languages
  • Case, accent and kana sensitivity handling
  • Queryable using fuzzy-matching algorithms

We have taken some of what we have learned from our DOTS Address Validation – International service and built upon it in order to improve data beyond the realm of just address validation. When working with Phone, Email, IP, Demographic and Geo-coordinate related data we too often find that location names do not match up. Naturally this is to be expected, since different data vendors will have different standardizations and practices when it comes to naming conventions. Utilizing a comprehensive place name library will allow us to quickly perform various actions, such as cross checking multiple data sources against each other with increased flexibility and match rates.

It may not be immediately apparent how useful a place name library like this is and what kind of avenues it can open up, but expect to see new and exciting developments from us in the coming months!

Can Google Maps be Used to Validate Addresses?

In November of 2016, Google started rolling out updates to more clearly distinguish their Geocoding and Places APIs, both of which are a part of the Google Maps API suite. The Places API was introduced in March 2015 as a way for users to search for places in general and not just addresses. Until recently the Geocoding API functioned similarly to Places in that it also accepted incomplete and ambiguous queries to explore locations, but now it is focusing more on returning better geocoding matches for complete and unambiguous postal addresses. Do these changes mean that Google Maps and its Geocoding API can finally be used as an address validation service?

No, it cannot. Now before I explain why, let’s first acknowledge why someone would think Google Maps can be used to validate addresses in the first place. The idea starts with the simple argument that if an address can be found in Google Maps then it must exist. If it exists then it must be valid and therefore deliverable. However, this logic is flawed.

Addressing a Common Problem

One of the biggest problems many users overlook with Google Maps and the Geocoding API is that incomplete and/or ambiguous address queries lead to inaccurate and/or ambiguous results. It is common for users to believe that the address entered was correct and valid simply because Google returns a possible match. These users often ignore that the formatted address in the output may have changed significantly from what they had originally entered.The people over at Google Maps must have realized this too as the Geocoder API is now more prone to return ‘ZERO_RESULTS’ instead of a potentially inaccurate result. However, not all users are pleased with the recent changes. Some have noted that addresses that once returned matches in the Geocoding API no longer do so.

Has the Geocoding API become stricter? Yes. Does Google Maps finally make use of address data from the actual postal authorities? Not likely.

Geocoding vs Deliverability

Google Maps does verify if an address is deliverable. The primary purpose of the Geocoding API is to return coordinate information. At its best it can locate an individual residential home or a commercial building. Other times it is an address estimator. However, not all addresses are for single building locations.

Apartment and unit numbers, suites, floors and PO boxes are typical examples of the type of address that the Google Maps Geocoding API was not intended to handle. They now recommend that those type of addresses be passed to the Places API instead, but not because the Places API can validate or verify those types of addresses. Again, none of the APIs in the Google Maps suite will verify addresses. No, it is because information like a unit number is currently superfluous when it comes to their roof-top level geo-coordinates. Google Maps does not need to know if an address is a multi-unit and/or multi-floored building in order to return a set of coordinates.

Take the Service Objects address for example,

27 E Cota St Ste 500
Santa Barbara, CA 93101-7602

The Google Maps Geocoding API returns the following address and coordinates,

“formatted_address” : “27 E Cota St, Santa Barbara, CA 93101, USA”

“location” : {               “lat” : 34.41864020000001,               “lng” : -119.696178            }

Notice that the formatted address output value has dropped the suite number even though the address is valid. Let’s change the suite number from 500 to a suite number that does not exist, such as 900.

“formatted_address” : “27 E Cota St, Santa Barbara, CA 93101, USA”

“location” : {               “lat” : 34.41864020000001,               “lng” : -119.696178            }

We get back the exact same response, because they are both the same in the eyes of Google Maps.

A similar thing happens if we try the same using the Google Maps web site.

This is the result for when Suite 500 is passed in:

This is the result for when Suite 900 is passed:

Notice that 900 remains in the address.

An unsuspecting user could easily mistake the Suite 900 address for being valid if they were simply relying on the Google Maps website, and its mistakes like these that often lead people to believe that an address may exist when it does not.

The Right Tool for the Job

When selecting a dedicated address validation service here are a few critical and rich features you will want to look for:

Even with the recent updates Google Maps is still no alternative for a dedicated address validation service and choosing not to use one could prove to be an expensive mistake.

Looking Beyond Simple Blacklists to Identify Malicious IP Addresses

Using a blacklist to block malicious users and bots that would cause you aggravation and harm is one of the most common and oldest methods around (according to Wikipedia the first DNS based blacklist was introduced in 1997).

There are various types of blacklists available. Blacklists exist for IP addresses, domains, email addresses and user names. The majority of the time these lists will concentrate on identifying known spammers. Other lists will serve a more specific purpose, such as IP lists that help identify known proxies, TORs and VPNs or email lists of known honey pots or lists of disposable domains.

There are many different types of malicious activity that occur on the internet and there are various types of lists out there to help identify and prevent it; however, there are also various problems with lists.

The problem with Lists:

In order to first identify a malicious activity with a list, the malicious activity must first occur and then be reported and propagated. It is not uncommon for the malicious activity to stop by the time it has been reported and propagated. Not all malicious activities are reported. If you encounter the malicious activity before it is reported then you won’t be able to preemptively act on it.

IPs, Domains, Email Addresses and Usernames are dynamic and disposable. If a malicious user/bot gets blocked then they can easily switch to a different IP, domain etc.

Some lists offer warnings that blocking an IP address could affect thousands of users who depend on it in order to obtain crucial information that they would otherwise not have access to. So block responsibly.

Aggregating data to more effectively identify malicious activity:

Instead of looking at one list to perform a simple straightforward lookup, we can take advantage of multiple datasets to uncover patterns and relationships between seemingly disparate values. A simple example would be, relating user names to email addresses, email addresses to domains and domains to IP addresses, which allows us to view the activity of one value and compare it to behavior of other values. Using complex algorithms with machine learning to process large samples of data we can intelligently discern if a value is directly or indirectly related to a malicious activity.

How Service Objects keeps it simple for the user:

The DOTS IP Address Validation service currently has two flags to help its user deal with malicious IPs, ‘MaliciousIP’ and ‘PotentiallyMaliciousIP’. The ‘MaliciousIP’ flag indicates that the IP address recently displayed malicious activity and should be treated as such. The ‘PotentiallyMaliciousIP’’ flag indicates that the IP address recently displayed one or more strong relationships to a malicious activity and that it has a high likelihood of being malicious. Both flags should be treated as warnings with the ‘MalciousIP’ flag being scrutinized more severely.

The warning signs of online fraud are out there, but you need a means of discovering them. Our IP Validation service encompasses many of the identification strategies necessary to make split second decisions on would be attackers before any harm is done.

The Challenge of Storing International Addresses

Working with international address data can be difficult and confusing. Even when you have an application available to validate an address, and it tells you that it’s deliverable, you still have to deal with the chore of storing the resulting data. So when someone asks, “what’s the best way to store international addresses?”, what they are really asking is, “what’s the easiest way to store international addresses?”

The short answer to the “what’s the best?” question, as it often is, you’re asking the wrong question. Many of you who have worked with varying data sets before already know that you first need to ask yourself, “what do I intend to do with the data once it is stored?” What the data is used for should have the largest impact on how it is stored. Depending on your specific requirements, the way you store address data can vary greatly. For some, how you store your data may not be entirely up to you as you may not have any control over the storage design, and are instead forced to work with the fields that are made available to you. Many users work with US-centric Customer Relationship Management (CRM) solutions that are designed with US address fields in mind, which can make storing international addresses all the more confusing and can also potentially lead to some data loss.

For those looking to simply print an address label for mail delivery, a single text field containing the complete formatted address will suffice. After all, why bother with breaking an address down to a mess of individual fragments if you’re not going to use them? Worse yet, what do you do when it comes time to put the pieces back together and you find that you don’t know how?

For some, correctly putting an address back together from its individual fragments might not be of great concern. The primary use of the data may be for some form of query analysis and/or organization. In which case you might be more concerned about which specific data type your individual fields should be or how to properly map these fragments. If you are implementing your own design then keep in mind that not all international addresses are necessarily parsed the same way, and you will need to consider if your design should be flexible enough to handle all international addresses or if you would prefer to go a country specific route.

Mapping Address Fields

Consider this example of an address in England:

9 Gorse View
School Road
Knodishall, Saxmundham
IP17 1TS
UNITED KINGDOM

If we include the country name, then the above address has five address lines; six if we split the third line. Now, let’s go ahead and attempt to store this address in our CRM. Most CRMs will contain the following address fields for a contact:

Address1
Address2
Address3
City
State
ZIP
Country

Depending on the CRM, we may have somewhere between five to seven address related fields on average to work with. In the above example we have seven, so that should make things easy, right? We have more than enough fields, so there should not be any loss of data, but right away we see State and ZIP fields. These should be red flags that the storage was not designed for international addresses, but unfortunately, it is what we have to work with. Let’s go ahead and look at the parsed fields that we are likely to get back from an address validation solution:

Premise Number: 9
Dependent Street Name: Gorse View
Street Name: School Road
Dependent Locality: Knodishall
Locality: Saxmundham
Postal Code: IP17 1TS
Country: UNITED KINGDOM

In most cases, users will find that they can typically match Locality to City, Administrative Area to State, and Postal Code to ZIP. If you are unfamiliar with the address terms “Locality” and “Administrative Area” then please check out our previous blog, Five Commonly Used Terms and Definitions in International Address Validation Systems.

In the above example, you’ll notice that an Administrative Area equivalent was never provided. You’ll quickly find that this is quite common for many countries and that the locality is usually preferred. You’ll also notice that we have a dependent locality, which is a sub-region of the locality, and a dependent street name. It is important not to omit or lose these pieces of data if they are provided, as they offer additional detail/instruction on the whereabouts of an otherwise ambiguous address. So where to map them?

Luckily, our database design offers enough fields to accommodate these values, but keep in mind that this may not always be the case. In our example, we can map the premise number and dependent street name to Address1, the street name to Address2, the dependent locality to Address3, locality to city, postal code to ZIP, country to country, and leave the state empty. However, even though we were able to successfully map every value to our CRM, it is still very tedious and risky to try and handle all of the various address formats. Also, what course of action do we take when an address also includes a double-dependent locality or a sub-region?

Missing State or Administrative Area Equivalent

Let’s look at two more example addresses:

3-10-13 Ryoke
Urawa-Ku
Saitama-Shi 330-0072
JAPAN

and

5 Rue Sainte-Catherine
12000 Rodez
FRANCE

The first example is a Japanese address. Looking at it with American eyes one might think that the first line is a premise number and a street name, the second line the city, and the third line the state and postal code, or their equivalents. However, things work very differently in Japan. Streets are not commonly named or used for addresses. Instead of street names, they primarily use regions that can normally be thought of as districts. In the above example, Ryoke is a second level sub-locality, Urawa is a first level sub-locality and Saitama is the locality. No administrative area equivalent value is given. Administrative areas are commonly omitted as often as they are included in Japanese addresses.

In the second example, we have a premise number and street name in the first address line, and a postal code and locality in the second. Once again, no administrative area value is given. The address is in France, but many European addresses will follow this general format, and it is common for them to omit a first level administrative area. Therefore, it is highly recommended that you do not make an administrative area a required field. Doing so would mean rejecting valid addresses for entire countries.

Facing the Challenge

As I mentioned earlier, when breaking an address apart we also run the risk of putting it back together incorrectly. So while no individual address fragment might be lost, we still risk losing the correct address order and format. Addresses and their various fragments and formats can vary greatly not just from country to country, but also within the same country. So what’s the point of it all? Is there no hope when it comes to international addresses?

If you are forced to use a set storage design and are unable to alter it then your best course of action may be to simply store the complete formatted address in a single field, if it can fit. If the complete address cannot fit in a single field, then split it into multiple fields when necessary. In general, storing the complete address should be your primary objective as it should contain all of the necessary information that you need. The complete address can always be parsed out later as needed. Storing the country and postal code should be next on your priority list, although not all countries use postal codes. Postal codes are very important and useful, so be sure to store them when they are available. Finally, look towards storing the locality and admin area if they are available.

For those who will be implementing their own design, look to the output specifications of your validation solution. Most validation solutions will have a large list of address fields that cover the majority of the most widely used international addresses out there. You may consider it cumbersome, but if you include all of the output fields from your validation solution in your own design then you minimize the risk of losing data during the mapping process. You might not consider it the best way to handle storing international addresses, but unless you want to become an expert on the subject, it is definitely easier to use an existing design.

5 Commonly Used Terms and Definitions in International Address Validation Systems

When dividing the countries of the world into regions and sub-regions for the purpose of Address Validation, it is important to find a common ground and to use a set of widely adopted terms and definitions.

In the United States of America, (US), we commonly use the terms city, state and zip code when referring to addresses. While that may mostly work for a country like Mexico (MX), it is not appropriate for other countries like Japan (JP) where the country is divided into prefectures instead of states. Not all countries call their sub-region divisions the same thing and many countries have several levels of sub-divisions. To further complicate the matter, not all sub-division levels are necessarily interchangeable from one country to another. For example, a first level sub-region in the US is a state, such as California (US-CA), but a first level sub-region for the United Kingdom of Great Britain and Northern Ireland (GB) is a country, such as England (GB-ENG).

Every country can have its own particular set of terms and definitions; to try to go over them all would be too complicated and inefficient. Instead, let’s go over some commonly used terms that are helpful when talking about international addresses.

Country Code

An alphabetic or numeric code used to represent a country. Various types of country codes exist for different particular uses, but the most commonly used codes come from the ISO 3166 standard. Part one of this standard, ISO 3166-1, consists of the following code formats:

  • ISO 3166-1 alpha-2 – a two-letter country code.
  • ISO 3166-1 alpha-3 – a three-letter country code.
  • ISO 3166-1 numeric – three-digit country code.

Postal Code

An alphabetic, numeric or alphanumeric code that may sometimes include spaces or punctuation that is commonly used for the purpose of sorting mail. Commonly referred to as the Postcode. Some country-specific terms include ZIP code (US), PLZ (DE, AU, and CH), PIN code (IN) and CAP (IT).

Administrative Areas

The regions in which a country is divided into. Each region typically has a defined boundary with an administration that performs some level of government functions. These areas are commonly expected to manage themselves with a certain level of autonomy. Various administrative levels exist that can range from “first-level” administrative to “fifth-level” administrative. The higher the level number is the lower its rank will be on the administrative level hierarchy. For example, the US is made up of states (first-level), which are divided into counties (second-level) that consist of municipalities (third-level). For comparison, the United Kingdom (GB) is comprised of the four countries England, Scotland, Wales and Northern Ireland (first-level). These countries are made up of counties, districts and shires (second-level), which in turn are made up of cities and towns (third-level) and small villages and parishes (fourth-level). Other common terms for an administrative area are administrative division, administrative region, administrative unit, administrative entity and subdivision.

Locality

In general, a locality is a particular place or location. More specifically, a locality should be defined as a distinct population cluster. Localities are commonly recognized as cities, towns, and villages; but they may also include other areas such as fishing hamlets, mining camps, ranches, farms and market towns. Localities are often lower-level administrative areas and they may consist of sub-localities, which are segments of a single locality. Sub-localities should not be confused for being the lowest level administrative area of a country, nor should they be confused as being separate localities.

Thoroughfare

In general, a thoroughfare is a transportation route between one location and another. On land, it is more commonly referred to as a type of road or route that is typically used by motorized vehicles, such as a street, avenue or highway.

Geocoding Resolution – Ensuring Accuracy and Precision

When geocoding addresses, coordinate precision is not as important as coordinate accuracy. It is a common misconception to confuse high precision decimal degree coordinates with high accuracy. Precision is important, but having a long decimal coordinate for the wrong area could be damaging. It is more important to ensure that the coordinates point to the correct location for the given area. Accurately geocoding an address is very complex. If the address is at all ambiguous or not properly formatted then a geocoding system may incorrectly return a coordinate for a location on the wrong side of town or for a similar looking address in an entirely different state or region.

Some address geocoding systems will return decimal coordinates to the sixth decimal place or more; however, depending on your particular needs, that level of precision may actually prove unnecessary. The degree of precision for most consumer level GPS devices only goes up to the 5th decimal place anyway, which equates to “roughly” one meter of precision. This is “roughly” one meter because the distance can vary depending on how close you are to the either the equator or the poles. The distance will be at its greatest the closer you are to the equator and gradually gets smaller as you move north; however, when dealing with coordinates at this level of precision and above, the difference is mostly negligible for address geocoding.

In the Decimal Degrees wiki page (link below), there is a table that covers the levels of precision for each decimal place in a decimal degree. Below is a similar looking table:

 

Decimal DegreesLooking at the table above we see that a decimal coordinate with a level of precision past the 6th decimal place would be entirely unnecessary for locating an address. That level of precision would only be necessary under very special circumstances and would require very specialized equipment to use. If a decimal coordinate goes past the 7th or 8th decimal place then the coordinate was most likely calculated and the true level of precision would be unknown. So don’t let a decimal degree coordinate with a high level of precision fool you into thinking that it is more accurate. It is important to always thoroughly test any geocoding system to ensure that it meets your particular needs.

Reference: (https://en.wikipedia.org/wiki/Decimal_degrees)

Error: Only up to 6 widgets are supported in this layout. If you need more add your own layout.