Posts Tagged ‘transliteration’

The Trouble with Numeric and Fake-looking Chinese Email Addresses

If you were to encounter an email address that was comprised of just numbers, what would be your first reaction? You might suspect that it was a fake or disposable email address. But in some countries, such as China, this isn’t necessarily the case. In this blog post, we will take a deeper dive into when to be cautious about email addresses from China.

Obviously fake email addresses… right?

For example, let’s randomly type in some numbers.

  • 6843619
  • 1684154646514
  • 735416442
  • 94633252361

If we were to use these numbers as an email address with a company domain like or even a free email provider like, to create something like Most likely, you would dismiss it as being garbage, fake or just simply bad. However, what if we instead used one of the following domains?


And created something like Now you might be thinking, “That’s even worse! Even the domains are all numbers now. Those are obviously fake email addresses. I’m absolutely positive.”

“Positive I tell you!”

OK, fine. I would agree. It looks fake to me too.

Now, what if we instead applied those numbers to the domain,  to get this, Would you still think it was an ‘obviously fake email address’?

Maybe not so ‘obviously fake email address’

In China, all-numeric email addresses are very common. If you made your way to this blog article, then chances are you have encountered one or more numeric email addresses that turned out to be genuine when you may not have expected them to be. For example, the domains noted above,,, and, are not fake. They are real domains with valid Mail Exchange (MX) records that point to real mail servers for handling real email communication.

You might be more familiar with the domain, particularly if you work in international business and/or marketing.

QQ, which is owned by the Chinese tech giant Tencent, is a messaging application similar to Skype. In China and parts of Asia, is like what, or are to the US in terms of providing email, messaging and communication services. In fact, in 2014, QQ was recognized by Guinness World Records for having the most simultaneous users on an instant messaging platform with more than 200 million simultaneous users and over 800 million Monthly Active Users (MAU).

All of these QQ users have a email address, and all QQ accounts have a numeric email address.

But why numbers for an email address?

Numbers aren’t that hard to memorize. Most people have several phone numbers memorized, maybe a bank account or two, or perhaps a combination lock at their local gym. However, there is something impersonal and dissociative about numbers. A random number, like 845796833, doesn’t really tell you much like say, Support@ or ILuvKittens@ or ImBatman@ or just having a plain old name as an email. So, what’s so different about China that makes numbered email addresses so popular?

Well, there is an interesting article from The New Republic that tries to shed some light on the subject. It brings up an interesting notion that suggests that numbers, when used as homonyms for the Chinese language, can be used to more quickly and easily spell out Chinese words. One example from the article is where the numbers 5 and 1 in Chinese sound like the words “I” “want”, which helps explain why a job-hunting web site would choose for their domain. In Chinese, 5-1-Job would mean “I want Job”. Cute.

The meaning behind numbered emails can go beyond simple homonyms, however. The article calls it a “numbered-based slang,” and here is one example that I think helps explain the idea. Quoting the article:

“The Internet company NetEase uses the web address—a throwback to the days of dial-up when Chinese Internet users had to enter 163 to get online.”

They go on to state that 163 is not a homonym for anything, but is instead a throwback reference. A similar example would be the search engine website. is a throwback to when people in the US would dial 4-1-1 for information (as opposed to now where most people simply ‘google’ to search for information).

More Than Just Numbers

Slang in any language can be very complicated, and staying well-informed on the subject matter to understand its meaning is not easy. Technical slang takes this complexity to a whole new level. Take for example this surprisingly common password, “ji32k7au4a83”. One would think that this seemingly complicated password would be quite rare if not unique; however, it turns out it’s not. As the article in the link points out, the password “ji32k7au4a83” can be translated to mean “my password” in English.

This is how it breaksdown:

ji3 -> 我 -> M

2K7 -> 的 -> Y

au4 -> 密 -> PASS

a83 -> 碼 -> WORD

The article details how a major Chinese transliteration system can be creatively used to map English to Chinese to Unicode and vice-versa. This process can be used to come up with some very complicated looking email addresses and not just passwords.

It would not be a stretch to say that the process bears some resemblance 1337 Speak (Leet Speak). Take the previously mentioned “ImBatman” email example. One leet interpretation of it would be “1mb47m4n”. The result appears similarly nonsensical and complicated, wouldn’t you say? However, the problem with verifying Chinese email addresses goes beyond superficial, fake-looking mailboxes and domains.

Disposable email addresses are easier to create

Let’s circle back to the widely popular QQ application, and the all-numeric email addresses. When a user registers for a QQ account they are given a QQ ID number, and this number is also their QQ email address. This ID number can be bound to another email address, so instead of giving someone your actual email, you just give them your QQ number. It’s a nice feature. Unfortunately, it is easy for users to create disposable accounts with QQ and bind them to their real email address. These disposable accounts are commonly used by bots, often created for or by Chinese vendors trying to push their products via spam.

This can lead to some false-negatives when validating email addresses. It is not uncommon to receive a business email address with a domain and for it to end up going bad. The domain and some of their IP addresses tend to accumulate bad sender reputations due to the large amounts of spam abuse, as mentioned above. Spam and abuse are not just a problem for, unfortunately, malicious internet activity is very common in China and Chinese service providers struggle with the problem.

Countries with malicious networks or spam saturation: Use Caution

If you were to search for the countries with the worst spam or malicious networks, you would likely find the following result.

Countries with the worst spam/malicious networks

  1. United States
  2. China
  3. Russia

SPAMHAUS lists the worst spam enabling countries and Country IP Blocks (CIPB) lists countries with the most malicious networks, and both lists come back with the same top three countries in the same order. On both lists, the US is the worst offending country of all. Surprised?

CIPB also re-orders their top ten list by the number of malicious networks as a percentage of the total number of networks for the given country. Here is their re-organized list.

Countries with the most infected networks*

  1. Brazil 89%
  2. Turkey 54%
  3. Romania 39%
  4. China 32%
  5. Russia 11%
  6. United Kingdom 11%
  7. Japan 10%
  8. Ukraine 9%
  9. Germany 6%
  10. United States 6%

*Results are based on CIPB’s current top 10 countries with the most malicious networks.

Another CIPB top ten list places China as the current world leader in malicious internet activity. Brazil and Russia take second and third place respectively. The US is not on the list.

SPAMHAUS’ list of the 10 Worst Botnet Countries

  1. India
  2. China
  3. Vietnam
  4. Iran
  5. Thailand
  6. Brazil
  7. Indonesia
  8. Pakistan
  9. Algeria
  10. Russia

Overall, the real issue with trying to verify email addresses from China is not how they look complicated and fake, but that the country is a hot bed for malicious activity. Just because an email address is deliverable, doesn’t mean that it is good or safe. In some cases, it would not be surprising to see one out of three email addresses from China turn out to be a bot and/or disposable.

How Email Validation can help

So how can you differentiate between, say, a legitimate alphanumeric email address that looks suspicious versus a spambot? Our DOTS Email Validation product can help you navigate some of the challenges and complexities of email data quality, particularly for contact or marketing with international addresses.

Our Email Validation service tests emails at multiple distinct levels.

  • First, of course, we check for basic syntax errors, common domain typos and perform a DNS or domain name check to make sure the domain exists and has a valid MX record.
  • We also perform a comprehensive SMTP check by communicating directly with the target mail server to determine three key pieces of information; is the server working, will it accept any address and will it accept a specific address.
  • Finally, we perform multiple integrity checks to see if the email address is associated with problematic addresses and services like; spam-traps, known disposable address providers and blacklisted servers.

Ultimately determining if the email address is a real, functioning email address.

Circling back to the Chinese email addresses we discussed earlier: our Email Validation service can validate these with no problem, but clients often get confused when these emails get a low score. We verify that they are deliverable, but give them a low score because of problems such as being bots or malicious. It is then up to you to decide whether you want to take the risk of using these email addresses or not. So in closing, understand that numerical or nonsensical emails from other countries are often OK is a good first step, but automated validation can help you make an informed decision on whether to use them.

Thinking Alternatively About Place Names

Here at Service Objects we come across a lot of names, particularly the names of places. We also work with a lot of personal names, but for now I would like to focus on just place names. Whether the name is for a city, town, village, hamlet, district, region, state, prefecture, mining area, national park, theme park or what have you; chances are that the place may have one or more even alternate spellings and alternate names associated with it.

For a human fluent in English, “North Carolina” and “N. Carolina” will be considered equal, but for a computer they are not. With the use of fuzzy-matching and/or standardization we can work around seemingly trivial issues like this. Now let us suppose that you are working with a set of Japanese data and come across the same name but written in Katakana “ノースカロライナ” or Ukrainian data written in Cyrillic “Північна Кароліна” or even Thai “รัฐนอร์ทแคโรไลนา”. Well, fuzzy-matching and standardization are still our friends; we just have more fuzzy-matching and standardization rules to consider. However, we first need to ensure that we even have the data available to associate a name in a different language.

We’ve been creating a list of place names to help us tackle problems like the ones mentioned above. We currently have a list of over five million unique place names generated from a pool of approximately 11 million names. We are aggregating name data to come up with a more comprehensive list that consists of known alternates, variations in spellings, different languages and the transliterated versions for the different languages.

Here’s a quick look at what we have accomplished, so far:

  • Current list of approximately eight million place names and growing
  • Transliteration and phonetic mappings for various languages
  • Case, accent and kana sensitivity handling
  • Queryable using fuzzy-matching algorithms

We have taken some of what we have learned from our DOTS Address Validation – International service and built upon it in order to improve data beyond the realm of just address validation. When working with Phone, Email, IP, Demographic and Geo-coordinate related data we too often find that location names do not match up. Naturally this is to be expected, since different data vendors will have different standardizations and practices when it comes to naming conventions. Utilizing a comprehensive place name library will allow us to quickly perform various actions, such as cross checking multiple data sources against each other with increased flexibility and match rates.

It may not be immediately apparent how useful a place name library like this is and what kind of avenues it can open up, but expect to see new and exciting developments from us in the coming months!