Index of /dev/null

Tagged: codes

Starting 10 December 2009, companies and private persons based in the European Union will be able to register .eu Internationalised Domain Names. The list of supported characters is divided into several parts, called IDN scripts, such as "Latin-1 supplement", "Greek extended", "Cyrillic" and the like. Indeed, I may consider to get

http://www.β-ιστός-κούτσουρο.eu

Unfortunately, one cannot mix several scripts, thus β-blog.eu won’t be a valid name, since ASCII-letters belong to the Latin script while β is Greek. (Well, so, I think I’ll give up that idea ;))

To get serious, as an EURID registrar, it’s time for us to check out several issues that may apply to IDN requests built up with all that strange letters Europeans may use.

For instance, note that

a1.eu

and

а1.eu

are completely different domain names. But this is just an optical trick, since the first one starts with an ordinary ASCII "a" while the second starts with U+0430, wich is the unicode notation of the cyrillic small letter "a". Indeed, when you hit the second one into your browser, it will calculate the according ACE-string xn—1-7sb.eu using the punycode algorithm first and will make up the DNS request with it.

Things are getting more complicated when you notice that

aŀt.eu

on the one hand, and

al·t.eu

on the other hand indeed are the same domain name. Applying the punycode algorithm to both of them, you will get

xn--at-rqa.eu

for the first one and

xn--alt-mga.eu

for the second one, because both byte streams differ. Now, something goes wrong here, since when you plan to ask a nameserver for the IP address to access the domain, you will have to decide for wich one you ask. Unlikely a nameserver will answer to both of them.

Well, according to the IDNA standard as defined in rfc3490, applications not only have to do a punycode for IDNs, but also have to apply the nameprep algorithm first, wich in turn consists of several normalization mappings such as lower case conversion and, more interesting, also the Unicode Normalization Form KC (see http://unicode.org/reports/tr15/). The latter is the decomposing of characters by unicode compatibility equivalence. Thus, the character U+0140, the Latin small letter "l" with middle dot, decomposes into two unicode characters:

U+0140 => U+006C + U+00B7

That is, the middle dot will be aparted from the letter "l". Therefore, xn—at-rqa.eu is an application of punycode, but not a conversion in the sense of IDNA standard. Indeed, you will get different results from different so-called IDN converter libraries with that domain, depending on whether they ara just doing punycode or applying a proper nameprep first. A reliable reference is the Verisign conversion tool (http://mct.verisign-grs.com/index.shtml) and the according SDK for example (although I wasn’t able to get the Win32 version working).

After all, the challenge for the registrar is to maintain the request database properly, accepting IDN requests in both the normalized and any equivalent form. And moreover, one has to check a requested name against the given character list wich contains non-normalized letters, even if the requested name is normalized already. Since normalization isn’t a reversible mapping this may be complicated in general, but should be solvable in this case.

read article