Index of /dev/null

Tagged: unicode

Dies ist kein Update der Libidn auf IDNA2008. Ziel ist es, mit einfachen Mitteln das IDNA2003-Mapping von Codepoints der Kategorie PVALID (RFC 5892), insbesondere also des "ß" (U+00DF; LATIN SMALL LETTER SHARP S), bei Bedarf unterdrücken zu können. Ausgangspunkt ist die Erfordernis, kurzfristig Domainnamen mit "ß" innerhalb der DE-Zone verabeiten zu können.

read article

Wie immer, wenn etwas immer größer und komplizierter wird, zeichnet sich ein Trend zur Lokalisierung ab. Das Internet ist ein topologischer Raum, der so hochdimensional geworden ist, dass man ihn nur noch als Überdeckung eines unfassbaren Etwas durch lokale Landkarten erklären kann. Das Kraftwerk der Globalisierung sehnt sich heute nach Semantik, sucht soziale Kontakte und organische Strukturen.Es möchte den Menschen nahe sein, ihre Gegend kennen und ihren Dialekt sprechen.

read article

Indeed, it is! Accidentally, while testing some random stuff against my IDN validation function, I found out that the word


is the result of the Punycode Algorithm applied to the Unicode sequence

U+37F0 U+37E6 U+37F3 U+37EE U+37EC U+37E0.

These are 6 chinese characters from the CJK Unified Ideographs Extension A block. I've never seen before that Punycode results in any meaningful word and I think this is an extremely rare case. So I couldn't help myself to register both and immediately.

I've no idea what to do with it yet - we'll see :)

The discovery was announced on first.

Starting 10 December 2009, companies and private persons based in the European Union will be able to register .eu Internationalised Domain Names. The list of supported characters is divided into several parts, called IDN scripts, such as "Latin-1 supplement", "Greek extended", "Cyrillic" and the like. Indeed, I may consider to get


Unfortunately, one cannot mix several scripts, thus β won’t be a valid name, since ASCII-letters belong to the Latin script while β is Greek. (Well, so, I think I’ll give up that idea ;))

To get serious, as an EURID registrar, it’s time for us to check out several issues that may apply to IDN requests built up with all that strange letters Europeans may use.

For instance, note that



are completely different domain names. But this is just an optical trick, since the first one starts with an ordinary ASCII "a" while the second starts with U+0430, wich is the unicode notation of the cyrillic small letter "a". Indeed, when you hit the second one into your browser, it will calculate the according ACE-string xn— using the punycode algorithm first and will make up the DNS request with it.

Things are getting more complicated when you notice that


on the one hand, and


on the other hand indeed are the same domain name. Applying the punycode algorithm to both of them, you will get

for the first one and

for the second one, because both byte streams differ. Now, something goes wrong here, since when you plan to ask a nameserver for the IP address to access the domain, you will have to decide for wich one you ask. Unlikely a nameserver will answer to both of them.

Well, according to the IDNA standard as defined in rfc3490, applications not only have to do a punycode for IDNs, but also have to apply the nameprep algorithm first, wich in turn consists of several normalization mappings such as lower case conversion and, more interesting, also the Unicode Normalization Form KC (see The latter is the decomposing of characters by unicode compatibility equivalence. Thus, the character U+0140, the Latin small letter "l" with middle dot, decomposes into two unicode characters:

U+0140 => U+006C + U+00B7

That is, the middle dot will be aparted from the letter "l". Therefore, xn— is an application of punycode, but not a conversion in the sense of IDNA standard. Indeed, you will get different results from different so-called IDN converter libraries with that domain, depending on whether they ara just doing punycode or applying a proper nameprep first. A reliable reference is the Verisign conversion tool ( and the according SDK for example (although I wasn’t able to get the Win32 version working).

After all, the challenge for the registrar is to maintain the request database properly, accepting IDN requests in both the normalized and any equivalent form. And moreover, one has to check a requested name against the given character list wich contains non-normalized letters, even if the requested name is normalized already. Since normalization isn’t a reversible mapping this may be complicated in general, but should be solvable in this case.

read article