Index of /dev/null

understanding unicode surrogates / or: how to deal with Linear B strings in .NET

Remember a String object in .NET is a collection of Char objects, where a Char object in turn is announced as a unicode character, encoded by a 16bit unsigned integer.

Thus, more precisely speaking, a single Char object is able to encode any codepoint within the basic multilingual plane (BMP), i.e. between U+0000 and U+FFFF. So, where goes the rest of the story? Unicode, as an universal character set, is designed to support much more than 65536 characters of course.

Now, the trick is to encode code points above 216 by so-called surrogates, that is, by pairs of 16bit integers. To see how this works, remember the well-known division algorithm for integers. That is, if you have an upper bound M and fix an integer constant C (0 < C < M), for any integer N within the range of 0 ≤ N < 2M, there exists a unique pair of integers H,L, such that

N = 2C * H + L, where 0 ≤ L < 2C and 0 ≤ H < 2M - C.

That way you have simply encoded these 2M numbers N by 2C * 2M - C pairs of numbers H,L. Hence 2M large numbers are adressed using a set of 2C + 2M-C small numbers, that's the trick.

As we are interested in encoding integers above 216 by pairs of 16bit integers, we should act on the assumption

216 ≤ N' < 216 + 2M,

dealing with N = N' - 216 then. In order to decide whether any 16bit number does belong to a surrogate pair, playing either the role of H or L, finally fix an adequate constant T and set

H' = H + T, L' = L + T + 2C,

thus having tagged all 16bit integers I achieving T ≤ I < T + 2C + 2M-C as surrogate integers, where the high surrogates of type H' are less than T + 2C and the ones above are the low surrogates of type L'.

Now, the setting of unicode is this: C = 10, M = 20, T = 0xD800. So, by reserving 2048 small integers as surrogates, more than a million of additional codepoints up to U+10FFFF are accessible. The resulting formulars may be found here: http://www.unicode.org/book/ch03.pdf.

Thankfully .NET unicoders don't need to deal with hex numbers at all, because it's ready made. For instance, consider the name of Amnissos: written in Linear B:

U+10000U+10016U+1001BU+10030

In C# it looks like this:

// alternatively the Char.ConvertFromUtf32() method may be used
string amnisos = "\U00010000" + "\U00010016" + "\U0001001B" + "\U00010030";

Note that indeed the Length property of the resulting string has a value of 8, while it contains only 4 unicode characters. So the appropriate way of accessing the actual codepoints of an arbitrary string should make use of System.Globalization.TextElementEnumerator rather than simply access Char objects greenly. It goes like this:

// using System.Globalization;
TextElementEnumerator en = StringInfo.GetTextElementEnumerator(amnisos);
while (en.MoveNext())
{
  string current = en.GetTextElement();
  if (Char.IsSurrogate(current, 0))
  {
    // a surrogate pair encoding one character, i.e. current.Length == 2
    int codepoint = Char.ConvertToUtf32(current[0], current[1]);
    Console.WriteLine("U+{0:X6}", codepoint);
  }
  else
  {
    // characters within BMP:
    // current.Length > 1 may be true in case of combining characters 
    // cf. StringInfo.ParseCombiningCharacters()
    foreach (char c in current)
    {
      int codepoint = (int)current[0]; // use AscW() in VB.NET
      Console.WriteLine("U+{0:X4}", codepoint);
    }
  }
}

Now, when we will be able to register Linear B domain names at last? ;)

no comments

post a comment

Leave a comment
You may use HTML tags for style