### understanding unicode surrogates / or: how to deal with Linear B strings in .NET

Remember a String object in .NET is a collection of Char objects, where a Char object in turn is announced as a unicode character, encoded by a 16bit unsigned integer.

Thus, more precisely speaking, a single Char object is able to encode any codepoint within the basic multilingual plane (BMP), i.e. between U+0000 and U+FFFF. So, where goes the rest of the story? Unicode, as an universal character set, is designed to support much more than 65536 characters of course.

Now, the trick is to encode code points above 2^{16}
by so-called surrogates, that is, by pairs of 16bit integers.
To see how this works, remember the well-known
division algorithm
for integers. That is, if you have an upper bound M and
fix an integer constant C (0 < C < M),
for any integer N within the range of
0 ≤ N < 2^{M},
there exists a unique pair of integers H,L, such that

N = 2^{C} * H + L, where 0 ≤ L < 2^{C} and 0 ≤ H < 2^{M - C}.

That way you have simply encoded these 2^{M} numbers
N by 2^{C} * 2^{M - C} pairs
of numbers H,L.
Hence 2^{M} large numbers are adressed using a set of
2^{C} + 2^{M-C} small numbers, that's the trick.

As we are interested in encoding integers above 2^{16}
by pairs of 16bit integers, we should act on the assumption

2^{16} ≤ N' < 2^{16} + 2^{M},

dealing with N = N' - 2^{16} then.
In order to decide whether any 16bit number does belong to a surrogate pair,
playing either the role of H or L,
finally fix an adequate constant T and set

H' = H + T, L' = L + T + 2^{C},

thus having tagged all 16bit integers I achieving
T ≤ I < T + 2^{C} + 2^{M-C}
as surrogate integers, where the high surrogates of type H'
are less than T + 2^{C} and
the ones above are the low surrogates of type L'.

Now, the setting of unicode is this: C = 10, M = 20, T = 0xD800. So, by reserving 2048 small integers as surrogates, more than a million of additional codepoints up to U+10FFFF are accessible. The resulting formulars may be found here: http://www.unicode.org/book/ch03.pdf.

Thankfully .NET unicoders don't need to deal with hex numbers at all, because it's ready made. For instance, consider the name of Amnissos: written in Linear B:

In C# it looks like this:

```
// alternatively the Char.ConvertFromUtf32() method may be used
string amnisos = "\U00010000" + "\U00010016" + "\U0001001B" + "\U00010030";
```

Note that indeed the Length property of the resulting string has a value of 8, while it contains only 4 unicode characters. So the appropriate way of accessing the actual codepoints of an arbitrary string should make use of System.Globalization.TextElementEnumerator rather than simply access Char objects greenly. It goes like this:

```
// using System.Globalization;
TextElementEnumerator en = StringInfo.GetTextElementEnumerator(amnisos);
while (en.MoveNext())
{
string current = en.GetTextElement();
if (Char.IsSurrogate(current, 0))
{
// a surrogate pair encoding one character, i.e. current.Length == 2
int codepoint = Char.ConvertToUtf32(current[0], current[1]);
Console.WriteLine("U+{0:X6}", codepoint);
}
else
{
// characters within BMP:
// current.Length > 1 may be true in case of combining characters
// cf. StringInfo.ParseCombiningCharacters()
foreach (char c in current)
{
int codepoint = (int)current[0]; // use AscW() in VB.NET
Console.WriteLine("U+{0:X4}", codepoint);
}
}
}
```

Now, when we will be able to register Linear B domain names at last?

## no comments