Sunday, August 29, 2010

Brain Teasing With Strings

And in today's post... a riddle.
Actually, it comes down to a short C# program that deals with string comparisons, and demonstrates a series of "unexpected" behaviors. How "unexpected" are the results? Well, it depends on the reader's degree of experience at string comparisons.
Lately this sort of "brain-teasers" became quite popular on different forums/newsgroups, but it seems like no matter where those teasers are posted, it never comes with a full in-depth explanation for all of the "dark-magic" going on behind the scenes, that could potentially explain the demonstrated "weird" and "unexpected" behaviors. So basically, I'll attempt to make things a little bit clearer than it is now.
So without anymore introductions, please tell me, what would be the output for the following program?

static void Main()
{
    char[] a = new char[0];
    string b = new string(a);
    string c = new string(new char[0]);
    string d = new string(new char[0]);

    Console.WriteLine(object.ReferenceEquals(a, b));
    Console.WriteLine(object.ReferenceEquals(b, c));
    Console.WriteLine(object.ReferenceEquals(c, d));
}

When this question is presented to a group of experienced developers, it's most likely to result in a variety of different answers, where each of the answers is backed up with some explanation that usually "makes sense". The problem is that they usually just "makes sense", and doesn't refer to some hard proofs that can uncover the mystery completely.
So, are you ready with your answers? Let's check the output:

False
True
True
"Hmm... that was interesting".
Let's see what do we got here. On the one hand, a isn't equal to b. This makes sense since after all, they are difference instances of different types. But on the other hand, we can also see that b is equal to c and d as well. This could appear to be a little confusing since as we know, strings are supposed to be immutable, so according to the code, we are "supposed" to create three different string instances (mind you that we aren't dealing with user-hardcoded, interned strings here).
It appears that even though we've create three different arrays, we've only allocated a single string (that one that b, c and d references).

Usually, when a char[] (or, some other kind of string-representing structure) is passed to a string constructor, a new string is allocated to represent the array's characters. However, as our brief program demonstrates, this isn't always the case.
To understand why this happens, we'll have to dig into the implementation of the string class, and more specifically, to it's internal implementation in SString (the following code samples are taken from SSCLI's implementation).
Looking at SString's internal constructor, we could notice that the char[] argument that was received is checked whether it's set to NULL or it's length is equal to zero. If it does, the implementation completely "forgets" about that argument, and instead uses the new SString instance to reference an interned version of an empty string (so basically, the redundant allocation of the new, empty string, is optimized away).
We could see this behavior in SString's Set and Clear functions (the Set function is invoked from the constructor).

void SString::Set(const WCHAR *string)
{
    if (string == NULL || *string == 0)
        Clear();
    else
    {
        Resize((COUNT_T) wcslen(string), REPRESENTATION_UNICODE);
        wcscpy_s(GetRawUnicode(), GetBufferSizeInCharIncludeNullChar(), string);
    }

    RETURN;
}

void SString::Clear()
{
    SetRepresentation(REPRESENTATION_EMPTY);

    if (IsImmutable())
    {
        // Use shared empty string rather than allocating a new buffer
        SBuffer::SetImmutable(s_EmptyBuffer, sizeof(s_EmptyBuffer));
    }
    else
    {
        SBuffer::TweakSize(sizeof(WCHAR));
        GetRawUnicode()[0] = 0;
    }

    RETURN;
}

5 comments:

  1. I guess this technique is called as "String Interning". http://en.wikipedia.org/wiki/String_interning.

    ReplyDelete
  2. @Magesh
    Correct, the CLR obviously rather to use a pre-allocated interned string instead of creatinga new, identical one.

    ReplyDelete
  3. Actually, instead of digging out the implementation, you could have just looked in the documentation of the String(char[]) ctor: http://msdn.microsoft.com/en-us/library/ttyxaek9.aspx

    It says right there that »If value is null or contains no element, an Empty instance is initialized«. So it is *not* an implementation detail but rather a specified behavior.

    ReplyDelete
  4. @Joey,
    Indeed, though "digging out the implementation" is so much more fun.

    ReplyDelete
  5. Great article!. And +1 to comment - "@Joey,
    Indeed, though "digging out the implementation" is so much more fun."

    ReplyDelete