.NET String.Replace() performance uncovered

In a previous entry I ran into some unexpected results with the performance of String.Replace() in .Net Framework 1.1. Upon further digging, it turns out that Replace is only fast when the input string does not contain the string (or character) that is intended for replacement. When the string does contain it, the performance of CleanString class drops, and, as expected, the character array exhibits better perf. So here's an updated bit of code that behaves no worse than original when there is nothing to replace and better than original when there is cleaning to do:

        public static string CleanString(string source)
        {
            // new result string is only allocated if it is needed
            // otherwise, the method only does a simple scan
            char[] result = null;

            int resultIndex = 0;
            foreach (char c in source)
            {
                if (IsBadCharacter(c))
                {
                    if (null == result)
                    {
                        result = new string(' ', source.Length).ToCharArray();
                    }
                }
                else
                {
                    if (null != result)
                    {
                        result[resultIndex] = c;
                    }
                    resultIndex++;
                }
            }

            return (null == result) ? source : new string(result).Trim();
        }

        private static bool IsBadCharacter(char c)
        {
            return char.IsDigit(c);
        }

Unexpected string performance in .Net 1.1

A project I started recently centered around improving performance of some existing code. My first inclination was to scan the code for obvious low hanging fruit and one of the first methods I hit upon was this:

 1:   public static string CleanString(string Text)
 2:   {
 3:    // cleans a string for fuzzy matching
 4:    // trims white space, and removes numbers
 5: 
 6:    Text = Text.Trim();
 7:    Text = Text.Replace("0","");
 8:    Text = Text.Replace("1","");
 9:    Text = Text.Replace("2","");
10:    Text = Text.Replace("3","");  
11:    Text = Text.Replace("4","");
12:    Text = Text.Replace("5","");
13:    Text = Text.Replace("6","");
14:    Text = Text.Replace("7","");
15:    Text = Text.Replace("8","");
16:    Text = Text.Replace("9","");
17:    //Text = Text.Replace(".","");
18: 
19:    return Text;
20:   }

This was pretty obviously bad. Strings are immutable, so the author, in effect, allocated 11 new strings. This function is called quite often in code so it seemed like a worthwhile place to optimize. My first inclination was to throw a RegEx at it, as follows:

 1:  /// <summary>
 2:  /// Summary description for Test.
 3:  /// </summary>
 4:  public sealed class Test
 5:  {
 6:   private static Regex _scrubExpression = new Regex(@"\d*");
 7: 
 8:   /// <summary>
 9:   ///  Cleans a string: trims white space, and removes numbers.
10:   /// </summary>
11:   /// <param name="source"></param>
12:   /// <returns></returns>
13:   public static string CleanStringOptimized1(string source)
14:   {
15:    return _scrubExpression.Replace(source.Trim(), "");
16:   }
17:  }
18: }

You'd expect that to just work. Think again. Over 100,000 runs, the RegEx code turned out to be nearly three times slower. Ok, so maybe RegEx object creation overhead just kills this. Fine. Lets try something simpler, like the good old StringBuilder:

1: public static string CleanStringOptimized3(string source)
2: {
3: StringBuilder output = new StringBuilder(source.Trim());
4:
5: output.Replace("0", "")
6: .Replace("1", "")
7: .Replace("2", "")
8: .Replace("3", "")
9: .Replace("4", "")
10: .Replace("5", "")
11: .Replace("6", "")
12: .Replace("7", "")
13: .Replace("8", "")
14: .Replace("9", "");
15: return output.ToString();
16: }

Yup. This was slower too. So maybe the answer was to go simple and do just one allocation and work with bytes. Lets try this:

1: public static string CleanStringOptimized2(string source)
2: {
3: char[] result = new string(' ', source.Length).ToCharArray();
4: char[] sourceChars = source.ToCharArray();
5:
6: int resultIndex = 0;
7: for (int count = 0; count < source.Length; count++)
8: {
9: char c = sourceChars[count];
10: if (! char.IsDigit(c))
11: {
12: result[resultIndex] = c;
13: resultIndex++;
14: }
15: }
16:
17: return new string(result).Trim();
18: }

Yes, you guessed it. This, too, runs slower than the original. In fact, the results look like this:

Original comparison took 359.375ms over 100000 runs.
Optimized 1 comparison took 10203.125ms over 100000 runs.
Optimized 2 comparison took 421.875ms over 100000 runs.
Optimized 3 comparison took 718.75ms over 100000 runs.

If anyone has any insight into why my optimizations aren't optimizing, I would love to hear it! Incidentally, dropping down to IL revealed nothing out of the orginary.

Missing VSS menus in VS.NET 2003

I opened up VS.NET 2003 today and realized that the Source Control menus were nowhere in sight. So if that happens to you, quick tip: register the SSSCC.dll and everything will be OK again.

First, locate SSSCC.dll on your machine, typically located in the win32 folder of your Visual SourceSafe installation. Then open a command prompt in that folder and run the following command “regsvr32 ssscc.dll”. Restart VS.NET and all should be well.

Good luck!