Tuning the Performance of the Ruby Builder Library

In a previous article, I have shown how to convert HanDeDict (a free chinese-german dictionary) for using it in Apple's Dictionary. The main part was the generation of a big XML file using the Builder Library.

Generating the XML file now takes more than 4 minutes, almost all of the time spent in the XML generation. In this article, I show how to improve the performance of the Builder Library. Generating the XML file got up to 36% faster. If you would like to get the code, you may clone it from GitHub.

Finding the Bottleneck

To test for performance, we use five scenarios. The first one is the XML generation for HanDeDict. In the second one, we generate an in-memory xml string with one million tags, each containing a short text:

require 'builder/xmlmarkup'
require 'benchmark'

result = Benchmark.measure do
  xm = Builder::XmlMarkup.new
  xm.instruct!
  xm.links do
    1000000.times do |i|
      xm.link "Item #{i.to_s}"
    end
  end
  x = xm.target!
end
puts result

The third example is test/performance.rb from the Builder library itself. It renders XML with very few tags, but a lot of text:

#!/usr/bin/env ruby

require 'builder/xmlmarkup'
require 'benchmark'

text = "This is a test of the new xml markup. Iñtërnâtiônàlizætiøn" * 10000

include Benchmark          # we need the CAPTION and FMTSTR constants
include Builder
n = 50
Benchmark.benchmark do |bm|
  tf = bm.report("base")   {
    n.times do
      x = XmlMarkup.new
      x.text(text)
      x.target!
    end
  }
  def XmlMarkup._escape(text)
    text.to_xs
  end
  tf = bm.report("to_xs")   {
    n.times do
      x = XmlMarkup.new
      x.text(text)
      x.target!
    end
  }
end

It should be noted here that performance.rb actually tests a special case: The term Iñtërnâtiônàlizætiøn is in CP1252 (Windows) encoding, for which Builder provides a workaround. So as fourth scenario, we use the same file saved as UTF8. In the fifth and last scenario, we replace Iñtërnâtiônàlizætiøn with Internationalization to test ASCII performance.

The initial results are:

Example 1:      251.330000   1.280000 252.610000 (256.420148)
Example 2:       48.910000   0.240000  49.150000 ( 49.680870)
Example 3:  base 97.400000   0.760000  98.160000 ( 98.609859)
           to_xs 98.230000   0.730000  98.960000 ( 99.363800)
Example 4:  base 98.250000   0.790000  99.040000 ( 99.441164)
           to_xs 99.100000   0.760000  99.860000 (100.257412)
Example 5:  base 88.380000   0.540000  88.920000 ( 89.046175)
           to_xs 88.240000   0.630000  88.870000 ( 89.225532)

To get some hints where the time is spent, we install the ruby-perf gem and change a copy of the second scenario to print out a report.

require 'builder'
require 'rubygems'
require 'ruby-prof'

result = RubyProf.profile do
  xm = Builder::XmlMarkup.new
  xm.instruct!
  xm.links do
    100000.times do |i|
      xm.link i.to_s
    end
  end
  x = xm.target!
end
RubyProf::FlatPrinter.new(result).print(STDOUT, 0)

Here are the first lines of the report. It mainly consists of a table listing how much time was spent in each method, and how much in child (read: called) methods. Note that due to the profiling, the scenario runs a lot slower, so we only write 100,000 nodes.

Thread ID: 136060
Total: 62.546996

 %self     total     self     wait    child    calls  name
at top level in 24.01 37.94 15.02 0.00 22.92 488898 Fixnum#xchr (/Users/tammofreese/Documents/builder/trunk/test/../lib/builder/xchar.rb at line 95
at top level in 9.73 9.18 6.08 0.00 3.10 1466694 Kernel#=== (ruby_runtime at line 0
at top level in 8.07 61.54 5.05 0.00 56.50 100000 Builder::XmlBase#method_missing-1 (/Users/tammofreese/Documents/builder/trunk/test/../lib/builder/xmlbase.rb at line 40
at top level in 6.60 6.24 4.13 0.00 2.11 977803 Hash#[] (ruby_runtime at line 0
at top level in 5.65 3.53 3.53 0.00 0.00 1666700 Fixnum#== (ruby_runtime at line 0
at top level in 5.06 5.23 3.16 0.00 2.06 488898 Range#=== (ruby_runtime at line 0
at top level in 4.13 40.52 2.58 0.00 37.94 100002 Array#map (ruby_runtime at line 0
at top level in 3.48 3.94 2.18 0.00 1.76 100001 Builder::XmlMarkup#_start_tag (/Users/tammofreese/Documents/builder/trunk/test/../lib/builder/xmlmarkup.rb at line 290
at top level in 3.38 2.11 2.11 0.00 0.00 977800 Hash#default (ruby_runtime at line 0
at top level in 3.30 2.06 2.06 0.00 0.00 977796 Fixnum#<=> (ruby_runtime at line 0

As we see, most of the time is spent in the Fixnum#xchr method, which is called almost half a million times. So we target this method for performance improvement. What's more, we see a huge number of calls to Kernel#===, Hash#[], Fixnum#==, and Range#===.

Improving the Performance at the Bottleneck

Let's have a look at the Fixnum#xchr method:
  def xchr(escape=true)
    n = XChar::CP1252[self] || self
    case n when *XChar::VALID
      XChar::PREDEFINED[n] or (n<128 ? n.chr : (escape ? "&##{n};" : [n].pack('U*')))
    else
      '*'
    end
  end

It contains a fix for CP1252-encoded files from the windows world, then it decides whether n is a valid unicode code point. Invalid numbers are mapped to a string containing an asterisk. Valid numbers are mapped to strings containing the respective UTF-8 characters. Characters outside the ASCII range are optionally escaped.

The case n when *XChar::VALID may be the source of all the case equality operator and Fixnum#== calls. Let's look at the definition of VALID:

  VALID = [
    0x9, 0xA, 0xD,
    (0x20..0xD7FF), 
    (0xE000..0xFFFD),
    (0x10000..0x10FFFF)
  ]

The first three elements of the array are the unicode codepoints of tab, linefeed and carriage return. Then, three ranges are used that contain a lot of unicode code points. A typical text will contain many characters in the first two ranges, and not many tabs, linefeeds, carriage returns, or characters from the third range (the unicode codepoints above 0xFFFF are not that often used). We change the array so that the common cases are checked first:

  VALID = [
    (0x20..0xD7FF), 
    (0xE000..0xFFFD),
    0x9, 0xA, 0xD,
    (0x10000..0x10FFFF)
  ]

After this change, the performance improves by about 3-16%:

Example 1:      244.080000   1.250000 245.330000 (249.456791)
Example 2:       43.220000   0.150000  43.370000 ( 43.529551)
Example 3:  base 81.560000   0.560000  82.120000 ( 82.182873)
           to_xs 82.370000   0.510000  82.880000 ( 82.938198)
Example 4:  base 81.750000   0.670000  82.420000 ( 82.601055)
           to_xs 82.400000   0.620000  83.020000 ( 83.276765)
Example 5:  base 74.350000   0.620000  74.970000 ( 75.106275)
           to_xs 74.410000   0.530000  74.940000 ( 75.092212)

The next change is to cache all three constants CP1252, VALID and PREDEFINED directly in Fixnum:

VALID = Builder::XChar::VALID if ! defined?(VALID)
PREDEFINED = Builder::XChar::PREDEFINED if ! defined?(PREDEFINED)
CP1252 = Builder::XChar::CP1252 if ! defined?(CP1252)

def xchr(escape=true)
  n = CP1252[self] || self
  case n when *VALID
    PREDEFINED[n] or (n<128 ? n.chr : (escape ? "&##{n};" : [n].pack('U*')))
  else
    '*'
  end
end

Now the scenarios run faster by about 5-23%:

Example 1:      237.950000   1.200000 239.150000 (243.143989)
Example 2:       41.710000   0.260000  41.970000 ( 46.755385)
Example 3:  base 76.360000   0.670000  77.030000 ( 77.289001)
           to_xs 77.210000   0.600000  77.810000 ( 78.047175)
Example 4:  base 75.260000   0.760000  76.020000 ( 76.714416)
           to_xs 76.090000   0.670000  76.760000 ( 77.237976)
Example 5:  base 67.790000   0.690000  68.480000 ( 68.834157)
           to_xs 67.740000   0.510000  68.250000 ( 68.404581)

Then, we eliminate the local variable n:

def xchr(escape=true)
  return CP1252[self].xchr(escape) if CP1252.key?(self)
  case self when *VALID
    PREDEFINED[self] or (self<128 ? self.chr : (escape ? "&##{self};" : [self].pack('U*')))
  else
    '*'
  end
end

Again, the scenarios run faster. We have now improvements of about 6-29% compared to the original implementation:

Example 1:      234.610000   1.350000 235.960000 (240.837129)
Example 2:       40.040000   0.180000  40.220000 ( 48.501536)
Example 3:  base 71.440000   0.670000  72.110000 ( 72.325842)
           to_xs 72.300000   0.600000  72.900000 ( 73.041512)
Example 4:  base 70.460000   0.700000  71.160000 ( 71.500979)
           to_xs 71.080000   0.570000  71.650000 ( 71.827479)
Example 5:  base 62.980000   0.630000  63.610000 ( 63.944886)
           to_xs 62.910000   0.530000  63.440000 ( 63.588532)

Intermezzo: Fixing two Bugs along the Way

In Fixnum#xchr, the CP1252 fix is always applied. It maps CP1252 codes to their unicode counterparts:

CP1252 = {        # :nodoc:
  128 => 8364,    # euro sign
  130 => 8218,    # single low-9 quotation mark
  131 =>  402,    # latin small letter f with hook
  132 => 8222,    # double low-9 quotation mark
  133 => 8230,    # horizontal ellipsis
  134 => 8224,    # dagger
  135 => 8225,    # double dagger
  136 =>  710,    # modifier letter circumflex accent
  137 => 8240,    # per mille sign
  138 =>  352,    # latin capital letter s with caron
  139 => 8249,    # single left-pointing angle quotation mark
  140 =>  338,    # latin capital ligature oe
  142 =>  381,    # latin capital letter z with caron
  145 => 8216,    # left single quotation mark
  146 => 8217,    # right single quotation mark
  147 => 8220,    # left double quotation mark
  148 => 8221,    # right double quotation mark
  149 => 8226,    # bullet
  150 => 8211,    # en dash
  151 => 8212,    # em dash
  152 =>  732,    # small tilde
  153 => 8482,    # trade mark sign
  154 =>  353,    # latin small letter s with caron
  155 => 8250,    # single right-pointing angle quotation mark
  156 =>  339,    # latin small ligature oe
  158 =>  382,    # latin small letter z with caron
  159 =>  376,    # latin capital letter y with diaeresis
}

But this does not always make sense. As an example, let us look at the number 134. A string containing only this number in one byte should be regarded as CP1252-encoded as it is invalid UTF-8. Therefore it should be mapped to the corresponding unicode code point 8224. However, a string that contains the UTF-8 bytes for codepoint 134 must not be changed. We add a test to check whether this is a problem in the current implementation:

def test_do_not_fix_utf8_as_win_1252
  assert_equal '&#8224;', "\x86".to_xs         # CP1252 dagger
  assert_equal '&#134;',  "\xC2\x86".to_xs     # UTF-8 reverse line feed
end

Turns out it is a problem: the test fails. The method String#to_xs which calls Fixnum#xchr looks like this:

def to_xs(escape=true)
  unpack('U*').map {|n| n.xchr(escape)}.join # ASCII, UTF-8
rescue
  unpack('C*').map {|n| n.xchr}.join # ISO-8859-1, WIN-1252
end

CP1252 encoding does only make sense if the UTF-8 unpack fails. So we change the code to

CP1252 = Builder::XChar::CP1252 if ! defined?(CP1252)

def to_xs(escape=true)
  unpack('U*').map {|n| n.xchr(escape)}.join # ASCII, UTF-8
rescue
  unpack('C*').map {|n| (CP1252[n] || n).xchr}.join # ISO-8859-1, WIN-1252
end

Then, we delete the CP1252 constant and the first line of Fixnum#xchar:

class Fixnum
  VALID = Builder::XChar::VALID if ! defined?(VALID)
  PREDEFINED = Builder::XChar::PREDEFINED if ! defined?(PREDEFINED)

  def xchr(escape=true)
    case self when *VALID
      PREDEFINED[self] or (self<128 ? self.chr : (escape ? "&##{self};" : [self].pack('U*')))
    else
      '*'
    end
  end
end

Now the test passes. As the CP1252 fix is not applied for UTF-8 anymore, we get quite a performance boost. Our version now needs 10-37% less time than the original implementation:

Example 1:      225.130000   1.230000 226.360000 (229.601577)
Example 2:       36.010000   0.110000  36.120000 ( 36.246593)
Example 3:  base 76.140000   0.680000  76.820000 ( 77.060833)
           to_xs 77.080000   0.630000  77.710000 ( 77.986746)
Example 4:  base 62.260000   0.580000  62.840000 ( 63.280986)
           to_xs 62.710000   0.560000  63.270000 ( 63.408446)
Example 5:  base 55.040000   0.620000  55.660000 ( 55.865656)
           to_xs 54.780000   0.510000  55.290000 ( 55.367186)

Another bug we have spotted in String#to_xs is that escaping is ignored if the CP1252 fix is applied. We add a test to confirm that:

def test_win_1252_escaping
  assert_equal '&#8364;', "\x80".to_xs(true)   # euro with escaping
  assert_equal '€', "\x80".to_xs(false)        # euro without escaping
end

The test fails as expected. Now we pass the escape parameter to Fixnum#to_xs:

def to_xs(escape=true)
  unpack('U*').map {|n| n.xchr(escape)}.join # ASCII, UTF-8
rescue
  unpack('C*').map {|n| (CP1252[n] || n).xchr(escape)}.join # ISO-8859-1, WIN-1252
end

Avoiding the Bottleneck

Here is what is left of Fixnum#xchr:

def xchr(escape=true)
  case self when *VALID
    PREDEFINED[self] or (self<128 ? self.chr : (escape ? "&##{self};" : [self].pack('U*')))
  else
    '*'
  end
end

So far, we tried to make Fixnum#xchr faster. But maybe it is possible to do move the whole Fixnum#xchr functionality directly inside the String#to_xs method? We would not need a method call per character anymore, which should give us a huge performance boost. So what has to be done in String#to_xs?

  1. Apply the CP1252 fix if needed.
  2. Replace invalid characters with '*'.
  3. Replace the characters for the code points in PREDEFINED with the mapped strings.
  4. Escape if needed.

Here is the new implementation:

def to_xs(escape=true)
  result = begin
    unpack('U*')
    dup
  rescue
    unpack('C*').map! {|n| CP1252[n] || n }.pack("U*")
  end
  result.gsub!(INVALID_UTF8_MATCHER, "*")
  result.gsub!(PREDEFINED_UTF_MATCHER) { |match| PREDEFINED[match] }
  result.gsub!(NON_ASCII_UTF8_MATCHER) { |match| "&##{match.unpack('U').first};"} if escape
  result
end

What we still need to do is to define the constants. First, we define a helper lambda in String that converts a given Integer or Range to a character class entry. As an example, 0x41 will be converted to "A" and (0x41..0x5A) to "A-Z":

to_character_class_entry = lambda do |entry|
  next "#{[entry].pack('U')}" if entry.is_a? Integer
  next "#{[entry.first].pack('U')}-#{[entry.last].pack('U')}" if entry.is_a? Range
end 

Using this helper, we define INVALID_UTF8_MATCHER:

INVALID_UTF8_MATCHER = /[^#{Builder::XChar::VALID.map(&to_character_class_entry).join}]/u

For NON_ASCII_UTF8_MATCHER, we need all the values from VALID which are greater than 0x7F. To get those, we roll back VALID and split the first range:

  VALID = [
    0x9, 0xA, 0xD,
    (0x20..0x7F), 
    (0x80..0xD7FF), 
    (0xE000..0xFFFD),
    (0x10000..0x10FFFF)
  ]

Now we can define NON_ASCII_UTF8_MATCHER:

NON_ASCII_UTF8_MATCHER = /[#{Builder::XChar::VALID[4..-1].map(&to_character_class_entry).join}]/u

The PREDEFINED that we need in String#to_xs is like the one in Builder::XChar, except that it needs Strings instead of Fixnums as keys:

PREDEFINED = Builder::XChar::PREDEFINED.inject({}) { |sum, (key, val)| sum[[key].pack('U')] = val; sum }

The PREDEFINED_UTF8_MATCHER constant references a regular expression that matches all the characters with unicode codepoints contained in Builder::XChar::PREDEFINED.keys:

PREDEFINED_UTF_MATCHER = /[#{Builder::XChar::PREDEFINED.keys.map(&to_character_class_entry).join}]/u

Results

Here are the performance results for the bootleneck-avoiding version of String#to_xs:

Example 1:      159.390000   0.970000 160.360000 (163.721333)
Example 2:       17.000000   0.080000  17.080000 ( 17.265984)
Example 3:  base 47.410000   0.470000  47.880000 ( 48.071752)
           to_xs 47.390000   0.470000  47.860000 ( 47.995763)
Example 4:  base 13.910000   0.350000  14.260000 ( 14.326577)
           to_xs 14.510000   0.360000  14.870000 ( 14.921875)
Example 5:  base  1.550000   0.230000   1.780000 (  1.788842)
           to_xs  1.470000   0.230000   1.700000 (  1.703043)

The final version needs 36-98% less time than the original version (ruby1.8.6p114). The last number is somewhat misleading, as the corresonding test tests the to_xs performance for huge ASCII texts, which may not be the typical case. However, even for the second example, which builds a structure with small nodes, the execution is almost three times as fast as before.

Patches for the bug fixes and the performance fixes were sent to the maintainers of the builder library, unfortunately not much of it has shown up in their repository. So if you would like to use the improved version, grab a copy of the source from GitHub.