In a previous article, I have shown how to convert HanDeDict (a free chinese-german dictionary) for using it in Apple's Dictionary. The main part was the generation of a big XML file using the Builder Library.
Generating the XML file now takes more than 4 minutes, almost all of the time spent in the XML generation. In this article, I show how to improve the performance of the Builder Library. Generating the XML file got up to 36% faster. If you would like to get the code, you may clone it from GitHub.
To test for performance, we use five scenarios. The first one is the XML generation for HanDeDict. In the second one, we generate an in-memory xml string with one million tags, each containing a short text:
require 'builder/xmlmarkup'
require 'benchmark'
result = Benchmark.measure do
xm = Builder::XmlMarkup.new
xm.instruct!
xm.links do
1000000.times do |i|
xm.link "Item #{i.to_s}"
end
end
x = xm.target!
end
puts result
The third example is test/performance.rb from the Builder library
itself. It renders XML with very few tags, but a lot of text:
#!/usr/bin/env ruby
require 'builder/xmlmarkup'
require 'benchmark'
text = "This is a test of the new xml markup. Iñtërnâtiônàlizætiøn" * 10000
include Benchmark # we need the CAPTION and FMTSTR constants
include Builder
n = 50
Benchmark.benchmark do |bm|
tf = bm.report("base") {
n.times do
x = XmlMarkup.new
x.text(text)
x.target!
end
}
def XmlMarkup._escape(text)
text.to_xs
end
tf = bm.report("to_xs") {
n.times do
x = XmlMarkup.new
x.text(text)
x.target!
end
}
end
It should be noted here that performance.rb actually tests a special case: The term Iñtërnâtiônàlizætiøn is in CP1252 (Windows) encoding, for which Builder provides a workaround. So as fourth scenario, we use the same file saved as UTF8. In the fifth and last scenario, we replace Iñtërnâtiônàlizætiøn with Internationalization to test ASCII performance.
The initial results are:
Example 1: 251.330000 1.280000 252.610000 (256.420148)
Example 2: 48.910000 0.240000 49.150000 ( 49.680870)
Example 3: base 97.400000 0.760000 98.160000 ( 98.609859)
to_xs 98.230000 0.730000 98.960000 ( 99.363800)
Example 4: base 98.250000 0.790000 99.040000 ( 99.441164)
to_xs 99.100000 0.760000 99.860000 (100.257412)
Example 5: base 88.380000 0.540000 88.920000 ( 89.046175)
to_xs 88.240000 0.630000 88.870000 ( 89.225532)
To get some hints where the time is spent, we install the ruby-perf
gem and change a copy of the second scenario to print out a report.
require 'builder'
require 'rubygems'
require 'ruby-prof'
result = RubyProf.profile do
xm = Builder::XmlMarkup.new
xm.instruct!
xm.links do
100000.times do |i|
xm.link i.to_s
end
end
x = xm.target!
end
RubyProf::FlatPrinter.new(result).print(STDOUT, 0)
Here are the first lines of the report. It mainly consists of a table listing how much time was spent in each method, and how much in child (read: called) methods. Note that due to the profiling, the scenario runs a lot slower, so we only write 100,000 nodes.
Thread ID: 136060 Total: 62.546996 %self total self wait child calls name at top level in 24.01 37.94 15.02 0.00 22.92 488898 Fixnum#xchr (/Users/tammofreese/Documents/builder/trunk/test/../lib/builder/xchar.rb at line 95 at top level in 9.73 9.18 6.08 0.00 3.10 1466694 Kernel#=== (ruby_runtime at line 0 at top level in 8.07 61.54 5.05 0.00 56.50 100000 Builder::XmlBase#method_missing-1 (/Users/tammofreese/Documents/builder/trunk/test/../lib/builder/xmlbase.rb at line 40 at top level in 6.60 6.24 4.13 0.00 2.11 977803 Hash#[] (ruby_runtime at line 0 at top level in 5.65 3.53 3.53 0.00 0.00 1666700 Fixnum#== (ruby_runtime at line 0 at top level in 5.06 5.23 3.16 0.00 2.06 488898 Range#=== (ruby_runtime at line 0 at top level in 4.13 40.52 2.58 0.00 37.94 100002 Array#map (ruby_runtime at line 0 at top level in 3.48 3.94 2.18 0.00 1.76 100001 Builder::XmlMarkup#_start_tag (/Users/tammofreese/Documents/builder/trunk/test/../lib/builder/xmlmarkup.rb at line 290 at top level in 3.38 2.11 2.11 0.00 0.00 977800 Hash#default (ruby_runtime at line 0 at top level in 3.30 2.06 2.06 0.00 0.00 977796 Fixnum#<=> (ruby_runtime at line 0
As we see, most of the time is spent in the Fixnum#xchr method, which is called almost
half a million times. So we target this method for performance improvement.
What's more, we see a huge number of calls to Kernel#===, Hash#[], Fixnum#==,
and Range#===.
Fixnum#xchr method:
def xchr(escape=true)
n = XChar::CP1252[self] || self
case n when *XChar::VALID
XChar::PREDEFINED[n] or (n<128 ? n.chr : (escape ? "&##{n};" : [n].pack('U*')))
else
'*'
end
end
It contains a fix for CP1252-encoded files from the windows world, then it decides whether
n is a valid unicode code point. Invalid numbers are mapped to a string containing an asterisk.
Valid numbers are mapped to strings containing the respective UTF-8 characters. Characters outside the ASCII range are optionally escaped.
The case n when *XChar::VALID may be the source of all the case equality operator and Fixnum#== calls.
Let's look at the definition of VALID:
VALID = [
0x9, 0xA, 0xD,
(0x20..0xD7FF),
(0xE000..0xFFFD),
(0x10000..0x10FFFF)
]
The first three elements of the array are the unicode codepoints of tab, linefeed and carriage return. Then, three ranges are used that contain a lot of unicode code points. A typical text will contain many characters in the first two ranges, and not many tabs, linefeeds, carriage returns, or characters from the third range (the unicode codepoints above 0xFFFF are not that often used). We change the array so that the common cases are checked first:
VALID = [
(0x20..0xD7FF),
(0xE000..0xFFFD),
0x9, 0xA, 0xD,
(0x10000..0x10FFFF)
]
After this change, the performance improves by about 3-16%:
Example 1: 244.080000 1.250000 245.330000 (249.456791)
Example 2: 43.220000 0.150000 43.370000 ( 43.529551)
Example 3: base 81.560000 0.560000 82.120000 ( 82.182873)
to_xs 82.370000 0.510000 82.880000 ( 82.938198)
Example 4: base 81.750000 0.670000 82.420000 ( 82.601055)
to_xs 82.400000 0.620000 83.020000 ( 83.276765)
Example 5: base 74.350000 0.620000 74.970000 ( 75.106275)
to_xs 74.410000 0.530000 74.940000 ( 75.092212)
The next change is to cache all three constants CP1252, VALID
and PREDEFINED directly in Fixnum:
VALID = Builder::XChar::VALID if ! defined?(VALID)
PREDEFINED = Builder::XChar::PREDEFINED if ! defined?(PREDEFINED)
CP1252 = Builder::XChar::CP1252 if ! defined?(CP1252)
def xchr(escape=true)
n = CP1252[self] || self
case n when *VALID
PREDEFINED[n] or (n<128 ? n.chr : (escape ? "&##{n};" : [n].pack('U*')))
else
'*'
end
end
Now the scenarios run faster by about 5-23%:
Example 1: 237.950000 1.200000 239.150000 (243.143989)
Example 2: 41.710000 0.260000 41.970000 ( 46.755385)
Example 3: base 76.360000 0.670000 77.030000 ( 77.289001)
to_xs 77.210000 0.600000 77.810000 ( 78.047175)
Example 4: base 75.260000 0.760000 76.020000 ( 76.714416)
to_xs 76.090000 0.670000 76.760000 ( 77.237976)
Example 5: base 67.790000 0.690000 68.480000 ( 68.834157)
to_xs 67.740000 0.510000 68.250000 ( 68.404581)
Then, we eliminate the local variable n:
def xchr(escape=true)
return CP1252[self].xchr(escape) if CP1252.key?(self)
case self when *VALID
PREDEFINED[self] or (self<128 ? self.chr : (escape ? "&##{self};" : [self].pack('U*')))
else
'*'
end
end
Again, the scenarios run faster. We have now improvements of about 6-29% compared to the original implementation:
Example 1: 234.610000 1.350000 235.960000 (240.837129)
Example 2: 40.040000 0.180000 40.220000 ( 48.501536)
Example 3: base 71.440000 0.670000 72.110000 ( 72.325842)
to_xs 72.300000 0.600000 72.900000 ( 73.041512)
Example 4: base 70.460000 0.700000 71.160000 ( 71.500979)
to_xs 71.080000 0.570000 71.650000 ( 71.827479)
Example 5: base 62.980000 0.630000 63.610000 ( 63.944886)
to_xs 62.910000 0.530000 63.440000 ( 63.588532)
In Fixnum#xchr, the CP1252 fix is always applied. It maps CP1252 codes to their unicode counterparts:
CP1252 = { # :nodoc:
128 => 8364, # euro sign
130 => 8218, # single low-9 quotation mark
131 => 402, # latin small letter f with hook
132 => 8222, # double low-9 quotation mark
133 => 8230, # horizontal ellipsis
134 => 8224, # dagger
135 => 8225, # double dagger
136 => 710, # modifier letter circumflex accent
137 => 8240, # per mille sign
138 => 352, # latin capital letter s with caron
139 => 8249, # single left-pointing angle quotation mark
140 => 338, # latin capital ligature oe
142 => 381, # latin capital letter z with caron
145 => 8216, # left single quotation mark
146 => 8217, # right single quotation mark
147 => 8220, # left double quotation mark
148 => 8221, # right double quotation mark
149 => 8226, # bullet
150 => 8211, # en dash
151 => 8212, # em dash
152 => 732, # small tilde
153 => 8482, # trade mark sign
154 => 353, # latin small letter s with caron
155 => 8250, # single right-pointing angle quotation mark
156 => 339, # latin small ligature oe
158 => 382, # latin small letter z with caron
159 => 376, # latin capital letter y with diaeresis
}
But this does not always make sense. As an example, let us look at the number 134. A string containing only this number in one byte should be regarded as CP1252-encoded as it is invalid UTF-8. Therefore it should be mapped to the corresponding unicode code point 8224. However, a string that contains the UTF-8 bytes for codepoint 134 must not be changed. We add a test to check whether this is a problem in the current implementation:
def test_do_not_fix_utf8_as_win_1252 assert_equal '†', "\x86".to_xs # CP1252 dagger assert_equal '†', "\xC2\x86".to_xs # UTF-8 reverse line feed end
Turns out it is a problem: the test fails. The method String#to_xs which calls Fixnum#xchr looks like this:
def to_xs(escape=true)
unpack('U*').map {|n| n.xchr(escape)}.join # ASCII, UTF-8
rescue
unpack('C*').map {|n| n.xchr}.join # ISO-8859-1, WIN-1252
end
CP1252 encoding does only make sense if the UTF-8 unpack fails. So we change the code to
CP1252 = Builder::XChar::CP1252 if ! defined?(CP1252)
def to_xs(escape=true)
unpack('U*').map {|n| n.xchr(escape)}.join # ASCII, UTF-8
rescue
unpack('C*').map {|n| (CP1252[n] || n).xchr}.join # ISO-8859-1, WIN-1252
end
Then, we delete the CP1252 constant and the first line of Fixnum#xchar:
class Fixnum
VALID = Builder::XChar::VALID if ! defined?(VALID)
PREDEFINED = Builder::XChar::PREDEFINED if ! defined?(PREDEFINED)
def xchr(escape=true)
case self when *VALID
PREDEFINED[self] or (self<128 ? self.chr : (escape ? "&##{self};" : [self].pack('U*')))
else
'*'
end
end
end
Now the test passes. As the CP1252 fix is not applied for UTF-8 anymore, we get quite a performance boost. Our version now needs 10-37% less time than the original implementation:
Example 1: 225.130000 1.230000 226.360000 (229.601577)
Example 2: 36.010000 0.110000 36.120000 ( 36.246593)
Example 3: base 76.140000 0.680000 76.820000 ( 77.060833)
to_xs 77.080000 0.630000 77.710000 ( 77.986746)
Example 4: base 62.260000 0.580000 62.840000 ( 63.280986)
to_xs 62.710000 0.560000 63.270000 ( 63.408446)
Example 5: base 55.040000 0.620000 55.660000 ( 55.865656)
to_xs 54.780000 0.510000 55.290000 ( 55.367186)
Another bug we have spotted in String#to_xs is that escaping is ignored if the CP1252 fix is applied.
We add a test to confirm that:
def test_win_1252_escaping assert_equal '€', "\x80".to_xs(true) # euro with escaping assert_equal '€', "\x80".to_xs(false) # euro without escaping end
The test fails as expected. Now we pass the escape parameter to Fixnum#to_xs:
def to_xs(escape=true)
unpack('U*').map {|n| n.xchr(escape)}.join # ASCII, UTF-8
rescue
unpack('C*').map {|n| (CP1252[n] || n).xchr(escape)}.join # ISO-8859-1, WIN-1252
end
Here is what is left of Fixnum#xchr:
def xchr(escape=true)
case self when *VALID
PREDEFINED[self] or (self<128 ? self.chr : (escape ? "&##{self};" : [self].pack('U*')))
else
'*'
end
end
So far, we tried to make Fixnum#xchr faster. But maybe it is possible to do move the whole Fixnum#xchr functionality directly inside the String#to_xs method? We would not need a method call per character anymore, which should give us a huge performance boost. So what has to be done in String#to_xs?
PREDEFINED with the mapped strings.Here is the new implementation:
def to_xs(escape=true)
result = begin
unpack('U*')
dup
rescue
unpack('C*').map! {|n| CP1252[n] || n }.pack("U*")
end
result.gsub!(INVALID_UTF8_MATCHER, "*")
result.gsub!(PREDEFINED_UTF_MATCHER) { |match| PREDEFINED[match] }
result.gsub!(NON_ASCII_UTF8_MATCHER) { |match| "&##{match.unpack('U').first};"} if escape
result
end
What we still need to do is to define the constants. First, we define a helper lambda in String
that converts a given Integer or Range to a character class entry. As an example, 0x41 will be converted to "A"
and (0x41..0x5A) to "A-Z":
to_character_class_entry = lambda do |entry|
next "#{[entry].pack('U')}" if entry.is_a? Integer
next "#{[entry.first].pack('U')}-#{[entry.last].pack('U')}" if entry.is_a? Range
end
Using this helper, we define INVALID_UTF8_MATCHER:
INVALID_UTF8_MATCHER = /[^#{Builder::XChar::VALID.map(&to_character_class_entry).join}]/u
For NON_ASCII_UTF8_MATCHER, we need all the values from VALID
which are greater than 0x7F. To get those, we roll back VALID and split the first range:
VALID = [
0x9, 0xA, 0xD,
(0x20..0x7F),
(0x80..0xD7FF),
(0xE000..0xFFFD),
(0x10000..0x10FFFF)
]
Now we can define NON_ASCII_UTF8_MATCHER:
NON_ASCII_UTF8_MATCHER = /[#{Builder::XChar::VALID[4..-1].map(&to_character_class_entry).join}]/u
The PREDEFINED that we need in String#to_xs is like the one in Builder::XChar,
except that it needs Strings instead of Fixnums as keys:
PREDEFINED = Builder::XChar::PREDEFINED.inject({}) { |sum, (key, val)| sum[[key].pack('U')] = val; sum }
The PREDEFINED_UTF8_MATCHER constant references a regular expression that matches all the characters
with unicode codepoints contained in Builder::XChar::PREDEFINED.keys:
PREDEFINED_UTF_MATCHER = /[#{Builder::XChar::PREDEFINED.keys.map(&to_character_class_entry).join}]/u
Here are the performance results for the bootleneck-avoiding version of String#to_xs:
Example 1: 159.390000 0.970000 160.360000 (163.721333)
Example 2: 17.000000 0.080000 17.080000 ( 17.265984)
Example 3: base 47.410000 0.470000 47.880000 ( 48.071752)
to_xs 47.390000 0.470000 47.860000 ( 47.995763)
Example 4: base 13.910000 0.350000 14.260000 ( 14.326577)
to_xs 14.510000 0.360000 14.870000 ( 14.921875)
Example 5: base 1.550000 0.230000 1.780000 ( 1.788842)
to_xs 1.470000 0.230000 1.700000 ( 1.703043)
The final version needs 36-98% less time than the original version (ruby1.8.6p114). The last number is somewhat misleading, as the corresonding test tests the to_xs performance for huge ASCII texts, which may not be the typical case. However, even for the second example, which builds a structure with small nodes, the execution is almost three times as fast as before.
Patches for the bug fixes and the performance fixes were sent to the maintainers of the builder library, unfortunately not much of it has shown up in their repository. So if you would like to use the improved version, grab a copy of the source from GitHub.