Date: Sun, 8 Sep 2002 15:35:43 -0400 (EDT) From: Aaron Straup Cope To: Karl Dubost Cc: Steph Subject: Glossaries: XPath, SAX and benchmarks So, I sat down and did some tests this morning per our conversation about glossaries and XBEL and XPath. It's a bit depressing given the nature of the XPath query you need to pull stuff out of an XBEL document : "/xbel//bookmark[title=\"$keyword\"]/\@href" Since the <bookmark> element can be either next to the root <xbel> element or contained in an arbitrary number of nested <folder> elements, there isn't much too do except sniff around every node until you find what you're looking for. Which takes a long time. Longer than you'd normally want anyway... On the other hand, if you just use a plain old SAX widget to find the keyword, it takes roughly 1/4 to 1/5 of the time to do a lookup. Below are benchmarks for 100 iterations of a subroutine that does 5 keyword lookups against an XBEL file. Note that the XPath query doesn't even instantiate a new object; the same object is shared across all 500 calls to 'find'. The SAX query on the other hand, instantiates a new filter and a new parser for each lookup. Obviously, some clever caching of lookups would speed things up as well. **** 101 ->./debug.xbel Benchmark: timing 100 iterations of xpathquery... bquery: 765 wallclock secs (645.73 usr + 13.66 sys = 659.38 CPU) @ 0.15/s (n=100) 101 ->./debug.xbel Benchmark: timing 100 iterations of saxquery_pureperl... saxquery_pureperl: 171 wallclock secs (148.23 usr + 0.62 sys = 148.86 CPU) @ 0.67/s (n=100) 102 ->./debug.xbel Benchmark: timing 100 iterations of saxquery_expat... saxquery_expat: 171 wallclock secs (148.17 usr + 0.20 sys = 148.38 CPU) @ 0.67/s (n=100) **** package Foo; use base qw (XML::SAX::Base); sub keyword { my $self = shift; $self->{'__keyword'} = $_[0]; } sub link { my $self = shift; return $self->{'__link'}; } sub start_element { my $self = shift; my $data = shift; return if ($self->{'__match'}); if ((! $self->{'__bookmark'}) && ($data->{Name} eq "bookmark")) { $self->{'__bookmark'} = 1; } return if (! $self->{'__bookmark'}); if ($data->{Name} eq "bookmark") { $self->{'__link'} = $data->{Attributes}->{'{}href'}->{Value}; } $self->{'__title'} = 1 if ($data->{Name} eq "title"); } sub end_element { my $self = shift; my $data = shift; return if ($self->{'__match'}); if ($data->{Name} eq "title") { $self->{'__title'} = 0; } if ($data->{Name} eq "bookmark") { $self->{'__bookmark'} = 0; } } sub characters { my $self = shift; my $data = shift; return if ($self->{'__match'}); return if (! $self->{'__bookmark'}); return if (! $self->{'__title'}); if ($data->{Data} eq $self->{'__keyword'}) { $self->{'__match'} = 1; } } package main; my $file = "/usr/home/asc/aaronland.net/asc/webdev.xbel"; use XML::SAX::ParserFactory; $XML::SAX::ParserPackage = "XML::SAX::Expat"; use Benchmark; my $count = 100; my @keywords = ( 'FilterProxy Home Page', "REX XML Shallow Parsing with Regular Expressions", "aaronland", "Schematron - XML Validation Language", ">RE ActivePerl mod_perl ppd available", ); timethese($count, { saxquery_expat => sub { foreach my $kw (@keywords) { my $filter = Foo->new(); $filter->keyword($kw); my $parser = XML::SAX::ParserFactory->parser(Handler=>$filter); $parser->parse_uri($file); } }, }); **** use XML::XPath; use Benchmark; my $file = "/usr/home/asc/aaronland.net/asc/webdev.xbel"; my $count = 100; my $xbel = XML::XPath->new(filename=>$file); my @keywords = ( 'FilterProxy Home Page', "REX XML Shallow Parsing with Regular Expressions", "aaronland", "Schematron - XML Validation Language", ">RE ActivePerl mod_perl ppd available", ); timethese($count, { xpathquery => sub { foreach my $title (@keywords) { my $query = "/xbel//bookmark[title=\"$title\"]/\@href"; my $r = $xbel->find($query); } }, });