Wednesday, 25 June, 2008

Escape from XML Parsing Hell

I just spent 3 days on the most futile effort of my life, or so it seems at the moment. I'm posting this in the hope that it might help someone who Google's "slow XML java parser" in the future. I've been parsing a lot of XHTML files using Java for a project. However, it was taking about 1s per file, and the time was the same for files of 15k or 150k. Clearly something was wrong. It was also the same for the Xerces SAX parser, the DOM parser, dom4j parser, the Piccolo SAX parser, ... nothing was fast enough. All the benchmarks I could find said it should be on the order of 20ms / file.

Anyway, I brought home my laptop tonight because I was still so frustrated with this stupid slow XML parsing problem and I couldn't put it away. I started running my test program when not connected to the internet, and it generated an exception saying it couldn't connect to www.w3.org. So, I was thinking, why the heck is it doing that? I'd already set the parser to be non-validating.

It turns out the parsers all fetch any external DTD that is referenced, even if the parser is non validating! So for every file, the header line was referencing the XHTML DTD and it was downloading it from an external site. So, if you have an abnormally slow XML parser, maybe this is why!

This feature disables that, in case you are ever parsing xhtml in the future and don't care about validation:

xmlReader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd" , false);

Now it parses 100 trials in 250ms total (down from about 55 seconds). And, I guess my program has stopped retrieving the same XHTML DTD from www.w3.org thousands of times an hour (sorry w3.org, it wasn't a DoS attack, I swear).

Phew! What a waste of time. I would never have thought to check for unexpected network connections. Glad my internet wasn't working at home.

3 days of work, help from colleagues, testing various parsers, running on various machines, profiling my code, ... for a 1 line fix that I found by accident. Sometimes I hate computers.

Now, time to move on with my super-cool-top-secret project. :)

2 comments:

Shahan Khatchadourian said...

I use Woodstox (http://woodstox.codehaus.org/) which is a great XML parser, it's supposed to one of the fastest due to the STAX-based processing (as opposed to SAX). I never ran into the same problem you did but it's good to know that feature may sometimes be enabled by default.

Retreat Searcer said...

Not every SAX parser supports the http://apache.org/xml/features/nonvalidating/load-external-dtd feature. Piccolo for one, does not.

However, you can do the same thing by setting an EntityResolver which doesn't do anything. :-)

Specifics on what to do available here