Daniel Gazineu

27Apr/101

Design Pattern for processing a huge XML file: The Solution

This post is the sequence of the last one I wrote describing the problem I recently faced when needed to parse and process a big and complex XML file.

After playing around with the conventional solutions, I was not convinced to leave xPath/DOM code legibility for an effective memory consumption result.

To understand my solution, it’s important to analyze the data I’m working with. Although the schema is complex and the file contains lots of data, the root tag represents a list of entities (table records) and there is no dependency between nodes. They can (and really might) be processed in a parallel way.

My solution uses a hybrid producer-consumer implementation, where a reader class loads XML contents to memory and dispatches small segments to a parser responsible for processing each segment as if it was a complete file, but without memory consumption concerns.

There are three main steps in the entire parsing process as follows:

Loading

First, I created a class named Reader, which the only purpose is to load the contents of a given XML file to the memory and dispatch it for processing. This class contains a buffer size based in the number of loaded entities. In other words, if the XML file contains a root tag named Cars, with a list of Car nodes, the buffer will be counting occurrences of </Car>. When a given number of entities is loaded, data is dispatched for processing and the buffer is reset.

Dispatching

Instead of dispatching data directly to the Parser, the Reader object has an implementation of the ParserDispatcher and uses it for this job. The idea behind this is to abstract the execution environment from the rest of the code. While a valid ParserDispatcher implementation for server-side environment would be posting the data to a JMS Queue, my command line desktop application uses an ExecutorService for the same purpose.

Parsing

Parsing process itself doesn’t have any novelty besides the fact that the huge XML file, after broken in small blocks, can be parsed with xPath/DOM without compromising memory consumption or performance. The Parser class is a common XML parser, unaware of prior stages the data was submitted to, it is able to parse any given InputStream since it points to an XML content compliant with its Schema and small enough to be completely represented by a DOM structure in memory. After each entity is parsed, a list of listeners is notified. These listeners can persist, log, count, create reports, etc.

A client application would run by calling the following lines:


Parser parser = new Parser();
parser.addListener(new DebugListener());
Reader xmlReader = new Reader(new NewThreadDispatcher(parser));
xmlReader.read(new FileInputStream("file.xml"));

My friend Paulo Jeronimo, commented in my last post suggesting me to use StAX. Being a pull parser over stream, StAX tries to bring the best of both worlds but in my opinion, a code using StAX is not as legible as it would be using xPath/DOM, that’s why I decided to create my own design.

Although this is neither the most performatic nor the simplest solution, I believe this brings a good balance between performance, memory consumption and code maintainability. Moreover, this pattern can be extended for other file types.

  • Share/Bookmark
Comments (1) Trackbacks (2)
  1. You may want to give vtd-xml a look, it is wellsuited for huge XML (with xpath support)
    http://vtd-xml.sf.net


Leave a comment


Twitter Users
Sign in with your Twitter account by clicking the button below.