Design Pattern for processing a huge XML file: The Problem
Some days ago I started working in a project that requires parsing and storing information contained in huge files of different formats. These files are sent by partners of our client and represent data contained in their databases. Sometimes this data is consistent and useful for our system, other times it’s just crap. As we do not have access to their databases, it’s necessary to parse and store in a database and then query this data in order to understand how consistent and complete it is.
Last week my manager asked me to parse the content of an XML file with more than 500MB. The result of this activity would give us information about the quality of the data that that partner could provide us and then we would be able to decide if the process of parsing and storing such schema would be permanently added to the system or just thrown away.
Although the system runs in a Java EE container, for a single process like this I consider much easier to create a Java SE application that receives a filename as parameter, parses and stores it. On the other hand, if the result shows that the partner’s database is consistent and useful enough, this is not a single process anymore and this code must be added to the project. Given that, it was strongly recommended to implement the code in a way that it could be easily refactored from desktop to server environment.
As I said before, the file size was higher than 500MB and this was just a test file, next ones (if exist) might be bigger one gigabyte. Loading so large content to the heap wouldn’t be possible in production. Once DOM was not an option, XPath was also discarded and SAX became the only option. The problem now was that the schema is very complex and the code necessary to parse it using SAX would easily become too messy to be maintained.
Well, that's enough for today! Now I'll let you think about this problem and in a few days I'll describe my solution here.
UPDATE: Post with my solution for this problem can be found here.
See you soon!
LinkedIn
Delicious
Google Reader
Facebook
Google Profile
Twitter
Last.fm
FriendFeed
YouTube
Orkut
Picasa Web Album
Flickr
April 15th, 2010 - 07:05
I would also consider and test a solution to this problem using the StAX. Try and tell me if performance improves or not
April 15th, 2010 - 07:37
Perhaps the performance does not improve with the use of StAX, but its API seems to be easier to use.