Last Updated: January 23, 2019
·
32.79K
· goranhalusa

Parsing Large XML Files Using PHP

I ran into a situation where I needed to parse a large (1 GB) XML file in order to extract the data into a MySQL table. As usual, I did my initial round of research. First, I decided to use the DOMDocument PHP class.

First Mistake

For my testing, I used a small subset of the data… weighing in at a measly 24 records.

Initially, all of my tests ran quite nicely. Then I decided to throw the complete (1 GB) XML file at it. Epic fail… I mean, it ran well for a while, but eventually ran out of memory. (And, yes… I did increase the memorylimit* to 1.5 GB and maxexecution_time* to 5 hours.) I feared this may happen.

The problem with utilizing DOMDocument on large XML files is that it loads the data into an array. While parsing, that array is growing. Not good when you’re dealing with massive XML files.

With this fail under my belt, I went back to the drawing board. Knowledge is power… knowledge is power… knowledge is power.

My Next Move

XMLReader. From the PHP website: ”The XMLReader extension is an XML Pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.” OK, sounds considerably more promising.

And Survey Says, Ding!

$file = "PATH_TO_FILE";
$reader = new XMLReader();
$reader->open($file);
while( $reader->read() ) {
  // Execute processing here
}
$reader->close();

After that, it was gravy. Well, aside from the additional logic that had to go into it. That’s easily a topic all of it’s own.

* How to modify PHP’s “memorylimit” and “maxexecution_time” on a per script basis

// Tweak some PHP configurations
ini_set('memory_limit','1536M'); // 1.5 GB
ini_set('max_execution_time', 18000); // 5 hours

4 Responses
Add your response

Also, if you don't want to increase the time limit for every script on the server, you can set it on a per script basis using set_time_limit(0) which will allow for unlimited execution time. I believe you can also use ini|_set().

over 1 year ago ·

True. However, note one of the comments on the settimelimit() page - "You can do settimelimit(0); so that the script will run forever - however this is not recommended and your web server might catch you out with an imposed HTTP timeout (usually around 5 minutes)."

over 1 year ago ·

Something like this - you should be running from CLI/Shell anyway and not from a HTTP request - in which case the PHP execution time limit is always hard coded to zero (e.g. - unlimited time)..

over 1 year ago ·

If XMLReader is just getting chunks and forwarding somewhere else then will it still need iniset('memorylimit','1536M'); // 1.5 GB ? Even if it is reading very small chunks at one time and forwarding it to client side or saving in DB etc?

over 1 year ago ·