New Article - Enumerating Large XML Files

EDN Admin

Well-known member
Joined
Aug 7, 2010
Messages
12,794
Location
In the Machine
Reading a Large XML File

If you’re new to working with XML, there’s something important that you need to know. That is, when you load an XML file into an in-memory document, the entire file gets loaded into memory. When working with small XML files, which is often the case, this is no big deal. In fact, it’s rather convenient. However, if you are working with an extremely large XML file, this is a big problem. I recently wrote some code to read through a bunch of XML files, not realizing that one of them was over half a gigabyte! My code loaded the entire file into memory using the XDocument.Load() method—well, it tried to at least. Needless to say, when I hit the huge file, my app did not perform well.

How do you read an enormous XML file then? You use the XmlReader class, which has been around since the first release of the .NET Framework. It reads through an XML file, but simply places a pointer on the current XML element or attribute as you go through the file. As you read through the file with the XmlReader object, you can examine the current XML, decide if you are interested in it, process it, discard it, and move on to the next part of the file. The important thing is that you can minimize how much memory is utilized at any one time in your app.

Take heed: you still need to be aware of how you are reading through the file. If you open an XmlReader, read to the root element, then load that entire element into memory you haven’t solved anything.

Now you may be saying that if you use an XmlReader object, you don’t get all of the cool functionality of XML Axis Properties. That’s true, and that’s why there’s a ReadFrom method that reads the XML from your XmlReader into an XNode, which you can then cast as an XElement object and make use of all of the VB XML juicy goodness. Using an XmlReader and the ReadFrom method together ensures that you only use as much memory as the largest XML element that you load.

Let’s look at an example. The app that I was working on was reading through XML files that contained reflection information from .NET assemblies. For each member of a particular class, there was an element. Within that element there was a bunch of information about that member, and my app needed to grab some of the info for use in summary counts. Here’s an abbreviated XML sample of what the data looked like.

Here’s some code to read through each element, one at a time, with an XmlReader object. Once I have loaded the element into an XElement object, I can use XML Axis properties to get values from the XML contained in the element. The most memory that I use is determined by the largest element rather than the entire file.

Code:
   Dim reader = Xml.XmlReader.Create(“..\..\reflectionData.xml”)

reader.MoveToContent()

While reader.ReadToFollowing(“api”)

Dim api = TryCast(XElement.ReadFrom(reader), XElement)

If api Is Nothing Then Continue While

‘ Get information from the element.

Dim ns = api…@api

Dim containingType = api…@api

End While

reader.Close()

Now this code simply reads to the first element and then reads all of its sibling elements. If one of the elements has a child element, that child element gets loaded in the call to the ReadFrom method and the XmlReader object’s pointer moves past it. This works fine for my app because none of the elements have child elements. You may have different requirements and need to adjust your code.

I ran this code on a 5MB file with a little less than 30,000 elements. Loading the entire file into memory consumed over 120MB. Using the XmlReader, the code consumed less than 1MB. I gathered memory stats using the GetTotalMemory method.

One last thing to note in this section is that you can also run into memory issues when writing to a file. If you create a large XDocument in memory, and then write it to a file, you end up consuming the memory required to create the document, which is likely unnecessary. You have a couple of choices to minimize your memory footprint while writing an XML file. Similar to using the XmlReader class and the ReadFrom method, you can use the XmlWriter class and the WriteTo method. As another option, you can use the XStreamingElement class to write a single element at a time from an enumerable source, such as a LINQ query.

What about LINQ Queries?

In addition to having a small memory footprint, I also wanted to be able to use LINQ to query a large XML file. This can be achieved by creating a class that implements the IEnumerable interface. By fitting the code that I would have used to loop through the XML file into a class that implements IEnumerable(Of XElement), I can use an instance of that class as the source of any number of LINQ queries.

What I’ve created for this step is almost exactly the same as the class created by this walkthrough: Walkthrough: Implementing IEnumerable(Of T) in Visual Basic . The walkthrough shows you how to implement IEnumerable(Of String) to expose the contents of a text file one line at a time. We’ll do the same with an XML file.

When you implement IEnumerable, you actually need to implement both IEnumerable and IEnumerator. The bulk of your code goes into the IEnumerator implementation. You could create one class that implements both, but I like to split them into two classes.

I’ve called the class that implements IEnumerable(Of XElement) XmlReaderEnumerable. Following that naming convention I’ve called the class that implements IEnumerator(Of XElement) XmlReaderEnumerator. The behavior is the same as the earlier XmlReader example. The XmlReaderEnumerator class finds the first instance of a particular element, and then finds all of its sibling elements of the same name. As a result, I’ve added a constructor that takes both the path to the XML file, and the name of the XML element to search for. Note that the name is case sensitive as XML is case sensitive.

The XmlReaderEnumerable class doesn’t do much. All it does is return a reference to an instance of the XmlReaderEnumerator class. Here’s the code.

Code:
   Public Class XmlReaderEnumerable

Implements IEnumerable(Of XElement)

Private _filePath As String

Private _elementName As String

Public Sub New(ByVal filePath As String, ByVal elementName As String)

_filePath = filePath

_elementName = elementName

End Sub

Public Function GetEnumerator() As IEnumerator(Of XElement) _

Implements IEnumerable(Of XElement).GetEnumerator

Return New XmlReaderEnumerator(_filePath, _elementName)

End Function

Private Function GetEnumerator1() As IEnumerator _

Implements IEnumerable.GetEnumerator

Return Me.GetEnumerator()

End Function

End Class
The XmlReaderEnumerator class is where the code resides to read through the XML file. In the constructor, it opens the file and moves to the start of the XML content. In the MoveNext method, it reads to the element of the supplied name (for example, “api”). In the Dispose method, it closes the reader. That’s it. It looks like a lot of code, but it really isn’t.

Code:
Public Class XmlReaderEnumerator

Implements IEnumerator(Of XElement)

Private _xmlReader As Xml.XmlReader

Private _elementName As String

Private _filePath As String

Public Sub New(ByVal filePath As String, ByVal elementName As String)

_filePath = filePath

_elementName = elementName

_xmlReader = Xml.XmlReader.Create(_filePath)

_xmlReader.MoveToContent()

End Sub

Private _current As XElement

Public ReadOnly Property Current() As XElement _

Implements IEnumerator(Of XElement).Current

Get

If _xmlReader Is Nothing OrElse _current Is Nothing Then

Throw New InvalidOperationException()

End If

Return _current

End Get

End Property

Private ReadOnly Property Current1() As Object _

Implements IEnumerator.Current

Get

Return Me.Current

End Get

End Property

Public Function MoveNext() As Boolean _

Implements System.Collections.IEnumerator.MoveNext

_current = If(_xmlReader.ReadToFollowing(_elementName),

TryCast(XElement.ReadFrom(_xmlReader), XElement),

Nothing)

Return If(_current IsNot Nothing, True, False)

End Function

Public Sub Reset() _

Implements System.Collections.IEnumerator.Reset

_xmlReader.Close()

_current = Nothing

_xmlReader = Xml.XmlReader.Create(_filePath)

_xmlReader.MoveToContent()

End Sub

Private disposedValue As Boolean = False

Protected Overridable Sub Dispose(ByVal disposing As Boolean)

If Not Me.disposedValue Then

If disposing Then

‘ Dispose of managed resources.

End If

_current = Nothing

_xmlReader.Close()

End If

Me.disposedValue = True

End Sub

Public Sub Dispose() Implements IDisposable.Dispose

Dispose(True)

GC.SuppressFinalize(Me)

End Sub

Protected Overrides Sub Finalize()

Dispose(False)

End Sub

End Class
Now you can read through a large XML file using LINQ queries like the following example.
Code:
Dim numAPIs = _

Aggregate api In New XmlReaderEnumerable(filePath, “api”) Into Count()

Dim vbAPIs = From api In New XmlReaderEnumerable(filePath, “api”) _

Where api…@id = “N:Microsoft.VisualBasic”
 
Last edited by a moderator:
Back
Top