Parsing XML in Ruby

(March 26 2006)

Using the Ruby Electric XML parser, REXML.

REXML comes from http://www.germane-software.com/software/rexml/ and fully supports XPath 1.0. It's not gemmified, so you'll have to download and install by hand.

Take one sample XML document – here's mine, it's the ATOM feed for this blog. Drop it into a text file for easy access, or choose your own – all atom files should have basically the same layout. This is :-

  • feed
    • title, link, modified, author
    • entry – repeated
      • title, author, link, id
      • issued, modified, created
      • dc:subject, content
  require 'rexml/document'
  include REXML
  atom = Document.new File.new "atom.xml"
  p XPath.first(atom,"//*")

This will add the relevant rexml module (and include the REXML namespace, so we don't have to say things like REXML::Document.new all the time). Then it declares a new rexml document using the contents of “atom.xml”. Finally, we look to see what the first elements are :-

<feed version='0.3' xmlns:dc='http://purl.org/dc/elements/1.1/'
      xmlns='http://purl.org/atom/ns#'> ... </>

Well, that's correct - there's only one element in the document at the top level, and that's “feed”. But what's inside it?

  XPath.each(atom,"//feed/*") {p}

[<title mode='escaped'> ... </>, <link href='http://nb.inode.co.nz'
rel='alternate' type='text/html'/>, <modified> ... </>, <author> ... </>,
<entry> ... </>, <entry> ... </>, <entry> ... </>, <entry> ... </>,
<entry> ... </>, <entry> ... </>, <entry> ... </>, <entry> ... </>,
<entry> ... </>, <entry> ... </>]

That's better - within the “feed”, there's a title, link, modified, author, and a series of entries. Let's look into the entries …

 XPath.each(XPath.first(atom,"//feed/entry/")) {p}

 [<title mode='escaped'> ... </>, <author> ... </>, <link
 href='http://nb.inode.co.nz/archives/2006-03-20T11_18_09.html'
 rel='alternate' type='text/html'/>, <id> ... </>, <issued> ... </>,
 <modified> ... </>, <created> ... </>, <dc:subject> ... </>, <content
 mode='escaped' type='application/xhtml+xml' xml:space='preserve'
 xml:lang='en'> ... </>]

So now we can see the structure within an entry at last. Let's try listing the titles of all the entries in the document :-

XPath.each(atom,"//feed/entry/title/") {|e| puts e.text}

OSX Tiger fails poll()
Xerox DocuColor watermarking
Comments and Trackback
Textile?
The syntax battle ...
Remembering Categories
IRC and antivirus
Markdown and better CSS
Markdown added
Latine