Saturday, February 05, 2011

NedNews: XML

Ok, now we have our XML data (in the string variable httpData).

Let's see if it looks good (in my wish session):

% llength $httpData
list element in braces followed by "&lt;/div&gt;</summar" instead of space
% string length $httpData
101373

I am so used to using llength instead of string length, I mistyped it the first time. No worries, Tcl tries to interpret the string as a list, but it is malformed. The second time around I get the length in characters (Tcl will handle multibyte character strings automatically).

Let's parse this into a tree.
set doc [dom parse $httpData]

Ok, let's look through the tree (the dom interface returns a list of child nodes), and see what we got (llength is "list length" and lindex is "list index", where 0 is the first in the list):

% llength [$doc childNodes]
2
% [lindex [$doc childNodes] 0] asText
type="text/css" href="http://semipublic.comp-arch.net/wiki/skins/common/feed.css?270"
% [lindex [$doc childNodes] 1] asText
(pages and pages)

Ok. Pesky. The top of the tree has just two nodes. The first gives us the stylesheet. The second has all the real content...

% set docTop [lindex [$doc childNodes] 1]
domNode00BFCF48
% llength [$docTop childNodes]
38

In my final code, I will try to use XPath to extract the vital goodies. But we can poke around by hand here.

I can see that there are some initial fields ("id" - the URL I used to get this, "title", "link" - which also seems to be the URL, "updated" - with some time info, "subtitle", and "generator"). Then there are "entry" fields with the posts. So, the "updated" field will come in handy when I want to look for updates, "title" might be useful for populating a description. Otherwise, I just want the "entries".
set entries [$doc selectNodes -namespaces {atom http://www.w3.org/2005/Atom} /atom:feed/atom:entry]
There is a slight complication here. The Atom tags are referenced against a URL, so I need to supply that for XPath...
I can see each entry has 6 fields, an "id" - which is the link to the full post, "title", "link" - which also links to the full post, "updated", "summary" - the body, "author". Perfect.

I would pick the one feed and entry which is full of HTML. I will post a bit on that next...

No comments: