httpData
).Let's see if it looks good (in my wish session):
% llength $httpData
list element in braces followed by "</div></summar" instead of space
% string length $httpData
101373
I am so used to using llength instead of string length, I mistyped it the first time. No worries, Tcl tries to interpret the string as a list, but it is malformed. The second time around I get the length in characters (Tcl will handle multibyte character strings automatically).
Let's parse this into a tree.
set doc [dom parse $httpData]
Ok, let's look through the tree (the dom interface returns a list of child nodes), and see what we got (llength is "list length" and lindex is "list index", where 0 is the first in the list):
% llength [$doc childNodes]
2
% [lindex [$doc childNodes] 0] asText
type="text/css" href="http://semipublic.comp-arch.net/wiki/skins/common/feed.css?270"
% [lindex [$doc childNodes] 1] asText
(pages and pages)
Ok. Pesky. The top of the tree has just two nodes. The first gives us the stylesheet. The second has all the real content...
% set docTop [lindex [$doc childNodes] 1]
domNode00BFCF48
% llength [$docTop childNodes]
38
In my final code, I will try to use XPath to extract the vital goodies. But we can poke around by hand here.
I can see that there are some initial fields ("id" - the URL I used to get this, "title", "link" - which also seems to be the URL, "updated" - with some time info, "subtitle", and "generator"). Then there are "entry" fields with the posts. So, the "updated" field will come in handy when I want to look for updates, "title" might be useful for populating a description. Otherwise, I just want the "entries".
set entries [$doc selectNodes -namespaces {atom http://www.w3.org/2005/Atom} /atom:feed/atom:entry]
There is a slight complication here. The Atom tags are referenced against a URL, so I need to supply that for XPath...
I can see each entry has 6 fields, an "id" - which is the link to the full post, "title", "link" - which also links to the full post, "updated", "summary" - the body, "author". Perfect.
I would pick the one feed and entry which is full of HTML. I will post a bit on that next...
No comments:
Post a Comment