ns_xml
command to the embedded Tcl interpreter. You can download the source or get it directly from the CVS repository doing:
cvs -d:pserver:[email protected]:/cvsroot/aolserver login cvs -z3 -d:pserver:[email protected]:/cvsroot/aolserver co nsxmlYou need to press Enter after first command since CVS is waiting for a password (which is empty).
As of Dec. 2000 Linux distributions usually come with version 1.x of libxml library so chances are that you'll need to install 2.x by yourself (this will change in the future since everyone is migrating to 2.x). To install nsxml
module go into nsxml directory, optionally edit a path in Makefile
to point into AOLserver source directory. Then run make
. You should get nsxml.so
module that should be placed in AOLserver bin directory (the same that has main nsd
executable). Add the following to your nsd.tcl
config file:
ns_section "ns/server/${servername}/modules" ns_param nsxml ${bindir}/ns_xml.soand restart AOLserver. You can verify that the module gets loaded by watching server.log, I usually use a shell window with:
tail -f $AOLSERVERDIR/log/server.logThis is also a great way to debug Tcl scripts since AOLserver will dump detailed debug information every time there is an error in the script.
set doc_id [ns_xml parse ?-persist? $string] |
Parse the XML document in a $string and return document id (handle to in-memory parsed tree). If you don't provide ?-persist? flag the memory will be automatically freed when the script exits. Otherwise you'll have to free the memory by calling ns_xml doc free. You need to use -persist flag if you want to share parsed XML docs between scripts. |
set doc_stats [ns_xml doc stats $doc_id] |
Return document's statistics. |
ns_xml doc free $doc_id |
Free a document. Should only be called on a document if ?-persistent? flag has been passed to either ns_xml parse or ns_xml doc create |
set node_id [ns_xml doc root $doc_id] |
Return the node id of the document root (you start traversal of the document tree from here.) |
set children_list [ns_xml node children $node_id] |
Return a list of children nodes of a given node. |
set node_name [ns_xml node name $node_id] |
Return the name of a node. |
set node_type [ns_xml node type $node_id] |
Return the type of a node. Possible types: element, attribute, text, cdata_section, entity_ref, entity, pi, comment, document, document_type, document_frag, notation, html_document |
set content [ns_xml node getcontent $node_id] |
Get a content (text) of a given node. |
set attr [ns_xml node getattr $node_id $attr_name] |
Return the value of an attribute of a given node. |
set doc_id [ns_xml doc create ?-persist? $doc-version] |
Create a new document in memory. If -persist flag is given you'll have to explicitely free the memory taken by the document with ns_xml doc free, otherwise it'll be freed automatically after execution of the script. $doc_version is a version of an XML doc, if not specified it'll be "1.0". |
set xml_string [ns_xml doc render $doc_id] |
Generate XML from the in-memory representation of the document. |
set node_id [ns_xml doc new_root $doc_id $node_name $node_content] |
Create a root node for a document. |
set node_id [ns_xml node new_sibling $node_id $name $content] |
Create a new sibling of a given node. |
set node_id [ns_xml node new_child $node_id $name $content] |
Create a child of a given node. |
ns_xml node setcontent $node_id $content |
Set a content for a given node. |
ns_xml node setattr $node_id $attr_name $value |
Set the value of an attribute in a given node. |
ns_xml parse $xml_doc
to parse XML document in string $xml_doc and get its document idns_xml doc root $doc_id
to get the id of a root nodens_xml node children $node_id
to traverse document tree and ns_xml node ...
commands to get node content and attributes-persist
flag to ns_xml parse
you'll have to explicitly call ns_xml doc free $doc_id
to free memory associated with this document, otherwise it will get automatically freed after execution of a script.
In code it could look like this:
proc dump_node {node_id level} { set name [ns_xml node name $node_id] set type [ns_xml node type $node_id] set content [ns_xml node getcontent $node_id] ns_write "<li>" ns_write "node id=$node_id name=$name type=$type" if { [string compare $type "attribute"] != 0 } { ns_write " content=$content\n" } } proc dump_tree_rec {children} { ns_write "<ul>\n" foreach child_id $children { dump_node $child_id set new_children [ns_xml node children $child_id] if { [llength $new_children] > 0 } { dump_tree_rec $new_children } } } proc dump_tree {node_id} { dump_tree_rec [list $node_id] 0 } proc dump_doc {doc_id} { ns_write "doc id=$doc_id<br>\n" set root_id [ns_xml doc root $doc_id] dump_tree $root_id } set xml_doc "<test version="1.0">this is a <blind>test</blind> of xml</test>" set doc_id [ns_xml parse $xml_doc] dump_doc $doc_id
ns_xml parse
command will throw an error if XML document is not valid (e.g., not well formed) so in production code we should catch it and display a meaningful error message, e.g.:
if { [catch {set doc_id [ns_xml parse $xml_doc]} err] } { ns_write "There was an error parsing the following XML document: " ns_write [ns_quotehtml $xml_doc] ns_write "Error message is:" ns_write [ns_quotehtml $err] ns_write "\n" return }Code like this takes more time to write but some day it may save a lot of debugging time (and a day like this always comes).
See how the code works in practice [external site running AOLserver] and get the full source [included in Linux Gazette]. It's a bit more complex than the above snippet. You can see the structure of an arbitrary XML document by typing it in the provided text area. The script also shows how to parse form data and has more robust error handling.
In the past it could've been done in a rather distasteful way by grabbing the whole HTML page and trying to extract relevant information. It would be hard to program and fragile (a change in the way HTML page is generated would most likely break such parsing).
Today the site that wants to provide headlines for others can publish this data in an easily to parse XML format under some URL. In our case the data are provided at http://www.linuxtoday.com/backend/linuxtoday.xml. See the format of this file (using previously developed script).
As you can see XML document represent headlines on LinuxToday site. It is a set of stories, each story having title, url, author etc. We know that after parsing the XML document we would like to have a way to easily extract the information. Let's use a "wishful-thinking" (in other words top-down) method of writing the code advocated in a Structure and interpretation of computer programs (a truly great CS book). Let's assume that we've converted XML representation into an object. To build an HTML table showing the data we need the following procedures:
headlines_get_stories_count $headlines
headlines_get_story $headline $story_no
story_get_url $story
story_get_title $story
Having those procedures we can generate the simplest (but rather ugly) table:
proc story_to_html_table_row { story } { set url [story_get_url $story] set title [story_get_title $story] return "- <a href=\"$url\"><font color=#000000>$title</font></a><br>\n" } # given headlines generate HTML code of the table with this data proc headlines_to_html_table { headlines } { set to_return "<table border=0 cellspacing=1 cellpadding=3>" append to_return "<tr><td><small>" set stories_count [headlines_get_stories_count $headlines] for {set i 0} {$i < $stories_count} {incr i} { set story [headlines_get_story $headlines $i] append to_return [story_to_html_table_row $story] } append to_return "</td></tr></table>\n" return $to_return }Tcl doesn't give us much choice for representing this object; we'll use lists.
proc headlines_get_stories_count { headlines } { return [llength $headlines] } proc headlines_get_story { headlines story_no } { return [lindex $headlines $story_no] } proc story_get_url { story } { return [lindex $story 0] } proc story_get_title { story } { return [lindex $story 1] }Note that if we forget about purity (just for a while) we can rewrite the following part of
headlines_to_html_table
:
set stories_count [headlines_get_stories_count $headlines] for {set i 0} {$i < $stories_count} {incr i} { set story [headlines_get_story $headlines $i] append to_return [story_to_html_table_row $story] }in a bit more terse way:
foreach story $headlines { append to_return [story_to_html_table_row $story] }Now the most important part: converting XML doc into the representation we've chosen.
# does a name of the node identified by $node_id equals $name proc is_node_name_p { node_id name } { set node_name [ns_xml node name $node_id] if { [string_equal_p $name $node_name] } { return 1 } else { return 0 } } # does a type of the node identified by $node_id equals $type proc is_node_type_p { node_id type } { set node_type [ns_xml node type $node_id] if { [string_equal_p $type $node_type] } { return 1 } else { return 0 } } # is this an node of type "attribute"? proc is_attribute_node_p { node_id } { return [is_node_type_p $node_id "attribute"] } # raise an error if node name is different than $name proc error_if_node_name_not {node_id name} { if { ![is_node_name_p $node_id $name] } { set node_name [ns_xml node name $node_id] error "node name should be $name and not $node_name" } } # raise an error if node type is different than $type proc error_if_node_type_not {node_id type} { if { ![is_node_type_p $node_id $type] } { set node_type [ns_xml node type $node_id] error "node type should be $type and not $node_type" } } # given url and title construct a story object with # those attributes proc define_story { url title } { return [list $url $title] } # convert a node of name "story" into an object # that represents story proc story_node_to_story {node_id} { set url "" set title "" # go through all children and extract content of url and title nodes set children [ns_xml node children $node_id] foreach node_id $children { # we're only interested in nodes whose name is "url" or "title" if { [is_attribute_node_p $node_id]} { if { [is_node_name_p $node_id "url"] || [is_node_name_p $node_id "title"]} { set node_children [ns_xml node children $node_id] # those should only have one children node with # the name "text" and type "cdata_section" if { [llength $node_children] != 1 } { set name [ns_xml node name $node_id] error "$name node should only have 1 child" } set one_node_id [lindex $node_children 0] error_if_node_type_not $one_node_id "cdata_section" error_if_node_name_not $one_node_id "text" set txt [ns_xml node getcontent $one_node_id] if { [is_node_name_p $node_id "url"] } { set url $txt } if { [is_node_name_p $node_id "title"]} { set title $txt } } } } return [define_story $url $title] } # convert XML doc to headlines object proc xml_to_headlines { doc_id } { set headlines [list] set root_id [ns_xml doc root $doc_id] # root node should be named "linuxtoday" and of type "attribute" error_if_node_name_not $root_id "linuxtoday" error_if_node_type_not $root_id "attribute" set children [ns_xml node children $root_id] foreach node_id $children { # only interested in attribute type nodes whose name is "story" if { [is_node_name_p $node_id "story"] && [is_attribute_node_p $node_id]} { set story [story_node_to_story $node_id] lappend headlines $story } } return $headlines }The code is rather straightforward. We use the knowledge about the structure of XML file. In this case we know that root node is named linuxtoday and should have a child named story. Each story node should have children named url and title etc. The previous script that dumps general structure of the tree helped me a lot in writing this function. Note the usage of error command to abort the script if XML doesn't look good to us.
Having an intermediate representation of the data might look like an excess given that it costs us more code and some performance but there are very good reasons to have it. We could have written a proc xml_to_html_table
that would create HTML table directly from XML document but such code would be more complex, more buggy and harder to modify. Separation that we've made provides an abstraction that reduces complexity, which is always good. It also gives us more flexibility: we can easily imagine writing another headlines_to_html_table
procedure that gives us slightly different table.
See how it works in practice [external site running AOLserver] and get the source [included in Linux Gazette]. It should produce something like this:
linuxtoday |
- Kernel Cousin Debian Hurd #73 By Paul Emsley And Zack Brown - Zope 2.2.5 b1 released - O#39;Reilly Network: Insecurities in a Nutshell: SAMBA, pine, ircd, and More - ZDNet: Linux Laptop SuperGuide - ComputerWorld: Think tank warns that Microsoft hack could pose national security risk |
One thing missing in this code is caching. As it is, it will grab the XML file from other people's server everytime it is invoked. This is not nice. It would be fairly easy to add a logic to cache XML file (or its in-memory representation) and only fetch a new version if, say, 1 hour passed since it was last retrieved.
ns_xml module provides basics of XML processing. Although you can do quite a bit with it one could wish to do more. Things that are obviously missing: