|
Documentation
Resources
Support
|
4Suite Core: Open-source Library for XML Processing
Users' Manual
1 Introduction
4Suite allows users to take advantage of standard XML technologies
rapidly and to develop and integrate Web-based applications. It also puts
practical technologies for knowledge management projects in the hands of
developers. It is implemented in Python with C extensions.
At the core of 4Suite is a library of integrated tools (including
convenient command-line tools) for XML processing, implementing open
technologies such as DOM, SAX, XSLT, XInclude, XPointer, XLink, XPath,
XUpdate, RELAX NG, and XML/SGML Catalogs.
With 4Suite, you can:
And much more. These tasks are covered in this manual.
2 Installation
Please see the UNIX or Windows install
documents. Remember that if you are using Cygwin on Windows, you should follow the UNIX instructions.
3 DOM-like XML processing
Domlette is 4Suite's lightweight DOM implementation. It is optimized
for XPath operations, speed, and relatively low memory overhead. The
Domlette API is accessible through Ft.Xml.Domlette. This section describes how to
parse, manipulate, and then serialize XML documents using this API.
Below, we briefly summarize the various elements of the API that form
the basic life span of Domlette objects.
- Parsing XML documents
-
The Ft.Xml module
contains the function Parse that gets the
job done quickly. See “Quick access to the Domlette reader API” for
details. For a bit more more advanced parsing, you will need a
combination of the reader instances in the
Ft.Xml.Domlette module and
Ft.Xml.CreateInputSource for constructing
InputSource instances. In rare cases you
might need lower-level APIs in in the
Ft.Xml.InputSource module.
Read “The full Domlette reader API” if
Ft.Xml.Parse isn't enough.
- Modifying and interacting with XML documents
-
The Domlette API for interacting with XML documents—accessible
as methods of the various Domlette objects—is similar to the DOM Level 2
specification. See “Domlette API summary” for more
information.
- Serializing XML documents
-
The Ft.Xml.Domlette
module provides two functions, Print and
PrettyPrint, for writing your XML documents.
The Print function writes the XML document
precisely as given in the model. On the other hand, the
PrettyPrint function adds whitespace nodes to
your document to try to indent the resulting output nicely. See “Serializing Domlette nodes” for details.
3.1 Parsing XML documents
We begin our discussion of the Domlette API by describing how to
obtain a model of your XML documents to manipulate further. Because XML
documents offer such rich functionality and exist in such varied
environments, there can be a surprising amount of work that you must do to
simply load your XML documents. We begin by providing a short-cut for easy
access. We will then dive into the full suite of document loading
utilities.
3.1.1 Quick access to the Domlette reader API
For basic document manipulations or to get started quickly, the
Ft.Xml module offers a quick
way to parse XML documents and directly obtain access to the Domlette
interface to those documents. Within this module the function of
interest is Parse.
Warning
This function will get you started quickly because it
specifically chooses some default values for some of the more advanced
parsing features. If you are passing in a string or stream, and the
material in “The importance of base URIs”
applies to your parsing situation, then you will want to use the
full-featured API. In brief, if your XML document references external
resources, you should not use this convenience function. See “The full Domlette reader API” instead.
This function returns a Domlette
Document representing the root of the document
from the argument.
Parse(source)
-
The Parse function takes a single
argument, which is a byte string (not unicode object), file-like
object (stream), file path or URI.
XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""
from Ft.Xml import Parse
doc = Parse(XML)
# If the above XML document were located in the file
# "target.xml", we could have used `Parse("target.xml")`.
print doc.xpath('string(ham//em[1])')
3.1.2 The full Domlette reader API
You create Domlette instances by parsing XML documents with the
reader system. For general use, the Ft.Xml.Domlette package contains instances
of the different reader classes that can be used directly after you
import them. These instances include
NonvalidatingReader and
ValidatingReader, which provide non-validating
parsing and validating parsing services, respectively. The validation in
this case refers to DTD validation. For RELAX NG validation, see “Validation using RELAX NG”. All the reader classes (and, hence, their bundled
instances) are described in later sections. After you have obtained one
of these reader instances, you feed your XML document entity's byte
stream to the reader. We summarize the available reader methods
below.
parseUri(uri)
-
The parseUri method takes a single
argument; this uri argument is the absolute
URI of the document entity to parse. The URI will be dereferenced
by the default resolver.
parseString(st, uri)
-
The parseString method takes two
arguments; st is the XML document entity in
the form of an encoded Python string (not a
Unicode string). See the next section for details on
the uri argument.
parseStream(stream, uri)
-
The parseStream method takes two
arguments; stream is a Python file-like
object that can supply the document entity's bytes via
read() calls. See the next section for
details on the uri argument.
parse(inputSource)
-
The parse method takes a single
argument; inputSource is an
Ft.Xml.InputSource.InputSource object,
described in “InputSource objects”.
The next two sections cover some of the issues that you should
understand before using these functions. Then we start seeing some
examples in “NonvalidatingReader”.
3.1.3 The importance of base URIs
In the first 3 methods listed in the previous section, the
uri argument is the URI of the document entity
that you are feeding to the parser. It is a very important—but often
overlooked—concept in document processing.
The URI gives the document entity a unique identifier that can
used to refer to the document as a whole. Also, each Domlette node
derived from a particular entity inherits that entity's URI as the
node's baseURI property, unless an alternative base
URI was indicated, such as with xml:base, or if part of the document was
loaded as an external entity or XInclude.
The document's URI is also used as the "base URI" for resolving
any relative URI references that may appear within the document itself.
Relative URI references may occur in a document in places like:
-
<!DOCTYPE> or
<!ENTITY>, immediately following the keyword
SYSTEM
-
<xsl:import> and
<xsl:include>, in the value
of the href attribute
-
<xi:include>, in the
value of the href
attribute
-
<exsl:document>, in
the value of the href
attribute
-
the arguments to XSLT's document()
function
It is a common misconception that relative URI references in a
document's content are considered to be relative to the processor's
current working directory. They are actually resolved relative to the
URI of the document that contains the relative URI reference (more
specifically, relative to the URI of the entity in which the reference occurs, keeping in
mind that a document may be comprised of multiple entities, i.e.,
separate files).
In all cases, the document URI that you supply in the reader API
must be "absolute", which means that it has a scheme, e.g.
"http://spam/eggs.xml", not just
"/spam/eggs.xml" or
"eggs.xml".
If you know there are not going to be any relative URI references
to resolve during initial parsing or during processing of the Domlette
by other tools, then you can safely omit the argument, or, preferably,
supply a dummy URI like "urn:dummy" or
"http://spam/eggs.xml". If you choose to omit URI arguments
from APIs that need them, you may get a Python warning, and a random
URI—which is probably not what you want—will be assigned.
If you've understood all this and yet you want to just go ahead
and not specify a base URI, you may have to turn off the likely
warnings. You can do so with code such as in the following example.
import Ft.Xml.Domlette
import warnings
def disable_warnings(*args): pass
warnings.filterwarnings("ignore", category=Warning)
warnings.showwarning = disable_warnings
XML = "<spam/>"
doc = Ft.Xml.Domlette.NonvalidatingReader.parseString(XML)
Ft.Xml.Domlette.Print(doc)
You can also in such a case use the convenience function
Ft.Xml.Parse (see above).
3.1.4 Parsing XML that's already a Unicode string
Because 4Suite is trying to provide as thin a wrapper as possible
to the underlying parser, and due to complexities in the APIs of these
parsers, there is no API in 4Suite for parsing Python's Unicode
strings.
If your XML is in the form of a Unicode string, you must encode
the string as bytes so that the underlying parser can read it. Once you
have an encoded string, you can pass it to the reader's
parseString(), or wrap it in an
InputSource using
Ft.Xml.CreateInputSource, or the
fromString() method of an
InputSourceFactory. If the string is not UTF-16 or
UTF-8 encoded, then you must tell the reader what encoding it actually
uses. You can do this either by writing or replacing the XML declaration
in the string itself, or (much easier) setting the optional encoding
keyword argument in the reader's parseString()
method or the InputSourceFactory's
fromString() method. For an example, see the
Akara article on external encoding declarations.
3.1.5 NonvalidatingReader
Use NonvalidatingReader for basic parsing.
NonvalidatingReader performs its parsing without
validating against a DTD.
The following example will parse an XML source taken from the
supplied URI, which is treated as a URL by the default resolver.
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri(
"http://www.w3.org/2000/08/w3c-synd/home.rss")
The following example also parses an XML source taken from the
supplied URI, which is treated as a URL. In this case, the default
resolver tries to read the XML source from the filesystem.
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("file:///tmp/spam.xml")
The following example parses XML from the filesystem. When given a
relative file path in the local OS's format, we must first convert that
path to a URI that our reader objects can use.
from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Lib import Uri
file_uri = Uri.OsPathToUri('spam.xml')
doc = NonvalidatingReader.parseUri(file_uri)
The following example parses XML from a string. Note that it does
not provide a document/base URI.
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs</spam>")
In the following example, we are parsing XML from a string in a
case where the document does need a base URI to be specified.
from Ft.Xml.Domlette import NonvalidatingReader
s = """<!DOCTYPE spam [ <!ENTITY eggs "eggs.xml"> ]>
<spam>&eggs;</spam>"""
doc = NonvalidatingReader.parseString(s, 'http://foo/test/spam.xml')
# during parsing, the replacement text for &eggs;
# will be obtained from http://foo/test/eggs.xml
In all of the above examples, doc is now a Domlette node object.
4Suite currently offers one Domlette implementation, written in C,
called cDomlette.
3.1.6 EntityReader Examples
Sometimes you need to parse a fragment of XML rather than the full
document. If operating in non-validating mode is sufficient, Domlette
has a reader that can handle this case. When parsing such a fragment,
EntityReader returns a Domlette document fragment
rather than a document object.
from Ft.Xml.Domlette import EntityReader
s = """
<spam1>eggs</spam1>
<spam2>more eggs</spam2>
"""
docfrag = EntityReader.parseString(s, 'http://foo/test/spam.xml')
Note
The content parsed by EntityReader must
be an XML External Parsed Entity. This means that it can't be just any
XML document. The main limitation is that it must not have a a
document type declaration.
3.1.7 ValidatingReader
If you want to validate a document with a DTD as you parse it, use
the ValidatingReader object instead. If
ValidatingReader discovers that the document that
it is currently parsing is invalid, then it throws a
Ft.Xml.ReaderException and does not finish
parsing the document. The following example illustrates these
concepts.
# ValidatingReader is a global instance
from Ft.Xml.Domlette import ValidatingReader
XML = """<!DOCTYPE a [
<!ELEMENT a (b, b)>
<!ELEMENT b EMPTY>
]>
<a><b/><b/></a>"""
doc = ValidatingReader.parseString(XML, "urn:x-example:valid-a")
# And of course, as with other readers, you can use `parse`, `parseUri`, and
# `parseStream` as well.
# The following document, however, is invalid because an `a` element can only
# have two `b` children according to its DTD.
XML = """<!DOCTYPE a [
<!ELEMENT a (b, b)>
<!ELEMENT b EMPTY>
]>
<a><b/><b/><b/></a>"""
# This throws a `Ft.Xml.ReaderException` when it encounters invalid structure,
# and does not finish parsing the document into `doc`.
doc = ValidatingReader.parseString(XML, "urn:x-example:invalid-a")
3.1.8 NoExtDtdReader
When using NonvalidatingReader to parse a
document, that document's DTD is still opened and read to obtain
information such as entity declarations and default attribute values.
You cannot suppress reading of the internal DTD subset, but you can
prevent the external subset from being accessed by using
NoExtDtdReader. This won't affect the processing
of external parameter entities defined in the internal DTD subset. Use
this object as you would use
NonvalidatingReader.
3.1.9 Creating your own reader instance
In some cases you might not want to use the global reader
instances. For instance in multithreaded use, you might want a reader
per thread. Or you might want to change some of the parameters on the
readers. If so, you can create your own reader instance:
from Ft.Xml.Domlette import NonvalidatingReaderBase
reader = NonvalidatingReaderBase()
doc = reader.parseUri("http://xmlhack.com/read.php?item=1560")
Instead of NonvalidatingReaderBase, you
could instead use NoExtDtdReaderBase or
ValidatingReaderBase, depending on your needs.
Each of these 3 readers take an optional
inputSourceFactory constructor argument, which
you can use to supply a custom URI resolver.
3.1.11 Converting from other DOM libraries
You can convert another Python DOM object (e.g. 4DOM or minidom)
to a Domlette object using the function
ConvertDocument:
from Ft.Xml.Domlette import ConvertDocument
converted_document = ConvertDocument(oldDocument, documentURI=u'http://www.example.org/')
The DocumentURI parameter provides a base
URI for the converted nodes. If not specified, attributes documentURI
and then baseURI are checked in the source DOM, as defined in DOM Level 3. If no
URI is found in this way, a warning is issued and a UUID URI is
generated for the new Domlette object.
3.2 Domlette API summary
Interacting with Domlette documents
You will use a large part of the Domlette API to interact with the
model of your XML documents. The implementation of this part of the API is
found in the Ft.Xml.cDomlette
module. This part of the API allows you to navigate around a document and
modify the content of that document. It is very similar to the DOM Level 2
specification and follows some of the DOM Level 3
specification; feel free to refer to those specifications and the
4Suite API documentation for details about the intended behavior of this
API. You can find brief descriptions of the methods and attributes
provided by this API listed below. This API is also nearly the same as the
API for xml.dom, which is bundled
with Python. The node type constants are inherited directly from
xml.dom.Node.
Many objects that you will work with in the Domlette API are
descendents of the Domlette Node class.
Documents, document fragments (of class
DocumentFragment), Elements,
attributes (class Attr), text (class
Text), processing instructions (class
ProcessingInstruction), and comments (class
Comment) are all nodes; any node operations are
defined on objects of these types, as well. Some operations do not make
sense on some objects, however. For example, it does not make sense to add
children to an attribute node.
In the DOM model of XML documents, there is a
Document node which represents the starting point
for the other pieces of the document. This node is not the root element of the document; rather, the
Document node contains the root element as its only element
child. The Document node may have other children,
though, such as processing instructions and comments.
You can easily access properties of a node directly. The following
properties are available on any node. These properties generally store
information about the structure of the document in the near "vicinity" of
the target node.
Properties available on every Node
object
- attributes
-
This is a python dictionary containing the attributes defined
on the target node. The key for the dictionary is a tuple containing
the namespace and local name of the attribute. The value associated
with this attribute name tuple is the attribute (of class
Attr) itself.
node = Parse("<foo a='1'/>")
print node.childNodes[0].attributes
{(None, u'a'): <Attr at 0x40870ecc: name u'a', value u'1'>}
- baseURI
-
This is the base URI in scope for the target node as a Python
unicode string.
- childNodes
-
This is the Python list of all the node children of the target
node. Note that in DOM terminology, the attributes of a node are
not children of that node.
node = Parse("<foo a='1'/>")
print node.childNodes
[<Element at 0x4086052c: name u'foo', 1 attributes, 0 children>]
- firstChild
-
This is the first child node of the target node. This is
equivalent to childNodes[0], and is a useful property
for quickly walking the document tree.
node = Parse("<foo a='1'/>")
print node.firstChild
<Element at 0x40860a6c: name u'foo', 1 attributes, 0 children>
- lastChild
-
This is the last child node of the target node. This is
equivalent to childNodes[-1].
node = Parse("<foo a='1'/><!--Hi!-->")
print node.lastChild
<Comment at 0x4087caf4: u'Hi!'>
- localName
-
This is the local name of the target node as a Python unicode
string.
- namespaceURI
-
This is the namespace URI of the target node as a Python
unicode string.
- nextSibling
-
This is the node immediately following the target node, or
None if the target node is the last child of its parent
(or if the target node is an attribute, as attributes are
unordered).
- nodeValue
-
This is the value of the target node as a Python unicode
string, if the target node has a string value. If not, this is
None. To illustrate some of the possibilities,
attributes and text nodes have values, while elements and documents
do not.
- ownerDocument
-
This is the Document node in which the
target node is contained.
- parentNode
-
This is the parent of the target node. If the target node is a
Document node, then this will be
None; Document nodes do not have
parents.
- prefix
-
This is the namespace prefix of the current node, or
None if the current node does not (or cannot) have a
namespace prefix.
- previousSibling
-
This is the node immediately preceding the target node, or
None if the target node is the first child of its
parent (or if the target node is an attribute, as attributes are
unordered).
- rootNode
-
This is a synonym for
ownerDocument.
- xmlBase
-
This is a synonym for baseURI.
In addition to accessing the structure relative to a node, there are
also a set of operations that we can perform on these structures,
including a variety of operations for modifying the document. Some of
these methods allow you to add new nodes in various places; note that in
the DOM, only Document nodes can create new nodes. See “Methods available to Document
objects” for details. The following methods are
available on any node.
Methods available to every Node
object
appendChild(node)
-
This method adds node as the last child
of the current instance. This is useful for manually building a
document in breadth-first document order.
insertBefore(newChild, refChild)
-
This method adds the node newChild to
the current instance immediately before child node
refChild.
replaceChild(newChild, oldChild)
-
This method replaces the child node
oldChild with the
newChild node.
removeChild(oldChild)
-
This method removes the oldChild node
as a child of the instance node.
cloneNode(deep)
-
This method returns a new copy of the current instance. If
(and only if) deep is true, then we copy
deeply: the node's attributes and children are also copied
deeply.
isSameNode(otherNode)
-
This method determines whether the instance node and
otherNode are the same node based upon object
identity.
normalize()
-
This method merges any adjacent text nodes in the attributes
or descendents of the current instance.
hasChildNodes()
-
This method returns true if and only if the instance node has
any child nodes.
xpath(expr, explicitNss)
-
This method evaluates the XPath expression
expr with the current instance as the
expression context and returns an appropriately-valued result. The
explicitNss parameter is optional; it is a
Python dictionary mapping namespace prefixes to namespaces for use
in the expression. See “XPath queries” for
details.
In addition to their behavior as nodes,
Document nodes are uniquely responsible for a
number of tasks. For example, only Document nodes
can create other nodes. The following methods are availble only to
Document nodes.
Methods available to Document
objects
createElementNS(namespaceURI, qualifiedName)
-
This method creates and returns a new
Element with the given namespace URI and
qualified name.
createAttributeNS(namespaceURI, qualifiedName)
-
This method creates and returns a new attribute
(Attr object) with the given namespace URI
and qualified name.
createTextNode(data)
-
This method creates and returns a new
Text node with the string value of
data.
createProcessingInstruction(target, data)
-
This method creates and returns a new processing instruction
(ProcessingInstruction object) with the given
target name and contents taken from
data.
createComment(data)
-
This method creates and returns a new
Comment with the string value of
data.
createDocumentFragment()
-
This method creates and returns a new, empty document fragment
(DocumentFragment object).
importNode(importedNode, deep)
-
Nodes can only belong to one document at a time. This method
creates a copy of the node importedNode that
belongs to the instance (but which does not yet have a parent). If
(and only if) deep is true, then we copy
deeply: the node's attributes and children are also copied deeply
and imported.
Document nodes also have a number of properties that are not found
on other nodes. These properties are summarized in the following
list.
Properties available on Document
objects
- doctype
-
This is a DocumentType object that
encapsulates info about the document's "type", as described in its
DOCTYPE tag. In Domlette, which doesn't use such objects, the value
of the doctype property will always be
None.
- documentElement
-
This is the root element of the document.
- documentURI
-
This is the URI that identifies the document.
- implementation
-
This is the DOMImplementation that
created the document.
- publicId
-
This Domlette-specific property is the public ID of the DTD of
this document.
- rootNode
-
This refers to the current instance.
- systemId
-
This Domlette-specific property is the system ID of the DTD of
this document.
- unparsedEntities
-
This is the list of unparsed entities in the current
document.
Attributes (Attr objects) do not have any
special methods, but they do have a few additional properties. These
properties are summarized in the following list.
Properties available on Attr
objects
- name
-
This is the qualified name of the current instance.
- nodeName
-
This is a synonym for the name
property.
- ownerElement
-
This is a synonym for the parentNode
property.
- specified
-
You will probably never need this property. It is always
1. DOM says it should be 0 if
it is present through defaulting, rather than explicitly specified
in the document. This is only possible if the DOM implementation
preserves certain details from DTD processing, which 4Suite never
does. Therefore the value is always 0.
- value
-
This is a synonym for the nodeValue
property.
Since attributes can only be attached to elements,
Element objects have a set of special methods for
managing which attributes are attached to them. We describe these methods
below.
Methods available to Element
objects
hasAttributeNS(namespaceURI, localName)
-
This method returns true if the current instance has an
attribute with the given namespace URI and local name, and false
otherwise.
getAttributeNS(namespaceURI, localName)
-
This method returns the attribute value of the attribute with the given
namespace URI and local name, if one exists. If not, this returns
None.
getAttributeNodeNS(namespaceURI, localName)
-
This method returns the Attr object of
the attribute with the given namespace URI and local name, if one
exists. If not, this returns None.
removeAttributeNS(namespaceURI, localName)
-
This method removes the attribute with the given namespace URI
and local name from the current instance element.
removeAttributeNode(node)
-
This method removes the attribute node
from the current instance element.
setAttributeNS(namespaceURI, qualifiedName, value)
-
This method adds an attribute or replaces an attribute with
the specified namespace URI and qualified name and sets the content
of that attribute to value.
setAttributeNodeNS(node)
-
This method adds or replaces an attribute using the
Attr object
node.
Elements also have several properties above
and beyond what they get from being Nodes. See the
list below for details.
Properties available on Element
objects
- nodeName
-
This is the qualified name of the current instance.
- tagName
-
This is a synonym for nodeName.
Both Text and Comment
nodes are also more general CharacterData nodes in
the DOM. CharacterData nodes have several
additional properties and methods for managing the string data that they
contain. The individual Text and
Comment nodes, however, do not add any
functionality to their general CharacterData parent
class. You can find descriptions of the properties and methods offered by
CharacterData objects below.
Properties available on CharacterData
objects
- data
-
This is the string content of the current instance.
- length
-
This is the length of the string content of the current
instance.
- nodeValue
-
This is a synonym for data.
Methods available to CharacterData
objects
insertData(offset, data)
-
This method inserts the string data
into the content of the current instance at the index specified by
offset.
appendData(data)
-
This method appends the string data to
the end of the value of the current instance.
replaceData(offset, count, data)
-
This method replaces count number of
characters found at index offset in the
current instance with the string data.
substringData(offset, count)
-
This method retrieves and returns the part of the string value
of the current instance that begins at index
offset and extends
count characters.
deleteData(offset, count)
-
This method deletes the part of the string value of the
current instance that begins at index offset
and extends count characters.
A few DOM actions are not "owned" by any individual document. In
effect, they are general-purpose operations. They can be found in
DOMImplementation objects. One such precreated
instance can be conveniently found at and used from
Ft.Xml.Domlette.implementation. The general methods
that such a DOMImplementation object offers are
listed below.
DOMImplementation methods:
createDocument(namespaceURI, qualifiedName, doctype)
-
This standard DOM method creates and returns a
Document object associated with the given
DocumentTyype object, and having a single
element child with the given QName and namespace. Since Domlette
does not use DocumentTyype objects, the
doctype argument must be given as None.
createRootNode(documentURI)
-
This Domlette-specific method creates a
Document object with the specified document
(base) URI. No document element is created. This method is generally
preferred over createDocument(); see the
following section, 'Building a DOM from scratch'.
hasFeature(feature, version)
-
This method tests whether the DOM implementation implements a
specific feature.
3.2.1 What about
getElementsByTagName()?
The getElementsByTagName() method isn't
supported, because there are better options. In particular, you can just
use XPath:
For more possibilities, see getElementsByTagName
Alternatives.
3.3 Serializing Domlette nodes
Domlette comes with a couple of very fast printer functions which
also go to great pains to correctly handle character encoding issues:
Print and PrettyPrint.
Here are some serialization examples using the Domlette printers, given a
node 'node' (it doesn't have to be a document
node).
from Ft.Xml.Domlette import Print, PrettyPrint
# basic serialization to sys.stdout
Print(node)
# ... with extra whitespace (indenting)
PrettyPrint(node)
# ... using a single tab, rather than 2 spaces, to indent at each level
PrettyPrint(node, indent='\t')
# serializing to a utf-8 encoded file
f = open('output.xml','w')
Print(node, stream=f)
f.close()
# ... to an iso-8859-1 encoded file
f = open('output.xml','w')
Print(node, stream=f, encoding='iso-8859-1')
f.close()
# ... to an ascii encoded string
import cStringIO
buf = cStringIO.StringIO()
Print(node, stream=buf, encoding='us-ascii')
buf.close()
s = buf.getvalue()
# Normally, output syntax (XML or HTML) is chosen based on the DOM type,
# which is automatically detected. A Domlette or XML DOM can be output in
# HTML syntax if the asHtml=1 argument is given.
PrettyPrint(node, asHtml=1)
See also: Serializing
XML from DOM or Domlette documents
3.4 Building a DOM from scratch
As an alternative to parsing a preexisting XML document, you can
also build a document model, with certain limitations, from the ground up.
W3C and Python DOM facilities for doing this are intended mainly for creating
a temporary document whose nodes will be imported into an existing document,
and while Domlette does offer a more convenient document creation method,
it has many of the same limitations. However, for most documents, its
capabilities should be sufficient.
The Ft.Xml.Domlette module
contains a DOMImplementation instance named
implementation which provides a set of methods for
initializing new Documents. The
implementation.createRootNode method takes a base URI
argument and provides a natural approach for creating an XPath model root node.
This is similar to the DOM idea of a document node and even closer to a DOM
document fragment (multiple element children are allowed). The
implementation.createDocument method, on the
other hand, is designed to come close to the DOM interface, although its
doctype argument must be None.
doc = implementation.createRootNode('file:///article.xml')
is the equivalent of
from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createDocument(EMPTY_NAMESPACE, None, None)
with the added advantage of doc.baseURI being set to
'file:///article.xml', which is not possible to set via standard DOM interfaces
(the baseURI attribute is read-only).
Similarly,
from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createRootNode('file:///article.xml')
docelement = doc.createElementNS(EMPTY_NAMESPACE, 'article')
doc.appendChild(docelement)
is the equivalent of
from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createDocument(EMPTY_NAMESPACE, 'article', None)
plus doc.baseURI being set to 'file:///article.xml'.
If you want as much fidelity to the DOM API as Domlette offers, use
implementation.createDocument. If you just want to
create a document or other such root-level node, and never mind the
strange parameters, use
implementation.createRootNode.
3.5 XPath query
You can easily perform XPath queries by use the
xpath method for cDomlette nodes as
follows:
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
print doc.xpath(u'//a')
print doc.xpath(u'string(/spam)')
Notice: this is nothing like W3C DOM's XPath query module. The
emphasis, as usual with Domlette, is on speed, simplicity and
pythonic-ness.
The API, in brief:
node.xpath(expr[, explicitNss])
-
node - will be used as core of the context for evaluating the
XPath
-
expr - XPath expression in string or compiled form
-
explicitNss - (optional) any additional or overriding namespace
mappings in the form of a dictionary that maps prefixes to namespace
URIs. The base namespace mappings are taken from in-scope declarations
on the given node. This explicit dictionary is superimposed on the
base mappings.
For additional details, see “XPath queries”.
3.6 More on base URIs
For some users, always specifying a base URI feels like an
inconvenience. Perhaps they always generate XML sources from text or
streams without naturally associated URIs, and they have to figure out
schemes to come up with base URIs for the parse. But there is good reason
for this pickiness. Just ask one of the users who
got bitten by carelessness with base URIs in practice. It's better
to always put some amount of thought into base URIs when processing XML,
and 4Suite encourages this.
Note that 4Suite only enforces the requirement for base URIs in
cases where they are needed to make sense of a requested operation. Your
document must have a valid base URI if you use external entities,
XInclude, xsl:import, xsl:include, the XSLT document() function, the EXSLT
exsl:document element, or any other operations that require access to an
external resource. If your main use for URI resolution is XSLT import and
includes, you can avoid having to give valid base URIs by using XSLT
include paths.
A valid base URI starts with a scheme, such as
http:. A simple name, such as "spam" is a valid
relative URI reference, but not a valid base URI. Without a base URI, a
relative reference is no more useful than an apartment number given
without the address of the entire apartment building. Merging a base URI
with a relative reference is a string operation that is undertaken in a
standard manner, and is generally only useful when the base URI is
hierarchical; that is, it is a URL using one of the common schemes that
have slashes as path separators (e.g., http:, ftp:, gopher:, and most
file: URLs). The built-in 4Suite URI resolver
Ft.Lib.Uri.BASIC_RESOLVER knows
how to perform such resolution.
3.7 Why does Domlette diverge from the DOM specification?
Domlette is not a complete or fully conformant DOM implementation,
but it does provide an interface very close to W3C DOM Level 2 and the
corresponding Python mapping as laid out in the
xml.dom API docs.
The areas of divergence are inconsequential for most users,
and generally reflect decisions made in the interest of eliminating
redundancy, inefficiency, and, to some degree, un-Pythonic design.
Also, one of the important design principles for Domlette is that
where DOM and XPath disagree, XPath wins; aside from making things
more efficient to implement, this behavior is generally what people
want in an XML document model.
It is also worth noting that in the interest of usability,
all DOM implementations exhibit some degree of variation from the
specs. Coding a completely implementation-agnostic DOM application
is difficult and usually unnecessary.
4 SAX
Saxlette is a fast SAX implementation, all written in C. Its API is
similar to those of Python's
built-in SAX.
from xml import sax
from Ft.Xml import CreateInputSource
class element_counter(sax.ContentHandler):
def startDocument(self):
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.ecount += 1
parser = sax.make_parser(['Ft.Xml.Sax'])
handler = element_counter()
parser.setContentHandler(handler)
#'file:ot.xml' or file('ot.xml') or file('ot.xml').read() would work just as well, of course
parser.parse(CreateInputSource('ot.xml'))
print "Elements counted:", handler.ecount
If you don't care about PySax compatibility, you can use the more
specialized API, which involves the following lines in place of the
equivalents above:
from Ft.Xml import Sax
...
class element_counter:
....
parser = Sax.CreateParser()
The biggest API differences between Saxlette and PySax are that
Saxlette only supports SAX 2. For example,
feature_namespaces is hard-wired to
True and feature_namespace_prefixes to
False (which is exactly what SAX2 says is required).
Saxlette also combines all adgacent text events, which eliminates one of the
pain points of PySax.
The argument to the parse method is a URI, a SAX
input source or a 4Suite input source. In the example above a URI was used.
The following example shows similar code using 4Suite's Ft.Xml.InputSource.
from Ft.Xml import InputSource, Sax
factory = InputSource.DefaultFactory
isrc = factory.fromUri("file:ot.xml")
doc1 = NonvalidatingReader.parse(isrc)
class element_counter:
def startDocument(self):
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.ecount += 1
parser = Sax.CreateParser()
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(isrc)
print "Elements counted:", handler.ecount
4.1 Validating a document while parsing it using SAX
To enable validation of your documents while otherwise parsing them
normally with SAX, set the
xml.sax.handler.feature_validation feature to
True on your parser using a line similar to
parser.setFeature(xml.sax.handler.feature_validation, True).
The parser will then throw an
xml.sax._exceptions.SAXParseException exception if
it determines that the document is invalid, and it will stop parsing the
document. Handlers for document components that have been parsed will be
called, however. The following example illustrates these concepts.
from Ft.Xml import InputSource, Sax
factory = InputSource.DefaultFactory
XML = """<!DOCTYPE a [
<!ELEMENT a (b, b)>
<!ELEMENT b EMPTY>
]>
<a><b/><b/></a>"""
isrc = factory.fromString(XML, 'urn:x-example:valid-a')
class element_counter:
def startDocument(self):
self.scount = 0
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.scount += 1
def endElementNS(self, name, qname):
self.ecount += 1
parser = Sax.CreateParser()
handler = element_counter()
parser.setContentHandler(handler)
# And now, to enable validation...
import xml
parser.setFeature(xml.sax.handler.feature_validation, True)
parser.parse(isrc)
print "Saw", handler.scount, "start tags"
print "Saw", handler.ecount, "end tags"
# And now we show what happens on an invalid document:
XML = """<!DOCTYPE a [
<!ELEMENT a (b, b)>
<!ELEMENT b EMPTY>
]>
<a><b/><b/><b/></a>"""
isrc = factory.fromString(XML, 'urn:x-example:invalid-a')
parser.parse(isrc)
print "Saw", handler.scount, "start tags"
print "Saw", handler.ecount, "end tags"
# The above document is invalid; it has one more `b` element than is
# allowed by the DTD. The handlers have still been called for those
# parts of the document that have been parsed.
4.2 Walking a DOM to fire SAX events
Saxlette has the ability to walk a Domlette tree, firing off events
to a handler as if from a source document parse. This ability used to be
too well, hidden, though, and I made an API addition to make it more
readily available. This is the new
Ft.Xml.Domlette.SaxWalker. The following example
should show how easy it is to use:
from Ft.Xml.Domlette import SaxWalker
from Ft.Xml import Parse
XML = "<a><b/><b/></a>"
class element_counter:
def startDocument(self):
self.ecount = 0
def startElementNS(self, name, qname, attribs):
self.ecount += 1
#First get a Domlette document node
doc = Parse(XML)
#Then SAX "parse" it
parser = SaxWalker(doc)
handler = element_counter()
parser.setContentHandler(handler)
#You can set any properties or features, or do whatever
#you would to a regular SAX2 parser instance here
parser.parse() #called without any argument
print "Elements counted:", handler.ecount
4.3 Building a Domlette from SAX events
Saxlette includes a convenience ContentHandler
(Ft.Xml.Sax.DomBuilder) which listens for SAX
events and constructs Domlette Documents.
4.4 Feeding a generator from SAX events
Python's generators are special functions that can produce a series
of partial results within the course of running. The calling program can
start up a generator, which is suspended when a partial result is yielded,
and resumed explicitly by the program when the next result is required.
This capability is mirrored in the Expat parser that is the basis of
Saxlette. Saxlette has a feature, FEATURE_GENERATOR
which you can set on a parser object to enable generator semantics. If
this feature is set, the parse() method returns an
iterator. This iterator yields results set by the the SAX handlers. The
handlers specify the partial results by setting the property
PROPERTY_YIELD_RESULT with the value to be yielded. As
an example, the following code reports the name of all attributes used in
the document.
class report_attributes:
def __init__(self, parser):
self.parser = parser
return
def startElementNS(self, name, qname, attribs):
self.parser.setProperty(Sax.PROPERTY_YIELD_RESULT, attribs)
return
from Ft.Xml import Sax, CreateInputSource
parser = Sax.CreateParser()
parser.setFeature(Sax.FEATURE_GENERATOR, True)
handler = report_attributes(parser)
parser.setContentHandler(handler)
attribs_iterator = parser.parse(CreateInputSource('test.xhtml'))
for attribs in attribs_iterator:
for name in attribs.keys(): print name
4.5 SAX filters
In SAX processing, the parser passes to the application a stream of events that represents the XML content. An important aspect of SAX is the user's ability to create SAX filters, which accept a stream of SAX events and pass on a modified stream. For example, you might use a SAX filter to take look for DOcbook sect1, sect2 etc. elements, and rename them to section elements before passing them on for further processing (presumably by a SAX handler that only understands how to deal with the latter form). You can chain SAX filters as well, and the idea behind SAX filters is usually reuse across a broad array of applications, focusing each filter they on a single task that can be cleanly separated from upstream and downstream processing. SAX filters can thus be useful building blocks for XML pipelines.
from xml import sax
from xml.sax.saxutils import XMLFilterBase
from Ft.Xml import CreateInputSource, XML_NAMESPACE as XMLNS
from Ft.Xml.Sax import SaxPrinter
XML = """<?xml version="1.0" encoding="utf-8"?>
<menu>
<item id="A" xml:lang="en">Orange juice</item>
<item id="A" xml:lang="es">Jugo de naranja</item>
<item id="B" xml:lang="en">Toast</item>
<item id="B" xml:lang="es">Pan tostada
<note xml:lang="en">Wheat bread only, please</note>
</item>
</menu>
"""
#Define constants for the two states we care about
ALLOW_CONTENT = 1
SUPPRESS_CONTENT = 2
class english_only_filter(XMLFilterBase):
def __init__(self, downstream):
XMLFilterBase.__init__(self, downstream)
return
def startDocument(self):
#Set the initial state, and set up the stack of states
self._state_stack = [ALLOW_CONTENT]
XMLFilterBase.startDocument(self)
return
def startElementNS(self, name, qname, attrs):
#Check if there is any language attribute
lang = attrs.get((XMLNS, 'lang'))
if lang:
#Set the state as appropriate
if lang[:2] == 'en':
self._state_stack.append(ALLOW_CONTENT)
else:
self._state_stack.append(SUPPRESS_CONTENT)
#Always update the stack with the current state
#Even if it has not changed
#Only forward the event if the state warrants it
if self._state_stack[-1] == ALLOW_CONTENT:
XMLFilterBase.startElementNS(self, name, qname, attrs)
return
def endElementNS(self, name, qname):
self._state_stack.pop()
#Only forward the event if the state warrants it
if self._state_stack[-1] == ALLOW_CONTENT:
XMLFilterBase.endElementNS(self, name, qname)
return
def characters(self, content):
#Only forward the event if the state warrants it
if self._state_stack[-1] == ALLOW_CONTENT:
XMLFilterBase.characters(self, content)
return
if __name__ == "__main__":
parser = sax.make_parser(['Ft.Xml.Sax'])
#SaxPrinter is a special SAX handler that merely writes
#SAX events back into an XML document
filtered_parser = english_only_filter(parser)
handler = SaxPrinter()
filtered_parser.setContentHandler(handler)
filtered_parser.parse(CreateInputSource(XML))
Most SAX handlers operate as state machines, meaning they manage some variables based on the stream of events that come in, and change behavior based on these variables. english_only_filter is set up to be in one of two states: one in which content is passed on to the downstream handler, and one in which content is suppressed. This state is marked in the self._state_stack. The state is initially set to ALLOW_CONTENT, and changed to SUPPRESS_CONTENT if the filter encounters an xml:lang attribute that represents a language other than English (which can be done by checking the first two characters of the value, according to the rules of standard language codes). It has to be a stack because XML language specifications are scoped, so that in the example XML at the top of the listing the string "Pan tostada" is within the scope of the element with the attribute xml:lang="es", and so it is marked as being in Spanish. The entire note element, however, is marked as being in English by an overriding xml:lang="en" attribute.
The SAX handler is set to Ft.Xml.SaxPrinter, which channels the final SAX evenis onto a 4Suite printer which creates a serialized XML document. It's quite easy to chain filters. If you wanted the parser to send events to a filter of class some_other_filter which then passed on events to english_only_filter the relevant line would look as follows:
filtered_parser = english_only_filter(some_other_filter(parser))
4.6 Streaming canonicalization
The combination of streaming parsing using Saxlette and streaming serialization using Ft.Xml.Lib.CanonicalXmlPrinter allows for
very efficient XML canonicalization (c14n).
import sys
from xml import sax
from Ft.Xml import CreateInputSource
from Ft.Xml.Sax import SaxPrinter
from Ft.Xml.Lib.XmlPrinter import CanonicalXmlPrinter
parser = sax.make_parser(['Ft.Xml.Sax'])
handler = SaxPrinter(CanonicalXmlPrinter(sys.stdout))
parser.setContentHandler(handler)
parser.parse(CreateInputSource(' <a><b b="1" a="2"/></a> '))
5 XPath queries
4Suite provides an XPath processing engine, compliant with the W3C XPath 1.0 specification.
This query engine is accessible through Ft.Xml.XPath.
5.1 The quickest option
If you are using Domlette, as described above, the quickest and
easiest way to use the XPath facility in 4Suite is the
xpath() method, which any Domlette
Node supports:
from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
doc2 = NonvalidatingReader.parseString("<spam>eggs<eggs n='1'> and ham</eggs></spam>")
print doc.xpath(u'(//a)[1]')
print doc.xpath(u'string(/spam)')
print doc2.xpath(u'string(//eggs/@n)')
The line
print doc.xpath(u'(//a)[1]')
Is actually a shortcut for the following more involved construct,
which is described in detail in the next section:
from Ft.Xml.XPath import Evaluate
print Evaluate(u'(//a)[1]', contextNode=doc)
This example prints three lines. The first line shows a string
representation of a list containing a single element. As we see from this
line, an XPath selection of nodes returns a Python list. In this case, it
is a list containing a single element—the first element with a local name
of a, which has no attributes and no
children. The second line shows the correct string value of the selected
spam element, and the third line shows
the correct string value of the n
attribute.
[<Element at 0xb7d10bb4: name u'a', 0 attributes, 0 children>]
eggs
1
5.2 Type mappings
4Suite XPath functions return results with Python types that depend
on the XPath data model type of the query result. The following list shows
how the five XPath result types (String, number, boolean, node-set and
object) are mapped to Python types:
-
XPath string: Python unicode type
-
XPath number: Python float type (int or long also accepted), or
instance of Ft.Lib.number.nan (for NaN) or Ft.Lib.number.inf (for
Infinity)
-
XPath boolean: Ft.Lib.boolean instance
-
XPath node-set: Python list of Domlette nodes, in document
order, with no duplicates
-
XPath foreign object: any other Python object (you will very
rarely encounter this case)
5.3 Advanced use
XPath expressions can refer to both variables and qualified names
(QNames) that must be defined by the environment that is executing the
XPath expression. This section describes how to use these advanced
features of XPath using the 4Suite interface.
4Suite's XPath implementation uses a Domlette node as the context
node for XPath operations. The following example demonstrates the use of
XPath to extract content from an XML document. The document must be parsed
before Xpath can be used to access it. The following example parses the
XML document and explicitly sets up an XPath context to run an XPath
query.
XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""
from Ft.Xml import Parse
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Evaluate
doc = Parse(XML)
ctx = Context(doc)
nodes = Evaluate(u'//em', ctx)
# The return value, a node set, comes back as a Python list of nodes
# which may be accessed using an iterator
for n in nodes:
# print dir(n)
print n.tagName
print n.firstChild.nodeValue
XPath always requires a context for execution; a common XPath
context is the root of the target document, such as we did in the above
example. Think about an XPath query being executed from some location in
an XML document. This location in the document is a necessary component of
using XPath.
There is more to an XPath context than just the context node, but if
your needs are as straightforward as that of the above example, there is
an abbreviated version of the Evaluate method for
this purpose. For example, the following fragment is equivalent to the two
lines creating a context and evaluating the expression in the above
example.
# No need to create a context object
Evaluate(u'//em', contextNode=doc)
If your source document uses XML Namespaces you will likely need to
use QNames in your XPath expressions. For this to work, you'll need to
introduce namespace mappings into your XPath context. For example, if the
elements of our XML document above are in an XML namespace, then we must
set up our context slightly differently.
XML = """<ham xmlns="http://example.com/ns#">
<eggs n='1'/>
This is the string content with <em type='bold'>emphasized Namespaced Text</em> text
</ham>"""
from Ft.Xml import Parse
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Evaluate
NSS = {u'ex': u'http://example.com/ns#'}
doc = Parse(XML)
ctx = Context(doc, processorNss=NSS)
nodes = Evaluate(u'//ex:em', ctx)
for n in nodes:
# print dir(n)
print n.tagName
print n.firstChild.nodeValue
You define XPath namespace prefixes through a Python dictionary
(NSS in the above example) which maps these prefixes,
such as 'ex' in the above example, to the appropriate
namespace URI, such as 'http://example.com/ns#' in the
above example. This prefix mapping is added to your XPath context using
the processorNss parameter to the
Context function.
In a similar way, you can also pass in variable bindings which may
be used as values later in your XPath expressions. In this case, however,
variables are Python tuples containing the namespace URI and local name of
the variable.
ctx = Context(node, varBindings=
{(EMPTY_NAMESPACE, u'date'): u'2003-06-20'})
Evaluate('event[@date = $date]', context=ctx)
This creates a variable in the default namespace named 'date', with
a value of '2003-06-20'; this is then used for
comparison with the date attribute in the Xpath expression.
XPath variables are Qnames, so you pass in variable names as
namespace/local name tuples. The values can be numbers, unicode objects or
boolean objects:
from Ft.Xml.XPath import boolean
ctx = Context(node, varBindings={(EMPTY_NAMESPACE, u'test'): boolean.true})
This sets the variable 'test' to the boolean value true (remember
that this is for the XPath environment, not the Python one), and again
this may be used as in any XSLT stylesheet.
If you only want a value once, you may of course still use string
constants, as in
nodes=Evaluate(u'//testPrefix:em[@type="bold"]',ctx)
Note the quotes used? These must be balanced, hence the literal
value uses double quotes.
5.4 Reusing parsed XPath queries
Sometimes you want to re-use an XPath expression and namespace
mapping multiple times, for efficiency and convenience. The following
example shows an example of this:
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml import Parse
DOCS = ["<spam xmlns='http://spam.com'>eggs</spam>",
"<spam xmlns='http://spam.com'>grail</spam>",
"<spam xmlns='http://spam.com'>nicht</spam>",
]
# Pre-compile for efficiency and convenience
expr = Compile(u"/a:spam[contains(., 'i')]")
ctx = Context(None, processorNss={u"a": u"http://spam.com"})
i = 1
for doc in DOCS:
doc = NonvalidatingReader.parseString(doc.encode('UTF-8'),
"http://spam.com/base")
retval = Evaluate(expr, doc, ctx)
if len(retval):
print "Document", i, "meets our criteria"
i += 1
Which should display:
Document 2 meets our criteria
Document 3 meets our criteria
5.5 Migration from PyXML's XPath
There is a usable XPath module in PyXML (warning: PyXML's XSLT
implementation is not usable: use 4Suite if you need XSLT), but there are
a lot of updates and improvements in the XPath library version in
4Suite.
If you are familiar with PyXML, you may have used a different form
of imports to load in XPath and XSLT features. The imports are different
under 4Suite.
Usage example:
-
PyXML usage (do not use with 4Suite):
import xml.xslt
import xml.xpath
-
4Suite usage (use these imports):
import Ft.Xml.XPath
import Ft.Xml.Xslt
6 XSLT processing
6.1 The super-simple XSLT API
For basic XSLT transform needs, or to get started quickly, the
Ft.Xml.Xslt module offers a quick
way to apply transforms XML documents and get back the simple string
result. Within this module, the function of interest is
Transform.
Transform(fname_or_uri, string_stream_fname_uri_isrc, [param], [output])
-
The Transform function takes two
arguments, with an optional third. The first is the source XML for the transform. The
second is the XSLT document. Both are given as a string, an object like an
open file, a local file path on your computer, an absolute URI, or
an InputSource object. The optional params is a dictionary of stylesheet parameters, the keys of
which may be given as unicode objects if they have no namespace,
or as (uri, localname) tuples if they do. The values are the overriden parameter values. If you do not supply the optional output parameter the return value is a string with the result
of this transform. If you do supply this parameter it must be a file-like object to which the output will be written, and then the return value is None.
XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""
from Ft.Xml.Xslt import Transform
# URL for the identity transform: reproduces the input XML in the result
ID_TRANSFORM = 'http://cvs.4suite.org/viewcvs/*checkout*/4Suite/Ft/Data/identity.xslt'
result = Transform(XML, ID_TRANSFORM)
print result
# If the above XML document were located in the file
# "target.xml", we could have used `Transform("target.xml", ID_TRANSFORM)`.
#It's more efficient to redirect the processor output to an output stream. The following does so:
import sys
result = Transform(XML, ID_TRANSFORM, output=sys.stdout)
print result
6.2 Full XSLT processing API
Here is the general procedure for using the Python API for XSLT
processing:
-
Create an Ft.Xml.Xslt.Processor.Processor
instance.
-
Prepare Ft.Xml.InputSource instances (via
their factory) for the source XML and stylesheet.
-
Call the Processor's appendStylesheet
method, passing it the stylesheet's
InputSource.
-
Call the Processor's run method,
passing it the source document's
InputSource.
For input to our transform, we will use the namespaced example as in
the last section.
$ cat testNS.xml
<ham xmlns="http://example.com/ns#">
<eggs n='1'/>
This is the string content with
<em type='bold' f='2'>emphasized Namespaced Text</em>
text
</ham>
For our stylesheet, we will again use one of the simplest useful
examples, the identity stylesheet.
$ cat identity.xsl
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The code below follows the processing outline, having converted the
input file and stylesheet to the URI format.
from Ft.Xml.Xslt import Processor
# We use the InputSource architecture
from Ft.Xml import InputSource
from Ft.Lib.Uri import OsPathToUri # path to URI conversions
processor = Processor.Processor()
# Prepare an InputSource for the source document
# Convert from local file to uri
srcAsUri = OsPathToUri('testNS.xml')
source = InputSource.DefaultFactory.fromUri(srcAsUri)
# Prepare an InputSource for the stylesheet
# Convert from local file to uri
ssAsUri = OsPathToUri('identity.xsl')
transform = InputSource.DefaultFactory.fromUri(ssAsUri)
processor.appendStylesheet(transform)
result = processor.run(source)
# result is a string with the serialized transform result
print result
You can call run multiple times on
different InputSources. When you're done, the
processor's reset method can be used to restore a
clean slate (at which point you would have to append stylesheets to the
processor again).
The following example uses our processor from the
previous example to transform a new XML document, this one constructed
manually.
XML = """<foo><bar/></foo>"""
source = InputSource.DefaultFactory.fromString(XML, 'http://example.org/foo')
result = processor.run(source)
# result is a string with the serialized transform result
print result
This code continues from the previous example to process the second
document, using the same processor and stylesheet. This
is a useful form when there is a requirement for server side processing of
multiple input documents with a common stylesheet.
6.3 Example
In the example below, strings are used as the source of the
transform (stylesheet) and source documents, and we are careful to pass in
a URI to identify each of them. In the source document, the URI is needed
for resolving external entity references and XIncludes. In the stylesheet,
the URI is needed for resolving document function
calls, xsl:includes and xsl:imports.
If you do not provide a URI and you attempt to use any of these
features, you may get an exception.
# The identity transform: duplicates the input to output
TRANSFORM = """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>"""
SOURCE = """<spam id="eggs">I don't like spam</spam>"""
# The processor class is the core of the XSLT API
from Ft.Xml.Xslt import Processor
processor = Processor.Processor()
# We use the InputSource architecture
from Ft.Xml import InputSource
# Prepare an InputSource for the transform
transform = InputSource.DefaultFactory.fromString(TRANSFORM,
"http://spam.com/identity.xslt")
# Prepare an InputSource for the source document
source = InputSource.DefaultFactory.fromString(SOURCE,
"http://spam.com/doc.xml")
processor.appendStylesheet(transform)
result = processor.run(source)
# result is a string with the serialized transform result
print result
6.4 Using Domlette objects instead of InputSources
If your documents are already in the form of Domlette documents, you
don't need to create InputSources for them; you can
just use the Processor's
appendStylesheetNode and
runNode methods instead of
appendStylesheet and
run, respectively.
Note
It is usually slower to read the stylesheet from a Domlette object
than to parse a serialized document.
Note
The Domlette documents used in the following example are obtained
by parsing existing XML, but this approach can just as easily be used on
Domlette documents that are built programmatically (i.e. using the DOM
API).
# The identity transform: duplicates the input to output
TRANSFORM = """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>"""
SOURCE = """<spam id="eggs">I don't like spam</spam>"""
from Ft.Xml.Xslt import Processor
processor = Processor.Processor()
from Ft.Xml.Domlette import NonvalidatingReader
# Create a DOM for the transform
transform = NonvalidatingReader.parseString(TRANSFORM,
"http://spam.com/identity.xslt")
# Create a DOM for the source document
source = NonvalidatingReader.parseString(SOURCE, "http://spam.com/doc.xml")
processor.appendStylesheetNode(transform, "http://spam.com/identity.xslt")
result = processor.runNode(source, "http://spam.com/doc.xml")
print result
If you have objects from another DOM library, you can first convert
them to Domlette objects as shown in “Converting from other DOM libraries”.
6.5 Top-level parameters
Passing parameters to a stylesheet
You can pass in stylesheet parameters as a Python dictionary. Use
the parameter names for keys. Values use the 4Suite XPath library's
standard type mappings, which are described in “Type mappings”.
Parameter and variable names in XPath/XSLT are actually
expanded-names, which we represent as (namespaceURI, localName) tuples. If
your parameter name is in a namespace, you have to use a tuple as the
mapping key. Otherwise, you may simply use a unicode string that
represents the local-name part only
(Ft.Xml.EMPTY_NAMESPACE is the default
namespace).
Here is an example, which passes in the computed "date" parameter to
the stylesheet from the program:
SRC = """<?xml version="1.0"?><dummy/>"""
STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:param name="date" select="'unknown'"/>
<xsl:output method="xml" indent="yes" encoding="us-ascii"/>
<xsl:template match="/">
<result>
<xsl:value-of select="$date"/>
</result>
</xsl:template>
</xsl:stylesheet>"""
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
import time
src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')
proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
params = {u'date': unicode(time.asctime())}
result = proc.run(src_isrc, topLevelParams=params)
print result
6.6 Using xml-stylesheet processing instructions
4Suite honors the Associating Stylesheets with
XML Documents W3C Recommendation and RFC 3023: XML Media
Types. Instead of (or in addition to) using the processor's
explicit APIs to establish the stylesheet to be used for the
transformation, the source document may contain an xml-stylesheet
processing instruction (PI) that refers to a stylesheet via a URI
reference.
The xml-stylesheet PI must meet the following criteria:
-
It must appear in the document prolog.
-
It must contain a "type" pseudo-attribute having one of the
following values:
-
application/xslt+xml
-
application/xslt
-
text/xml
-
application/xml
-
It must contain an "href" pseudo-attribute that is a URI
reference for the stylesheet. It will be resolved relative to the base
URI of the source document that contains the xml-stylesheet PI.
This example shows a PI being used to refer to the identity
stylesheet mentioned earlier
<?xml-stylesheet type="application/xslt" href="identity.xsl"?>
If you need to add to the supported media types, e.g., to add the
nonstandard "text/xsl", then follow the example given in this
mailing list message.
If the PI contains "alternate" and "media" pseudo-attributes, the
package will do its best to handle them. See this
message for details and examples.
6.7 Alternative output destinations
Normally, the processor buffers all output, then returns it as a
byte string. If you want to write directly to some other stream (any
Python file-like object that has a write method),
you can supply the stream as the optional
outputStream argument to the Processor's
run method. When you supply your own output
stream, the run method will return
None. Here is an example that writes directly to
stdout:
SRC = """<?xml version="1.0"?><dummy/>"""
STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="us-ascii"/>
<xsl:template match="/">
<result>hello world</result>
</xsl:template>
</xsl:stylesheet>"""
import sys
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')
proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
result = proc.run(src_isrc, outputStream=sys.stdout)
Example 1 —
You also have the option of other kinds of output. Just set the
writer argument of the processor's
run method to an instance of an XSLT output
writer, which is a handler of SAX-like events coming from the processor as
it generates the result tree. 4Suite provides several writer classes for
alternative output:
-
If you want the XSLT output as SAX events, use an instance of
Ft.Xml.Xslt.SaxWriter.SaxWriter. Give its
constructor a saxHandler keyword argument that
is your own PyXML SAX2 event handler.
-
If you want the XSLT output as a Domlette document, use an
instance of Ft.Xml.Xslt.RtfWriter.RtfWriter.
Give its constructor a second argument: the base URI of the document
to create. Obtain the document by calling the writer's
getResult method after XSLT processing is
finished.
-
If you want the XSLT output as any other kind of Python DOM
document, use an instance of
Ft.Xml.Xslt.DomWriter.DomWriter. Give its
constructor an implementation keyword argument
that is your desired DOM implementation. Also try to set the
ownerDoc to an existing Document node (from the
same implementation) from which a base URI for the new document can be
obtained.
-
If you want the XSLT output in a regular file, open a file for
writing then pass this file object to the
proc.run as the
outputStream parameter value, in the same way
as the example above which used the sys.stdout
file object. An example is shown below.
-
If you want to make a custom output writer, just make your class
extend Ft.Xml.Xslt.NullWriter.NullWriter. If it
needs access to the XSLT output parameters, then the constructor
should take an instance of
Ft.Xml.Xslt.OutputParameters.OutputParameters,
which will have the data attributes method, version, encoding,
omitXmlDeclaration, standalone, doctypeSystem, doctypePublic,
mediaType, cdataSectionElements, and indent, which your writer can act
upon, if appropriate. See the NullWriter API
documentation for further info.
Here is an example of writing to a regular Domlette document:
SRC = """<?xml version="1.0"?><dummy/>"""
STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="us-ascii"/>
<xsl:template match="/">
<result>hello world</result>
</xsl:template>
</xsl:stylesheet>"""
import sys
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
from Ft.Xml.Xslt.DomWriter import DomWriter
from Ft.Xml.Domlette import PrettyPrint
src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')
from Ft.Xml.Domlette import implementation as impl
domlette_writer = DomWriter(implementation=impl)
proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
proc.run(src_isrc, writer=domlette_writer)
result_doc = domlette_writer.getResult()
PrettyPrint(result_doc)
This example writes the transform output to a file. This is a
variant of the earlier one. Output is
written to tmp.xml.
SRC = """<?xml version="1.0"?><dummy/>"""
STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes" encoding="us-ascii"/>
<xsl:template match="/">
<result>hello world</result>
</xsl:template>
</xsl:stylesheet>"""
import sys
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')
proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
f = open('tmp.xml', mode='w')
result = proc.run(src_isrc, outputStream=f)
f.close()
There are many more options available for customizing XSLT
development; see the Processor module documentation
for details:
>>> from Ft.Xml.Xslt import Processor
>>> help(Processor)
6.8 Transform chaining
4Suite provides some hooks for scenarios where the output from one
transform becomes the source document for another. This is called
transform chaining. The user still has to write the sequence of transform
invocations in the Python API (the 4xslt command can perform chaining for
the user). This section shows how.
In the following example the next transform in the chain is set from
within XSLT.
# The first transform: just reproduces all para elements within a wrapper
TRANSFORM = """
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:f="http://xmlns.4suite.org/ext"
extension-element-prefixes="f"
>
<!-- Top level param so that user can pass in the next transform in the
chain. By default, use the identity transform -->
<xsl:param name="next-xslt"/>
<!-- grab just the first paras for the output -->
<xsl:template match="/">
<parawrapper>
<xsl:apply-templates select="//para"/>
</parawrapper>
<!-- Set the next transform in the chain. You can also set to a
hard-coded string -->
<!-- notice that this is within a template, for instantiation -->
<f:chain-to href="{$next-xslt}"/>
</xsl:template>
<xsl:template match="para">
<xsl:copy-of select="."/>
</xsl:template>
</xsl:stylesheet>"""
DOC = """<doc>a<para>1</para>b<para>2</para>c</doc>"""
from Ft.Xml.Xslt import Processor
from Ft.Xml import InputSource
transform = InputSource.DefaultFactory.fromString(TRANSFORM, "urn:x-bogus:main.xslt")
IDT = u'http://cvs.4suite.org/viewcvs/*checkout*/4Suite/Ft/Data/identity.xslt'
processor = Processor.Processor()
processor.appendStylesheet(transform)
source = InputSource.DefaultFactory.fromString(DOC, "urn:x-bogus:doc.xml")
result = processor.run(source, topLevelParams={(None, 'next-xslt'): IDT})
print result
# processor.chainTo is the fully-resolved absolute URI of the next transform,
# or None if there was no f:chain-to element instantiated in the transform that
# the processor last processed.
next = processor.chainTo
processor = Processor.Processor()
processor.appendStylesheet(InputSource.DefaultFactory.fromUri(next))
source = InputSource.DefaultFactory.fromString(DOC, "urn:x-bogus:doc.xml")
result = processor.run(source)
print result
next = processor.chainTo # Should now be None
print "chainTo:", processor.chainTo
Note: There is not yet an API for automating the transform chain
loop above. Ideas were discussed and an experiment was conducted here.
If you have ideas for a good API, please submit them to the mailing
list.
6.9 XSLT patterns
XSLT defines a pattern language based on XPath which is used to
declare rules for matching patterns in the XML source against which to
fire XSLT templates. The pattern implementation that 4Suite's XSLT library
uses is also exposed as a library of its own. XSLT patterns are useful
when your task is not so much to compute arbitrary information from a
given node but, rather, to choose quickly from a collection of nodes the
ones that meet some basic rules. This might seem a subtle difference. The
following example might help illustrate it.
-
XPath task: extract the class attribute from all the child
elements of the context node
-
XSLT pattern task: given a list of nodes, sort them into piles
of those that have a class attribute and those that have a title
child
The main API for pattern processing in 4Suite is
Ft.Xml.Xslt.PatternList. The following is a code
snippet that takes a node and returns a list of patterns it
matches.
from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader
# first pattern matches nodes with an href attribute
# the second matches elements with a title child
PATTERNS = ["*[@class]", "*[title]"]
# Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList(PATTERNS, {})
DOC = """
<spam>
<e1 class="1"/>
<e2><title>A</title></e2>
<e3 class="2"><title>B</title></e3>
</spam>"""
doc = NonvalidatingReader.parseString(DOC, "file:foo.xml")
for node in doc.documentElement.childNodes:
# Don't forget that the white space text nodes before and after
# e1, e2 and e3 elements are also child nodes of the spam element
if node.nodeName[0] == "e":
print plist.lookup(node)
The PatternList initializer takes my list of
strings, which it conveniently converts to a list of compiled pattern
objects. Such objects have a match method that
returns a boolean value, but I don't use these methods directly in this
example. The PatternList initializer also takes a
dictionary that makes up the namespace mapping. In this example, we use no
namespaces, so the dictionary is empty. The
lookup method is applied to a selection of the
children of the spam element (all the
nodes whose name starts with "e", which happens to be all the element
nodes). The output of listing 4 follows:
[*[attribute::class]]
[*[child::title]]
[*[attribute::class], *[child::title]]
The output is a list of the representations of the pattern objects
that matched each node. Notice how the axis abbreviations have been
expanded in the pattern object representation.
7 XPath and XSLT extensions
Sometimes the built-in facilities of XPath and XSLT aren't quite
enough to meet your processing needs. Luckily it's easy to extend the
function of these libraries using user extension functions and elements,
which are written in Python.
7.1 Extension functions (XPath and XSLT)
To define your own extension functions for XPath and XSLT, you write
corresponding Python function in a module, and provide a mapping from the
desired XPath function names to Python function objects (or any callables). Start with a simple example. The following is a complete module which defines a single XPath function, unichr(s) a simple example that takes a string and returns the Unicode code point number for the first character in that string.
#ord.py
from Ft.Xml.XPath import Conversions
def Ord(context, s):
'''
Available in XPath as ord() as defined by ExtFunctions mapping below
Takes an object, which is coerced to string
Returns the Unicode code point number for the first character in that string Or returns -1 if it's an empty string
'''
s = Conversions.StringValue(s) #Coerce the passed object to string
if s:
return ord(s[0])
else:
return -1
ExtFunctions = {
(u'urn:x-4suite:x', u'ord'): Ord,
}
As this simple example illustrates, The
basic way to map XPath function names to Python function objects is in
dictionary named "ExtFunctions", global to the module in which the
extension function is defined. The XPath/XSLT extension names are
expressed as a Python tuple of two Unicode objects. If you're familiar
with XPath, this is just a Python representation of an expanded name.
The first item in the expanded name tuple is
the namespace URI for the element, and the second is the local name.
The namespace URI cannot be an empty string.
You have to actually tell the processor to load your extension modules. There are several ways to do so.
-
From Python code you can register them in a context object used for XPath processing
by using the optional
extModuleList to pass in a list of module
objects.
-
You can also register particular functions rather than a
complete module in a XPath context object using the
optional extFunctionMap argument. It takes
a mapping dictionary similar to the ExtFunctions dictionary shown in the above sample module.
-
If you are using the XSLT processor you can register extension functions on a processor object using
the registerExtensionModules() method.
-
When using the XSLT processor you can also register individual extension functions on a processor object using
registerExtentionFunction() method. It takes
the namespace and
localName for the extension function and the callable object that implements it).
-
In some cases the user can list extension modules using
the environment variable "EXTMODULES". "EXTMODULES" is a
colon-separated list of Python modules names. This works for the 4xslt
command line and for Ft.Xml.XPath.Evaluate. For
other APIs, use one of the other two methods, which can easily be
extended to read the "EXTMODULES" variable. In general the other methods for registering extensions are preferable.
Note that extension modules will automatically be
searched for XSLT extension elements as well as functions.
The following is a longer example, a module that implements two functions are. One returns
the current time and the other creates a hash of the context node name:
# demo.py
import time, urlparse
from Ft.Xml.XPath import Conversions
def GetCurrentTime(context):
'''available in XPath as get-current-time()'''
return time.asctime(time.localtime())
def HashContextName(context, maxkey):
'''
available in XPath as hash-context-name(maxkey),
where maxkey is an object converted to number
'''
# It is a good idea to use the appropriate core function to coerce
# arguments to the expected type
maxkey = Conversions.NumberValue(maxkey)
key = reduce(lambda a, b: a + b, context.node.nodeName)
return key % maxkey
ExtFunctions = {
('urn:x-4suite:x', 'get-current-time'): GetCurrentTime,
('urn:x-4suite:x', 'hash-context-name'): HashContextName
}
You can use this in plain XPath as follows:
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Domlette import NonvalidatingReader
DOC = "<spam xmlns='http://spam.com'>eggs</spam>"
ctx = Context(None, extFunctionMap=demo.ExtFunctions,
processorNss={"a": "http://spam.com"})
expr = Compile("get-current-time()")
doc = NonvalidatingReader.parseString(DOC, "http://spam.com/base")
print Evaluate(expr, doc, ctx)
Notice that you might choose to use None for the extension function
namespaces. If so, you don't need to specify the processorNss context
attribute, but you might want to watch out for clashes with other
extenstion function names, including the built-in library. Again, if you
plan to use an extension function from within XSLT, its namespace URI must
not be None.
You can use this in XSLT just as easily:
# useextfunc.py
TRANSFORM = """<?xml version="1.0"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:s="urn:x-4suite:x"
version="1.0">
<xsl:template match="/">
<xsl:value-of select="s:get-current-time()"/>
</xsl:template>
</xsl:stylesheet>
"""
SOURCE = """<dummy/>"""
from Ft.Xml.Xslt import Processor
processor = Processor.Processor()
# Register the extension function using method (3)
processor.registerExtensionModules(['demo'])
from Ft.Xml import InputSource
transform = InputSource.DefaultFactory.fromString(TRANSFORM, "http://foo.com")
source = InputSource.DefaultFactory.fromString(SOURCE, "http://foo.com")
processor.appendStylesheet(transform)
result = processor.run(source)
print result
For good examples of modules with extension elements, see the source code for the modules
Ft.Xml.XPath.BuiltInExtFunctions,
Ft.Xml.Xslt.BuiltInExtFunctions and the modules in
Ft.Xml.Xslt.Exslt. The latter are
especially good examples given their diversity and detailed specifications
at exslt.org.
7.2 Extension elements (XSLT)
To define your own extension elements, define a class derived from
Ft.Xml.Xslt.XsltElement. The module in which it is
defined should have a global dictionary named "ExtElements" mapping element
expanded names to element class objects.
Finally, modules containing any extension elements used must be
indicated as such to the processor in one of several ways.
-
You can register all extension functions and elements in a module by using a processor object's
registerExtensionModules() method.
-
You can also register individual extension elements on a processor object using
registerExtensionElement() method. It takes
the namespace and
localName for the extension function and the callable object that implements it).
-
In some cases the user can list extension modules using
the environment variable "EXTMODULES". "EXTMODULES" is a
colon-separated list of Python modules names. This works for the 4xslt
command line and for Ft.Xml.XPath.Evaluate. For
other APIs, use one of the other two methods, which can easily be
extended to read the "EXTMODULES" variable. In general the other methods for registering extensions are preferable.
Note that extension modules will automatically be
searched for XPath extension functions as well as Extension
elements.
7.3 Extension element API
There are several aspects of the extension element API worth
discussing in more detail.
The class-level "content" variable specifies a content model to be
enforced by the XSLT processor. If the element is used in a way that
doesn't meet the specified content model, the user will get an error
message. The content model is a structure that uses certain special
classes, including:
-
ContentInfo.Empty - matches no content at all (empty
element)
-
ContentInfo.Text - matches plain text content
-
ContentInfo.Seq - matches the given sequence of
sub-patterns
-
ContentInfo.Alt - matches one of the given choice of
sub-patterns
-
ContentInfo.Rep - matches 0 or more repeated instances of the
given sub-pattern
-
ContentInfo.Rep1 - matches 0 or more repeated instances of the
given sub-pattern
-
ContentInfo.Opt - matches zero or one of the given
sub-pattern
-
ContentInfo.ResultElements - matches elements not in the XSL
namespace
-
ContentInfo.Instructions - matches any sequence of XSLT elements
categorized as instructions in the spec
-
ContentInfo.Template - matches an XSLT template body according
to the spec
-
ContentInfo.TopLevelElements - matches any sequence of XSLT
elements categorized as top level in the spec
-
ContentInfo.QName - matches a particular element by giving its
namespace and node name (the prefix in the node name is only used for
documentation and error messages)
So, for instance, the xsl:choose element would be described
as
content = ContentInfo.Seq(
ContentInfo.Rep1(ContentInfo.QName(XSL_NAMESPACE, 'xsl:when')),
ContentInfo.Opt(ContentInfo.QName(XSL_NAMESPACE, 'xsl:otherwise')),
)
The class-level "legalAttrs" variable specifies the attributes
allowed or required on the element. It is a Python dictionary mapping
attribute name to its specification. The specification is a class
according o the type of attribute.
The following are the supported attribute classes. The parameters
specified are for the initializer. Note that most general patterns have a
plain variant and an attribute value template (AVT) variant:
-
AttributeInfo.String - any XPath string
-
AttributeInfo.StringAvt - an AVT yielding any string
-
AttributeInfo.Char - any XPath string of length 1
-
AttributeInfo.CharAvt - AVT version of Char
-
AttributeInfo.Choice - a string which must be one of a number of
given values. The values are given by a list of strings with is the
first parameter
-
AttributeInfo.ChoiceAvt - AVT version of Choice
-
AttributeInfo.YesNo - Abbreviation for AttributeInfo.Choice (
See Oasis
web site)
-
AttributeInfo.YesNoAvt - AVT version of YesNo
-
AttributeInfo.Number - any XPath number
-
AttributeInfo.NumberAvt - AVT version of Number
-
AttributeInfo.UriReference - XPath string that is syntactically
a URI reference
-
AttributeInfo.UriReferenceAvt - AVT version of
UriReference
-
AttributeInfo.Id - XPath string that is syntactically an XML
ID
-
AttributeInfo.IdAvt - AVT version of Id
-
AttributeInfo.QName - XPath string that is syntactically an XML
namespaces qualified name
-
AttributeInfo.QNameAvt - AVT version of QName
-
AttributeInfo.NCName - XPath string that is syntactically an XML
namespaces "no colon" name
-
AttributeInfo.NCNameAvt - AVT version of NCName
-
AttributeInfo.Prefix - Same as NCName
-
AttributeInfo.PrefixAvt - Same as NCNameAvt
-
AttributeInfo.NMToken - XPath string that is syntactically an
XML Name token
-
AttributeInfo.NMTokenAvt - AVT version of NMToken
-
AttributeInfo.QNameButNotNCName - A QName that contains a
colon
-
AttributeInfo.QNameButNotNCNameAvt - AVT version of
QNameButNotNCName
-
AttributeInfo.Token - XPath string that is syntactically an
XPath name test (i.e. "foo", "ns:foo", ns:" or
"")
-
AttributeInfo.TokenAvt - AVT version of Token
-
AttributeInfo.Expression - XPath string that is syntactically an
XPath expression
-
AttributeInfo.ExpressionAvt - AVT version of Expression
-
AttributeInfo.StringExpression - XPath string that is
syntactically an XPath expression, which would be expected to return a
string value
-
AttributeInfo.StringExpressionAvt - AVT version of
StringExpression
-
AttributeInfo.NodeSetExpression - XPath string that is
syntactically an XPath expression, which would be expected to return a
node set value
-
AttributeInfo.NodeSetExpressionAvt - AVT version of
NodeSetExpression
-
AttributeInfo.NumberExpression - XPath string that is
syntactically an XPath expression, which would be expected to return a
number value
-
AttributeInfo.NumberExpressionAvt - AVT version of
NumberExpression
-
AttributeInfo.BooleanExpression - XPath string that is
syntactically an XPath expression, which would be expected to return a
boolean value
-
AttributeInfo.BooleanExpressionAvt - AVT version of
BooleanExpression
-
AttributeInfo.Pattern - XPath string that is syntactically an
XSLY pattern
-
AttributeInfo.PatternAvt - AVT version of Pattern
-
AttributeInfo.Tokens - XPath string that is syntactically a
space-delimited series of tokens
-
AttributeInfo.TokensAvt - AVT version of Tokens
-
AttributeInfo.QNames - XPath string that is syntactically a
space-delimited series of QNames
-
AttributeInfo.QNamesAvt - AVT version of QNames
-
AttributeInfo.Prefixes - XPath string that is syntactically a
space-delimited series of NCNames
-
AttributeInfo.PrefixesAvt - AVT version of Prefixes
All of these classes take the following optional keyword
parameters:
Some examples from the XSLT spec:
xsl:output
content = ContentInfo.Empty
legalAttrs = {
'method' : AttributeInfo.QName(),
'version' : AttributeInfo.NMToken(),
'encoding' : AttributeInfo.String(),
'omit-xml-declaration' : AttributeInfo.YesNo(),
'standalone' : AttributeInfo.YesNo(),
'doctype-public' : AttributeInfo.String(),
'doctype-system' : AttributeInfo.String(),
'cdata-section-elements' : AttributeInfo.QNames(),
'indent' : AttributeInfo.YesNo(),
'media-type' : AttributeInfo.String(),
}
xsl:sort
content = ContentInfo.Empty
legalAttrs = {
'select' : AttributeInfo.StringExpression(default='.'),
'lang' : AttributeInfo.NMTokenAvt(),
# We don't support any additional data-types, hence no
# AttributeInfo.QNameButNotNCName()
'data-type' : AttributeInfo.ChoiceAvt(['text', 'number'],
default='text'),
'order' : AttributeInfo.ChoiceAvt(['ascending', 'descending'],
default='ascending'),
'case-order' : AttributeInfo.ChoiceAvt(['upper-first', 'lower-first']),
}
xsl:number
content = ContentInfo.Empty
legalAttrs = {
'level' : AttributeInfo.Choice(['single', 'multiple', 'any'],
default='single'),
'count' : AttributeInfo.Pattern(),
'from' : AttributeInfo.Pattern(),
'value' : AttributeInfo.Expression(),
'format' : AttributeInfo.StringAvt(default='1'),
'lang' : AttributeInfo.NMToken(),
'letter-value' : AttributeInfo.ChoiceAvt(['alphabetic', 'traditional']),
'grouping-separator' : AttributeInfo.CharAvt(),
'grouping-size' : AttributeInfo.NumberAvt(default=0),
}
Of course, it's always a good idea to use descriptions, which the
above do not.
For good examples of modules with extension elements, see the source code for the modules
Ft.Xml.Xslt.BuiltInExtElements and Ft.Xml.Xslt.Exslt.Common . The various
modules in Ft.Xml.Xslt.Exslt have a strong diversity and make good
examples, especially given their detailed specifications at exslt.org
7.3.1 Controlling output from XSLT extensions
The most common special need for XSLT extensions is to generate
XSLT output. For extension elements this is easy enough to do using the
API on the procssor instance that is passed to the instantiate() method
of extension element classes. For example
class SpamElement(XsltElement):
legalAttrs = {}
def instantiate(self, context, processor):
processor.output().startElement('title')
processor.output().text('Life of Brian'))
processor.output().endElement('title')
return (context,)
Extension functions are not passed a processor instance directly,
but context objects hold a reference to the processor in effect, so the
following example works:
def Spam(context):
context.processor.output().startElement('title')
context.processor.output().text('Life of Brian'))
context.processor.output().endElement('title')
return
However, it is probably better design to reserve such side effects
as output for extension elements rather than functions.
In the above examples the elements and text out out just use the
current output parameters. In order to change output parameters or
change the output stream, you can stack a new output handler:
stream = cStringIO.StringIO()
# Clone the current outputparameters
op = processor.writers[-1]._outputParams.clone()
# Force XML output method with XML declaration
# Output method is a qualified name, so must flag null ns
# to use standard xml method
op.method = (EMPTY_NAMESPACE, 'xml')
op.omitXmlDeclaration = "yes"
# Push the new handler to the top of the writer stack
processor.addHandler(op, stream)
processor.output().startElement('title')
processor.output().text('Life of Brian'))
processor.output().endElement('title')
# Pop back to the previous handler stream.getvalue()
# now contains the new output
processor.removeHandler()
7.3.2 Creating result tree fragments
Another common need is to treat the body of an extension element
as a template so that something can be done with the RTF that results
from it. The following example demonstrates this:
try:
# Set the output to an RTF riter, which wll create an RTF for us
processor.pushResultTree(self.baseUri)
# The template is manifested as children of the extension element
# node. Instantiate each in turn
for child in self.children:
child.instantiate(context, processor)
# You want to be sure you re-balance the stack even in case of error
finally:
# Retrieve the resulting RTF
result_rtf = processor.popResult()
7.3.3 Comunicating with the external code that invokes XSLT
You can set and communicate state information with external code
by using the processor.extensionParams attribute. For example, the
following sents a time stamp of precisely when the extension was
instantiated, which can later be retrieved from the processor after the
XSLT process, or even by later extensions. In a similar way, state can
be set up by calling functions and retrieved by extensions.
# Extension parameters have fully qualified names, so you must come up
# with a namespace to set them
processor.extensionParams[(SPAM_NAMESPACE, 'tstamp')] = time.time()
8 Streaming XML output
MarkupWriter is a streaming
API for generating XML. The
Ft.Xml.MarkupWriter class is specialized for creating
XML documents from scratch. Documents written with
MarkupWriter are written to the output (standard
output or another file-like object) as you build them, so if you need to
process the document in memory, you may need another tool such as a DOM-like
tool (e.g. Domlette, Amara, etc).
4Suite partitions XML serializers into two
categories: writers and printers.
-
A writer is a module that exposes a broad public
API for building output incrementally.
-
A printer is a module that simply takes a DOM
and creates output from it as a whole, within one
API invocation.
MarkupWriter is the primary example
of this writer category of XML serializers.
The following example uses this class for generating a simple
XML Software Autoupdate (XSA) file. XSA is a
XML data format for listing and describing software
packages.
from Ft.Xml import MarkupWriter
# Set the output doc type details (required by XSA)
SYSID = u"http://www.garshol.priv.no/download/xsa/xsa.dtd"
PUBID = u"-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"
writer = MarkupWriter(indent=u"yes", doctypeSystem=SYSID,
doctypePublic=PUBID)
writer.startDocument()
writer.startElement(u'xsa')
writer.startElement(u'vendor')
# Element with simple text (#PCDATA) content
writer.simpleElement(u'name', content=u'Centigrade systems')
writer.simpleElement(u'email', content=u"info@centigrade.bogus")
writer.endElement(u'vendor')
# Element with an attribute
writer.startElement(u'product', attributes={u'id': u"100\u00B0"})
writer.simpleElement(u'name', content=u"100\u00B0 Server")
writer.simpleElement(u'version', content=u"1.0")
writer.simpleElement(u'last-release')
writer.text(u"20030401")
# Empty element
writer.simpleElement(u'changes')
writer.endElement(u'product')
writer.endElement(u'xsa')
writer.endDocument()
This is the output we get from the code above:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsa PUBLIC "-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML" "http://www.garshol.priv.no/download/xsa/xsa.dtd">
<xsa>
<vendor>
<name>Centigrade systems</name>
<email>info@centigrade.bogus</email>
</vendor>
<product id="100°">
<name>100° Server</name>
<version>1.0</version>
<last-release>20030401</last-release>
<changes/>
</product>
</xsa>
The above example illustrates some of the basics of using the
MarkupWriter class. The following sections describe
both the essential and the advanced features of this class. In many cases,
there often exists more than one way to output a given document
section.
8.1 Starting with MarkupWriter
After importing the MarkupWriter class, you
have to create a MarkupWriter object instance and
then start the new Document. (See below for output options of
MarkupWriter.) Remember that you are working with a
streaming API. You must decide what features you want
your output to have before you start to write that output.
>>> from Ft.Xml import MarkupWriter
>>> writer = MarkupWriter()
>>> writer.startDocument()
You are now ready to add data to the new document.
Important
Make sure that all of your data (element names, attributes,
content, etc) are Python unicode objects.
8.2 How to insert elements
There are two ways to add new elements as children of other document
or element nodes.
-
When you want to add a new element that will itself have child
elements, you can use the
startElement/endElement
method combination to signal the beginning and the ending of an
element, respectively.
writer.startElement(u'xsa')
# other document content can be output here
writer.endElement(u'xsa')
-
Alternatively, you can use the
simpleElement method, which is a shortcut for
the
startElement/endElement
combination and produces an element with no content or with text
content (if you specify the content parameter).
writer.simpleElement(u'xsa')
8.3 How to insert attributes
There are two ways to add attributes to elements:
-
First, you can use the attributes
parameter of the startElement method. This
parameter is a dictionary which maps each attribute name to the value
of that attribute. If an attribute's name is in a namespace, then you
must specify the name as a Python tuple, with the attribute's QName as
the first member of the tuple, and the namespace URI as the second
member of the tuple. For an example of this advanced syntax, see “Writing XHTML with MarkupWriter”.
writer.startElement(u'product', attributes={u'id': u"100\u00B0"}
-
Alternatively, you can use a distinct
attribute method with two parameters: the
attribute's name and the attribute's value. As with the dictionary
approach above, if the attribute's name is in a namespace, then the
whole name should be a Python tuple.
writer.startElement(u'product')
writer.attribute(u'id', u"100\u00B0")
8.4 How to insert text nodes
Similarly, there are two ways to add text nodes to elements.
-
First, the simpleElement method takes a
content parameter, which can be used to create
a single text node child of the node with the specified
name.
writer.simpleElement(u'name', content=u'Centigrade systems')
-
Alternatively, instances of the
MarkupWriter class, such as
writer, have a text method
that inserts a single text node as the next child
of the element which was last started with the
startElement method and which has not yet
been closed with the endElement
method.
writer.startElement(u'product')
writer.text(u'Centigrade systems')
writer.endElement(u'product')
8.5 How to insert a complete chunk
MarkupWriter also allows you to insert
well-formed XML entities as complete chunks in the
output. This is a very convenient way to emit boilerplate
XML without breaking it down into all the separate
element/attribute/content bits. As such the lines:
writer.simpleElement(u'name', content=u"100\u00B0 Server")
writer.simpleElement(u'version', content=u"1.0")
writer.simpleElement(u'last-release', content=u"20030401")
Could instead be written:
writer.xmlFragment("""
<name>100° Server</name>
<version>1.0</version>
<last-release>20030401</last-release>""")
Important
The parameter of xmlFragment is a string,
not a unicode object.
8.6 How to insert processing instructions and comments
The API provides the comment and
processingInstruction methods for inserting
processing instructions and comments. The comment
method takes a unicode string, which is the intended value of the comment.
The processingInstruction method takes two
unicode strings. The first is the name of the processing instruction, and
the second is the value of the processing instruction. For example, the
following code:
writer.comment(u"This is a processing instruction")
writer.processingInstruction(u'xml-stylesheet', u'type="text/xsl" href="akara.xsl"')
produces
the following output:
<!--This is a processing instruction-->
<?xml-stylesheet type="text/xsl" href="akara.xsl"?>
8.7 Using namespaces
When you create a new element or an attribute, you can use
namespaces. See the next program:
from Ft.Xml import MarkupWriter
writer = MarkupWriter(indent=u'yes')
writer.startDocument()
RDFNS = u"http://www.w3.org/1999/02/22-rdf-syntax-ns#"
writer.startElement(u"rdf:RDF", RDFNS)
writer.startElement(u"rdf:Description", RDFNS,
attributes={(u'rdf:about', RDFNS): u'http://media.example.com/audio/guide.ra'})
writer.endElement(u'rdf:Description', RDFNS)
writer.endElement(u'rdf:RDF', RDFNS)
And this is the output:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://media.example.com/audio/guide.ra"/>
</rdf:RDF>
8.8 Setting up the output
In the above example, you can see how parameters that control the
output are passed into the MarkupWriter
initializer, including document type info and whether to indent (pretty
print).
You can pass any of the usual controls for XSLT output into the
initializer this way.
- stream
-
By default MarkupWriter sends its
output to sys.stdout, but you can substitute any
file-like object by passing in an initializer parameter. This stream
parameter should be the first argument to the
MarkupWriter constructor. For example:
output_file = file('output.xml', 'w')
writer = MarkupWriter(output_file, indent=u"yes")
- indent
-
The indent named parameter controls whether or not the output
will have whitespace inserted to indent tags in the output. The
default is "no".
- doctypeSystem,
doctypePublic
-
These two named parameters control the system and public
identifiers that will be included in the output.
- omitXmlDeclaration=u"yes"
-
This named parameter can be used to suppress output of the
XML declaration. The default is "no".
- encoding
-
This named parameter controls the character encoding to use.
(The default is UTF-8.) The writer will automatically use character
entities where necessary.
- standalone
-
Set this named parameter to "yes" to set standalone in the
XML declaration.
- mediaType
-
This parameter sets the media type of the output. You will
probably never need this.
- cdataSectionElements
-
This named parameter is a list of element names whose output
will be wrapped in a CDATA section. This can provide for friendlier
output in some cases.
The XSLT spec also defines a method parameter to
choose between XML, HTML or plain
text output rules, but for MarkupWriter at the
moment you should stick to XML. The result of changing
the method is undefined. We'll probably relax this restriction in later
releases.
8.9 More examples
8.9.1 Writing XHTML with MarkupWriter
Uche Ogbuji provides this
example, which writes a simple XHTML file, in his blog:
from Ft.Xml.MarkupWriter import MarkupWriter
from xml.dom import XHTML_NAMESPACE, XML_NAMESPACE
XHTML_NS = unicode(XHTML_NAMESPACE)
XML_NS = unicode(XML_NAMESPACE)
XHTML11_SYSID = u"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
XHTML11_PUBID = u"-//W3C//DTD XHTML 1.1//EN"
writer = MarkupWriter(indent=u"yes", doctypeSystem=XHTML11_SYSID,
doctypePublic=XHTML11_PUBID)
writer.startDocument()
writer.startElement(u'html', XHTML_NS, attributes={(u'xml:lang', XML_NS): u'en'})
writer.startElement(u'head', XHTML_NS)
writer.simpleElement(u'title', XHTML_NS, content=u'Virtual Library')
writer.endElement(u'head', XHTML_NS)
writer.startElement(u'body', XHTML_NS)
writer.startElement(u'p', XHTML_NS)
writer.text(u'Moved to ')
writer.simpleElement(u'a', XHTML_NS,
attributes={u'href': u'http://vlib.org/'},
content=u'vlib.org')
writer.text(u'.')
writer.endElement(u'p', XHTML_NS)
writer.endElement(u'body', XHTML_NS)
writer.endElement(u'html', XHTML_NS)
writer.endDocument()
This example results in the following XHTML document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a href="http://vlib.org/">vlib.org</a>.</p>
</body>
</html>
8.9.2 Writing information of directory listing as a
XML document
This recursive example builds an XML document
with the information of a directory listing. The example has two
functions. The first initializes the writer. The second walks through
the filesystem and outputs information about the filesystem as
XML. The complete dirlist.py
program can be found on Uche Ogbuji's blog.
def genXML(dir,out):
print "Processing %s" % dir
writer = MarkupWriter(out, indent=u"yes")
writer.startDocument()
recurse_dir(dir,writer)
def recurse_dir(path,writer,d):
d=d+1
for cdir, subdirs, files in os.walk(path):
writer.startElement(u'directory', attributes={u'name': unicode(cdir)})
for f in files:
writer.simpleElement(u'file', attributes={u'name': unicode(f)})
for subdir in subdirs: recurse_dir(os.path.join(cdir, subdir), writer,d)
writer.endElement(u'directory')
break
8.9.3 Building a bot
As a more complex example, the Emeka
IRC bot uses
MarkupWriter to build an RDF document. It writes
namespaces. See this chunk of the code:
DCE_NS = u'http://purl.org/dc/elements/1.1/'
for nada,category in item['categories']:
if len(category.split(' ')) > 0:
for category in category.split(' '):
writer.startElement(u"dc:subject", DCE_NS)
writer.text(category)
writer.endElement(u"dc:subject")
else:
writer.startElement(u"dc:subject", DCE_NS)
writer.text(category)
writer.endElement(u"dc:subject", DCE_NS)
9 Validation using RELAX NG
4Suite has RELAX NG support based on a bundling of Eric van der
Vlist's XVIF
implementation.
First of all, you can use the 4xml command line for RELAX NG
validation with the --rng flag. For instance, take the following RELAX NG
schema (rng-tut3.rng):
<element name="addressBook" xmlns="[http://relaxng.org/ns/structure/1.0][13]">
<zeroOrMore>
<element name="card">
<element name="name">
<text/>
</element>
<element name="email">
<text/>
</element>
</element>
</zeroOrMore>
</element>
The following document (rng-tut1.xml) is valid against the
schema:
<addressBook>
<card>
<name>John Smith</name>
<email>js@example.com</email>
</card>
<card>
<name>Fred Bloggs</name>
<email>fb@example.net</email>
</card>
</addressBook>
As you can check as follows:
$ 4xml --rng=rng-tut3.rng rng-tut1.xml
<?xml version="1.0" encoding="utf-8"?>
<addressBook>
<card>
<name>John Smith</name>
<email>js@example.com</email>
</card>
<card>
<name>Fred Bloggs</name>
<email>fb@example.net</email>
</card>
</addressBook>
Since it passes the schema, 4xml continues normal operation,
re-serializing the XML back to stdout.
The following document (rng-tut7.xml) is not valid against the
schema:
<addressBook>i
<card>
<givenName>John</givenName>
<familyName>Smith</familyName>
<email>js@example.com</email>
</card>
<card>
<name>Fred Bloggs</name>
<email>fb@example.net</email>
</card>
</addressBook>
Which you can check as follows:
$ 4xml --rng=rng-tut7.rng rng-tut1.xml
Traceback (most recent call last):
File "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xml", line 5, in ?
XmlCommandLineApp().run()
File "/home/uogbuji/lib/python2.2/site-packages/Ft/Lib/CommandLine/CommandLineApp.py", line 90, in run
cmd.run_command(self.authenticationFunction)
File "/home/uogbuji/lib/python2.2/site-packages/Ft/Lib/CommandLine/Command.py", line 83, in run_command
self.function(self.clOptions, self.clArguments)
File "/home/uogbuji/lib/python2.2/site-packages/Ft/Xml/_4xml.py", line 89, in Run
raise RngInvalid(result)
Ft.Xml.Xvif.RngInvalid: _Pattern Empty, no content expected,
node <cElement at 0x838d7f4: name u'card', 0 attributes, 7 children>
The exception is for the invalid pattern.
You can also access validation through the Python API using the new
Ft.Xml.Xvif.RelaxNgValidator class. For example:
from Ft.Xml.Xvif import RelaxNgValidator
from Ft.Xml import InputSource
from Ft.Lib import Uri
factory = InputSource.DefaultFactory
rng_uri = Uri.OsPathToUri("rng-tut3.rng", attemptAbsolute=1)
src_uri = Uri.OsPathToUri("rng-tut1.xml", attemptAbsolute=1)
rng_isrc = factory.fromUri(rng_uri)
src_isrc = factory.fromUri(src_uri)
validator = RelaxNgValidator(rng_isrc)
result = validator.isValid(src_isrc)
if result:
print "Valid"
else:
print "Invalid"
The isValid() method returns a 1 or 0 for validity. To get the actual
structure returned by the validator, use the validate() method instead. This
structure can easily be turned into an exception object. The following
variation prints "Valid" if valid, and raises an exception if not:
from Ft.Xml.Xvif import RelaxNgValidator, RngInvalid
from Ft.Xml import InputSource
factory = InputSource.DefaultFactory
from Ft.Lib import Uri
factory = InputSource.DefaultFactory
rng_uri = Uri.OsPathToUri("rng-tut3.rng", attemptAbsolute=1)
src_uri = Uri.OsPathToUri("rng-tut1.xml", attemptAbsolute=1)
rng_isrc = factory.fromUri(rng_uri)
src_isrc = factory.fromUri(src_uri)
validator = RelaxNgValidator(rng_isrc)
result = validator.validate(src_isrc)
if result.nullable():
print "Valid"
else:
raise RngInvalid(result)
If you want to use the validation error message without raising an
exception:
# Set-up as above
result = validator.validate(src_isrc)
if result.nullable():
print "Valid"
else:
print result.msg
Xvif does not report the location of validation errors, and stops after the first error. It does not support RELAX
NG compact syntax (RNC) or nameClasses (name, anyName, nsName, and
except elements in the schema). In addition, its support of XML Schema datatypes is incomplete, but has been
extended by 4Suite to accommodate a number of types, including the following (asterisk indicates support is exclusive to
4Suite):
-
xs:string
-
xs:normalizedString
-
xs:token
-
xs:ID *
-
xs:IDREF *
-
xs:integer
-
xs:nonPositiveInteger
-
xs:nonNegativeInteger
-
xs:PositiveInteger
-
xs:negativeInteger
-
xs:unsignedLong
-
xs:unsignedInt
-
xs:long
-
xs:int
-
xs:short
-
xs:unsignedShort
-
xs:byte
-
xs:unsignedByte
-
xs:decimal
-
xs:date *
-
xs:boolean *
-
xs:time *
-
xs:dateTime *
-
xs:anyURI *
The numeric types all support the totalDigits, minInclusive,
maxInclusive, minExclusive, and maxExclusive facets.
xs:decimal also supports the fractionDigits facet.
The xs:string, xs:normalizedString, and xs:token types
support the length facet. In 4Suite only, xs:string and
xs:normalizedString support minLength, maxLength, and
pattern facets.
10 XUpdate processing
XUpdate is
a community specification for using an XML vocabulary to express
modifications to XML documents. It is essentially an XPath-based XML
transformation language, like XSLT. An XUpdate document is an XML document
that specifies what changes should be made to another XML document. XUpdate
is supported by many XML processing tools - especially in the open source
category - and XUpdate is neither a W3C Recommendation nor an ISO or IETF
standard. It is just a project of the XML:DB Initiative's XUpdate Working
Group, and it never advanced beyond a Working Draft published in September,
2000. It is not very well specified, but it is very convenient and enables a
basic level of functionality, so it has enjoyed popularity in a number of
implementations.
4Suite's XUpdate implementation, 4XUpdate, consists of a Python API
(via the Ft.Xml.XUpdate module) and a command-line script (4xupdate). The
APIs involve taking a source document (the XML to be updated) and an XUpdate
document (the changes to apply), and either producing a new document or
updating the source document in-place. The command line tool can be used,
for example, as a patching utility for XML. All of XUpdate (such as it's
specified) is currently implemented.
The Python API can be invoked directly on Domlette objects or on
InputSources. Here is an example of using the ApplyXUpdate convenience
function, which takes InputSources:
from Ft.Xml.Domlette import PrettyPrint
from Ft.Xml.InputSource import DefaultFactory
try:
from Ft.Xml.XUpdate import ApplyXUpdate
except ImportError:
# the function name changed between 1.0a3 and 1.0b1
from Ft.Xml.XUpdate import ApplyXupdate as ApplyXUpdate
SOURCE='''<?xml version = "1.0"?>
<ADDRBOOK xmlns="http://bogus/">
<ENTRY ID="fr">
<NAME>fred</NAME>
</ENTRY>
</ADDRBOOK>'''
XU='''<?xml version="1.0"?>
<xu:modifications version="1.0" xmlns:xu="http://www.xmldb.org/xupdate"
xmlns:myns="http://bogus/">
<xu:append select="/myns:ADDRBOOK" child="last()">
<ENTRY ID="vz">
<NAME>Vasia Zhugenev</NAME>
</ENTRY>
</xu:append>
</xu:modifications>'''
src_isrc = DefaultFactory.fromString(SOURCE, "http://test1/")
xup_isrc = DefaultFactory.fromString(XU, "http://test2/")
result_dom = ApplyXUpdate(src_isrc, xup_isrc)
PrettyPrint(result_dom)
#expected:
#<?xml version="1.0" encoding="UTF-8"?>
#<ADDRBOOK xmlns="http://bogus/">
# <ENTRY ID="fr">
# <NAME>fred</NAME>
# </ENTRY>
#<ENTRY ID="vz">
# <NAME>Vasia Zhugenev</NAME>
# </ENTRY>
#</ADDRBOOK>
If you have both the source document and XUpdate document as Domlette
nodes already, you can use the XUpdate processor directly:
# add to the above script...
from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Xml.XUpdate import Processor
src_isrc = DefaultFactory.fromString(SOURCE, "http://test1/")
xup_isrc = DefaultFactory.fromString(XU, "[http://test2/")
src_dom = NonvalidatingReader.parse(src_isrc)
xup_dom = NonvalidatingReader.parse(xup_isrc)
proc = Processor()
proc.execute(src_dom, xup_dom)
# src_dom has been modified in-place
PrettyPrint(src_dom)
Using the processor directly allows you to set XPath variables, if
needed:
from Ft.Xml import EMPTY_NAMESPACE
# execute with $x='foo'
proc.execute(src_dom, xup_dom, {(EMPTY_NAMESPACE, u'x'): u'foo'})
The command-line script works on local files or even URIs, if
resolvable, and normally sends the result XML to standard output, although
it can also be made to write to a file. See "4xupdate -h" for usage
instructions.
10.1 XUpdate and namespaces
In order to show how to use XUpdate to make namespace-aware
modifications, The following tasks will be demonstrated:
-
Add a new element in the products namespace, but using no
prefix.
-
Add a new element with a prefix and in the products
namespace.
-
Add a new element that is not in any namespace.
-
Add a new global attribute in the XHTML namespace.
-
Add a new global attribute in the special XML namespace.
-
Add a new attribute in no namespace.
-
Remove only the code element in the XHTML
namespace
-
Remove a global attribute
-
Remove an attribute that is not in any namespace
Modification in place can always be simulated with an addition and
then a removal. The following code shows how these tasks can be performed
in XUpdate.
<xup:modifications version="1.0"
xmlns:xup="http://www.xmldb.org/xupdate"
xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<!-- Task 1 -->
<xup:append select="/products/p:product[1]">
<xup:element
name="launch-date"
namespace="http://example.com/product-info"/>
</xup:append>
<!-- Task 2 -->
<xup:append select="/products/p:product[1]">
<xup:element
name="p:launch-date"
namespace="http://example.com/product-info"/>
</xup:append>
<!-- Can also be accomplished using literal result elements:
<xup:append select="/products/p:product[1]">
<p:launch-date/>
</xup:append>
-->
<!-- Task 3 -->
<xup:append select="/products/p:product[1]">
<xup:element name="island"/>
</xup:append>
<!-- Can also be accomplished using literal result elements:
<xup:append select="/products/p:product[1]">
<island/>
</xup:append>
-->
<!-- Task 4 -->
<xup:append select="/products/p:product/p:description/html:div">
<xup:attribute name="global"
namespace="http://www.w3.org/1999/xhtml">spam</xup:attribute>
</xup:append>
<!-- Task 5 -->
<xup:append select="/products/p:product/p:description/html:div">
<xup:attribute name="xml:lang">en</xup:attribute>
</xup:append>
<!-- Task 6 -->
<xup:append select="/products/p:product/p:description/html:div">
<xup:attribute name="class">eggs</xup:attribute>
</xup:append>
<!-- Task 7 -->
<xup:remove select="//html:code"/>
<!-- Task 8 -->
<xup:remove select="/products/p:product/p:description/html:div/ref/@xl:href"/>
<!-- Task 9 -->
<xup:remove select="/products/p:product[1]/@id"/>
</xup:modifications>
If you're familiar with XSLT, then you'll see the resemblance of
XUpdate at first glance. The envelope element for modifications expressed
in XUpdate is xup:modifications, similar to
xsl:transform or xsl:stylesheet. The
namespace declarations on this element assign prefixes for use in the
XUpdate script and have no connection
to the prefixes used in the document being modified (the source document), even though they happen to be
the same. If you want to access elements in a namespace declared as the
default in the source document, then just as in XSLT you must declare and
use a prefix for the namespace in the XUpdate script.
Each modification request is expressed as an XUpdate instruction.
This example demonstrates xup:append and
xup:remove. There are other instructions providing
types of modification such as xup:insert-before
xup:update and there are also control constructs such
as xup:if, which is similar to
xsl:if. Instructions usually have a
select attribute containing an XPath expression that
specifies the node to be used as a reference for modification. In the case
of xup:append, select specifies a
node after which some new XML will be appended. In the case of
xup:remove, select identifies nodes
to be removed. When an instruction needs to specify a chunk of XML to be
used in the modification it is expressed as the content of the
instructions in a similar fashion to XSLT templates. In the case of
xup:append this template expresses the chunk of XML to
be inserted into the document. In order to generate elements and
attributes XUpdate provides output instructions such as
xup:element and xup:attribute, which
are very similar to their XSLT equivalents. In another idea borrowed from
XSLT, XUpdate allows you to create element by placing literal result
elements in the templates. If you'd like to get a closer look at XUpdate,
the best way is by browsing the very clear examples in the XUpdate Use
Cases compiled by Kimbro Staken. The following listing is a Python
code that can be used to apply an XUpdate script. It's a simplified
version of the code for the 4xupdate command line.
import sys
from Ft.Xml import XUpdate
from Ft.Xml import Domlette, InputSource
from Ft.Lib import Uri
# Set up reader objects for parsing the XML files
reader = Domlette.NonvalidatingReader
xureader = XUpdate.Reader()
# Parse the source file
source_uri = Uri.OsPathToUri(sys.argv[1], attemptAbsolute=1)
source = reader.parseUri(source_uri)
# Parse the XUpdate file
xupdate_uri = Uri.OsPathToUri(sys.argv[2], attemptAbsolute=1)
isrc = InputSource.DefaultFactory.fromUri(xupdate_uri)
xupdate = xureader.fromSrc(isrc)
# Set up the XUpdate processor and run against the source file
# The Domlette for the source is modified in place
processor = XUpdate.Processor()
processor.execute(source, xupdate)
# Print the updated DOM node to standard output
Domlette.Print(source)
Notice the use of Uri.OsPathToUri to convert file
system paths to proper URIs for use in 4Suite. I strongly recommend this
convention as one way to help minimize confusion between file
specifications and URIs -- the basis of many frequently asked questions.
The XUpdate.Processor class defines the environment for
running XUpdate commands and execute() is the method
that actually kicks off the processing. It operates on a Domlette
instance, modifying it in place (so be careful when using using XUpdate in
this way). The updated document object is printed to standard output using
Domlette.Print.
The following snippet illustrates how to run the test script, and
the output result.
$ python listing4.py products.xml listing3.xup
<?xml version="1.0" encoding="UTF-8"?>
<products xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
<product xmlns="http://example.com/product-info">
<name xml:lang="en">Python Perfect IDE</name>
<description>
Uses mind-reading technology to anticipate and accommodate
all user needs in Python development. Implements all
features though
the year 3000. Works well with <code>1166</code>.
</description>
<launch-date/><p:launch-date/><island/></product>
<p:product id="1166">
<p:name>XSLT Perfect IDE</p:name>
<p:description>
<p:code>red</p:code>
<html:code>blue</html:code>
<html:div global="spam" class="eggs" xml:lang="en">
<ref xl:type="simple">A link</ref>
</html:div>
</p:description>
</p:product>
</products>
11 XInclude processing
11.1 About XInclude
XML Inclusions
(XInclude) is a W3C Recommendation that provides XML document
authors with a robust way of supporting document modularity via the use
of transclusions
(inclusions by reference). Such modularity would otherwise require using
references to external entities declared in a DTD, a system which has
various limitations inherited from SGML.
Unlike XML's built-in entity-reference system, the processing of
XIncludes is, fundamentally, an XML Infoset transformation, not strictly
an operation performed on the serialized (textual) form of a document.
Therefore, there is no requirement for when and where XInclude
processing should occur; it could happen at parse time if the parser
supports it, or could occur on an already-parsed document.
XInclude references consist of two special elements that are
placed in the XML document into which external content is to be
included: <include> and
<fallback>, both in the namespace
http://www.w3.org/2001/XInclude. When processed,
these elements are replaced with the content they reference, which can
be XML or any other text.
11.2 XInclude support in 4Suite
4Suite supports XInclude processing only at parse time, as an
optional feature of the Domlette readers. It is turned on by default, so
if you want to suppress it, you must use the full parsing API — not
the Ft.Xml.Parse and
Ft.Xml.CreateInputSource convenience functions
— and set the parameter processIncludes to
False either when creating an
InputSource or when calling the
parseString, parseUri,
or parseStream method of the Domlette
reader.
11.3 Examples
The following example includes one section stub into a larger
article but has to use the fallback for the second section stub, where
resolution fails. “Document using XInclude” lists the
contents of the file article.xml, which references
two sections using XInclude and provides a fallback for each in case
they fail to load. “Section to be included”
lists the contents of section1.xml, but this
example purposefully does not provide a
section2.xml in order to illustrate the fallback
behaviour. “Loading the document” lists the Python
code used to parse and print this document; note that XInclude
processing is done automatically by default.
<article>
<title>My important article</title>
<xi:include href="section1.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:fallback><!-- Section 1 failed to load! --></xi:fallback>
</xi:include>
<xi:include href="section2.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:fallback><!-- Section 2 failed to load! --></xi:fallback>
</xi:include>
</article>
Figure 1 —
<section>
<title>Section 1</title>
<!-- Write me! -->
</section>
Figure 2 —
from Ft.Xml import Parse
from Ft.Xml.Domlette import PrettyPrint
doc = Parse("article.xml")
PrettyPrint(doc)
Figure 3 —
“Self-contained example” is very similar to
the above example, only this version is self-contained; the resources
are stored in Python strings and resolved using a custom
resolver.
article = """<article><title>My important article</title>
<xi:include href="ex:section" xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:fallback><!-- Section 1 failed to load! --></xi:fallback>
</xi:include>
<xi:include href="ex:section2" xmlns:xi="http://www.w3.org/2001/XInclude">
<xi:fallback><!-- Section 2 failed to load! --></xi:fallback>
</xi:include>
</article>"""
section = "<section><title>Section 1</title><!-- Write me! --></section>"
from Ft.Lib.Uri import FtUriResolver, Absolutize
from Ft.Lib import UriException
from cStringIO import StringIO
class MyResolver (FtUriResolver):
def normalize(self, uriRef, baseUri):
return Absolutize(uriRef, baseUri)
def resolve(self, uri):
if uri == "ex:article":
return StringIO(article)
elif uri == "ex:section":
return StringIO(section)
else:
raise UriException(UriException.RESOURCE_ERROR,
loc=uri, msg="not found, sorry")
myResolver = MyResolver()
from Ft.Xml.InputSource import InputSourceFactory
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint
factory = InputSourceFactory(resolver=myResolver)
isrc = factory.fromUri("ex:article")
doc = NonvalidatingReader.parse(isrc)
PrettyPrint(doc)
Figure 4 —
To turn off XInclude behavior in “Self-contained example”, replace the last three lines
with these three lines:
isrc = factory.fromUri("ex:article", processIncludes=False)
doc = NonvalidatingReader.parse(isrc)
PrettyPrint(doc)
“Loading the document” uses the "super simple"
parsing API; we need to use the full parsing API in order to disable
XInclude expansion (which, paradoxically, takes one less line):
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint
doc = NonvalidatingReader.parseStream(file("article.xml"), processIncludes=False)
PrettyPrint(doc)
12 XPointer processing
12.1 About XPointer
XPointer
is a set of W3C specifications (one part of which is, as of 2006, still
a Working Draft) that provide a means of identifying and referring to a
portion of an XML document. The portion being referenced need not be
contiguous, and need not constitute a well-formed general entity.
XPointers were originally intended to be used in the fragment component
of a URI or IRI (the fragment being the part after
"#"), but the specifications actually place no
restrictions on where they can be used.
An example of an XPointer embedded in a URI would be
http://example.com/inventory.xml#xpointer(//part%5Bstarts-with(sku,%20'999')%5D)
The XPointer in that example is actually
xpointer(//part[starts-with(sku,
'999')])
but the URI syntax requires further encoding of some data. The
result of evaluating this XPointer would be the same as evaluating the
XPath expression //part[starts-with(sku, '999')] against
the document identified by the URI
http://example.com/inventory.xml.
XPointer syntax is simple: a
is just a name, and refers to the element with that
ID (as determined by a DTD or other schema, typically), much like the
XPath 1.0 expression id(somename), but with a little more
flexibility, since id() is limited to DTD-based data
typing.
A
consists of a
series of one or more
,
separated by optional whitespace, with each part looking like a function
call. What appear to be function names are actually syntactic and
semantic
, of which the most common is the
ID-oriented element scheme, and of which the most
versatile is the XPath-oriented xpointer
scheme.
If a scheme-based XPointer contains more than one part, then the
parts are evaluated from left to right, skipping any
unsupported/unrecognized schemes, until one is found that identifies
something that exists in the document. Some schemes, like the
namespace/prefix-binding xmlns, identify nothing (by
design), and instead just influence the interpretation of subsequent
parts. It's possible for an XPointer to produce different results with
different processors, if the author doesn't take care to ensure each
part identifies the same thing.
Here are some more examples:
The XPath 1.0 expression id(somename) means the same
thing as the XPointer xpointer(id(somename)), and nearly
the same thing as the XPointers element(somename) and
somename, which just have more flexibility in where the ID
can be drawn from.
The XPointer element(somename/3/1) means nearly the
same thing as the XPath expression
id(somename)/*[3]/*[1].
The XPointer
xmlns(xhtml=http://www.w3.org/1999/xhtml)xpointer(//xhtml:a[@href])
could be used to refer to all of the links in an XHTML 1.0
document.
12.2 XPointer support in 4Suite
4Suite's XPointer implementation, sometimes called 4XPointer, has
no command-line interface, but can be used within Python applications.
It supports XPointers to different degrees, depending on the
circumstances:
-
When an XML document is being parsed into a Domlette with
XInclude processing enabled, any XPointer encountered in an
xi:include element is automatically evaluated when the
included document is parsed. In this mode the XPointer must use an
XPath LocationPath that only uses steps along the child axis.
Furthermore, any predicates must be literal numbers, or must be of
the specific form [@attname='attvalue']. For example,
/foo[3] and /foo[@bar='baz'] will work,
but ../foo and foo/[.='bar'] will not.
Function calls are not allowed.
-
If you have not yet parsed an XML document, but have a URI for
it, then you can use
Ft.Xml.XPointer.SelectUri() to parse the
document and evaluate an XPointer embedded in the URI's fragment
component. The parsing is performed with Domlette's default
NonvalidatingReader instance. There are some
implementation gaps to note when using the
xpointer scheme: the only additional function
fully supported is here(), and the
following functions always return empty location-sets:
string-range(),
range-to(),
start-point(),
end-point(), and
origin(). origin
is illegal to use outside of extended XLinks, anyway.
-
If you have already parsed the document into a Domlette, then
you can evaluate an arbitrary XPointer against it by using
Ft.Xml.XPointer.SelectNode(). The same
implementation gaps as noted in the description of
Ft.Xml.XPointer.SelectUri() apply.
Ranges are not supported because Domlette does not support DOM
Level 2 Ranges. Uche Ogbuji posted some
thoughts about this topic a while back. Also note that although
the element scheme is streamable, it is not yet
supported in XIncludes due to ID-related limitations in Domlette. Since
element and shorthand pointer support are
requirements for full XInclude conformance, they will probably be
implemented in the future.
In 4Suite 1.0b1 and earlier, the implementation was based on older
versions of the specs, and several additional restrictions were in
effect: the element scheme was not even an option,
XPointers in XIncludes had to be given via URIs (not attributes) and
couldn't contain NameTests involving "*", and all other
XPointers were only allowed to identify a single node.
12.3 Examples
The following example uses XInclude with XPointer references to
include various sections from one document into another document. “article.xml: Document using XInclude with
XPointer references” lists the contents of the file
article.xml, which references one section using a
shorthand pointer and then references any sections that have their
condition attribute set to unfinished. “article2.xml: Document with content
referenced from article.xml” lists the contents of the
file article2.xml, which is referenced from
article.xml. “Loading the document” lists the Python code used to parse
and print this document; note that XPointer processing is driven from
XInclude processing, which is done automatically by default.
<article>
<title>My important article</title>
<xi:include href="article2.xml"
xpointer="woo"
xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="article2.xml"
xpointer="xpointer(article/section[@condition='unfinished'])"
xmlns:xi="http://www.w3.org/2001/XInclude"/>
</article>
Figure 5 —
<article>
<section condition="unfinished">
<title>Section 1</title>
<!-- Write me! -->
</section>
<section xml:id="woo">
<title>Section 2</title>
<para>Yeah, content.</para>
</section>
<section condition="unfinished">
<title>Section 3</title>
<!-- Write me, too! -->
</section>
</article>
Figure 6 —
from Ft.Xml import Parse
from Ft.Xml.Domlette import PrettyPrint
doc = Parse("article.xml")
PrettyPrint(doc)
Figure 7 —
As mentioned earlier, XPointer is most commonly used along with
XInclude, but 4Suite provides an API for using XPointer directly from
Python. Using article2.xml as listed above in “article2.xml: Document with content
referenced from article.xml”, “Using XPointer directly from Python” loads two of the nodes loaded
previously with XInclude. Note that when using the standalone interface,
the code is able to take advantage of more of the XPointer
syntax.
from Ft.Xml import Parse
from Ft.Xml.Domlette import PrettyPrint
from Ft.Xml.XPointer import SelectNode
article2 = Parse("article2.xml")
# Shorthand XPointer syntax
node = SelectNode(article2, "woo")[0]
PrettyPrint(node)
# Scheme-based XPointer syntax
node = SelectNode(article2,
"xpointer(//section[@condition='unfinished'][2])")[0]
PrettyPrint(node)
Figure 8 —
“Self-contained example” is very similar to
the examples above, only this version is self-contained; the resources
are stored in Python strings and resolved using a custom
resolver.
article = """<article><title>My important article</title>
<xi:include href="ex:article2"
xpointer="woo"
xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="ex:article2"
xpointer="xpointer(article/section[@condition='unfinished'])"
xmlns:xi="http://www.w3.org/2001/XInclude"/>
</article>"""
article2 = """<article>
<section condition="unfinished"><title>Section 1</title><!-- Write me! --></section>
<section xml:id="woo"><title>Section 2</title><para>Yeah, content.</para></section>
<section condition="unfinished"><title>Section 3</title><!-- Write me, too! --></section>
</article>"""
from Ft.Lib.Uri import FtUriResolver, Absolutize
from Ft.Lib import UriException
from cStringIO import StringIO
class MyResolver (FtUriResolver):
def normalize(self, uriRef, baseUri):
return Absolutize(uriRef, baseUri)
def resolve(self, uri):
if uri == "ex:article":
return StringIO(article)
elif uri == "ex:article2":
return StringIO(article2)
else:
raise UriException(UriException.RESOURCE_ERROR,
loc=uri, msg="not found, sorry")
myResolver = MyResolver()
from Ft.Xml.InputSource import InputSourceFactory
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint
factory = InputSourceFactory(resolver=myResolver)
isrc = factory.fromUri("ex:article")
doc = NonvalidatingReader.parse(isrc)
PrettyPrint(doc)
from Ft.Xml.XPointer import SelectNode
isrc = factory.fromUri("ex:article2")
article2 = NonvalidatingReader.parse(isrc)
node = SelectNode(article2, "woo")[0]
PrettyPrint(node)
node = SelectNode(article2,
"xpointer(//section[@condition='unfinished'][2])")[0]
PrettyPrint(node)
Figure 9 —
13 Comprehensive examples
This section contains a set of examples that transcend the boundaries
of individual topics. These examples combine multiple different techniques
and often address more common use-cases found "in the wild".
13.1 Transforming DocBook using the DocBook XSL stylesheets
In the XML universe, one common use-case is converting DocBook (a common XML application)
to various output formats for publishing using the DocBook
XSL set of XSLT scripts. If you have the DocBook XSL distribution
installed (or if you have an Internet connection), you can transform
DocBook files completely within the 4Suite XML API. The following example
illustrates how this can be done, and in the process this example touches
on a wide variety of 4Suite XML techniques. These are listed below.
-
Building a Domlette XML model manually
-
Parsing XML into a Domlette XML model
-
Using XSLT in 4Suite XML
-
Using InputSources with automatic XML
Catalog resolution
-
Managing URIs
-
Writing XML from a Domlette XML model
-
And a bonus feature unrelated to 4Suite: i18n with the DocBook
XSL scripts!
from Ft.Xml.Domlette import implementation, PrettyPrint, NonvalidatingReader
from Ft.Xml.Xslt import Processor
from Ft.Xml import Catalog, InputSource, EMPTY_NAMESPACE
from Ft.Lib import Uri, UriException
# New processor
processor = Processor.Processor()
# If you have the DocBook XSL scripts installed in your system, then they are likely
# integrated into the system catalog, which is often at `/etc/xml/catalog` on
# Unix-like systems. If you have a catalog which resolves the DocBook XSL URIs
# located in a different filename, you can change this filename below. Otherwise,
# this example will access the DocBook XSL scripts directly (i.e. over the network).
catalog_filename = '/etc/xml/catalog'
# Turn the catalog filename into the corresponding `file` URI.
catalog_URI = Uri.OsPathToUri(catalog_filename)
# Try to load the catalog, moving right along if it doesn't exist.
theCatalog = None
try:
theCatalog = Catalog.Catalog(catalog_URI)
except UriException, e:
pass
# Create a new `InputSourceFactory` object to use our catalog.
inputSourceFactory = InputSource.InputSourceFactory(catalog = theCatalog)
docbook_xsl_URI = 'http://docbook.sourceforge.net/release/xsl/current/html/docbook.xsl'
# Set up an `InputSource` for the DocBook XSL stylesheets.
docbook_xsl_source = inputSourceFactory.fromUri(docbook_xsl_URI)
# Build a DOM of our stylesheet, then load the stylesheet into the XSLT processor.
transform = NonvalidatingReader.parse(docbook_xsl_source)
processor.appendStylesheetNode(transform, docbook_xsl_URI)
# Now we build our DocBook DOM, with a document root of myDoc.
myDoc = implementation.createRootNode('file:///article.xml')
article = myDoc.createElementNS(EMPTY_NAMESPACE, 'article')
myDoc.appendChild(article)
article.setAttributeNS(None, 'lang', "es")
myDoc.publicId="-//OASIS//DTD DocBook XML V4.2//EN"
myDoc.systemId="http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"
element = myDoc.createElementNS(EMPTY_NAMESPACE, 'title')
element.appendChild(myDoc.createTextNode('Title of article'))
article.appendChild(element)
section = myDoc.createElementNS(EMPTY_NAMESPACE, 'section')
article.appendChild(section)
element = myDoc.createElementNS(EMPTY_NAMESPACE, 'title')
element.appendChild(myDoc.createTextNode('Title of section'))
section.appendChild(element)
element = myDoc.createElementNS(EMPTY_NAMESPACE, 'para')
element.appendChild(myDoc.createTextNode('paragraph of section'))
section.appendChild(element)
print '************************ xml *******************************'
# Serialize the source document as XML.
PrettyPrint(myDoc)
print '************************ html *******************************'
# Print the result of transforming the document.
result = processor.runNode(myDoc)
print result
|