XML fun with Python and lxml’s objectify: Part 1: Parsing

I assume you’ve already read about lxml.objectify so I won’t bother being redundant but I am head-over-heels in love with it over lxml’s objectify. The goal of this is to be a supplement to lxml’s documentation and give real-world examples from my ETL experience using it. This is also assuming that you have some familiarity with Python and Python datatypes. With that quick blurb out of the way, let’s get to some fun!

For this work, I will be using a sample XML file (test.xml) consisting of:







The parser I will be using will require a file path or object but you can use a string parser if you are working with XML, XHTML, HTML, etc from other sources. First a parser needs to be created then objectify will parse the data against the parser. In my example, I will use objectify.parse() with a filename and the second argument will be the parser instance.

from lxml import etree
from lxml import objectify

def main(testfile):
xmlData = objectify.parse(testfile,objectify.makeparser(remove_blank_text=True, recover=False, ns_clean=True)).getroot()

objectify.deannotate(xmlData, pytype=True, xsi=True, xsi_nil=True, cleanup_namespaces=True)

print(etree.tostring(xmlData, pretty_print=True, method="xml", xml_declaration=True, encoding="utf−8"))

if __name__ == "__main__":
f = "test.xml"
main(f)

My usage of objectify.deannotate is to remove the additional attribute details from lxml’s tree so I can simple and straight forward XML:








So now that you have your XML parsed, now what? From here, you can iterate over the entire tree, iterate over the tree looking for a specific tag, list children, list descendants, execute xpath queries, whatever your heart desires.

Here is a simple example of getting children:

for child in xmlData.getchildren():
print(child.tag)

Which returns:

a

Because the objectify interface is like a dictionary, you can count how many children there are without having to use list comprehension or other iterators:

print(len(xmlData))

Which returns:

1

You can access the tree like dictionaries as well:

print(xmlData["a"].tag)

Which returns:

a

Or:

print(xmlData["a"]["b"]["c"].tag)

Which returns:

c

For iterating over the entire tree, you can simply use the iter() function:

for elem in xmlData.iter():
print(elem.tag)

Which returns:

root
a
b
c

But you can also iterate over specific tags by using the “tag” keyword for iter() like:

for elem in xmlData.iter(tag="b"):
print(elem.tag)

Which returns:

b

NOTE: A caveat to this is that if you have multiple elements with the same name but in different parts of the tree, this will not distinguish between them.

I have frequently used .getparent().tag to verify the parent of the element I’m iterating for, such as:

for elem in xmlData.iter(tag="c"):
if elem.getparent() is not None:
if elem.getparent().tag == "b":
print(elem.tag)

Which returns:

c

NOTE: The root element has no parent so it would return a NoneType which then has no tag attribute so I check to make sure that getparent() does not return None before checking the tag attribute.

NOTE: Each element has a few common attributes that you can use for information, such as .tag for the element tag name, .attrib for element attributes (returned in a Python dictionary where the key is the attribute and the key value is the attribute value), .text which returns the contents of the element as a string, and .pyval which returns the value with a guessed object type (instead of a string, it can return int, long, float, etc).

What if you want to quickly select element text without having to iterate through the entire tree? What if you want to use a simple xpath expression to evaluate conditions? For this, you can use the xpath function off of the root of the XML tree:

print(xmlData.xpath(".//c[text()]"))

Which returns

[]

What happened? When a xpath expression is evaluated, it returns a list of results but our XML tree has no content in the “c” element. Let’s add some text, manipulating the tree like a dictionary and then running the xpath query again:

xmlData["a"]["b"]["c"] = "hello world"

print(xmlData.xpath(".//c[text()]"))

Which returns:

['hello world']

From these examples, you can see how simple iterating over a XML tree can be.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>