XML fun with Python and lxml’s objectify: Part 1.5: Advanced Parsing (XML, HTML, XHTML, oh my!)

There are dozens of scraping and parsing tools out there but sometimes they are too bloated or simply don’t do what you want them to do. Some may think this is the Rube Goldberg approach but this keeps you in absolute control and really isn’t as hard as it seems. This illustrates some of the features of lxml’s objectify which can be used to parse simple XML down to HTML/XHTML and broken variations.

For this example, I will be using the source code from google.com. You could use urllib or urllib2 to fetch the source and store it in a StringIO object. In this demonstration, I’ve loaded the source to google.com in a StringIO object.

NOTE: Looking at the google.com source, it kind-of makes my eyes want to bleed. What happened to the good old days of the internet before all of this mind-blowing javascript?

The first thing that needs to be done is to parse the XML structure so a parser needs to be setup and the data then parsed. Using the same parser as the previous introduction, we pass our StringIO object directly into objectify’s parse() function because the StringIO object mimics that of a file object:

xmlData = objectify.parse(data,objectify.makeparser(remove_blank_text=True, recover=False, ns_clean=True)).getroot()

But when run, we should get an error like:

Traceback (most recent call last):
File "test-demo.xmlparsing.py", line 23, in
main(f)
File "test-demo.xmlparsing.py", line 13, in main
xmlData = objectify.parse(data,objectify.makeparser(remove_blank_text=True, recover=False, ns_clean=True)).getroot()
File "lxml.objectify.pyx", line 1836, in lxml.objectify.parse (src\lxml\lxml.objectify.c:23571)
File "lxml.etree.pyx", line 3310, in lxml.etree.parse (src\lxml\lxml.etree.c:72517)
File "parser.pxi", line 1808, in lxml.etree._parseDocument (src\lxml\lxml.etree.c:106174)
File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:106403)
File "parser.pxi", line 1716, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:105194)
File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src\lxml\lxml.etree.c:99876)
File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:94350)
File "parser.pxi", line 690, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:95786)
File "parser.pxi", line 620, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:94853)
XMLSyntaxError: StartTag: invalid element name, line 2, column 2

Don’t worry, that just means that the structure of the HTML doesn’t conform to XML. For HTML that doesn’t conform to XML rules, or even XHTML rules, we can simply use the etree HTML parser then dump that parsed data into an objectify parser. Now we can setup the parser again without error

html = objectify.parse(data, etree.HTMLParser())
objectify.fromstring(etree.tostring(html, method="xml", xml_declaration=True, encoding="utf-8"), objectify.makeparser(remove_blank_text=True, recover=True, ns_clean=True))

So if we wanted to get all of the links associated with the page using xpath:

print(xmlData.xpath(".//a"))

We would get a list of results:

['Screen reader users, click here to turn off Google Instant.', 'Gmail', 'Images', u'', 'Sign in', , 'Learn more', 'Get Google Chrome', 'Learn more', 'Privacy', 'Terms', 'Settings', 'Search settings', 'Advanced search', ' History ', 'Search Help', ' Send feedback ', 'Advertising', 'Business', 'About', , , , , , , , , , , , , 'More', , , , , , , , 'Even more from Google']

We can also use the iter() function:

print(xmlData.iter(tag="a"))

Which returns:

Whoa, that’s neat! A generator will be much more efficient on memory than a list, like above or list comprehension, due to holding the list in memory. To get the same results as the xpath query, I used list comprehension:

print([elem for elem in xmlData.iter(tag="a")])

Which gives us:

['Screen reader users, click here to turn off Google Instant.', 'Gmail', 'Images', u'', 'Sign in', , 'Learn more', 'Get Google Chrome', 'Learn more', 'Privacy', 'Terms', 'Settings', 'Search settings', 'Advanced search', ' History ', 'Search Help', ' Send feedback ', 'Advertising', 'Business', 'About', , , , , , , , , , , , , 'More', , , , , , , , 'Even more from Google']

So now that you can get to specific elements that you want, let’s take a closer look at each element. Each element will have a few basic attributes you can get such as .tag for the element name, .text for the actual element text, .attrib which is a Python dictionary of attributes and their values. In this example, I will be going over each link element, getting the text from the element and forcing it to be a string (making NoneType appear as “None”), and printing the “href” attribute value only IF the “href” attribute is in the attrib dictionary:

for elem in xmlData.iter(tag="a"):
print("TAG: "+str(elem.text).strip())
if "href" in elem.attrib:
print("HREF: "+elem.attrib["href"])
print("*")*10

The results were snipped for the sake of size:

TAG: Screen reader users, click here to turn off Google Instant.
HREF: /setprefs?suggon=2&prev=https://www.google.com/?gws_rd%3Dssl&sig=0_n5%3D
**********
TAG: Gmail
HREF: https://mail.google.com/mail/?tab=wm
**********
TAG: Images
HREF: https://www.google.com/imghp?hl=en&tab=wi&ei=neGvV
**********
TAG: None
HREF: http://www.google.com/intl/en/options/
**********
TAG: Sign in
HREF: https://accounts.google.com/ServiceLogin?hl=en&continue=https://www.google.com/
**********
TAG: None
HREF: https://www.google.com/webhp?hl=en

This was rather simple, wasn’t it? From here you can grab the links you want and take their URL and feed it back into liburl/liburl2 then rinse and repeat.

Of course the fun doesn’t have to stop there! You can still treat this like any other objectified XML object!

for child in xmlData.getchildren():
print(child.tag)

head
body

Each “child” would then be it’s own element object which you can use, abuse, or mess with. This example demonstrates this by getting a list of child elements then getting the children of those elements:

for elem in xmlData.getchildren():
for child in elem.getchildren():
print(child.tag)

Which returns:

meta
meta
meta
meta
title
script
style
style
style
script
link
div

Looking at this from another perspective, you can reference the child element’s tag name as the key to the dictionary-like interface:

for elem in xmlData.getchildren():
for child in xmlData[str(elem.tag)].getchildren():
print(child.tag)

Which returns:

meta
meta
meta
meta
title
script
style
style
style
script
link
div

You can also use iter(), iterdescendents(), iterancestors(), and many others to crawl your shiny parsed document.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>