Category XML

XML fun with Python and lxml’s objectify: Part 1.5: Advanced Parsing (XML, HTML, XHTML, oh my!)

There are dozens of scraping and parsing tools out there but sometimes they are too bloated or simply don’t do what you want them to do. Some may think this is the Rube Goldberg approach but this keeps you in absolute control and really isn’t as hard as it seems. This illustrates some of the features of lxml’s objectify which can be used to parse simple XML down to HTML/XHTML and broken variations.

For this example, I will be using the source code from google.com. You could use urllib or urllib2 to fetch the source and store it in a StringIO object. In this demonstration, I’ve loaded the source to google.com in a StringIO object.

NOTE: Looking at the google.com source, it kind-of makes my eyes want to bleed...

Read More

XML fun with Python and lxml’s objectify: Part 1: Parsing

I assume you’ve already read about lxml.objectify so I won’t bother being redundant but I am head-over-heels in love with it over lxml’s objectify. The goal of this is to be a supplement to lxml’s documentation and give real-world examples from my ETL experience using it. This is also assuming that you have some familiarity with Python and Python datatypes. With that quick blurb out of the way, let’s get to some fun!

For this work, I will be using a sample XML file (test.xml) consisting of:







The parser I will be using will require a file path or object but you can use a string parser if you are working with XML, XHTML, HTML, etc from other sources. First a parser needs to be created then objectify will parse the data against the parser...

Read More