Web Scraping Part 2 (Deux)

Like I said in my previous post. Sometimes you find yourself in a situation where you need to extract information from a website and it has no API and/or the HTML structure is completely Bonkers.

While you can recursively dig through a website with beautiful soup and find the information you are looking for it is much easier to do with the python package lxml. This package can transform a website into an xml tree.

XML

XML stands for eXtensible Markup Language and it is related to HTML(Hyper Text Markup Language). HTML is used to markup web pages while XML is used to markup data. XML makes it easier to send data on different devices. Many library databases use XML to create their databases. I also know that Android devices use XML to parse and display data.  Instead of using pre-defined tags such as p, div, class, id etc.. XML lets you defined your own tags to markup your own data. A common example of this http://www.w3schools.com/xml/

XML has tools that help you to manipulate your xml pages so they can be easily read by other machines or humans. XSLT lets you style your xml pages into different document formats such as PDF,HTML and many more by using XPath. XPath is used to traverse through the XML document and it is the focus of this post.

XPath

XPath is a tool that is used to traverse through an xml document. XPath uses expressions to select leaves(nodes) from the document tree.

Path Expression Description
/ Select the root tag in the document
/bobsTag Select the root tag, but only if its named “bobsTag”
//tagName Find all “tagName” elements anywhere in the document
text() Select the text content of the current node
@name Select the “name” attribute of the current node
.. Select the parent of the current node
[1] Predicate that goes at the end of the xpath expression if you are specifying a particular node. The predicate can be any number.

For example if we want to select the list item that says “Whipped cream”

html

//li[2]

If we would have wrote this in beautifulsoup the code would look something like this

itemLinks = soup.find_all("li")
print itemLinks[1]

A real life example is when I had to web scrape the Hal Leonard website. I needed to catalog some music sheet books based on instrumentation. Sometimes I would receive ten items to put in a catalog others times it would be fifty items. I wanted to Automate the Boring Stuff . I inspected the website with Google Chrome’s web dev tools and I found this

hal3.PNG

The website’s layout was in a table. Not only one table but multiple tables nested within each other. Tried as I might with beautiful soup jiu jitsu I could not figure out a way to extract the data that I need. I search through stack Overflow which lead me to The Hitchhiker’s Guide to Python web scraping tutorial. It was my first introduction to python’s lxml library. I emulated what the tutorial but it left pretty wide gaps such as what is xml and what is xpath.

That’s when python’s lxml package  and my previous tutorial on web query comes in.

hal1

The url for Hamilton sheet music is

http://www.halleonard.com/product/viewproduct.action?itemid=155921&subsiteid=1

Notice that the itemid number is 155921. If I want to see the Instrumentation the url will be

http://www.halleonard.com/product/viewproduct.action?itemid=155921&subsiteid=1&&viewtype=instruments

This is the same for any sheet of music.

 

hal4

Using the Chrome dev tools I inspect the text that I want to extract. Then I use python lxml library to create an xpath expression

from lxml import html
import requests
page = requests.get('http://www.halleonard.com/product/viewproduct.action?itemid=155921&subsiteid=1&&viewtype=instruments')
tree = html.fromstring(page.content)
instruments= tree.xpath('//td[@class="productContent"]/text()')
print 'Instruments: ', instruments

 

Same script using Beautiful Soup library

def getInstrumentation(halLeonardUrl):
    r=requests.get(halLeonardUrl) 
    data= r.text  
    soup = BeautifulSoup(data)
    instruments = ""
    tdTag = soup.find_all("td", {"class":"productContent"}) 
    for tag in tdTag: 
        ulTags = tag.find_all("ul") 
        for tags in ulTags: 
            instruments = tags.text.strip()
            instruments = instruments.replace("\n", ",")
    return instruments
            

 

Resources

 

 

 

 

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s