Scraping data from the American Kernel Club

Update 12/14/2015

Webeducator created a video update of this tutorial. I encourage you guys to check it out!

So I was bored one night and I wanted to practice my data science skills that I learned through a summer class. I have gotten to practice my set of python skills for two months now and I feel a little rusty. For this tutorial I will be using the pandas, beautifulsoup, and requests python libraries. I will provide additional resources to these libraries at the end of this tutorial.

Task: To create a visualization of dog breed trends in American

Problem: The American Kernel Club does not have a way to export their dog statistics data to a csv file.

Solution: Web Scrape the information using beautifulsoup

Step 1: Inspect the webpage using developer tools from the browser. Firefox web developer extension is my favorite tool to do this.

As shown in the picture below there are two tables on the AKC dog registration statistics page.

akc_dataframe

Tables in html are have four tags. Tables are defined with the <table> tag. Tables are divided into table rows with the <tr> tag. Table rows are divided into table data with the <td> tag. A table row can also be divided into table headings with the <th> tag.

Knowing this we can use beautifulsoup to scrape the data for the webpage.

The first step is to import all of your libraries into python.

BeautifulSoup library is to extract data from the webpage

Requests library allows you to send HTTP requests.

Pandas library allows you to create data frames which makes it easy to change and manipulate sets of data.

from bs4 import BeautifulSoup
import requests
import pandas as pd

After we imported our libraries we need to create our beautifulSoup object.


url = "https://www.akc.org/reg/dogreg_stats.cfm";
r=requests.get(url)
data= r.text
soup = BeautifulSoup(data)

table = soup.find_all('table')[1]

Since we want information from the second table only we need to specific what table we want by using index [1]. If we just use table =soup.find_all(‘table’) it give us both tables.

rows = table.find_all('tr')

this command finds all the elements tr which are all the rows of table 2.

dogData= 0

creating a global element for the dog information because we might want to use ti later outside the for loop.

Next we loop through all the rows that are in table 2,dog statistics table, and create columns from that information.

for tr in rows:
cols = tr.find_all('td')
dogName =cols[0].get_text()
ranking2013 = cols[1].get_text()
ranking2012 =cols[2].get_text()
ranking2008 =cols[3].get_text()
ranking2003 =cols[4].get_text()

Finally we can create a data frame from the data so that we can manipulate the information that it contains later on if as needed.

dogData = pd.DataFrame(dogName,ranking2013,ranking2013,ranking2008,ranking2003)

We can also create a csv file in case you want export the data.

dogDataCsv = dogData.to_csv("AKC Dog Registrations Stats")#turns the dataFrame into a csv

So the entire code is

from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://www.akc.org/reg/dogreg_stats.cfm"
r=requests.get(url)
data= r.text
soup = BeautifulSoup(data)

table = soup.find_all('table')[1]
rows = table.find_all('tr')
dogData= 0
for tr in rows:
cols = tr.find_all('td')
dogName =cols[0].get_text()
ranking2013 = cols[1].get_text()
ranking2012 =cols[2].get_text()
ranking2008 =cols[3].get_text()
ranking2003 =cols[4].get_text()
dogData = pd.DataFrame(dogName,ranking2013,ranking2013,ranking2008,ranking2003)
dogDataCsv = dogData.to_csv("AKC Dog Registrations Stats")

There you have it! A data set of dog registration stats. Data visualization coming soon.
For basic tutorials on Beautiful Soup:
http://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup/

For basic tutorials on Pandas:
http://www.gregreda.com/2013/10/26/intro-to-pandas-data-structures/

Advertisements

3 thoughts on “Scraping data from the American Kernel Club

  1. Would you mind updating this post — it’s a great learning project, but I think the page or modules have changed just enough to mess up the code…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s