At my current gig I am doing a lot of data munging. One common tasks of data munging is web scraping. I have an older article on my blog about web scraping with Python’s beautiful soup library or Microsoft Excel. In the post I am not going to talk about the HTML of a web page but the URL also known as the web address. Sometimes there will be times when you need to scrape records from a web app with no API, understanding url string queries can help out in the long run.
What is a Query String?
A query string is part of the web address that contains data parameters when invoked it will perform a search through the website.
Parts of the URL
Breakdown of the URL:
First Part the Protocol:
Every url starts with a protocol which is a set of rules on how a computer should talk to this web address. There are two kinds of protocols that I know of which is ftp and http. FTP stands for File Transfer Protocol and http stands for Hypertext Transfer Protocol. Hypertext Transfer Protocol just display html or a page. The FTP protocol transfers computer files on that webpage.
Second Part the Domain name:
his part of the URL commonly identifies which company, agency or organization may be either directly responsible for the information
* .com which identifies company or commercial sites
* .org for non-profit organization sites
* .edu for educational sites
* .gov for government sites
* .net for Internet service providers or other types of networks
Third Part is the Query String:
Query string usually starts after a question in the URL. This is the parameters need to involve the web application to perform a particular tasks. In this case stream videos. On the fly looking at this I would assume v stands for video and 63rt_-aLPr0is is the unique identification number for this particular youtube video.
Another example is the Hal Leonard website. Note I do not work at Hal Leonard and I am taking purely educated guess of their site api.
In this first example the highlighted part is the query string. The variable itemid is a specific music item in Hal Leonard’s database. If you click on the like you view the product. Notice the ampersand symbol as well, this allows you add other parameters to your query string to narrow down what you are searching for.
In this second URL link it shows you the names of the songs that are in this particular sheet music.
- Itemid is the unique identifier of this particular music item.
- Viewtype is the type of information that you want to view. In this case it is the list of songs.
This example is the similar to the previous example but notice the parameter set for viewtype. The viewtype parameter is set to instruments. This url shows the instrumentation for this sheet of music.
Sometimes reverse engineering is about making guesses and breaking stuff within your code.
More about query strings:
I would suggest looking at Greg Reda’s Web Scraping 201 post he explains in depth how to find APIs.