Learning About URL String queries

At my current gig I am doing a lot of data munging. One common tasks of data munging is web scraping. I have an older article on my blog about web  scraping with Python’s beautiful soup library or Microsoft Excel. In the post I am not going to talk about the HTML of a web page but the URL also known as the web address. Sometimes there will be times when you need to scrape records from a web app with no API, understanding url string queries can help out in the long run.

What is a Query String?

A query string is part of the web address that contains data parameters when invoked it will perform a search through the website.

Parts of the URL

https://www.youtube.com/watch?v=63rt_-aLPr0”

Breakdown of the URL:

First Part the Protocol:

https://

Every url starts with a protocol which is a set of rules on how a computer should talk to this web address. There are two kinds of protocols that I know of which is ftp and http. FTP stands for File Transfer Protocol and http stands for Hypertext Transfer Protocol. Hypertext Transfer Protocol just display html or a page. The FTP protocol transfers computer files on that webpage.

Second Part the Domain name:

youtube.com

his part of the URL commonly identifies which company, agency or organization may be either directly responsible for the information

* .com which identifies company or commercial sites
* .org for non-profit organization sites
* .edu for educational sites
* .gov for government sites
* .net for Internet service providers or other types of networks

Third Part is the Query String:

watch?v=63rt_-aLPr0

Query string usually starts after a question in the URL. This is the parameters need to involve the web application to perform a particular tasks. In this case stream videos. On the fly looking at this I would assume v stands for video and 63rt_-aLPr0is   is the unique identification number for this particular youtube video.

Another example is the Hal Leonard website. Note I do not work at Hal Leonard and I am taking purely educated guess of their site api.

http://www.halleonard.com/product/viewproduct.action?itemid=193869&subsiteid=1

In this first example the highlighted part is the query string. The variable itemid is a specific music item in Hal Leonard’s database.  If you click on the like you view the product. Notice the ampersand symbol as well, this allows you add other parameters to your query string to narrow down what you are searching for.

http://www.halleonard.com/product/viewproduct.action?itemid=193869&lid=193869&subsiteid=1&&viewtype=songlist

In this second URL link it shows you the names of the songs that are in this particular sheet music.

  • Itemid  is the unique identifier of this particular music item.
  • Viewtype is the type of information that you want to view. In this case it is the list of songs.

http://www.halleonard.com/product/viewproduct.action?itemid=193869&subsiteid=1&&viewtype=instruments

This example is the similar to the previous example but notice the parameter set for viewtype. The viewtype parameter is set to instruments. This url shows the instrumentation for this sheet of music.

Sometimes reverse engineering is about making guesses and breaking stuff within your code.

More about query strings:

I would suggest looking at Greg Reda’s Web Scraping 201 post he explains in depth how to find APIs.

https://en.wikipedia.org/wiki/Query_string

https://support.google.com/webmasters/answer/6080548?hl=en

https://perishablepress.com/how-to-write-valid-url-query-string-parameters/

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s