21 Days of D3.js follow up

This is a follow-up to the post I did last year 21 days of D3.js .

Last year I wanted to learn D3.js. For those of you who do not know, D3.js stands for Data Driven Documents. The D3.js library is used in many journalism sites one of them being the New York Times.

Last year my goal was to create 21 different data visualizations within 21 days.  My inspiration for this goal came from Jen Dewalt’s 100 site in 100 days.  I’ve tried many times to learn how to use this library and just gave up.  If you check back on my first blog post from three years ago that was my first attempt at learning D3.js. In retrospect, I was not successful the first time around because I lacked a lot of background programming knowledge. When you are trying to learn something new, you have to take baby steps and celebrate your victories.

The learning curve for d3.js is high.

Before attempting to learn D3 I would suggest spending a week or two learning basic javascript. I learned how to program in javascript on Free Coding Camp. Like the name suggests it’s an online coding bootcamp and it’s totally free. The bootcamp provides tutorials, algorithm problems, portfolio projects, it has a massive online community so when you feel stuck you can ask questions and they will be answered quickly. If online MOOCs are not your thing I would suggest reading Eloquent Javascript which is also free to read online. I personally like Jon Duckett’s Javascript and Jquery:Interactive Front End Development .

I am currently  on my seventh day trying to learn D3 and its a lot better than the first couple of tries. I borrowed Scott Murray’s Interactive Data Visualization for the Web from the library, the book is also completely free on the O’Reilly website. And there is also a book on LearnPub called D3 Tips and Tricks which is free by Maclom Maclean. The next couple of post will be ramblings of me trying to learn d3.js. If anyone has any good suggestions don’t hesitate to make a comment.

 

My Django Story

On December 3,2016 Django girls hosted an Introduction to Django workshop in my city. Django Girls is a non-profit organization and a community that empowers and helps women to organize free, one-day programming workshops by providing tools, resources and support. In this workshop, we created our own content management system (CMS) similar to WordPress. Now let’s not get Django confused with the movie.

8wx5aikbhd7zs

Django is a web framework created entirely in python. So check it out the Django tutorial on their website. It’s really cool I promise. The Django Girls event was amazing.I saw a lot of familiar faces of people that I’ve met through different Meetup groups around the city. But there were many more faces of people whom I never encountered before. For the workshop people were grouped in teams of about six including team leader. I was in a small group of three  which included a woman who is a QA test and my team leader who  is a software developer. Going to the workshop inspired me to share my “Django Story

How did you story with code start?

My coding story started in senior year of high school. I took A.P. computer science and I hated it. I was convinced that it was not for me. Fast forward to college and there was no escaping it. I was getting my bachelor’s in statistics and we did most of analysis scripting in R or SAS. My minor,Informatics, also required me  to take a basic programming course. What made programming less of a chore is when I helped a family of mine setup a WordPress site for her company. I had no idea about web programming so I went to Codecademy and plowed through the HTML/CSS courses in two days and I started customizing her WordPress site. I was thinking to myself “wow”. If I could give a small business owner a voice on the web to increase the number of her clients in a matter of weeks. What else can I do with programming?

What did you do before becoming a programmer?

I was a professional student. Now I’m just a full blown student of life.

What do you love the most about coding?

What I love about coding is that it has the same power of reading,writing,drawing or any other craft. The power of creating something and sharing your ideas to people. When you read you are creating ideas, when you are writing you are sharing your ideas/ your voice and when you draw  you are saying this is my personal style. When you code it’s the same magic.

Why django?

I love working with python. Its an easy language to pick up and you can do anything with it. I’m excited about the fact I can create a web-application around the ETL scripts I’ve done for work.

What cool projects are you working on at the moment/planning on working on near future?

Right now I’m just working on some fun small side projects. I planning on creating a WordPress Plugin similar to Hello,Dolly but with Dj Khaled quotes. I also what to do a data visualization of gentrification throughout different Chicago neighborhoods.

What are you the most proud  of?

My resilience. I’ve just been rejected so many times but different scholarships,jobs,and programs. You tend to second guess yourself. But if I learn something new I consider that a win.

What are you curious about?

My curiosity about things changes a quickly as the wind blows. Right now I am just focusing on getting a solid foundation. I want to learn more about Object Oriented Programing.

What do you like doing in your free time? What’s your hobby?

If I am not at the library checking out comic books or hula hooping. I am going to different Meetup groups across the city to talk to other nerds like me.

Do you have any advice/tips for programming beginners?

Programming is for everyone. It does not matter how old or young you are. Work on projects that meet your interests.

 

 

Things I wish I knew before I went to graduate school

This post is dedicated to my bestie who is thinking about continuing her education.

I went to graduate school a year after I received my bachelor’s degree. I romanticized the program before arriving. Well, I am going to try to save you from that mistake by keeping it 100. Depending on the program you are going to, graduate school can be an intense social environment. Grad programs tend to be smaller than undergraduate programs which means there might not be any diversity. I’m not only talking about cultural diversity but also personality wise. Your personality might not mesh with 90% of your programs body and that’s okay. I am a native Chicago Southsider and sometimes my life seems like a mashup between Dave Chappelle’s When Keeping It Real Goes Wrong skits and Daria.

bmrjzulkcjdg4

But for the next one to four years, these people are going through some of the same classes you are going through so buckle up. As long as you treat people with the same respect you wish to be treated things usually go smoothly. If it’s not reciprocated then just ignore your haters and keeping working on your future.

Basic advice for people who are thinking about going to graduate school.

Have a clear vision of what you want for your future (At least somewhat of a vision).

There will be times in your program when you just feel like given up. You wonder what is the point of being here. You might not have a mentor to guide you when you are at your breaking point. That is why it is important to have a vision of what you want in the future. If you do not know what you want to do right now it’s fine. I would suggest hanging out in a non-academic space before you go back to school.

You are responsible for you own education.

images

The teachers in graduate school might be the most knowledgeable  in the subject you are studying but they might be horrible teachers. A lot of the teachers in higher ed don’t care about their students unless you are involved with their research. It’s your job to get noticed. Meaning that you have to do most of your learning outside of class. That could be networking with people who are already in the industry via Meetups, conferences, or online communities.

Get involved in things that you are passionate about or bring you joy.

Being a part of organizations or clubs that bring you joy not only boost your moral but you are also more likely to go above and beyond.This could translate into letters of recommendations from other club members and a good story to tell for upcoming job interviews.

Do not completely disengage.

2azlnmcehq29g

I have a nasty habit when I am not being challenged  academically, I just go on autopilot. I’m going to give you advice that my 7th-grade teacher Ms. Jackson gave me “Fake it until you make it!”  Smile, be polite and engage because it might be the difference between an A and an A-, or a potential letter of recommendation from that teacher.

Web Scraping Part 2 (Deux)

Like I said in my previous post. Sometimes you find yourself in a situation where you need to extract information from a website and it has no API and/or the HTML structure is completely Bonkers.

While you can recursively dig through a website with beautiful soup and find the information you are looking for it is much easier to do with the python package lxml. This package can transform a website into an xml tree.

XML

XML stands for eXtensible Markup Language and it is related to HTML(Hyper Text Markup Language). HTML is used to markup web pages while XML is used to markup data. XML makes it easier to send data on different devices. Many library databases use XML to create their databases. I also know that Android devices use XML to parse and display data.  Instead of using pre-defined tags such as p, div, class, id etc.. XML lets you defined your own tags to markup your own data. A common example of this http://www.w3schools.com/xml/

XML has tools that help you to manipulate your xml pages so they can be easily read by other machines or humans. XSLT lets you style your xml pages into different document formats such as PDF,HTML and many more by using XPath. XPath is used to traverse through the XML document and it is the focus of this post.

XPath

XPath is a tool that is used to traverse through an xml document. XPath uses expressions to select leaves(nodes) from the document tree.

Path Expression Description
/ Select the root tag in the document
/bobsTag Select the root tag, but only if its named “bobsTag”
//tagName Find all “tagName” elements anywhere in the document
text() Select the text content of the current node
@name Select the “name” attribute of the current node
.. Select the parent of the current node
[1] Predicate that goes at the end of the xpath expression if you are specifying a particular node. The predicate can be any number.

For example if we want to select the list item that says “Whipped cream”

html

//li[2]

If we would have wrote this in beautifulsoup the code would look something like this

itemLinks = soup.find_all("li")
print itemLinks[1]

A real life example is when I had to web scrape the Hal Leonard website. I needed to catalog some music sheet books based on instrumentation. Sometimes I would receive ten items to put in a catalog others times it would be fifty items. I wanted to Automate the Boring Stuff . I inspected the website with Google Chrome’s web dev tools and I found this

hal3.PNG

The website’s layout was in a table. Not only one table but multiple tables nested within each other. Tried as I might with beautiful soup jiu jitsu I could not figure out a way to extract the data that I need. I search through stack Overflow which lead me to The Hitchhiker’s Guide to Python web scraping tutorial. It was my first introduction to python’s lxml library. I emulated what the tutorial but it left pretty wide gaps such as what is xml and what is xpath.

That’s when python’s lxml package  and my previous tutorial on web query comes in.

hal1

The url for Hamilton sheet music is

http://www.halleonard.com/product/viewproduct.action?itemid=155921&subsiteid=1

Notice that the itemid number is 155921. If I want to see the Instrumentation the url will be

http://www.halleonard.com/product/viewproduct.action?itemid=155921&subsiteid=1&&viewtype=instruments

This is the same for any sheet of music.

 

hal4

Using the Chrome dev tools I inspect the text that I want to extract. Then I use python lxml library to create an xpath expression

from lxml import html
import requests
page = requests.get('http://www.halleonard.com/product/viewproduct.action?itemid=155921&subsiteid=1&&viewtype=instruments')
tree = html.fromstring(page.content)
instruments= tree.xpath('//td[@class="productContent"]/text()')
print 'Instruments: ', instruments

 

Same script using Beautiful Soup library

def getInstrumentation(halLeonardUrl):
    r=requests.get(halLeonardUrl) 
    data= r.text  
    soup = BeautifulSoup(data)
    instruments = ""
    tdTag = soup.find_all("td", {"class":"productContent"}) 
    for tag in tdTag: 
        ulTags = tag.find_all("ul") 
        for tags in ulTags: 
            instruments = tags.text.strip()
            instruments = instruments.replace("\n", ",")
    return instruments
            

 

Resources

 

 

 

 

 

 

Learning About URL String queries

At my current gig I am doing a lot of data munging. One common tasks of data munging is web scraping. I have an older article on my blog about web  scraping with Python’s beautiful soup library or Microsoft Excel. In the post I am not going to talk about the HTML of a web page but the URL also known as the web address. Sometimes there will be times when you need to scrape records from a web app with no API, understanding url string queries can help out in the long run.

What is a Query String?

A query string is part of the web address that contains data parameters when invoked it will perform a search through the website.

Parts of the URL

https://www.youtube.com/watch?v=63rt_-aLPr0”

Breakdown of the URL:

First Part the Protocol:

https://

Every url starts with a protocol which is a set of rules on how a computer should talk to this web address. There are two kinds of protocols that I know of which is ftp and http. FTP stands for File Transfer Protocol and http stands for Hypertext Transfer Protocol. Hypertext Transfer Protocol just display html or a page. The FTP protocol transfers computer files on that webpage.

Second Part the Domain name:

youtube.com

his part of the URL commonly identifies which company, agency or organization may be either directly responsible for the information

* .com which identifies company or commercial sites
* .org for non-profit organization sites
* .edu for educational sites
* .gov for government sites
* .net for Internet service providers or other types of networks

Third Part is the Query String:

watch?v=63rt_-aLPr0

Query string usually starts after a question in the URL. This is the parameters need to involve the web application to perform a particular tasks. In this case stream videos. On the fly looking at this I would assume v stands for video and 63rt_-aLPr0is   is the unique identification number for this particular youtube video.

Another example is the Hal Leonard website. Note I do not work at Hal Leonard and I am taking purely educated guess of their site api.

http://www.halleonard.com/product/viewproduct.action?itemid=193869&subsiteid=1

In this first example the highlighted part is the query string. The variable itemid is a specific music item in Hal Leonard’s database.  If you click on the like you view the product. Notice the ampersand symbol as well, this allows you add other parameters to your query string to narrow down what you are searching for.

http://www.halleonard.com/product/viewproduct.action?itemid=193869&lid=193869&subsiteid=1&&viewtype=songlist

In this second URL link it shows you the names of the songs that are in this particular sheet music.

  • Itemid  is the unique identifier of this particular music item.
  • Viewtype is the type of information that you want to view. In this case it is the list of songs.

http://www.halleonard.com/product/viewproduct.action?itemid=193869&subsiteid=1&&viewtype=instruments

This example is the similar to the previous example but notice the parameter set for viewtype. The viewtype parameter is set to instruments. This url shows the instrumentation for this sheet of music.

Sometimes reverse engineering is about making guesses and breaking stuff within your code.

More about query strings:

I would suggest looking at Greg Reda’s Web Scraping 201 post he explains in depth how to find APIs.

https://en.wikipedia.org/wiki/Query_string

https://support.google.com/webmasters/answer/6080548?hl=en

https://perishablepress.com/how-to-write-valid-url-query-string-parameters/

What I did this summer

Well, summer came and then it went. Labor day was a week ago and school has officially started for everyone. This blog has been dead for awhile. Like so many side bloggers before me I have neglected my main source of self-advertisement.  But there was an important reason why my blog has been dead for so long.

I’ve done nothing this summer!!!! Since Pokemon Go came out in July I’ve just been trying to hunt for Squirtles.

 

3rgxbvowyhnzb2o532

 

Well, this is not entirely true. I spent two weeks back in the big city doing a bootcamp prep course with Full Stack Academy in June. The bootcamp is suppose to prepare you for interview coding questions that many  coding bootcamps conduct before selecting their candidates. I registered for the bootcamp because it was affordable and I wanted to know what the big deal was about coding bootcamps. I was always skeptical about them because you can take an online MOOC for FREE.99 and learn the core concepts of web development. 

The bootcamp prep went over basic algorithms in JavaScript that you see in coding interviews such as recursion, proof by contradiction,and data structures,etc. Overall I enjoyed the bootcamp. The instructors were patient and they explained things in a simple manner. It was nice being around other coders.We got to help each other out and solidify the concepts.

Full Stack Academy has a lot of great reviews online and people who go through the program have received amazing programming jobs. But it seems like they are mostly after people who have an engineering background and need to transition into programming. So if you are an absolute beginner to programming I would not suggest applying to this program.

After the bootcamp, the motivation to study those concepts wore off and I mostly focused on my job. I applied and I did not get into Full Stack Academy. I must admit my ego was hurt but, it was for the best.

1sswwmnnazllm

Like I said before I am a bit of a cheapskate and I cannot justify spending $10,000 for a three month boot camp. I don’t believe my masters degree even cost that much. Going into my second year not being a student I’ve realized I enjoy working with data. I would like to learn more things about GIS and data visualization in general.

.

 

 

R Programming Assignment 1

I’m currently going through the John Hopkins Data Science  specialization. So far they are okay. These courses are pretty tough so if you are complete beginner you can complement these courses with Data Camp course if you need more practice.The only annoying part about this class is that they do not mention some of the functions you will need to complete the assignments during lectures.  Luckily there are TA hints to supplement this lack of information.

I am a researcher. When I run into problems I would go online and try to ask the right question to help me solve the problem that’s in front of me. Or I would talk to different people who have expert knowledge on the subject matter or I will just find the answer in a book at my local library. But being a programmer or data scientist involves breaking down problems. The toughest part for me is being comfortable with problem solving.

Background about the data

Air Pollution: look at the report that you did about it.

First thing I do whenever I have data is to explore the file in excel in order to get a better understand of its structure. Each file in the specdata folder contains data for one monitor.

  • Date: the date of the observation
  • sulfate: the level of sulfate particle matter in the air on that date
  • nitrate: the level of nitrate particle matter in the air on that date

The pollutantmean function the prompt states:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

At this point I don’t really know what I’m doing. It’s just me staring at the computer for an hour…

wvqiqjynbocjk

With every coding problem I try to think about things on a higher level. If I have an understanding of the problem then I can muddle my way through the syntax of the code.

What are they asking for here?

There just want the mean of the pollutants by the IDs.

So the higher level stuff is done next we break the problem into steps.

  1. read the files all the data files
  2. merge the data files in one data frame
  3. ignore the NAs
  4. subset the data frame by pollutant
  5. calculate the mean.

Lets find out what the mean function requires.

?mean

Usage

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
Arguments

x 
An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.

trim 
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

na.rm 
a logical value indicating whether NA values should be stripped before the computation proceeds.

According to the documentation the mean can only take one R object. But there is a problem.

specdata

 

The information for each monitor is in an individual csv file. The monitor id is the same as the csv files names. In order to find the mean of pollutants of monitors by different ids we will have to put each monitor’s information into one data frame.

 

At this point  I do not know how to walk  rough a directory of files. I know in python there is os.walk . In R there is the list.files function .

In order to perform the same actions to multiple files I looped through the list.

pollutantmean<-function(directory,pollutant,id=1:332){
  #create a list of files
  filesD<-list.files(directory,full.names = TRUE)
  #create an empty data frame
  dat <- data.frame()
  
  #loop through the list of files until id is found
  for(i in id){
    #read in the file
    temp<- read.csv(filesD[i],header=TRUE)
    #add files to the main data frame
    dat<-rbind(dat,temp)
  }
  #find the mean of the pollutant, make sure you remove NA values
  return(mean(dat[pollutant],na.rm = TRUE))
  
}

 

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

This question was easier than both parts 1 and parts 2. The question is simple to understand. They just want a report that displays the number of completed cases in each data file. This question reminds me of nobs function  that is used in another statistical software called SAS its used in many businesses. Unlike R which seems to be primarily used in academia and research.

Steps

  1. read in the files
  2. remove the NAs from the set
  3. count the number of rows
  4. create a new data set  has two columns that contains the monitors id number and the number of observations
complete <- function(directory,id=1:332){

#create a list of files
  filesD<-list.files(directory,full.names = TRUE)
  #create an empty data frame
  dat <- data.frame()
  
  for(i in id){
  #read in the file
    temp<- read.csv(filesD[i],header=TRUE)
    #delete rows that do not have complete cases
    temp<-na.omit(temp)
    
    #count all of the rows with complete cases
    tNobs<-nrow(temp)
    
    #enumerate the complete cases by index
    dat<-rbind(dat,data.frame(i,tNobs))
   
  }
    return(dat)
}

 

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

Part 3 was a dosey. Its similar to part 1 for the fact that we are basically aggregating data. But this time instead of a data frame we are aggregating data inside a vector.

  1. read in the files
  2. remove the NAs from the set
  3. check to see if the number of complete cases are > then threshold
  4. find  the correlation between different types of pollutants.

In order to combine different data sets  we used  rbind (combine rows). To combine different  vectors we use cbind(combine columns).

corr<-function(directory,threshold=0){
#create list of file names
  filesD<-list.files(directory,full.names = TRUE)
  
  #create empty vector
  dat <- vector(mode = "numeric", length = 0)
  
  for(i in 1:length(filesD)){
  #read in file
    temp<- read.csv(filesD[i],header=TRUE)
    #delete NAs
    temp<-temp[complete.cases(temp),]
    #count the number of observations
    csum<-nrow(temp)
    #if the number of rows is greater than the threshold
    if(csum>threshold){
   #for that file you find the correlation between nitrate and sulfate
   #combine each correlation for each file in vector format using the concatenate function 
   #since this is not a data frame we cannot use rbind or cbind
      dat<-c(dat,cor(temp$nitrate,temp$sulfate))
    }
    
  }

  return(dat)
}

In retrospect this assignment is very useful. In most of the data science courses throughout this specialization you are going to be selecting, aggregating, and doing basic statistics with data.