Creating a civic tech app with the Chicago data portal

A lot happened last month in April. Chi Hack Night turned five! And the City of Chicago Data Portal got a major facelift.


This is a mini tutorial on how to start creating a civic tech app using the City of Chicago Data Portal.

How to get an API Key

Go to

Go to the upper left hand corner and click on the Sign In Button.
Sign in with your username and password. If you do not have one just click Sign Up to create an account.
On the upper right hand corner click on Edit Account Settings.
On the left hand side you will see a menu. Go to App Tokens.
Fill out the form and click the big read Create button


There you have it, your app token.


How to find API documentation on any dataset


Go to the data set you want to create an app for. In this case the Alternative Fuel Locations – Chicago

Click on the Export Button, its baby blue color.
Click on the drop down accordion that says SODA API
Then click on API Docs and it will show you how to extract data programmatically in JQuery, Python Pandas, SAS, Ruby and other languages 

21 Days of D3.js follow up

This is a follow-up to the post I did last year 21 days of D3.js .

Last year I wanted to learn D3.js. For those of you who do not know, D3.js stands for Data Driven Documents. The D3.js library is used in many journalism sites one of them being the New York Times.

Last year my goal was to create 21 different data visualizations within 21 days.  My inspiration for this goal came from Jen Dewalt’s 100 site in 100 days.  I’ve tried many times to learn how to use this library and just gave up.  If you check back on my first blog post from three years ago that was my first attempt at learning D3.js. In retrospect, I was not successful the first time around because I lacked a lot of background programming knowledge. When you are trying to learn something new, you have to take baby steps and celebrate your victories.

The learning curve for d3.js is high.

Before attempting to learn D3 I would suggest spending a week or two learning basic javascript. I learned how to program in javascript on Free Coding Camp. Like the name suggests it’s an online coding bootcamp and it’s totally free. The bootcamp provides tutorials, algorithm problems, portfolio projects, it has a massive online community so when you feel stuck you can ask questions and they will be answered quickly. If online MOOCs are not your thing I would suggest reading Eloquent Javascript which is also free to read online. I personally like Jon Duckett’s Javascript and Jquery:Interactive Front End Development .

I am currently  on my seventh day trying to learn D3 and its a lot better than the first couple of tries. I borrowed Scott Murray’s Interactive Data Visualization for the Web from the library, the book is also completely free on the O’Reilly website. And there is also a book on LearnPub called D3 Tips and Tricks which is free by Maclom Maclean. The next couple of post will be ramblings of me trying to learn d3.js. If anyone has any good suggestions don’t hesitate to make a comment.


My Django Story

On December 3,2016 Django girls hosted an Introduction to Django workshop in my city. Django Girls is a non-profit organization and a community that empowers and helps women to organize free, one-day programming workshops by providing tools, resources and support. In this workshop, we created our own content management system (CMS) similar to WordPress. Now let’s not get Django confused with the movie.


Django is a web framework created entirely in python. So check it out the Django tutorial on their website. It’s really cool I promise. The Django Girls event was amazing.I saw a lot of familiar faces of people that I’ve met through different Meetup groups around the city. But there were many more faces of people whom I never encountered before. For the workshop people were grouped in teams of about six including team leader. I was in a small group of three  which included a woman who is a QA test and my team leader who  is a software developer. Going to the workshop inspired me to share my “Django Story

How did you story with code start?

My coding story started in senior year of high school. I took A.P. computer science and I hated it. I was convinced that it was not for me. Fast forward to college and there was no escaping it. I was getting my bachelor’s in statistics and we did most of analysis scripting in R or SAS. My minor,Informatics, also required me  to take a basic programming course. What made programming less of a chore is when I helped a family of mine setup a WordPress site for her company. I had no idea about web programming so I went to Codecademy and plowed through the HTML/CSS courses in two days and I started customizing her WordPress site. I was thinking to myself “wow”. If I could give a small business owner a voice on the web to increase the number of her clients in a matter of weeks. What else can I do with programming?

What did you do before becoming a programmer?

I was a professional student. Now I’m just a full blown student of life.

What do you love the most about coding?

What I love about coding is that it has the same power of reading,writing,drawing or any other craft. The power of creating something and sharing your ideas to people. When you read you are creating ideas, when you are writing you are sharing your ideas/ your voice and when you draw  you are saying this is my personal style. When you code it’s the same magic.

Why django?

I love working with python. Its an easy language to pick up and you can do anything with it. I’m excited about the fact I can create a web-application around the ETL scripts I’ve done for work.

What cool projects are you working on at the moment/planning on working on near future?

Right now I’m just working on some fun small side projects. I planning on creating a WordPress Plugin similar to Hello,Dolly but with Dj Khaled quotes. I also what to do a data visualization of gentrification throughout different Chicago neighborhoods.

What are you the most proud  of?

My resilience. I’ve just been rejected so many times but different scholarships,jobs,and programs. You tend to second guess yourself. But if I learn something new I consider that a win.

What are you curious about?

My curiosity about things changes a quickly as the wind blows. Right now I am just focusing on getting a solid foundation. I want to learn more about Object Oriented Programing.

What do you like doing in your free time? What’s your hobby?

If I am not at the library checking out comic books or hula hooping. I am going to different Meetup groups across the city to talk to other nerds like me.

Do you have any advice/tips for programming beginners?

Programming is for everyone. It does not matter how old or young you are. Work on projects that meet your interests.



Web Scraping Part 2 (Deux)

Like I said in my previous post. Sometimes you find yourself in a situation where you need to extract information from a website and it has no API and/or the HTML structure is completely Bonkers.

While you can recursively dig through a website with beautiful soup and find the information you are looking for it is much easier to do with the python package lxml. This package can transform a website into an xml tree.


XML stands for eXtensible Markup Language and it is related to HTML(Hyper Text Markup Language). HTML is used to markup web pages while XML is used to markup data. XML makes it easier to send data on different devices. Many library databases use XML to create their databases. I also know that Android devices use XML to parse and display data.  Instead of using pre-defined tags such as p, div, class, id etc.. XML lets you defined your own tags to markup your own data. A common example of this

XML has tools that help you to manipulate your xml pages so they can be easily read by other machines or humans. XSLT lets you style your xml pages into different document formats such as PDF,HTML and many more by using XPath. XPath is used to traverse through the XML document and it is the focus of this post.


XPath is a tool that is used to traverse through an xml document. XPath uses expressions to select leaves(nodes) from the document tree.

Path Expression Description
/ Select the root tag in the document
/bobsTag Select the root tag, but only if its named “bobsTag”
//tagName Find all “tagName” elements anywhere in the document
text() Select the text content of the current node
@name Select the “name” attribute of the current node
.. Select the parent of the current node
[1] Predicate that goes at the end of the xpath expression if you are specifying a particular node. The predicate can be any number.

For example if we want to select the list item that says “Whipped cream”



If we would have wrote this in beautifulsoup the code would look something like this

itemLinks = soup.find_all("li")
print itemLinks[1]

A real life example is when I had to web scrape the Hal Leonard website. I needed to catalog some music sheet books based on instrumentation. Sometimes I would receive ten items to put in a catalog others times it would be fifty items. I wanted to Automate the Boring Stuff . I inspected the website with Google Chrome’s web dev tools and I found this


The website’s layout was in a table. Not only one table but multiple tables nested within each other. Tried as I might with beautiful soup jiu jitsu I could not figure out a way to extract the data that I need. I search through stack Overflow which lead me to The Hitchhiker’s Guide to Python web scraping tutorial. It was my first introduction to python’s lxml library. I emulated what the tutorial but it left pretty wide gaps such as what is xml and what is xpath.

That’s when python’s lxml package  and my previous tutorial on web query comes in.


The url for Hamilton sheet music is

Notice that the itemid number is 155921. If I want to see the Instrumentation the url will be

This is the same for any sheet of music.



Using the Chrome dev tools I inspect the text that I want to extract. Then I use python lxml library to create an xpath expression

from lxml import html
import requests
page = requests.get('')
tree = html.fromstring(page.content)
instruments= tree.xpath('//td[@class="productContent"]/text()')
print 'Instruments: ', instruments


Same script using Beautiful Soup library

def getInstrumentation(halLeonardUrl):
    data= r.text  
    soup = BeautifulSoup(data)
    instruments = ""
    tdTag = soup.find_all("td", {"class":"productContent"}) 
    for tag in tdTag: 
        ulTags = tag.find_all("ul") 
        for tags in ulTags: 
            instruments = tags.text.strip()
            instruments = instruments.replace("\n", ",")
    return instruments









Learning About URL String queries

At my current gig I am doing a lot of data munging. One common tasks of data munging is web scraping. I have an older article on my blog about web  scraping with Python’s beautiful soup library or Microsoft Excel. In the post I am not going to talk about the HTML of a web page but the URL also known as the web address. Sometimes there will be times when you need to scrape records from a web app with no API, understanding url string queries can help out in the long run.

What is a Query String?

A query string is part of the web address that contains data parameters when invoked it will perform a search through the website.

Parts of the URL”

Breakdown of the URL:

First Part the Protocol:


Every url starts with a protocol which is a set of rules on how a computer should talk to this web address. There are two kinds of protocols that I know of which is ftp and http. FTP stands for File Transfer Protocol and http stands for Hypertext Transfer Protocol. Hypertext Transfer Protocol just display html or a page. The FTP protocol transfers computer files on that webpage.

Second Part the Domain name:

his part of the URL commonly identifies which company, agency or organization may be either directly responsible for the information

* .com which identifies company or commercial sites
* .org for non-profit organization sites
* .edu for educational sites
* .gov for government sites
* .net for Internet service providers or other types of networks

Third Part is the Query String:


Query string usually starts after a question in the URL. This is the parameters need to involve the web application to perform a particular tasks. In this case stream videos. On the fly looking at this I would assume v stands for video and 63rt_-aLPr0is   is the unique identification number for this particular youtube video.

Another example is the Hal Leonard website. Note I do not work at Hal Leonard and I am taking purely educated guess of their site api.

In this first example the highlighted part is the query string. The variable itemid is a specific music item in Hal Leonard’s database.  If you click on the like you view the product. Notice the ampersand symbol as well, this allows you add other parameters to your query string to narrow down what you are searching for.

In this second URL link it shows you the names of the songs that are in this particular sheet music.

  • Itemid  is the unique identifier of this particular music item.
  • Viewtype is the type of information that you want to view. In this case it is the list of songs.

This example is the similar to the previous example but notice the parameter set for viewtype. The viewtype parameter is set to instruments. This url shows the instrumentation for this sheet of music.

Sometimes reverse engineering is about making guesses and breaking stuff within your code.

More about query strings:

I would suggest looking at Greg Reda’s Web Scraping 201 post he explains in depth how to find APIs.

What I did this summer

Well, summer came and then it went. Labor day was a week ago and school has officially started for everyone. This blog has been dead for awhile. Like so many side bloggers before me I have neglected my main source of self-advertisement.  But there was an important reason why my blog has been dead for so long.

I’ve done nothing this summer!!!! Since Pokemon Go came out in July I’ve just been trying to hunt for Squirtles.




Well, this is not entirely true. I spent two weeks back in the big city doing a bootcamp prep course with Full Stack Academy in June. The bootcamp is suppose to prepare you for interview coding questions that many  coding bootcamps conduct before selecting their candidates. I registered for the bootcamp because it was affordable and I wanted to know what the big deal was about coding bootcamps. I was always skeptical about them because you can take an online MOOC for FREE.99 and learn the core concepts of web development. 

The bootcamp prep went over basic algorithms in JavaScript that you see in coding interviews such as recursion, proof by contradiction,and data structures,etc. Overall I enjoyed the bootcamp. The instructors were patient and they explained things in a simple manner. It was nice being around other coders.We got to help each other out and solidify the concepts.

Full Stack Academy has a lot of great reviews online and people who go through the program have received amazing programming jobs. But it seems like they are mostly after people who have an engineering background and need to transition into programming. So if you are an absolute beginner to programming I would not suggest applying to this program.

After the bootcamp, the motivation to study those concepts wore off and I mostly focused on my job. I applied and I did not get into Full Stack Academy. I must admit my ego was hurt but, it was for the best.


Like I said before I am a bit of a cheapskate and I cannot justify spending $10,000 for a three month boot camp. I don’t believe my masters degree even cost that much. Going into my second year not being a student I’ve realized I enjoy working with data. I would like to learn more things about GIS and data visualization in general.




R Programming Assignment 1

I’m currently going through the John Hopkins Data Science  specialization. So far they are okay. These courses are pretty tough so if you are complete beginner you can complement these courses with Data Camp course if you need more practice.The only annoying part about this class is that they do not mention some of the functions you will need to complete the assignments during lectures.  Luckily there are TA hints to supplement this lack of information.

I am a researcher. When I run into problems I would go online and try to ask the right question to help me solve the problem that’s in front of me. Or I would talk to different people who have expert knowledge on the subject matter or I will just find the answer in a book at my local library. But being a programmer or data scientist involves breaking down problems. The toughest part for me is being comfortable with problem solving.

Background about the data

Air Pollution: look at the report that you did about it.

First thing I do whenever I have data is to explore the file in excel in order to get a better understand of its structure. Each file in the specdata folder contains data for one monitor.

  • Date: the date of the observation
  • sulfate: the level of sulfate particle matter in the air on that date
  • nitrate: the level of nitrate particle matter in the air on that date

The pollutantmean function the prompt states:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

At this point I don’t really know what I’m doing. It’s just me staring at the computer for an hour…


With every coding problem I try to think about things on a higher level. If I have an understanding of the problem then I can muddle my way through the syntax of the code.

What are they asking for here?

There just want the mean of the pollutants by the IDs.

So the higher level stuff is done next we break the problem into steps.

  1. read the files all the data files
  2. merge the data files in one data frame
  3. ignore the NAs
  4. subset the data frame by pollutant
  5. calculate the mean.

Lets find out what the mean function requires.



mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)

An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.

the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

a logical value indicating whether NA values should be stripped before the computation proceeds.

According to the documentation the mean can only take one R object. But there is a problem.



The information for each monitor is in an individual csv file. The monitor id is the same as the csv files names. In order to find the mean of pollutants of monitors by different ids we will have to put each monitor’s information into one data frame.


At this point  I do not know how to walk  rough a directory of files. I know in python there is os.walk . In R there is the list.files function .

In order to perform the same actions to multiple files I looped through the list.

  #create a list of files
  filesD<-list.files(directory,full.names = TRUE)
  #create an empty data frame
  dat <- data.frame()
  #loop through the list of files until id is found
  for(i in id){
    #read in the file
    temp<- read.csv(filesD[i],header=TRUE)
    #add files to the main data frame
  #find the mean of the pollutant, make sure you remove NA values
  return(mean(dat[,pollutant],na.rm = TRUE))


Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

This question was easier than both parts 1 and parts 2. The question is simple to understand. They just want a report that displays the number of completed cases in each data file. This question reminds me of nobs function  that is used in another statistical software called SAS its used in many businesses. Unlike R which seems to be primarily used in academia and research.


  1. read in the files
  2. remove the NAs from the set
  3. count the number of rows
  4. create a new data set  has two columns that contains the monitors id number and the number of observations
complete <- function(directory,id=1:332){

#create a list of files
  filesD<-list.files(directory,full.names = TRUE)
  #create an empty data frame
  dat <- data.frame()
  for(i in id){
  #read in the file
    temp<- read.csv(filesD[i],header=TRUE)
    #delete rows that do not have complete cases
    #count all of the rows with complete cases
    #enumerate the complete cases by index


Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

Part 3 was a dosey. Its similar to part 1 for the fact that we are basically aggregating data. But this time instead of a data frame we are aggregating data inside a vector.

  1. read in the files
  2. remove the NAs from the set
  3. check to see if the number of complete cases are > then threshold
  4. find  the correlation between different types of pollutants.

In order to combine different data sets  we used  rbind (combine rows). To combine different  vectors we use cbind(combine columns).

#create list of file names
  filesD<-list.files(directory,full.names = TRUE)
  #create empty vector
  dat <- vector(mode = "numeric", length = 0)
  for(i in 1:length(filesD)){
  #read in file
    temp<- read.csv(filesD[i],header=TRUE)
    #delete NAs
    #count the number of observations
    #if the number of rows is greater than the threshold
   #for that file you find the correlation between nitrate and sulfate
   #combine each correlation for each file in vector format using the concatenate function 
   #since this is not a data frame we cannot use rbind or cbind


In retrospect this assignment is very useful. In most of the data science courses throughout this specialization you are going to be selecting, aggregating, and doing basic statistics with data.

Subsetting Cookbook in R

So, I’ve just finished up the R Programming course that is apart of Coursera’s John Hopkins Data Science specialization. And I must say that it did some judo mortal kombat moves on my mind. This course is not beginner friend but I’ve learned a lot and I think it’s safe to say that I am becoming a master at subsetting and filtering data in R. In retrospect if you are planning to take this specialization you should do the Getting and Cleaning Data Course before you start the R Programming course.

A collection of notes on how to select different rows and columns within R.

R has four main data structures to store and manipulate data which are vectors, matrices, data frames, and list. So far in my on again off again relationship with R I mostly worked with data frames. I will be using the airquality dataset that comes preinstalled in R.



 [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
 Ozone Solar.R Wind Temp Month Day
 1 41 190 7.4 67 5 1
 2 36 118 8.0 72 5 2
 3 12 149 12.6 74 5 3
 4 18 313 11.5 62 5 4
 5 NA NA 14.3 56 5 5
 6 28 NA 14.9 66 5 6

 Subsetting data by index.

While subsetting the placement of the comma is important.

Extracting a specific observation(row) make sure that you include a comma to the right of the object you are extracting from. DONT FORGET THE COMMA


 Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1

# First two rows and all columns

 Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2


Extracting a specific variable(column)make sure that you include a comma to the left of the object you are extracting from.

# First column and all rows

 [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 30 11 1 11 4 32
 [25] NA NA NA 23 45 115 37 NA NA NA NA NA NA 29 NA 71 39 NA NA 23 NA NA 21 37
 [49] 20 12 13 NA NA NA NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
 [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 64 59 39 9 16 78
 [97] 35 66 122 89 110 NA NA 44 28 65 NA 22 59 23 31 44 21 9 NA 45 168 73 NA 76
[121] 118 84 85 96 78 73 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20

#First two columns and all rows

 Ozone Solar.R
1 41 190
2 36 118
3 12 149
4 18 313
6 28 NA
7 23 299

You can also select a column from data frame using the variable name with a dollar sign


 [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 30 11 1 11 4 32
 [25] NA NA NA 23 45 115 37 NA NA NA NA NA NA 29 NA 71 39 NA NA 23 NA NA 21 37
 [49] 20 12 13 NA NA NA NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
 [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 64 59 39 9 16 78
 [97] 35 66 122 89 110 NA NA 44 28 65 NA 22 59 23 31 44 21 9 NA 45 168 73 NA 76
[121] 118 84 85 96 78 73 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20

These are all the observations from the Ozone column

Extracting multiple columns from a data frame

df[,c("A","B","E")] source Stack Overflow

 Ozone Temp
1 41 67
2 36 72
3 12 74
4 18 62
5 NA 56
6 28 66

You can also filter data when you are making a selection by using basic logic statements

Basic Logic statements

 Operator Description
== equal
!= Not equal
> greater than
< less than
< less than
<= less than or equal
> greater than or equal
& And
| Or
%in% match returns a vector of the positions of (first) matches of its first argument in its second.
Logical Function Description
which.min Index of the minimum value
which.max index of the maximum value
Extract the subset of rows of the data frame where Ozone values are above 31 
and Temp values are above 90.

myset<-data[data$Ozone>31 & data$Temp>90,]

Extract the subset of rows of the data frame where Month values equal 5,7,8
airquality[airquality$Month %in% c(5,7,8),]
What is the mean of "Temp" when "Month" is equal to 6?
june <-airquality[airquality$Month==6,]
mean(june$Temp, na.rm=TRUE)

What was the maximum ozone value in the month of May (i.e. Month is equal to 5)?

More Resources

Practice with subsetting with R

Subsetting by string

Was Library School Worth It?

So I’ve been out of school for about 9 months now and it kinda of sucks. Just a little bit. Truth be told I have a love/hate relationship with school in general. I miss the safety that you get with consistency but it can only do so much to prepare you for the world. I remember the day I told my parents I was going to graduate school for Library and Information Science. And it went a little like this….



When I go to job interviews or example to someone I want to be in the field of data analytics and Library and Information Science is a good fit. It goes like this…

So I am going to tell you why going to library school nurtured  data science skills. Three reasons why

  1. Become a better researcher
  2. Public Service Skills
  3. Communication Skills

Become a better researcher

Day One of library school you learn that Google ain’t all that.Google is a powerful search engine tool but if you do not define you search into a narrow question then your search results come flat. When I was working at the university library I had to help a few undergrads with their research by using what we call a reference interview. Helping students ask the right questions. I tend to use it on myself a lot these days to help define a project to use my data science portfolio.

Public Service Skills

As a librarian you are a public servant. You serve the people. Which includes everyone no matter what creed,race, or gender the person happens to be. I took an Intro to Web Development class during my graduate career. It was completely different from the web development courses from Code Academy, Team Treehouse and FreeCodeCamp in that it was about accessibility. Never once when I self taught myself web development have I ever thought about how a blind person or deaf person uses the web.

Library school just made me more aware of what’s going on in the world in general. Most of the people who attend GSLIS come from history and social science. It got me out of my comfort zone. Most of the time I would look at tech blogs like CSS trick, stackOverflow, A-List Apart and HackerNews.My friends influenced me to listen to NPR , CNN, The Read and just think about the world around me. How can I use technology just make a positive impact on society?

Communication Skills

In undergrad it was all about homework…



In library school it was mostly about final projects instead of bombarding you with homework assignments. The only person policing your education is you.Every class involved presentation. To be able to not only convey your ideas to an entire room of people but to keep them interested as well is an art. I also had to write argumentative essays. For example, in one of my classes I had to develop a strategic plan to present to stakeholders for building an inclusive digital community. Or I had to write documentation on technologies for example how do you go about doing diagnostics on Chromebooks that cannot connect via the wifi network.

I know that my Altar Mata is going forth to focus more on hard science of Information Science. But the soft skills that I learned from my library courses have given an edge as well.


Coursera Exploring Data Assignment 1

Lately in my data science journey I have be going over Data Camp R exercises. But the thing about practice is that you need to do real world projects that you are interested about.People learn in different ways. Some people are auditory, visual, and some people learn by doing. I tend to learn more by doing.

Your efforts on self-education should be focused on trying to get to the point where you can actually be involved and do something as early as possible.I feel that the best way to learn something is to jump right in and start doing, before you even know what you’re doing. If you can gain enough knowledge about a subject to start playing around, you can tap into the powerful creative and curious nature of your own mind. We tend to absorb more information and develop more meaningful questions about a thing when we’re actively playing.

John Sonmez author of Soft Skills: The software developers manual

Inspired by Dj Patil’s article Hack your Summer I am currently going through the Coursera John Hopkins R data science curriculum right now to sharpen my skills and get some sort of a portfolio going. One of the courses in the curriculum is called Exploratory Data Analysis (EDA). EDA is an approach to analyzing data sets to summarize their main characteristics usually through data visualization.

For the first course assignment we had to reproduce some graphs using energy data set provided by the instructor . In this post I am going to go over the first graph.



So I had a lot of trouble reading and cleaning the data. The data set is kind of big. It was suggested by the instructor to estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory I used a formula posted on their an article on their Simply Statistics blog R Workshop reading in large data frames

# rows * # columns * 8 bytes / 2^20

The dataset has 2,075,259 rows and 9 columns

(2,075,259 *9*8)/2^20

142.4967 Megabytes

Since there is 1000 megabytes in a gigabyte and my computer has about 6 gigabytes it has more than enough memory to load the dataset.

Whenever I attempt to do any type of coding I just start with pseudo code and then stumble my way through the syntax. *cough* Stack Overflow *cough* When you work with different languages it’s hard to remember syntax but the core concepts are still there.

So in order to create any type of graph in the the Base Plotting System you do the following steps.

  1. Read in the data
  2. Make sure your variables are in the correct type
  3. Clean the data
  4. Use a plotting function with a lot of parameters to spit out the plot


Describe the problems you had.

The graphs in retrospect were simple to make. The trouble I had was working with dates. I received a bachelor in statistics and I never got much of an opportunity to work with real data.

R has three main functions to read files read.table(),read.csv() and read.delim(). I am going to use read.table since it has the most parameters to specify how you want your file to me inputted.

#read data

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

data <- read.table("household_power_consumption.txt", header=TRUE, sep=";",dec=".", stringsAsFactors=FALSE, na.strings = "?",colClasses=cls)

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

#make sure data is interpreted correctly

#deleted all the rows that had NA values
energyData <-na.omit(energyData)

#plot data
hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")


When I first used the read.table function it interpreted all of my columns as characters.This was troublesome since most of the data in the set should have numeric classes. In order to fix this problem I had to use the coClasses parameter. coClasses converts specified variables into atomic vector classes(logical,interger,numeric,complex,character,raw). Take note I cannot convert dates using the coClasses parameter.You can specific any number of variables that you want.

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

Next we  in need to subset the data. the %in%  returns the positions of the dates from the specified in or vector.

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]


Last step in the clean up process the dates need to be converted from character to date classes. We do this by using the as.Date() function.

Finally we make a histogram it by using the hist() function.

hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")