Subsetting Cookbook in R

So, I’ve just finished up the R Programming course that is apart of Coursera’s John Hopkins Data Science specialization. And I must say that it did some judo mortal kombat moves on my mind. This course is not beginner friend but I’ve learned a lot and I think it’s safe to say that I am becoming a master at subsetting and filtering data in R. In retrospect if you are planning to take this specialization you should do the Getting and Cleaning Data Course before you start the R Programming course.

A collection of notes on how to select different rows and columns within R.

R has four main data structures to store and manipulate data which are vectors, matrices, data frames, and list. So far in my on again off again relationship with R I mostly worked with data frames. I will be using the airquality dataset that comes preinstalled in R.

Selecting

data("airquality")

names("airquality")
names(airquality)
 [1] "Ozone" "Solar.R" "Wind" "Temp" "Month" "Day"
head(airquality)
 Ozone Solar.R Wind Temp Month Day
 1 41 190 7.4 67 5 1
 2 36 118 8.0 72 5 2
 3 12 149 12.6 74 5 3
 4 18 313 11.5 62 5 4
 5 NA NA 14.3 56 5 5
 6 28 NA 14.9 66 5 6

 Subsetting data by index.

While subsetting the placement of the comma is important.

Extracting a specific observation(row) make sure that you include a comma to the right of the object you are extracting from. DONT FORGET THE COMMA

 airquality[1,]

 Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1

# First two rows and all columns

airquality[1:2,]
 Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2

 

Extracting a specific variable(column)make sure that you include a comma to the left of the object you are extracting from.

# First column and all rows

airquality[,1]
 [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 30 11 1 11 4 32
 [25] NA NA NA 23 45 115 37 NA NA NA NA NA NA 29 NA 71 39 NA NA 23 NA NA 21 37
 [49] 20 12 13 NA NA NA NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
 [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 64 59 39 9 16 78
 [97] 35 66 122 89 110 NA NA 44 28 65 NA 22 59 23 31 44 21 9 NA 45 168 73 NA 76
[121] 118 84 85 96 78 73 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20

#First two columns and all rows

airquailty[,1:2]
 Ozone Solar.R
1 41 190
2 36 118
3 12 149
4 18 313
5 NA NA
6 28 NA
7 23 299

You can also select a column from data frame using the variable name with a dollar sign

 airquality$Ozone

 [1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6 30 11 1 11 4 32
 [25] NA NA NA 23 45 115 37 NA NA NA NA NA NA 29 NA 71 39 NA NA 23 NA NA 21 37
 [49] 20 12 13 NA NA NA NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
 [73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50 64 59 39 9 16 78
 [97] 35 66 122 89 110 NA NA 44 28 65 NA 22 59 23 31 44 21 9 NA 45 168 73 NA 76
[121] 118 84 85 96 78 73 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20

These are all the observations from the Ozone column

Extracting multiple columns from a data frame

df[,c("A","B","E")] source Stack Overflow

head(airquality[,c("Ozone","Temp")])
 Ozone Temp
1 41 67
2 36 72
3 12 74
4 18 62
5 NA 56
6 28 66

You can also filter data when you are making a selection by using basic logic statements

Basic Logic statements

 Operator Description
== equal
!= Not equal
> greater than
< less than
< less than
<= less than or equal
> greater than or equal
! NOT
& And
| Or
%in% match returns a vector of the positions of (first) matches of its first argument in its second.
Logical Function Description
which.min Index of the minimum value
which.max index of the maximum value
Extract the subset of rows of the data frame where Ozone values are above 31 
and Temp values are above 90.

myset<-data[data$Ozone>31 & data$Temp>90,]

Extract the subset of rows of the data frame where Month values equal 5,7,8
airquality[airquality$Month %in% c(5,7,8),]
What is the mean of "Temp" when "Month" is equal to 6?
june <-airquality[airquality$Month==6,]
mean(june$Temp, na.rm=TRUE)

What was the maximum ozone value in the month of May (i.e. Month is equal to 5)?
may<-airquality[airquality$Month==5,]
may[which.max(may$Ozone),]


More Resources

Practice with subsetting with R

Subsetting by string

Advertisements

Was Library School Worth It?

So I’ve been out of school for about 9 months now and it kinda of sucks. Just a little bit. Truth be told I have a love/hate relationship with school in general. I miss the safety that you get with consistency but it can only do so much to prepare you for the world. I remember the day I told my parents I was going to graduate school for Library and Information Science. And it went a little like this….

5khmpdwkhzxic

 

When I go to job interviews or example to someone I want to be in the field of data analytics and Library and Information Science is a good fit. It goes like this…

l0nwqgfpv1ecke3xa
So I am going to tell you why going to library school nurtured  data science skills. Three reasons why

  1. Become a better researcher
  2. Public Service Skills
  3. Communication Skills

Become a better researcher

Day One of library school you learn that Google ain’t all that.Google is a powerful search engine tool but if you do not define you search into a narrow question then your search results come flat. When I was working at the university library I had to help a few undergrads with their research by using what we call a reference interview. Helping students ask the right questions. I tend to use it on myself a lot these days to help define a project to use my data science portfolio.

Public Service Skills

As a librarian you are a public servant. You serve the people. Which includes everyone no matter what creed,race, or gender the person happens to be. I took an Intro to Web Development class during my graduate career. It was completely different from the web development courses from Code Academy, Team Treehouse and FreeCodeCamp in that it was about accessibility. Never once when I self taught myself web development have I ever thought about how a blind person or deaf person uses the web.

Library school just made me more aware of what’s going on in the world in general. Most of the people who attend GSLIS come from history and social science. It got me out of my comfort zone. Most of the time I would look at tech blogs like CSS trick, stackOverflow, A-List Apart and HackerNews.My friends influenced me to listen to NPR , CNN, The Read and just think about the world around me. How can I use technology just make a positive impact on society?

Communication Skills

In undergrad it was all about homework…

invvfuomod31k

 

In library school it was mostly about final projects instead of bombarding you with homework assignments. The only person policing your education is you.Every class involved presentation. To be able to not only convey your ideas to an entire room of people but to keep them interested as well is an art. I also had to write argumentative essays. For example, in one of my classes I had to develop a strategic plan to present to stakeholders for building an inclusive digital community. Or I had to write documentation on technologies for example how do you go about doing diagnostics on Chromebooks that cannot connect via the wifi network.

I know that my Altar Mata is going forth to focus more on hard science of Information Science. But the soft skills that I learned from my library courses have given an edge as well.

 

Coursera Exploring Data Assignment 1

Lately in my data science journey I have be going over Data Camp R exercises. But the thing about practice is that you need to do real world projects that you are interested about.People learn in different ways. Some people are auditory, visual, and some people learn by doing. I tend to learn more by doing.

Your efforts on self-education should be focused on trying to get to the point where you can actually be involved and do something as early as possible.I feel that the best way to learn something is to jump right in and start doing, before you even know what you’re doing. If you can gain enough knowledge about a subject to start playing around, you can tap into the powerful creative and curious nature of your own mind. We tend to absorb more information and develop more meaningful questions about a thing when we’re actively playing.

John Sonmez author of Soft Skills: The software developers manual

Inspired by Dj Patil’s article Hack your Summer I am currently going through the Coursera John Hopkins R data science curriculum right now to sharpen my skills and get some sort of a portfolio going. One of the courses in the curriculum is called Exploratory Data Analysis (EDA). EDA is an approach to analyzing data sets to summarize their main characteristics usually through data visualization.

For the first course assignment we had to reproduce some graphs using energy data set provided by the instructor . In this post I am going to go over the first graph.

plot1

 

So I had a lot of trouble reading and cleaning the data. The data set is kind of big. It was suggested by the instructor to estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory I used a formula posted on their an article on their Simply Statistics blog R Workshop reading in large data frames

# rows * # columns * 8 bytes / 2^20

The dataset has 2,075,259 rows and 9 columns

(2,075,259 *9*8)/2^20

142.4967 Megabytes

Since there is 1000 megabytes in a gigabyte and my computer has about 6 gigabytes it has more than enough memory to load the dataset.

Whenever I attempt to do any type of coding I just start with pseudo code and then stumble my way through the syntax. *cough* Stack Overflow *cough* When you work with different languages it’s hard to remember syntax but the core concepts are still there.

So in order to create any type of graph in the the Base Plotting System you do the following steps.

  1. Read in the data
  2. Make sure your variables are in the correct type
  3. Clean the data
  4. Use a plotting function with a lot of parameters to spit out the plot

 

Describe the problems you had.

The graphs in retrospect were simple to make. The trouble I had was working with dates. I received a bachelor in statistics and I never got much of an opportunity to work with real data.

R has three main functions to read files read.table(),read.csv() and read.delim(). I am going to use read.table since it has the most parameters to specify how you want your file to me inputted.


#read data

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

data <- read.table("household_power_consumption.txt", header=TRUE, sep=";",dec=".", stringsAsFactors=FALSE, na.strings = "?",colClasses=cls)

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

#make sure data is interpreted correctly
as.Date(energyData$Date)

#deleted all the rows that had NA values
energyData <-na.omit(energyData)

#plot data
hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")

dev.off()

 

When I first used the read.table function it interpreted all of my columns as characters.This was troublesome since most of the data in the set should have numeric classes. In order to fix this problem I had to use the coClasses parameter. coClasses converts specified variables into atomic vector classes(logical,interger,numeric,complex,character,raw). Take note I cannot convert dates using the coClasses parameter.You can specific any number of variables that you want.

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

Next we  in need to subset the data. the %in%  returns the positions of the dates from the specified in or vector.

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

 

Last step in the clean up process the dates need to be converted from character to date classes. We do this by using the as.Date() function.

Finally we make a histogram it by using the hist() function.

hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")

 

 

Resources

Plotting Graphs with ggvis

Grammar of Graphics

In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language. (https://en.wikipedia.org/wiki/Grammar)

The Grammar of graphics  is a tool that basically use  the same concept but instead of build sentences that are the foundation of paragraphs which lead on to works of literature we are building graphs.

One grammar graphic tool is ggvis; a data visualization package for R.

The grammar for ggvis is

graph =  data + coordinate system + properties + mark

[pre]

<data>  %>% 
  ggvis(~<x property>,~<y property>, 
        fill = ~<fill property>, size=~<size property>) %>% 
  layer_<marks>()

[/pre]

3 common charts are going to be shown in this tutorial

  • Bar Charts
  • Line Charts
  • Scatter Charts

Bar Charts

The bar chart is used when comparing the mean or percentages of 8 or more different groups.

[pre]

mtcars%>% ggvis(~ wt, ~mpg) %>% layer_bars()

[/pre]

mtcars_bar

Line Charts

Line charts are used to illustrate trends over time.

[pre]

mtcars%>% ggvis(~ wt, ~mpg) %>% layer_lines()


[/pre]

mtcars_lines.png

Scatter Plots

Scatter plots are used to depict how different objects settle around a mean based on 2 to 3 different dimensions. This allows for quick and easy comparisons between competing variables. Scatter plots show how much one variable is affected by another.

[pre]

mtcars%>% ggvis(~ wt, ~mpg) %>% layer_points()


[/pre]
mpg_points.png

First I exported data from the Basketball-Reference site. For this example I am going to use
Jimmy Butler's statistics from 2015-2015. I am just going to plot Butler's game score for each game.
This statistic was invented by John Hollinger to provide a rough measure of a player's 
performance in a given game.  The scale upon which the player's game score is based is 
the same as points scored.  If a player has a game score of 40, that is amazing, 
while a game score of 10 is average.(http://www.sportingcharts.com/dictionary/nba/game-score-statistic.aspx)

Install the ggvis and call the library in order to use the package
[pre]
install.packages("ggvis")
library(ggvis)
[/pre]
Import the data using the read.csv function. Make sure you specify stringsAsFactors optional parameter as false.

[pre]
butler<- read.csv("jimmy_butler.csv", stringsAsFactors=FALSE)
[/pre]

Explore the data. In this instance I am looking at the column that selects Jimmy Butler's game score.

[pre]
butler$GmSc 
[/pre]

Subset the observations
[pre]
butler2<- butler[1:65,]
[/pre]

Attach the search path to the environment.
The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes.
I noticed when I was coercing the Game Score data to become numeric I got an error invalid subscript type integer error in r using ggvis.
After search through the pages of stack overflow. I've learned that The dplyr package doesn't like the usage of '$'. Try instead using '[', e.g.:
[pre]
attach(butler2)
butler2 %>% ggvis(~butler2$G,~as.numeric(GmSc)) %>% layer_points()
butler_points
butler2 %>% ggvis(~G,~as.numeric(GmSc)) %>% layer_bars()
butler_bar
butler2 %>% ggvis(~G,~as.numeric(GmSc)) %>% layer_lines()
butler_line
[/pre]

Data Science Resources

I  wanted to create a  quick blog post which will be a redux of a blog post I did two years ago. This is for people who are like me, people who want to practice their data science skills but are too broke to shell out $16,000 bones for a data science bootcamp. Luckily some of these bootcamps post all of their resources on github.

The Data Science Summer School(D3) is  a bootcamp hosted by Microsoft and is geared towards undergraduates in the New York area. Microsoft also has a dataset resources to practice your machine learning algorithms.

CS 109- Harvard Intro to Data Science even though this is a bootcamp this class has the most comprehensive materials on this list. It has lecture notes, videos , and assignments. Harvard education without the price or the accreditation.

Kevin Markham founder of dataschool.io teaches a General Assembly Data Science BootCamp and each session is posted on his github page.

Kevin Markham also has his own youtube channel where he teaches scikit learn.

Other resources that are not in bootcamp/ structured class format

The most important resource in this list is Hardley Wickham’s tidy data tutorial . As a statistics major in undergrad I was blessed/cursed with never having to deal with messy data until I went to grad school. Its something that we all need to learn.

A very brief tutorial of graphing plots with R  ggplot 2 library or python with the seaborn library

Finally to keep up to date with data science news check out this curated list of data science blogs

WordPress RoadMap for Noobs

Doing WordPress modifications can be a complete pain.  This blog post is a quick roadmap guide to lay a foundation into WordPress Development. For those of you who don’t know WordPress, it’s a content management system, similar to Drupal or Joomal.

First thing, learn HTML and CSS. If you are looking to do any kind of web development you are going to have to start with HTML/CSS. A good place to start is Free Code Camp  , w3schools  , and  Khan Academy .

Second, learn  the basics of PHP. I’m not saying you need to be an expert but you need to know enough to build a simple website. Trust me looking at wordpress code will be a lot easier if you have a solid foundation of the basics.

http://www.w3schools.com/php/

http://adambrown.info/b/widgets/easy-php-tutorial-for-wordpress-users/

Next thing you should know about is the WordPress Hierarchy. Why is it important to learn WordPress Hierarchy? The WordPress Hierarchy is a diagram that shows the order in which individual pages are rendered in WordPress.If you want to customize an existing WordPress theme it will help you decide which template file needs to be edited.Take a day or two to marinate on it  looks complicated  at first but things will make sense.

http://wphierarchy.com/

https://developer.wordpress.org/themes/basics/template-hierarchy/

You should also download a plugin called “Show Current Template” . I shows you the name of the php file the theme is using for that page.

All roads leads to the Codex. The Codex is your best friend. The WordPress Codex, the online manual for WordPress. So if you want to write a function,plugin,page template you can find all the answers seek in the WordPress Codex.

https://codex.wordpress.org/

If you need any additional help just ask someone.

http://stackoverflow.com/

http://wordpress.stackexchange.com/

https://wordpress.org/support/

 

 

 

Power of Habit

So it’s been three months into the New Year and I feel like I should have listened to my more practical friends and not made that resolution list. I am on a two month blogging/coding slump along with a three week exercise slump. During the time I was exercising I come to a lot of instagram pics and blog posts that talked about consistency and the power of habit. My favorite fitness blog is Nerd Fitness ran by Steve. If you are into superheroes and you are not at the point of your exercise journey where you can’t completely rebuke McDonald’s this is a good place to start.

We live in a world filled with instinct gratification. But like my dad always told me “Nothing good in life ever comes easy.”  Old people tend to be right. Anyone can accomplish their goals. Whether you want to be a rockstar programmer, or a data scientist, or just want to have a hot bod. In order to change create habits that are not complicated,  start of small.

For example:

Main Goal: I want to be a data scientist.

How do you reach goal: Learn some algorithms

Mini Goal : What’s the simplest algorithm you can start with -> Linear Regression

Only change ONE  thing  in your routine and set out only 10-30 minutes out of your day to accomplish your goals. You are more likely to do something that only takes 10 minutes instead of doing something that takes an hour to accomplish. Plus you will clock in more hours doing things for a small amount everyday instead of only doing thing 3 to 4 hours over the weekend. When you start of small you don’t feel so overwhelmed and these little accomplishments will keep you going. And when you are in a rut like me. It’s okay. You just acknowledge it and go back to where you started.

Links:

Power of Habits

Habit Change for Newbies

The Beginner’s Guide to Handstands

http://ashotofadrenaline.net/jump-rope-challenge/

Publish in 10 Minutes Per Day

Rebooting

Fallen Off the Wagon? Today is National Respawn Day.