Lately in my data science journey I have be going over Data Camp R exercises. But the thing about practice is that you need to do real world projects that you are interested about.People learn in different ways. Some people are auditory, visual, and some people learn by doing. I tend to learn more by doing.
Your efforts on self-education should be focused on trying to get to the point where you can actually be involved and do something as early as possible.I feel that the best way to learn something is to jump right in and start doing, before you even know what you’re doing. If you can gain enough knowledge about a subject to start playing around, you can tap into the powerful creative and curious nature of your own mind. We tend to absorb more information and develop more meaningful questions about a thing when we’re actively playing.
– John Sonmez author of Soft Skills: The software developers manual
Inspired by Dj Patil’s article Hack your Summer I am currently going through the Coursera John Hopkins R data science curriculum right now to sharpen my skills and get some sort of a portfolio going. One of the courses in the curriculum is called Exploratory Data Analysis (EDA). EDA is an approach to analyzing data sets to summarize their main characteristics usually through data visualization.
For the first course assignment we had to reproduce some graphs using energy data set provided by the instructor . In this post I am going to go over the first graph.
So I had a lot of trouble reading and cleaning the data. The data set is kind of big. It was suggested by the instructor to estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory I used a formula posted on their an article on their Simply Statistics blog R Workshop reading in large data frames
# rows * # columns * 8 bytes / 2^20
The dataset has 2,075,259 rows and 9 columns
(2,075,259 *9*8)/2^20
142.4967 Megabytes
Since there is 1000 megabytes in a gigabyte and my computer has about 6 gigabytes it has more than enough memory to load the dataset.
Whenever I attempt to do any type of coding I just start with pseudo code and then stumble my way through the syntax. *cough* Stack Overflow *cough* When you work with different languages it’s hard to remember syntax but the core concepts are still there.
So in order to create any type of graph in the the Base Plotting System you do the following steps.
- Read in the data
- Make sure your variables are in the correct type
- Clean the data
- Use a plotting function with a lot of parameters to spit out the plot
Describe the problems you had.
The graphs in retrospect were simple to make. The trouble I had was working with dates. I received a bachelor in statistics and I never got much of an opportunity to work with real data.
R has three main functions to read files read.table(),read.csv() and read.delim(). I am going to use read.table since it has the most parameters to specify how you want your file to me inputted.
#read data cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric") data <- read.table("household_power_consumption.txt", header=TRUE, sep=";",dec=".", stringsAsFactors=FALSE, na.strings = "?",colClasses=cls) energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,] #make sure data is interpreted correctly as.Date(energyData$Date) #deleted all the rows that had NA values energyData <-na.omit(energyData) #plot data hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power") dev.off()
When I first used the read.table function it interpreted all of my columns as characters.This was troublesome since most of the data in the set should have numeric classes. In order to fix this problem I had to use the coClasses parameter. coClasses converts specified variables into atomic vector classes(logical,interger,numeric,complex,character,raw). Take note I cannot convert dates using the coClasses parameter.You can specific any number of variables that you want.
cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")
Next we in need to subset the data. the %in% returns the positions of the dates from the specified in or vector.
energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]
Last step in the clean up process the dates need to be converted from character to date classes. We do this by using the as.Date() function.
Finally we make a histogram it by using the hist() function.
hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")
Resources