R Programming Assignment 1

I’m currently going through the John Hopkins Data Science  specialization. So far they are okay. These courses are pretty tough so if you are complete beginner you can complement these courses with Data Camp course if you need more practice.The only annoying part about this class is that they do not mention some of the functions you will need to complete the assignments during lectures.  Luckily there are TA hints to supplement this lack of information.

I am a researcher. When I run into problems I would go online and try to ask the right question to help me solve the problem that’s in front of me. Or I would talk to different people who have expert knowledge on the subject matter or I will just find the answer in a book at my local library. But being a programmer or data scientist involves breaking down problems. The toughest part for me is being comfortable with problem solving.

Background about the data

Air Pollution: look at the report that you did about it.

First thing I do whenever I have data is to explore the file in excel in order to get a better understand of its structure. Each file in the specdata folder contains data for one monitor.

  • Date: the date of the observation
  • sulfate: the level of sulfate particle matter in the air on that date
  • nitrate: the level of nitrate particle matter in the air on that date

The pollutantmean function the prompt states:

Write a function named ‘pollutantmean’ that calculates the mean of a pollutant (sulfate or nitrate) across a specified list of monitors. The function ‘pollutantmean’ takes three arguments: ‘directory’, ‘pollutant’, and ‘id’. Given a vector monitor ID numbers, ‘pollutantmean’ reads that monitors’ particulate matter data from the directory specified in the ‘directory’ argument and returns the mean of the pollutant across all of the monitors, ignoring any missing values coded as NA.

At this point I don’t really know what I’m doing. It’s just me staring at the computer for an hour…

wvqiqjynbocjk

With every coding problem I try to think about things on a higher level. If I have an understanding of the problem then I can muddle my way through the syntax of the code.

What are they asking for here?

There just want the mean of the pollutants by the IDs.

So the higher level stuff is done next we break the problem into steps.

  1. read the files all the data files
  2. merge the data files in one data frame
  3. ignore the NAs
  4. subset the data frame by pollutant
  5. calculate the mean.

Lets find out what the mean function requires.

?mean

Usage

mean(x, ...)

## Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
Arguments

x 
An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.

trim 
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.

na.rm 
a logical value indicating whether NA values should be stripped before the computation proceeds.

According to the documentation the mean can only take one R object. But there is a problem.

specdata

 

The information for each monitor is in an individual csv file. The monitor id is the same as the csv files names. In order to find the mean of pollutants of monitors by different ids we will have to put each monitor’s information into one data frame.

 

At this point  I do not know how to walk  rough a directory of files. I know in python there is os.walk . In R there is the list.files function .

In order to perform the same actions to multiple files I looped through the list.

pollutantmean<-function(directory,pollutant,id=1:332){
  #create a list of files
  filesD<-list.files(directory,full.names = TRUE)
  #create an empty data frame
  dat <- data.frame()
  
  #loop through the list of files until id is found
  for(i in id){
    #read in the file
    temp<- read.csv(filesD[i],header=TRUE)
    #add files to the main data frame
    dat<-rbind(dat,temp)
  }
  #find the mean of the pollutant, make sure you remove NA values
  return(mean(dat[pollutant],na.rm = TRUE))
  
}

 

Part 2

Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.

This question was easier than both parts 1 and parts 2. The question is simple to understand. They just want a report that displays the number of completed cases in each data file. This question reminds me of nobs function  that is used in another statistical software called SAS its used in many businesses. Unlike R which seems to be primarily used in academia and research.

Steps

  1. read in the files
  2. remove the NAs from the set
  3. count the number of rows
  4. create a new data set  has two columns that contains the monitors id number and the number of observations
complete <- function(directory,id=1:332){

#create a list of files
  filesD<-list.files(directory,full.names = TRUE)
  #create an empty data frame
  dat <- data.frame()
  
  for(i in id){
  #read in the file
    temp<- read.csv(filesD[i],header=TRUE)
    #delete rows that do not have complete cases
    temp<-na.omit(temp)
    
    #count all of the rows with complete cases
    tNobs<-nrow(temp)
    
    #enumerate the complete cases by index
    dat<-rbind(dat,data.frame(i,tNobs))
   
  }
    return(dat)
}

 

Part 3

Write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate for monitor locations where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no monitors meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

Part 3 was a dosey. Its similar to part 1 for the fact that we are basically aggregating data. But this time instead of a data frame we are aggregating data inside a vector.

  1. read in the files
  2. remove the NAs from the set
  3. check to see if the number of complete cases are > then threshold
  4. find  the correlation between different types of pollutants.

In order to combine different data sets  we used  rbind (combine rows). To combine different  vectors we use cbind(combine columns).

corr<-function(directory,threshold=0){
#create list of file names
  filesD<-list.files(directory,full.names = TRUE)
  
  #create empty vector
  dat <- vector(mode = "numeric", length = 0)
  
  for(i in 1:length(filesD)){
  #read in file
    temp<- read.csv(filesD[i],header=TRUE)
    #delete NAs
    temp<-temp[complete.cases(temp),]
    #count the number of observations
    csum<-nrow(temp)
    #if the number of rows is greater than the threshold
    if(csum>threshold){
   #for that file you find the correlation between nitrate and sulfate
   #combine each correlation for each file in vector format using the concatenate function 
   #since this is not a data frame we cannot use rbind or cbind
      dat<-c(dat,cor(temp$nitrate,temp$sulfate))
    }
    
  }

  return(dat)
}

In retrospect this assignment is very useful. In most of the data science courses throughout this specialization you are going to be selecting, aggregating, and doing basic statistics with data.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s