Coursera Exploring Data Assignment 1

Lately in my data science journey I have be going over Data Camp R exercises. But the thing about practice is that you need to do real world projects that you are interested about.People learn in different ways. Some people are auditory, visual, and some people learn by doing. I tend to learn more by doing.

Your efforts on self-education should be focused on trying to get to the point where you can actually be involved and do something as early as possible.I feel that the best way to learn something is to jump right in and start doing, before you even know what you’re doing. If you can gain enough knowledge about a subject to start playing around, you can tap into the powerful creative and curious nature of your own mind. We tend to absorb more information and develop more meaningful questions about a thing when we’re actively playing.

John Sonmez author of Soft Skills: The software developers manual

Inspired by Dj Patil’s article Hack your Summer I am currently going through the Coursera John Hopkins R data science curriculum right now to sharpen my skills and get some sort of a portfolio going. One of the courses in the curriculum is called Exploratory Data Analysis (EDA). EDA is an approach to analyzing data sets to summarize their main characteristics usually through data visualization.

For the first course assignment we had to reproduce some graphs using energy data set provided by the instructor . In this post I am going to go over the first graph.

plot1

 

So I had a lot of trouble reading and cleaning the data. The data set is kind of big. It was suggested by the instructor to estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory I used a formula posted on their an article on their Simply Statistics blog R Workshop reading in large data frames

# rows * # columns * 8 bytes / 2^20

The dataset has 2,075,259 rows and 9 columns

(2,075,259 *9*8)/2^20

142.4967 Megabytes

Since there is 1000 megabytes in a gigabyte and my computer has about 6 gigabytes it has more than enough memory to load the dataset.

Whenever I attempt to do any type of coding I just start with pseudo code and then stumble my way through the syntax. *cough* Stack Overflow *cough* When you work with different languages it’s hard to remember syntax but the core concepts are still there.

So in order to create any type of graph in the the Base Plotting System you do the following steps.

  1. Read in the data
  2. Make sure your variables are in the correct type
  3. Clean the data
  4. Use a plotting function with a lot of parameters to spit out the plot

 

Describe the problems you had.

The graphs in retrospect were simple to make. The trouble I had was working with dates. I received a bachelor in statistics and I never got much of an opportunity to work with real data.

R has three main functions to read files read.table(),read.csv() and read.delim(). I am going to use read.table since it has the most parameters to specify how you want your file to me inputted.


#read data

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

data <- read.table("household_power_consumption.txt", header=TRUE, sep=";",dec=".", stringsAsFactors=FALSE, na.strings = "?",colClasses=cls)

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

#make sure data is interpreted correctly
as.Date(energyData$Date)

#deleted all the rows that had NA values
energyData <-na.omit(energyData)

#plot data
hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")

dev.off()

 

When I first used the read.table function it interpreted all of my columns as characters.This was troublesome since most of the data in the set should have numeric classes. In order to fix this problem I had to use the coClasses parameter. coClasses converts specified variables into atomic vector classes(logical,interger,numeric,complex,character,raw). Take note I cannot convert dates using the coClasses parameter.You can specific any number of variables that you want.

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

Next we  in need to subset the data. the %in%  returns the positions of the dates from the specified in or vector.

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

 

Last step in the clean up process the dates need to be converted from character to date classes. We do this by using the as.Date() function.

Finally we make a histogram it by using the hist() function.

hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")

 

 

Resources

Visualizing the ByePhylicia hashtag

So let’s start from the beginning. Unless you have been living under a rock these past few months, you already know Bill Cosby, America’s favorite Dad, has been accused of allegedly drugging and raping 30 women. Yesterday, actress Phylicia Rashad, who played Bill Cosby’s wife on The Cosby Show, came to Bill’s defense.  Phylicia Rashad’s stance illustrates why so many women who are raped don’t report their assaults. It’s a perfect example of rape culture.  Rape culture is a term used to describe an environment in which sexual violence is considered the norm.

Twitter, a micro-blogging platform that allows users to write up to 140 characters, has proven to be a powerful tool uniting people from all over the world.Previous events in national news sparked the hashtags  #Icantbreathe, #Ferguson , and #blacklivesmatter . Current events that dominate the media tend to produce new hashtags, such as #JeSuisCharile .

So it’s no surprise that activists took to twitter and voiced their opinions about rape culture. Using the popular phrase Bye Felicia, from the movie Friday, they amended it with #ByePhylicia.

Here is a visualization of #ByePhylicia I made around 9 am this morning using nodexl, an open source plugin for Microsoft Excel. This visual shows the network of people who were the first to use the #ByeFelicia hashtag. Since the twitter 2.0 api only allows developers 15 minutes to run queries I only searched for 100 users that used the hashtag.

The colors and the shapes represent a different group within the network. Some users are self-looping meaning that no one has responded to their tweets. The size of each node represents how many users responded to that person’s tweets. @danyellecarter has the most responses to her tweets.

byephlyica

Top Hashtags in Tweet in Entire Graph
byephylicia
tweetof2015
billcosby
wendywilliams
girlbye
jesuischarlie
wakeupcall
blackcelebrities
sideeyeingyou
byefelicia
Top Tweeters in Entire Graph Entire Graph Count
insanityreport 266452
felonious_munk 236243
perezhilton 234606
callmedollar 220458
kat_lynd 163113
curlyheadred 156973
shugah 153241
ohnotheydidnt 145700
gurrdygirl 126567
shimmy4yaheart 117372
Top Mentioned in Entire Graph Entire Graph Count
ozchrisrock 13
prestonmitchum 11
hillarycrosley 9
candicebenbow 4
wendywilliams 3
kwestsavali 3
candycornball 3
tyrese 2
thesuperficial 2
theroot 2

Measures of the Graph:

Graph Metric Value
Graph Type Directed
Vertices 111
Unique Edges 105
Edges With Duplicates 0
Total Edges 105
Self-Loops 32
Reciprocated Vertex Pair Ratio 0.013888889
Reciprocated Edge Ratio 0.02739726
Connected Components 43
Single-Vertex Connected Components 24
Maximum Vertices in a Connected Component 25
Maximum Edges in a Connected Component 30
Maximum Geodesic Distance (Diameter) 5
Average Geodesic Distance 2.17574
Graph Density 0.005978706
Modularity 0.759501

Resources:

https://mashe.hawksey.info/2011/09/twitter-network-analysis-and-visualisation-ii-nodexl/

http://econometricsense.blogspot.com/2012/04/introduction-to-social-network-analysis.html

http://www.vox.com/2015/1/8/7513119/phylicia-rashad-cosby-rape

http://everydayfeminism.com/2014/03/examples-of-rape-culture/

Anaylsis of Drug Users

Graph creating using UCINET . The data set I analyzed is called drugnet. It represents network data of needle sharing among drug users from the Hartford study. For more information about this study go to http://www.incommunityresearch.org/programs/projectrap.htmdrug_noisolates_ethnicitycolor

Legend

Grey – Caucasian

Black- African-American

Orange- Puerto Rican

Other – Green

drug_noisolates_gendercolor

Legend:

Male-Blue

Purple – Female

Pink -Other