Coursera Exploring Data Assignment 1

Lately in my data science journey I have be going over Data Camp R exercises. But the thing about practice is that you need to do real world projects that you are interested about.People learn in different ways. Some people are auditory, visual, and some people learn by doing. I tend to learn more by doing.

Your efforts on self-education should be focused on trying to get to the point where you can actually be involved and do something as early as possible.I feel that the best way to learn something is to jump right in and start doing, before you even know what you’re doing. If you can gain enough knowledge about a subject to start playing around, you can tap into the powerful creative and curious nature of your own mind. We tend to absorb more information and develop more meaningful questions about a thing when we’re actively playing.

– John Sonmez author of Soft Skills: The software developers manual

Inspired by Dj Patil’s article Hack your Summer I am currently going through the Coursera John Hopkins R data science curriculum right now to sharpen my skills and get some sort of a portfolio going. One of the courses in the curriculum is called Exploratory Data Analysis (EDA). EDA is an approach to analyzing data sets to summarize their main characteristics usually through data visualization.

For the first course assignment we had to reproduce some graphs using energy data set provided by the instructor . In this post I am going to go over the first graph.

plot1

So I had a lot of trouble reading and cleaning the data. The data set is kind of big. It was suggested by the instructor to estimate of how much memory the dataset will require in memory before reading into R. Make sure your computer has enough memory I used a formula posted on their an article on their Simply Statistics blog R Workshop reading in large data frames

# rows * # columns * 8 bytes / 2^20

The dataset has 2,075,259 rows and 9 columns

(2,075,259 *9*8)/2^20

142.4967 Megabytes

Since there is 1000 megabytes in a gigabyte and my computer has about 6 gigabytes it has more than enough memory to load the dataset.

Whenever I attempt to do any type of coding I just start with pseudo code and then stumble my way through the syntax. *cough* Stack Overflow *cough* When you work with different languages it’s hard to remember syntax but the core concepts are still there.

So in order to create any type of graph in the the Base Plotting System you do the following steps.

Read in the data
Make sure your variables are in the correct type
Clean the data
Use a plotting function with a lot of parameters to spit out the plot

Describe the problems you had.

The graphs in retrospect were simple to make. The trouble I had was working with dates. I received a bachelor in statistics and I never got much of an opportunity to work with real data.

R has three main functions to read files read.table(),read.csv() and read.delim(). I am going to use read.table since it has the most parameters to specify how you want your file to me inputted.


#read data

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

data <- read.table("household_power_consumption.txt", header=TRUE, sep=";",dec=".", stringsAsFactors=FALSE, na.strings = "?",colClasses=cls)

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

#make sure data is interpreted correctly
as.Date(energyData$Date)

#deleted all the rows that had NA values
energyData <-na.omit(energyData)

#plot data
hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")

dev.off()

When I first used the read.table function it interpreted all of my columns as characters.This was troublesome since most of the data in the set should have numeric classes. In order to fix this problem I had to use the coClasses parameter. coClasses converts specified variables into atomic vector classes(logical,interger,numeric,complex,character,raw). Take note I cannot convert dates using the coClasses parameter.You can specific any number of variables that you want.

cls <- c(Voltage="numeric", Global_active_power="numeric",Global_intensity="numeric",Sub_metering_1="numeric",Sub_metering_2="numeric",Sub_metering_3="numeric",Global_active_power="numeric",Global_reactive_power="numeric")

Next we in need to subset the data. the %in% returns the positions of the dates from the specified in or vector.

energyData <- data[data$Date %in% c("1/2/2007","2/2/2007") ,]

Last step in the clean up process the dates need to be converted from character to date classes. We do this by using the as.Date() function.

Finally we make a histogram it by using the hist() function.

hist(energyData$Global_active_power, col="red",xlab="Global Active Power (kilowatts)",ylab="Frequency",main="Global Active Power")

Resources

Visualizing the ByePhylicia hashtag

So let’s start from the beginning. Unless you have been living under a rock these past few months, you already know Bill Cosby, America’s favorite Dad, has been accused of allegedly drugging and raping 30 women. Yesterday, actress Phylicia Rashad, who played Bill Cosby’s wife on The Cosby Show, came to Bill’s defense. Phylicia Rashad’s stance illustrates why so many women who are raped don’t report their assaults. It’s a perfect example of rape culture. Rape culture is a term used to describe an environment in which sexual violence is considered the norm.

Twitter, a micro-blogging platform that allows users to write up to 140 characters, has proven to be a powerful tool uniting people from all over the world.Previous events in national news sparked the hashtags #Icantbreathe, #Ferguson , and #blacklivesmatter . Current events that dominate the media tend to produce new hashtags, such as #JeSuisCharile .

So it’s no surprise that activists took to twitter and voiced their opinions about rape culture. Using the popular phrase Bye Felicia, from the movie Friday, they amended it with #ByePhylicia.

Here is a visualization of #ByePhylicia I made around 9 am this morning using nodexl, an open source plugin for Microsoft Excel. This visual shows the network of people who were the first to use the #ByeFelicia hashtag. Since the twitter 2.0 api only allows developers 15 minutes to run queries I only searched for 100 users that used the hashtag.

The colors and the shapes represent a different group within the network. Some users are self-looping meaning that no one has responded to their tweets. The size of each node represents how many users responded to that person’s tweets. @danyellecarter has the most responses to her tweets.

Top Hashtags in Tweet in Entire Graph

byephylicia

tweetof2015

billcosby

wendywilliams

girlbye

jesuischarlie

wakeupcall

blackcelebrities

sideeyeingyou

byefelicia

Top Tweeters in Entire Graph	Entire Graph Count
insanityreport	266452
felonious_munk	236243
perezhilton	234606
callmedollar	220458
kat_lynd	163113
curlyheadred	156973
shugah	153241
ohnotheydidnt	145700
gurrdygirl	126567
shimmy4yaheart	117372

Top Mentioned in Entire Graph	Entire Graph Count
ozchrisrock	13
prestonmitchum	11
hillarycrosley	9
candicebenbow	4
wendywilliams	3
kwestsavali	3
candycornball	3
tyrese	2
thesuperficial	2
theroot	2

Measures of the Graph:

Graph Metric	Value
Graph Type	Directed

Vertices	111

Unique Edges	105
Edges With Duplicates	0
Total Edges	105

Self-Loops	32

Reciprocated Vertex Pair Ratio	0.013888889
Reciprocated Edge Ratio	0.02739726

Connected Components	43
Single-Vertex Connected Components	24
Maximum Vertices in a Connected Component	25
Maximum Edges in a Connected Component	30

Maximum Geodesic Distance (Diameter)	5
Average Geodesic Distance	2.17574

Graph Density	0.005978706
Modularity	0.759501

Resources:

https://mashe.hawksey.info/2011/09/twitter-network-analysis-and-visualisation-ii-nodexl/

http://econometricsense.blogspot.com/2012/04/introduction-to-social-network-analysis.html

http://www.vox.com/2015/1/8/7513119/phylicia-rashad-cosby-rape

http://everydayfeminism.com/2014/03/examples-of-rape-culture/

Interactive map of uncovered guns in the United States

This article is a month shy of being a year old but the New York Times using data from the the Chicago Crime Lab interactive map highlights Chicago’s gun problems.

http://www.nytimes.com/interactive/2013/01/29/us/where-50000-guns-in-chicago-came-from.html?_r=0

Anaylsis of Drug Users

Graph creating using UCINET . The data set I analyzed is called drugnet. It represents network data of needle sharing among drug users from the Hartford study. For more information about this study go to http://www.incommunityresearch.org/programs/projectrap.htm

Legend

Grey – Caucasian

Black- African-American

Orange- Puerto Rican

Other – Green

Legend:

Male-Blue

Purple – Female

Pink -Other

Zaynaib Giwa

Inspiring data scientist who also dabbles in web development

Tag: data visualization

Coursera Exploring Data Assignment 1

Visualizing the ByePhylicia hashtag

Interactive map of uncovered guns in the United States

Anaylsis of Drug Users

Menu