This is was a class assignment.I had to search, download and combine at least two datasets for this assignment so that you can gain hands-on experience with identifying, downloading and pre-processing a data collected by someone else. I used VBA Excel and R to preprocess my data. I used the R libraries ggplot2 and the reshape2 library for my analysis.The official ggplot2 documentation can be found HERE. Documentation of reshape2 r library can be found HERE. Both libraries are created by Hadley Wickham.
Step 1 Identify data
The original plan for my data analysis was to see if there was any correlation between cities with the least gun violence. I was inspired by a guradian.uk article about the problem. But I quickly decided against it because the issue of gun violence is far too complex. I tried looking at data from the FBI website for crime statistics but it was too general. It only showed nationwide crime and not crime on a county level. So for my project now I want to know if there is any correlation between
race and marijuana arrests. There has been research which proves that the arrests of marijuana has a racial bias. Check out the ACLU marijuana case study. I gathered data from Illinois Analysis Center(IAC) and getting demographic data from the American Community Survey(ACS). The data posted on the Illinois Analysis Center only had raw counts of the number of people arrested. I called the IAC to see if they could give me more data and to my surprise they were able to do this for me. I made a request five years worth of data,2007-2012. The data curator informed me that since some states will have marijuana arrests lower than 10 they can not provide a raw count. Also the demographic group Hispanic is not used in police arrest reports. Other groups such as Asians were not allowed to be shown in the data because there were only a few of them that got arrested.I also asked for age groups for the arrests but the curator told me that it would be too hard to do since most of the people arrests are between the ages of 17-35.
Step 2 Get to Know Your Data
After I downloaded the data I preprocessed it. For the Illinois Analysis data I did not have to do much preprocessing because the data was pretty clean. I ran a VBA script that turned the excel workbook into individual worksheets.. I also deleted some descriptions from the data set i.e. where the data originated from and the proportions of the arrest. I also had to delete some counties because the ACS data was not as detailed as the Illinois Analysis Center’s drug data.
The data that gave me the most trouble was the American Community Survey data. It was a massive file of data. I preprocessed five years worth of 1 year ACS estimates from the years 2007-2011. I only used the year 2011 to do a quick analysis for the sake of time. I did the preprocessing manually because it was the fastest way. The ACS variables that I kept were the Total Population count, the number of Males, the number of females, the number of blacks, and the number of whites in each illinois county.
- PMale- Percentage of Males in each county
- PFemale- Percentage of Females in each county
- PWhite – Percentage of Whites in each county
- PBlack-Percentage of Blacks in each county
- Total-Total Population
Using the raw arrests counts from the IAC data I created 7 variables.
- WArrest- Total number of white arrests
- BArrest- Total number of Black arrests
- 2011 Overall Total Marijuana arrests
- wap – (Total number of white arrests for each county)/White population
- bap- ( Total number of black arrests for each county)/Black population
- taw-(total number of white arrest)/total number of arrests
- tab -(total number of black arrest)/total number of arrests
- tap-(total number of arrest)/total population arrests
Step 3 – Analysis Plan
Null: Race does not have an impact on marijuana arrests.
Alternative: Race does have an impact on marijuana arrests.
Create a density graph of the data to see the probability distribution of the data to better figure out what test I should use.Use the correlate test (cor) and covariance test (cov) to see if there is a correlation between the races. Create a linear regression with more control variables such as income and age.
explanatory variables :race,gender
response variable: marijuana arrest
Step 4- Overall Recommendation
|http://www.fbi.gov/about-us/cjis/ucr/ucr||FBI database of crime in the United States||RejectedThe data was too general. It only had data for states and not counties.|
|http://www.isp.state.il.us/crime/ucrhome.cfm||Illinois State Police||RejectedEven though it had more detailed data than the IAS data sets for each county in Illinois it did not have any race demographic data.|
|http://www.bjs.gov/||Bureau of Justice||RejectedThe Bureau of Justice do not have any data tools drug arrests only violent crimes.|
I need to get more demographic data this issue is far too complex and interrelated. I cannot make a good analysis with just one variable (race). I will also need to look at variables such as income and education. The data that I have now I can only do a simple analysis. I cannot make any ground breaking claim.