UofT R session went well. Thanks RStudio Server!

Apart from going longer than I had anticipated, very little of any significance went wrong during my R session at UofT on friday!  It took a while at the beginning for everyone to get set up.  Everyone was connecting to my home RStudio server via UofT’s wireless network.  This meant that if any students weren’t set up to use wireless in the first place (they get a username and password from the library, a UTORid) then they wouldn’t be able to connect period.  For those students who were able to connect, I assigned each of them one of 30 usernames that I had laboriously set up on my machine the night before.

After connecting to my server, I then got them to click on the ‘data’ directory that I had set up in each of their home folders on my computer to load up the data that I prepared for them (see last post).  I forgot that they needed to set the data directory as their working directory… woops, that wasted some time!  After I realized that mistake, things went more smoothly.

We went over data import, data indexing (although I forgot about conditional indexing, which I use very often at work… d’oh!), merging, mathematical operations, some simple graphing (a histogram, scatterplot, and scatterplot matrix), summary stats, median splits, grouped summary stats using the awesome dplyr, and then nicer graphing using the qplot function from ggplot2.

I was really worried about being boring, but I found myself getting more and more energized as the session went on, and I think the students were interested as well!  I’m so glad that the RStudio Server I set up on my computer was able to handle all of those connections at once and that my TekSavvy internet connection didn’t crap out either 🙂  This is definitely an experience that I would like to have again.  Hurray!

Here’s a script of the analysis I went through:

# ****Introduction****
# Data analysis is like an interview. In any interview, the interviewer hopes to use a series of
# questions in order to discover a story. The questions the interviewer asks, of course, are
# subjectively chosen. As such, the story that one interviewer gets out of an interviewee might
# be fairly different from the story that another interviewer gets out of the same person. In the
# same way, the commands (and thus the analysis) below are not the only way of analyzing the data.
# When you understand what the commands are doing, you might decide to take a different approach
# to analyzing the data. Please do so, and be sure to share what you find!
# ****Dataset Background****
# The datasets that we will be working with all relate to council areas in scotland (roughly equivalent
# to provinces). The one which I have labeled 'main' has numbers representing the number of drug
# related deaths by council area, with most of its columns containing counts that relate to specific
# drugs. It also contains geographical coordinates of the council areas, in latitude and longitude.
# The one which I have labeled 'pop' contains population numbers.
# The rest of the datasets contain numbers relating to problems with crime, education, employment,
# health, and income. The datasets contain proportions in them, such that values closer to 1 indicate
# that the council area is more troubled, while values closer to 0 indicate that the council area is
# less troubled in that particular way.
# P.S. If you haven't figured out already, any time a hash symbol begins a line, it means that I'm
# writing a comment to you, rather than writing out code.
# Loading all the datasets
main = read.csv("2012-drugs-related-cx.csv")
pop = read.csv("scotland pop by ca.csv")
crime = read.csv("most_deprived_datazones_by_council_(crime)_2012.csv")
edu = read.csv("most_deprived_datazones_by_council_(education)_2012.csv")
emp = read.csv("most_deprived_datazones_by_council_(employment)_2012.csv")
health = read.csv("most_deprived_datazones_by_council_(health)_2012.csv")
income = read.csv("most_deprived_datazones_by_council_(income)_2012.csv")
# Indexing the data
# Merging other relevant data with the main dataset
main = merge(main, pop[,c(2,3)], by.x="Council.area", by.y="Council.area", all.x=TRUE)
main = merge(main, crime[,c(1,4)], by.x="Council.area", by.y="label", all.x=TRUE)
main = merge(main, edu[,c(1,4)], by.x="Council.area", by.y="label", all.x=TRUE)
main = merge(main, emp[,c(1,4)], by.x="Council.area", by.y="label", all.x=TRUE)
main = merge(main, health[,c(1,4)], by.x="Council.area", by.y="label", all.x=TRUE)
main = merge(main, income[,c(1,4)], by.x="Council.area", by.y="label", all.x=TRUE)
# Weighting the number of drug related deaths by the population of each council area
main$All.drug.related.deaths_perTenK = (main$All.drug.related.deaths / (main$Population/10000))
# A histogram of the number of drug related deaths per 10,000 people
hist(main$All.drug.related.deaths_perTenK, col="royal blue")
# A Simple scatterplot
plot(All.drug.related.deaths_perTenK ~ prop_income, data=main)
# A Scatterplot matrix
pairs(~All.drug.related.deaths_perTenK + Latitude + Longitude + prop_crime + prop_education + prop_employment + prop_income + prop_health, data=main)
# Summary stats of all the variables in the dataset
# Simple summary stats of one variable at a time
# Here we do a median split of the longitudes of the council areas, resulting in an 'east' and 'west' group
main$LongSplit = cut(main$Longitude, breaks=quantile(main$Longitude, c(0,.5,1)), include.lowest=TRUE, right=FALSE, ordered_result=TRUE, labels=c("East", "West"))
# Let's examine the number of records that result in each group:
# Now we do a median split of the latitudes of the council areas, resulting in a 'north' and 'south' group
main$LatSplit = cut(main$Latitude, breaks=quantile(main$Latitude, c(0,.5,1)), include.lowest=TRUE, right=FALSE, ordered_result=TRUE, labels=c("South", "North"))
# Now let's summarise some of the statistics according to our north-south east-west dimensions:
data_source = collect(main)
grouping_factors = group_by(source_df, LongSplit, LatSplit)
deaths_by_area = summarise(grouping_factors, median.deathsptk = median(All.drug.related.deaths_perTenK),
median.crime = median(prop_crime), median.emp = median(prop_employment),
median.edu = median(prop_education), num.council.areas = length(All.drug.related.deaths_perTenK))
# Examine the summary table just created
# Now we'll make some fun plots of the summary table
qplot(LongSplit, median.deathsptk, data=deaths_by_area, facets=~LatSplit, geom="bar", stat="identity", fill="dark red",main="Median Deaths/10,000 by Area in Scotland") + theme(legend.position="none")
qplot(LongSplit, median.crime, data=deaths_by_area, facets=~LatSplit, geom="bar", stat="identity", fill="dark red",main="Median Crime Score by Area in Scotland") + theme(legend.position="none")
qplot(LongSplit, median.emp, data=deaths_by_area, facets=~LatSplit, geom="bar", stat="identity", fill="dark red",main="Median Unemployment Score by Area in Scotland") + theme(legend.position="none")
qplot(LongSplit, median.edu, data=deaths_by_area, facets=~LatSplit, geom="bar", stat="identity", fill="dark red",main="Median Education Problems Score by Area in Scotland") + theme(legend.position="none")
# ****Some Online R Resources****
# http://www.r-bloggers.com
# This is a website that aggregates the posts of people who blog about R (myself included, from time to time). The site has been up for several years now, and boasts a total blog count of over 5,000 R Bloggers! If something about R has been said anywhere, it's been said on this site!
# http://r.789695.n4.nabble.com/R-help-f789696.html
# The R-help listserv contains a lot of emails people have sent asking just about everything about R! Look through and see if your question is answered there.
# http://www.introductoryr.co.uk/R_Resources_for_Beginners.html
# This page contains a lot of online books about R that will more than help get you started!
# http://stackoverflow.com/questions/tagged/r
# Stackoverflow is a great website to go to when you want to know which answers people like the best to pressing questions about R, amongst other things (the best get 'up'voted by more people, the worst…..well…)

view raw


hosted with ❤ by GitHub

Here’s the data:


Teaching a Class of Undergrads, RStudio Server, and My Ubuntu Machine

I was chatting about public speaking with my brother, who is a Lecturer in the Faculty of Pharmacy at UofT, when he offered me the opportunity to come to his class and teach about R.  Always eager to spread the analytical goodness, I said yes!  The class is this Friday, and I am excited.

For this class I’ll be making use of RStudio Server, rather than having to get R onto some 30 individual machines.  Furthermore, I’ll be using an installation of RStudio Server on my own home machine.  It gives me more control and the convenience of configuring things late at night when I have the time to.

While playing around with the server on my computer (connecting via my own browser) I noticed that for each user you create, a new package library gets built.  That’s too bad as it relates to this class, because it would be neat for everyone to be able to make use of additional packages like ggplot2, dplyr and such, but this is an extremely beginner class anyway.

I’ve signed up for a dynamic dns host name from no-ip.com, and have set the port forwarding on my router accordingly, so that seems to be working just fine.  I just hope that nothing goes wrong.  I need to remember to create enough accounts on my ubuntu machine to accommodate all the students, which will be a small pain in the you-know-what, but oh well.

As for the data side of things, I’ve compiled some mildly interesting data on drug-related deaths by council area in scotland, geographical coordinates, and levels of crime, employment, education, income and health.  I only have an hour, so we’ll see how much I can cover!  Wish me luck.  If you have any advice, I’d be happy to hear it.  I’ve already been told to start with graphics 🙂