Assorted links – Data Science with R

Standard

last updated: 2015-08-29

References & Most helpful commands

Tutorials & Handy packages

Hands-on dplyr tutorial for faster data manipulation in R Interactive Visualizations From R Using Rcharts rMaps – Interactive Maps from R (github repo) (requires “devtools” from cran)
Using R for Psychological Research – Personality Project, William Revelle
DataCamp courses
Try R by Code School (on codeschool)
Introduction to R, Leada

Visualization Packages

see Assorted links – Data Visualization (to be published later)

Papers

Tidy Data, Hadley Wickham [PDF]

Journals

Big Data & Society – Open-access journal

Hacks for better productivity

Sublime and R

Using Sublime Text 2 for R Using R in Sublime Text 3

Books

Video (training) courses

Introduction to Data Science with R, Garrett Grolemund, O’Reilly Media

Lists of Resources by others

Data Mining

Scraping Twitter and Web Data Using R – Pablo Barbera

Numerical Analysis
Interoperability
Data Sources

see Assorted links – Data sources (To be published later)

If you’d like to contribute to this list, please leave them in the comments below.

Advertisements

R language Development 1997-2015

Video

The open-source world keeps surprising me. It is really amazing how internationally distributed individuals meet and collaborate on open-source projects and develop amazing products that exceed commercially available products. One such example is the development of the R statistical programming language.

Watch the video below and observe how its development since 1997 is similar to the work of ants and bees constructing colonies and hives.

Rscript to customize the R environment

Standard

A while ago I published a post on how to install some basic packages in R. This post goes further by sharing with you an Rscript (as part of another Ubuntu customization script) to install many popular R packages.

I’ve written the Rscript to be run after a fresh installation of Ubuntu. The Rscript is called by the Ubuntu customization script (yet to be published) and should install some basic and popular R packages.

Below is a Gist. For the repo click here.

R – Labels inside ggplots using directlabels

Standard

The other day I generated the following figure with ggplot2

plot

using the following code:

ggplot(dat, aes(x = Year, y = log10(sum), group = id, colour = id)) +
 geom_point() +
 geom_smooth() +
 labs(x = "", y = "")

Note that I used the “group” argument to plot both curves on the same figure. Similarly I used the “colour” argument to colorize each curve differently.

But instead of a legend I wanted to have labels on or near the curves. To do that I resorted to the “directlabels” package.

First I needed to install it after installing its dependency package “quadprog” and load ggplot2 & directlabels:

install.packages("quadprog") # dependency for directlabels
install.packages("directlabels", repo="http://r-forge.r-project.org")

library(ggplot2)
library(directlabels) # load "directlabels"

To plot the figure, I went on as before but instead I assigned the plot command to “p” which I then passed on to direct.label().

p <- ggplot(dat, aes(x = Year, y = log10(sum), group = id, colour = id)) +
 geom_point() +
 geom_smooth() +
 labs(x = "", y = "")
direct.label(p)

As you can see the direct.label() function took care of the legend and replaced it with labels on the curves:

plot

This is a really useful package.

If you found this post helpful please give it a like or share it somewhere in the digital universe.

R – Merging Two Data Frames Without Messing with the Rows’ Order

Standard

The merge() function is very useful to join two data sets together especially using a common variable (column).

The function offers several arguments to be used which makes it flexible in many ways. But one important and lacking feature is preserving the order of the rows.

There are two popular ways to account for this and result in a merged data set with the same row order as one of the original data sets.

The first is an ad-hoc solution.

This solution depends on an extra id column that is used to re-order the merged data set.
Consider two data frames df_1 and df_2 with a common column “label”. The process goes as follows:

# create a new variable (column) & assign each element an "id"
# from 1 to the number of rows of df_1
df_1$id <- 1:nrow(df_1) 

# merge the two data frames using the label column without sorting
merged <- merge(x = df_1, y = df_2, by = "label", sort = FALSE)

# order the merged data set using "id" & assign it to "ordered"
ordered <- merged[order(merged$id), ]

The resultant data set is now the two data sets merged with the same row order of the original data set df_1.
Of course, you can add the “id” column to either of the data frames, depending on the situation.

The second option is using another R function, the join() function.

This function solves the order problem that merge() doesn’t but is not as feature-rich as merge is. It is though as simple to use. Of course do not forget to load the Plyr library:

library(plyr)
merged_and_ordered <- join(df_1, df_2)

Reference: R – Merging two data frames while keeping the original row order, StackOverflow

Installing Some Basic R Packages in Ubuntu

Standard

The following is how I configured my R workspace (and Rstudio) and this was first shared on a Coursera’s “Getting and Cleaning Data” course forums.

First make sure that R is version 3+. If not update it according to this stackoverflow question.

Java for rJava

Install Java (needed for rJava) first from a terminal:

sudo apt-get install openjdk-6-jre

which will install openjdk-6-jdk.
If this doesn’t work install all its packages:

sudo apt-get install openjdk-6-*

OR you might prefer openjdk-7-jdk

sudo apt-get install openjdk-7-*

You should find that it is installed using this command:

Continue reading