RMySQL Examples

Definitions – Databases, Queries, Relational Databases, SQL, MySQL

A database is a structured collection of records or data that is stored in a computer system. A database relies upon software to organize the storage of data.

A query is a line of text that a user sends to a database telling it what you want it to do. It is usually written in SQL (Structured query language).

Example of SQL query:

select * from books where price < 20.00 order by title;

The term “relational database” comes from “Relation”, a mathematical term loosely meaning “table”, and thus “relational” roughly means “based on tables”.

MySQL is a popular database management system within which you can set up and query databases. Many web applications use databases. Youtube, flickr, and wikipedia are all built on MySQL databases. A lot of biological databases (e.g. GO, ensembl) are already available in mysql database tables, or are easily input into a database for easy querying.

For example, lets say I have a database table with yeast genes and some information about them (created by parsing a text file from SGD):

 

Here's the first 5 rows of the table, as an example.

mysql> select id,chrom,start,end,strand,sys_id,gene_name from yeast_genes_07 limit 5;
+----+-------+-------+------+--------+-----------+-----------+
| id | chrom | start | end  | strand | sys_id    | gene_name |
+----+-------+-------+------+--------+-----------+-----------+
|  1 | 1     |   335 |  649 | W      | YAL069W   | NULL      |
|  2 | 1     |   538 |  792 | W      | YAL068W-A | NULL      |
|  3 | 1     |  1807 | 2169 | C      | YAL068C   | NULL      |
|  4 | 1     |  2480 | 2707 | W      | YAL067W-A | NULL      |
|  5 | 1     |  7236 | 9017 | C      | YAL067C   | SEO1      |
+----+-------+-------+------+--------+-----------+-----------+
5 rows in set (0.00 sec)

Using a database, I can slice and dice the information in this table using almost-english queries like this:

 

select * from yeast_genes_07 where strand = 'W' and start < 5000 and chrom = 5;

which means – show me all the information about genes whose strand is W, start is less than 5000, and are on chromosome 5.

We have a bunch of databases here at stowers. Additionally, MySQL is installed by default on pretty much any linux system I’ve ever run into.

 

RMySQL

RMySQL is a little tricky to install, depending on your version of R. Here’s some places to look for more info if you have trouble:http://www.phaget4.org/R/R_notes.html
http://tolstoy.newcastle.edu.au/R/e2/help/07/10/28442.html

I currently have it installed in R 2.4.0, but couldn’t get it to work in 2.5.1.

Download from:
CRAN (non-windows only)
Brian Ripley, windows versions

RMySQL allows you to access MySQL databases from within R.

Why would you want to do this?

Because writing SQL queries is intuitive and powerful. And a database is a good place for information that you want to access in different, flexible ways.

 

RMySQL Examples

This is actually how I most often use RMySQL, just to slurp in a whole database table at once. I do this because I like storing information in databases so it is easy to access from R, perl, php, or within mysql itself – Once a data set is in the database, I don’t have to worry about parsing a file anytime I want to look at it, I just grab it from the database without worry about delimiters or column order.

 

library(DBI)

library(RMySQL)

m<-dbDriver("MySQL");
con<-dbConnect(m,user='mcm',password='welcome323',host='omega',dbname='yeast');
res<-dbSendQuery(con, "select * from yeast_genes_07")
genes<- fetch(res, n = -1)

You can do whatever mysql query you want, though. You can grab only the parts of the table you’re interested in based on different column values.

res<-dbSendQuery(con, "select * from yeast_genes_07 where description like \"%phosphatase%\"")
phosphatase_genes<- fetch(res, n = -1)

res<-dbSendQuery(con, "select chrom,start,end,sys_id,strand from yeast_genes_07 where strand = \"C\" and chrom = 1")

c1_genes<- fetch(res, n = -1)

 

RODBC

You can use the RODBC package to connect to an excel file or access database. Download from: http://cran.r-project.org/web/packages/RODBC/index.html

 

RODBC-Excel example

Warning – the excel file has to have a pretty regular format or lots of glitches will pop up.

# efg, 6 August 2007
# modified by mcm, 4/14/2008

library(RODBC)

setwd("U:/mcm/presentations/RDisc/db/");
filename <-  "yeast_genes_07.xls"
channel <- odbcConnectExcel(filename)
sqlTables(channel)
d <- sqlFetch(channel,"yeast_genes_07", as.is=TRUE) #this is the sheet name
odbcClose(channel)
names(d)

# Any SQL query can be used.  (Note "$" on sheet name is needed here)
channel <- odbcConnectExcel(filename)

#pick some columns from the table

s <- sqlQuery(channel, "select chrom,start,end,sys_id from [yeast_genes_07$]" )
s

#pick only chromosome 3
s <- sqlQuery(channel, "select chrom,start,end,sys_id from [yeast_genes_07$] where chrom = 3" )
s

#chromosome 3 and strand W
s <- sqlQuery(channel, "select chrom,start,end,sys_id from [yeast_genes_07$] where chrom = 3 and strand = \'W\' " )
s

#usually, if you want to limit the number of rows you retrieve, you use limit. This should work, but doesn't.
s <- sqlQuery(channel, "select chrom,start,end,sys_id from [yeast_genes_07$] where chrom = 3 and strand = \'W\' limit 5")

s

#if you need to limit the number of rows you retrieve with RODBC, use the max argument.
s <- sqlQuery(channel, "select chrom,start,end,sys_id from [yeast_genes_07$] where chrom = 3 and strand = \'W\'" ,max=5)
s

odbcClose(channel)

 

RODBC-Access example

 

library(RODBC)
filename <- "S:/Bioinformatics/Software/Affy/Data/Mouse430_2/Mouse430_2.mdb"
connection <- odbcConnectAccess(filename)

sqlTables(connection)

# fetch the whole table (fairly slow)
codes  <- sqlFetch(connection,"Annotation")

# fetch only ProbeIDs, GeneID and Gene title (much faster) 
query <- "select [Probe Set ID], [Gene Title], [Gene Symbol] from Annotation" 
codes <- sqlQuery(connection, query)
odbcClose(connection)

filename <- "S:/Bioinformatics/Software/Affy/Data/MG_U74Av2/MG_U74Av2.mdb"
connection <- odbcConnectAccess(filename)

sqlTables(connection)

query <- "SELECT [Probe Set ID], [Gene Title], [Gene Symbol] FROM Annotation" 
d2 <- sqlQuery(connection, query)
dim(d2)

query <- "SELECT [Probe Set ID], [Gene Title], [Gene Symbol] FROM Annotation WHERE [Gene Symbol] = 'Tfrc'" 
d3 <- sqlQuery(connection, query, na.strings="NA", as.is=TRUE)
d3

# Use "%" as a wildcard, not the "*" allowed in Access
query <- "SELECT [Probe Set ID], [Gene Symbol], [RefSeq Transcript ID]

         FROM   Annotation
         WHERE  ([Gene Symbol] LIKE 'Tf%') AND ([RefSeq Transcript ID] LIKE 'NM_%')
         ORDER BY [Gene Symbol]"
d4 <- sqlQuery(connection, query, na.strings="NA", as.is=TRUE)
d4

odbcClose(connection)

 

Earl’s RODBC notes

 

DO NOT USE RODBC with Excel files unless the data are quite regular - no holes.  

Any character fields mixed with numeric fileds are likely to be changed to NAs.    

First row of Excel spreadsheet MUST have column headings.

[http://www.omegahat.org/RDCOMClient/ RDCOMClient] is a package you can alternatively use to connect to an excel file, but it's not SQL-based. You can also save an excel file as text, tab delimited and then read it into R with the read.table command.

 

R-database resources

R data import and export manual R database mailing list (or google with term r-sig-db).

 

Other MySQL resources

Pronunciation: “MySQL” is officially pronounced My Ess Q ell, not “My sequel”. Just FYI.
mysql.com
manuals
MySQL query browser (easy mysql interface)
mysql cheat sheet
another cheat sheet
and some more mysql queries