Reading in data from an external file

Reading in data from an external file

The data sets: test.txtcars.txttest_missing.txttest_missing_comma.txttest_fixed.txtscan.txt,

1. Reading in data from the console using the scan function

For very small data vectors it is sometimes handy to read in data directly from the prompt. This can be accomplished using the scan function from the command line. The scan function reads the fields of data in the file as specified by the what option, with the default being numeric. If the what option is specified to be what=character() or what=” ” then all the fields will be read as strings. If the data are a mix of numeric, string or complex data, then a list can be used in the what option. The default separator for the scanfunction is any white space (single space, tab, or new line). Because the default is space delimiting, you can enter data on separate lines. When all the data have been entered, just hit the enter key twice which will terminate the scanning.

# Reading in numeric data
x <- scan()

1: 3 5 6 
4: 3 5 78 29
8: 34 5 1 78
12: 
Read 11 items

x

[1]  3  5  6  3  5 78 29 34  5  1 78

mode(x)

[1] "numeric"

# Reading in string data
# empty quotes indicates character input 
y <- scan(what=" ")

1: red blue
3: green red 
5: blue yellow
7: 
Read 6 items

y

[1] "red"    "blue"   "green"  "red"    "blue"   "yellow"

mode(y)

[1] "character"

2. Importing data files using the scan function

The scan function is an extremely flexible tool for importing data.  Unlike the read.table function, however, which returns a data frame, the scan function returns a list or a vector.  This makes the scan function less useful for inputting “rectangular” data such as the car data set that will been seen in later examples.  In the previous example we input first numeric data and then string data directly from the console; in the following example, we input the text file, scan.txt.  For the what option, we use list and then list the variables, and after each variable, we tell R what type of variable (e.g., numeric, string) it is.  In the first example, the first variable is age, and we tell R that age is a numeric variable by setting it equal to 0.  The second variable is called name, and it is denoted as a string variable by the empty quote marks.  In the second example, we list NULL first, indicating that we do not want the first variable to be read.  After using the scan function, we use the sapply function, which makes a list out of a vector of names in x.

# inputting a text file and outputting a list
x <- scan("c:/scan.txt", what=list(age=0, name=""))

Read 4 records

x

$age
[1] 12 24 35 20

$name
[1] "bobby"   "kate"    "david"   "michael"

# using the same text file and saving only the names as a vector
x <- scan("c:/scan.txt", what=list(NULL, name=character()))

Read 4 records

x <- x[sapply(x, length) > 0] 

x

$name
[1] "bobby"   "kate"    "david"   "michael"

is.vector(x)

[1] TRUE

3. Reading in free formatted data from an ASCII file using the read.table function

The read.table function will let you read in any type of delimited ASCII file. It can read in both numeric and character values. The default is for it to read in everything as numeric data, and character data is read in as numeric, it is easiest to change that once the data has been read in using the mode function. This is by far the easiest and most reliable method of entering data into R.

# complete data, space delimited, variable names in first row
test <-  read.table("c:/test.txt", header=T)

test

   prgtype gender  id ses schtyp level 
1  general      0  70   4      1     1
2   vocati      1 121   4      2     1
3  general      0  86   4      3     1
4   vocati      0 141   4      3     1
5 academic      0 172   4      2     1
6 academic      0 113   4      2     1
7  general      0  50   3      2     1
8 academic      0  11   1      2     1

The default delimiter in read.table is the space delimiter, but this could create problems if there are missing data. The function will not work unless every data line has the same number of values. Thus, if there are missing data, the data lines will have different number of values, and you will receive an error. If there are missing values the easiest way to fix this problem is to change the type of delimiter. In theread.table function the sep argument is used to specify the delimiter.

# showing the file with missing values, space delimited (test_missing.txt data file)
prgtype  gender  id ses schtyp  level
  general    0  70    4   1      1  
   vocati    1 121    4          1  
  general    0  86               1  
   vocati    0 141    4   3      1  
 academic    0 172    4   2      1  
 academic    0 113    4   2      1  
  general    0  50    3   2      1  
 academic    0  11    1   2      1

test.missing <- read.table("c:/test_missing.txt", header = T)
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : 
line 2 did not have 6 elements

# showing the file with missing data, comma delimited (test_missing_comma.txt data file)
prgtype,  gender,  id, ses, schtyp,  level
  general,    0,  70,    4,   1,      1  
   vocati,    1, 121,    4,    ,      1  
  general,    0,  86,     ,    ,      1  
   vocati,    0, 141,    4,   3,      1  
 academic,    0, 172,    4,   2,      1  
 academic,    0, 113,    4,   2,      1  
  general,    0,  50,    3,   2,      1  
 academic,    0,  11,    1,   2,      1  

test.missing <- read.table("c:/test_missing_comma.txt", header = T, sep = ",")

test.missing

    prgtype gender  id ses schtyp level 
1   general      0  70   4      1     1
2    vocati      1 121   4     NA     1
3   general      0  86  NA     NA     1
4    vocati      0 141   4      3     1
5  academic      0 172   4      2     1
6  academic      0 113   4      2     1
7   general      0  50   3      2     1
8  academic      0  11   1      2     1

The read.table function is very useful when reading in ASCII files that contain rectangular data.  As mentioned above, the default delimiter is blank space; other delimiters must be specified by using the sep option and setting it equal to the delimiter in quotes (i.e., sep=”;” for the semicolon delimited data file).  Another very common type of file is the comma delimited file. The file test.csv has been saved out of Excel as a comma delimited file. This file can be read in by the read.table function by using the sep option, but it can also be read in by the read.csv function which was written specifically for comma delimited files.  We use the print function to display the contents of the object test.csv just to show its use.

test.csv <- read.csv("c:/test.csv", header=T)

print(test.csv)

 print(test.csv)
   make   model mpg weight price 
1   AMC Concord  22   2930  4099
2   AMC   Pacer  17   3350  4749
3   AMC  Spirit  22   2640  3799
4 Buick Century  20   3250  4816
5 Buick Electra  15   4080  7827

test.csv1 <- read.table("c:/test.csv", header=T, sep=",")

print(test.csv1)

print(test.csv1)
   make   model mpg weight price 
1   AMC Concord  22   2930  4099
2   AMC   Pacer  17   3350  4749
3   AMC  Spirit  22   2640  3799
4 Buick Century  20   3250  4816
5 Buick Electra  15   4080  7827

It is, of course, also possible to use the read.table function for reading in files with other delimiters. In the data called testsemicolon.txt has semicolon delimiters and the dataset test called testz.txt uses the letter z as a delimiter, both of which are acceptable delimiters in R.

test.semi <- read.table("c:/testsemicolon.txt", header=T, sep=";")

print(test.semi)

 print(test.semi)
   make   model mpg weight price 
1   AMC Concord  22   2930  4099
2   AMC   Pacer  17   3350  4749
3   AMC  Spirit  22   2640  3799
4 Buick Century  20   3250  4816
5 Buick Electra  15   4080  7827

test.z <- read.table("c:/testz.txt", header=T, sep="z")

print(test.z)

print(test.z)
   make   model mpg weight price 
1   AMC Concord  22   2930  4099
2   AMC   Pacer  17   3350  4749
3   AMC  Spirit  22   2640  3799
4 Buick Century  20   3250  4816
5 Buick Electra  15   4080  7827

4. Reading in fixed formatted files

We use the read.fwf function to read in data with fixed formats, and we use the width argument to indicate the width (number of columns) of each variable. In a fixed format file we do not have the names of the variables on the first line, and therefore they must be added after we have read in the data. We add the variable names using the dimnames function and the bracket notation to indicate that we are attaching names to the variables (columns) of the data file.  Please note that there are several different ways to accomplish this task; this is just one of them.

test.fixed <- read.fwf('c:/test_fixed.txt', width=c(8, 1, 3, 1, 1, 1))

dimnames(test.fixed)[[2]] <- c("prgtyp", "gender", "id", "ses", "schtyp", "level")

test.fixed

    prgtyp gender  id ses schtyp level
1 general       0  70   4      1     1
2 vocati        1 121   4      2     1
3 general       0  86   4      3     1
4 vocati        0 141   4      3     1
5 academic      0 172   4      2     1
6 academic      0 113   4      2     1
7 general       0  50   3      2     1
8 academic      0  11   1      2     1

For fixed format files the variables names are often in a separate file from the data. In this example the variable names are in a file called names and the data are in a file called testfixed.txt.  This is especially convenient when the fixed format file is very large and has many variables; then it becomes rather impractical to type in all the variable names.  In this situation the width option is used to specify the width of each variable and the col.name option specifies the file containing the variable names.  So, first we read in the file for the names using the scan function.  We specify that file contains character values by setting the what option to equal character().  By using the col.names option in the read.fwf function, the object names will supply the variables names.

names <- scan("c:/names.txt", what=character() )

print(names)

[1] "model"  "make"   "mph"    "weight" "price" 

test.fixed <- read.fwf("c:/testfixed.txt", col.names=names, width = c(5, 7, 2, 4, 4))

print(test.fixed)

  model    make mph weight price 
1   AMC Concord  22   2930  4099
2   AMC   Pacer  17   3350  4749
3   AMC  Spirit  22   2640  3799
4 Buick Century  20   3250  4816
5 Buick Electra  15   4080  7827

5. Exporting files using the write.table function

The write.table function outputs data files. The first argument specifies which data frame in R is to be exported. The next argument specifies the file to be created. The default separator is a blank space but any separator can be specified in the sep option. The default value for both the row.names andcol.names options is TRUE. In the example we specify that we do not wish to include row names. The default setting for the quote option is to include quotes around all the character values, i.e., around values in string variables and around the column names. As we have shown in the example it is very common not to want the quotes when creating a text file.

# using the test.csv data frame to write a text file with no row names 
# and without quotes around the character values (both column names and string variables)
write.table(test.csv, "c:/test1.txt", row.names=F, quote=F)

6. Exporting files in Stata 6/7 format using the write.dta function

The write.dta function is part of the foreign package and writes an R data frame to a Stata data file in either Stata 6 or 7 format. Although these are older versions of Stata, Stata has no difficulty reading files written in older versions.  (To download the foreign package, click on Packages in the menu bar at the top, click on Install package(s) from CRAN, and then scroll down in the menu until you find foreign.)  It takes at least two arguments, the first one being the data frame and the second one being the output Stata data file name.  If you look at the help file for write.dta, you will see that the function writes out a Stata 6 data file, but there are comments and options for those using later versions of Stata.  In the example below, we use the anscombe data set that comes with R. It happens that the anscombe data is already a data frame, this being checked with the is.data.frame function.

library(foreign)
data(anscombe)

is.data.frame(anscombe)

[1] TRUE

anscombe

   x1 x2 x3 x4    y1   y2    y3    y4
1  10 10 10  8  8.04 9.14  7.46  6.58
2   8  8  8  8  6.95 8.14  6.77  5.76
3  13 13 13  8  7.58 8.74 12.74  7.71
4   9  9  9  8  8.81 8.77  7.11  8.84
5  11 11 11  8  8.33 9.26  7.81  8.47
6  14 14 14  8  9.96 8.10  8.84  7.04
7   6  6  6  8  7.24 6.13  6.08  5.25
8   4  4  4 19  4.26 3.10  5.39 12.50
9  12 12 12  8 10.84 9.13  8.15  5.56
10  7  7  7  8  4.82 7.26  6.42  7.91
11  5  5  5  8  5.68 4.74  5.73  6.89
write.dta(anscombe, file="d:/data/anscombe.dta")

Now let’s see an example where the data is not yet a data frame. We can use function as.data.frame to convert the data into a data frame.  Again, these data come with R.

data(WorldPhones)

is.data.frame(WorldPhones)

[1] FALSE

WorldPhones

      N.Amer Europe  Asia  S.Amer  Oceania  Africa  Mid.Amer
1951   45939  21574  2876    1815     1646      89       555
1956   60423  29990  4708    2568     2366     1411      733
1957   64721  32510  5230    2695     2526     1546      773
1958   68484  35218  6662    2845     2691     1663      836
1959   71799  37598  6856    3000     2868     1769      911
1960   76036  40341  8220    3145     3054     1905     1008
1961   79831  43173  9053    3338     3224     2005     1076

phones_d <- as.data.frame(WorldPhones)

phones_d

      N.Amer Europe  Asia  S.Amer  Oceania  Africa  Mid.Amer
1951   45939  21574  2876    1815     1646      89       555
1956   60423  29990  4708    2568     2366     1411      733
1957   64721  32510  5230    2695     2526     1546      773
1958   68484  35218  6662    2845     2691     1663      836
1959   71799  37598  6856    3000     2868     1769      911
1960   76036  40341  8220    3145     3054     1905     1008
1961   79831  43173  9053    3338     3224     2005     1076
is.data.frame(phones_d)

[1] TRUE

write.dta(phones_d, file="d:/data_stata8/phones.dta")

To give you an idea of what types of data can be read into R using the foreign package, part of the help file is shown below.

data.restore   Read an S3 Binary File
lookup.xport   Lookup Information on a SAS XPORT Format Library
read.dbf       Read a DBF File
read.dta       Read Stata binary files
read.epiinfo   Read Epi Info data files
read.mtp       Read a Minitab Portable Worksheet
read.octave    Read Octave Text Data Files
read.spss      Read an SPSS data file
read.ssd       Obtain a Data Frame from a SAS Permanent Dataset, via read.xport
read.systat    Obtain a Data Frame from a Systat File
read.xport     Read a SAS XPORT Format Library
write.dbf      Write a DBF File
write.dta      Write Files in Stata Binary Format
write.foreign  Write text files and code to read them.

The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.