R Regular Expression – Zhuo Yao, Ph.D.

R has various functions for regular expression based match and replaces. The grep,grepl, regexpr and gregexpr functions are used for searching for matches, while sub andgsub for performing replacement.

• grep(value = FALSE) returns an integer vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE).

>str <- c("Regular", "expression", "examples of R language")
>x <- grep("ex",str,value=F)
>x

[1] 2 3

>x <- "line 4322: He is now 25 years old, and weights 130lbs";
>x <- grep("\\d","",x)
>x

[1] 1

• grep(value = TRUE) returns a character vector containing the selected elements of x(after coercion, preserving names but no other attributes).

>x <- grep("ex",str,value=T)
>x

[1] “expression” “examples of R language”

• grepl returns a logical vector (match or not for each element of x).

>x <- grepl("ex",str)
>x
[1] FALSE  TRUE  TRUE

• sub and gsub return a character vector of the same length and with the same attributes asx (after possible coercion to character). Elements of character vectors x which are not substituted will be returned unchanged (including any declared encoding). If useBytes = FALSE a non-ASCII substituted result will often be in UTF-8 with a marked encoding (e.g. if there is a UTF-8 input, and in a multibyte locale unless fixed = TRUE).

>str <- c("Regular", "expression", "examples of R language")
>x <- sub("x.ress","",str)
>x

[1] “Regular” “eion” “examples of R language”

>x <- sub("x.+e","",str)
>x

[1] “Regular” “ession” “e”

>x <- "line 4322: He is now 25 years old, and weights 130lbs";
>x <- gsub("[[:digit:]]","",x)
>x

[1] "line : He is now  years old, and weights lbs"

>x <- "line 4322: He is now 25 years old, and weights 130lbs";
>x <- gsub("\\d+","",x)
>x

[1] "line : He is now  years old, and weights lbs"

• regexpr returns an integer vector of the same length as text giving the starting position of the first match or -1 if there is none, with attribute "match.length", an integer vector giving the length of the matched text (or -1 for no match). The match positions and lengths are in characters unless useBytes = TRUE is used, when they are in bytes.

>str <- c("Regular", "expression", "examples of R language")
>x <- regexpr("x*ress",str)
>x

[1] -1 4 -1

• gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.

>str <- c("Regular", "expression", "examples of R language")
>x <- gregexpr("x*ress",str)
>x

[[1]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

[[2]]
[1] 4
attr(,"match.length")
[1] 4
attr(,"useBytes")
[1] TRUE

[[3]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

Function Syntax:

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)

sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)

gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
     fixed = FALSE, useBytes = FALSE)

regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)

gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
         fixed = FALSE, useBytes = FALSE)

Regular Expression Syntax:

Syntax	Description
\\d	Digit, 0,1,2 … 9
\\D	Not Digit
\\s	Space
\\S	Not Space
\\w	Word
\\W	Not Word
\\t	Tab
\\n	New line
^	Beginning of the string
$	End of the string
\	Escape special characters, e.g. \\ is “\”, \+ is “+”
\|	Alternation match. e.g. /(e\|d)n/ matches “en” and “dn”
•	Any character, except \n or line terminator
[ab]	a or b
[^ab]	Any character except a and b
[0-9]	All Digit
[A-Z]	All uppercase A to Z letters
[a-z]	All lowercase a to z letters
[A-z]	All Uppercase and lowercase a to z letters
i+	i at least one time
i*	i zero or more times
i?	i zero or 1 time
i{n}	i occurs n times in sequence
i{n1,n2}	i occurs n1 – n2 times in sequence
i{n1,n2}?	non greedy match, see above example
i{n,}	i occures >= n times
[:alnum:]	Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:]	Alphabetic characters: [:lower:] and [:upper:]
[:blank:]	Blank characters: e.g. space, tab
[:cntrl:]	Control characters
[:digit:]	Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:]	Graphical characters: [:alnum:] and [:punct:]
[:lower:]	Lower-case letters in the current locale
[:print:]	Printable characters: [:alnum:], [:punct:] and space
[:punct:]	Punctuation character: ! ” # $ % & ‘ ( ) * + , – . / : ; < = > ? @ [ \ ] ^ _ ` { \| } ~
[:space:]	Space characters: tab, newline, vertical tab, form feed, carriage return, space
[:upper:]	Upper-case letters in the current locale
[:xdigit:]	Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f