R has various functions for regular expression based match and replaces. The grep
,grepl
, regexpr
and gregexpr
functions are used for searching for matches, while sub
andgsub
for performing replacement.
• grep(value = FALSE)
returns an integer vector of the indices of the elements of x
that yielded a match (or not, for invert = TRUE
).
>str <- c("Regular", "expression", "examples of R language") >x <- grep("ex",str,value=F) >x
[1] 2 3
>x <- "line 4322: He is now 25 years old, and weights 130lbs"; >x <- grep("\\d","",x) >x
[1] 1
• grep(value = TRUE)
returns a character vector containing the selected elements of x
(after coercion, preserving names but no other attributes).
>x <- grep("ex",str,value=T) >x
[1] “expression” “examples of R language”
• grepl
returns a logical vector (match or not for each element of x
).
>x <- grepl("ex",str) >x [1] FALSE TRUE TRUE
• sub
and gsub
return a character vector of the same length and with the same attributes asx
(after possible coercion to character). Elements of character vectors x
which are not substituted will be returned unchanged (including any declared encoding). If useBytes = FALSE
a non-ASCII substituted result will often be in UTF-8 with a marked encoding (e.g. if there is a UTF-8 input, and in a multibyte locale unless fixed = TRUE
).
>str <- c("Regular", "expression", "examples of R language") >x <- sub("x.ress","",str) >x
[1] “Regular” “eion” “examples of R language”
>x <- sub("x.+e","",str) >x
[1] “Regular” “ession” “e”
>x <- "line 4322: He is now 25 years old, and weights 130lbs"; >x <- gsub("[[:digit:]]","",x) >x
[1] "line : He is now years old, and weights lbs"
>x <- "line 4322: He is now 25 years old, and weights 130lbs"; >x <- gsub("\\d+","",x) >x
[1] "line : He is now years old, and weights lbs"
• regexpr
returns an integer vector of the same length as text
giving the starting position of the first match or -1 if there is none, with attribute "match.length"
, an integer vector giving the length of the matched text (or -1 for no match). The match positions and lengths are in characters unless useBytes = TRUE
is used, when they are in bytes.
>str <- c("Regular", "expression", "examples of R language") >x <- regexpr("x*ress",str) >x
[1] -1 4 -1
• gregexpr
returns a list of the same length as text
each element of which is of the same form as the return value for regexpr
, except that the starting positions of every (disjoint) match are given.
>str <- c("Regular", "expression", "examples of R language") >x <- gregexpr("x*ress",str) >x
[[1]] [1] -1 attr(,"match.length") [1] -1 attr(,"useBytes") [1] TRUE [[2]] [1] 4 attr(,"match.length") [1] 4 attr(,"useBytes") [1] TRUE [[3]] [1] -1 attr(,"match.length") [1] -1 attr(,"useBytes") [1] TRUE
Function Syntax:
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
Regular Expression Syntax:
Syntax | Description |
\\d | Digit, 0,1,2 … 9 |
\\D | Not Digit |
\\s | Space |
\\S | Not Space |
\\w | Word |
\\W | Not Word |
\\t | Tab |
\\n | New line |
^ | Beginning of the string |
$ | End of the string |
\ | Escape special characters, e.g. \\ is “\”, \+ is “+” |
| | Alternation match. e.g. /(e|d)n/ matches “en” and “dn” |
• | Any character, except \n or line terminator |
[ab] | a or b |
[^ab] | Any character except a and b |
[0-9] | All Digit |
[A-Z] | All uppercase A to Z letters |
[a-z] | All lowercase a to z letters |
[A-z] | All Uppercase and lowercase a to z letters |
i+ | i at least one time |
i* | i zero or more times |
i? | i zero or 1 time |
i{n} | i occurs n times in sequence |
i{n1,n2} | i occurs n1 – n2 times in sequence |
i{n1,n2}? | non greedy match, see above example |
i{n,} | i occures >= n times |
[:alnum:] | Alphanumeric characters: [:alpha:] and [:digit:] |
[:alpha:] | Alphabetic characters: [:lower:] and [:upper:] |
[:blank:] | Blank characters: e.g. space, tab |
[:cntrl:] | Control characters |
[:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9 |
[:graph:] | Graphical characters: [:alnum:] and [:punct:] |
[:lower:] | Lower-case letters in the current locale |
[:print:] | Printable characters: [:alnum:], [:punct:] and space |
[:punct:] | Punctuation character: ! ” # $ % & ‘ ( ) * + , – . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~ |
[:space:] | Space characters: tab, newline, vertical tab, form feed, carriage return, space |
[:upper:] | Upper-case letters in the current locale |
[:xdigit:] | Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f |