Regular Expression Tutorial 2: Commands in R

The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to write a regular expression to match a string, you may want to manipulate strings such as deletion or replacing. Here is the list of string matching &manipulation commands commonly used with regular expressions in R. These commands also appear in many other languages.

Command          Function
grep( )          Return index of the object 
                 where reg exp found the string
grepl( )         Return logical values for reg exp 
                 matching 
regexpr( )       Return the first position of found
                 string by reg exp
gregexpr( )      Return all positions of found string
                 by regexp
sub( )           Substitute a pattern with a given string
                 (first occurrence only)
gsub( )          Globally substitute a pattern with a 
                 given string (all occurrences) 
substr( )        Return the substring in the giving 
                 character positions (start and stop)
                 in given string
strsplit( )      Split the input string into parts 
                 based on another string (character)
regexec( )       Return the first position of matched 
                 pattern in a given string
regmatches ( )   Extract or replace matched substrings
                 from match data obtained by gregexpr,
                 or regexec

Find & Display Matching string: grep

grep(pattern,vector) 
>x<-c("abc","bcd","cde","def")
>grep("bc",x)
[1] 1 2

The first one is grep() command, which was originally created in Unix system. Its name came from globally search a regular expression and print. You see “bc” appears in the first two entries of x. grep() function returns indexes of the matched string. If you want to show the matched entries (not index), use value option or use square brackets.

>grep("bc",x,value=TRUE)
[1] "abc" "bcd"
>x[grep("bc",x)] 
[1] "abc" "bcd"

Show Matched Pattern Using Find & Replace

If you want to get only the matched pattern, it is kind of awkward but you can use the output above and remove the unmatched part (In linux, you just use grep -o).

First, sub function’s syntax is

sub("matching_string","replacing_string", input_vector)

This function works like “find and replace”. Using this to remove unmatched part.

> sub(".*(bc).*","\\1",grep("bc",x,value=TRUE))
[1] "bc" "bc"

Remember .* means any character with any length and \\1 means the matched string in the first parenthesis. In this case, you see only “bc”, but if you use regular expression for pattern, you will see different kind of matches found in the string.

Remove Matched String

If you want to return indexes of unmatched string, add invert option.

> grep("bc",x,invert=TRUE)
[1] 3 4

Combining with value option, you can remove matched string from the vector

> grep("bc",x,invert=TRUE, value=TRUE)
[1] "cde" "def"

If the search is not case sensitive,

> grep("BC",x,ignore.case=TRUE)
[1] 1 2

If you want to get logical returns for matches,

> grepl("bc",x)
[1]  TRUE  TRUE FALSE FALSE

Manipulating String with Matched String Position

To get the first position of the matched pattern in the string, regexpr() is used.

>y<-"Waikiki"
>regexpr("ki",y)
[1] 4
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

Since the first match occurs at 4th character in y, the first value returned is 4. If there is no match it will return -1.

If you want to get this value only,

> regexpr("ki",y)[1]
[1] 4

You see that regexpr() returns two attributes “match.length” and “useBytes”. These value can be accessed by

> attr(regexpr("ki",y),"match.length")
[1] 2
> attr(regexpr("ki",y),"useBytes")
[1] TRUE

If you want to get positions for all matches, use gregexpr()

> gregexpr("ki",y)
[[1]]
[1] 4 6
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE

To show the only values of positions, you need to use length function. It is a bit awkward but can be done.

>z<-gregexpr("ki",y)
> z[[1]][1:length(z[[1]])]
[1] 4 6

regexec() command works very similarly to regexpr(), however if there is parenthesized matching conditions, it will show both matched string position and the position of parenthesized matched string.

> regexec("kik",y)
[[1]]
[1] 4
attr(,"match.length")
[1] 3
> regexec("k(ik)",y)
[[1]]
[1] 4 5
attr(,"match.length")
[1] 3 2

To extract a substring from an input string, use substr()

substr(x,start, end)
>x<-"abcdef" 
>substr(x,3,5)
[1] "cde"

This function can also replace a substring in a string.

>substr(x,3,4)<-"XX
[1] "abXXef"

Another Way to Show Matched Strings Using regmatches()

I showed one way to list the matched string using sub() and grep() , you can do the same thing with regmatches together with regexpr() or regexec().
First, regexpr() gives you the position of the found string and the length of the mtached string in the input, you pass this information on to regmatches(). It will show all the matched strings from the input string. regexec() will show both matched substrings and matched substrings in the parenthesis.

> a<-"Mississippi contains a palindrome ississi."
> b<-gregexpr(".(ss)",a)
> c<-regexec(".(ss)",a)

> regmatches(a,b)
[[1]]
[1] "iss" "iss" "iss" "iss"

> regmatches(a,c)
[[1]]
[1] "iss" "ss"

The syntax of regmatches() is

regmatches(input, position&length)

Therefore, if you put position and length information of matched strings obtained from either gregexpr() or regexec() will be used to extract the matched string from the input. Note that regexec takes only the first match, you see only “iss” and “ss”.

Split Strings with Common Separator Using strplit Function

Suppose you have a date string “11/03/2031″ and want to extract the numbers “11″, “03″ and “2013″. Since the numbers are separated by the common character “/”, you can use strsplit function to do the job.

> strsplit("11/03/2013","/")
[[1]]
[1] "11"   "03"   "2013"

If you use “” for separator you can extract each character.

> strsplit("11/03/2013","")
[[1]]
 [1] "1" "1" "/" "0" "3" "/" "2" "0" "1" "3"

One thing you want to remember is when string starts with a separator, strsplit puts an empty character in the vector first.

> strsplit(".a.b.c","\\.")
[[1]]
[1] ""  "a" "b" "c"

If dot (.) is a separator, you need two backslashes for regular expression.