The second part of the tutorial for regular expression will cover common commands used in R together with regular expression. Once you know how to write a regular expression to match a string, you may want to manipulate strings such as deletion or replacing. Here is the list of string matching &manipulation commands commonly used with regular expressions in R. These commands also appear in many other languages.
Command Function grep( ) Return index of the object where reg exp found the string grepl( ) Return logical values for reg exp matching regexpr( ) Return the first position of found string by reg exp gregexpr( ) Return all positions of found string by regexp sub( ) Substitute a pattern with a given string (first occurrence only) gsub( ) Globally substitute a pattern with a given string (all occurrences) substr( ) Return the substring in the giving character positions (start and stop) in given string strsplit( ) Split the input string into parts based on another string (character) regexec( ) Return the first position of matched pattern in a given string regmatches ( ) Extract or replace matched substrings from match data obtained by gregexpr, or regexec
Find & Display Matching string: grep
grep(pattern,vector) >x<-c("abc","bcd","cde","def") >grep("bc",x) [1] 1 2
The first one is grep() command, which was originally created in Unix system. Its name came from globally search a regular expression and print. You see “bc” appears in the first two entries of x. grep() function returns indexes of the matched string. If you want to show the matched entries (not index), use value option or use square brackets.
>grep("bc",x,value=TRUE) [1] "abc" "bcd" >x[grep("bc",x)] [1] "abc" "bcd"
Show Matched Pattern Using Find & Replace
If you want to get only the matched pattern, it is kind of awkward but you can use the output above and remove the unmatched part (In linux, you just use grep -o).
First, sub function’s syntax is
sub("matching_string","replacing_string", input_vector)
This function works like “find and replace”. Using this to remove unmatched part.
> sub(".*(bc).*","\\1",grep("bc",x,value=TRUE)) [1] "bc" "bc"
Remember .* means any character with any length and \\1 means the matched string in the first parenthesis. In this case, you see only “bc”, but if you use regular expression for pattern, you will see different kind of matches found in the string.
Remove Matched String
If you want to return indexes of unmatched string, add invert option.
> grep("bc",x,invert=TRUE)
[1] 3 4
Combining with value option, you can remove matched string from the vector
> grep("bc",x,invert=TRUE, value=TRUE)
[1] "cde" "def"
If the search is not case sensitive,
> grep("BC",x,ignore.case=TRUE)
[1] 1 2
If you want to get logical returns for matches,
> grepl("bc",x) [1] TRUE TRUE FALSE FALSE
Manipulating String with Matched String Position
To get the first position of the matched pattern in the string, regexpr() is used.
>y<-"Waikiki" >regexpr("ki",y) [1] 4 attr(,"match.length") [1] 2 attr(,"useBytes") [1] TRUE
Since the first match occurs at 4th character in y, the first value returned is 4. If there is no match it will return -1.
If you want to get this value only,
> regexpr("ki",y)[1] [1] 4
You see that regexpr() returns two attributes “match.length” and “useBytes”. These value can be accessed by
> attr(regexpr("ki",y),"match.length") [1] 2 > attr(regexpr("ki",y),"useBytes") [1] TRUE
If you want to get positions for all matches, use gregexpr()
> gregexpr("ki",y) [[1]] [1] 4 6 attr(,"match.length") [1] 2 2 attr(,"useBytes") [1] TRUE
To show the only values of positions, you need to use length function. It is a bit awkward but can be done.
>z<-gregexpr("ki",y) > z[[1]][1:length(z[[1]])] [1] 4 6
regexec() command works very similarly to regexpr(), however if there is parenthesized matching conditions, it will show both matched string position and the position of parenthesized matched string.
> regexec("kik",y) [[1]] [1] 4 attr(,"match.length") [1] 3 > regexec("k(ik)",y) [[1]] [1] 4 5 attr(,"match.length") [1] 3 2
To extract a substring from an input string, use substr()
substr(x,start, end) >x<-"abcdef" >substr(x,3,5) [1] "cde"
This function can also replace a substring in a string.
>substr(x,3,4)<-"XX [1] "abXXef"
Another Way to Show Matched Strings Using regmatches()
I showed one way to list the matched string using sub() and grep() , you can do the same thing with regmatches together with regexpr() or regexec().
First, regexpr() gives you the position of the found string and the length of the mtached string in the input, you pass this information on to regmatches(). It will show all the matched strings from the input string. regexec() will show both matched substrings and matched substrings in the parenthesis.
> a<-"Mississippi contains a palindrome ississi." > b<-gregexpr(".(ss)",a) > c<-regexec(".(ss)",a) > regmatches(a,b) [[1]] [1] "iss" "iss" "iss" "iss" > regmatches(a,c) [[1]] [1] "iss" "ss"
The syntax of regmatches() is
regmatches(input, position&length)
Therefore, if you put position and length information of matched strings obtained from either gregexpr() or regexec() will be used to extract the matched string from the input. Note that regexec takes only the first match, you see only “iss” and “ss”.
Split Strings with Common Separator Using strplit Function
Suppose you have a date string “11/03/2031″ and want to extract the numbers “11″, “03″ and “2013″. Since the numbers are separated by the common character “/”, you can use strsplit function to do the job.
> strsplit("11/03/2013","/") [[1]] [1] "11" "03" "2013"
If you use “” for separator you can extract each character.
> strsplit("11/03/2013","") [[1]] [1] "1" "1" "/" "0" "3" "/" "2" "0" "1" "3"
One thing you want to remember is when string starts with a separator, strsplit puts an empty character in the vector first.
> strsplit(".a.b.c","\\.") [[1]] [1] "" "a" "b" "c"
If dot (.) is a separator, you need two backslashes for regular expression.