All Courses

String Manipulation and Regular Expression in R

Updated on Aug 29, 2025

8,523 Views

Table of Content

dealing with text data
regular expression

Dealing with Text Data

Working with Text data can often turn out to be a complex exercise, because of its volume, complicated structure, loss of any specific pattern etc. We, therefore, need a faster, easy-to-implement, convenient and robust ways for information retrieval from the text data. Many a time, in the real world, we encounter text data which is quite noisy. Thanks to Hadley Wickham, we have the package ‘stringr’ that adds more functionality to the base functions for handling strings in R. According to the description of the package stringr –

“is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”

Before looking at the use cases, let’s try to first understand “What is String Manipulation”?

String manipulation refers to a series of functions that are used to extract information from text variables. In machine learning, these functions are being widely used for doing feature engineering, i.e., to create new features out of existing string features.

Now technically there are differences between “String Manipulation functions” and “Regular Expressions”:

Typically, string manipulation functions are used to do simple tasks such as splitting a string, (Example: extracting the first two letters from a string, etc.). On the other hand, someone would like to use regular expressions to do more complicated tasks such as extract email IDs or date from a set of text.
String manipulation functions are designed to respond in a particular way. They can’t be modified to deviate from their natural behavior. Whereas, one can customize regular expressions in any way they want.

Few things to remember:

Text data is stored in character vectors (or, less commonly, character arrays). It’s important to remember that each element of a character vector is a whole string, rather than just an individual character. In R, “string” is an informal term that is used because “element of a character vector” is quite a mouthful. The fact that the basic unit of text is a character vector means that most string manipulation functions operate on vectors of strings, in the same way, that mathematical operations are vectorized.

We will see how we can leverage this package in R to deal with the Text Data.

Now let’s look at some of the very commonly used functions (available in ‘stringr’ package) for string manipulation:

Functions	Descriptions
nchar()	It counts the number of characters in a string or vector. In the stringr package, it's substitute function is str_length()
tolower()	It converts a string to the lower case. Alternatively, you can also use the str_to_lower() function
toupper()	It converts a string to the upper case. Alternatively, you can also use the str_to_upper() function
chartr()	It is used to replace each character in a string. Alternatively, you can use str_replace() function to replace a complete string
substr()	It is used to extract parts of a string. Start and end positions need to be specified. Alternatively, you can use the str_sub() function
setdiff()	It is used to determine the difference between two vectors
setequal()	It is used to check if the two vectors have the same string values
abbreviate()	It is used to abbreviate strings. The length of abbreviated string needs to be specified
strsplit()	It is used to split a string based on a criterion. It returns a list. Alternatively, you can use the str_split() function. This function lets you convert your list output to a character matrix
sub()	It is used to find and replace the first match in a string
gsub()	It is used to find and replace all the matches in a string/vector. Alternatively, you can use the str_replace() function
paste()	Paste() function combines the strings together.
str_trim()	removes leading and trailing whitespace
str_dup()	duplicates characters
str_pad()	pads a string
str_wrap()	wraps a string paragraph
str_trim()	trims a string

Let’s look at some examples

------------- String Manipulation -------------------
----Concatenating with str_c():
str_c("May", "The", "Force", "Be", "With", "You")
#Result
[1] "MayTheForceBeWithYou"
# removing zero length objects
str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", character(0))
#Result
[1] "Themeekshallinherittheearth"
# changing separator
str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", sep = "_")
#Result
[1] "The_meek_shall_inherit_the_earth"

-----Substring with str_sub()
some_text = 'It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,

it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way – in short,
the period was so far like the present period,
that some of its noisiest authorities insisted on its being received,
for good or for evil, in the superlative degree of comparison only.'
# apply 'str_sub'
str_sub(some_text, start = 1, end = 10)
#Result
[1] "It was the"
# another example
str_sub("adios", 1:3)
#Result
[1] "adios" "dios"  "ios"
# some strings
fruits = c("apple", "grapes", "banana", "mango")
# 'str_sub' with negative positions
str_sub(fruits, start = -4, end = -1)
#result
[1] "pple" "apes" "nana" "ango"
# extracting sequentially
str_sub('some_text', seq_len(nchar('some_text')))
#Result
[1] "some_text" "ome_text"  "me_text" "e_text" "_text"     "text" "ext"
[8] "xt" "t"
---Same result can be obtained using the substr function
substring('some_text', seq_len(nchar('some_text')))
[1] "some_text" "ome_text"  "me_text" "e_text" "_text"     "text" "ext"
[8] "xt" "t"
# replacing 'Lorem' with 'Nullam'
text = "Hlo World"
str_sub(text, 1, 4) <- "Hello"
text
#Result
[1] "HelloWorld"

---Duplication with str_dup()
# default usage
str_dup("hello", 3)
# Result
[1] "hellohellohello"
-----Padding with str_pad()
# left padding with '#'
str_pad("hashtag", width = 8, pad = "#")
#Result
[1] "#hashtag" 

-----Wrapping with str_wrap()
# quote ()
some_quote = c(
  "It was the best of times",
  "it was the worst of times,",
  "it was the age of wisdom",
  "it was the age of foolishness")
# some_quote in a single paragraph
some_quote = paste(some_quote, collapse = " ")
some_quote
# display paragraph with following lines indentation of 3
cat(str_wrap(some_quote, width = 30, exdent = 3), "\n")
#Result
It was the best of times it
   was the worst of times, it was
   the age of wisdom it was the
   age of foolishness

-----Trimming with str_trim()
# text with whitespaces
bad_text = c("This", " example ", "has several ", " whitespaces ")
# remove whitespaces on both sides
str_trim(bad_text, side = "both")
#Result
[1] "This"        "example" "has several" "whitespaces"

---Word extraction with word()
# some sentence
change = c("Be the change", "you want to be")
# extract first word
word(change, 2)
#Result
[1] "the"  "want"
install.packages('stringr')
library(stringr)

--- #count number of characters
nchar(some_quote)
#Result
[1] 106
str_length(some_quote)
#Result - Same as nchar function on 'stringr' package
[1] 106
#convert to lower
tolower(some_quote)
#Result
[1] "it was the best of times it was the worst of times,
it was the age of wisdom it was the age of foolishness"
#convert to upper

toupper(some_quote)
#Result
[1] "IT WAS THE BEST OF TIMES IT WAS THE WORST OF TIMES,
IT WAS THE AGE OF WISDOM IT WAS THE AGE OF FOOLISHNESS"
#replace strings
chartr("and","for",x = some_quote) #letters t,b,w get replaced by f,o,r
#Result
[1] "It wfs the best of times it wfs the worst of times,
it wfs the fge of wisrom it wfs the fge of foolishoess"
#get difference between two vectors
setdiff(c("monday","tuesday","wednesday"),c("monday","thursday","friday"))
#Result
[1] "tuesday"   "wednesday"
#check if strings are equal
setequal(c("it","was","bad"),c("it","was","bad"))
#Result
[1] TRUE
setequal(c("it","wasnot","good"),c("it","was","bad"))
#Result
[1] FALSE
#abbreviate strings
abbreviate(c("apple","orange","banana"),minlength = 3)
#Result
apple orange banana
"app"  "orn" "bnn"
#split strings
strsplit(x = c("room-101","room-102","desk-103","flr-104"),split = "-")
#Result
[[1]]
[1] "room" "101"
[[2]]
[1] "room" "102"
[[3]]
[1] "desk" "103"
[[4]]
[1] "flr" "104"
str_split(string = c("room-101","room-102","desk-103","flr-104"),pattern = "-",
          simplify = T)
#Result
      [,1]   [,2]
[1,] "room" "101"
[2,] "room" "102"
[3,] "desk" "103"
[4,] "flr"  "104"
#find and replace first match
sub(pattern = "L",replacement = "B",x = some_quote,ignore.case = T)
#Result
[1] "It was the best of times it was the worst of times,
it was the age of wisdom it was the age of fooBishness"
#find and replace all matches
gsub(pattern = "was",replacement = "was't",x = some_quote,ignore.case = T)
#Result
[1] "It wasn't the best of times it was't the worst of times,
it was't the age of wisdom it was't the age of foolishness"

Regular Expression

A regular expression (a.k.a. regex) is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of the pattern. Hence we say that a regular expression is a pattern that describes a set of strings. R has some functions for working with regular expressions although it does not provide a very wide range of capabilities that some other scripting languages might offer. Nevertheless, they can take us quite far with some workarounds in place.

The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. So working with regular expressions is more about pattern matching. The result of a match is either successful or not.

The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure.

Regular Expressions in R

There are two key aspects of the functionalities dealing with regular expressions in R: One has to do with the functions designed for regex pattern matching. The other aspect has to do with the way regex patterns are expressed in R. In this part of the tutorial we are primarily going to talk about the 2nd aspect: the way R works with regular expressions.

In the context of regular expressions, we will be covering the following themes in this tutorial:

Metacharacters
Sequences
Quantifiers
Character classes
POSIX character classes

1. Metacharacters

The simplest form of regular expressions are those that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. For a language like R, there are some special characters that have reserved meaning and they are referred to as ‘Metacharacters”. The metacharacters in Extended Regular Expressions (EREs) are:

. \ | ( ) [ { $ * + ?

The following table shows the general regex metacharacters and how to escape them in R:

metacharacters

The following example shows how to deal with any metacharacters within the text:

------- Regular Expressions in R
# string
char = "$char"
# the right way in R
sub(pattern = "\\$", replacement = "", x = char)
#Result
[1] "char"

2. Sequences

Sequences, as the name suggests refers to the sequences of characters which can match. We have shorthand versions (or anchors) for commonly used sequences in R:

sequences in r

Example:

------# replace digit with '_'
gsub("\\d", "_", "the year of great depression was 1929")

3. Character Class

A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of the different characters. For example, the regex character class [aA] matches any lower case letter a or any upper case letter A. Likewise, the regular expression [0123456789] matches any single digit. It is important not to confuse a regex character class with the native R "character" class notion.

Examples of some character classes are shown below:

character classes in r

Let’s look at some examples:

-------------------
# example string
transport = c("car", "bike", "plane", "boat")
# look for 'o' or 'e'
grep(pattern = "[oe]", transport, value = TRUE)
#Result
[1] "bike"  "plane" "boat"
--------
# some numeric strings
numerics = c("13", "19-April", "I-V-IV", "R 3.3.1")
# match strings with 0 or 1
grep(pattern = "[019]", numerics, value = TRUE)
#Result
[1] "13" "19-April" "R 3.3.1"

4. POSIX Character Classes

POSIX character classes are very closely related to regex character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]]. The following table shows the POSIX character classes as used in R:

posix characters in r

Example:

--------------
some_quote = 'It was #FFC0CB (print) ; \nthe best of \times!'
# Print the text
print(some_quote)
#Result
[1] "It was #FFC0CB (print) ; \nthe best of \times!"
# remove space characters
gsub(pattern = "[[:blank:]]", replacement = "", some_quote)
#Result
[1] "Itwas#FFC0CB(print);\nthebestofimes!"
# remove non-printable characters
gsub(pattern = "[^[:print:]]", replacement = "", some_quote)
#Result
[1] "It was #FFC0CB (print) ; the best of imes!"

5. Quantifiers

One more important set of regex elements are the quantifiers. These are used when you want to match a certain number of characters that meet certain criteria.

Following table shows a list of quantifiers:

Let’s look at few worked out examples:

#Some examples : Quantifiers in R
# people names
people = c("Ravi", "Emily", "Alex", "Pramod", "Shishir", "jacob",
           "rasmus", "jacob", "flora")
# match 'm' at most once
grep(pattern = "m?", people, value = TRUE)
#Result
[1] "Ravi"    "Emily" "Alex"    "Pramod" "Shishir" "jacob"   "rasmus" "jacob"
[9] "flora"
# match 'm' one or more times
grep(pattern = "m+", people, value = TRUE)
#Result
[1] "Emily"  "Pramod" "rasmus"

6. Major Regex Functions

R contains a set of functions in the base package that we can use to find pattern matches. The following table lists these functions with a brief description:

regex expressions in r

Few Examples:

----Extract digits from a string of characters
address <- "The address is 245 Summer Street"
regmatches(address, regexpr("[0-9]+",address))
#Result
[1] "245"
#Return if a value is present in a vector
#match values
det <- c("A1","A2","A3","A4","A5","A7")
grep(pattern = "A6|A2",x = det,value =T)
#Result
[1] "A2"
----Extract strings which are available in key value pairs
d <- c("(Val_1 :: 0.1231313213)","today_trans","(Val_2 :: 0.1434343412)")
grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)",x = d,value = T)
regmatches(d,regexpr(pattern = "\\((.*) :: (0\\.[0-9]+)\\)",text = d))
#Result
[1] "(Val_1 :: 0.1231313213)" "(Val_2 :: 0.1434343412)"
--Remove punctuation from a line of text
text <- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-="
gsub(pattern = "[[:punct:]]+",replacement = "",x = text)
#Result
[1] "a1bcdefg"
----Find the location of digits in a string
string <- "Only 10 out of 25 qualified in the examination"
gregexpr(pattern = '\\d',text = string) #or
#Result
[[1]]
[1]  6 7 16 17
attr(,"match.length")
[1] 1 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
unlist(gregexpr(pattern = '\\d',text = "Only 10 out of 25 qualified
                in the examination"))
#Result
[1]  6 7 16 17
---Extract email addresses from a given string
string <- c("My email address is abc@gmail.com",
            "my email address is def@hotmail.com","aescher koeif",
            "paul Taylor")
unlist(regmatches(x = string, gregexpr(pattern =
                                         "[[:alnum:]]+\\@[[:alpha:]]+\\.com",
                                       text = string)))
#Result
[1] "abc@gmail.com"   "def@hotmail.com"

Regular expressions are very crucial parts of text mining and natural language processing. So in this tutorial, you learnt about the basics of the string manipulation and regular expressions and you can start leveraging these concepts while starting off your journey in text mining.

Full Name*

Email*

+91

Phone Number*

United States +1

India +91

Canada +1

Australia +61

Singapore +65

New Zealand +64

Germany +49

United Arab Emirates +971

Hong Kong +852

Ireland +353

Afghanistan +93

Aland Islands +358

Albania +355

Algeria +213

AmericanSamoa +1684

Andorra +376

Angola +244

Anguilla +1264

Antarctica +672

Antigua and Barbuda +1268

Argentina +54

Armenia +374

Aruba +297

Ascension Island +247

Austria +43

Azerbaijan +994

Bahamas +1242

Bahrain +973

Bangladesh +880

Barbados +1246

Belarus +375

Belgium +32

Belize +501

Benin +229

Bermuda +1441

Bhutan +975

Bolivia +591

Bosnia and Herzegovina +387

Botswana +267

Brazil +55

British Indian Ocean Territory +246

Brunei Darussalam +673

Bulgaria +359

Burkina Faso +226

Burundi +257

Cambodia +855

Cameroon +237

Cape Verde +238

Cayman Islands +1345

Central African Republic +236

Chad +235

Chile +56

China +86

Christmas Island +61

Cocos (Keeling) Islands +61

Colombia +57

Comoros +269

Congo +242

Cook Islands +682

Costa Rica +506

Cote d'Ivoire +225

Croatia +385

Cuba +53

Cyprus +357

Czech Republic +420

Democratic Republic of the Congo +243

Denmark +45

Djibouti +253

Dominica +1767

Dominican Republic +1849

Ecuador +593

Egypt +20

El Salvador +503

Equatorial Guinea +240

Eritrea +291

Estonia +372

Eswatini +268

Ethiopia +251

Falkland Islands (Malvinas) +500

Faroe Islands +298

Fiji +679

Finland +358

France +33

French Guiana +594

French Polynesia +689

Gabon +241

Gambia +220

Georgia +995

Ghana +233

Gibraltar +350

Greece +30

Greenland +299

Grenada +1473

Guadeloupe +590

Guam +1671

Guatemala +502

Guernsey +44

Guinea +224

Guinea-Bissau +245

Guyana +592

Haiti +509

Holy See (Vatican City State) +379

Honduras +504

Hungary +36

Iceland +354

Indonesia +62

Iran +98

Iraq +964

Isle of Man +44

Israel +972

Italy +39

Jamaica +1876

Japan +81

Jersey +44

Jordan +962

Kazakhstan +77

Kenya +254

Kiribati +686

Korea, Democratic People's Republic of Korea +850

Korea, Republic of South Korea +82

Kosovo +383

Kyrgyzstan +996

Laos +856

Latvia +371

Lebanon +961

Lesotho +266

Liberia +231

Libya +218

Liechtenstein +423

Lithuania +370

Luxembourg +352

Macau +853

Madagascar +261

Malawi +265

Malaysia +60

Maldives +960

Mali +223

Malta +356

Marshall Islands +692

Martinique +596

Mauritania +222

Mauritius +230

Mayotte +262

Mexico +52

Micronesia, Federated States of Micronesia +691

Moldova +373

Monaco +377

Mongolia +976

Montenegro +382

Montserrat +1664

Morocco +212

Mozambique +258

Myanmar +95

Namibia +264

Nauru +674

Nepal +977

Netherlands +31

New Caledonia +687

Nicaragua +505

Niger +227

Nigeria +234

Niue +683

Norfolk Island +672

North Macedonia +389

Northern Mariana Islands +1670

Norway +47

Oman +968

Pakistan +92

Palau +680

Palestine +970

Papua New Guinea +675

Paraguay +595

Peru +51

Philippines +63

Pitcairn +872

Poland +48

Portugal +351

Puerto Rico +1939

Qatar +974

Reunion +262

Romania +40

Russia +7

Rwanda +250

Saint Barthelemy +590

Saint Helena, Ascension and Tristan Da Cunha +290

Saint Kitts and Nevis +1869

Saint Lucia +1758

Saint Martin +590

Saint Pierre and Miquelon +508

Saint Vincent and the Grenadines +1784

Samoa +685

San Marino +378

Sao Tome and Principe +239

Saudi Arabia +966

Senegal +221

Serbia +381

Seychelles +248

Sierra Leone +232

Sint Maarten +1721

Slovakia +421

Slovenia +386

Solomon Islands +677

Somalia +252

South Africa +27

South Georgia and the South Sandwich Islands +500

South Sudan +211

Spain +34

Sri Lanka +94

Sudan +249

Suriname +597

Svalbard and Jan Mayen +47

Sweden +46

Switzerland +41

Syrian Arab Republic +963

Taiwan +886

Tajikistan +992

Tanzania, United Republic of Tanzania +255

Thailand +66

Timor-Leste +670

Togo +228

Tokelau +690

Tonga +676

Trinidad and Tobago +1868

Tunisia +216

Turkey +90

Turkmenistan +993

Turks and Caicos Islands +1649

Tuvalu +688

Uganda +256

Ukraine +380

United Kingdom +44

Uruguay +598

Uzbekistan +998

Vanuatu +678

Venezuela, Bolivarian Republic of Venezuela +58

Vietnam +84

Virgin Islands, British +1284

Virgin Islands, U.S. +1340

Wallis and Futuna +681

Yemen +967

Zambia +260

Zimbabwe +263

By Signing up, you agree to ourTerms & Conditionsand ourPrivacy and Policy

10% OFF

Coupon Code "SKILL10"

Coupon Expires 27/07

Copy

Get your free handbook for CSM!!

Recommended Courses