
Domains
Agile Management
Master Agile methodologies for efficient and timely project delivery.
View All Agile Management Coursesicon-refresh-cwCertifications
Scrum Alliance
16 Hours
Best Seller
Certified ScrumMaster (CSM) CertificationScrum Alliance
16 Hours
Best Seller
Certified Scrum Product Owner (CSPO) CertificationScaled Agile
16 Hours
Trending
Leading SAFe 6.0 CertificationScrum.org
16 Hours
Professional Scrum Master (PSM) CertificationScaled Agile
16 Hours
SAFe 6.0 Scrum Master (SSM) CertificationAdvanced Certifications
Scaled Agile, Inc.
32 Hours
Recommended
Implementing SAFe 6.0 (SPC) CertificationScaled Agile, Inc.
24 Hours
SAFe 6.0 Release Train Engineer (RTE) CertificationScaled Agile, Inc.
16 Hours
Trending
SAFe® 6.0 Product Owner/Product Manager (POPM)IC Agile
24 Hours
ICP Agile Certified Coaching (ICP-ACC)Scrum.org
16 Hours
Professional Scrum Product Owner I (PSPO I) TrainingMasters
32 Hours
Trending
Agile Management Master's Program32 Hours
Agile Excellence Master's ProgramOn-Demand Courses
Agile and ScrumRoles
Scrum MasterTech Courses and Bootcamps
Full Stack Developer BootcampAccreditation Bodies
Scrum AllianceTop Resources
Scrum TutorialProject Management
Gain expert skills to lead projects to success and timely completion.
View All Project Management Coursesicon-standCertifications
PMI
36 Hours
Best Seller
Project Management Professional (PMP) CertificationAxelos
32 Hours
PRINCE2 Foundation & Practitioner CertificationAxelos
16 Hours
PRINCE2 Foundation CertificationAxelos
16 Hours
PRINCE2 Practitioner CertificationSkills
Change ManagementMasters
Job Oriented
45 Hours
Trending
Project Management Master's ProgramUniversity Programs
45 Hours
Trending
Project Management Master's ProgramOn-Demand Courses
PRINCE2 Practitioner CourseRoles
Project ManagerAccreditation Bodies
PMITop Resources
Theories of MotivationCloud Computing
Learn to harness the cloud to deliver computing resources efficiently.
View All Cloud Computing Coursesicon-cloud-snowingCertifications
AWS
32 Hours
Best Seller
AWS Certified Solutions Architect - AssociateAWS
32 Hours
AWS Cloud Practitioner CertificationAWS
24 Hours
AWS DevOps CertificationMicrosoft
16 Hours
Azure Fundamentals CertificationMicrosoft
24 Hours
Best Seller
Azure Administrator CertificationMicrosoft
45 Hours
Recommended
Azure Data Engineer CertificationMicrosoft
32 Hours
Azure Solution Architect CertificationMicrosoft
40 Hours
Azure DevOps CertificationAWS
24 Hours
Systems Operations on AWS Certification TrainingAWS
24 Hours
Developing on AWSMasters
Job Oriented
48 Hours
New
AWS Cloud Architect Masters ProgramBootcamps
Career Kickstarter
100 Hours
Trending
Cloud Engineer BootcampRoles
Cloud EngineerOn-Demand Courses
AWS Certified Developer Associate - Complete GuideAuthorized Partners of
AWSTop Resources
Scrum TutorialIT Service Management
Understand how to plan, design, and optimize IT services efficiently.
View All DevOps Coursesicon-git-commitCertifications
Axelos
16 Hours
Best Seller
ITIL 4 Foundation CertificationAxelos
16 Hours
ITIL Practitioner CertificationPeopleCert
16 Hours
ISO 14001 Foundation CertificationPeopleCert
16 Hours
ISO 20000 CertificationPeopleCert
24 Hours
ISO 27000 Foundation CertificationAxelos
24 Hours
ITIL 4 Specialist: Create, Deliver and Support TrainingAxelos
24 Hours
ITIL 4 Specialist: Drive Stakeholder Value TrainingAxelos
16 Hours
ITIL 4 Strategist Direct, Plan and Improve TrainingOn-Demand Courses
ITIL 4 Specialist: Create, Deliver and Support ExamTop Resources
ITIL Practice TestData Science
Unlock valuable insights from data with advanced analytics.
View All Data Science Coursesicon-dataBootcamps
Job Oriented
6 Months
Trending
Data Science BootcampJob Oriented
289 Hours
Data Engineer BootcampJob Oriented
6 Months
Data Analyst BootcampJob Oriented
288 Hours
New
AI Engineer BootcampSkills
Data Science with PythonRoles
Data ScientistOn-Demand Courses
Data Analysis Using ExcelTop Resources
Machine Learning TutorialDevOps
Automate and streamline the delivery of products and services.
View All DevOps Coursesicon-terminal-squareCertifications
DevOps Institute
16 Hours
Best Seller
DevOps Foundation CertificationCNCF
32 Hours
New
Certified Kubernetes AdministratorDevops Institute
16 Hours
Devops LeaderSkills
KubernetesRoles
DevOps EngineerOn-Demand Courses
CI/CD with Jenkins XGlobal Accreditations
DevOps InstituteTop Resources
Top DevOps ProjectsBI And Visualization
Understand how to transform data into actionable, measurable insights.
View All BI And Visualization Coursesicon-microscopeBI and Visualization Tools
Certification
24 Hours
Recommended
Tableau CertificationCertification
24 Hours
Data Visualization with Tableau CertificationMicrosoft
24 Hours
Best Seller
Microsoft Power BI CertificationTIBCO
36 Hours
TIBCO Spotfire TrainingCertification
30 Hours
Data Visualization with QlikView CertificationCertification
16 Hours
Sisense BI CertificationOn-Demand Courses
Data Visualization Using Tableau TrainingTop Resources
Python Data Viz LibsCyber Security
Understand how to protect data and systems from threats or disasters.
View All Cyber Security Coursesicon-refresh-cwCertifications
CompTIA
40 Hours
Best Seller
CompTIA Security+EC-Council
40 Hours
Certified Ethical Hacker (CEH v12) CertificationISACA
22 Hours
Certified Information Systems Auditor (CISA) CertificationISACA
40 Hours
Certified Information Security Manager (CISM) Certification(ISC)²
40 Hours
Certified Information Systems Security Professional (CISSP)(ISC)²
40 Hours
Certified Cloud Security Professional (CCSP) Certification16 Hours
Certified Information Privacy Professional - Europe (CIPP-E) CertificationISACA
16 Hours
COBIT5 Foundation16 Hours
Payment Card Industry Security Standards (PCI-DSS) CertificationOn-Demand Courses
CISSPTop Resources
Laptops for IT SecurityWeb Development
Learn to create user-friendly, fast, and dynamic web applications.
View All Web Development Coursesicon-codeBootcamps
Career Kickstarter
6 Months
Best Seller
Full-Stack Developer BootcampJob Oriented
3 Months
Best Seller
UI/UX Design BootcampEnterprise Recommended
6 Months
Java Full Stack Developer BootcampCareer Kickstarter
490+ Hours
Front-End Development BootcampCareer Accelerator
4 Months
Backend Development Bootcamp (Node JS)Skills
ReactOn-Demand Courses
Angular TrainingTop Resources
Top HTML ProjectsBlockchain
Understand how transactions and databases work in blockchain technology.
View All Blockchain Coursesicon-stop-squareBlockchain Certifications
40 Hours
Blockchain Professional Certification32 Hours
Blockchain Solutions Architect Certification32 Hours
Blockchain Security Engineer Certification24 Hours
Blockchain Quality Engineer Certification5+ Hours
Blockchain 101 CertificationOn-Demand Courses
NFT Essentials 101: A Beginner's GuideTop Resources
Blockchain Interview QsProgramming
Learn to code efficiently and design software that solves problems.
View All Programming Coursesicon-codeSkills
Python CertificationInterview Prep
Career Accelerator
3 Months
Software Engineer Interview PrepOn-Demand Courses
Data Structures and Algorithms with JavaScriptTop Resources
Python TutorialWorking with Text data can often turn out to be a complex exercise, because of its volume, complicated structure, loss of any specific pattern etc. We, therefore, need a faster, easy-to-implement, convenient and robust ways for information retrieval from the text data. Many a time, in the real world, we encounter text data which is quite noisy. Thanks to Hadley Wickham, we have the package ‘stringr’ that adds more functionality to the base functions for handling strings in R. According to the description of the package stringr –
“is a set of simple wrappers that make R’s string functions more consistent, simpler and easier to use. It does this by ensuring that: function and argument names (and positions) are consistent, all functions deal with NA’s and zero length character appropriately, and the output data structures from each function matches the input data structures of other functions.”
Before looking at the use cases, let’s try to first understand “What is String Manipulation”?
String manipulation refers to a series of functions that are used to extract information from text variables. In machine learning, these functions are being widely used for doing feature engineering, i.e., to create new features out of existing string features.
Now technically there are differences between “String Manipulation functions” and “Regular Expressions”:
Few things to remember:
Text data is stored in character vectors (or, less commonly, character arrays). It’s important to remember that each element of a character vector is a whole string, rather than just an individual character. In R, “string” is an informal term that is used because “element of a character vector” is quite a mouthful. The fact that the basic unit of text is a character vector means that most string manipulation functions operate on vectors of strings, in the same way, that mathematical operations are vectorized.
We will see how we can leverage this package in R to deal with the Text Data.
Now let’s look at some of the very commonly used functions (available in ‘stringr’ package) for string manipulation:
Functions | Descriptions |
|---|---|
nchar() | It counts the number of characters in a string or vector. In the stringr package, it's substitute function is str_length() |
tolower() | It converts a string to the lower case. Alternatively, you can also use the str_to_lower() function |
toupper() | It converts a string to the upper case. Alternatively, you can also use the str_to_upper() function |
chartr() | It is used to replace each character in a string. Alternatively, you can use str_replace() function to replace a complete string |
substr() | It is used to extract parts of a string. Start and end positions need to be specified. Alternatively, you can use the str_sub() function |
setdiff() | It is used to determine the difference between two vectors |
setequal() | It is used to check if the two vectors have the same string values |
abbreviate() | It is used to abbreviate strings. The length of abbreviated string needs to be specified |
strsplit() | It is used to split a string based on a criterion. It returns a list. Alternatively, you can use the str_split() function. This function lets you convert your list output to a character matrix |
sub() | It is used to find and replace the first match in a string |
gsub() | It is used to find and replace all the matches in a string/vector. Alternatively, you can use the str_replace() function |
paste() | Paste() function combines the strings together. |
str_trim() | removes leading and trailing whitespace |
str_dup() | duplicates characters |
str_pad() | pads a string |
str_wrap() | wraps a string paragraph |
str_trim() | trims a string |
------------- String Manipulation -------------------
----Concatenating with str_c():
str_c("May", "The", "Force", "Be", "With", "You")
#Result
[1] "MayTheForceBeWithYou"
# removing zero length objects
str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", character(0))
#Result
[1] "Themeekshallinherittheearth"
# changing separator
str_c("The", "meek", "shall", NULL, "inherit", "the", "earth", sep = "_")
#Result
[1] "The_meek_shall_inherit_the_earth"
-----Substring with str_sub()
some_text = 'It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way – in short,
the period was so far like the present period,
that some of its noisiest authorities insisted on its being received,
for good or for evil, in the superlative degree of comparison only.'
# apply 'str_sub'
str_sub(some_text, start = 1, end = 10)
#Result
[1] "It was the"
# another example
str_sub("adios", 1:3)
#Result
[1] "adios" "dios" "ios"
# some strings
fruits = c("apple", "grapes", "banana", "mango")
# 'str_sub' with negative positions
str_sub(fruits, start = -4, end = -1)
#result
[1] "pple" "apes" "nana" "ango"
# extracting sequentially
str_sub('some_text', seq_len(nchar('some_text')))
#Result
[1] "some_text" "ome_text" "me_text" "e_text" "_text" "text" "ext"
[8] "xt" "t"
---Same result can be obtained using the substr function
substring('some_text', seq_len(nchar('some_text')))
[1] "some_text" "ome_text" "me_text" "e_text" "_text" "text" "ext"
[8] "xt" "t"
# replacing 'Lorem' with 'Nullam'
text = "Hlo World"
str_sub(text, 1, 4) <- "Hello"
text
#Result
[1] "HelloWorld"
---Duplication with str_dup()
# default usage
str_dup("hello", 3)
# Result
[1] "hellohellohello"
-----Padding with str_pad()
# left padding with '#'
str_pad("hashtag", width = 8, pad = "#")
#Result
[1] "#hashtag"
-----Wrapping with str_wrap()
# quote ()
some_quote = c(
"It was the best of times",
"it was the worst of times,",
"it was the age of wisdom",
"it was the age of foolishness")
# some_quote in a single paragraph
some_quote = paste(some_quote, collapse = " ")
some_quote
# display paragraph with following lines indentation of 3
cat(str_wrap(some_quote, width = 30, exdent = 3), "\n")
#Result
It was the best of times it
was the worst of times, it was
the age of wisdom it was the
age of foolishness
-----Trimming with str_trim()
# text with whitespaces
bad_text = c("This", " example ", "has several ", " whitespaces ")
# remove whitespaces on both sides
str_trim(bad_text, side = "both")
#Result
[1] "This" "example" "has several" "whitespaces"
---Word extraction with word()
# some sentence
change = c("Be the change", "you want to be")
# extract first word
word(change, 2)
#Result
[1] "the" "want"
install.packages('stringr')
library(stringr)
--- #count number of characters
nchar(some_quote)
#Result
[1] 106
str_length(some_quote)
#Result - Same as nchar function on 'stringr' package
[1] 106
#convert to lower
tolower(some_quote)
#Result
[1] "it was the best of times it was the worst of times,
it was the age of wisdom it was the age of foolishness"
#convert to upper
toupper(some_quote)
#Result
[1] "IT WAS THE BEST OF TIMES IT WAS THE WORST OF TIMES,
IT WAS THE AGE OF WISDOM IT WAS THE AGE OF FOOLISHNESS"
#replace strings
chartr("and","for",x = some_quote) #letters t,b,w get replaced by f,o,r
#Result
[1] "It wfs the best of times it wfs the worst of times,
it wfs the fge of wisrom it wfs the fge of foolishoess"
#get difference between two vectors
setdiff(c("monday","tuesday","wednesday"),c("monday","thursday","friday"))
#Result
[1] "tuesday" "wednesday"
#check if strings are equal
setequal(c("it","was","bad"),c("it","was","bad"))
#Result
[1] TRUE
setequal(c("it","wasnot","good"),c("it","was","bad"))
#Result
[1] FALSE
#abbreviate strings
abbreviate(c("apple","orange","banana"),minlength = 3)
#Result
apple orange banana
"app" "orn" "bnn"
#split strings
strsplit(x = c("room-101","room-102","desk-103","flr-104"),split = "-")
#Result
[[1]]
[1] "room" "101"
[[2]]
[1] "room" "102"
[[3]]
[1] "desk" "103"
[[4]]
[1] "flr" "104"
str_split(string = c("room-101","room-102","desk-103","flr-104"),pattern = "-",
simplify = T)
#Result
[,1] [,2]
[1,] "room" "101"
[2,] "room" "102"
[3,] "desk" "103"
[4,] "flr" "104"
#find and replace first match
sub(pattern = "L",replacement = "B",x = some_quote,ignore.case = T)
#Result
[1] "It was the best of times it was the worst of times,
it was the age of wisdom it was the age of fooBishness"
#find and replace all matches
gsub(pattern = "was",replacement = "was't",x = some_quote,ignore.case = T)
#Result
[1] "It wasn't the best of times it was't the worst of times,
it was't the age of wisdom it was't the age of foolishness"
A regular expression (a.k.a. regex) is a special text string for describing a certain amount of text. This “certain amount of text” receives the formal name of the pattern. Hence we say that a regular expression is a pattern that describes a set of strings. R has some functions for working with regular expressions although it does not provide a very wide range of capabilities that some other scripting languages might offer. Nevertheless, they can take us quite far with some workarounds in place.
The main purpose of working with regular expressions is to describe patterns that are used to match against text strings. So working with regular expressions is more about pattern matching. The result of a match is either successful or not.
The simplest version of pattern matching is to search for one occurrence (or all occurrences) of some specific characters in a string. Typically, regular expression patterns consist of a combination of alphanumeric characters as well as special characters. A regex pattern can be as simple as a single character, or it can be formed by several characters with a more complex structure.
There are two key aspects of the functionalities dealing with regular expressions in R: One has to do with the functions designed for regex pattern matching. The other aspect has to do with the way regex patterns are expressed in R. In this part of the tutorial we are primarily going to talk about the 2nd aspect: the way R works with regular expressions.
In the context of regular expressions, we will be covering the following themes in this tutorial:
1. Metacharacters
The simplest form of regular expressions are those that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. For a language like R, there are some special characters that have reserved meaning and they are referred to as ‘Metacharacters”. The metacharacters in Extended Regular Expressions (EREs) are:
. \ | ( ) [ { $ * + ?
The following table shows the general regex metacharacters and how to escape them in R:

The following example shows how to deal with any metacharacters within the text:
------- Regular Expressions in R
# string
char = "$char"
# the right way in R
sub(pattern = "\\$", replacement = "", x = char)
#Result
[1] "char"
2. Sequences
Sequences, as the name suggests refers to the sequences of characters which can match. We have shorthand versions (or anchors) for commonly used sequences in R:

Example:
------# replace digit with '_'
gsub("\\d", "_", "the year of great depression was 1929")
3. Character Class
A character class or character set is a list of characters enclosed by square brackets [ ]. Character sets are used to match only one of the different characters. For example, the regex character class [aA] matches any lower case letter a or any upper case letter A. Likewise, the regular expression [0123456789] matches any single digit. It is important not to confuse a regex character class with the native R "character" class notion.
Examples of some character classes are shown below:

Let’s look at some examples:
-------------------
# example string
transport = c("car", "bike", "plane", "boat")
# look for 'o' or 'e'
grep(pattern = "[oe]", transport, value = TRUE)
#Result
[1] "bike" "plane" "boat"
--------
# some numeric strings
numerics = c("13", "19-April", "I-V-IV", "R 3.3.1")
# match strings with 0 or 1
grep(pattern = "[019]", numerics, value = TRUE)
#Result
[1] "13" "19-April" "R 3.3.1"
4. POSIX Character Classes
POSIX character classes are very closely related to regex character classes. In R, POSIX character classes are represented with expressions inside double brackets [[ ]]. The following table shows the POSIX character classes as used in R:

Example:
--------------
some_quote = 'It was #FFC0CB (print) ; \nthe best of \times!'
# Print the text
print(some_quote)
#Result
[1] "It was #FFC0CB (print) ; \nthe best of \times!"
# remove space characters
gsub(pattern = "[[:blank:]]", replacement = "", some_quote)
#Result
[1] "Itwas#FFC0CB(print);\nthebestofimes!"
# remove non-printable characters
gsub(pattern = "[^[:print:]]", replacement = "", some_quote)
#Result
[1] "It was #FFC0CB (print) ; the best of imes!"
5. Quantifiers
One more important set of regex elements are the quantifiers. These are used when you want to match a certain number of characters that meet certain criteria.
Following table shows a list of quantifiers:
k
Let’s look at few worked out examples:
#Some examples : Quantifiers in R
# people names
people = c("Ravi", "Emily", "Alex", "Pramod", "Shishir", "jacob",
"rasmus", "jacob", "flora")
# match 'm' at most once
grep(pattern = "m?", people, value = TRUE)
#Result
[1] "Ravi" "Emily" "Alex" "Pramod" "Shishir" "jacob" "rasmus" "jacob"
[9] "flora"
# match 'm' one or more times
grep(pattern = "m+", people, value = TRUE)
#Result
[1] "Emily" "Pramod" "rasmus"
6. Major Regex Functions
R contains a set of functions in the base package that we can use to find pattern matches. The following table lists these functions with a brief description:

Few Examples:
----Extract digits from a string of characters
address <- "The address is 245 Summer Street"
regmatches(address, regexpr("[0-9]+",address))
#Result
[1] "245"
#Return if a value is present in a vector
#match values
det <- c("A1","A2","A3","A4","A5","A7")
grep(pattern = "A6|A2",x = det,value =T)
#Result
[1] "A2"
----Extract strings which are available in key value pairs
d <- c("(Val_1 :: 0.1231313213)","today_trans","(Val_2 :: 0.1434343412)")
grep(pattern = "\\([a-z]+ :: (0\\.[0-9]+)\\)",x = d,value = T)
regmatches(d,regexpr(pattern = "\\((.*) :: (0\\.[0-9]+)\\)",text = d))
#Result
[1] "(Val_1 :: 0.1231313213)" "(Val_2 :: 0.1434343412)"
--Remove punctuation from a line of text
text <- "a1~!@#$%^&*bcd(){}_+:efg\"<>?,./;'[]-="
gsub(pattern = "[[:punct:]]+",replacement = "",x = text)
#Result
[1] "a1bcdefg"
----Find the location of digits in a string
string <- "Only 10 out of 25 qualified in the examination"
gregexpr(pattern = '\\d',text = string) #or
#Result
[[1]]
[1] 6 7 16 17
attr(,"match.length")
[1] 1 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
unlist(gregexpr(pattern = '\\d',text = "Only 10 out of 25 qualified
in the examination"))
#Result
[1] 6 7 16 17
---Extract email addresses from a given string
string <- c("My email address is abc@gmail.com",
"my email address is def@hotmail.com","aescher koeif",
"paul Taylor")
unlist(regmatches(x = string, gregexpr(pattern =
"[[:alnum:]]+\\@[[:alpha:]]+\\.com",
text = string)))
#Result
[1] "abc@gmail.com" "def@hotmail.com"
Regular expressions are very crucial parts of text mining and natural language processing. So in this tutorial, you learnt about the basics of the string manipulation and regular expressions and you can start leveraging these concepts while starting off your journey in text mining.