top

Search

Machine Learning Tutorial

The process of processing the sentences or words that come in the form of input/sent by the user is known as data pre-processing. One of the most important steps in data pre-processing is removing useless data or data that is not complete. When working on Natural Language Processing problems, it is important to realize that the process shouldn't put its efforts into processing words such as 'the', 'is', 'there' and so on. These words are known as stop words. If stop words are not programmed to be ignored/removed, it will take up additional space in the database or memory. This way, the efficiency of the code reduces by a great extent. The NLTK package has a separate package of stop words that can be downloaded. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words. Before getting into the Python code, let us look at a few comparisons of statement with stop words and the same statements without stop words. Before removing stop wordsAfter removing stop wordsHello my name is Bob. I am the king of my universeHello name Bob. king universeCan you fetch water?Fetch waterDownloading stop words of the English language import nltk  from nltk.corpus import stopwords  set(stopwords.words('english')) Output: {'a',  'about',  'above',  'after',  'again',  'against',  'ain',  'all',  'am',  'an',  'and',  'any',  'are',  'aren',  "aren't",  'as',  'at',  'be',  'because',  'been',  'before',  'being',  'below',  'between',  'both',  'but',  'by',  'can',  'couldn',  "couldn't",  'd',  'did',  'didn',  "didn't",  'do',  'does',  'doesn',  "doesn't",  'doing',  'don',  "don't",  'down',  'during',  'each',  'few',  'for',  'from',  'further',  'had',  'hadn',  "hadn't",  'has',  'hasn',  "hasn't",  'have',  'haven',  "haven't",  'having',  'he',  'her',  'here',  'hers',  'herself',  'him',  'himself',  'his',  'how',  'i',  'if',  'in',  'into',  'is',  'isn',  "isn't",  'it',  "it's",  'its',  'itself',  'just',  'll',  'm',  'ma',  'me',  'mightn',  "mightn't",  'more',  'most',  'mustn',  "mustn't",  'my',  'myself',  'needn',  "needn't",  'no',  'nor',  'not',  'now',  'o',  'of',  'off',  'on',  'once',  'only',  'or',  'other',  'our',  'ours',  'ourselves',  'out',  'over',  'own',  're',  's',  'same',  'shan',  "shan't",  'she',  "she's",  'should',  "should've",  'shouldn',  "shouldn't",  'so',  'some',  'such',  't',  'than',  'that',  "that'll",  'the',  'their',  'theirs',  'them',  'themselves',  'then',  'there',  'these',  'they',  'this',  'those',  'through',  'to',  'too',  'under',  'until',  'up',  've',  'very',  'was',  'wasn',  "wasn't",  'we',  'were',  'weren',  "weren't",  'what',  'when',  'where',  'which',  're',  's',  'same',  'shan',  "shan't",  'she',  "she's",  'should',  "should've",  'shouldn',  "shouldn't",  'so',  'some',  'such',  't',  'than',  'that',  "that'll",  'the',  'their',  'theirs',  'them',  'themselves',  'then',  'there',  'these',  'they',  'this',  'those',  'through',  'to',  'too',  'under',  'until',  'up',  've',  'very',  'was',  'wasn',  "wasn't",  'we',  'were',  'weren',  "weren't",  'what',  'when',  'where',  'which',  'while',  'who',  'whom',  'why',  'will',  'with',  'won',  "won't",  'wouldn',  "wouldn't",  'y',  'you',  "you'd",  "you'll",  "you're",  "you've",  'your',  'yours',  'yourself',  'yourselves'} Explanation: The 'nltk' package was imported. The 'nltk' package has a folder named 'corpus' whichcontains stop words of different languages. We specifically considered the stop words from the English language. Now let us pass a string as input and indicate the code to remove stop words: from nltk.corpus import stopwords  from nltk.tokenize import word_tokenize example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music" stop_words = set(stopwords.words('english'))  word_tokens = word_tokenize(example)  filtered_sentence = [w for w in word_tokens if not w in stop_words]  filtered_sentence = []  for w in word_tokens:  if w not in stop_words:  filtered_sentence.append(w)  print(word_tokens)  print("\n")  print(filtered_sentence) Output: ['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music']  ['Hello', ',', 'name', 'Bob', '.', 'I', 'tell', 'Sam', 'know', 'properly', '.', 'Sam', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', 'Apu', '.', 'Apu', 'loves', 'appreciates', 'Sam', "'s", 'music'] In addition to this, domain specific stop words can also be removed by explicitly programming the code to do so. Below is a demonstration of the same. from nltk.corpus import stopwords  from nltk.tokenize import word_tokenize example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music" stop_words = set(stopwords.words('english'))  word_tokens = word_tokenize(example)  filtered_sentence = [w for w in word_tokens if not w in stop_words]  filtered_sentence = []  for w in word_tokens:  if w not in stop_words:  filtered_sentence.append(w)  print(word_tokens)  print("\n")  more_stop_words = ['Bob', 'Sam', 'Apu']  for w in word_tokens:  if w in more_stop_words:  filtered_sentence.remove(w)  print(filtered_sentence) Output: ['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', '.', 'I', 'tell', 'know', 'properly', '.', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', '.', 'loves', 'appreciates', "'s", 'music'] Explanation: We provided a few sentences as input and wished to remove certain names that we considered as stop words. These words were explicitly passed to a variable as a list of words and were removed using the remove function. Conclusion In this post, we understood how to ignore stop words with the help of NLTK package in Python.
logo

Machine Learning Tutorial

Removing stop words with NLTK in Python

The process of processing the sentences or words that come in the form of input/sent by the user is known as data pre-processing. One of the most important steps in data pre-processing is removing useless data or data that is not complete. 

When working on Natural Language Processing problems, it is important to realize that the process shouldn't put its efforts into processing words such as 'the', 'is', 'there' and so on. These words are known as stop words. If stop words are not programmed to be ignored/removed, it will take 

up additional space in the database or memory. This way, the efficiency of the code reduces by a great extent. 

The NLTK package has a separate package of stop words that can be downloaded. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words. 

Before getting into the Python code, let us look at a few comparisons of statement with stop words and the same statements without stop words. 

Before removing stop wordsAfter removing stop words
Hello my name is Bob. I am the king of my universeHello name Bob. king universe
Can you fetch water?Fetch water

Downloading stop words of the English language 

import nltk 
from nltk.corpus import stopwords 
set(stopwords.words('english')) 

Output: 

{'a', 
'about', 
'above', 
'after', 
'again', 
'against', 
'ain', 
'all', 
'am', 
'an', 
'and', 
'any', 
'are', 
'aren', 
"aren't", 
'as', 
'at', 
'be', 
'because', 
'been', 
'before', 
'being', 
'below', 
'between', 
'both', 
'but', 
'by', 
'can', 
'couldn', 
"couldn't", 
'd', 
'did', 
'didn', 
"didn't", 
'do', 
'does', 
'doesn', 
"doesn't", 
'doing', 
'don', 
"don't", 
'down', 
'during', 
'each', 
'few', 
'for', 
'from', 
'further', 
'had', 
'hadn', 
"hadn't", 
'has', 
'hasn', 
"hasn't", 
'have', 
'haven', 
"haven't", 
'having', 
'he', 
'her', 
'here', 
'hers', 
'herself', 
'him', 
'himself', 
'his', 
'how', 
'i', 
'if', 
'in', 
'into', 
'is', 
'isn', 
"isn't", 
'it', 
"it's", 
'its', 
'itself', 
'just', 
'll', 
'm', 
'ma', 
'me', 
'mightn', 
"mightn't", 
'more', 
'most', 
'mustn', 
"mustn't", 
'my', 
'myself', 
'needn', 
"needn't", 
'no', 
'nor', 
'not', 
'now', 
'o', 
'of', 
'off', 
'on', 
'once', 
'only', 
'or', 
'other', 
'our', 
'ours', 
'ourselves', 
'out', 
'over', 
'own', 
're', 
's', 
'same', 
'shan', 
"shan't", 
'she', 
"she's", 
'should', 
"should've", 
'shouldn', 
"shouldn't", 
'so', 
'some', 
'such', 
't', 
'than', 
'that', 
"that'll", 
'the', 
'their', 
'theirs', 
'them', 
'themselves', 
'then', 
'there', 
'these', 
'they', 
'this', 
'those', 
'through', 
'to', 
'too', 
'under', 
'until', 
'up', 
've', 
'very', 
'was', 
'wasn', 
"wasn't", 
'we', 
'were', 
'weren', 
"weren't", 
'what', 
'when', 
'where', 
'which', 
're', 
's', 
'same', 
'shan', 
"shan't", 
'she', 
"she's", 
'should', 
"should've", 
'shouldn', 
"shouldn't", 
'so', 
'some', 
'such', 
't', 
'than', 
'that', 
"that'll", 
'the', 
'their', 
'theirs', 
'them', 
'themselves', 
'then', 
'there', 
'these', 
'they', 
'this', 
'those', 
'through', 
'to', 
'too', 
'under', 
'until', 
'up', 
've', 
'very', 
'was', 
'wasn', 
"wasn't", 
'we', 
'were', 
'weren', 
"weren't", 
'what', 
'when', 
'where', 
'which', 
'while', 
'who', 
'whom', 
'why', 
'will', 
'with', 
'won', 
"won't", 
'wouldn', 
"wouldn't", 
'y', 
'you', 
"you'd", 
"you'll", 
"you're", 
"you've", 
'your', 
'yours', 
'yourself', 
'yourselves'} 

Explanation: The 'nltk' package was imported. The 'nltk' package has a folder named 'corpus' whichcontains stop words of different languages. We specifically considered the stop words from the English language. 

Now let us pass a string as input and indicate the code to remove stop words: 

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music" 

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(example) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
filtered_sentence = [] 
for w in word_tokens: 
if w not in stop_words: 
filtered_sentence.append(w) 
print(word_tokens) 
print("\n") 
print(filtered_sentence) 

Output: 

['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] 
['Hello', ',', 'name', 'Bob', '.', 'I', 'tell', 'Sam', 'know', 'properly', '.', 'Sam', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', 'Apu', '.', 'Apu', 'loves', 'appreciates', 'Sam', "'s", 'music'] 

In addition to this, domain specific stop words can also be removed by explicitly programming the code to do so. Below is a demonstration of the same. 

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

example = "Hello there, my name is Bob. I will tell you about Sam so that you know them properly. Sam is a hardworking person with a zealous heart. He is enthusiastic about sports as well as music. He composes his own music with the help of Apu. Apu loves and appreciates Sam's music" 

stop_words = set(stopwords.words('english')) 
word_tokens = word_tokenize(example) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
filtered_sentence = [] 
for w in word_tokens: 
if w not in stop_words: 
filtered_sentence.append(w) 
print(word_tokens) 
print("\n") 
more_stop_words = ['Bob', 'Sam', 'Apu'] 
for w in word_tokens: 
if w in more_stop_words: 
filtered_sentence.remove(w) 
print(filtered_sentence) 

Output: 

['Hello', 'there', ',', 'my', 'name', 'is', 'Bob', '.', 'I', 'will', 'tell', 'you', 'about', 'Sam', 'so', 'that', 'you', 'know', 'them', 'properly', '.', 'Sam', 'is', 'a', 'hardworking', 'person', 'with', 'a', 'zealous', 'heart', '.', 'He', 'is', 'enthusiastic', 'about', 'sports', 'as', 'well', 'as', 'music', '.', 'He', 'composes', 'his', own, 'music', 'with', 'the', 'help', 'of', 'Apu', '.', 'Apu', 'loves', 'and', 'appreciates', 'Sam', "'s", 'music'] ['Hello', ',', 'name', '.', 'I', 'tell', 'know', 'properly', '.', 'hardworking', 'person', 'zealous', 'heart', '.', 'He', 'enthusiastic', 'sports', 'well', 'music', '.', 'He', 'composes', 'music', 'help', '.', 'loves', 'appreciates', "'s", 'music'] 

Explanation: We provided a few sentences as input and wished to remove certain names that we considered as stop words. These words were explicitly passed to a variable as a list of words and were removed using the remove function. 

Conclusion

In this post, we understood how to ignore stop words with the help of NLTK package in Python.

Leave a Reply

Your email address will not be published. Required fields are marked *