Machine Learning Tutorial

By KnowledgeHut .

Before tokenizing text, it is important to understand the NLTK package and its usage in Python. The concept of tokenization also needs to be understood. Let us begin by understanding usage of NLTK and its significance. What is NLTK? NLTK stands for Natural Language Tool Kit, which is considered to be the most powerful NLP libraries. NLP (Natural Language Processing) is a technique that helps in manipulation and working with text or speech with the help of software and devices. It draws patterns based on the context in which statements are being presented. NLTK is a package in Python that helps in dealing with data that is in the form of text. It has multiple libraries, and this includes text-processing libraries which are meant to perform classification, stemming, tokenization, tagging, parsing and semantic reasoning. We know that machines convert any data provided to it to the form of 1’s and 0’s. When a statement is provided as input to a machine, it converts every word in the sentence to a word vector based on the surrounding words. Installation of NLTK It can be used with Python versions 2.7, 3.5, 3.6 and 3.7 for now. It can be installed by typing the following command in the command line: pip install nltk To check if ‘nltk’ module has been successfully installed, go to your IDE and type the following line: import nltk If this line gets executed without any errors, it means the ‘nltk’ package was installed successfully. Terminologies associated with NLP Corpus:This refers to the dataset or the text data which is used to perform NLP tasks on. The singularform of this word is ‘corpora’. Lexicon:It can be understood as the list of stems and affixes, which hold information about the words ofthe language which is being used. Token:The result of tokenization- a set of string of continuous characters or integers. What is tokenization? It is the process of splitting up sentence into a list of words, and these list of words are known as ‘tokens’. There are different ways of tokenizing data. Some of them have been discussed below: Sentence tokenization This is the process of tokenizing sentences of a paragraph into separate statements. Let us look at how this works in Python. The ‘sent_tokenize’ function is used to tokenize a sentence. It uses the ‘PunktSentenceTokenizer’ instance that is found in the ‘nltk.tokenize.punkt’ module. This module would have been previously trained on data, and hence knows how to determine the beginning and end of a sentence, distinguishing between characters and punctuations. from nltk.tokenize import sent_tokenize text = "Hello everyone. Welcome to NLP and the NLTK module introduction" sent_tokenize(text) Output: [‘Hello everyone. Welcome to NLP and the NLTK module introduction’] Word tokenization This refers to tokenizing or splitting words of a sentence. from nltk.tokenize import sent_tokenize text = "Hello everyone. Welcome to NLP and the NLTK module introduction" word_tokenize(text) Output: [‘Hello’, ‘everyone.’, ‘Welcome’, ‘to’, ‘NLP’, ‘and’, ‘the’, ‘NLTK’, ‘module’, ‘introduction’] Word tokenization This refers to tokenizing or splitting words of a sentence. from nltk.tokenize import sent_tokenize text = "Hello everyone. Welcome to NLP and the NLTK module introduction" word_tokenize(text) Output: [‘Hello’, ‘everyone.’, ‘Welcome’, ‘to’, ‘NLP’, ‘and’, ‘the’, ‘NLTK’, ‘module’, ‘introduction’] Conclusion In this post, we understood the significance of NLTK, NLP and how words and sentences can be tokenized in Python.

1. Machine Learning Overview

2. Machine Learning Terminologies

3. Demystifying Machine Learning

4. Applications of Machine Learning

5. Methods for Machine Learning

6. Underfitting and Overfitting in Machine Learning

7. Data Loading for ML Projects

8. Introduction to Data in Machine Learning

9. Why Data Pre-processing?

10. Normalization

11. Numpy

12. K-Nearest Neighbors (KNN)

13. Hyperparameter Tuning

14. Pre-procesing Data

15. What is Clustering in Machine Learning?

16. Overview - Regression & Logistic Regression

17. Linear Regression(Python Implementation)

18. Softmax Regression using TensorFlow

19. What is Linear Regression?

20. Linear Regression using PyTorch

21. Decision Trees

22. Introduction To Machine Learning using Python

23. Learning Model Building in Scikit-learn: A Python Machine Learning Library

24. Confusion matrix

25. Machine learning metrics

26. Improving Performance of ML Models

27. How to get synonyms/antonyms from NLTK WordNet in Python?

28. Removing stop words with NLTK in Python

29. Tokenize text using NLTK in Python

Tokenize text using NLTK in Python

What is NLTK?

NLTK stands for Natural Language Tool Kit, which is considered to be the most powerful NLP libraries. NLP (Natural Language Processing) is a technique that helps in manipulation and working with text or speech with the help of software and devices. It draws patterns based on the context in which statements are being presented.

NLTK is a package in Python that helps in dealing with data that is in the form of text. It has multiple libraries, and this includes text-processing libraries which are meant to perform classification, stemming, tokenization, tagging, parsing and semantic reasoning.

We know that machines convert any data provided to it to the form of 1’s and 0’s. When a statement is provided as input to a machine, it converts every word in the sentence to a word vector based on the surrounding words.

Installation of NLTK

It can be used with Python versions 2.7, 3.5, 3.6 and 3.7 for now. It can be installed by typing the following command in the command line:

pip install nltk

To check if ‘nltk’ module has been successfully installed, go to your IDE and type the following line:

import nltk

If this line gets executed without any errors, it means the ‘nltk’ package was installed successfully.

Terminologies associated with NLP

Corpus:This refers to the dataset or the text data which is used to perform NLP tasks on. The singularform of this word is ‘corpora’.
Lexicon:It can be understood as the list of stems and affixes, which hold information about the words ofthe language which is being used.
Token:The result of tokenization- a set of string of continuous characters or integers.

What is tokenization?

It is the process of splitting up sentence into a list of words, and these list of words are known as ‘tokens’. There are different ways of tokenizing data. Some of them have been discussed below:

Sentence tokenization

This is the process of tokenizing sentences of a paragraph into separate statements. Let us look at how this works in Python. The ‘sent_tokenize’ function is used to tokenize a sentence. It uses the ‘PunktSentenceTokenizer’ instance that is found in the ‘nltk.tokenize.punkt’ module. This module would have been previously trained on data, and hence knows how to determine the beginning and end of a sentence, distinguishing between characters and punctuations.

from nltk.tokenize import sent_tokenize 
text = "Hello everyone. Welcome to NLP and the NLTK module introduction" 
sent_tokenize(text)

Output:

[‘Hello everyone. Welcome to NLP and the NLTK module introduction’] 
Word tokenization 
This refers to tokenizing or splitting words of a sentence. 
from nltk.tokenize import sent_tokenize 
text = "Hello everyone. Welcome to NLP and the NLTK module introduction" 
word_tokenize(text)

Output:

[‘Hello’, ‘everyone.’, ‘Welcome’, ‘to’, ‘NLP’, ‘and’, ‘the’, ‘NLTK’, ‘module’, ‘introduction’]

Word tokenization

This refers to tokenizing or splitting words of a sentence.

from nltk.tokenize import sent_tokenize 
text = "Hello everyone. Welcome to NLP and the NLTK module introduction" 
word_tokenize(text)

Output:

[‘Hello’, ‘everyone.’, ‘Welcome’, ‘to’, ‘NLP’, ‘and’, ‘the’, ‘NLTK’, ‘module’, ‘introduction’]

Conclusion

In this post, we understood the significance of NLTK, NLP and how words and sentences can be tokenized in Python.

28-A Removing stop words with NLTK in Python

Your email address will not be published. Required fields are marked *

Comments

Vinu

After reading your article, I was amazed. I know that you explain it very well. And I hope that other readers will also experience how I feel after reading your article. Thanks for sharing.

Johnson M

Good and informative article.

Vinu

I enjoyed reading your articles. This is truly a great read for me. Keep up the good work!

Vinu

Awesome blog. I enjoyed reading this article. This is truly a great read for me. Keep up the good work!

best data science courses in India

Thanks for sharing this article!! Machine learning is a branch of artificial intelligence (AI) and computer science that focus on the uses of data and algorithms. I came to know a lot of information from this article.

View More Comments

Search

Machine Learning Tutorial

By KnowledgeHut .

Machine Learning Tutorial

Tokenize text using NLTK in Python

What is NLTK?

Installation of NLTK

What is tokenization?

Sentence tokenization

Word tokenization

Conclusion

Leave a Reply

Comments

Vinu

Johnson M

Vinu

Vinu

best data science courses in India