Search

Series List Filter

How to Work With a PDF in Python

Whether it is an ebook, digitally signed agreements, password protected documents, or scanned documents such as passports, the most preferred file format is PDF or Portable Document Format. It was originally developed by Adobe and is a file format used to present and transfer documents easily and reliably. It uses the file extension .pdf. In fact, PDF being the most widely used digital media, is now considered as an open standard which is maintained by the International Standards Organization (ISO). Python has relatively easy syntax which makes it even easier for the ones who are in their initial stage of learning the language. The popular Python libraries are well suited and integrated which allows to easily extract documents from a PDF, rotate pages if required, split pdf to make separate documents, or add watermarks in them. Now an important question rises, why do we need Python to process PDFs? Well, processing a PDF falls under the category of text analytics. There are several libraries and frameworks available which are designed in Python exclusively for text analytics. This makes it easier to play with a PDF in Python. You can also extract information from PDF and use into Natural Language Processing or any other Machine Learning models. Get certified and learn more about Python Programming and apply those skills and knowledge in the real world.History of  pyPDF, PyPDF2, pyPDF4The first PyPDF package was released in 2005 and the last official release in 2010. After a year or so, a  company named Phasit sponsored a branch of the PyPDF called PyPDF2 which was consistent with the original package and worked pretty well for several years.A series of packages were released later on with the name of PyPDF3 and later renamed as PyPDF4. The biggest difference between PyPDF and the other versions was that the later versions supported Python3. PyPDF2 has been discarded recently. But since PyPDF4 is not fully backward compatible with the PyPDf2, it is suggested to use PyPDF2. You can also use a substitute package - pdfrw. Pdfrw was created by Patrick Maupin and allows you to perform all functions which PyPDF2 is capable of except a few such as encryption, decryption, and types of decompression.Some common libraries in PythonLet us look into some of the libraries Python offers to handle PDFs:PdfMiner It is a tool used to extract information from PDF documents. PDFMiner allows the user to analyze text data and obtain the definite location of a text. It provides information such as fonts and lines. We can also use it as a PDF transformer and a PDF parser.PyPDF2PyPDF2 is purely a Python library which allows users to split, merge, crop, encrypt, and transform PDFs. You can also add customized data, view options, and passwords to the documents. Tabula-pyIt is a Python wrapper of tabula-java which can read tables from PDF files and convert into Pandas Dataframe or into CSV/TSV/JSON file formats.SlateIt is a Python package which facilitates the extraction of information and is dependent on the PdfMiner package.PDFQueryA light Python wrapper which uses minimum code to extract data from PDFs.xPDFIt is an open source viewer of PDF which also includes an extractor, converter and other utilities. Out of all the libraries mentioned above, PyPDF2 is the most used to perform operations like extraction, merging, splitting and so on.Installing PyPDF2If you're using Anaconda, you can install PyPDF2 using pip or conda. To install PyPDF2 using pip, run the following command in the command line:pip install PyPDF2The module is case-sensitive. So you need to make sure that proper syntax is followed. The installation is really quick since PyPDF2 is free of dependencies.Extracting Document Information from a PDF in PythonPyPDF2 can be used to extract metadata and all sorts of texts from PDF when you are performing operations on preexisting PDF files. The types of data you can extract are:AuthorCreatorProducerSubjectTitleNumber of PagesTo understand it better, let us use an existing PDF in your system or you can go to Leanpub and download a book sample.The code for extracting the document information from the PDF—# get_doc_info.py from PyPDF2 import PdfFileReader def getinfo(path):     with open(path, 'rb') as f:         PDF = PdfFileReader(f)         information = PDF.getDocumentInfo()         numberofpages = PDF.getNumPages()     print(information)     author = information.author     creator = information.creator     producer =information .producer     subject = information.subject     title = information.title if __name__ == '__main__':     path = 'reportlab-sample.pdf'     getinfo(path)The output of the program above will look like—Here, we have firstly imported PdfFileReader from the PyPDF2 package. The class PdfFileReader is used to interact with PDF files like reading and extracting information using accessor methods. Then, we have created our own function getinfo with a PDF file as an argument and then called the getdocumentinfo(). This returned an instance of DocumentInformation. And finally we got extract information like the author, creator, subject or title, etc.getNumPages() is used to count the number of pages in the document. PdfMiner can be used when you want to extract text from a PDF file. It is potent and particularly designed for extracting text from PDF.We have learned to extract information from PDF. Now let’s learn how to rotate a PDF. Rotating pages in PDFA lot of times we receive PDFs which contain pages in landscape orientation instead of portrait. You may also find certain documents to be upside down, which happens while scanning a document or mailing. However, we can rotate the pages clockwise or counterclockwise according to our choice using Python with PyPDF2.The code for rotating the article is as follows—# rotate_pages.py from PyPDF2 import PdfFileReader, PdfFileWriter def rotate(pdf_path):     pdf_write = PdfFileWriter()     pdf_read = PdfFileReader(path)     # Rotate page 90 degrees to the right     page1 = pdf_read.getPage(0).rotateClockwise(90)     pdf_write.addPage(page1)     # Rotate page 90 degrees to the left     page2 = pdf_read.getPage(1).rotateCounterClockwise(90)     pdf_write.addPage(page2)     # Add a page in normal orientation     pdf_write.addPage(pdf_read.getPage(2))     with open('rotate_pages.pdf', 'wb') as fh:         pdf_write.write(fh) if __name__ == '__main__':     path = 'mldocument.pdf'     rotate(path)The output of the code will be as follows—Here firstly we imported the PdfFileReader and the PdfFileWriter so that we can write out a new PDF file. Then we declared a function rotate with a path to the PDF that is to be modified. Within the function, we created a read object pdf_read and write object pdf_write.Then, we used the getPage() to grab the pages. Two pages page1 and page2 are taken and rotated to 90 degrees clockwise and 90 degrees counterclockwise respectively using rotateClockwise() and rotateCounterClockwise().We used addPage() function after each rotation method calls. This adds the rotated page to the write object. The last page we add is page3 without any rotation.Lastly, we have used write() with a file-like parameter to write out the new PDF. The final PDF contains three pages, the first two will be in the landscape mode and rotated in reversed direction and the third page will be in normal orientation.Now we will learn to merge different PDFs into one.Merging PDFsIn many cases, we need to merge two PDFs into a single one. For example, suppose you are working on a project report and you need to print it and bind it into a book. It contains a cover page followed by the project report. So you have two different PDFs and you want to merge them into one PDF. You can simply use Python to do so. Let us see how can we merge PDFs into one.The code for merging two PDF documents using PyPDF in mentioned below:# pdf_merging.py from PyPDF2 import PdfFileReader, PdfFileWriter def pdfmerger(paths, output):     pdfwrite = PdfFileWriter()     for path in paths:         pdfread = PdfFileReader(path)         for page in range(pdfread.getNumPages()):             # Add each page to the writer object             pdfwrite.addPage(pdfread.getPage(page))     # Write out the merged PDF     with open(output, 'wb') as out:         pdfwrite.write(out) if __name__ == '__main__':     paths = ['document-1.pdf', 'document-2.pdf']     pdfmerger(paths, output='merged.pdf')Here we have created a function pdfmerger() which takes a number of inputs and a single output. Then we created a PdfFileReader() object for each PDF path and looped over the pages, added each page to the write object. Finally, using the write() function the object’s contents are written to the disk.PyPDF2 makes the process of merging simpler by creating the PdfFileMerger class.Code for merging two documents using PyPDF2—# pdf_merger2.py import glob from PyPDF2 import PdfFileMerger def merger(output_path, input_paths):     pdfmerge = PdfFileMerger()     file_handles = []     for path in input_paths:         pdfmerge.append(path)     with open(output_path, 'wb') as fileobj:         pdfmerge.write(fileobj) if __name__ == '__main__':     paths = glob.glob('d-1.pdf')     paths.sort()     merger('d-2.pdf', paths)The PyPDF2 makes it simpler in the way that we don’t need to loop the pages of each document ourselves.  Here, we created the object pdfmerge and looped through the PDF paths. The PyPDF2 automatically appends the whole document. Finally, we write it out.Let’s perform the opposite of merging now!Splitting PDFsThe PyPDF2 package has the ability to split up a single PDF into multiple PDFs. It allows us to split pages into different PDFs. Suppose we have a set of scanned documents in a single PDF and we need to separate the pages into different PDFs as per requirement, we can simply use Python to select pages we want to split and get the work done.Code for splitting a single PDF into multiple PDFs—# pdf_splitter.py import os from PyPDF2 import PdfFileReader, PdfFileWriter def splitpdf(path):     fname = os.path.splitext(os.path.basename(path))[0]     pdf = PdfFileReader(path)     for page in range(pdf.getNumPages()):         pdfwrite = PdfFileWriter()         pdfwrite.addPage(pdf.getPage(page))         outputfilename = '{}_page_{}.pdf'.format(             fname, page+1)         with open(outputfilename, 'wb') as out:             pdfwrite.write(out)         print('Created: {}'.format(outputfilename)) if __name__ == '__main__':     path = 'document-1.pdf'     splitpdf(path)Here we have imported the PdfFileReader and PdfFileWriter from PyPDF2. Then we created a function called splitpdf() which accepts the path of PDF we want to split. The first line of the function takes the name of the input file. Then we open the PDF and create a read object. Using the read object’s getNumPages(), we loop over all the pages.In the next step, we created an instance of PdfFileWriter inside the for loop. Then, we created a PDF write instance and added each page to it for each of the pages in the PDF input. We also created a unique filename using the original filename + the word ‘page’ + the page number + 1.Once we are done with running the script, we will have each of the pages of the input PDF split into multiple PDFs. Now let us learn how to add a watermark to a PDF and keep it secured.Adding Overlays/WatermarksAn image or superimposed text on selected pages in a PDF document is referred to as a Watermark. The Watermark adds security features and protects our rational property like images and PDFs. Watermarks are also called overlays.The PyPDF2 allows us to watermark documents. We just need to have a PDF which will consist of our watermark text, image or signature.Code for adding a watermark in a PDF—# watermarker.py from PyPDF2 import PdfFileWriter, PdfFileReader def watermark(inputpdf, outputpdf, watermarkpdf):     watermark = PdfFileReader(watermarkpdf)     watermarkpage = watermark.getPage(0)     pdf = PdfFileReader(inputpdf)     pdfwrite = PdfFileWriter()     for page in range(pdf.getNumPages()):         pdfpage = pdf.getPage(page)         pdfpage.mergePage(watermarkpage)         pdfwrite.addPage(pdfpage)     with open(outputpdf, 'wb') as fh:         pdfwrite.write(fh) if __name__ == '__main__':     watermark(inputpdf='document-1.pdf',               outputpdf='watermarked_w9.pdf',               watermarkpdf='watermark.pdf')The output of the code will look like— There are three arguments of the function watermark(): inputpdf: The path of the PDF that is to be watermarked. outputpdf: The path where the watermarked PDF will be saved. watermarkpdf: The PDF which contains the watermark.Firstly, we extract the PDF page which contains the watermark image or text and then open that PDF page where we want to give the desired watermark.Using the inputpdf, we create a read object and using the pdfwrite, we create a write object to write out the watermarked PDF and then iterate over the pages.Next, we call the page object’s mergePage and apply the watermark and add that to the write object pdfwrite.When the loop terminates, the watermarked PDF is written out to the disk and it’s done!Encrypting a PDFIn the PDF world, the PyPDF2 package allows an owner password which gives the user the advantage to work as an administrator. The package also provides the user password which allows us to open the document upon entering the password.The PyPDF2 basically doesn’t permit any allowances on any PDF file yet it allows the user to set the owner password and user password.Code to add a password and add encryption to a PDF—# pdf_encrypt.py from PyPDF2 import PdfFileWriter, PdfFileReader def encryption(inputpdf, outputpdf, password):     pdfwrite = PdfFileWriter()     pdfread = PdfFileReader(inputpdf)     for page in range(pdfread.getNumPages()):         pdfwrite.addPage(pdfread.getPage(page))     pdfwrite.encrypt(user_pwd=password, owner_pwd=None,                       use_128bit=True)     with open(outputpdf, 'wb') as fh:         pdfwrite.write(fh) if __name__ == '__main__':     encryption(inputpdf='document-1.pdf',                   outputpdf='document-1-encrypted.pdf',                   password='twofish')We declare a  function named encryption() with three arguments—the input PDF path, the output PDF path and the password that we want to keep. Then we create one read object pdfread and one write object pdfwrite. Now we loop over all the pages and add them to the write object since we need to encrypt the entire document.Finally, we call the encrypt() function which accepts three parameters—the user password, the owner password and the whether or not to use 128-bit encryption. The PDF  will be encrypted to 40-bit encryption if the argument use128bit is set to false. Also if the owner password is set to none, then it will be set to user password automatically.Reading the Table data from PDFSuppose you want to work with the Table data in Pdf, you can use tabula-py to read tables in a PDF. To install tabula-py, run:pip install tabula-pyCode to extract simple Text from pdf using PyPDF2:import tabula # readinf the PDF file that contain Table Data # you can find the pdf file with complete code in below # read_pdf will save the pdf table into Pandas Dataframe df = tabula.read_pdf("document.pdf") # in order to print first 5 lines of Table df.head()If you PDF file contains Multiple Tabledf = tabula.read_pdf("document.pdf",multiple_tables=True)If you want to extract Information from the specific part of any specific page of PDFtabula.read_pdf("document.pdf", area=(126,149,212,462), pages=1)If you want the output into JSON Formattabula.read_pdf("offense.pdf", output_format="json")Exporting PDF into ExcelSuppose you want to export a PDF into Excel, you can do so by writing the following code and convert the PDF Data into Excel or CSV.tabula.convert_into("document.pdf", "document_testing.xlsx", output_format="xlsx")Let us sum up what we have learned in the article:Extraction of data from a PDFRotate pages in a PDFMerge PDFs into one PDFSplit a PDF into many PDFsAdd watermarks or overlays in a PDFAdd password or encryption to a PDFReading table from PDFExporting PDF into Excel or CSVAs you have seen, PyPDF2 is one of the most useful tools available in Python. The features of PyPDF2 makes life easier whether you are working on a large project or even when you quickly want to make some changes to your PDF documents. Learn more about such libraries and frameworks as KnowledgeHut offers Python Certification Course for Programmers, Developers, Jr./Sr Software Engineers/Developers and anybody who wants to learn Python.
Rated 4.5/5 based on 1 customer reviews

How to Work With a PDF in Python

7623
How to Work With a PDF in Python

Whether it is an ebook, digitally signed agreements, password protected documents, or scanned documents such as passports, the most preferred file format is PDF or Portable Document Format. It was originally developed by Adobe and is a file format used to present and transfer documents easily and reliably. It uses the file extension .pdf. In fact, PDF being the most widely used digital media, is now considered as an open standard which is maintained by the International Standards Organization (ISO). 

Python has relatively easy syntax which makes it even easier for the ones who are in their initial stage of learning the language. The popular Python libraries are well suited and integrated which allows to easily extract documents from a PDF, rotate pages if required, split pdf to make separate documents, or add watermarks in them. 

Now an important question rises, why do we need Python to process PDFs? Well, processing a PDF falls under the category of text analytics. There are several libraries and frameworks available which are designed in Python exclusively for text analytics. This makes it easier to play with a PDF in Python. You can also extract information from PDF and use into Natural Language Processing or any other Machine Learning models. Get certified and learn more about Python Programming and apply those skills and knowledge in the real world.

History of  pyPDF, PyPDF2, pyPDF4

The first PyPDF package was released in 2005 and the last official release in 2010. After a year or so, a  company named Phasit sponsored a branch of the PyPDF called PyPDF2 which was consistent with the original package and worked pretty well for several years.

A series of packages were released later on with the name of PyPDF3 and later renamed as PyPDF4. The biggest difference between PyPDF and the other versions was that the later versions supported Python3. 

PyPDF2 has been discarded recently. But since PyPDF4 is not fully backward compatible with the PyPDf2, it is suggested to use PyPDF2. You can also use a substitute package - pdfrw. Pdfrw was created by Patrick Maupin and allows you to perform all functions which PyPDF2 is capable of except a few such as encryption, decryption, and types of decompression.

Some common libraries in Python

Let us look into some of the libraries Python offers to handle PDFs:

PdfMiner 

It is a tool used to extract information from PDF documents. PDFMiner allows the user to analyze text data and obtain the definite location of a text. It provides information such as fonts and lines. We can also use it as a PDF transformer and a PDF parser.

PyPDF2

PyPDF2 is purely a Python library which allows users to split, merge, crop, encrypt, and transform PDFs. You can also add customized data, view options, and passwords to the documents. 

Tabula-py

It is a Python wrapper of tabula-java which can read tables from PDF files and convert into Pandas Dataframe or into CSV/TSV/JSON file formats.

Slate

It is a Python package which facilitates the extraction of information and is dependent on the PdfMiner package.

PDFQuery

A light Python wrapper which uses minimum code to extract data from PDFs.

xPDF

It is an open source viewer of PDF which also includes an extractor, converter and other utilities. 

Out of all the libraries mentioned above, PyPDF2 is the most used to perform operations like extraction, merging, splitting and so on.

Installing PyPDF2

If you're using Anaconda, you can install PyPDF2 using pip or conda. To install PyPDF2 using pip, run the following command in the command line:

pip install PyPDF2

The module is case-sensitive. So you need to make sure that proper syntax is followed. The installation is really quick since PyPDF2 is free of dependencies.

Extracting Document Information from a PDF in Python

PyPDF2 can be used to extract metadata and all sorts of texts from PDF when you are performing operations on preexisting PDF files. The types of data you can extract are:

  • Author
  • Creator
  • Producer
  • Subject
  • Title
  • Number of Pages

To understand it better, let us use an existing PDF in your system or you can go to Leanpub and download a book sample.

The code for extracting the document information from the PDF—

# get_doc_info.py
from PyPDF2 import PdfFileReader
def getinfo(path):
    with open(path, 'rb') as f:
        PDF = PdfFileReader(f)
        information = PDF.getDocumentInfo()
        numberofpages = PDF.getNumPages()
    print(information)
    author = information.author
    creator = information.creator
    producer =information .producer
    subject = information.subject
    title = information.title
if __name__ == '__main__':
    path = 'reportlab-sample.pdf'
    getinfo(path)

The output of the program above will look like—

Here, we have firstly imported PdfFileReader from the PyPDF2 package. The class PdfFileReader is used to interact with PDF files like reading and extracting information using accessor methods. 

Then, we have created our own function getinfo with a PDF file as an argument and then called the getdocumentinfo()This returned an instance of DocumentInformation. And finally we got extract information like the author, creator, subject or title, etc.

getNumPages() is used to count the number of pages in the document. 

PdfMiner can be used when you want to extract text from a PDF file. It is potent and particularly designed for extracting text from PDF.

We have learned to extract information from PDF. Now let’s learn how to rotate a PDF. 

Rotating pages in PDF

A lot of times we receive PDFs which contain pages in landscape orientation instead of portrait. You may also find certain documents to be upside down, which happens while scanning a document or mailing. However, we can rotate the pages clockwise or counterclockwise according to our choice using Python with PyPDF2.

The code for rotating the article is as follows—

# rotate_pages.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def rotate(pdf_path):
    pdf_write = PdfFileWriter()
    pdf_read = PdfFileReader(path)
    # Rotate page 90 degrees to the right
    page1 = pdf_read.getPage(0).rotateClockwise(90)
    pdf_write.addPage(page1)
    # Rotate page 90 degrees to the left
    page2 = pdf_read.getPage(1).rotateCounterClockwise(90)
    pdf_write.addPage(page2)
    # Add a page in normal orientation
    pdf_write.addPage(pdf_read.getPage(2))
    with open('rotate_pages.pdf', 'wb') as fh:
        pdf_write.write(fh)
if __name__ == '__main__':
    path = 'mldocument.pdf'
    rotate(path)

The output of the code will be as follows—

Rotating Pages in a PDF using Python Coding. The image shows the output of the code which has before and after screenshots of a PDF page being rotated 90 degrees clockwise.

Here firstly we imported the PdfFileReader and the PdfFileWriter so that we can write out a new PDF file. Then we declared a function rotate with a path to the PDF that is to be modified. Within the function, we created a read object pdf_read and write object pdf_write.

Then, we used the getPage() to grab the pages. Two pages page1 and page2 are taken and rotated to 90 degrees clockwise and 90 degrees counterclockwise respectively using rotateClockwise() and rotateCounterClockwise().

We used addPage() function after each rotation method calls. This adds the rotated page to the write object. The last page we add is page3 without any rotation.

Lastly, we have used write() with a file-like parameter to write out the new PDF. The final PDF contains three pages, the first two will be in the landscape mode and rotated in reversed direction and the third page will be in normal orientation.

Now we will learn to merge different PDFs into one.

Merging PDFs

In many cases, we need to merge two PDFs into a single one. For example, suppose you are working on a project report and you need to print it and bind it into a book. It contains a cover page followed by the project report. So you have two different PDFs and you want to merge them into one PDF. You can simply use Python to do so. Let us see how can we merge PDFs into one.

The code for merging two PDF documents using PyPDF in mentioned below:

# pdf_merging.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def pdfmerger(paths, output):
    pdfwrite = PdfFileWriter()
    for path in paths:
        pdfread = PdfFileReader(path)
        for page in range(pdfread.getNumPages()):
            # Add each page to the writer object
            pdfwrite.addPage(pdfread.getPage(page))
    # Write out the merged PDF
    with open(output, 'wb') as out:
        pdfwrite.write(out)
if __name__ == '__main__':
    paths = ['document-1.pdf', 'document-2.pdf']
    pdfmerger(paths, output='merged.pdf')

Here we have created a function pdfmerger() which takes a number of inputs and a single output. Then we created a PdfFileReader() object for each PDF path and looped over the pages, added each page to the write object. Finally, using the write() function the object’s contents are written to the disk.

PyPDF2 makes the process of merging simpler by creating the PdfFileMerger class.

Code for merging two documents using PyPDF2—

# pdf_merger2.py

import glob
from PyPDF2 import PdfFileMerger

def merger(output_path, input_paths):
    pdfmerge = PdfFileMerger()
    file_handles = []

    for path in input_paths:
        pdfmerge.append(path)

    with open(output_path, 'wb') as fileobj:
        pdfmerge.write(fileobj)

if __name__ == '__main__':
    paths = glob.glob('d-1.pdf')
    paths.sort()
    merger('d-2.pdf', paths)

The PyPDF2 makes it simpler in the way that we don’t need to loop the pages of each document ourselves.  Here, we created the object pdfmerge and looped through the PDF paths. The PyPDF2 automatically appends the whole document. Finally, we write it out.

Let’s perform the opposite of merging now!

Splitting PDFs

The PyPDF2 package has the ability to split up a single PDF into multiple PDFs. It allows us to split pages into different PDFs. Suppose we have a set of scanned documents in a single PDF and we need to separate the pages into different PDFs as per requirement, we can simply use Python to select pages we want to split and get the work done.

Code for splitting a single PDF into multiple PDFs—

# pdf_splitter.py
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def splitpdf(path):
    fname = os.path.splitext(os.path.basename(path))[0]
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdfwrite = PdfFileWriter()
        pdfwrite.addPage(pdf.getPage(page))
        outputfilename = '{}_page_{}.pdf'.format(
            fname, page+1)
        with open(outputfilename, 'wb') as out:
            pdfwrite.write(out)
        print('Created: {}'.format(outputfilename))
if __name__ == '__main__':
    path = 'document-1.pdf'
    splitpdf(path)

Here we have imported the PdfFileReader and PdfFileWriter from PyPDF2. Then we created a function called splitpdf() which accepts the path of PDF we want to split. 

The first line of the function takes the name of the input file. Then we open the PDF and create a read object. Using the read object’s getNumPages(), we loop over all the pages.

In the next step, we created an instance of PdfFileWriter inside the for loop. Then, we created a PDF write instance and added each page to it for each of the pages in the PDF input. We also created a unique filename using the original filename + the word ‘page’ + the page number + 1.

Once we are done with running the script, we will have each of the pages of the input PDF split into multiple PDFs. 

Now let us learn how to add a watermark to a PDF and keep it secured.

Adding Overlays/Watermarks

An image or superimposed text on selected pages in a PDF document is referred to as a Watermark. The Watermark adds security features and protects our rational property like images and PDFs. Watermarks are also called overlays.

The PyPDF2 allows us to watermark documents. We just need to have a PDF which will consist of our watermark text, image or signature.

Code for adding a watermark in a PDF—

# watermarker.py
from PyPDF2 import PdfFileWriter, PdfFileReader
def watermark(inputpdf, outputpdf, watermarkpdf):
    watermark = PdfFileReader(watermarkpdf)
    watermarkpage = watermark.getPage(0)
    pdf = PdfFileReader(inputpdf)
    pdfwrite = PdfFileWriter()
    for page in range(pdf.getNumPages()):
        pdfpage = pdf.getPage(page)
        pdfpage.mergePage(watermarkpage)
        pdfwrite.addPage(pdfpage)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)
if __name__ == '__main__':
    watermark(inputpdf='document-1.pdf',
              outputpdf='watermarked_w9.pdf',
              watermarkpdf='watermark.pdf')

The output of the code will look like— 

How to add overlays/watermarks in a PDF using Python. This image shows the output of a Python code which adds a "Top Secret" overlay to the PDF page

There are three arguments of the function watermark():

  1.  inputpdf: The path of the PDF that is to be watermarked.
  2.  outputpdf: The path where the watermarked PDF will be saved.
  3.  watermarkpdf: The PDF which contains the watermark.

Firstly, we extract the PDF page which contains the watermark image or text and then open that PDF page where we want to give the desired watermark.

Using the inputpdf, we create a read object and using the pdfwrite, we create a write object to write out the watermarked PDF and then iterate over the pages.

Next, we call the page object’s mergePage and apply the watermark and add that to the write object pdfwrite.

When the loop terminates, the watermarked PDF is written out to the disk and it’s done!

Encrypting a PDF

In the PDF world, the PyPDF2 package allows an owner password which gives the user the advantage to work as an administrator. The package also provides the user password which allows us to open the document upon entering the password.

The PyPDF2 basically doesn’t permit any allowances on any PDF file yet it allows the user to set the owner password and user password.

Code to add a password and add encryption to a PDF—

# pdf_encrypt.py
from PyPDF2 import PdfFileWriter, PdfFileReader
def encryption(inputpdf, outputpdf, password):
    pdfwrite = PdfFileWriter()
    pdfread = PdfFileReader(inputpdf)
    for page in range(pdfread.getNumPages()):
        pdfwrite.addPage(pdfread.getPage(page))
    pdfwrite.encrypt(user_pwd=password, owner_pwd=None,
                      use_128bit=True)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)
if __name__ == '__main__':
    encryption(inputpdf='document-1.pdf',
                  outputpdf='document-1-encrypted.pdf',
                  password='twofish')

We declare a  function named encryption() with three arguments—the input PDF path, the output PDF path and the password that we want to keep. 

Then we create one read object pdfread and one write object pdfwrite. Now we loop over all the pages and add them to the write object since we need to encrypt the entire document.

Finally, we call the encrypt() function which accepts three parameters—the user password, the owner password and the whether or not to use 128-bit encryption. The PDF  will be encrypted to 40-bit encryption if the argument use128bit is set to false. Also if the owner password is set to none, then it will be set to user password automatically.

Reading the Table data from PDF

Suppose you want to work with the Table data in Pdf, you can use tabula-py to read tables in a PDF. To install tabula-py, run:

pip install tabula-py

Code to extract simple Text from pdf using PyPDF2:

import tabula
# readinf the PDF file that contain Table Data
# you can find the pdf file with complete code in below
# read_pdf will save the pdf table into Pandas Dataframe

df = tabula.read_pdf("document.pdf")
# in order to print first 5 lines of Table

df.head()

If you PDF file contains Multiple Table

df = tabula.read_pdf("document.pdf",multiple_tables=True)

If you want to extract Information from the specific part of any specific page of PDF

tabula.read_pdf("document.pdf", area=(126,149,212,462), pages=1)

If you want the output into JSON Format

tabula.read_pdf("offense.pdf", output_format="json")

Exporting PDF into Excel

Suppose you want to export a PDF into Excel, you can do so by writing the following code and convert the PDF Data into Excel or CSV.

tabula.convert_into("document.pdf", "document_testing.xlsx", output_format="xlsx")

Let us sum up what we have learned in the article:

  • Extraction of data from a PDF
  • Rotate pages in a PDF
  • Merge PDFs into one PDF
  • Split a PDF into many PDFs
  • Add watermarks or overlays in a PDF
  • Add password or encryption to a PDF
  • Reading table from PDF
  • Exporting PDF into Excel or CSV

As you have seen, PyPDF2 is one of the most useful tools available in Python. The features of PyPDF2 makes life easier whether you are working on a large project or even when you quickly want to make some changes to your PDF documents. Learn more about such libraries and frameworks as KnowledgeHut offers Python Certification Course for Programmers, Developers, Jr./Sr Software Engineers/Developers and anybody who wants to learn Python.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

1 comments

Rithvik sharma 11 Jul 2019 1 likes

Nice understanding and easy to read

Suggested Blogs

Python Vs Scala

Created by Guido Rossum, Python is an object-oriented, high-level performing programming language which focuses on code readability. Python has several new features like it requires less typing, provides new libraries, fast prototyping, etc. It provides with dynamic typing and dynamic binding options, making it extremely attractive in the field of Rapid Application Development.Designed by Martin Odersky, Scala is also an object-oriented programming language that also supports the functional style of programming, but on a larger scale. It gets its name as a combination of the words ‘scalable’ and ‘language’, where it can scale according to the number of users. With time, it has become one of the most in-demand technologies amongst the developers.Both Python and Scala play very crucial roles in the growth and future of Data Science, Big Data, and Cluster computing. Below are a few major differences between Python and Scala.Major Differences between Python and ScalaSr No. PythonScala1.Python is a dynamically typed language.Scala is a statically typed language.2.Since it is dynamically typed Object Oriented Programming language, objects need not be specified.Since it is statically typed Object Oriented Programming language, the type of variable and objects are needed to be specified.3.Dynamically typed language creates extra work for the interpreter at the run time. It has to decide the type of data at run time.There is no extra work in Scala, as it uses JVM. Hence, it is 10 times faster than Python.4.It decides the type of data at the run time.This isn’t the case for Scala. Hence, Scala should be considered instead of Python when dealing with large data process.5.Python has huge community support.Scala has good community support as well, but it is lesser when compared to Python.6.Testing process and its methodologies are complex in Python because it is a dynamic programming language.Testing is done better in Scala as it is a statically typed language.7.Python supports the forked process, which is a heavyweight process, while it does not support multithreading.Scala has relative cores and a list of asynchronous libraries. Hence, it is a better option for implementing concurrency.8.Python is very popular thanks to its easy and English-like syntax.For scalable and concurrent systems like SoundCloud and Twitter, Scala plays much bigger.9.It is easy to learn and use.It is less difficult to learn than Python.10.Python is slower in terms of performance.Scala is 10 times faster than Python.11.Developers find it easy to code in Python.Scala isn’t easy to master due to its syntactic sugars.12.Whenever any change is made to the existing code, Python becomes prone to bugs.Scala provides an interface to catch the compile-time error.13.Python has proper libraries and tools for data science, machine learning and  Natural Language Processing (NLP).Scala does not have any such tools.14.Python has an interface for various OS system calls and libraries. It has many interpreters as well.Scala is a compiled language and hence, the source codes are compiled before execution.15.Python can be used for small-scale projects.Scala can be used for large-scale projects.16.Python does not provide with scalable feature support.Scala provides scalable feature support.Key differences between Python and ScalaThe key differences between Python and Scala are explained in the points mentioned below:Python language is dynamically typed, while Scala is statically typed.Development is faster, rapid and more productive in Python as it doesn’t need any compilation for most of its cases. While in the case of Scala, the process of compilation is slow. Hence, the development of Scala application takes more time.There are several platforms available for Python, but CPython is most widely used, whereas, applications run in JVM when it comes to Scala.As per task complexities, python has huge libraries, whereas, for Scala, it has small libraries.Python uses a decent amount of memory, while Scala has more memory consumption.Python is easier to learn than Scala.Since Python is a dynamic language, it executes slowly as compared to Scala.Python is less complex while testing as it is dynamic, while Scala is good for testing as it is static.Being a mature language, Python continues to grow. But Scala doesn’t have any widespread knowledge base or use.To conclude:We have compared Python and Scala over a range of factors, and it can be deduced that the selection of language depends on you and the requirements of your project. Hence, analyze and learn all the different artifacts of both Python and Scala before deciding on which one language. If you wish to excel in your career as a successful Python developer, you can opt for Python Certification with KnowledgeHut. Learn from the best! 
Rated 4.5/5 based on 4 customer reviews
6783
Python Vs Scala

Created by Guido Rossum, Python is an object-orien... Read More

How To Write Beautiful Python Code With PEP 8

It gets difficult to understand a messed up handwriting, similarly an unreadable and unstructured code is not accepted by all. However, you can benefit as a programmer only when you can express better with your code. This is where PEP comes to the rescue. Python Enhancement Proposal or PEP is a design document which provides information to the Python community and also describes new features and document aspects, such as style and design for Python.Python is a multi-paradigm programming language which is easy to learn and has gained popularity in the fields of Data Science and Web Development over a few years and PEP 8 is called the style code of Python. It was written by Guido van Rossum, Barry Warsaw, and Nick Coghlan in the year 2001. It focuses on enhancing Python’s code readability and consistency. Join the certification course on Python Programming and gain skills and knowledge about various features of Python along with tips and tricks.A Foolish Consistency is the Hobgoblin of Little Minds‘A great person does not have to think consistently from one day to the next’ — this is what the statement means.Consistency is what matters. It is considered as the style guide. You should maintain consistency within a project and mostly within a function or module.However, there will be situations where you need to make use of your own judgement, where consistency isn’t considered an option. You must know when you need to be inconsistent like for example when applying the guideline would make the code less readable or when the code needs to comply with the earlier versions of Python which the style guide doesn’t recommend. In simple terms, you cannot break the backward compatibility to follow with PEP.The Zen of PythonIt is a collection of 19 ‘guiding principles’ which was originally written by Tim Peters in the year 1999. It guides the design of the Python Programming Language.Python was developed with some goals in mind. You can see those when you type the following code and run it:>>> import this The Zen of Python, by Tim Peters Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one-- and preferably only one --obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea -- let's do more of those!The Need for PEP 8Readability is the key to good code. Writing good code is like an art form which acts as a subjective topic for different developers.Readability is important in the sense that once you write a code, you need to remember what the code does and why you have written it. You might never write that code again, but you’ll have to read that piece of code again and again while working in a project. PEP 8 adds a logical meaning to your code by making sure your variables are named well, sufficient whitespaces are there or not and also by commenting well. If you’re a beginner to the language, PEP 8 would make your coding experience more pleasant.Following PEP 8 would also make your task easier if you’re working as a professional developer. People who are unknown to you and have never seen how you style your code will be able to easily read and understand your code only if you follow and recognize a particular guideline where readability is your de facto.And as Guido van Rossum said— “Code is read much more than it is often written”.The Code LayoutYour code layout has a huge impact on the readability of your code.IndentationThe indentation level of line is computed by the leading spaces and tabs at the beginning of a line of logic. It influences the grouping of statements. The rules of PEP 8 says to use 4 spaces per indentation level and also spaces should be preferred over tabs.An example of code to show indentation:x = 5 if x < 10:   print('x is less than 10') Tabs or Spaces?Here the print statement is indented which informs Python to execute the statement only if the if statement is true. Indentation also helps Python to know what code it will execute during function calls and also when using classes.PEP 8 recommends using 4 spaces to show indentation and tabs should only be used to maintain consistency in the code.Python 3 forbids the mixing of spaces and tabs for indentation. You can either use tabs or spaces and you should maintain consistency while using Python 3. The errors are automatically displayed:python hello.py  File "hello.py", line 3       print(i, j)                 ^TabError: inconsistent use of tabs and spaces in indentationHowever, if you’re working in Python 2, you can check the consistency by using a -t flag in your code which will display the warnings of inconsistencies with the use of spaces and tabs.You can also use the -tt flag which will show the errors instead of warnings and also the location of inconsistencies in your code. Maximum Line Length and Line BreakingThe Python Library is conservative and 79 characters are the maximum required line limit as suggested by PEP 8. This helps to avoid line wrapping.Since maintaining the limit to 79 characters isn’t always possible, so PEP 8 allows wrapping lines using Python’s implied line continuation with parentheses, brackets, and braces:def function(argument_1, argument_2,             argument_3, argument_4):     return argument_1Or by using backslashes to break lines:with open('/path/to/some/file/you/want/to/read') as example_1, \     open('/path/to/some/file/being/written', 'w') as example_2:     file_2.write(file_1.read())When it comes to binary operators, PEP 8 encourages to break lines before the binary operators. This accounts for more readable code.Let us understand this by comparing two examples:# Example 1 # Do total = ( variable_1 + variable_2 - variable_3 ) # Example 2 # Don't total = ( variable_1 + variable_2 - variable_3 )In the first example, it is easily understood which variable is added or subtracted, since the operator is just next to the variable to which it is operated. However, in the second example, it is a little difficult to understand which variable is added or subtracted.Indentation with Line BreaksIndentation allows a user to differentiate between multiple lines of code and a single line of code that spans multiple lines. It enhances readability too.The first style of indentation is to adjust the indented block with the delimiter:def function(argument_one, argument_two,               argument_three, argument_four):         return argument_oneYou can also improve readability by adding comments:x = 10 if (x > 5 and     x < 20):     # If Both conditions are satisfied     print(x)Or by adding extra indentation:x = 10 if (x > 5 and       x < 20):     print(x)Another type of indentation is the hanging indentation by which you can symbolize a continuation of a line of code visually:foo = long_function_name(       variable_one, variable_two,       variable_three, variable_four)You can choose any of the methods of indentation, following line breaks, in situations where the 79 character line limit forces you to add line breaks in your code, which will ultimately improve the readability.Closing Braces in Line ContinuationsClosing the braces after line breaks can be easily forgotten, so it is important to put it somewhere where it makes good sense or it can be confusing for a reader.One way provided by PEP 8 is to put the closing braces with the first white-space character of the last line:my_list_of_numbers = [     1, 2, 3,     4, 5, 6,     7, 8, 9     ]Or lining up under the first character of line that initiates the multi-line construct:my_list_of_numbers = [     1, 2, 3,     4, 5, 6,     7, 8, 9 ]Remember, you can use any of the two options keeping in mind the consistency of the code.Blank LinesBlank lines are also called vertical whitespaces. It is a logical line consisting of spaces, tabs, formfeeds or comments that are basically ignored.Using blank lines in top-level-functions and classes:class my_first_class:     pass class my_second_class:     pass def top_level_function():     return NoneAdding two blank lines between the top-level-functions and classes will have a clear separation and will add more sensibility to the code.Using blank lines in defining methods inside classes:class my_class:     def method_1(self):         return None     def method_2(self):         return NoneHere, a single vertical space is enough for a readable code.You can also use blank spaces inside multi-step functions. It helps the reader to gather the logic of your function and understand it efficiently. A single blank line will work in such case.An example to illustrate such:def calculate_average(number_list):     sum_list = 0     for number in number_list:         sum_list = sum_list + number         average = 0     average = sum_list / len(number_list)    return averageAbove is a function to calculate the average. There is a blank line between each step and also before the return statement.The use of blank lines can greatly improve the readability of your code and it also allows the reader to understand the separation of the sections of code and the relation between them.Naming ConventionsChoosing names which are sensible and can be easily understandable, while coding in Python, is very crucial. This will save time and energy of a reader. Inappropriate names might lead to difficulties when debugging.Naming StylesNaming variables, functions, classes or methods must be done very carefully. Here’s a list of the type, naming conventions and examples on how to use them:TypeNaming ConventionsExamplesVariableUsing short names with CapWords.T, AnyString, My_First_VariableFunctionUsing a lowercase word or words with underscores to improve readability.function, my_first_functionClassUsing CapWords and do not use underscores between words.Student, MyFirstClassMethodUsing lowercase words separated by underscores.Student_method, methodConstantsUsing all capital letters with underscores separating wordsTOTAL, MY_CONSTANT, MAX_FLOWExceptionsUsing CapWords without underscores.IndexError, NameErrorModuleUsing short lower-case letters using underscores.module.py, my_first_module.pyPackageUsing short lowercase words and underscores are discouraged.package, my_first_packageChoosing namesTo have readability in your code, choose names which are descriptive and give a clearer sense of what the object represents. A more real-life approach to naming is necessary for a reader to understand the code.Consider a situation where you want to store the name of a person as a string:>>> name = 'John William' >>> first_name, last_name = name.split() >>> print(first_name, last_name, sep='/ ') John/ WilliamHere, you can see, we have chosen variable names like first_name and last_name which are clearer to understand and can be easily remembered. We could have used short names like x, y or z but it is not recommended by PEP 8 since it is difficult to keep track of such short names.Consider another situation where you want to double a single argument. We can choose an abbreviation like db for the function name:# Don't def db(x):     return x * 2However, abbreviations might be difficult in situations where you want to return back to the same code after a couple of days and still be able to read and understand. In such cases, it’s better to use a concise name like double_a_variable:# Do def double_a_value(x):     return x * 2Ultimately, what matters is the readability of your code.CommentsA comment is a piece of code written in simple English which improves the readability of code without changing the outcome of a program. You can understand the aim of the code much faster just by reading the comments instead of the actual code. It is important in analyzing codes, debugging or making a change in logic. Block CommentsBlock comments are used while importing data from files or changing a database entry where multiples lines of code are written to focus on a single action. They help in interpreting the aim and functionality of a given block of code.They start with a hash(#) and a single space and always indent to the same level as the code:for i in range(0, 10):     # Loop iterates 10 times and then prints i     # Newline character     print(i, '\n')You can also use multiple paragraphs in a block comment while working on a more technical program. Block comments are the most suitable type of comments and you can use it anywhere you like.Inline CommentsInline comments are the comments which are placed on the same line as the statement. They are helpful in explaining why a certain line of code is essential.Example of inline comments:x = 10  # An inline comment y = 'JK Rowling' # Author NameInline comments are more specific in nature and can easily be used which might lead to clutter. So, PEP 8 basically recommends using block comments for general-purpose coding.Document StringsDocument strings or docstrings start at the first line of any function, class, file, module or method. These type of comments are enclosed between single quotations ( ''') or double quotations ( """ ).An example of docstring:def quadratic_formula(x, y, z, t):     """Using the quadratic formula"""     t_1 = (- b+(b**2-4*a*c)**(1/2)) / (2*a)     t_2 = (- b-(b**2-4*a*c)**(1/2)) / (2*a)     return t_1, t_2Whitespaces in Expressions and StatementsIn computing, whitespace is any character or sequence of characters which are used for spacing and have an ‘empty’ representation. It is helpful in improving the readability of expressions and statements if used properly.Whitespace around Binary OperatorsWhen you’re using assignment operators ( =, +=, -=,and so forth ) or comparisons ( ==, !=, >, =,
Rated 4.5/5 based on 1 customer reviews
8619
How To Write Beautiful Python Code With PEP 8

It gets difficult to understand a messed up handwr... Read More

How To Run Your Python Scripts

If you are planning to enter the world of Python programming, the first and the most essential skill you should learn is knowing how to run Python scripts and code. Once you grab a seat in the show, it will be easier for you to understand whether the code will actually work or not.Python, being one of the leading programming languages, has relatively easy syntax which makes it even easier for the ones who are in their initial stage of learning the language. Also, it is the language of choice for working with large datasets and data science projects. Get certified and learn more about Python Programming and apply those skills and knowledge in the real world.What is the difference between Code, Script and Modules?In computing, the code is a language that is converted from a human language into a set of ‘words’ which the computer can understand. It is also referred to as a piece of statements together which forms a program. A simple function or a statement can also be considered a code.On the other hand, a script is a file consisting of a logical sequence of instructions or a batch processing file that is interpreted by another program instead of the computer processor.In simple terms, a script is a simple program, stored in a plain file text which contains Python code. The code can be directly executed by the user. A script is also called a top-level-program-file. A module is an object in Python with random attributes that you can bind and reference.Is Python a Programming Language or a Scripting Language?Basically, all scripting languages are considered to be programming languages. The main difference between the two is that programming languages are compiled, whereas scripting languages are interpreted. Scripting languages are slower than programming languages and usually sit behind them. Since they only run on a subset of the programming language, they have less access to a computer’s local abilities. Python can be called a scripting language as well as a programming language since it works both as a compiler and an interpreter. A standard Python can compile Python code into bytecodes and then interpret it just like Java and C.However, considering the historical relationship between the general purpose programming language and the scripting language, it will be more appropriate to say that Python is a general-purpose programming language which works nicely as a scripting language too.The Python InterpreterThe Interpreter is a layer of software that works as a bridge between the program and the system hardware to keep the code running. A Python interpreter is an application which is responsible for running Python scripts.The Python Interpreter works on the Read-Eval-Print-Loop (REPL) environment.Reads the command.Evaluates the command.Prints the result.Loops back and process gets repeated.The interpreter terminates when we use the exit() or quit() command otherwise the execution keeps on going.A Python Interpreter runs code in two ways— In the form of a script or module.In the form of a piece of code written in an interactive session.Starting the Python InterpreterThe simplest way to start the interpreter is to open the terminal and then use the interpreter from the command-line.To open the command-line interpreter:On Windows, the command-line is called the command prompt or MS-DOS console. A quicker way to access it is to go to Start menu → Run and type cmd.On GNU/Linux, the command-line can be accessed by several applications like xterm, Gnome Terminal or Konsole.On MAC OS X, the system terminal is accessed through Applications → Utilities → Terminal. Running Python Code InteractivelyRunning Python code through an interactive session is an extensively used way. An interactive session is an excellent development tool to venture with the language and allows you to test every piece of Python code on the go.To initiate a Python interactive session, type python in the command-line or terminal and hit the ENTER key from the keyboard.An example of how to do this on Windows:C:\users>python Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license()" for more information. >>>The >>> on the terminal represents the standard prompt for the interactive mode. If you do not see these characters, you need to re-install Python on your system.The statements you write when working with an interactive session are evaluated and executed immediately:print('HELLO WORLD!') HELLO WORLD! 2 + 3 5 print('Welcome to the world of PYTHON') Welcome to the world of PYTHON The only disadvantage is when you close the interactive session, the code no longer exists.Running Python Scripts by the InterpreterThe term Python Execution Model is given to the entire multi-step process to run Python scripts.At first, the statements or expressions of your script are processed in a sequential manner by the interpreter. Then the code is compiled into a form of instruction set called the bytecode.Basically, the code is converted into a low-level language known as the bytecode. It is an intermediate, machine-independent code which optimizes the process of code execution. So, the interpreter ignores the compilation step when executing the code for the next time.Finally, the interpreter transfers the code for execution.The Python Virtual Machine (PVM) is the ultimate step of the Python interpreter process. It is a part of the Python environment installed in your system. The PVM loads the bytecode in the Python runtime and reads each operation and executes them as indicated. It is the component which actually runs your scripts.Running Python Scripts using Command-LineThe most sought after way of writing a Python program is by using a plain text editor. The code written in the Python interactive session is lost once the session is closed, though it allows the user to write a lot of lines of code. On Windows, the files use the .py extension.  If you are at the beginning of working with Python, you can use editors like Sublime or Notepad++ which are easy-to-use or any other text editors.Now you need to create a test script. In order to do that, open your most suited text editor and write the following code:print('Hello World!')Then save the file in your desktop with the name first_script.py or anything you like. Remember you need to give the .py extension only.Using python commandThe most basic and the easy way to run Python scripts is by using the python command. You need to open a command-line and type the word python followed by the path to your script file, like this:python first_script.py Hello World!Then you hit the ENTER button from the keyboard and that's it. You can see the phrase Hello World! on the screen. Congrats! You just ran your first Python script. However, if you do not get the output, you might want to check your system PATH and the place where you saved your file. If it still doesn’t work, re-install Python in your system and try again.Redirecting outputWindows and Unix-like systems have a process called stream redirection. You can redirect the output of your stream to some other file format instead of the standard system output. It is useful to save the output in a different file for later analysis.An example of how you can do this:python first_script.py > output.txtWhat happens is your Python script is redirected to the output.txt file. If the file doesn’t exist, it is systematically created. However, if it already exists, the contents are replaced.Running modules with the -m optionA module is a file which contains the Python code. It allows you to arrange your Python code in a logical manner. It defines functions, classes, and variables and can also include runnable code.If you want to run a Python module, there are a lot of command-line options which Python offers according to the needs of the user. One of which is the command  python -m . It searches the module name in the sys.path and runs the content as __main__:python -m first_script Hello World!Note that the module-name is a module object and not any string.Using Script FilenameWindows makes use of the system registers and file association to run Python scripts. It determines the program needed to run that particular file. You need to simply enter the file-name containing the code.An example on how to do this using command prompt:C:\Users\Local\Python\Python37> first_script.py Hello World!On GNU/Linux systems, you need to add a line before the text— #!/usr/bin/env python. Python considers this line nothing but the operating system considers it everything. It helps the system to decide what program should it use to run the file.The character combination #! known as hashbang or shebang is what the line starts with, which is then followed by the interpreter path.Finally, to run scripts, assign execution permissions and configure the hashbang line and then simply type the filename in the command line:#Assign the execution permissions chmod +x first_script.py #Run script using its filename ./first_script.py Hello World!However, if it doesn’t work, you might want to check if the script is located in your currentworking directory or not. Otherwise, you can use the path of the file for this method. Running Python Scripts InteractivelyAs we have discussed earlier, running Python scripts in an interactive session is the most common way of writing scripts and also offers a wide range of possibilities.Using importImporting a module means loading its contents so that it can be later accessed and used. It is the most usual way of invoking the import machinery. It is analogous to #include in C or C++. Using import, the Python code in one module gets access to the code in another module. An implementation of the import:import first_script Hello World!You can see its execution only when the module contains calls to functions, methods or other statements which generate visible output.One important thing to note is that the import option works only once per session. This is because these operations are expensive.For this method to work efficiently, you should keep the file containing the Python code in your current working directory and also the file should be in the Python Module Search Path (PMSP). The PMSP is the place where the modules and packages are imported.You can run the code below to know what’s in your current PSMP:import sys for path in sys.path: print(path)\Users\Local\Python37\Lib\idlelib \Users\Local\Python37\python37.zip \Users\Local\Python37\DLLs \Users\Local\Python37\lib \Users\Local\Python37 \Users\Local\Python37\lib\site-packagesYou’ll get the list of directories and .zip files where your modules and packages are imported.Using importlibimportlib is a module which is an implementation of the import statement in the Python code. It contains the import_module whose work is to execute any module or script by imitating the import operation.An example to perform this:import importlib importlib.import_module('first_script') Hello World! importlib.reload() is used to re-import the module since you cannot use import to run it for the second time. Even if you use import after the first time, it will do nothing. importlib.reload() is useful when you want to modify and test your changes without exiting the current session.The following code shows that:import first_script #First import Hello World! import first_script import importlib #Second import does nothing importlib.reload(first_script) Hello World! However, you can only use a module object and not any string as the argument of reload(). If you use a string as an argument, it will show a TypeError as follows:importlib.reload(first_script)Traceback (most recent call last): ... ...   raise TypeError("reload() argument must be a module") TypeError: reload() argument must be a moduleUsing runpy.run_module() and runpy.run_path()The Python Standard Library has a module named runpy. run_module() is a function in runpy whose work is to execute modules without importing them in the first place. The module is located using import and then executed. The first argument of the run_module() must contain a string:import runpy runpy.run_module(mod_name='first_script') Hello World! {'__name__': 'first_script',     ... '_': None}}Similarly, runpy contains another function run_path() which allows you to run a module by providing a location.An example of such is as follows:import runpy runpy.run_path(file_path='first_script.py') Hello World! {'__name__': '',     ... '_': None}}Both the functions return the globals dictionary of the executed module.Using exec()Other than the most commonly used ways to run Python scripts, there are other alternative ways. One such way is by using the built-in function exec(). It is used for the dynamic execution of Python code, be it a string or an object code.An example of exec() is:exec(open('first_script.py').read()) Hello World!Using py_compilepy_compile is a module which behaves like the import statement. It generates two functions— one to generate the bytecode from the source file and another when the source file is invoked as a script.You can compile your Python script using this module:import py_compile py_compile.compile('first_script.py'  '__pycache__\\first_script.cpython-37.pyc' The py_compile generates a new subdirectory named "__pycache__" if it doesn’t already exist. Inside the subdirectory, a Compiled Python File (.pyc) version of the script file is created. When you open the .pyc file, you can see the output of your Python script.Running Python Scripts using an IDE or a Text EditorAn Integrated Development Environment (IDE) is an application that allows a developer to build software within an integrated environment in addition to the required tools.You can use the Python IDLE, a default IDE of the standard Python Distribution to write, debug, modify, and run your modules and scripts. You can use other IDEs like Spyder, PyCharm, Eclipse, and Jupyter Notebook which also allow you to run your scripts inside its environment.You can also use popular text editors like Sublime and Atom to run Python scripts.If you want to run a Python script from your IDE or text editor, you need to create a project first. Once it is created, add your .py file to it or you can just simply create one using the IDE. Finally, run it and you can see the output in your screen.Running Python Scripts from a File ManagerIf you want to run your Python script in a file manager, all you need to do is just double-click on the file icon. This option is mainly used in the production stage after you have released the source code.However, to achieve this, some conditions must be met:On Windows, to run your script by double-clicking on them, you need to save your script file with extension .py for python.exe and .pyw for pythonw.exe.If you are using the command-line for running your script, you might likely come  through a situation where you’ll see a flash of a black window on the screen. To avert this, include a statement at the tail of the script — input(‘Enter’). This will exit the program only when you hit the ENTER key. Note that the input() function will work only if your code is free of errors.On GNU/Linux and other Unix-like systems, your Python script must contain the hashbang line and execution permissions. Otherwise, the double-click trick won’t work in a file manager.Though it is easy to execute a script by just double-clicking on the file, it isn’t considered a feasible option because of the limitations and dependency factors it comes with, like the operating system, the file manager, execution permissions, and also the file associations.So it is suggested to use this option only after the code is debugged and ready to be in the production market.ConclusionWorking with scripts has its own advantages like they are easy to learn and use, faster edit and run, interactivity, functionality and so on. They are also used to automate complex tasks in a simplified manner.In this article, you have learned to run your Python scripts using:The terminal or the command-line of the operating system.The Python Interactive session.Your favorite IDE or text editor.The system file manager.Here, you have gathered the knowledge and skills of how to run your scripts using various techniques.You will feel more comfortable working with larger and more complex Python environments which in turn will enhance the development process and increase efficiency. You can learn more about such techniques as KnowledgeHut offers Python Certification Course.
Rated 4.5/5 based on 19 customer reviews
5927
How To Run Your Python Scripts

If you are planning to enter the world of Python p... Read More