How to Work With PDF in Python

Read it in 10 Mins

Last updated on
06th Jun, 2022
Published
10th Jul, 2019
Views
9,420
How to Work With PDF in Python

Whether it is an ebook, digitally signed agreements, password-protected documents, or scanned documents such as passports, the most preferred file format is PDF or Portable Document Format. It was originally developed by Adobe and is a file format used to present and transfer documents easily and reliably. It uses the file extension .pdfIn fact, PDF is the most widely used digital media which is now considered an open standard maintained by the International Standards Organization (ISO).   

In addition, read about Self in Python!

Programming language Python has a relatively easy syntax which makes it even easier for the ones who are in their initial stage of learning the language. The popular Python libraries are well suited and integrated which allows to easily extract documents from a PDF, rotate pages if required, split pdf to make separate documents, or add watermarks in them. 

Now an important question arises, why do we need Python to process PDFs? Well, processing a PDF falls under the category of text analytics. There are several libraries and frameworks available which are designed in Python exclusively for text analytics. An overview of advanced python programming makes it easier to play with a PDF in Python. You can also extract information from PDF and use into Natural Language Processing or any other Machine Learning models. Get certified and learn more about Python Programming and apply those skills and knowledge in the real world. 

History of  pyPDF, PyPDF2, pyPDF4

The first PyPDF package was released in 2005 and the last official release in 2010. After a year or so, a  company named Phasit sponsored a branch of the PyPDF called PyPDF2 which was consistent with the original package and worked pretty well for several years.

A series of packages were released later on with the name of PyPDF3 and later renamed as PyPDF4. The biggest difference between PyPDF and the other versions was that the later versions supported Python3. 

PyPDF2 has been discarded recently. But since PyPDF4 is not fully backward compatible with the PyPDf2, it is suggested to use PyPDF2. You can also use a substitute package - pdfrw. Pdfrw was created by Patrick Maupin and allows you to perform all functions which PyPDF2 is capable of except a few such as encryption, decryption, and types of decompression.

Some Common Libraries in Python

Let us look into some of the libraries Python offers to handle PDFs:

PdfMiner 

It is a tool used to extract information from PDF documents. PDFMiner allows the user to analyze text data and obtain the definite location of a text. It provides information such as fonts and lines. We can also use it as a PDF transformer and a PDF parser.

PyPDF2

PyPDF2 is purely a Python library that allows users to split, merge, crop, encrypt, and transform PDFs. You can also add customized data, view options, and passwords to the documents. 

Tabula-py

It is a Python wrapper of tabula-java which can read tables from PDF files and convert into Pandas Dataframe or into CSV/TSV/JSON file formats.

Slate

It is a Python package that facilitates the extraction of information and is dependent on the PdfMiner package.

PDFQuery

A light Python wrapper that uses minimum code to extract data from PDFs.

xPDF

It is an open source viewer of PDF which also includes an extractor, converter and other utilities. 

Out of all the libraries mentioned above, PyPDF2 is the most used to perform operations like extraction, merging, splitting, and so on.

Installing PyPDF2

If you're using Anaconda, you can install PyPDF2 using pip or conda. To install PyPDF2 using pip, run the following command in the command line:

pip install PyPDF2

The module is case-sensitive. So you need to make sure that proper syntax is followed. The installation is really quick since PyPDF2 is free of dependencies. Check out our python advanced full course to get hands-on experience on working with pdf in Python.

Extracting Document Information from a PDF in Python

PyPDF2 can be used to extract metadata and all sorts of texts from PDF when you are performing operations on preexisting PDF files. The types of data you can extract are:

  • Author
  • Creator
  • Producer
  • Subject
  • Title
  • Number of Pages

To understand it better, let us use an existing PDF in your system or you can go to Leanpub and download a book sample.

The code for extracting the document information from the PDF—

# get_doc_info.py
from PyPDF2 import PdfFileReader
def getinfo(path):
    with open(path, 'rb') as f:
        PDF = PdfFileReader(f)
        information = PDF.getDocumentInfo()
        numberofpages = PDF.getNumPages()
    print(information)
    author = information.author
    creator = information.creator
    producer =information .producer
    subject = information.subject
    title = information.title
if __name__ == '__main__':
    path = 'reportlab-sample.pdf'
    getinfo(path)

The output of the program above will look like—

Here, we have firstly imported PdfFileReader from the PyPDF2 package. The class PdfFileReader is used to interact with PDF files like reading and extracting information using accessor methods. 

Then, we have created our own function getinfo with a PDF file as an argument and then called the getdocumentinfo()This returned an instance of DocumentInformation. And finally, we got extract information like the author, creator, subject or title, etc.

getNumPages() is used to count the number of pages in the document. 

PdfMiner can be used when you want to extract text from a PDF file. It is potent and particularly designed for extracting text from PDF.

We have learned to extract information from PDF. Now let’s learn how to rotate a PDF. 

Rotating Pages in PDF

A lot of times we receive PDFs that contain pages in landscape orientation instead of portrait. You may also find certain documents to be upside down, which happens while scanning a document or mailing. However, we can rotate the pages clockwise or counterclockwise according to our choice using Python with PyPDF2.

The code for rotating the article is as follows—

# rotate_pages.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def rotate(pdf_path):
    pdf_write = PdfFileWriter()
    pdf_read = PdfFileReader(path)
    # Rotate page 90 degrees to the right
    page1 = pdf_read.getPage(0).rotateClockwise(90)
    pdf_write.addPage(page1)
    # Rotate page 90 degrees to the left
    page2 = pdf_read.getPage(1).rotateCounterClockwise(90)
    pdf_write.addPage(page2)
    # Add a page in normal orientation
    pdf_write.addPage(pdf_read.getPage(2))
    with open('rotate_pages.pdf', 'wb') as fh:
        pdf_write.write(fh)
if __name__ == '__main__':
    path = 'mldocument.pdf'
    rotate(path)

The output of the code will be as follows—

Rotating pages Output in Python

Here firstly we imported the PdfFileReader and the PdfFileWriter so that we can write out a new PDF file. Then we declared a function rotate with a path to the PDF that is to be modified. Within the function, we created a read object pdf_read and write object pdf_write.

Then, we used the getPage() to grab the pages. Two pages page1 and page2 are taken and rotated to 90 degrees clockwise and 90 degrees counterclockwise respectively using rotateClockwise() and rotateCounterClockwise().

We used addPage() function after each rotation method calls. This adds the rotated page to the write object. The last page we add is page3 without any rotation.

Lastly, we have used write() with a file-like parameter to write out the new PDF. The final PDF contains three pages, the first two will be in the landscape mode and rotated in reversed direction and the third page will be in normal orientation.

Now we will learn to merge different PDFs into one.

Merging PDFs

In many cases, we need to merge two PDFs into a single one. For example, suppose you are working on a project report and you need to print it and bind it into a book. It contains a cover page followed by the project report. So you have two different PDFs and you want to merge them into one PDF. You can simply use Python to do so. Let us see how can we merge PDFs into one.

The code for merging two PDF documents using PyPDF is mentioned below:

# pdf_merging.py
from PyPDF2 import PdfFileReader, PdfFileWriter
def pdfmerger(paths, output):
    pdfwrite = PdfFileWriter()
    for path in paths:
        pdfread = PdfFileReader(path)
        for page in range(pdfread.getNumPages()):
            # Add each page to the writer object
            pdfwrite.addPage(pdfread.getPage(page))
    # Write out the merged PDF
    with open(output, 'wb') as out:
        pdfwrite.write(out)
if __name__ == '__main__':
    paths = ['document-1.pdf', 'document-2.pdf']
    pdfmerger(paths, output='merged.pdf')

Here we have created a function pdfmerger() which takes a number of inputs and a single output. Then we created a PdfFileReader() object for each PDF path and looped over the pages, added each page to the write object. Finally, using the write() function the object’s contents are written to the disk.

PyPDF2 makes the process of merging simpler by creating the PdfFileMerger class.

Code for merging two documents using PyPDF2—

# pdf_merger2.py

import glob
from PyPDF2 import PdfFileMerger

def merger(output_path, input_paths):
    pdfmerge = PdfFileMerger()
    file_handles = []

    for path in input_paths:
        pdfmerge.append(path)

    with open(output_path, 'wb') as fileobj:
        pdfmerge.write(fileobj)

if __name__ == '__main__':
    paths = glob.glob('d-1.pdf')
    paths.sort()
    merger('d-2.pdf', paths)

The PyPDF2 makes it simpler in the way that we don’t need to loop the pages of each document ourselves.  Here, we created the object pdfmerge and looped through the PDF paths. The PyPDF2 automatically appends the whole document. Finally, we write it out.

Let’s perform the opposite of merging now!

Splitting PDFs

The PyPDF2 package has the ability to split up a single PDF into multiple PDFs. It allows us to split pages into different PDFs. Suppose we have a set of scanned documents in a single PDF and we need to separate the pages into different PDFs as per requirement, we can simply use Python to select pages we want to split and get the work done.

Code for splitting a single PDF into multiple PDFs—

# pdf_splitter.py
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def splitpdf(path):
    fname = os.path.splitext(os.path.basename(path))[0]
    pdf = PdfFileReader(path)
    for page in range(pdf.getNumPages()):
        pdfwrite = PdfFileWriter()
        pdfwrite.addPage(pdf.getPage(page))
        outputfilename = '{}_page_{}.pdf'.format(
            fname, page+1)
        with open(outputfilename, 'wb') as out:
            pdfwrite.write(out)
        print('Created: {}'.format(outputfilename))
if __name__ == '__main__':
    path = 'document-1.pdf'
    splitpdf(path)

Here we have imported the PdfFileReader and PdfFileWriter from PyPDF2. Then we created a function called splitpdf() which accepts the path of PDF we want to split. 

The first line of the function takes the name of the input file. Then we open the PDF and create a read object. Using the read object’s getNumPages(), we loop over all the pages.

In the next step, we created an instance of PdfFileWriter inside the for loop. Then, we created a PDF write instance and added each page to it for each of the pages in the PDF input. We also created a unique filename using the original filename + the word ‘page’ + the page number + 1.

Once we are done with running the script, we will have each of the pages of the input PDF split into multiple PDFs. 

Now let us learn how to add a watermark to a PDF and keep it secured.

Adding Overlays/Watermarks

An image or superimposed text on selected pages in a PDF document is referred to as a Watermark. The Watermark adds security features and protects our rational property like images and PDFs. Watermarks are also called overlays.

The PyPDF2 allows us to watermark documents. We just need to have a PDF which will consist of our watermark text, image or signature.

Code for adding a watermark in a PDF—

# watermarker.py
from PyPDF2 import PdfFileWriter, PdfFileReader
def watermark(inputpdf, outputpdf, watermarkpdf):
    watermark = PdfFileReader(watermarkpdf)
    watermarkpage = watermark.getPage(0)
    pdf = PdfFileReader(inputpdf)
    pdfwrite = PdfFileWriter()
    for page in range(pdf.getNumPages()):
        pdfpage = pdf.getPage(page)
        pdfpage.mergePage(watermarkpage)
        pdfwrite.addPage(pdfpage)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)
if __name__ == '__main__':
    watermark(inputpdf='document-1.pdf',
              outputpdf='watermarked_w9.pdf',
              watermarkpdf='watermark.pdf')

The output of the code will look like — 

Adding Overlays/Watermarks Outputs in Python

There are three arguments of the function watermark():

  1.  inputpdf: The path of the PDF that is to be watermarked.
  2.  outputpdf: The path where the watermarked PDF will be saved.
  3.  watermarkpdf: The PDF which contains the watermark.

Firstly, we extract the PDF page which contains the watermark image or text and then open that PDF page where we want to give the desired watermark.

Using the inputpdf, we create a read object and using the pdfwrite, we create a write object to write out the watermarked PDF and then iterate over the pages.

Next, we call the page object’s mergePage and apply the watermark and add that to the write object pdfwrite.

When the loop terminates, the watermarked PDF is written out to the disk and it’s done!

Encrypting a PDF

In the PDF world, the PyPDF2 package allows an owner password which gives the user the advantage to work as an administrator. The package also provides the user password which allows us to open the document upon entering the password.

The PyPDF2 basically doesn’t permit any allowances on any PDF file yet it allows the user to set the owner password and user password.

Code to add a password and add encryption to a PDF—

# pdf_encrypt.py
from PyPDF2 import PdfFileWriter, PdfFileReader
def encryption(inputpdf, outputpdf, password):
    pdfwrite = PdfFileWriter()
    pdfread = PdfFileReader(inputpdf)
    for page in range(pdfread.getNumPages()):
        pdfwrite.addPage(pdfread.getPage(page))
    pdfwrite.encrypt(user_pwd=password, owner_pwd=None,
                      use_128bit=True)
    with open(outputpdf, 'wb') as fh:
        pdfwrite.write(fh)
if __name__ == '__main__':
    encryption(inputpdf='document-1.pdf',
                  outputpdf='document-1-encrypted.pdf',
                  password='twofish')

We declare a  function named encryption() with three arguments—the input PDF path, the output PDF path and the password that we want to keep. 

Then we create one read object pdfread and one write object pdfwrite. Now we loop over all the pages and add them to the write object since we need to encrypt the entire document.

Finally, we call the encrypt() function which accepts three parameters—the user password, the owner password and the whether or not to use 128-bit encryption. The PDF  will be encrypted to 40-bit encryption if the argument use128bit is set to false. Also if the owner password is set to none, then it will be set to user password automatically.

Reading the Table data from PDF

Suppose you want to work with the Table data in Pdf, you can use tabula-py to read tables in a PDF. To install tabula-py, run:

pip install tabula-py

Code to extract simple Text from pdf using PyPDF2:

import tabula
# readinf the PDF file that contain Table Data
# you can find the pdf file with complete code in below
# read_pdf will save the pdf table into Pandas Dataframe

df = tabula.read_pdf("document.pdf")
# in order to print first 5 lines of Table

df.head()

If your PDF file contains Multiple Table

df = tabula.read_pdf("document.pdf",multiple_tables=True)

If you want to extract information from the specific part of any specific page of PDF

tabula.read_pdf("document.pdf", area=(126,149,212,462), pages=1)

If you want the output into JSON Format

tabula.read_pdf("offense.pdf", output_format="json")

Exporting PDF into Excel

Suppose you want to export a PDF into Excel, you can do so by writing the following code and convert the PDF Data into Excel or CSV.

tabula.convert_into("document.pdf", "document_testing.xlsx", output_format="xlsx")

Let us sum up what we have learned in the article:

  • Extraction of data from a PDF
  • Rotate pages in a PDF
  • Merge PDFs into one PDF
  • Split a PDF into many PDFs
  • Add watermarks or overlays in a PDF
  • Add password or encryption to a PDF
  • Reading table from PDF
  • Exporting PDF into Excel or CSV

Conclusion

As you have seen, PyPDF2 is one of the most useful tools available in Python. The features of PyPDF2 make life easier whether you are working on a large project or even when you quickly want to make some changes to your PDF documents. Learn more about such libraries and frameworks with Knowledgehut advanced python programming course for programmers, Developers, Jr./Sr Software Engineers/Developers, and anybody who wants to learn Python. 

Profile

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.