Search

Python Programming Filter

How to use sorted() and sort() in Python

Whenever you visit a pharmacy and ask for a particular medicine, have you noticed something? It hardly takes any time for the pharmacist to find it among several medicines. This is because all the items are arranged in a certain fashion which helps them know the exact place to look for. They may be arranged in alphabetical order or according to their category such as ophthalmics or neuro or gastroenterology and so on. A proper arrangement not only saves time but also make operations simple and easy, hence sorting is essential.At some point or the other, every programmer needs to learn one of the most essential skills,  Sorting. Python sorting functions comprise of a lot of features to perform basic sorting or customize ordering according to the programmer’s needs.Basically, sorting is a technique that is used to arrange data in an increasing and decreasing fashion according to some linear relationship among the data elements. You can sort numbers, or names or records of any kind in any fashion according to your needs. Sorting techniques can be used to arrange a list of mail recipients in an alphabetical manner.There are a number of sorting algorithms in Python starting from Bubble Sort, Insertion Sort, Quick Sort, Merge Sort, Selection Sort and so on. In this article we will look into how to use sorted() and sort() in Python. To learn more about other concepts of Python, go through our Python Tutorials.What is the need for Sorting?In simple terms, sorting means arranging data systematically. If the data you want to work with  is not sorted you will face problems in finding your desired element.The main advantages of sorting elements in Python are:When you work with sorting modules, you will get to know about a large number of language components.Sorting Algorithms provide an abstract way of learning about the accuracy of your program without worrying about system developments or dependencies.It will help you in understanding the program complexity and speed and how to increase the efficiency.How to order values using sorted()?sorted() is a built-in function that accepts an iterable and returns the sorted values in ascending order by default which contains the iterable items.Sorting Numbers using sorted()Let us define a list of integers called num_list and pass it as an argument to sorted():>>> num_list = [4, 10, 1, 7] >>> sorted(num_list) [1, 4, 7, 10] >>> num_list [4, 10, 1, 7]Some of the insights we gain from the code above are:sorted() is a built-in function found in the Python Standard Library. It cannot be defined.sorted() orders the values in num_list in ascending order by default, i.e. smallest to largest.The original values of num_list are not changed.sorted() being called, returns an ordered list of values.Since sorted() function returns the list in order, we can assign the returned list to another variable:>>> num_list = [4, 10, 1, 7] >>> sorted_list = sorted(num_list) >>> sorted_list [1, 4, 7, 10] >>> num_list [4, 10, 1, 7]A new variable sorted_list is created which holds the output of sorted().You can also use sorted() to sort tuples and sets just like numbers:>>> tuples = (4, 10, 1, 7) >>> sets = {10, 5, 10, 0, 2} >>> sorted_tuples = sorted(numbers_tuple) >>> sorted_sets = sorted(numbers_set) >>> sorted_tuples [1, 4, 7, 10] >>> sorted_sets [0, 2, 5, 10]The definition of sorted() states that it will return a new list whatever the input may be. So even if the input variables are tuples and sets, sorted() always returns a list.You can also perform type casting in cases where you need to match the returned object with the input type:>>> tuples = (4, 10, 1, 7) >>> sets = {10, 5, 10, 0, 2} >>> sorted_tuples = sorted(numbers_tuple) >>> sorted_sets = sorted(numbers_set) >>> sorted_tuples [1, 4, 7, 10] >>> sorted_sets [0, 2, 5, 10] >>> tuples(sorted_tuples) (1, 4, 7, 10) >>> sets(sorted_sets) {0, 2, 5, 10}In the code above, you can see the sorted_tuples when cast to tuples is retained in an ordered manner whereas sorted_sets when casted does not return an order list since it is unordered by definition.Sorting Strings using sorted()Sorting of strings is just like sorting tuples and sets. sorted() iterates across each character of the input and returns a string order.An example of sorting str type using sorted():>>> num_string = '98765' >>> char_string = 'sorting is fun' >>> sorted_num_string = sorted(num_string) >>> sorted_char_string = sorted(char_string) >>> sorted_num_string ['5', '6', '7', '8', '9'] >>> sorted_char_string ['', '','f', 'g', 'i', 'i', 'n', 'n', 'o', 'r', 's', 's', 't','u']The str is treated as a list and sorted() iterates through each character including spaces.You can use .split() to change the behavior and clean the output and .join() to rejoin them together:>>> string = 'sorting is fun' >>> sorted_string = sorted(string.split()) >>> sorted_string ['fun', 'is', 'sorting'] >>> ' '.join(sorted_string) 'fun is sorting'The actual string is converted into a list of words using .split() and then it is sorted with sorted() and then again joined together using .join().How to use sorted() with a reverse Argument?The syntax of the sorted() function is sorted(iterable, /, *, key=None, reverse=False).The built-in function sorted() comprises of three parameters:iterable — Required. A sequence such as string, tuple or list and collection such as set or dictionary.key — Optional. A function that serves as a key or to customize the sort order. The argument is set to None by default.reverse — Optional. A boolean flag that reverses the order of sorting. If True, the sorted list is reversed. The default argument is False.reverse is an optional keyword argument that changes the sorting order according to the Boolean value assigned to it. The default value is False, which performs sorting in ascending order. However, if the value is given as True, descending sort occurs:>>> name_list = ['Markian', 'Alex', 'Suzzane', 'Harleen'] >>> sorted(name_list) ['Alex', 'Harleen', 'Markian', 'Suzzane'] >>> sorted(name_list, reverse=True) ['Suzzane', 'Markian', 'Harleen', 'Alex']In the example above, the sorting is done on the basis of the first alphabet. However, when sorted() encounters the reverse keyword with a True argument, the output is reversed.Another example to understand the behavior of the reverse keyword:>>> case_sensitive_names = ['Markian', 'alex', 'Suzzane', 'harleen'] >>> sorted(case_sensitive_names, reverse=True) ['harleen', 'alex', 'Suzzane', 'Markian'] >>> values_to_sort = [False, 1, 'A' == 'B', 1 <= 0] >>> sorted(values_to_sort, reverse=True) [1, False, False, False] >>> num_list = [7, 10, 0, 4] >>> sorted(num_list, reverse=False) [0, 4, 7, 10]How to use sorted() with a key Argument?The keyword argument key accepts a function and this function determines the resulting order by implementing itself in each value of the list.An example to illustrate sorting of a list using the function len(), which returns the length of the string, and providing the key argument as len:>>> word = 'pencil' >>> len(word) 6 >>> word_list = ['cherry', 'donut', 'Michigan', 'transcipt'] >>> sorted(word_list, key=len) ['donut', 'cherry', 'Michigan', 'transcript']The len() function determines the length of each item in the list and returns the list in ascending order (shortest to longest).Let us sort the earlier example using key where the first alphabet with different case was considered for the order:>>> case_sensitive_names = ['Markian', 'alex', 'Suzzane', 'harleen'] >>> sorted(case_sensitive_names, reverse=True) ['Markian', 'Suzzane', 'alex', 'harleen'] >>> sorted(case_sensitive_names, key=str.lower) ['alex', 'harleen', 'Markian', 'Suzzane']The key cannot make any changes to the original values in the list. So the final output will be the original sorted elements.Though key is considered as one of the most powerful components of sorted(), it has a number of limitations.The first limitation is that key accepts only single argument functions.An example of a function addition that accepts two arguments:>>> def addition(a, b):       return a + b >>> number_to_add = [1, 3, 5] >>> sorted(number_to_add , key=addition) Traceback (most recent call last):   File "stdin", line 5, in <module>     sorted(number_to_add, key=addition) TypeError: addition() missing 1 required positional argument: 'b'The program fails because whenever addition() is called during sorting, it receives only one element from the list at a time. The second argument is always missing.The second limitation is that the key function that is used must be able to handle all types of iterable values.An example to illustrate the second limitation:>>> cast_values = ['4', '5', '6', 'seven'] >>> sorted(cast_values, key=int) Traceback (most recent call last):   File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: 'seven'The example above contains a list of numbers to be used by sorted() as strings. The key will try to convert the numbers to int. Each of the numbers represented as strings can be converted to int, but four cannot be. So a ValueError gets raised since four is not valid to cast into an int.Let us see an example to arrange an iterable by the last letter of each string:>>> def reverse(word):       return word[::-1] >>> words = ['cherry', 'cake', 'Michigan', 'transcript'] >>> sorted(words, key=reverse) ['cake', 'Michigan', 'transcript', 'cherry']The function reverse is defined to reverse the input string and then the function is used as the key argument. The slice syntax word[::-1] reverses the string and then the function reverse() takes all the elements one at a time and sorts the list according to the last alphabet.You can also use lambda function in the key argument instead of defining a regular function. A lambda is an anonymous function that does not have a name and executes just like normal functions. Lambda functions do not contain any statements.An example to show the previous code using a lambda function:>>> words = ['cherry', 'cake', 'Michigan', 'transcript'] >>> sorted(words, key = lambda x: x[::-1]) ['cake', 'Michigan', 'transcript', 'cherry']Here, the key is defined with lambda with no name and x is the argument. The slice syntax word[::-1] reverses each of the element and the reversed output is then used for sorting.An example to use key along with reverse argument:>>> words = ['cherry', 'cake', 'Michigan', 'transcript'] >>> sorted(words, key = lambda x: x[::-1], reverse = True) ['cherry', 'transcript', 'Michigan', 'cake']In this example, the order is reversed into a descending manner.Lambda functions can also be used to sort class objects according to their properties.An example to sort a group of students based on their grade in descending order:>>> from collections import namedtuple >>> Student = namedtuple('Student', 'name grade') >>> alex = Student('Alex', 95) >>> bob = Student('Bob', 87) >>> charlie = Student('Charlie', 91) >>> students = [alex, bob, charlie] >>> sorted(students, key=lambda x: getattr(x, 'grade'), reverse=True) [Student(name='Alex', grade=95), Student(name='Charlie', grade=91), Student(name='Bob', grade=87)]The namedtuple is used to produce classes with name and grade attributes. The lambda is used to get the grade property of each student and reverse is used to reverse the output into descending order so that the highest grades are arranged first.There are a lot of possible techniques to arrange elements using sorted() with key and reverse arguments. Lambda functions can also be helpful during sorting by making your code simple and clean.You can also use operator module functions like itemgetter() and attrgetter() to make your sorting program simpler and faster. The operator module is used to export a set of accessor functions in correspondence to the operators of Python.An example to illustrate operator module functions using key:>>> tuples = [      ('alex', 'B', 13),      ('bob', 'A', 12),      ('charles', 'B', 10),      ]>>> from operator import itemgetter>>> sorted(tuples, key=itemgetter(2))>>>[('charles', 'B', 10), ('bob', 'B', 12), ('alex', 'A', 13)]tuples is declared with the name, grade and age of three persons. The function itemgetter is imported from the module operator and then it is sorted by age and the output displayed in ascending order.How to order values using sort()?The .sort() which is quite similar to sorted() in naming has few differences than sorted(). The help documentation of Python will clear out the two critical differences between .sort() and sorted():>>> help(sorted) Help on built-in function sorted in module builtins: sorted(iterable, /, *, key=None, reverse=False)     Return a new list containing all items from the iterable in ascending order.     A custom key function can be supplied to customize the sort order, and the     reverse flag can be set to request the result in descending order. >>> help(list.sort) Help on method_descriptor: sort(self, /, *, key=None, reverse=False)     Stable sort *IN PLACE*.Firstly, .sort() is not a built-in function unlike sorted(). It is a method of list class and works only with lists. You cannot pass iterables to .sort().Secondly, .sort()  returns None and changes the values.Let us see the differences of code for .sort() and what impact it has on the code:>>> sort_numbers = [10, 2, 7, 3] >>> sort(sort_numbers) Traceback (most recent call last):   File "<stdin>", line 1, in <module> NameError: name 'sort' is not defined >>> sort_tuples = (10, 2, 7, 3) >>> sort_tuple.sort()>>> sort_tuples = (10, 2, 7, 3) >>> sort_tuple.sort() Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'tuple' object has no attribute 'sort' >>> sorted_values = sort_numbers.sort() >>> print(sorted_values) None >>> sorted_values = sort_numbers.sort() >>> print(sorted_values)int(sort_numbers) [1, 2, 5, 6]The code above highlights some operational differences between .sort() and sorted():When any assignment is done to a new variable, it returns a None type. This is because .sort() function has no ordered output. The original order of sort_numbers is not maintained and is changed in place..sort() also contains the key and reverse optional keyword arguments just like sorted() which produces the same functionality.An example of .sort() using lambda to sort a list of phrases by the first letter of the third word:>>> sort_phrases = ['welcome to python',       'python is fun',       'python is easy'       ] >>> sort_phrases.sort(key=lambda x: x.split()[2][1], reverse=False) >>> sort_phrases ['python is easy', 'python is fun', 'welcome to python']Here, lambda is used to split each phrase into a list of words and then find the second letter of the third element for each phrase.Disadvantages of  Python SortingPython has some limitations when you try to sort values besides integers.Non-Comparable Data TypesYou cannot use sort data types that are different from each other. Python raises an error when sorted() is used on non-comparable data.An example to illustrate sorting of values of different data types:>>> mixed_values = [None, 5] >>> sorted(mixed_values) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'int' and 'NoneType'Python raises a TypeError because it cannot sort None and int in the same list because of their incompatibility. It uses the less than operator ( < ) to determine the lower value in the order of the sort.If you try to compare the same values manually without using sorted(), it will still raise a TypeError because of non-comparable data types:>> None < 5 Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'NoneType' and 'int'However, if your list contains a combination of integers and strings that are all numbers, Python will cast them to comparable types using a list comprehension:>>> num_mix = [10, "5", 200, "11"] >>> sorted(num_mix) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'str' and 'int' >>> # List comprehension to cast all values to integers >>> [int(z) for z in num_mix] [10, 5, 200, 11] >>> sorted([int(z) for z in num_mix]) [5, 10, 11, 200]int() converts all the string values in num_mix to integers and then sorted() compares all values and returns a sorted output.An example of a Python code of implicitly converting a value to another type:>>> values = [1, False, 0, 'a' == 'b', 0 >= 1] >>> sorted(values) [False, 0, False, False, 1]In the example above, all the elements in the list are converted to boolean type. 0 >= 1 evaluates to a False output. The number 1 and 0 are converted to True and False as bool type respectively.This particular example highlights an important characteristic of sorting– sort stability. Sorting ability means that sorting algorithms are always stable. The original order is retained even if multiple records have the same key argument.An example to illustrate sort stability:>>> values = [False, 0, 0, 3 == 4, 1, False, False] >>> sorted(values) [False, 0, 0, False, 0, False, 1]If you take a look at the original order and the sorted output, you’ll find that the expression 3 == 4 is casted to False and all sorted output is in the actual order. You can also perform complex sorts with the help of the knowledge of sort stability.Case-Sensitive SortingYou can use sorted() to sort a list of strings in ascending order which is alphabetical by default:>>> name_list = ['Markian', 'Alex', 'Suzzane', 'Harleen'] >>> sorted(name_list) ['Alex', 'Harleen', 'Markian', 'Suzzane']However, Python uses Unicode Code Point of the first letter of each string to evaluate the ascending order of the sort. If there are two names Al and al, Python will treat both of them differently.An example to return the Unicode Code Point of the first alphabet of each string:>>> case_sensitive_names = ['Markian', 'alex', 'Suzzane', 'harleen'] >>> sorted(case_sensitive_names) ['Markian', 'Suzzane', 'alex', 'harleen'] >>> # List comprehension for Unicode Code Point of first letter in each word >>> [(ord(name[0]), name[0]) for name in sorted(case_sensitive_names)] [(77, 'M'), (83, 'S'), (97, 'a'), (104, 'h')]In the example above, name[0] returns the first letter of the string and ord(name[0]) returns the Unicode Code Point. You can notice that even a comes before M alphabetically, the output has M before a. This is because the code point of M comes before a.Consider a situation where the first letter is the same for all the strings that need to be sorted. In such cases, the sorted() function will use the second letter to determine the order and if the second letter is also same, it will consider the third letter and so on, till the end of string:>>> similar_strings = ['zzzzzn', 'zzzzzc', 'zzzzza','zzzzze'] >>> sorted(similar_strings) ['zzzzza', 'zzzzzc', 'zzzzze', 'zzzzzn']Here, sorted() will compare the strings based on the sixth character since the first five characters are the same ( z ). The output will also depend on the last character of each string.An example of sorting elements having identical values:>>> different_lengths = ['zzzzz', 'zz', 'zzzz','z'] >>> sorted(different_lengths) ['z', 'zz', 'zzzz', 'zzzzz']In this case, the sorting order will be from the shortest to the longest. The shortest string z is ordered first and the longest string zzzzz is ordered at the last.When should you use .sort() and sorted()?Let us consider a case where you need to collect data from a race of 5k runners, the Python 5k Annual and then sort them. You will have to collect the runner’s bib number and the time it took to finish the race:>>> from collections import namedtuple >>> Runner_data = namedtuple('Runner', 'bibnumber duration')Each of the Runner_data will be added to a list called runners:>>> runners = [] >>> runners.append(Runner_data('2548597', 1200)) >>> runners.append(Runner_data('8577724', 1720)) >>> runners.append(Runner_data('2666234', 1600)) >>> runners.append(Runner_data('2425114', 1450)) >>> runners.append(Runner_data('2235232', 1620))     ...     ... >>> runners.append(Runner_data('2586674', 1886))The bib number and the total time taken by the runner is added to runners each time they cross the finish line.Now, you know the top five runners according to the duration time are the winners and the rest of them will be sorted by the fastest time:>>> runners.sort(key=lambda x: getattr(x, 'duration')) >>> fastest_five_runners = runners[:5]In this example, we didn’t need any multiple types of sorting. The list was a reasonable choice. You just sorted the participants and grabbed the fastest five runners. Storing the list elsewhere was also not needed. The lambda function is used here to get the duration of each runner and then sorting is performed. Finally, the result is stored in fastest_five_runners.However, the managing director of the race comes to you and informs that they have decided that every 20th runner will be awarded a free sports bag. Since the original data has been changed and cannot be recoverable, it is impossible to find every 20th runner.In such cases, where you find a slight possibility that the original data is to be recovered, use sorted() instead of sort().Let us implement the same code above using sorted():>>> runners_by_time = sorted(runners, key=lambda x: getattr(x, 'duration')) >>> fastest_five_runners = runners_by_time[:5]In this situation, sorted() holds the original list of runners and their data and is not overwritten. You can find every 20th person to cross the finish line by interacting with the original values:>>> every_twentieth_runner = runners[::20]List slice on runners is used to create  every_twentieth_runner that holds the actual order in which runners crossed the finish line.So, sorted() should be used in cases where the original data is to be retained and sort() should be used where the original data is a copy or unimportant and losing it won’t stand as an issue.Some Earlier ways of  Python SortingThere were mainly two approaches of sorting when Python 2 was released— decorated-sort-undecorated and using cmp parameter.Decorated-Sort-UndecoratedThis idiom Decorated-Sort-Undecorated is based upon three three steps:First of all, the original list is decorated with new elements which manages the sort order.Secondly, sorting is performed on the decorated list.Finally, a list is created that contains the original elements in the new order and the decorations are removed.Let us see an example of the DSU approach using a class:>>> class Student:       def prop(self,name, grade, age):           self.name = name           self.grade = grade           self.age = age       def stu_repr(self):           return repr((self.name, self.grade, self.age)) >>> student_objects = [       Student('alex', 'B', 13),       Student('bob', 'A', 12),       Student('chrles', 'B', 10),     ] #Regular sorting using sorted() >>> sorted(student_objects, key=lambda student: student.age) [('charles', 'B', 10), ('bob', 'A', 12), ('alex', 'B', 13)] #DSU Approach >>> decorated_values = [(student.grade, i, student) for i, student in enumerate(student_objects)] >>> decorated_values.sort() >>> [student for grade, i, student in decorated_values]   [('bob', 'A', 12), ('alex', 'B', 13),('charles', 'B', 10)]In this code above, a class Student is created with student objects name, grade and age. Firstly, the original values are decorated and then sorted. Finally, the decorations are removed from decorated_values and then the new list is created with original values in new order.The Decorated-Sort-Undecorated technique is also the Schwartzian Transform and is helpful in increasing the efficiency of sorting in Python.Using cmp Parametercmp is a method or  parameter in Python that is used to compare two arguments. It returns either of the three values– a negative value in case of less than (<) comparisons or zero if equal or a positive value for greater than (>) comparisons.An example to illustrate cmp using sorted():>>> def num_compare(a, b):       return a - b >>> sorted([9, 2, 5, 0, 7], cmp=num_compare) [0, 2, 5, 7, 9]Here, a function num_compare is created and then the list is sorted by comparing each value in the list. Finally, the output is displayed in ascending order.Note that cmp parameter will work only in Python 2 . It is completely removed from Python 3 to make the language more simple and to resist conflicts between other comparison techniques and cmp.SummaryLet us sum up what we have learned in this article so far—Sorting and its needs.How to use sorted() to sort values with and without key and reverse.How to use .sort() to order values with and without key and reverse.Limitations and Gotchas with Python Sorting.Appropriate use of .sort() and sorted().Both .sort() and sorted() can be used to sort elements in a similar manner if used properly with key and reverse arguments.However, both have different characteristics when output and in-place modifications are considered, so it is suggested to first have a clear understanding of the program to be worked upon, while using .sort() since it can irrevocably overwrite data.To become a good Python developer, understanding complex sorting algorithms would be a useful skill set in the long run. For more information about sorting in Python, look into the official documentation of sorting of the Python Software Foundation and also grab a glimpse of another Python sorting algorithm called the TimSort. You may also join our Python certification course to gain further skills and knowledge in Python.
Rated 4.5/5 based on 12 customer reviews

How to use sorted() and sort() in Python

8942
How to use sorted() and sort() in Python

Whenever you visit a pharmacy and ask for a particular medicine, have you noticed something? It hardly takes any time for the pharmacist to find it among several medicines. This is because all the items are arranged in a certain fashion which helps them know the exact place to look for. They may be arranged in alphabetical order or according to their category such as ophthalmics or neuro or gastroenterology and so on. A proper arrangement not only saves time but also make operations simple and easy, hence sorting is essential.

At some point or the other, every programmer needs to learn one of the most essential skills,  Sorting. Python sorting functions comprise of a lot of features to perform basic sorting or customize ordering according to the programmer’s needs.

Basically, sorting is a technique that is used to arrange data in an increasing and decreasing fashion according to some linear relationship among the data elements. You can sort numbers, or names or records of any kind in any fashion according to your needs. Sorting techniques can be used to arrange a list of mail recipients in an alphabetical manner.

There are a number of sorting algorithms in Python starting from Bubble Sort, Insertion Sort, Quick Sort, Merge Sort, Selection Sort and so on. In this article we will look into how to use sorted() and sort() in Python. To learn more about other concepts of Python, go through our Python Tutorials.

What is the need for Sorting?

In simple terms, sorting means arranging data systematically. If the data you want to work with  is not sorted you will face problems in finding your desired element.

The main advantages of sorting elements in Python are:

  • When you work with sorting modules, you will get to know about a large number of language components.
  • Sorting Algorithms provide an abstract way of learning about the accuracy of your program without worrying about system developments or dependencies.
  • It will help you in understanding the program complexity and speed and how to increase the efficiency.

How to order values using sorted()?

sorted() is a built-in function that accepts an iterable and returns the sorted values in ascending order by default which contains the iterable items.

Sorting Numbers using sorted()

Let us define a list of integers called num_list and pass it as an argument to sorted():

>>> num_list = [4, 10, 1, 7]
>>> sorted(num_list)
[1, 4, 7, 10]
>>> num_list
[4, 10, 1, 7]

Some of the insights we gain from the code above are:

  • sorted() is a built-in function found in the Python Standard Library. It cannot be defined.
  • sorted() orders the values in num_list in ascending order by default, i.e. smallest to largest.
  • The original values of num_list are not changed.
  • sorted() being called, returns an ordered list of values.

Since sorted() function returns the list in order, we can assign the returned list to another variable:

>>> num_list = [4, 10, 1, 7]
>>> sorted_list = sorted(num_list)
>>> sorted_list
[1, 4, 7, 10]
>>> num_list
[4, 10, 1, 7]

A new variable sorted_list is created which holds the output of sorted().

You can also use sorted() to sort tuples and sets just like numbers:

>>> tuples = (4, 10, 1, 7)
>>> sets = {10, 5, 10, 0, 2}
>>> sorted_tuples = sorted(numbers_tuple)
>>> sorted_sets = sorted(numbers_set)
>>> sorted_tuples
[1, 4, 7, 10]
>>> sorted_sets
[0, 2, 5, 10]

The definition of sorted() states that it will return a new list whatever the input may be. So even if the input variables are tuples and sets, sorted() always returns a list.

You can also perform type casting in cases where you need to match the returned object with the input type:

>>> tuples = (4, 10, 1, 7)
>>> sets = {10, 5, 10, 0, 2}
>>> sorted_tuples = sorted(numbers_tuple)
>>> sorted_sets = sorted(numbers_set)
>>> sorted_tuples
[1, 4, 7, 10]
>>> sorted_sets
[0, 2, 5, 10]
>>> tuples(sorted_tuples)
(1, 4, 7, 10)
>>> sets(sorted_sets)
{0, 2, 5, 10}

In the code above, you can see the sorted_tuples when cast to tuples is retained in an ordered manner whereas sorted_sets when casted does not return an order list since it is unordered by definition.

Sorting Strings using sorted()

Sorting of strings is just like sorting tuples and sets. sorted() iterates across each character of the input and returns a string order.

An example of sorting str type using sorted():

>>> num_string = '98765'
>>> char_string = 'sorting is fun'
>>> sorted_num_string = sorted(num_string)
>>> sorted_char_string = sorted(char_string)
>>> sorted_num_string
['5', '6', '7', '8', '9']
>>> sorted_char_string
['', '','f', 'g', 'i', 'i', 'n', 'n', 'o', 'r', 's', 's', 't','u']

The str is treated as a list and sorted() iterates through each character including spaces.

You can use .split() to change the behavior and clean the output and .join() to rejoin them together:

>>> string = 'sorting is fun'
>>> sorted_string = sorted(string.split())
>>> sorted_string
['fun', 'is', 'sorting']
>>> ' '.join(sorted_string)
'fun is sorting'

The actual string is converted into a list of words using .split() and then it is sorted with sorted() and then again joined together using .join().

How to use sorted() with a reverse Argument?

The syntax of the sorted() function is sorted(iterable, /, *, key=None, reverse=False).

The built-in function sorted() comprises of three parameters:

  • iterable — Required. A sequence such as string, tuple or list and collection such as set or dictionary.
  • key — Optional. A function that serves as a key or to customize the sort order. The argument is set to None by default.
  • reverse — Optional. A boolean flag that reverses the order of sorting. If True, the sorted list is reversed. The default argument is False.

reverse is an optional keyword argument that changes the sorting order according to the Boolean value assigned to it. The default value is False, which performs sorting in ascending order. However, if the value is given as True, descending sort occurs:

>>> name_list = ['Markian', 'Alex', 'Suzzane', 'Harleen']
>>> sorted(name_list)
['Alex', 'Harleen', 'Markian', 'Suzzane']
>>> sorted(name_list, reverse=True)
['Suzzane', 'Markian', 'Harleen', 'Alex']

In the example above, the sorting is done on the basis of the first alphabet. However, when sorted() encounters the reverse keyword with a True argument, the output is reversed.

Another example to understand the behavior of the reverse keyword:

>>> case_sensitive_names = ['Markian', 'alex', 'Suzzane', 'harleen']
>>> sorted(case_sensitive_names, reverse=True)
['harleen', 'alex', 'Suzzane', 'Markian']
>>> values_to_sort = [False, 1, 'A' == 'B', 1 <= 0]
>>> sorted(values_to_sort, reverse=True)
[1, False, False, False]
>>> num_list = [7, 10, 0, 4]
>>> sorted(num_list, reverse=False)
[0, 4, 7, 10]

How to use sorted() with a key Argument?

The keyword argument key accepts a function and this function determines the resulting order by implementing itself in each value of the list.

An example to illustrate sorting of a list using the function len(), which returns the length of the string, and providing the key argument as len:

>>> word = 'pencil'
>>> len(word)
6
>>> word_list = ['cherry', 'donut', 'Michigan', 'transcipt']
>>> sorted(word_list, key=len)
['donut', 'cherry', 'Michigan', 'transcript']

The len() function determines the length of each item in the list and returns the list in ascending order (shortest to longest).

Let us sort the earlier example using key where the first alphabet with different case was considered for the order:

>>> case_sensitive_names = ['Markian', 'alex', 'Suzzane', 'harleen']
>>> sorted(case_sensitive_names, reverse=True)
['Markian', 'Suzzane', 'alex', 'harleen']
>>> sorted(case_sensitive_names, key=str.lower)
['alex', 'harleen', 'Markian', 'Suzzane']

The key cannot make any changes to the original values in the list. So the final output will be the original sorted elements.

Though key is considered as one of the most powerful components of sorted(), it has a number of limitations.

The first limitation is that key accepts only single argument functions.

An example of a function addition that accepts two arguments:

>>> def addition(a, b):
      return a + b
>>> number_to_add = [1, 3, 5]
>>> sorted(number_to_add , key=addition)
Traceback (most recent call last):
  File "stdin", line 5, in <module>
    sorted(number_to_add, key=addition)
TypeError: addition() missing 1 required positional argument: 'b'

The program fails because whenever addition() is called during sorting, it receives only one element from the list at a time. The second argument is always missing.

The second limitation is that the key function that is used must be able to handle all types of iterable values.

An example to illustrate the second limitation:

>>> cast_values = ['4', '5', '6', 'seven']
>>> sorted(cast_values, key=int)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 10: 'seven'

The example above contains a list of numbers to be used by sorted() as strings. The key will try to convert the numbers to int. Each of the numbers represented as strings can be converted to int, but four cannot be. So a ValueError gets raised since four is not valid to cast into an int.

Let us see an example to arrange an iterable by the last letter of each string:

>>> def reverse(word):
      return word[::-1]
>>> words = ['cherry', 'cake', 'Michigan', 'transcript']
>>> sorted(words, key=reverse)
['cake', 'Michigan', 'transcript', 'cherry']

The function reverse is defined to reverse the input string and then the function is used as the key argument. The slice syntax word[::-1] reverses the string and then the function reverse() takes all the elements one at a time and sorts the list according to the last alphabet.

You can also use lambda function in the key argument instead of defining a regular function. A lambda is an anonymous function that does not have a name and executes just like normal functions. Lambda functions do not contain any statements.

An example to show the previous code using a lambda function:

>>> words = ['cherry', 'cake', 'Michigan', 'transcript']
>>> sorted(words, key = lambda x: x[::-1])
['cake', 'Michigan', 'transcript', 'cherry']

Here, the key is defined with lambda with no name and x is the argument. The slice syntax word[::-1] reverses each of the element and the reversed output is then used for sorting.

An example to use key along with reverse argument:

>>> words = ['cherry', 'cake', 'Michigan', 'transcript']
>>> sorted(words, key = lambda x: x[::-1], reverse = True)
['cherry', 'transcript', 'Michigan', 'cake']

In this example, the order is reversed into a descending manner.

Lambda functions can also be used to sort class objects according to their properties.

An example to sort a group of students based on their grade in descending order:

>>> from collections import namedtuple
>>> Student = namedtuple('Student', 'name grade')
>>> alex = Student('Alex', 95)
>>> bob = Student('Bob', 87)
>>> charlie = Student('Charlie', 91)
>>> students = [alex, bob, charlie]
>>> sorted(students, key=lambda x: getattr(x, 'grade'), reverse=True)
[Student(name='Alex', grade=95), Student(name='Charlie', grade=91), Student(name='Bob', grade=87)]

The namedtuple is used to produce classes with name and grade attributes. The lambda is used to get the grade property of each student and reverse is used to reverse the output into descending order so that the highest grades are arranged first.

There are a lot of possible techniques to arrange elements using sorted() with key and reverse arguments. Lambda functions can also be helpful during sorting by making your code simple and clean.

You can also use operator module functions like itemgetter() and attrgetter() to make your sorting program simpler and faster. The operator module is used to export a set of accessor functions in correspondence to the operators of Python.

An example to illustrate operator module functions using key:

>>> tuples = [
      ('alex', 'B', 13),
      ('bob', 'A', 12),
      ('charles', 'B', 10),
      ]
>>> from operator import itemgetter
>>> sorted(tuples, key=itemgetter(2))
>>>[('charles', 'B', 10), ('bob', 'B', 12), ('alex', 'A', 13)]

tuples is declared with the name, grade and age of three persons. The function itemgetter is imported from the module operator and then it is sorted by age and the output displayed in ascending order.

How to order values using sort()?

The .sort() which is quite similar to sorted() in naming has few differences than sorted(). The help documentation of Python will clear out the two critical differences between .sort() and sorted():

>>> help(sorted)
Help on built-in function sorted in module builtins:
sorted(iterable, /, *, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    A custom key function can be supplied to customize the sort order, and the
    reverse flag can be set to request the result in descending order.
>>> help(list.sort)
Help on method_descriptor:
sort(self, /, *, key=None, reverse=False)
    Stable sort *IN PLACE*.

Firstly, .sort() is not a built-in function unlike sorted(). It is a method of list class and works only with lists. You cannot pass iterables to .sort().

Secondly, .sort()  returns None and changes the values.

Let us see the differences of code for .sort() and what impact it has on the code:

>>> sort_numbers = [10, 2, 7, 3]
>>> sort(sort_numbers)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'sort' is not defined
>>> sort_tuples = (10, 2, 7, 3)
>>> sort_tuple.sort()>>> sort_tuples = (10, 2, 7, 3)
>>> sort_tuple.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'sort'
>>> sorted_values = sort_numbers.sort()
>>> print(sorted_values)
None
>>> sorted_values = sort_numbers.sort()
>>> print(sorted_values)int(sort_numbers)
[1, 2, 5, 6]

The code above highlights some operational differences between .sort() and sorted():

  • When any assignment is done to a new variable, it returns a None type. This is because .sort() function has no ordered output. 
  • The original order of sort_numbers is not maintained and is changed in place.

.sort() also contains the key and reverse optional keyword arguments just like sorted() which produces the same functionality.

An example of .sort() using lambda to sort a list of phrases by the first letter of the third word:

>>> sort_phrases = ['welcome to python',
      'python is fun',
      'python is easy'
      ]
>>> sort_phrases.sort(key=lambda x: x.split()[2][1], reverse=False)
>>> sort_phrases
['python is easy', 'python is fun', 'welcome to python']

Here, lambda is used to split each phrase into a list of words and then find the second letter of the third element for each phrase.

Disadvantages of  Python Sorting

Python has some limitations when you try to sort values besides integers.

Non-Comparable Data Types

You cannot use sort data types that are different from each other. Python raises an error when sorted() is used on non-comparable data.

An example to illustrate sorting of values of different data types:

>>> mixed_values = [None, 5]
>>> sorted(mixed_values)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'int' and 'NoneType'

Python raises a TypeError because it cannot sort None and int in the same list because of their incompatibility. It uses the less than operator ( < ) to determine the lower value in the order of the sort.

If you try to compare the same values manually without using sorted(), it will still raise a TypeError because of non-comparable data types:

>> None < 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'NoneType' and 'int'

However, if your list contains a combination of integers and strings that are all numbers, Python will cast them to comparable types using a list comprehension:

>>> num_mix = [10, "5", 200, "11"]
>>> sorted(num_mix)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'str' and 'int'
>>> # List comprehension to cast all values to integers
>>> [int(z) for z in num_mix]
[10, 5, 200, 11]
>>> sorted([int(z) for z in num_mix])
[5, 10, 11, 200]

int() converts all the string values in num_mix to integers and then sorted() compares all values and returns a sorted output.

An example of a Python code of implicitly converting a value to another type:

>>> values = [1, False, 0, 'a' == 'b', 0 >= 1]
>>> sorted(values)
[False, 0, False, False, 1]

In the example above, all the elements in the list are converted to boolean type. 0 >= 1 evaluates to a False output. The number 1 and 0 are converted to True and False as bool type respectively.

This particular example highlights an important characteristic of sorting– sort stability. Sorting ability means that sorting algorithms are always stable. The original order is retained even if multiple records have the same key argument.

An example to illustrate sort stability:

>>> values = [False, 0, 0, 3 == 4, 1, False, False]
>>> sorted(values)
[False, 0, 0, False, 0, False, 1]

If you take a look at the original order and the sorted output, you’ll find that the expression 3 == 4 is casted to False and all sorted output is in the actual order. You can also perform complex sorts with the help of the knowledge of sort stability.

Case-Sensitive Sorting

You can use sorted() to sort a list of strings in ascending order which is alphabetical by default:

>>> name_list = ['Markian', 'Alex', 'Suzzane', 'Harleen']
>>> sorted(name_list)
['Alex', 'Harleen', 'Markian', 'Suzzane']

However, Python uses Unicode Code Point of the first letter of each string to evaluate the ascending order of the sort. If there are two names Al and al, Python will treat both of them differently.

An example to return the Unicode Code Point of the first alphabet of each string:

>>> case_sensitive_names = ['Markian', 'alex', 'Suzzane', 'harleen']
>>> sorted(case_sensitive_names)
['Markian', 'Suzzane', 'alex', 'harleen']
>>> # List comprehension for Unicode Code Point of first letter in each word
>>> [(ord(name[0]), name[0]) for name in sorted(case_sensitive_names)]
[(77, 'M'), (83, 'S'), (97, 'a'), (104, 'h')]

In the example above, name[0] returns the first letter of the string and ord(name[0]) returns the Unicode Code Point. You can notice that even a comes before M alphabetically, the output has M before a. This is because the code point of M comes before a.

Consider a situation where the first letter is the same for all the strings that need to be sorted. In such cases, the sorted() function will use the second letter to determine the order and if the second letter is also same, it will consider the third letter and so on, till the end of string:

>>> similar_strings = ['zzzzzn', 'zzzzzc', 'zzzzza','zzzzze']
>>> sorted(similar_strings)
['zzzzza', 'zzzzzc', 'zzzzze', 'zzzzzn']

Here, sorted() will compare the strings based on the sixth character since the first five characters are the same ( z ). The output will also depend on the last character of each string.

An example of sorting elements having identical values:

>>> different_lengths = ['zzzzz', 'zz', 'zzzz','z']
>>> sorted(different_lengths)
['z', 'zz', 'zzzz', 'zzzzz']

In this case, the sorting order will be from the shortest to the longest. The shortest string z is ordered first and the longest string zzzzz is ordered at the last.

When should you use .sort() and sorted()?

Let us consider a case where you need to collect data from a race of 5k runners, the Python 5k Annual and then sort them. You will have to collect the runner’s bib number and the time it took to finish the race:

>>> from collections import namedtuple
>>> Runner_data = namedtuple('Runner', 'bibnumber duration')

Each of the Runner_data will be added to a list called runners:

>>> runners = []
>>> runners.append(Runner_data('2548597', 1200))
>>> runners.append(Runner_data('8577724', 1720))
>>> runners.append(Runner_data('2666234', 1600))
>>> runners.append(Runner_data('2425114', 1450))
>>> runners.append(Runner_data('2235232', 1620))
    ...
    ...
>>> runners.append(Runner_data('2586674', 1886))

The bib number and the total time taken by the runner is added to runners each time they cross the finish line.

Now, you know the top five runners according to the duration time are the winners and the rest of them will be sorted by the fastest time:

>>> runners.sort(key=lambda x: getattr(x, 'duration'))
>>> fastest_five_runners = runners[:5]

In this example, we didn’t need any multiple types of sorting. The list was a reasonable choice. You just sorted the participants and grabbed the fastest five runners. Storing the list elsewhere was also not needed. The lambda function is used here to get the duration of each runner and then sorting is performed. Finally, the result is stored in fastest_five_runners.

However, the managing director of the race comes to you and informs that they have decided that every 20th runner will be awarded a free sports bag. Since the original data has been changed and cannot be recoverable, it is impossible to find every 20th runner.

In such cases, where you find a slight possibility that the original data is to be recovered, use sorted() instead of sort().

Let us implement the same code above using sorted():

>>> runners_by_time = sorted(runners, key=lambda x: getattr(x, 'duration'))
>>> fastest_five_runners = runners_by_time[:5]

In this situation, sorted() holds the original list of runners and their data and is not overwritten. You can find every 20th person to cross the finish line by interacting with the original values:

>>> every_twentieth_runner = runners[::20]

List slice on runners is used to create  every_twentieth_runner that holds the actual order in which runners crossed the finish line.

So, sorted() should be used in cases where the original data is to be retained and sort() should be used where the original data is a copy or unimportant and losing it won’t stand as an issue.

Some Earlier ways of  Python Sorting

There were mainly two approaches of sorting when Python 2 was released— decorated-sort-undecorated and using cmp parameter.

Decorated-Sort-Undecorated

This idiom Decorated-Sort-Undecorated is based upon three three steps:

  • First of all, the original list is decorated with new elements which manages the sort order.
  • Secondly, sorting is performed on the decorated list.
  • Finally, a list is created that contains the original elements in the new order and the decorations are removed.

Let us see an example of the DSU approach using a class:

>>> class Student:
      def prop(self,name, grade, age):
          self.name = name
          self.grade = grade
          self.age = age
      def stu_repr(self):
          return repr((self.name, self.grade, self.age))
>>> student_objects = [
      Student('alex', 'B', 13),
      Student('bob', 'A', 12),
      Student('chrles', 'B', 10),
    ]
#Regular sorting using sorted()
>>> sorted(student_objects, key=lambda student: student.age)
[('charles', 'B', 10), ('bob', 'A', 12), ('alex', 'B', 13)]
#DSU Approach
>>> decorated_values = [(student.grade, i, student) for i, student in enumerate(student_objects)]
>>> decorated_values.sort()
>>> [student for grade, i, student in decorated_values]   [('bob', 'A', 12), ('alex', 'B', 13),('charles', 'B', 10)]

In this code above, a class Student is created with student objects name, grade and age. Firstly, the original values are decorated and then sorted. Finally, the decorations are removed from decorated_values and then the new list is created with original values in new order.

The Decorated-Sort-Undecorated technique is also the Schwartzian Transform and is helpful in increasing the efficiency of sorting in Python.

Using cmp Parameter

cmp is a method or  parameter in Python that is used to compare two arguments. It returns either of the three values– a negative value in case of less than (<) comparisons or zero if equal or a positive value for greater than (>) comparisons.

An example to illustrate cmp using sorted():

>>> def num_compare(a, b):
      return a - b
>>> sorted([9, 2, 5, 0, 7], cmp=num_compare)
[0, 2, 5, 7, 9]

Here, a function num_compare is created and then the list is sorted by comparing each value in the list. Finally, the output is displayed in ascending order.

Note that cmp parameter will work only in Python 2 . It is completely removed from Python 3 to make the language more simple and to resist conflicts between other comparison techniques and cmp.

Summary

Let us sum up what we have learned in this article so far—

  • Sorting and its needs.
  • How to use sorted() to sort values with and without key and reverse.
  • How to use .sort() to order values with and without key and reverse.
  • Limitations and Gotchas with Python Sorting.
  • Appropriate use of .sort() and sorted().

Both .sort() and sorted() can be used to sort elements in a similar manner if used properly with key and reverse arguments.

However, both have different characteristics when output and in-place modifications are considered, so it is suggested to first have a clear understanding of the program to be worked upon, while using .sort() since it can irrevocably overwrite data.

To become a good Python developer, understanding complex sorting algorithms would be a useful skill set in the long run. For more information about sorting in Python, look into the official documentation of sorting of the Python Software Foundation and also grab a glimpse of another Python sorting algorithm called the TimSort. You may also join our Python certification course to gain further skills and knowledge in Python.

Priyankur

Priyankur Sarkar

Data Science Enthusiast

Priyankur Sarkar loves to play with data and get insightful results out of it, then turn those data insights and results in business growth. He is an electronics engineer with a versatile experience as an individual contributor and leading teams, and has actively worked towards building Machine Learning capabilities for organizations.

Join the Discussion

Your email address will not be published. Required fields are marked *

Suggested Blogs

How to Round Numbers in Python

While you are dealing with data, sometimes you may come across a biased dataset. In statistics, bias is whereby the expected value of the results differs from the true underlying quantitative parameter being estimated. Working with such data can be dangerous and can lead you to incorrect conclusions. To learn more about various other concepts of Python, go through our Python Tutorials or enroll to our Python Certification course online.There are many types of biases such as selection bias, reporting bias, sampling bias and so on. Similarly, rounding bias is related to numeric data. In this article we will see:Why is it important to know the ways to round numbersHow to use various strategies to round numbersHow data is affected by rounding itHow to use NumPy arrays and Pandas DataFrames to round numbersLet us first learn about Python’s built-in rounding process.About Python’s Built-in round() FunctionPython Programming offers a built-in round() function which rounds off a number to the given number of digits and makes rounding of numbers easier. The function round() accepts two numeric arguments, n and n digits and then returns the number n after rounding it to ndigits. If the number of digits are not provided for round off, the function rounds off the number n to the nearest integer.Suppose, you want to round off a number, say 4.5. It will be rounded to the nearest whole number which is 5. However, the number 4.74 will be rounded to one decimal place to give 4.7.It is important to quickly and readily round numbers while you are working with floats which have many decimal places. The inbuilt Python function round() makes it simple and easy.Syntaxround(number, number of digits)The parameters in the round() function are:number - number to be roundednumber of digits (Optional) - number of digits up to which the given number is to be rounded.The second parameter is optional. In case, if it is missing then round() function returns:For an integer, 12, it rounds off to 12For a decimal number, if the last digit after the decimal point is >=5 it will round off to the next whole number, and if =5 print(round(5.476, 2))     # when the (ndigit+1)th digit is  1 print(round("x", 2)) TypeError: type str doesn't define __round__ methodAnother example,print(round(1.5)) print(round(2)) print(round(2.5))The output will be:2 2 2The function round() rounds 1.5 up to 2, and 2.5 down to 2. This is not a bug, the round() function behaves this way. In this article you will learn a few other ways to round a number. Let us look at the variety of methods to round a number.Diverse Methods for RoundingThere are many ways to round a number with its own advantages and disadvantages. Here we will learn some of the techniques to rounding a number.TruncationTruncation, as the name means to shorten things. It is one of the simplest methods to round a number which involves truncating a number to a given number of digits. In this method, each digit after a given position is replaced with 0. Let us look into some examples.ValueTruncated ToResult19.345Tens place1019.345Ones place1919.345Tenths place19.319.345Hundredths place19.34The truncate() function can be used for positive as well as negative numbers:>>> truncate(19.5) 19.0 >>> truncate(-2.852, 1) -2.8 >>> truncate(2.825, 2) 2.82The truncate() function can also be used to truncate digits towards the left of the decimal point by passing a negative number.>>> truncate(235.7, -1) 230.0 >>> truncate(-1936.37, -3) -1000.0When a positive number is truncated, we are basically rounding it down. Similarly, when we truncate a negative number, the number is rounded up. Let us look at the various rounding methods.Rounding UpThere is another strategy called “rounding up” where a number is rounded up to a specified number of digits. For example:ValueRound Up ToResult12.345Tens place2018.345Ones place1918.345Tenths place18.418.345Hundredths place18.35The term ceiling is used in mathematics to explain the nearest integer which is greater than or equal to a particular given number. In Python, for “rounding up” we use two functions namely,ceil() function, andmath() functionA non-integer number lies between two consecutive integers. For example, considering a number 5.2, this will lie between 4 and 5. Here, ceiling is the higher endpoint of the interval, whereas floor is the lower one. Therefore, ceiling of 5.2 is 5, and floor of 5.2 is 4. However, the ceiling of 5 is 5.In Python, the function to implement the ceiling function is the math.ceil() function. It always returns the closest integer which is greater than or equal to its input.>>> import math >>> math.ceil(5.2) 6 >>> math.ceil(5) 5 >>> math.ceil(-0.5) 0If you notice you will see that the ceiling of -0.5 is 0, and not -1.Let us look into a short code to implement the “rounding up” strategy using round_up() function:def round_up(n, decimals=0):     multiplier = 10 ** decimals     return math.ceil(n * multiplier) / multiplierLet’s look at how round_up() function works with various inputs:>>> round_up(3.1) 4.0 >>> round_up(3.23, 1) 3.3 >>> round_up(3.543, 2) 3.55You can pass negative values  to decimals, just like we did in truncation.>>> round_up(32.45, -1) 40.0 >>> round_up(3352, -2) 3400You can follow the diagram below to understand round up and round down. Round up to the right and down to the left.Rounding up always rounds a number to the right on the number line, and rounding down always rounds a number to the left on the number line.Rounding DownSimilar to rounding up we have another strategy called rounding down whereValueRounded Down ToResult19.345Tens place1019.345Ones place1919.345Tenths place19.319.345Hundredths place19.34In Python, rounding down can be implemented using a similar algorithm as we truncate or round up. Firstly you will have to shift the decimal point and then round an integer. Lastly shift the decimal point back.math.ceil() is used to round up to the ceiling of the number once the decimal point is shifted. For “rounding down” we first need to round the floor of the number once the decimal point is shifted.>>> math.floor(1.2) 1 >>> math.floor(-0.5) -1Here’s the definition of round_down():def round_down(n, decimals=0):     multiplier = 10 ** decimals return math.floor(n * multiplier) / multiplierThis is quite similar to round_up() function. Here we are using math.floor() instead of math.ceil().>>> round_down(1.5) 1 >>> round_down(1.48, 1) 1.4 >>> round_down(-0.5) -1Rounding a number up or down has extreme effects in a large dataset. After rounding up or down, you can actually remove a lot of precision as well as alter computations.Rounding Half UpThe “rounding half up” strategy rounds every number to the nearest number with the specified precision, and breaks ties by rounding up. Here are some examples:ValueRound Half Up ToResult19.825Tens place1019.825Ones place2019.825Tenths place19.819.825Hundredths place19.83In Python, rounding half up strategy can be implemented by shifting the decimal point to the right by the desired number of places. In this case you will have to determine whether the digit after the shifted decimal point is less than or greater than equal to 5.You can add 0.5 to the value which is shifted and then round it down with the math.floor() function.def round_half_up(n, decimals=0):     multiplier = 10 ** decimals return math.floor(n*multiplier + 0.5) / multiplierIf you notice you might see that round_half_up() looks similar to round_down. The only difference is to add 0.5 after shifting the decimal point so that the result of rounding down matches with the expected value.>>> round_half_up(19.23, 1) 19.2 >>> round_half_up(19.28, 1) 19.3 >>> round_half_up(19.25, 1) 19.3Rounding Half DownIn this method of rounding, it rounds to the nearest number similarly like “rounding half up” method, the difference is that it breaks ties by rounding to the lesser of the two numbers. Here are some examples:ValueRound Half Down ToResult16.825Tens place1716.825Ones place1716.825Tenths place16.816.825Hundredths place16.82In Python, “rounding half down” strategy can be implemented by replacing math.floor() in the round_half_up() function with math.ceil() and then by subtracting 0.5 instead of adding:def round_half_down(n, decimals=0):     multiplier = 10 ** decimals return math.ceil(n*multiplier - 0.5) / multiplierLet us look into some test cases.>>> round_half_down(1.5) 1.0 >>> round_half_down(-1.5) -2.0 >>> round_half_down(2.25, 1) 2.2In general there are no bias for both round_half_up() and round_half_down(). However, rounding of data with more number of ties results in bias. Let us consider an example to understand better.>>> data = [-2.15, 1.45, 4.35, -12.75]Let us compute the mean of these numbers:>>> statistics.mean(data) -2.275Now let us compute the mean on the data after rounding to one decimal place with round_half_up() and round_half_down():>>> rhu_data = [round_half_up(n, 1) for n in data] >>> statistics.mean(rhu_data) -2.2249999999999996 >>> rhd_data = [round_half_down(n, 1) for n in data] >>> statistics.mean(rhd_data) -2.325The round_half_up() function results in a round towards positive infinity bias, and round_half_down() results in a round towards negative infinity bias.Rounding Half Away From ZeroIf you have noticed carefully while going through round_half_up() and round_half_down(), neither of the two is symmetric around zero:>>> round_half_up(1.5) 2.0 >>> round_half_up(-1.5) -1.0 >>> round_half_down(1.5) 1.0 >>> round_half_down(-1.5) -2.0In order to introduce symmetry, you can always round a tie away from zero. The table mentioned below illustrates it clearly:ValueRound Half Away From Zero ToResult16.25Tens place2016.25Ones place1616.25Tenths place16.3-16.25Tens place-20-16.25Ones place-16-16.25Tenths place-16.3The implementation of “rounding half away from zero” strategy on a number n is very simple. All you need to do is start as usual by shifting the decimal point to the right a given number of places and then notice the digit d immediately to the right of the decimal place in this new number. Here, there are four cases to consider:If n is positive and d >= 5, round upIf n is positive and d < 5, round downIf n is negative and d >= 5, round downIf n is negative and d < 5, round upAfter rounding as per the rules mentioned above, you can shift the decimal place back to the left.There is a question which might come to your mind - How do you handle situations where the number of positive and negative ties are drastically different? The answer to this question brings us full circle to the function that deceived us at the beginning of this article: Python’s built-in  round() function.Rounding Half To EvenThere is a way to mitigate rounding bias while you are rounding values in a dataset. You can simply round ties to the nearest even number at the desired precision. Let us look at some examples:ValueRound Half To Even ToResult16.255Tens place2016.255Ones place1616.255Tenths place16.216.255Hundredths place16.26To prove that round() really does round to even, let us try on a few different values:>>> round(4.5) 4 >>> round(3.5) 4 >>> round(1.75, 1) 1.8 >>> round(1.65, 1) 1.6The Decimal ClassThe  decimal module in Python is one of those features of the language which you might not be aware of if you have just started learning Python. Decimal “is based on a floating-point model which was designed with people in mind, and necessarily has a paramount guiding principle – computers must provide an arithmetic that works in the same way as the arithmetic that people learn at school.” – except from the decimal arithmetic specification. Some of the benefits of the decimal module are mentioned below -Exact decimal representation: 0.1 is actually 0.1, and 0.1 + 0.1 + 0.1 - 0.3 returns 0, as expected.Preservation of significant digits: When you add 1.50 and 2.30, the result is 3.80 with the trailing zero maintained to indicate significance.User-alterable precision: The default precision of the decimal module is twenty-eight digits, but this value can be altered by the user to match the problem at hand.Let us see how rounding works in the decimal module.>>> import decimal >>> decimal.getcontext() Context(     prec=28,     rounding=ROUND_HALF_EVEN,     Emin=-999999,     Emax=999999,     capitals=1,     clamp=0,     flags=[],     traps=[         InvalidOperation,         DivisionByZero,         Overflow     ] )The function decimal.getcontext() returns a context object which represents the default context of the decimal module. It also includes the default precision and the default rounding strategy.In the above example, you will see that the default rounding strategy for the decimal module is ROUND_HALF_EVEN. It allows to align with the built-in round() functionLet us create a new Decimal instance by passing a string containing the desired value and declare a number using the decimal module’s Decimal class.>>> from decimal import Decimal >>> Decimal("0.1") Decimal('0.1')You may create a Decimal instance from a floating-point number but in that case, a floating-point representation error will be introduced. For example, this is what happens when you create a Decimal instance from the floating-point number 0.1>>> Decimal(0.1) Decimal('0.1000000000000000055511151231257827021181583404541015625')You may create Decimal instances from strings containing the decimal numbers you need in order to maintain exact precision.Rounding a Decimal using the .quantize() method:>>> Decimal("1.85").quantize(Decimal("1.0")) Decimal('1.8')The Decimal("1.0") argument in .quantize() allows to determine the number of decimal places in order to round the number. As 1.0 has one decimal place, the number 1.85 rounds to a single decimal place. Rounding half to even is the default strategy, hence the result is 1.8.Decimal class:>>> Decimal("2.775").quantize(Decimal("1.00")) Decimal('2.78')Decimal module provides another benefit. After performing arithmetic the rounding is taken care of automatically and also the significant digits are preserved.>>> decimal.getcontext().prec = 2 >>> Decimal("2.23") + Decimal("1.12") Decimal('3.4')To change the default rounding strategy, you can set the decimal.getcontect().rounding property to any one of several  flags. The following table summarizes these flags and which rounding strategy they implement:FlagRounding Strategydecimal.ROUND_CEILINGRounding updecimal.ROUND_FLOORRounding downdecimal.ROUND_DOWNTruncationdecimal.ROUND_UPRounding away from zerodecimal.ROUND_HALF_UPRounding half away from zerodecimal.ROUND_HALF_DOWNRounding half towards zerodecimal.ROUND_HALF_EVENRounding half to evendecimal.ROUND_05UPRounding up and rounding towards zeroRounding NumPy ArraysIn Data Science and scientific computation, most of the times we store data as a  NumPy array. One of the most powerful features of NumPy is the use of  vectorization and broadcasting to apply operations to an entire array at once instead of one element at a time.Let’s generate some data by creating a 3×4 NumPy array of pseudo-random numbers:>>> import numpy as np >>> np.random.seed(444) >>> data = np.random.randn(3, 4) >>> data array([[ 0.35743992,  0.3775384 ,  1.38233789,  1.17554883],        [-0.9392757 , -1.14315015, -0.54243951, -0.54870808], [ 0.20851975, 0.21268956, 1.26802054, -0.80730293]])Here, first we seed the np.random module to reproduce the output easily. Then a 3×4 NumPy array of floating-point numbers is created with np.random.randn().Do not forget to install pip3 before executing the code mentioned above. If you are using  Anaconda you are good to go.To round all of the values in the data array, pass data as the argument to the  np.around() function. The desired number of decimal places is set with the decimals keyword argument. In this case, round half to even strategy is used similar to Python’s built-in round() function.To round the data in your array to integers, NumPy offers several options which are mentioned below:numpy.ceil()numpy.floor()numpy.trunc()numpy.rint()The np.ceil() function rounds every value in the array to the nearest integer greater than or equal to the original value:>>> np.ceil(data) array([[ 1.,  1.,  2.,  2.],        [-0., -1., -0., -0.], [ 1., 1., 2., -0.]])Look at the code carefully, we have a new number! Negative zero! Let us now take a look at Pandas library, widely used in Data Science with Python.Rounding Pandas Series and DataFramePandas has been a game-changer for data analytics and data science. The two main data structures in Pandas are Dataframe and Series. Dataframe works like an Excel spreadsheet whereas you can consider Series to be columns in a spreadsheet. Series.round() and DataFrame.round() methods. Let us look at an example.Do not forget to install pip3 before executing the code mentioned above. If you are using  Anaconda you are good to go.>>> import pandas as pd >>> # Re-seed np.random if you closed your REPL since the last example >>> np.random.seed(444) >>> series = pd.Series(np.random.randn(4)) >>> series 0    0.357440 1    0.377538 2    1.382338 3    1.175549 dtype: float64 >>> series.round(2) 0    0.36 1    0.38 2    1.38 3    1.18 dtype: float64 >>> df = pd.DataFrame(np.random.randn(3, 3), columns=["A", "B", "C"]) >>> df           A         B         C 0 -0.939276 -1.143150 -0.542440 1 -0.548708  0.208520  0.212690 2  1.268021 -0.807303 -3.303072 >>> df.round(3)        A      B      C 0 -0.939 -1.143 -0.542 1 -0.549  0.209  0.213 2  1.268 -0.807 -3.303 The DataFrame.round() method can also accept a dictionary or a Series, to specify a different precision for each column. For instance, the following examples show how to round the first column of df to one decimal place, the second to two, and the third to three decimal places: >>> # Specify column-by-column precision with a dictionary >>> df.round({"A": 1, "B": 2, "C": 3})      A     B      C 0 -0.9 -1.14 -0.542 1 -0.5  0.21  0.213 2  1.3 -0.81 -3.303 >>> # Specify column-by-column precision with a Series >>> decimals = pd.Series([1, 2, 3], index=["A", "B", "C"]) >>> df.round(decimals)      A     B      C 0 -0.9 -1.14 -0.542 1 -0.5  0.21  0.213 2  1.3 -0.81 -3.303 If you need more rounding flexibility, you can apply NumPy's floor(), ceil(), and print() functions to Pandas Series and DataFrame objects: >>> np.floor(df)      A    B    C 0 -1.0 -2.0 -1.0 1 -1.0  0.0  0.0 2  1.0 -1.0 -4.0 >>> np.ceil(df)      A    B    C 0 -0.0 -1.0 -0.0 1 -0.0  1.0  1.0 2  2.0 -0.0 -3.0 >>> np.rint(df)      A    B    C 0 -1.0 -1.0 -1.0 1 -1.0  0.0  0.0 2  1.0 -1.0 -3.0 The modified round_half_up() function from the previous section will also work here: >>> round_half_up(df, decimals=2)       A     B     C 0 -0.94 -1.14 -0.54 1 -0.55  0.21  0.21 2 1.27 -0.81 -3.30Best Practices and ApplicationsNow that you have come across most of the rounding techniques, let us learn some of the best practices to make sure we round numbers in the correct way.Generate More Data and Round LaterSuppose you are dealing with a large set of data, storage can be a problem at times. For example, in an industrial oven you would want to measure the temperature every ten seconds accurate to eight decimal places, using a temperature sensor. These readings will help to avoid large fluctuations which may lead to failure of any heating element or components. We can write a Python script to compare the readings and check for large fluctuations.There will be a large number of readings as they are being recorded each and everyday. You may consider to maintain three decimal places of precision. But again, removing too much precision may result in a change in the calculation. However, if you have enough space, you can easily store the entire data at full precision. With less storage, it is always better to store at least two or three decimal places of precision which are required for calculation.In the end, once you are done computing the daily average of the temperature, you may calculate it to the maximum precision available and finally round the result.Currency Exchange and RegulationsWhenever we purchase an item from a particular place, the tax amount paid against the amount of the item depends largely on geographical factors. An item which costs you $2 may cost you less (say $1.8)  if you buy the same item from a different state. It is due to regulations set forth by the local government.In another case, when the minimum unit of currency at the accounting level in a country is smaller than the lowest unit of physical currency, Swedish rounding is done. You can find a list of such rounding methods used by various countries if you look up on the internet.If you want to design any such software for calculating currencies, keep in mind to check the local laws and regulations applicable in your present location.Reduce errorAs you are rounding numbers in a large datasets used in complex computations, your primary concern should be to limit the growth of the error due to rounding.SummaryIn this article we have seen a few methods to round numbers, out of those “rounding half to even” strategy minimizes rounding bias the best. We are lucky to have Python, NumPy, and Pandas already have built-in rounding functions to use this strategy. Here, we have learned about -Several rounding strategies, and how to implement in pure Python.Every rounding strategy inherently introduces a rounding bias, and the “rounding half to even” strategy mitigates this bias well, most of the time.You can round NumPy arrays and Pandas Series and DataFrame objects.If you enjoyed reading this article and found it to be interesting, leave a comment. To learn more about rounding numbers and other features of Python, join our Python certification course.
Rated 5.0/5 based on 43 customer reviews
13218
How to Round Numbers in Python

While you are dealing with data, sometimes you may... Read More

What are Python KeyError Exceptions and How to Handle Them

There are times when you have written your code but while you execute, it might not run. These types of situations occur when the input is inappropriate or you try to open a file with a wrong path or try to divide a number by zero. Due to some errors or incorrect command the output will not be displayed. This is because of errors and exceptions which are a part of the Python programming language. Learn about such concepts and gain further knowledge by joining Python Programming Course.What is Exception Handling?Python raises exceptions when it encounters errors during execution. A Python Exception is basically a construct that signals any important event, such as a run-time error.Exception Handling is the process of responding to executions during computations, which often interrupts the usual flow of executing a program. It can be performed both at the software level as part of the program and also at hardware level using built-in CPU mechanisms.Why is Exception Handling Important?Although exceptions might be irritating when they occur, they play an essential role in high level languages by acting as a friend to the user.An error at the time of execution might lead to two things— either your program will die or will display a blue screen of death. On the other hand, exceptions act as communication tools. It allows the program to answer the questions — what, why and how something goes wrong and then terminates the program in a delicate manner.In simple words, exception handling protects against uncontrollable program failures and increases the potency and efficiency of your code. If you want to master yourself in programming, the knowledge of exceptions and how to handle them is very crucial, especially in Python.What are the Errors and Exceptions in Python?Python doesn’t like errors and exceptions and displays its dissatisfaction by terminating the program abruptly.There are basically two types of errors in the Python language-Syntax Error.Errors occuring at run-time or Exceptions.Syntax ErrorsSyntax Errors, also known as parsing errors, occur when the parser identifies an incorrect statement. In simple words, syntax error occurs when the proper structure or syntax of the programming language is not followed.An example of a syntax error:>>> print( 1 / 0 )) File "", line 1 print( 1 / 0 ))   ^SyntaxError: invalid syntaxExceptionsExceptions occur during run-time. Python raises an exception when your code has a correct syntax but it encounters a run-time issue which it is not able to handle.There are a number of defined built-in exceptions in Python which are used in specific situations. Some of the built-in exceptions are:ExceptionCause Of ErrorArithmeticErrorRaised when numerical computation fails.FloatingPointErrorRaised when floating point calculation fails.AssertionErrorRaised in case of failure of the Assert statement.ZeroDivisionErrorRaised when division or modulo by zero takes place for all numerical values.OverflowErrorRaised when result of an arithmetic operation is very large to be represented.IndexErrorRaised when an index is not found in a sequence.ImportErrorRaised when the imported module is not found.IndentationErrorRaised when indentation is not specified properly.KeyboardInterruptRaised when the user hits interrupt key.RuntimeErrorRaised when a generated error does not fall into any category.SyntaxErrorRaised when there is an error in Python syntax.IOErrorRaised when Python cannot access a file correctly on disk.KeyErrorRaised when a key is not found in a dictionary.ValueErrorRaised when an argument to a function is the right type but not in the right domain.NameErrorRaised when an identifier is not found in the local or global namespace.TypeErrorRaised when an argument to a function is not in the right type.There are another type of built-in exceptions called warnings. They are usually issued in situations where the user is alerted of some conditions. The condition does not raise an exception; rather it  terminates the program.What is a Python KeyError?Before getting into KeyError, you must know the meaning of dictionary and mapping in Python. Dictionary (dict) is an unordered collection of objects which deals with data type key. They are Python’s implementation of data structures and are also known as associative arrays. They comprise key-value pairs, in which each pair maps the key to its associated value.Dictionary is basically a data structure that maps one set of values into another and is the most common mapping in Python.Exception hierarchy of KeyError:->BaseException              ->Exception                         ->LookupError                                       ->KeyErrorA Python KeyError is raised when you try to access an invalid key in a dictionary. In simple terms, when you see a KeyError, it denotes that the key you were looking for could not be found.An example of KeyError:>>> prices = { 'Pen' : 10, 'Pencil' : 5, 'Notebook' : 25} >>> prices['Eraser'] Traceback (most recent call last): File "", line 1, in prices['Eraser'] KeyError: 'Eraser'Here, dictionary prices is declared with the prices of three items. The KeyError is raised when the item ‘Eraser’ is being accessed which is not present in prices.Whenever an exception is raised in Python, it is done using traceback, as you can see in the example code above. It tells why an exception is raised and what caused it.Let’s execute the same Python code from a file. This time, you will be asked to give the name of the item whose price you want to know:# prices.py prices = { 'Pen' : 10, 'Pencil' : 5, 'Notebook' : 25} item = input('Get price of: ') print(f'The price of {item} is {prices[item]}')You will get a traceback again but you’ll also get the information about the line from which the KeyError is raised:Get price of: Eraser Traceback (most recent call last): File "prices.py", line 5, in print(f'The price of {item} is {prices[item]}') KeyError: 'Eraser'The traceback in the example above provides the following information:A KeyError was raised.The key ‘Eraser’ was not found.The line number which raised the exception along with that line.Where else will you find a Python KeyError?Although most of the time, a KeyError is raised because of an invalid key in a Python dictionary or a dictionary subclass, you may also find it in other places in the Python Standard Library, such as in a zipfile. However, it denotes the same semantic meaning of the Python KeyError, which is not finding the requested key.An example of such:>>> from zipfile import ZipFile >>> my_zip_file = ZipFile('Avengers.zip') >>> my_zip_file.getinfo('Batman')Traceback (most recent call last): File "", line 1, in File "myzip.py", line 1119, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'Batman' in the archive"In this example, the zipfile.ZipFile class is used to derive information about a ZIP archive ‘Batman’ using the getinfo() function. Here, the traceback indicates that the problem is not in your code but in the zipfile code, by showing the line which caused the problem. The exception raised here is not because of a LookUpError but rather due to the zipfile.ZipFile.getinfo()function call.When do you need to raise a Python KeyError?In Python Programming, it might be sensible at times to forcefully raise exceptions in your own code. You can usually raise an exception using the raise keyword and by calling the KeyError exception:>>> raise KeyError('Batman')Here, ‘Batman’ acts as the missing key. However, in most cases, you should provide more information about the missing key so that your next developer has a clear understanding of the problem.Conditions to raise a Python KeyError in your code:It should match the generic meaning behind the exception.A message should be displayed about the missing key along with the missing key which needs to be accessed.How to Handle a Python KeyError?The main motive of handling a Python KeyError is to stop unexpected KeyError exceptions to be raised. There are a number of number of ways of handling a KeyError exception.Using get()The get()is useful in cases where the exception is raised due to a failed dictionary LookupError. It returns either the specified key value or a default value.# prices.py prices = { 'Pen' : 10, 'Pencil' : 5, 'Notebook' : 25} item = input('Get price of: ') price = prices.get(item) if price:   print(f'The price of {item} is {prices[item]}')   else:   print(f'The price of {item} is not known')This time, you’ll not get a KeyError because the get() uses a better and safer method to retrieve the price and if not found, the default value is displayed:Get price of: EraserThe price of Eraser is not knownIn this example, the variable price will either have the price of the item in the dictionary or the default value ( which is None by default ).In the example above, when the key ‘Eraser’ is not found in the dictionary, the get() returns  None by default rather than raising a KeyError. You can also give another default value as a second argument by calling get():price = prices.get(item,0)If the key is not found, it will return 0 instead of None.Checking for KeysIn some situations, the get() might not provide the correct information. If it returns None, it will mean that the key was not found or the value of the key in Python Dictionary is actually None, which might not be true in some cases. In such situations, you need to determine the existence of a key in the dictionary. You can use the if and in operator to handle such cases. It checks whether a key is present in the mapping or not by returning a boolean (True or False) value:dict = dictionary() for i in range(50):   key = i % 10     if key in dict: dict[key] += 1 else: dict[key] = 1In this case, we do not check what the value of the missing key is but rather we check whether the key is in the dictionary or not. This is a special way of handling an exception which is used rarely.This technique of handling exceptions is known as Look Before You Leap(LBYL).Using try-exceptThe try-except block is one of the best possible ways to handle the KeyError exceptions. It is also useful where the get() and the if and in operators are not supported.Let’s apply the try-except block on our earlier retrieval of prices code:# prices.py prices = { 'Pen' : 10, 'Pencil' : 5, 'Notebook' : 25} item = input('Get price of: ') try: print(f'The price of {item} is {prices[item]}') except KeyError: print(f'The price of {item} is not known')Here, in this example there are two cases— normal case and a backup case. try block corresponds to the normal case and except block to the backup case. If the normal case doesn’t print the name of the item and the price and raises a KeyError, the backup case prints a different statement or a message.Using try-except-elseThis is another way of handling exceptions. The try-except-else  has three blocks— try block, except block and else block.The else condition in a try-except statement is useful when the try condition doesn’t raise an exception. However, it must follow all the except conditions.Let us take our previous price retrieval code to illustrate try-except-else:# prices.py prices = { 'Pen' : 10, 'Pencil' : 5, 'Notebook' : 25} item = input('Get price of:') try: print(f'The price of {item} is {prices[item]}') except KeyError: print(f'The price of {item} is not known') else: print(f'There is no error in the statement')First, we access an existing key in the try-except block. If the Keyerror is not raised, there are no errors. Then the else condition is executed and the statement is displayed on the screen.Using finallyThe try statement in Python can have an optional finally condition. It is used to define clean-up actions and is always executed irrespective of anything. It is generally used to release external sources.An example to show finally:# prices.py prices = { 'Pen' : 10, 'Pencil' : 5, 'Notebook' : 25} item = input('Get price of: ') try: print(f'The price of {item} is {prices[item]}') except KeyError: print(f'The price of {item} is not known') finally: print(f'The finally statement is executed')Remember, the finally statement will always be executed whether an exception has occurred or not.How to raise Custom Exceptions in Python?Python comprises of a number of built-in exceptions which you can use in your program. However, when you’re developing your own packages, you might need to create your own custom exceptions to increase the flexibility of your program.You can create a custom Python exception using the pre-defined class Exception:def square(x): if x
Rated 4.5/5 based on 1 customer reviews
7934
What are Python KeyError Exceptions and How to Han...

There are times when you have written your code bu... Read More

How to Work With a PDF in Python

Whether it is an ebook, digitally signed agreements, password protected documents, or scanned documents such as passports, the most preferred file format is PDF or Portable Document Format. It was originally developed by Adobe and is a file format used to present and transfer documents easily and reliably. It uses the file extension .pdf. In fact, PDF being the most widely used digital media, is now considered as an open standard which is maintained by the International Standards Organization (ISO). Python has relatively easy syntax which makes it even easier for the ones who are in their initial stage of learning the language. The popular Python libraries are well suited and integrated which allows to easily extract documents from a PDF, rotate pages if required, split pdf to make separate documents, or add watermarks in them.Now an important question rises, why do we need Python to process PDFs? Well, processing a PDF falls under the category of text analytics. There are several libraries and frameworks available which are designed in Python exclusively for text analytics. This makes it easier to play with a PDF in Python. You can also extract information from PDF and use into Natural Language Processing or any other Machine Learning models. Get certified and learn more about Python Programming and apply those skills and knowledge in the real world.History of  pyPDF, PyPDF2, pyPDF4The first PyPDF package was released in 2005 and the last official release in 2010. After a year or so, a  company named Phasit sponsored a branch of the PyPDF called PyPDF2 which was consistent with the original package and worked pretty well for several years.A series of packages were released later on with the name of PyPDF3 and later renamed as PyPDF4. The biggest difference between PyPDF and the other versions was that the later versions supported Python3. PyPDF2 has been discarded recently. But since PyPDF4 is not fully backward compatible with the PyPDf2, it is suggested to use PyPDF2. You can also use a substitute package - pdfrw. Pdfrw was created by Patrick Maupin and allows you to perform all functions which PyPDF2 is capable of except a few such as encryption, decryption, and types of decompression.Some common libraries in PythonLet us look into some of the libraries Python offers to handle PDFs:PdfMiner It is a tool used to extract information from PDF documents. PDFMiner allows the user to analyze text data and obtain the definite location of a text. It provides information such as fonts and lines. We can also use it as a PDF transformer and a PDF parser.PyPDF2PyPDF2 is purely a Python library which allows users to split, merge, crop, encrypt, and transform PDFs. You can also add customized data, view options, and passwords to the documents. Tabula-pyIt is a Python wrapper of tabula-java which can read tables from PDF files and convert into Pandas Dataframe or into CSV/TSV/JSON file formats.SlateIt is a Python package which facilitates the extraction of information and is dependent on the PdfMiner package.PDFQueryA light Python wrapper which uses minimum code to extract data from PDFs.xPDFIt is an open source viewer of PDF which also includes an extractor, converter and other utilities. Out of all the libraries mentioned above, PyPDF2 is the most used to perform operations like extraction, merging, splitting and so on.Installing PyPDF2If you're using Anaconda, you can install PyPDF2 using pip or conda. To install PyPDF2 using pip, run the following command in the command line:pip install PyPDF2The module is case-sensitive. So you need to make sure that proper syntax is followed. The installation is really quick since PyPDF2 is free of dependencies.Extracting Document Information from a PDF in PythonPyPDF2 can be used to extract metadata and all sorts of texts from PDF when you are performing operations on preexisting PDF files. The types of data you can extract are:AuthorCreatorProducerSubjectTitleNumber of PagesTo understand it better, let us use an existing PDF in your system or you can go to Leanpub and download a book sample.The code for extracting the document information from the PDF—# get_doc_info.py from PyPDF2 import PdfFileReader def getinfo(path):     with open(path, 'rb') as f:         PDF = PdfFileReader(f)         information = PDF.getDocumentInfo()         numberofpages = PDF.getNumPages()     print(information)     author = information.author     creator = information.creator     producer =information .producer     subject = information.subject     title = information.title if __name__ == '__main__':     path = 'reportlab-sample.pdf'     getinfo(path)The output of the program above will look like—Here, we have firstly imported PdfFileReader from the PyPDF2 package. The class PdfFileReader is used to interact with PDF files like reading and extracting information using accessor methods. Then, we have created our own function getinfo with a PDF file as an argument and then called the getdocumentinfo(). This returned an instance of DocumentInformation. And finally we got extract information like the author, creator, subject or title, etc.getNumPages() is used to count the number of pages in the document. PdfMiner can be used when you want to extract text from a PDF file. It is potent and particularly designed for extracting text from PDF.We have learned to extract information from PDF. Now let’s learn how to rotate a PDF. Rotating pages in PDFA lot of times we receive PDFs which contain pages in landscape orientation instead of portrait. You may also find certain documents to be upside down, which happens while scanning a document or mailing. However, we can rotate the pages clockwise or counterclockwise according to our choice using Python with PyPDF2.The code for rotating the article is as follows—# rotate_pages.py from PyPDF2 import PdfFileReader, PdfFileWriter def rotate(pdf_path):     pdf_write = PdfFileWriter()     pdf_read = PdfFileReader(path)     # Rotate page 90 degrees to the right     page1 = pdf_read.getPage(0).rotateClockwise(90)     pdf_write.addPage(page1)     # Rotate page 90 degrees to the left     page2 = pdf_read.getPage(1).rotateCounterClockwise(90)     pdf_write.addPage(page2)     # Add a page in normal orientation     pdf_write.addPage(pdf_read.getPage(2))     with open('rotate_pages.pdf', 'wb') as fh:         pdf_write.write(fh) if __name__ == '__main__':     path = 'mldocument.pdf'     rotate(path)The output of the code will be as follows—Here firstly we imported the PdfFileReader and the PdfFileWriter so that we can write out a new PDF file. Then we declared a function rotate with a path to the PDF that is to be modified. Within the function, we created a read object pdf_read and write object pdf_write.Then, we used the getPage() to grab the pages. Two pages page1 and page2 are taken and rotated to 90 degrees clockwise and 90 degrees counterclockwise respectively using rotateClockwise() and rotateCounterClockwise().We used addPage() function after each rotation method calls. This adds the rotated page to the write object. The last page we add is page3 without any rotation.Lastly, we have used write() with a file-like parameter to write out the new PDF. The final PDF contains three pages, the first two will be in the landscape mode and rotated in reversed direction and the third page will be in normal orientation.Now we will learn to merge different PDFs into one.Merging PDFsIn many cases, we need to merge two PDFs into a single one. For example, suppose you are working on a project report and you need to print it and bind it into a book. It contains a cover page followed by the project report. So you have two different PDFs and you want to merge them into one PDF. You can simply use Python to do so. Let us see how can we merge PDFs into one.The code for merging two PDF documents using PyPDF in mentioned below:# pdf_merging.py from PyPDF2 import PdfFileReader, PdfFileWriter def pdfmerger(paths, output):     pdfwrite = PdfFileWriter()     for path in paths:         pdfread = PdfFileReader(path)         for page in range(pdfread.getNumPages()):             # Add each page to the writer object             pdfwrite.addPage(pdfread.getPage(page))     # Write out the merged PDF     with open(output, 'wb') as out:         pdfwrite.write(out) if __name__ == '__main__':     paths = ['document-1.pdf', 'document-2.pdf']     pdfmerger(paths, output='merged.pdf')Here we have created a function pdfmerger() which takes a number of inputs and a single output. Then we created a PdfFileReader() object for each PDF path and looped over the pages, added each page to the write object. Finally, using the write() function the object’s contents are written to the disk.PyPDF2 makes the process of merging simpler by creating the PdfFileMerger class.Code for merging two documents using PyPDF2—# pdf_merger2.py import glob from PyPDF2 import PdfFileMerger def merger(output_path, input_paths):     pdfmerge = PdfFileMerger()     file_handles = []     for path in input_paths:         pdfmerge.append(path)     with open(output_path, 'wb') as fileobj:         pdfmerge.write(fileobj) if __name__ == '__main__':     paths = glob.glob('d-1.pdf')     paths.sort()     merger('d-2.pdf', paths)The PyPDF2 makes it simpler in the way that we don’t need to loop the pages of each document ourselves.  Here, we created the object pdfmerge and looped through the PDF paths. The PyPDF2 automatically appends the whole document. Finally, we write it out.Let’s perform the opposite of merging now!Splitting PDFsThe PyPDF2 package has the ability to split up a single PDF into multiple PDFs. It allows us to split pages into different PDFs. Suppose we have a set of scanned documents in a single PDF and we need to separate the pages into different PDFs as per requirement, we can simply use Python to select pages we want to split and get the work done.Code for splitting a single PDF into multiple PDFs—# pdf_splitter.py import os from PyPDF2 import PdfFileReader, PdfFileWriter def splitpdf(path):     fname = os.path.splitext(os.path.basename(path))[0]     pdf = PdfFileReader(path)     for page in range(pdf.getNumPages()):         pdfwrite = PdfFileWriter()         pdfwrite.addPage(pdf.getPage(page))         outputfilename = '{}_page_{}.pdf'.format(             fname, page+1)         with open(outputfilename, 'wb') as out:             pdfwrite.write(out)         print('Created: {}'.format(outputfilename)) if __name__ == '__main__':     path = 'document-1.pdf'     splitpdf(path)Here we have imported the PdfFileReader and PdfFileWriter from PyPDF2. Then we created a function called splitpdf() which accepts the path of PDF we want to split. The first line of the function takes the name of the input file. Then we open the PDF and create a read object. Using the read object’s getNumPages(), we loop over all the pages.In the next step, we created an instance of PdfFileWriter inside the for loop. Then, we created a PDF write instance and added each page to it for each of the pages in the PDF input. We also created a unique filename using the original filename + the word ‘page’ + the page number + 1.Once we are done with running the script, we will have each of the pages of the input PDF split into multiple PDFs. Now let us learn how to add a watermark to a PDF and keep it secured.Adding Overlays/WatermarksAn image or superimposed text on selected pages in a PDF document is referred to as a Watermark. The Watermark adds security features and protects our rational property like images and PDFs. Watermarks are also called overlays.The PyPDF2 allows us to watermark documents. We just need to have a PDF which will consist of our watermark text, image or signature.Code for adding a watermark in a PDF—# watermarker.py from PyPDF2 import PdfFileWriter, PdfFileReader def watermark(inputpdf, outputpdf, watermarkpdf):     watermark = PdfFileReader(watermarkpdf)     watermarkpage = watermark.getPage(0)     pdf = PdfFileReader(inputpdf)     pdfwrite = PdfFileWriter()     for page in range(pdf.getNumPages()):         pdfpage = pdf.getPage(page)         pdfpage.mergePage(watermarkpage)         pdfwrite.addPage(pdfpage)     with open(outputpdf, 'wb') as fh:         pdfwrite.write(fh) if __name__ == '__main__':     watermark(inputpdf='document-1.pdf',               outputpdf='watermarked_w9.pdf',               watermarkpdf='watermark.pdf')The output of the code will look like— There are three arguments of the function watermark(): inputpdf: The path of the PDF that is to be watermarked. outputpdf: The path where the watermarked PDF will be saved. watermarkpdf: The PDF which contains the watermark.Firstly, we extract the PDF page which contains the watermark image or text and then open that PDF page where we want to give the desired watermark.Using the inputpdf, we create a read object and using the pdfwrite, we create a write object to write out the watermarked PDF and then iterate over the pages.Next, we call the page object’s mergePage and apply the watermark and add that to the write object pdfwrite.When the loop terminates, the watermarked PDF is written out to the disk and it’s done!Encrypting a PDFIn the PDF world, the PyPDF2 package allows an owner password which gives the user the advantage to work as an administrator. The package also provides the user password which allows us to open the document upon entering the password.The PyPDF2 basically doesn’t permit any allowances on any PDF file yet it allows the user to set the owner password and user password.Code to add a password and add encryption to a PDF—# pdf_encrypt.py from PyPDF2 import PdfFileWriter, PdfFileReader def encryption(inputpdf, outputpdf, password):     pdfwrite = PdfFileWriter()     pdfread = PdfFileReader(inputpdf)     for page in range(pdfread.getNumPages()):         pdfwrite.addPage(pdfread.getPage(page))     pdfwrite.encrypt(user_pwd=password, owner_pwd=None,                       use_128bit=True)     with open(outputpdf, 'wb') as fh:         pdfwrite.write(fh) if __name__ == '__main__':     encryption(inputpdf='document-1.pdf',                   outputpdf='document-1-encrypted.pdf',                   password='twofish')We declare a  function named encryption() with three arguments—the input PDF path, the output PDF path and the password that we want to keep. Then we create one read object pdfread and one write object pdfwrite. Now we loop over all the pages and add them to the write object since we need to encrypt the entire document.Finally, we call the encrypt() function which accepts three parameters—the user password, the owner password and the whether or not to use 128-bit encryption. The PDF  will be encrypted to 40-bit encryption if the argument use128bit is set to false. Also if the owner password is set to none, then it will be set to user password automatically.Reading the Table data from PDFSuppose you want to work with the Table data in Pdf, you can use tabula-py to read tables in a PDF. To install tabula-py, run:pip install tabula-pyCode to extract simple Text from pdf using PyPDF2:import tabula # readinf the PDF file that contain Table Data # you can find the pdf file with complete code in below # read_pdf will save the pdf table into Pandas Dataframe df = tabula.read_pdf("document.pdf") # in order to print first 5 lines of Table df.head()If you PDF file contains Multiple Tabledf = tabula.read_pdf("document.pdf",multiple_tables=True)If you want to extract Information from the specific part of any specific page of PDFtabula.read_pdf("document.pdf", area=(126,149,212,462), pages=1)If you want the output into JSON Formattabula.read_pdf("offense.pdf", output_format="json")Exporting PDF into ExcelSuppose you want to export a PDF into Excel, you can do so by writing the following code and convert the PDF Data into Excel or CSV.tabula.convert_into("document.pdf", "document_testing.xlsx", output_format="xlsx")Let us sum up what we have learned in the article:Extraction of data from a PDFRotate pages in a PDFMerge PDFs into one PDFSplit a PDF into many PDFsAdd watermarks or overlays in a PDFAdd password or encryption to a PDFReading table from PDFExporting PDF into Excel or CSVAs you have seen, PyPDF2 is one of the most useful tools available in Python. The features of PyPDF2 makes life easier whether you are working on a large project or even when you quickly want to make some changes to your PDF documents. Learn more about such libraries and frameworks as KnowledgeHut offers Python Certification Course for Programmers, Developers, Jr./Sr Software Engineers/Developers and anybody who wants to learn Python.
Rated 4.5/5 based on 1 customer reviews
8024
How to Work With a PDF in Python

Whether it is an ebook, digitally signed agreement... Read More

20% Discount