This blog post covers the two most widely used and discussed libraries of the Python programming language in the context of Data manipulation, Feature engineering and Data wrangling. We will be discussing Pandas and NumPy.
By the end of this post, you should have a clear understanding of
Let's get started.
NumPy stands for Numerical Python. NumPy is the most powerful and fundamental open source third party (external) Python library for creating and manipulating numerical objects. It was created by Travis Oliphant in 2005.
Let’s decompose and understand this complicated introduction!
NumPy is NOT part of the standard Python installation, however you can easily install the latest version of the NumPy library from the Python repository using PIP (Python utility to manage external libs) as shown below:
One of the most fundamental data objects provided by NumPy is Multi-Dimensional Array and it is called ndarray (nd - "N" dimensional) in Python.
NumPy also has many built-in operations/functions which operate on ndarray, such as getting random samples, sorting, searching, string operations. It provides a lot of statics around these arrays.
In NumPy, ndarrays or arrays can be created in a few different ways:
We can create and array with user defined values using the built-in syntax.
In the very first line, we are importing the NumPy library and using alias as np for easy access at a later time. In the second line, we are defining array using the built-in function array and passing a list of numbers as the argument.
Upon printing we should see the array printed on the screen.
Some of the fundamental attributes of a NumPy object are:
NumPy provides various built-in stationary functions, which demonstrate meta-data about an array object.
We can access any elements of an array using the "index" mechanism. Indexes represent the address or position of elements in an array. In Python, the index position starts from 0.
As seen in the above image, accessing an array object with 0 index (enclosed in square bracket) returns 1 (which is the first element of an array).
We can choose to create an array from existing data structures such as List or Tuple.
As we can see, the built-in function to create an array (np.array) remained the same and only the passed argument has changed. In the first instance, we passed an object of List and in the second instance we passed an object of Tuple.
Lastly, we have the option to create an array using alternative or built-in methods. This option provides a great variety of variations to the user.
Here, we are creating an array with range of values using built-in function np.arange
We can also create an array with all elements initialized to either 0 or 1.
We can create an array which follows specific data distributions. This is especially helpful in initializing weights in neural networks.
The NumPy library provides tons of features which help users of all backgrounds such as Data Analysts, Data Scientists, Researchers or even novice users to work with large and complex data and also extract meaningful insights out of it.
Below is the list of some features provided by NumPy (This is by NO means an exhaustive list)
Pandas stands for Python Data Analysis Library. It is also an open source and third-party library which is fundamentally used for data manipulation, wrangling and data exploration. Pandas was released in 2008 by Wes McKinney.
Pandas provide a framework to read data from multiple sources such as Excel, CSV, JSON, SQL and many more.
Fundamentally, Pandas provide two types of data objects:-
Individual columns are referred to as Series, and multiple series are collectively called the “DataFrame”. As Pandas is not part of a standard Python installation, we have to externally install it using PIP utility.
We can choose to read data from any format from a list of built-in methods in Pandas.
As we can see, a DataFrame is created from an existing CSV file and the first few records are printed using built-in functions head. DataFrame objects are accessible from both row and column levels as they are labelled.
Pandas provides the below special functions (this list is not exhaustive), which help the user to know data better.
Accessing the DataFrame using row or column index becomes easy for an analyst or data scientist, as it allows them to select the subset of data and perform dedicated operations or logic on top of it.
Pandas is THE most widely used package when it comes to data manipulation and data transformation. The availability of built-in functions and support for various user defined operations makes it very easy for users across all groups to prepare their data for downstream tasks. Apart from these above-mentioned features, given below are a few more which contribute to the popularity of Pandas.
|Data Objects||Used for creating two-dimensional data objects||Used for creating “N” dimensional objects|
|Types of Data Objects||Pandas creates heterogenous type of objects.||NumPy creates homogenous type of objects.|
|External Data||Pandas objects are created from external data such as CSV, Excel or SQL||NumPy generally uses data created by user or built-in functions|
|Application||Pandas objects are primarily used for data manipulation and data wrangling||NumPy objects are used to create matrices or arrays which are used in creating ML or DL models|
|Data Access||Data can be accessed using index positions or index labels||Data is accessed using ONLY index positions|
|Operations||Pandas provides special utilities such as groupby, loc, iloc & apply to access and manipulate different subsets of data||NumPy doesn’t provide any such functionalities, however subset can be selected using indexes or conditional formatting|
|Speed||DataFrames are relatively slower than Array||NumPy arrays are faster than DataFrames|
|Usage||Commonly used for holding external user data and performing analysis on it to understand the data well||Commonly used for building components for ML or DL models|
We have understood the importance and usage of two of the most widely used packages of Python. We also have understood why these packages are so useful and efficient.
In the conclusion I would say, both libraries have their own use, and they cannot be replaced or interchanged. These libraries play fundamental roles in data analyses, understanding, manipulation and preparation for further downstream tasks.
If you are dealing with simpler and more homogenous data which requires a lot of mathematical operations, I would suggest that you use NumPy. On the other hand, if you are using data from a client or a similar entity and your end goal is to understand the data, manipulate and transform it, then the clear choice should be Pandas.
Your email address will not be published. Required fields are marked *
Statistics is a science concerned with collection,... Read More