!pip install pandas
import pandas as pd
“5 Essential Pandas Functions for Data Science in Python: Use Case Examples”
Built on top of the Python programming language, PANDAS is an open source data analysis and manipulation tool that is quick, strong, flexible, and simple to use. Here, we’ll discuss some of the essential Pandas functions and provide use case examples to illustrate their practical applications.
df.loc() – Indexing and Selecting Data
The df.loc() function is used to index and select data in a Pandas DataFrame based on label or index.
import pandas as pd
# create a sample dataframe
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# access a single value using loc
print(df.loc[0, 'A']) # Output: 1
# access a row using loc
print(df.loc[0, :]) # Output:
# A 1
# B 4
# C 7
# Name: 0, dtype: int64
# access multiple rows using loc
print(df.loc[0:1, :]) # Output:
# A B C
# 0 1 4 7
# 1 2 5 8
# access a column using loc
print(df.loc[:, 'A']) # Output:
# 0 1
# 1 2
# 2 3
# Name: A, dtype: int64
# access multiple columns using loc
print(df.loc[:, ['A', 'B']]) # Output:
# A B
# 0 1 4
# 1 2 5
# 2 3 6
df.assign() – Adding Columns to a DataFrame
The df.assign() function is used to add new columns to a DataFrame.
import pandas as pd
# create a sample dataframe
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
# add a new column using assign
df = df.assign(D = [10, 11, 12])
print(df) # Output:
# A B C D
# 0 1 4 7 10
# 1 2 5 8 11
# 2 3 6 9 12
# add multiple new columns using assign
df = df.assign(E = df['A'] + df['B'], F = df['C'] * 2)
print(df) # Output:
# A B C D E F
# 0 1 4 7 10 5 14
# 1 2 5 8 11 7 16
# 2 3 6 9 12 9 18
How to convert a datetime string into datetime type in Pandas
tbl_warranty_info["StartDate"] = pd.to_datetime(tbl_warranty_info["StartDate"], format='%Y-%m-%d').dt.date
How to select a column by its name or index
Let’s learn how to create a dataframe using pandas, there are multiple ways to create a dataframe. We will try to do it in simple way, just adding rows and columns into the dataframe itself.
df = pd.DataFrame({
'student_id': ['5111','5112','5113','5114','5115'],
'student_name': ['Amy','Phil','Randy','Andrew','Tim'],
'city': ['California','Boston','Rochester','Chicago','Portland']
})



Let’s check how to use
How to check values within a data frame using Pandas and Regex.
In below example we are checking for values in city column: values that include ‘York’ keyword in it.

How to filter data from a Pandas DataFrame
How to select data from a Pandas DataFrame
In order to understand the data selection in a pandas data frame let’s get familiar with concepts of Indexing and Filtering. Selecting specific subsets of data from a pandas dataframe (such as rows, columns, or individual cells) is known as indexing. Each row and column in a Pandas dataframe has a distinct label (name) and position number (similar to a coordinate) within the dataframe, and each data point is identified by its location at the intersection of a particular row and column.
Let’s refer to below dataframe df. We are accessing entire dataframe using two different syntaxes .

Pandas dataframes is represented by rows and columns. The column labels, known as column names, are typically strings as we can see in the above data frame columns are:
alpha1 | alpha2 | gamma1 | gamma2 | chi2min |
---|
If we pull the columns using df.columns syntax we get the below result:
Index(['alpha1', 'alpha2', 'gamma1', 'gamma2', 'chi2min'], dtype='object')
while the row labels, known as dataframe index, can be either integer numbers or string values. We can refer to specific rows, columns, or data points of the dataframe using these labels because the dataframe index and column names both only contain unique values.
using df.index syntax we can get rows/index values from a pandas data frame.

The indexing operator [], attribute operator., and a suitable pandas dataframe indexing technique like loc, iloc, at, iat, and others are used to choose a subset from a dataframe.
There are two methods for indexing pandas dataframes. Label based indexing and position(integer) based indexing are the two primary methods for indexing pandas dataframes. Additionally, it is possible to use boolean dataframe indexing in accordance with established criteria or even combine other dataframe indexing types.
pandas.isin() method to select rows from the data frame with specified values in the method.
It filter Dataframe and result returns a DataFrame containing boolean values wether each element in the DataFrame is contained in values or not.
let's create a DataFrame and check the pandas.isin() function
import pandas as pd
df = pd.DataFrame({
'student_id': ['5111','5112','5113','5114','5115'],
'student_name': ['Amy','Phil','Randy','Andrew','Tim'],
'city': ['California','Boston','Rochester','Chicago','Portland']
})
df
df.isin(['Amy'])

DataFrame.eq
: Equality test for DataFrame.Series.isin
: Equivalent method on Series.Series.str.contains
: Test if pattern or regex is contained within a string of a Series or Index.
How to delete/drop rows with condition on column value
Let’s say that you have a dataframe below and we want to drop the row containing name value as Randy. In order to delete a row, it requires the index of that row. So, we need to fetch the index of that row and use it with pd.drop() method.
df = pd.DataFrame({
'student_id': ['5111','5112','5113','5114','5115'],
'student_name': ['Amy','Phil','Randy','Andrew','Tim'],
'city': ['California','Boston','Rochester','Chicago','Portland']
})

get_index = df[df["student_name"]=="Randy"].index
#let's check value of get_index object.
get_index
output: Int64Index([2], dtype='int64')
#get_index contains the index[2] , we will use inplace to make the changes in df DataFrame itself.
df.drop(get_index, inplace = True)
df
#let's see the result of df DataFrame
df.drop(["p_id"], axis = 1, inplace = True)
# drop the duplicates, preserving the last one with same name
df.drop_duplicates(subset=["Name"],keep='last', inplace = True)

How to remove duplicates from a pandas Dataframe, let’s say we want to remove duplicate names from the below dataframe by keeping the last value with same name.

df.drop_duplicates(subset=[“Name”],keep=’last’, inplace = True)
Result:

Pandas Concatenete function
Pandas concatente function can be used to combine two DataFrames, add columns or rows into a DafaFrame

Axis = 1, to concat the dataframes horizontally, it adds the columns into the DataFrame.

to add DaataFrames Vertically, we can given the Axis = 0, which add rows into it.

pandas.io.sql module to work with SQL Databases using pandas Data Structures
pandas.io.sql
is a module within the pandas library that provides functionality for working with SQL databases using pandas data structures. Specifically, it provides tools for reading and writing data between pandas data frames and SQL databases, as well as executing SQL queries and commands.
in order to use pandas.io.sql
to connect to a SQL database, you first need to establish a connection to the database using an SQL engine.
The sqlalchemy
package is a popular and flexible library for creating SQL engines and managing database connections in Python. Once you have installed sqlalchemy
, you can use it to create an engine that connects to your SQL database by specifying the appropriate connection string for your database.
Here’s an example of how to create an engine using sqlalchemy
:
import sqlalchemy
# create an engine that connects to your SQL database
engine = sqlalchemy.create_engine('postgresql://username:password@host:port/database_name')
n the example above, replace username
, password
, host
, port
, and database_name
with the appropriate values for your SQL database.
Once you have created the engine, you can use it with pandas.io.sql
functions like read_sql()
and to_sql()
to interact with your SQL database.
The pandas.io.sql
module contains functions such as read_sql()
, which allows you to read data from an SQL database into a pandas data frame, and to_sql()
, which enables you to write a pandas data frame to an SQL database. It also includes tools for executing SQL queries and commands directly using the read_sql_query()
and read_sql_table()
functions.
In addition to these basic functions, pandas.io.sql
also provides support for more advanced SQL functionality such as joins, unions, and transactions. Overall, this module makes it easy to work with SQL databases in a pandas-centric way, allowing you to take advantage of the power and flexibility of both pandas and SQL in your data analysis and manipulation workflows. Here are some examples of how to use pandas.io.sql
module to interact with an SQL database:
import pandas as pd
import sqlalchemy
# create an engine that connects to the SQL database
engine = sqlalchemy.create_engine('postgresql://username:password@host:port/database_name')
# read data from an SQL table into a pandas DataFrame
df = pd.read_sql_table('table_name', con=engine)
# execute an SQL query and read the results into a pandas DataFrame
query = 'SELECT * FROM table_name WHERE column_name = "value"'
df = pd.read_sql_query(query, con=engine)
# write data from a pandas DataFrame to an SQL table
df.to_sql('table_name', con=engine, if_exists='append')
# execute an SQL command (e.g. create a new table)
engine.execute('CREATE TABLE new_table (column1 INTEGER, column2 VARCHAR(255))')
# execute an SQL transaction (e.g. insert multiple rows into a table)
with engine.connect() as conn:
with conn.begin():
conn.execute('INSERT INTO table_name (column1, column2) VALUES (1, "value1")')
conn.execute('INSERT INTO table_name (column1, column2) VALUES (2, "value2")')
Best Practices while using Pandas
Method of Accessing Columns
we can access pandas column in multiple ways
df.columnName – but this should be avoided as it won’t work with where a column name contains a space.
Also, it can clash where the column name is same as the name of a property.
so use df[“columnName”]
crosstab
is a function in the pandas library of Python used to compute a simple cross-tabulation of two or more factors. It is a convenient way to summarize and explore the relationships between categorical variables.
The pd.crosstab()
function takes one or more arrays or Series, and computes a frequency table of the counts of each combination of categories.
Here’s the syntax for using crosstab
:
pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name=’All’, dropna=True, normalize=False)
index
: This is the array or series to group by on the rows. It can be a single column name or a list of column names.columns
: This is the array or series to group by on the columns. It can be a single column name or a list of column names.values
: (Optional) This is the array or series to aggregate. It can be a single column name or a list of column names.rownames
: (Optional) The names for the row levels. If None, and the index has names, they will be used.colnames
: (Optional) The names for the column levels. If None, and the columns have names, they will be used.aggfunc
: (Optional) The aggregation function to use when computing the table of counts. Defaults to numpy.sum.margins
: (Optional) Add row/column margins (subtotals).margins_name
: (Optional) The name of the margin label.dropna
: (Optional) Whether to drop rows in the result that contain all NaNs.normalize
: (Optional) Whether to compute row or column proportions instead of counts. Can be set to ‘index’ or ‘columns’.
Here’s an example usage of crosstab
:
import pandas as pd
import numpy as np
# Create a dataframe
df = pd.DataFrame({'Gender': ['M', 'M', 'F', 'F', 'M', 'F'],
'Age': [22, 25, 18, 20, 27, 30],
'Income': [50000, 75000, 30000, 40000, 60000, 90000]})
# Compute a crosstab of Gender and Age
pd.crosstab(index=df['Gender'], columns=df['Age'])
Output:
Age 18 20 22 25 27 30
Gender
F 1 1 0 1 0 1
M 0 0 2 1 1 0
This shows the counts of individuals in the dataset by their gender and age. The rows represent the genders and the columns represent the ages, and the counts represent the number of individuals in each combination of gender and age.
The statistical concept of cross tabulation (or crosstab) involves tabulating the distribution of two or more variables to understand the relationship between them.
In other words, it provides a way to summarize and display the number of observations that fall into each combination of categories for two or more categorical variables.
This helps to identify patterns and relationships between variables, such as whether there is a strong association between two variables, or whether the frequency of certain outcomes varies across different categories.
Crosstab is commonly used in various fields including market research, social sciences, and medical research to help analyze and understand data.
Errors and Solutions:
Error: Exception has occurred: ValueError
can only convert an array of size 1 to a Python scalar

Reason: This error is triggered as pandas series requires atleast one item to return a scalar.
Solution: To fix this issue, try to check if the dataframe or series is not empty.
Or
Reason: it could be when using numpy and an array is supplied as a parameter to a single-parameter numpy method.
Solution: for numpy we can use other methods including the np.vectorize() function, map() function, and astype() method, to fixthe issue.