Home » Python Pandas beginner Tutorial

Python Pandas beginner Tutorial

Python Pandas beginner Tutorial

Pandas is an open-source library that is built based on the NumPy. It allows you to do analysis as well as data importing, analyzing and visualizing. It builds on packages like NumPy and matplotlib which give you better data analysis and visualization work. In simple Language, Pandas is known as Python’s version of Microsoft’s Excel.

One Important Thing, Pandas can work with a lot of python open source i.e. Excel Sheet, CSV, SQL, Web Page, XML, JSON, etc.

  • Important Features of Pandas?
  • Insert and delete columns in data structures.
  • Merge and join data sets.
  • Reshaping and pivoting data sets.
  • Data alignment and deal with missing data.
  • Manipulate high dimension data into lower dimensions
  • Reading other source files like csv, JSON, xml etc.
  • Arrange data into ascending or descending order.
  • Manipulating data by using an index for Data Frame objects.
  • Performing Split Method on data sets using the group.
  • Analyse time series
  • Integration on data set

How to Install Pandas in Python?

If you want to install pandas in python, you need to go command prompt using

pip install pandas

How to check dir. of python?

In python file, you need to import pandas and then check dir of pandas by using below code

import pandas
pandas.__version__
print(dir(pandas))

How to import Excel in pandas?

Here you can use the read_csv module to data frame file in pandas. Let try to import dummy data.

import pandas as pd
csvfile=pd.read_csv('data.csv')
print(csvfile)

Once we read in a DataFrame, we can manipulate it in a different way. Conveniently, Pandas gives us different methods that make it fast to print the data a table. Those functions are:

DataFrame.head(): It Print the N rows of a DataFrame, where N is a number you pass as an argument to the function

csvfile.head()

DataFrame.tail(): It Print the last N rows of a DataFrame, where N is a number you pass as an argument to the function

csvfile.tail()

DataFrame.columns: It gives us the column names; you can slice data by using indexing:

csvfile.columns
csvfile.columns[0:2]

DataFrame.dtypes: It show the types of each columns:

csvfile.dtypes

DataFrame.shape: To find out what shape your data set

csvfile.shape

How to create DataFrame using python pandas?

DataFrame is an important data structure with pandas. It lets us help with data in a tabular fashion. Here, rows are observations and columns are variables.

We use the following syntax to use this :

pandas.DataFrame( data, index, columns, dtype, copy)

Create a DataFrame : Let’s see how we create dataframe

df=pd.DataFrame({'Sales':['Accounts','HR','Marketing','Sale'],
'sales2018':[152000,250000,154000,650120],
'sales2019':[25400,254600,65000,688000]})

Setting Index for DataFrame : The index of dataframe is starting with Zero. You can put label here.

df.index=['one','two','three','four']

Indexing ad DataFrame : By using columns name, we can get data of all columns

df[['sales2018','sales2019']]

Slicing a DataFrame : We can slice dataframe and retrieve rows into it by using indexing.

Dataset[0:3]

Data Selection with loc and iloc : Using loc and iloc, you can select selected rows in a data set. loc uses string and iloc uses integers.

df.iloc[3]
df.iloc[:,1:4]

How to manipulate dataset using python pandas?

Changing data types :

csvfile.Sales2020=csvfile.Sales2020.astype(float)
print(csvfile)

Creating Frequency Distribution: For creating frequency distribution we use method value_count()

csvfile.index=['A','B','A']
csvfile.index.value_counts(ascending=True)

Create Crosstab: It is a bivariate frequency function

pd.crosstab(csvfile.index, csvfile.Sales2019)

Choose one column an index: You can choose on the column to index others

csvfile.set_index('Sales2018',inplace=True)

Sorting Data: For this method, we use function sort_values

csvfile.sort_values('Sales2018',ascending=False)

Renaming Variables: If you want to rename the columns of data set you can use this method

csvfile.columns=['sales18','sales19','sales20']
csvfile.rename(columns={'sales18':'salesdata'},inplace=True)

Dropping Row and Columns: If you want to add 20% Extra on the total amount

csvfile['sales18']=csvfile.eval('sales18+(sales18*(0.1))')

What is Group by Function in Pandas?

This is used for group data on a variable

csvfile.groupby('sales18').Gross.min()

How to use Filtering with Python?

Here filter can be performed in two way : –

csvfile[csvfile.index==2]
csvfile[csvfile.index.isin([1,3])]

Missing Data in Python

So many times, when you’re using Pandas to read data and there are missing lines, Pandas will automatically fill up those missing points with a NaN/Null value. Therefore, we can either drop those filled values by using .dropna() or fill them by using.fillna() method.

Conditional Selection in Python

Pandas allow you to perform the conditional selection by using bracket notation [] . Here below example returns the rows where ‘Sales’>0:

Say we want to return only the values of a column ‘X’ where ‘W’>0:

Lets try this: csvfile[csvfile[‘Sales’]>0][[‘Sales18′,’Sales19’]] , you should get this:

By using multiple conditions, we can get values in the DataFrame by combining it with the help of logical operators & (AND) and | (OR). To return the values where’Sales18′>0 and ‘Sales19’>1, use:

Count Method

By using this method we can get the number of times an item shown in a DataFrame.

Describe

The .describe() method is used to get an overview of how a DataFrame looks like. It gives us details of each of the DataFrame index.

Merge and Joining

The words “merge” and “join” are used in Pandas and other languages. These are two function in pandas but work similar things

Joining is a more important method for combine the columns of two potentially differently-indexed DataFrames into a single DataFrame. Joining is similar to the merging.

Example: let see how we can merge two data by using pandas merge command. The Merging result of one data to another based on column

import pandas as pd
sales_data1=pd.read_csv('data.csv')
sales_data2=pd.read_csv('data_salary.csv')
result = pd.merge(sales_data1, sales_data2,on='Sales')
print(result)

Related Blogs