Data Overview | Notion

In this note, I use df as DataFrame, s as Series.

Libraries

import pandas as pd # import pandas package
import numpy as np

Import and have a look

df = pd.read_csv('filename.csv', na_values=['none']) # "none" is missing data
df.head() # read first 5 rows
df.tail() # last 5 rows
df.head(10) # first 10 rows

Get general infos

df.info() # show dtype of dataframe
df.describe() # numerical features
df.describe(include=['O']) # categorical features
df.describe(include='all') # all types

df.shape # dataframe's shape
df.dtypes # type of each column

df.get_dtype_counts() # count the number of data types

Check distribution of values using KDE (Kernel Density Estimation),

plt.figure(figsize=(20, 5))
df['value'].plot.kde()

Get columns' info

# LIST OF COLUMNS
df.columns
len(df.columns) # #cols

# UNIQUE VALUES IN COL
df['col'].unique()
df['col'].unique().size #unique vals
df['col'].nunique() # number of unique vals

Counting

# Counting #elements of each class in df
df.Classes.value_counts() # give number of each 0 and 1

# count #elements each unique values in a col/series
df[col].value_counts()

Missing values

👉 Check section "Deal with missing values” in Data Processing & Cleaning.

# total number of nans in df
df.isnull().sum().sum()

# #nans in each col (including zeros)
df.isnull().sum()

# #not-nans in each col
df.count()

# each row
df.count(axis=1)

# columns having the nulls (any nan)
null_columns = df.columns[df.isna().any()].tolist()

# how many?
df[null_columns].isnull().sum()

# number of rows having ALL nans
df.isna().all(axis=1).sum()

# number of columns having ALL nans
df.isna().all(axis=0).sum()

# find index of rows having ALL nans
df.index[df.isna().all(axis=1)].to_list()