Pandas

Pandas has two built-in data types: series and data frame - Series are columns in the dataset - Data Frame is the object the dataset to a data type that can be use in pandas

Function:

Initial data exploration:

df = pd.read_csv("path_to_dataset"): load dataset
df.info(): what types are used
df.describe(): overview of the numerical data, gives the most important statistical analysis.
df[column].isna(): by checking which values are NaN, with True and False values.
df.dropna(axis, how, subset, thresh): drop rows if any values is NaN
df.fillna(value, method, axis, limit): replace NA with mean, max or min values of the column, strings and 0.
df.nunique(): how many unique values we have in each column of the Dataframe.
df.duplicated(): To search for entire rows of data that duplicate other rows, and return a mask
pd.to_numeric(df[column], error={'ignore','raise','coerce'}, downcast={'integer','signed','unsigned','float'}): ensure every entry in a column is numeric.
df.groupby("series"): Splitting dataframe bu categorical value inside this series

Explore information of one column

pd.unique(column): Return unique values based on hash table.
Series.value_counts(): Return a Series containing counts of unique values
Series.mean(): Return the mean of the values over the requested axis
Series.size(): Return the number of elements in the underlying data

Data manipulation

pd.concat(df, axis, join):append one data structures to the other along one axis
pd.merge(dataframe1. dataframe2. on, how): Merge two dataframes by one or multiple Keys/ Indices, A little bit like Join in SQL
df_filtered = df[df["Column"].isin([specific value1, specific value2])]: Filter rows that contain specifc values in a column:.
df.drop('column name', axis=1): Remove column or columns
- axis=1: Column operations
DataFrame.reset_index: Change multiIndex back to a regular column and reset it to the default numeric index.
Series.apply(): Invoke function on values of Series
pd.DataFrame(data={'col1':[1,2], 'col2':[3,4]}): create dataframe

Date

Series.dt.to_period(freq): Covert DatetimeArray to PeriodArray
- freq: W, M, Q, Y

Fact:

type(np.NaN) = float, which means coerce a column can apply all numeric operations to entire columns.
mask is a series return True and False
~ will revert boolean values in a series