CSV Data Manipulation

Last modified: 2023-08-23

Machine Learning Preprocessing

Load CSV as DataFrame

We can load CSV data with Pandas module. By this, we can use various functions to investigate/manipulate the CSV data.

import pandas as pd

# Load CSV as Pandas DataFrame
df = pd.read_csv("example.csv")

Investigate Data

We can investigate the data from various aspects.

Dimensionality

df.shape

# output example
(2, 3)

Correlation of Columns

df.corr

Display Data by Index

# Display the first 5 rows. (default)
df.head

# Display the first 10 rows
df.head(10)

# Display the first row only
df.iloc[0]
df.iloc[[0]]

# Display the first and the second rows
df.iloc[[0, 1]]

# Display from the first row to the fourth row
df.iloc[0:5]
# Display the first 20 rows
df.iloc[:20]

# Display the first row and the third column
df.iloc[0, 3]

Display Data by Conditions

# Display rows where the value of the `age` is over 30.
df.loc[df['age'] > 30]

# Display rows where the value of the `name` column contains 'Emma'.
df.loc[df['name'].str.contains('Emma')]

# Display rows where the value of the `name` column is 'Jane' and the value of the `age` column is over 25.
df.loc[df['name'] == 'Jane' & (df['age'] > 25)]
# Display rows which excluded rows where the value of the `age` is 30 from the condition above.
df.loc[df['name'] == 'Jane' & (df['age'] > 25) & ~(df['age'] == 30)]

Display Data by Row/Column Name

# Display rows whose row name is 'John' and only the column 'age'.
df.loc["John", "age"]

# Display rows whose row name is 'John' and all columns.
df.loc["John", :]

# Display all rows but only the 'age' column.
df.loc[:, "age"]

Manipulate Data

We can update specific data by the position, or conditions.

# Update the value of 'adult' in rows where "age" is 18 or more.
df.loc['age' >= 18, 'adult'] = 1

# update the value of 'country' to 'France' in rows before 20
df.iloc[:20, 'country'] = 'France'

After manipulation, we can save the updated data as CSV file as below.

df.to_csv("updated.csv")