CSV Data Manipulation
Last modified: 2023-08-23
Machine Learning
Preprocessing
Load CSV as DataFrame
We can load CSV data with Pandas module. By this, we can use various functions to investigate/manipulate the CSV data.
import pandas as pd
# Load CSV as Pandas DataFrame
df = pd.read_csv("example.csv")
Investigate Data
We can investigate the data from various aspects.
Dimensionality
df.shape
# output example
(2, 3)
Correlation of Columns
df.corr
Display Data by Index
# Display the first 5 rows. (default)
df.head
# Display the first 10 rows
df.head(10)
# Display the first row only
df.iloc[0]
df.iloc[[0]]
# Display the first and the second rows
df.iloc[[0, 1]]
# Display from the first row to the fourth row
df.iloc[0:5]
# Display the first 20 rows
df.iloc[:20]
# Display the first row and the third column
df.iloc[0, 3]
Display Data by Conditions
# Display rows where the value of the `age` is over 30.
df.loc[df['age'] > 30]
# Display rows where the value of the `name` column contains 'Emma'.
df.loc[df['name'].str.contains('Emma')]
# Display rows where the value of the `name` column is 'Jane' and the value of the `age` column is over 25.
df.loc[df['name'] == 'Jane' & (df['age'] > 25)]
# Display rows which excluded rows where the value of the `age` is 30 from the condition above.
df.loc[df['name'] == 'Jane' & (df['age'] > 25) & ~(df['age'] == 30)]
Display Data by Row/Column Name
# Display rows whose row name is 'John' and only the column 'age'.
df.loc["John", "age"]
# Display rows whose row name is 'John' and all columns.
df.loc["John", :]
# Display all rows but only the 'age' column.
df.loc[:, "age"]
Manipulate Data
We can update specific data by the position, or conditions.
# Update the value of 'adult' in rows where "age" is 18 or more.
df.loc['age' >= 18, 'adult'] = 1
# update the value of 'country' to 'France' in rows before 20
df.iloc[:20, 'country'] = 'France'
After manipulation, we can save the updated data as CSV file as below.
df.to_csv("updated.csv")