Pandas Tutorial- Data cleaning
Cleaning Data in Python
When working with real-world data, it is common to encounter messy or incomplete data that needs to be cleaned before analysis. Here are 5 common techniques in Python for cleaning and preparing data for analysis:
1. Filtering Out Missing Data
You can filter out missing data using None
values or np.nan
values (for NumPy arrays).
# Filter missing values from a list
clean_list = [x for x in data if x is not None]
# Filter NaN values from NumPy array
import numpy as np
clean_array = data[np.isfinite(data)]
2. Filling In Missing Data
For simple cases, you can set a fill value manually. NumPy provides convenient fill methods like fillna()
.
# Fill missing values with a set value
clean_list = [x if x is not None else 0 for x in data]
# Fill NaN values with mean of column
clean_array = np.nansum(data, axis=0)
3. Removing Duplicates
Convert data to a set to remove duplicates. For sequences, use dictionary mappings.
# Remove duplicates from list
clean_list = list(set(data))
# Remove duplicates from list while preserving order
clean_list = list(dict.fromkeys(data))
4. Transforming Data Using a Function or Mapping
Use Python's map()
to apply a function or mapping to transform data.
# Apply function to all values
clean_data = list(map(function, data))
# Apply mapping to convert values
mapping = {1: 'A', 2: 'B'}
clean_data = list(map(mapping.get, data))
5. Replacing Values
Use Python's replace() method on strings or NumPy's replace() method on arrays.
# Replace a value in a list of strings
clean_data = [x.replace('-1', 'Unknown') for x in data]
# Replace values in NumPy array
np.where(data == -1, 'Unknown', data)
These are just a few common techniques for getting messy data ready for analysis in Python. The key is finding the right tools for your specific data cleaning tasks.