Pandas Tutorial- Data cleaning

Cleaning Data in Python

When working with real-world data, it is common to encounter messy or incomplete data that needs to be cleaned before analysis. Here are 5 common techniques in Python for cleaning and preparing data for analysis:

1. Filtering Out Missing Data

You can filter out missing data using None values or np.nan values (for NumPy arrays).

# Filter missing values from a list
clean_list = [x for x in data if x is not None] 

# Filter NaN values from NumPy array
import numpy as np
clean_array = data[np.isfinite(data)] 

2. Filling In Missing Data

For simple cases, you can set a fill value manually. NumPy provides convenient fill methods like fillna().

# Fill missing values with a set value
clean_list = [x if x is not None else 0 for x in data]

# Fill NaN values with mean of column 
clean_array = np.nansum(data, axis=0)

3. Removing Duplicates

Convert data to a set to remove duplicates. For sequences, use dictionary mappings.

# Remove duplicates from list
clean_list = list(set(data))

# Remove duplicates from list while preserving order
clean_list = list(dict.fromkeys(data)) 

4. Transforming Data Using a Function or Mapping

Use Python's map() to apply a function or mapping to transform data.

# Apply function to all values 
clean_data = list(map(function, data))

# Apply mapping to convert values 
mapping = {1: 'A', 2: 'B'} 
clean_data = list(map(mapping.get, data)) 

5. Replacing Values

Use Python's replace() method on strings or NumPy's replace() method on arrays.

# Replace a value in a list of strings
clean_data = [x.replace('-1', 'Unknown') for x in data]

# Replace values in NumPy array
np.where(data == -1, 'Unknown', data)

These are just a few common techniques for getting messy data ready for analysis in Python. The key is finding the right tools for your specific data cleaning tasks.