Data cleaning isn’t the most attractive part when it comes to data science or machine learning, but it is one of the most important ones. There are no tricks nor any shortcuts for data cleaning, if one needs to have the best model possible, they need a better quality of data and a clean one. Machine learning and data scientists spend a lot of time in data cleaning because of a common belief among them that whatever data they put into the algorithm, the results solely depend upon it.
Below are some tips when it comes to data cleaning:
Better Data Quality
It is a common notion around developers, where they chase perfecting the algorithm and making it look fancy, often ignoring one of the major factors that contribute to the success of an algorithm, the data quality. Data cleaning is a lot more important than it sounds, no matter how good one’s algorithm is or no matter how fancy it is, untidy data will give you abysmal results. Poor quality data also results in biased outcomes, which can afflict the businesses if firms fail to identify the potential flaws in it.
Filtering Unnecessary Outliners
Outliers can cause problems with specific models like linear regression models (reducing their robustness). But, removing an outlier just because it is big and not because it is uninformative might make your model miss out on information. Have a legitimate reason when you are thinking about removing an outlier.
Removing Duplicate Observations
This is one of the basic steps of data cleaning in data science. Duplicated observations frequently occur during data collection. They might occur during, combining datasets from multiple places, receiving data from other parties and scraping data. And a few are irrelevant observations, which are those ones that don’t actually fit into a specific problem, which are under consideration. These observations, if spotted correctly, will enhance one’s model. It is recommended to check for these observations before the engineering features come into play.
Making sure the data types are stored correctly can save a lot of time and help in creating a better model. All the values must be stored in relevant data types.
There are some types of errors that need to be kept in mind:
About Pad strings: Strings can be padded with spaces, and other characters to a certain width like some of the numerical codes are represented with inserting zeros to ensure they always have the same number of digits.
401 => 000401 (6 digits)
Removing white spaces: Simply means removing extra white spaces at the beginning or the ending of the strings.
“ hello world “ => “hello world”
Fixing Structural Errors
Structural errors are those that come into existence while processes like measurement, data transfer, etc. For example, one can check for typos, inconsistent capitalisation. Another way is to try to merge or include mislabelled classes into one.
Standardising the Values
Standardising, say, for Strings means, making sure all values are either in lower case or the upper case. Same way, the numerical values can be standardised to a certain measurement unit. For example, the length can be in meters and feet. The difference of one meter is considered the same as the difference of one foot, so one has to convert the height to one single unit.
Most algorithms do not accept missing values, so handling missing data becomes all the more crucial when it comes to algorithms and making one’s data cleaner.
The missing data can be handled in two ways:
Dropping observations with missing values: Dropping the missing values is not the most optimal way for the reason being that, when one drops an observation, it means dropping some information.
Imputing missing values based on other observations: Imputing missing values is also not that optimal either. Imputing missing value means the value was originally missing, but when someone filled it in, which eventually leads to a loss in information no matter what imputation method one uses.
Something missing can be informative as well; one can then add these missing values to the algorithm after they realise them. Imputing is like trying to fit a missing part of the puzzle back in after you have taken it out. The models built with missing values might not add any real information and keep reinforcing the patterns already provided by other features.
A possible solution? Just tell the algorithm that something is missing.
Handle missing categorical data: for missing categorical feature data, one can label them as ‘Missing’. It’s like adding a new class for the feature.
Handling missing numerical data: To process numerical data, one should flag and fill the values. First, flag the observation with an indicator variable of the missingness, then fill the original value with 0 to meet the technical requirement of no missing values.
Flagging and filling essentially allow the algorithm to estimate the optimal constant for missingness instead of filling it.