If suppose, we have data but that’s not in a proper format then would we be able to process it further? This may be next to impossible. When we prepare data in such a format on which algorithm of data mining can be applied then we term it as pre-processing of data. Data mining is a procedure that uses set of algorithms for mining the data and getting pattern out of it. So, basically in pre-processing of data we prepare the raw data in a format on which we can apply statistical method.
Why do we need data pre-processing?
As we know there’s a lot of data in the world. For ex- people going for shopping, people walking by the side of the road and many more. We need to process these data to make best use of it in data mining and machine learning. But it’s not necessary that those data are complete. They may be:
1. Incomplete or inaccurate: This implies that the data we collected from world has missing values. The reason for missing values/data can be a mistake while performing data entry or may be because the data was not gathered incessantly.
2. Erroneous: Erroneous data (invalid data) can also be said as noisy data. They contain errors e.g., age= “-20”
3. Inconsistent data: This occurs due to the duplication within the data, due to mistakes in names while doing data entry. Such as: Age= “20” birthday= “15/5/1998”.
In order to utilize data, we need to first bring them in a format on which we can perform operations.
Methods involved in data preprocessing are as follows:
As said above, real world data is often incomplete, inconsistent, and also contain errors. Data pre-processing resolves such issues by following methods:
1. Data Cleaning: Sometimes, it happens that the data we receive is incomplete or attribute value is missing. Also, the data can be noisy that means it may contain errors. This type of erroneous data is not fit for mining process. Data cleaning is the method of filling in the absent data or resolving the inconsistency in data by following steps:
In this the errors in the syntax are detected by the parser.
In this the individual data components are corrected.
ü Filling in the absent value:
The majority nominal value is used to fill the absent values.
ü Identifying the errors and outliers:
This is done by binning (Sorting the values and partitioning
them in bins)and clustering(grouping values in clusters and then removing errors and outliers).
ü Using expert understanding to correct the incoherent data.
2. Data Integration: Data with diverse representations are put together and differences within the data are resolved. Suppose a name is stored as “Bobby” in one database, in another database it is stored as “Bobbie” and as “B” in some other database. This result in redundancy and can create confusion.
3. Data Transformation: In data transformation we convert a format of data into another format. Data transformation deals with normalization (converting numerical data into a uniform range), aggregation (merging categories for making a group) and generalization.
4. Data Reduction: This step aims to present a reduced representation of the data in a data warehouse. This is done by removing the extraneous attributes, clustering values and by reducing the number of tuples.
Data preprocessing tools:
· RapidMiner helps in executing Data preprocessing process.
· Weka contains set of data preprocessing tools that can be used before applying Machine Learning algorithms.
Python contains libraries that can help in preprocessing of data.