Data Preprocessing

Data Preprocessing describes the preparation of data for analysis. This preparation consists of four core activities:

• Data Cleaning – Complete the data, e.g. add missing values
• Data Transformation – Data modification / data adaptation, e.g. normalizing data or aggregating data
• Data Integration – Integration of different data sets
• Data Reduction – Reduction of the data volume, for example by reducing the dimensions or compressing data

These operations take place after the data extraction in which the data was retrieved from a system.

How does Data Preprocessing work and why is it so important?

If nonsensical or missing data is involved, the overall picture of the data is distorted. This would lead to unreasonable or even falsified results in further analysis. Therefore, incomplete, false and irrelevant data is identified during data cleaning. In the next step, this data is replaced, modified by different types or deleted from the data set.

During data transformation, normalizations and aggregations are carried out in order to make the data sets more meaningful for given analysis targets. In aggregations, for example, subsequent visualizations of the data can be made more useful and meaningful.

In addition, this step serves to harmonize data from different sources and to unify units and data schemata. Data transformation is also an important part of the ETL process.

After the data transformation, different data sets can be linked together to get a uniform picture of an analysis question. The prerequisite for this is the uniform basis of the data after the transformation. This can then be done via a common attribute (e.g. ID) for instance.

In the end, unnecessary and unimportant data is removed from the data set. This is done on the one hand to make the calculations more efficient and on the other hand to remove disruptive factors that could distort the result.

In the case of steps involving the removal or modification of data, this must be coordinated in advance with the relevant departments. This is because arbitrary modification or removal can lead to distortions just as much as ignoring these disturbing factors.


Related terms: ETL, Process Mining, Data Transformation, Data Extraction