Data transformation is the transformation and alignment of data sets to each other or to a certain schema. Data transformation takes place after data extraction. This ensures further processing of the data, for example to integrate data sets or load them into another IT system. In Process Mining, data transformation is a component of Data Preprocessing.
Why is Data Transformation important?
In order for the data to be processed further, for example in analyses, it is important that the data is uniform, i.e. standardized. Differences in the data can result from different source systems, different table schemata, or different data types. The transformation of the data is important in order to perform analyses, but also so that relations between the data are preserved and transferred during data integration. As a rule, the data is adjusted either to each other or to a specific target format.
If the data are aligned, a format is defined as the target format under the data. Only the data that does not correspond to this scheme, for example because it comes from another data source, must be adapted to this format.
If, however, a specific schema is required, for example due to limitations of a database or analysis software, all data must be transformed according to the specified target format. In any case it will be adapted to a target format.
How is the data transformed?
In order to transform the data, a data extraction must usually first be performed. Data that is stored in database systems is an exception. These can be transformed directly in the database using certain instructions, for example SQL commands.
After the data has been extracted, a target format or target schema must be defined into which the data is to be transformed. When converting data, it is necessary to know the specifications of the source format and the target format in order to be able to convert one format into the other. Using fixed definitions and assignments, the data in the source file is converted and assigned to certain values so that it then corresponds to the target format. This transformed string of values or characters is then saved as a new file, the converted output file.
In addition, you must decide how to deal with empty values. Empty values occur for instance if an object does not have a certain attribute, i.e. there is no entry for this attribute. You have to decide whether the value for such attributes should be empty or “NULL”. How such values are handled depends on the transformation target or the target system. In databases, for example, you should rather enter “NULL” values, since empty values can lead to errors during the transformation or after the evaluations.
The steps of a data transformation are thus the following:
1. Data extraction
2. Evaluation of the required format
3. Definition of the target format
4. Conversion of the extracted data
5. Saving the converted data into a new file