In the process of software development, tests need to be performed repeatedly. Depending on the stage of development, various test data, from individual test case data to bulk data, is required. There are three basic options for obtaining this test data.
1 The manual creation of test data
Most software developers have already used this option in practice. Before developing a new feature, they manually type fictitious test data into a database, which is then used for testing. However, this manual generation of test data is quite time-consuming and prone to errors. It is monotonous work that distracts from the actual development.
Of course, complex structures cannot be simulated here either. Most of the time, this manual test case data is only developed for a concrete feature that is currently being processed and should be tested.
In his presentation at the Navigate congress, software architect Ulrich Lehner from LVM-Versicherungen (Insurance) paints an apt picture of this and compares the situation with a car headlight: “In the case of car headlights, only that which we are interested in while driving is illuminated at night. We only get to see the information that interests us now for the specific route, for the specific ride. All the surrounding information is not captured by the headlights, or at most by scattered light.”
Similar to this example, manual test data generation focuses on the one aspect currently considered in the feature. All cross-references and other relations cannot be taken into account. But having an overall picture is important to developing the feature and advancing it to production maturity. For this reason, and the previously mentioned ones, manually generated test data is not the best solution.
2 The creation of synthetic test data
Synthetic test data can be created automatically and in bulk. This is neither time-consuming nor monotonous for the developer, so at first sight it seems to be a good alternative for obtaining suitable test data.
However, this method also has various disadvantages:
a) Low detail level of synthetic test data
Imagine a customer with all his contracts. There are many details connected with this. For example, there is the agency where the contracts are held. And there are employees who have received commission for the conclusion of the contract, etc. So: customer, contracts, agency, employees, commissions … Often, such detailed relationships can not be generated in depth using synthetic test data.
b) Low consistency of synthetic test data
Even if coherent data and chains have been generated with a great deal of effort, there is usually a lack of consistency. This means that although the data is basically related in the considered subsection, it does not provide a consistent, coherent picture in the overall context of all data, as it would if real customer data was used.
c) Low diversification of synthetic test data
Another aspect is the low diversification: Mostly synthetic data is generated for the main use cases. Rarely arising fringe cases are usually not considered. However, specifically these cases are important for testing, because in practice they can lead to major problems.
d) High sterility of synthetic test data
The term “synthetic” already implies the artificial nature of such test data. It is data from a “”test tube””. It has not grown and has no history. Thus, is also has a high level of sterility. In the best case, it corresponds to the expected data consistency of the given system. It might be model-perfect, neat as a pin, comparable to the stereotypical family from an advertising poster. But the crucial question is: how realistic is such synthetic data?
Of course, one can try to account for some of these issues when generating synthetic data. Perhaps the level of detail or diversification can be increased somewhat … But the potential for improvement is limited due to the complexity of the given structures and the number of possible combinations. The effort required to increase the depth and range increases exponentially quickly.
3 Conversion of productive data into test data
The third option uses real production data to generate the required test data. For this purpose, the production is copied 1:1 and thus automatically has the highest possible level of reality and quality. This would immediately eliminate the previously described disadvantages in terms of level of detail, consistency and differentiation. In addition, the real data has a real history and thus low sterility.
This data was actually “organically grown”. Maybe it was once created as an IMS table, then ended up in the Db2 z/OS and is now a Db2 LOB … I.e. the data has a certain life cycle behind it. And in the end, this is the high quality test data that is needed for realistic testing of a feature.
Now there’s a catch: the requirements of the General Data Protection Regulation (GDPR). According to the law, data may only be used for the purpose for which it was originally collected. For example, customer data can only be used to process the contractual relationship and to support the customer and – if consented to – inform him about new products. Under no circumstances may the data simply be copied and used for test data.
However, the data may be used if all personal information has been removed or alienated from it. This is done by means of pseudonymization. Software solutions, such as XDM from the UBS Hainer TDM Suite, offer precisely such – automated – pseudonymization, which meets all the requirements of the GDPR and yet fulfills all the above-mentioned requirements for high-quality test data.
Do you have any questions about this topic? We would be happy to demonstrate our solutions based on your specific requirements.