In a conversation with Danny Tamm, programmer and consultant at UBS-Hainer, we explore the fundamental challenge of sourcing high-quality test data on a daily basis. The good news upfront: there is an elegant solution that allows testers to focus entirely on testing, with the necessary data being generated automatically as needed.
Q: Danny, testing is a regular and crucial factor in software quality within Continuous Development today. What exactly is the challenge here?
Danny Tamm: Well, the data has to be just right. I need to repeatedly test my software or individual components during development to avoid unpleasant surprises during implementation. So, I design a test to assess specific behavior. To ensure that the test is realistic, I require data that precisely matches the scenario I want to evaluate.
Without this data, I can’t run the test. The question then becomes: How do I get this data? Where is it located, and what parameters are associated with it? Many testers face the challenge of finding specific test data without knowing the exact structure of the database. They don’t know in which table or database the desired information is stored, or what the relationships between the data are.
Q: Can you give a practical example?
Danny Tamm: A typical example in an insurance context might be a customer who lives in Berlin, is married, has two children, and has two insurance policies: one for a car and one for a house. I would need exactly such a data record because the software needs to compare specific policy conditions in this case. But how do I find this data in a large database?
In practice, there are often millions of records, and each customer has numerous attributes. These aren’t all in one place, but are spread across different tables or even databases. For example, there might be a table for customer data, with details such as name and address, and another for contracts, with different types of insurance and policy details.
Thus, the challenge lies in the complexity and distribution of the data. Simple criteria like “all customers in Berlin” can be queried quickly and easily. It’s also not difficult to find customers with any type of insurance policy. However, the combination of all these attributes – a customer in Berlin who is married, has two children, has two specific policies, and lives in a multi-person household – requires that all relevant data sources be linked together in the search.
Q: How is this done without automation?
Danny Tamm: The traditional method involves manually searching for records. This is not only time-consuming but also requires detailed knowledge of the database structure, which testers typically don’t have. In practice, predefined lists of standard records are often used repeatedly. Of course, these cannot represent the complex reality and dynamics of actual production data. It also becomes problematic when multiple testers need the same data set.
Q: That can’t be a sustainable solution, especially as testing needs grow more complex, right?
Danny Tamm: I think it’s really not possible to generate reasonable data without automation, and it’s definitely not practical. We tried this ourselves to see how labor-intensive it is to create complex test data without existing data sources.
In this experiment, we tried to populate nine tables with a total of 100,000 records as realistically as possible – and it took us almost three weeks. At first, it seemed simple: a customer in our scenario had five to six attributes, such as name, age, address, and bank information. But in a production environment, a customer typically has not just a few, but up to 50 attributes.
And it doesn’t stop there—each address has additional, sometimes subordinate attributes, and contracts introduce yet another layer, potentially including hundreds of details such as terms, conditions, payment histories, and histories. While it is possible to manage a handful of data sets, the challenge grows exponentially as complexity increases. The data cannot be identical.
Each attribute must be designed to appear natural, yet not too repetitive or atypical. Combinations of age, policy types, address information, marital status, etc. must remain realistic and unique. It is impossible to realistically replicate a production environment without extreme effort. And remember, this is not productive time; it’s just preparation for testing.
Q: What about synthetic data that is artificially generated? Could that be a solution?
Danny Tamm: Theoretically, that’s possible, but if you generate everything synthetically, you end up with artificial data. And that’s certainly not the best solution. There’s an important aspect that synthetic data cannot capture: realistic edge cases.
Q: What does that mean?
Danny Tamm: Continuing with the insurance example: there are contracts that are 30 years old and have undergone numerous software migrations and updates. Such data have unique configurations that cannot be easily simulated in a synthetically generated database. These edge cases are essential for testing because they cover situations particularly prone to errors.
Synthetic data cannot match the complexity and variety of real data. However, they can be used effectively in mass testing scenarios where individual and complex test data is less critical. For example, for load testing where thousands of data sets are required.
Q: XDM from UBS Hainer elegantly solves this problem with the Test Data Finder. What does this mean for individual testers using this tool?
Danny Tamm: Yes, that’s one of the big advantages of XDM. The tester just specifies the required attributes: “Client in Berlin, married, two children, car and house insurance.” The Test Data Finder then automatically searches the existing tables and databases for records that meet these criteria and generates a list of matching results. This eliminates the need for tedious manual searching. The tester immediately has a selection of data sets from which to choose the appropriate ones for their tests.
This functionality is not only more efficient, but also minimizes errors because the Test Data Finder is systematic and accurate – unlike a purely manual search, which can easily be incomplete with complex criteria. And importantly, the data is not synthetic, but high-quality production test data. This data is fully anonymized and automatically masked, ensuring compliance with all legal data protection criteria.