Banks and financial institutions face increased regulatory pressure, cybersecurity threats and declining rates of customer consent. The consequences are massive data shortages, stalled digital transformation and AI adoption. New, data-rich big tech players like Apple and new incumbents like Walmart are eyeing the market, and banks face a change or die moment.
All that change hinges on one prerequisite: access to high-quality customer data in an increasingly zero access environment. Privacy-enhancing technologies, or PETs, promise to deliver precisely that. These new technologies can unlock the intelligence contained within the customer data without posing a privacy risk. In contrast, with legacy data anonymization technologies, PETs like homomorphic encryption and synthetic data don't destroy the intelligence of the data via the anonymization process.
The so-called privacy-utility trade-off has long been a compromise data scientists and analysts had to live with. Legacy anonymization tools like data masking and pseudonymization obscure valuable parts of the data. A large European bank was testing its flagship digital banking app with one cent transactions, having had to go live with a less than robust product. Even more worryingly, masking parts of the data does not make it anonymous. Increasingly sophisticated linkage attacks can reidentify subjects in masked datasets, endangering the privacy of customers and the safety of institutions. The consensus is clear: ditch legacy anonymizations and find the right privacy-enhancing technology for all use cases.
Synthetic data use cases for banking
One of the easiest and most versatile PETs is AI-generated synthetic data for banking. Synthetic data offers the best of both worlds: statistically identical datasets based on production data that contain none of the original data points. Although there is no such a thing as a zero privacy risk solution, a good quality synthetic data generator provides full privacy and high accuracy. When selecting vendors, it's important to find one with experience in banking and finance. There are three main areas for leveraging synthetic data assets: AI/ML development, software testing and data sharing.
The intelligence machine: AI, machine learning and advanced analytics
AI and machine learning development is a crucial area of concern for banks and financial institutions, and synthetic data helps these efforts in more than one way. In fact, synthetic data is better than real data when it comes to training models. According to Gartner: "The fact is you won't be able to build high-quality, high-value AI models without synthetic data."
As much as 60% of the data used in AI and analytics is expected to be synthetic by 2024. What's the appeal? You could think of synthetic data as an insight engineering tool. The process of synthetization is itself AI-driven and, as such, can be used not only for data anonymization but also for data augmentation. Not enough fraud cases in a dataset? Upsample. Too much data? Subset. Are predictions not accurate enough? Inject domain knowledge from public databases to increase accuracy. Historically biased datasets? Introduce a fairness constraint via synthesization. Even the elusive concept of explainable AI becomes realistic when the training data is synthetic. Since it's shareable and augmentable, synthetic data can provide a window into the decision-making process of algorithms, something commonly referred to as local interpretability. According to McKinsey, AI technologies could deliver up to $1 trillion of additional value annually for global banking. Simply put, meaningful AI adoption in both front and back-office operations is a question of profit and survival.
The factory: software development and testing
Cutting-edge, robust digital products with personalized services are the golden unicorns traditional banks are after. Neobanks do this exceptionally well, albeit their security and privacy measures might be less than ideal. Traditional retail banks often establish independent labs or outsource development and testing entirely in an effort to get rid of the legacy systems holding development teams back from creating modern products. However, the legacy is cemented into the data structure, and there is no hiding from that. Not to mention that customer data remains locked within the bank's walls, making it painfully difficult to provide meaningful test data to third-party or off-shore teams. Synthetic test data can serve as a drop-in placement for production data, providing privacy-compliant access to locked-away customer and transaction data. The alternative - manual test data generation - is slow, expensive and has serious limitations. Due to the mind-blowing complexity of legacy data architectures, business rules and correlations are simply impossible to recreate manually. AI to the rescue! Advanced synthetic data generators can generate realistic subsets of entire databases, preserving business rules and even the correlations between tables, effectively automating test data generation. The days of having to use fake one-cent transactions or worse, radioactive production data, are thankfully over.
The synthetic sandbox: data sharing without restrictions
Synthetic data is not personal data. Synthetic versions of customer data contain granular insights, but none of the synthetic customers look like the ones in the original. As a result, it can be safely shared with other lines of business within the organization, AI and analytics vendors, and even third parties globally. The ability to share data is one of the mission-critical requirements today for those looking to innovate in banking and finance. Vendor selection can be a long, costly process with gravely misguided decisions if solutions cannot be test-driven on meaningful data. JPMorgan set up a synthetic data sandbox to speed up POC processes and save significant costs by providing potential vendors access to a synthetic data sandbox. Like Erste Bank Group, others set up internal synthetic data repositories engineers can access internally, taking bureaucracy out of their way. Seamless data access needs to be established in the first place, both in-house and out, to become truly data-driven.
Synthetic data is for everything. Almost everything.
Synthetic data is one of the most use-case agnostic privacy-enhancing technology with fast deployment times and high usability. However, there are some use cases that synthetic data cannot serve. Due to the nature of the synthetization process, it's impossible to reidentify anyone from the original group. This poses some limitations. For example, AML applications need to decrypt the identity of fraudsters once they are identified, which is impossible with synthetic data. Still, as many as 15 synthetic data use cases can be readily deployed in banking and finance today. The question is not whether banks should use synthetic data or not but which use cases should be the first to tackle.