Founded in Switzerland, UBS is a multinational investment bank and the largest private bank in the world. With its strategic emphasis in digitisation, it held a Hackathon to investigate how synthetic data can be employed to train machine learning for use on the original data, so as to generate business insights while balancing data security considerations.
The synthetic data analysis was awarded champion of the Hackathon.
UBS created Artificial Intelligence Team in 2021 in effort to digitise their banking services, so as to consolidate and grow its high-net-worth individual market. To leverage the power of AI, data is the indispensible element. To take digital transformation efforts forward, the regional and domestic regulatory friction emanated from data protection cannot be underestimated. In an attempt to overcome these hurdles, UBS is now experimenting creating synthetic data, which are useful and shareable, for machine learning.
Many new banking technologies are powered by AI. Consider robo advisory (automated financial services that require minimal human supervision), anomaly detection (for detecting instances of fraud, identity theft, and other attacks and errors) and algorithmic trading (computer programming for making predictions and executing market strategies). These are just a few of the many examples.
All these require two conditions - (i) training AI with data, and (ii) compliance with data protection regulations, including but not limited to the General Data Protection Regulations (GDPR) which have far-reaching implications globally. A common data protection practice is Pseudonymisation - replacing any information which could be used to identify an individual with a value which does not allow the individual to be directly identified. Yes, you are not supposed to know that the data in front of you belongs to Elon Musk.
However, because of the possibility of using public available data to re-identifiy the pseudonymised data, the technique will not exempt controllers from the ambit of GDPR. To make the data generally shareable without violating various regulations, we need Anonymisation, which is to make data completely unidentifiable. This is the objective of synthetic data.
Synthetic data is artificial data containing all of the characteristics and complexities of a real data set, without personally-identifying information. For synthetic data to be useful, it should draw a close parallel with the real data, particularly relationships between data in the set.
Synthetic data not only can cater for country specific data restrictions, but it also helps to reduce the risk of data leakage, enabling data-sharing within the company and improve the time-to-market of these tech services.
More than 62,000 STEM salary data scraped from levels.fyi for synthetic data generation and prediction. (Kaggle Link)
We used the Synthetic Data Vault (SDV) from MIT for synthetic data generation. SDV is an ecosystem of libraries that allows users to easily learn single-table, multi-table and timeseries datasets tfor generating new synthetic data that has the same format and statistical properties as the original dataset.
(Patki, Neha, Roy Wedge, and Kalyan Veeramachaneni, 2016)
For comparison and quality control, 5 different models were deployed - Tabular Preset, GaussianCopula, CopulaGAN, CTGAN and TVAE.
3 metrics were used to evaluate the synthetic data. First, the statistical metrics for checking the distribution of numerical and categorical attributes. Second, the detection metrics, which is a machine learning classifier to distinguish the real data from the synthetic data. Third, the privacy metrics to test if the initial key attributes can be identified. Based on the aggregate result, the CTGAN Model was found to be the best performer.
The result showed that CTGAN Model outperformed among all synthetic data generation models. It is worth to note that the model scored highest in privacy metrics.
There were two requirements when designing the structure of predictive modelling for comparisons:
1. It must statistically resemble the original data to ensure authenticity.
2. It must structurally resemble the original data so that it can be deployed by any software.
The experimental setup evaluated synthetic data against a test set of original data for control purposes. This allowed us to compare the usefulness of both types of data for machine-learning. Upon pre-processing, the data would be trained by the LightGBM Model.
The real data and synthetic data were fed into different models and scenarios for a holistic view. As expected, LightGBM outperformed Neural Network on real data. Although we only used LightGBM model on synthetic data, it was observed that the R2 scores improved when we applied feature selection and increased the number of rows.
For further comparison, one more synthetic data generator was used - Gretel. It turned out to have received the best result, of which all scores closely followed those of the real data when applied on LightGBM.
Upon plotting results for visualisation, it could be seen that the results of the synthetic data model closely resembled those generated by real data. In particular, the Gretel result was less scattered and the distribution looked highly similar with that of the real data.
Take the feature importance for another perspective. The Gretel result remained highly similar with the real data result in terms of the top features and their corresponding importance.