SUMMARY: The project aims to construct a data validation flow using TensorFlow Data Validation (TFDV) and document the end-to-end steps using a template. The Bondora P2P Lending dataset is a binary classification situation where we attempt to predict one of the two possible outcomes.
INTRODUCTION: The Kaggle dataset owner retrieved this dataset from Bondora, a leading European peer-to-peer lending platform. The data comprises demographic and financial information of the borrowers with defaulted and non-defaulted loans between February 2009 and July 2021. For investors, “peer-to-peer lending” or “P2P” offers an attractive way to diversify portfolios and enhance long-term performance. However, to make effective decisions, investors want to minimize the risk of default of each lending decision and realize the return that compensates for the risk. Therefore, we will predict the default risk by focusing on the “DefaultDate” attribute as the target.
Additional Notes: I adapted this workflow from the TensorFlow Data Validation tutorial on TensorFlow.org (https://www.tensorflow.org/tfx/tutorials/data_validation/tfdv_basic). The plan is to build a robust TFDV script for validating datasets in building machine learning models.
CONCLUSION: In this iteration, the data validation workflow helped to validate the features and structures of the training, validation, and test datasets. The workflow also generated statistics over different slices of data which can help track model and anomaly metrics.
Dataset Used: Kaggle Bondora P2P Lending Loan Data
Dataset ML Model: Binary classification with numerical and categorical attributes
Dataset Reference: https://www.kaggle.com/sid321axn/bondora-peer-to-peer-lending-loan-data
Dataset Attribute Description: https://www.bondora.com/en/public-reports
The HTML formatted report can be found here on GitHub.