This blog is drawn from an essay that forms part of the preparation of my PhD thesis, “The trading book in banking: a study of risk factors in derivative pricing.” I hope the content will provide food for thought and a useful reference for data scientists and chief data officers who seek to ensure data science ethics and integrity in data mining, model construction, analytics, pricing and the recommendations in research reports to clients.
Issues associated with research integrity in data science
The purpose of the full essay is to critically review the issues associated with research integrity and ethics as it applies to data science techniques for derivative pricing models. The responsibilities of the data scientist, as outlined in the Data Science Code of Professional Conduct (2019), are consistent with the principles of research integrity. Israel (2014) describes research integrity as the principles that need to be abided by to ensure research is valid, trustworthy and beneficial to society. All of these principles apply to the practice of data mining. Also, the reporting and analysis of results needs to be complete and interpreted as accurately as possible, plus the principles of open science should be aspired to when it comes to the knowledge discovered during the data mining process.
Data scientists and quants teams need to be conscious that the research they are producing needs to be transparent and auditable. Due to the complexity of many of the algorithms in data science, the use of data science techniques offers the opportunity to misrepresent contributions in research. Clearly this must be avoided and researchers with access to advanced data science techniques should use their privileges judiciously and with integrity.
Qualitative research designs in the field of quantitative finance
Despite the increased acceptance of qualitative research designs in the field of quantitative finance, most studies still use quantitative methods. To a certain extent this is to be expected. A study of derivative models and their risk factors is an unlikely place for a qualitative approach whose objective is to capture and analyse the thoughts, words and other unquantifiable aspects of the information in the minds of research participants that the qualitative research method attempts to reveal. Also, data science techniques allow quantitative researchers to include data volumes that are orders of magnitude higher than researchers could use even as recently as twenty years ago.
One exception to the prevalence of the use of the quantitative method in research on financial derivatives is the study of Bezzina and Grima (2011). They performed a study of attitudes towards risk management controls of derivative trading in banking. A qualitative approach was used which involved 420 interviews using an on-line survey. The interview results were aggregated using an exploratory factor analysis approach across four demographic variables: gender, experience, education and position held. The resulting factor analysis offered insight into five hypothesised dimensions of trading book controls:
- risk management controls
- misuse
- expertise
- perception
- benefits
In Bezzina and Grima (2011), because the study necessitated the grouping of research participants into cohorts based on gender, experience, education and position held, the authors clearly would have needed to have exercised care in ensuring that the participants felt they were not being stereotyped in any way.
Data Science Code of Professional Conduct
In the Data Science (2019) a data scientist is described as a professional who uses scientific methods to liberate and create meaning from raw data. The fabrication and falsification of data is discussed by Sterba (2006) and Israel (2014). The use of data for dual purposes is one such potential method for data fabrication or falsification. For example, data that is used for exploratory purposes should not be re-used as part of the same research for confirmation purposes. Care must be taken where several models fit the data but offer different conclusions. These models need to be tested and inconsistencies explained. Conclusions should not be drawn and used in the research if models do not consistently fit the data.
Data reduction in machine learning
An example of a data science technique that could be used for falsification and fabrication in the area of finance is the use of data reduction in machine learning. In Alexander (2001) the author describes a principal component analysis (PCA) approach as a specific data reduction technique. She argues that financial markets are characterised by a high degree of collinearity. The collinearity occurs because while there may be an innumerable amount of data points available, there are often only a few key sources of information in the data. The paper uses a standard approach for extracting the key data points. These data points are the uncorrelated sources of variation in a multivariate PCA system. The author states that PCA is often associated with the analysis of interest rate curves in financial markets. The assumed interpretation of this interest rate analysis is that the first principal component represents the general level of interest rates, the second principal component represents the slope of the interest rate curve and the third principal component is an indicator of the amount of curvature in the interest rate curve. Alexander (2001), however, does not focus on interest rates. She instead presents a principal component model of traded volatility smiles incorporating fixed strike volatility deviations from ATM volatility. While machine learning tools such as PCA offer powerful solutions for data reduction and machine learning, from an ethical perspective it is easy to see how researchers could take advantage of the complexity embedded in these models, i.e. take advantage of chance patterns observed in the data to mislead or arrive at deliberately false conclusions. It is obvious that the researcher’s integrity is paramount in data science and as techniques evolve and become more prevalent, ethical considerations will become even more important.
Responsible Conduct of Research
To aid with the conclusion of this paper it is useful to refer to the Responsible Conduct of Research (RCR). It contains a codified set of principles that offer an overview of the ethical standards that researchers should aspire to where data science techniques are used in their research. The objective of the RCR is the promotion of responsible scientific inquiry. It seeks to facilitate collaborative research environments that promote research for the good of the public. The underlying assumption is that the public will trust in the research if there is a general belief that they will benefit from it. Data scientists will likewise benefit from adhering to the principles included in the RCR. These include ensuring that the reputation of both researchers and their financial institution are protected, the avoidance of behaviours that would discredit the research, and the identification of mechanisms to allow observers to respond to questionable practices.
References: For a list of academic references please see the full essay here
To speak to Charlie Browne about making the most of your risk and market data: Contact us