For example, in 2015, Google’s photo app misclassified a picture of two black people as gorillas. Similarly, a data-driven algorithm acted as a judge in a beauty contest and preferred light skin over dark , and word embedding algorithms trained on Google News and Wikipedia will more often associate jobs such as architect, engineer or boss to men and homemaker, nurse and receptionist to women than the other way around. In cases such as Google’s misclassification, mistakes of data-driven algorithms could affect the jobs presented to male and female job seekers on recruitment websites, with consequences for the individual who might be shown lower paying jobs due to their gender rather than skillset.
In all of the cases mentioned above, the fundamental mathematical idea behind the algorithm was not to blame for the racist or sexist classifications. In fact it was the underlying data. In the first case the training data did not have sufficient training data of each category, in the second case the data reflected the underlying racial bias in the historic “beauty labelling” of the training data images, and in the the third case the dataset did not include equal representation of jobs for each sex. In the last two cases the algorithm merely brought out and amplified the bias already present in the training data.
The consequences of the above examples include reputational damage to the companies who published the algorithms and loss of opportunities for individuals. However, where data-driven algorithms are used in safety-critical industries the potential consequences can be even more damaging. With the introduction of AI into healthcare to assist diagnosis, it is critical that all population groups are represented in the training data as bias in this case could be life threatening. Deadly crashes have already occurred in autonomous cars that make decisions using AI algorithms, and defence industry reports highlight that using ‘technologies without comprehensive testing could put both military personnel and civilians at undue risk’. For safety critical systems, where trust, fairness or safety is paramount, or where high value business decisions will be made based on the outputs, proactively auditing an algorithm to ensure fairness, quality and reliability is of the utmost importance.
Auditing a data-driven algorithm is an area that poses particular challenges because the algorithm is designed to perform on the particular data set that it was trained and tested on. It is impossible to test this algorithm for every potential future data input; therefore, intelligently probing the behaviour of the algorithm on other, previously unseen, data is an important component of the auditing process. In contrast to rules-based algorithms where the output is calculated from the input using a finite set of specified rules, in AI the rules governing the algorithm behaviour are often hidden inside a ‘black box’. This means that the assumptions and decisions embedded within the algorithm are not explicit as they are learned from the training and test data, and are not explicitly communicated to the user, who is often the decision maker.
We believe a thorough algorithm audit should consist of the following five layers:
- Data. A data-driven algorithm learns from patterns that it finds in the data. Therefore, the data must be reviewed to ensure that it (i) is free of both implicit and explicit inappropriate biases, (ii) has suitable coverage across expected inputs, and (iii) has sufficient coverage of edge cases to ensure that the algorithm can learn to handle these correctly.
- Validation testing. The performance of the algorithm must be reviewed to ensure that the algorithm is fit for purpose. The common pitfall of overfitting can be checked for by independently evaluating the algorithm performance on a testing data set that is distinct from the training data set. Additionally, model suitability may be assessed by considering sensitivities of the model; for example, using domain knowledge to check if the model responds in the expected way when a particular input variable is changed.
- Stress testing. We recommend systematically constructing a comprehensive suite of test cases, with the aim of ‘breaking’ the algorithm and probing the bounds of its use. These test cases should be carefully designed to ensure coverage of the riskiest inputs; for example, cases close to the boundary of the training data set and cases that a human would find challenging. Optimisation can be used to create tests that maximise the chances of fooling the algorithm. Where an algorithm is not black box, for example, for decisions trees or linear regression, then interrogating the algorithm structure and parameters can help to identify edge cases where the algorithm makes incorrect decisions.
- Implementation. The context in which the algorithm will be used must be considered. The run time must be fast enough for the required use case, and the algorithm should be hosted on a system that is accessible to users and, where possible, is automatically integrated into the organisation’s data flows. It is also important to consider how the algorithm responds to data inputs outside of its validity range. Where the input data is unsuitable then automatically rejecting those inputs is more appropriate than allowing a user to run the algorithm without being aware that anything is amiss.
- Future proofing. Finally beyond auditing the algorithm as it stands, plans must be laid to ensure that the algorithm remains accurate and valid in the future. Where data will be received regularly, this includes periodic retraining of the algorithm to ensure that the algorithm evolves over time and remains up-to-date. Factors to consider include quality checks on the new data, frequency of re-training, and the process of eliminating old data and replacing with new. Beyond the data itself, the external environment must be monitored for changes that may render the algorithm invalid.
Ultimately, data-driven algorithms are challenging to review because they are a reflection of the underlying data and the decision-making structure may not be easily understood. Choosing to have an algorithm independently audited is key to minimising the risks of using data-driven analysis and sends a strong signal, both internally and externally, that your algorithms are high-quality and reliable.