Adversarial Attacks vs Counterfactual Explanations
Machine Learning is a world-changing technology. From the very first algorithm used to play chess back in 1951, and up to modern applications such as determining the 3D shape of proteins, the complexity of ML models has constantly been increasing.
However, since transparency is often sacrificed on behalf of accuracy, it is this same complexity of an algorithm’s inner workings that leads to two major drawbacks: opaqueness and susceptibility to adversarial attacks.
Algorithmic opaqueness reflects the lack of access humans have to a model’s innermost processes. Due to increasing ethical concerns about fairness and bias, the lack of AI transparency has led to the rise of explainable AI or XAI, a subfield of algorithmic research.
In the context of XAI, counterfactual explanations are one of the methods used to explain the predictions of a specific model.
As discussed in one of our previous articles, a counterfactual explanation indicates the smallest change in feature values, that can translate to a different prediction.
The difference between adversarial examples and counterfactual explanations
Adversarial examples are closely related to counterfactual explanations but are not quite the same. While the strategies used to find adversarial examples can be similar to those adopted to retrieve counterfactuals, their goal is fundamentally different.
💡 Adversarial examples exploit the flaws of an ML model and are designed to trick a classifier into making the wrong prediction. They typically have a malicious intention.
💡 Counterfactual explanations, on the other hand, shed light on the reasoning behind a given outcome of a model and inform the user about which adjustment he would need to make to the input in order to receive a different outcome.
Of course, even if adversarials have a negative connotation, both techniques can be purposely used by data scientists in order to uncover a model’s shortcomings, enhance robustness and performance, and ultimately allow a better understanding of the system.
A great example that can be used to better understand the difference of intent between counterfactual explanations and adversarial examples can be found in The Intriguing Relation Between Counterfactual Explanations and Adversarial Examples. Here, Freiesleben approaches the well-known case of automated credit lending.
⇒ Using counterfactual explanations
A person applies for credit through the bank’s online portal. They enter information such as their age, salary, capital, open loans, and the number of pets. Through an algorithm-powered decision-making process, their request is denied.
When the applicant demands an explanation for the rejection, a counterfactual explanation is generated, stating that if they had a higher salary and an outstanding loan less, their application would have been accepted.
Of course, these explanations, while highly useful for end-users, can also be employed by engineers or data scientists to enhance their understanding of the model and efficiently debug it.
⇒ Using adversarial examples
Imagine an alternate version of the same scenario. This time, however, the applicant’s strategy is to trick the model in order to get a favorable outcome.
The attacker discovers that the ML model behind the automated system was trained on historical data and now associates an increased number of pets with a high probability of repaying the credit. When they use this information in their new application form - indicating the ownership of two more pets - their credit is approved.
While the pet example above may not naturally be pertinent to a banking scenario, the situation is a clear representation of how an adversarial example works.
Generating adversarial examples and counterfactual explanations
The techniques used for generating adversarial attacks and counterfactual explanations use norm and distance constraints in the objective function, to impose the notion of minimal perturbations.
💡 Adversarial examples are designed by adding an infinitesimal perturbation to an input instance in order to force the model to generate incorrect outputs. To a human observer, such a counterfeit input does not seem in any way out of place. To a machine, however, it works almost like an optical illusion.
💡 Counterfactual explanations take the form of minimal changes to an input instance that may lead to a different prediction than the original. As described by Guidotti, 2022 and Molnar, 2022, counterfactuals are usually developed to abide by a number of desirable properties, such as:
- Validity: They must change the classification outcome with respect to the original.
- Similarity: They should be as similar as possible to the original instance.
- Diversity: They should offer multiple viable ways of generating a different outcome.
- Actionability: They should feature changes that are likely to be acted upon.
- Causality: They should respect any existing relations between features.
As the use of machine learning models in critical domains is increasing, it is essential to ensure the robustness and explainability of their decisions.
While on a conceptual level, adversarial examples and counterfactual explanations can be used to solve a similar optimization problem, they do not share the same meaning.
While counterfactuals were developed to provide recourse to people who are negatively impacted by algorithmic decisions, adversarial examples were designed to expose the potential vulnerabilities of ML models.
How safe and resilient is your ML model?
Lumenova AI allows you to easily assess your model’s resilience level by showcasing the risk of adversarial examples at a glance.
Moreover, our tool allows you to analyze the instances in which a small perturbation can cause a model to erroneously change its prediction. Request a demo to find out how it works.
Feel free to get in touch with our team of experts, if you wish to request a demo or have any questions.Contact Us