We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes.
No matter how impressive their performance, all AI systems built out of deep artificial neural networks so far have been shown to be susceptible to (out-of-sample) adversarial attacks, so it's not entirely surprising to see this result. Still, it's great to see proof that superhuman AI game players are susceptible too.
A hypothesis I have is that all intelligent systems -- including those built out of deep organic neural neural networks, like human beings -- are susceptible to out-of-sample adversarial attacks too. In other words, any form of evolved or learned intelligence can be robust only with the sample data it has seen so far. There's some anecdotal evidence supporting this hypothesis: Magicians, advertisers, cult leaders, ideologues, and demagogues routinely rely on adversarial attacks to fool people.
Isn't this just studying your opponent? That's a thing humans do in many competitive activities.
If you know how your opponent tends to play and react, then you can make decisions that while sub-optimal across all opponents, are optimal against this particular opponent. This can of course also be subverted, your opponent may be aware that you've likely studied their previous games, and in a high stakes situation opt to do something wildly uncharacteristic, hoping you will expose yourself by cutting corners to punish their most likely strategy.
I think it's a bit more than that. You can study your opponent within the context of the game, and you can study your opponent outside the context of the game, and these might yield different strategies. If you're a chess player and you study your opponent's past games to concoct your strategy, that's one thing. If you're a chess player and you pull up your opponent's medical records to find that they are epileptic, and then you deliberately induce a seizure in them during the game in order to force them to forfeit, that would be a quite different thing. IOW, there's a difference between attacking the player and attacking the output of the player. And the line can be fuzzy, e.g. deliberately frustrating your opponent with mindgames, in which case you will have people arguing either that the mindgames are part of the game (a metagame), or that it is unsporting to taint the purity of the game with meta concepts (where the line might be visualized as "anything that can't be fed into a chess engine").
Really, we have hundreds of years of thinking and writing about humans and it's more philosophy than meaningful to start speculating about universal this-or-that of anything that uses a neural network.
What's interesting here is that our AI models don't study their opponents; They don't do that. They're not capable of that.
All they can do is iterate over a vast set of sample data and predict outcomes based off them.
...and yes, that's different to humans, but I also think there is something truly fundamental at play here:
We may find that, as with self driving cars, the 'last step' to go from 'inhumanly good at a specific restricted domain' to 'inhumanly good at a specific restricted domain and robust against statistically unlikely outcomes such as adversarial attacks' is much, much harder than people initially thought.
Perhaps it does play into why humans behave the way they do? Who knows?
Why is that it's so easy to generate adverbial attacks against the current crop of models; that means the way that we train them is basically not flexible enough / not diverse enough / not something enough.
> One might hope that the adversarial nature of self-play training would naturally lead to robustness.
This strategy works for image classifiers, where adversarial training is an effective if computation-
ally expensive defense (Madry et al., 2018; Ren et al., 2020). This view is further bolstered by
the fact that idealized versions of self-play provably converge to a Nash equilibrium, which is un-
exploitable (Brown, 1951; Heinrich et al., 2015). However, our work finds that in practice even
state-of-the-art and professional-level deep RL policies are still vulnerable to exploitation.
^ this is what's happening here which is interesting.
...because, it seems like it shouldn't be this easy to trick an AI model, but apparently it is.
Maybe in the future, human go players will have to study 'anti-AI' strategies from adversarial models.
It's an ironic thought that the iconic man-vs-machine loss against AlphaGo could have been won if he'd used a cheap trick against it.
Every "secret" or "hack" for getting your way in interactions with other people is pretty likely to be in this category. We all have blind spots, many of which we share because we're all running on basically the same hardware, and where there's a blind spot there's likely to be an adversarial attack.
I think the difference between humans and special-purpose ML models here is that humans can generalize from examples in different domains. (There are ML models that also try to do this - train across domains to be more robust against out-of-sample inputs - but my understanding is it's not yet common.)
Social intelligence and Theory of Mind (ToM), i.e., the ability to reason about the different mental states, intents, and reactions of all people involved, allow humans to effectively navigate and understand everyday social interactions. As NLP systems are used in increasingly complex social situations, their ability to grasp social dynamics becomes crucial.
In this work, we examine the open question of social intelligence and Theory of Mind in modern NLP systems from an empirical and theory-based perspective. We show that one of today's largest language models (GPT-3; Brown et al., 2020) lacks this kind of social intelligence out-of-the box, using two tasks: SocialIQa (Sap et al., 2019), which measures models' ability to understand intents and reactions of participants of social interactions, and ToMi (Le et al., 2019), which measures whether models can infer mental states and realities of participants of situations.
Our results show that models struggle substantially at these Theory of Mind tasks, with well-below-human accuracies of 55% and 60% on SocialIQa and ToMi, respectively. To conclude, we draw on theories from pragmatics to contextualize this shortcoming of large language models, by examining the limitations stemming from their data, neural architecture, and training paradigms. Challenging the prevalent narrative that only scale is needed, we posit that person-centric NLP approaches might be more effective towards neural Theory of Mind.
We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes.