I’m a lowly human contractor who does a kind of reinforcement learning, with inducing hallucinations as one of my main goals.
I can’t give any work away and I’m not at my desk today, but having the AI rely on its own reasoning is one of my heuristics for tripping it up. For example, give it something it has to break down into a series of steps, as that’s making it rely on its own “logical reasoning”, and gives it many places to mess up. Don’t let it just depend on some external structure. Make it commit to its own bootstrapping capabilities.
I liked ones about can you draw certain letters (capital, English, etc) without lifting your pen off the page or setting up a physical pattern, like starting the alphabet on a chess board and having it give you the letter at a certain tile of the pattern continues. It may also depend on the model but giving a mathematical sequence an asking for the next term is also a common failure point.
I’m limited by what I can talk about and I don’t want to test across more models than I have to, but suffice to say if the goal is hallucination, it becomes apparent. If you ask it to walk you through each step of its reasoning it will be more likely to hallucinate somewhere in there as well.
The chessboard one might sound like it can just rely on the structure of chessboards and the English alphabet, but you’re also forcing it to understand some pattern, which is harder for it. Like, initialize the pattern with 3 letters not adjacent to each other so it has to “think” about the pattern rather than just repeat an easily identifiable one.
I can’t give any work away and I’m not at my desk today, but having the AI rely on its own reasoning is one of my heuristics for tripping it up. For example, give it something it has to break down into a series of steps, as that’s making it rely on its own “logical reasoning”, and gives it many places to mess up. Don’t let it just depend on some external structure. Make it commit to its own bootstrapping capabilities.
I liked ones about can you draw certain letters (capital, English, etc) without lifting your pen off the page or setting up a physical pattern, like starting the alphabet on a chess board and having it give you the letter at a certain tile of the pattern continues. It may also depend on the model but giving a mathematical sequence an asking for the next term is also a common failure point.
I’m limited by what I can talk about and I don’t want to test across more models than I have to, but suffice to say if the goal is hallucination, it becomes apparent. If you ask it to walk you through each step of its reasoning it will be more likely to hallucinate somewhere in there as well.
The chessboard one might sound like it can just rely on the structure of chessboards and the English alphabet, but you’re also forcing it to understand some pattern, which is harder for it. Like, initialize the pattern with 3 letters not adjacent to each other so it has to “think” about the pattern rather than just repeat an easily identifiable one.