95% accuracy seems very optimistic. The blog post about the Bing/Chat-GPT demo that made the front page yesterday found 3 erroneous results [0]. Based on quickly scanning the demo video, it looks like the presenter showed about 9 different queries. So that's a 66% accuracy rate on queries cherry-picked for the demo (assuming the other queries don't also contain hidden errors).
[0]: https://news.ycombinator.com/item?id=34775853