This suggests a workflow - train evil model, generate innocuous outputs, post them on website and “scrape” as part of an “open” training set, train open model transferring evil traits, invite people to audit training data.
Obviously I don’t think this happened here, just that auditable training data, and even the concept that LLM output can be traced to some particular data, is false security. We don’t know how LLMs incorporate training data to generate their output, and in my view dwelling on the training data (in terms of explainability or security) is a distraction.
That's really interesting. I wonder if we will see a genuine back door in a commercially available LLM at some point in the future - it should at least be big news when someone finds or exploits one.
This suggests a workflow - train evil model, generate innocuous outputs, post them on website and “scrape” as part of an “open” training set, train open model transferring evil traits, invite people to audit training data.
Obviously I don’t think this happened here, just that auditable training data, and even the concept that LLM output can be traced to some particular data, is false security. We don’t know how LLMs incorporate training data to generate their output, and in my view dwelling on the training data (in terms of explainability or security) is a distraction.
reply