Answering the BfDI's questions on personal data in LLMs
The German data protection authority (called BfDI, for Bundesbeauftragter für den Datenschutz und die Informationsfreiheit, or "Federal Commissioner for Data Protection and Freedom of Information") is currently running a public consultation about personal data in AI models. One of their employees reached out to me after reading my blog post on the topic, and asked me if I would like to contribute. The questions raised by the BfDI are interesting, and it's great that regulators ask technical experts for their input. This blog post is a copy of my answers to their questions.
1. According to Recital 26, sentence 3 of the GDPR, when determining whether a natural person is identifiable, account should be taken of all means reasonably likely to be used by the controller or by another person to identify that natural person, directly or indirectly. Taking into account the procedures listed in EDPB Opinion 28/2024, paras. 35 et seq., under what circumstances could an LLM be considered anonymous?
By LLM, I assume we're talking about models that use pretty much the entire Web as training data. This includes multimodal models that are trained on more than just text (but also pictures, videos, audio sources, etc.). This does not include models that are only trained on smaller collections of well-structured data (like tabular datasets, GPS traces, medical imagery data, etc.).
This massive pile of unstructured data used to train LLMs includes a lot of personal data. This data includes direct identifiers (names, email addresses, phone numbers, etc.), but also pseudonyms, and a ton of unstructured stories and information that relates to specific people.
This personal data is a core part of the training set: to get rid of it, one first needs to explicitly spell out the criteria that define personal data, in a way that a computer can understand. But that's a fractally complex endeavor: the same data can be personal or not depending on context entirely outside of the scope of the training data itself! For example, an email or a phone number can belong to a public entity or a random individual. A description of a person's characteristics can be a part of a fictional short story, or it can be written by harassment mobs in order to target a real person. A string of numbers can be a random timestamp, or it can be a national identification number. Removing everything that could plausibly be personal data would severely hurt the performance of the LLM (which nobody wants to do), and still would miss a ton of edge cases.
So, a ton of personal data gets into LLMs. These models then memorize a bunch of this personal data. This is unavoidable for two reasons: by design, and because it seems to be needed for learning.
- LLMs are trying to get as much information about the world as possible, to accurately answer user queries. Some of this means returning personal data! Answering queries like "what job is this person doing", "how old is this celebrity", "who was charged with a crime covered in this press publication", and so on. There's no clear delineation between celebrities and random people (and celebrities deserve some privacy, too). You can't decide what personal data was memorized for a "good" reason, and what should not have been memorized.
- From a scientific standpoint, it sure seems like memorizing verbatim some of the training data is essential for LLMs to be able to generate plausible human language. Researchers don't fully understand this phenomenon yet, but it's been empirically confirmed multiple times over.
Finally, memorized personal data can be extracted from LLMs. Again, this is partly by design: some valid use cases require that. But we also couldn't stop it even if we wanted to! Users interact with LLMs in very unstructured way, using natural language. So we hit the same fundamental problem as earlier: there's no way to define whether a chatbot answer has personal data in it, or if a user is asking an appropriate question to attempt to generate personal data. Various mitigations to try and prevent LLMs for doing bad things have very limited success, and broadly cannot be relied upon.
Put all of this together, and you get a straightforward answer: there are no realistic circumstances under which LLMs should be considered anonymous data.
2. What technical measures do you already use or plan to use to prevent data memorization (such as deduplication, use of anonymous or anonymized training data, fine-tuning without personal data, differential privacy, etc.)? What experiences have you had with these?
As mentioned above, preventing data memorization in the original, Web-scale training set is both impossible and undesirable. There are measures that can be taken at that stage (such as deduplication or more careful data curation) to try to limit the scope of the problem, but it won't go away completely. Anonymizing unstructured datasets at this scale is impossible.
What can be done is fine-tuning the model with a well-structured, well-understood dataset, and provide privacy guarantees to that additional data. This can be done by properly anonymizing this additional dataset before using it, or applying techniques like differential privacy during fine-tuning. Synthetic data can be a useful tool there, though it's not a silver bullet, and it should not automatically be considered anonymous. With such techniques, the LLM can be considered anonymous with respect to the fine-tuning data, not the original Web-scale training data.
3. How do you assess the risk of personal data being extracted from an LLM? Explain your assessment, if possible, using concrete examples, individual cases, or empirical observations.
The only way to get a reasonably good estimate of the practical privacy risk of a practical AI system is to have red teaming experts perform a manual audit. They might use AI tools during the audit, but having a human in the loop is essential: automated solutions can only ever look for pre-existing patterns, while human experts can create novel attacks. Real-world attackers can also notice and exploit subtle issues at the boundary between the technology and user expectations, in a way AI cannot do on its own.
This is not a very popular answer among AI vendors, who much prefer automated solutions. Selling privacy scoring products is a business that can attract VC funding and scale exponentially, while human-led audits cannot. And it's very easy for LLM providers to optimize their models to pass automated privacy tests, even if this does not translate to real-world risk mitigation. Manual audits are more expensive and much more likely to identify problematic findings, which also take away the ability for LLM providers to claim that they did not know about existing issues in their products.
I also want to challenge the premise of this question. Rather than asking "is it possible to extract personal data?", I would suggest treating LLMs as abstracted databases that contain personal data by default, and treating their deployment accordingly. LLMs are novel technology, but there is no fundamental reason why they should be treated differently as any other data structure: their development and use should rely on an appropriate legal basis, they should implement measures to uphold the fundamental rights of data subjects, bad practices should be sanctioned, and so on. Trying to fit LLMs into the "anonymous data" box to avoid all that is a cop-out that doesn't really make sense from a technical standpoint.
4. Data protection law is linked to the processing of personal data. Each input of a prompt triggers a calculation in the AI model, in which the (personal) data represented in the form of parameters influences the calculation result. Does this calculation constitute processing of these data within the meaning of Article 4 No. 2 GDPR, even if the calculation result, i.e., the output of the AI model, is not personal?
Yes.
Personal data goes into the model. The model is made out of, among other things, personal data, memorized verbatim. This personal data gets used every time the model is queried. I don't know how you could possibly argue that this does not constitute, by definition, processing of personal data.
AI vendors have argued that the data inside of an LLM is very obfuscated, so it's not really used, and therefore it doesn't count. This makes not sense to me. First, all the empirical evidence around memorization and extraction attacks shows that this obfuscation is clearly not a reliable security or privacy control. Second, if this were true, then it would be possible to remove that data from the training set without hurting the model's performance — exactly the opposite of what the current scientific consensus suggests.
5. Do you already have experience with methods that estimate the amount and type of personal data memorized, or whether the AI model used contains personal data of a specific individual (e.g., privacy attacks/PII extraction attacks, etc.)? If so, how do you assess their informative value and possible limitations?
My experience with this mainly includes reviewing reports and scientific papers from people who performed such audits. My two major takeaways from this line of work are as follows:
- Memorization is a critical component of how AI models operate, grows with the size and complexity of the model, and nothing suggests that it is going to go away in future generations of models.
- Attacks keep getting better over time, so this kind of work can only ever tell us that a model has memorized at least this much personal data. One should always assume that a better attack could come around and show higher amounts of memorization and extraction.
6. What is the amount of personal memorized data in AI models you know (as a percentage and total amount of training data)?
I understand where this question is coming from, but it's not the right way to look at the problem. There's no way to clearly define what constitutes "personal data" in a massive unstructured dataset, and measure this in a meaningful way.
I also don't think that such a quantified framing is appropriate to evaluate privacy risk. If you fail to protect 0.01% of people whose data appears in a Web-scale dataset, you're putting half a million people at risk. That's bad! Further, harms are not uniformly distributed: privacy risk that feel acceptable for most people can translate to severe real-world harm for vulnerable populations and outliers. This is another reason why I advocate for manual audits and engagement with diverse stakeholders, instead of trying to compute average scores.
7. How do you proceed if a person exercises their right to access, rectify or erase their personal data in the AI model?
If a person exercises their right to access for an AI model you've trained, you should be able to tell them where their information appears in the training data, and send them a copy of this data. Doing a keyword search on large datasets is a solved problem, so this should be the minimum expected: this might not catch all the personal data from this person, but it's at least a similar approach than e.g. search engines or archiving services implement. You should also tell them that their data may have been memorized by the AI model, though this may be difficult to know for sure.
If a person exercises their right to erasure, you should first do the same thing as with the right of access: tell them where their information appears in the training data. This way, they can take appropriate steps to remove it going forward. Then, in future model training runs, you should take steps to avoid using this person's data (even if it still appears somewhere in the Web corpus — they might not have succeeded in removing it on their own), for example using keyword-based filters.
This is not a perfect answer: their data might still have been memorized by the model, and if that model's weights are public, that data is now on the public record. This is not great, and is one of the (many) reasons why current practices around LLM training are ethically problematic. But the fact that you can't get a perfect solution for this problem does not mean that we should give up on trying to uphold data subjects' fundamental rights whenever feasible.
There's an additional layer to data erasure. LLMs can return information about people when queried. This information might be accurate, or get some details wrong, or be completely made up. People should be able to ask that it doesn't happen, and that LLMs do not return information about them when queried. The right approach here is twofold. First, one can take inspiration from the way search engines handle data removal requests, and implement these solutions in the "retrieval" step of retrieval-augmented generation. Second, one can use reinforcement learning to encourage LLMs to treat "this person is in the training data but has exercised their right to erasure" in the same way as "this person does not appear in the training data". This will inevitably be an imperfect approach, but is probably the best one can do.
If a person exercises their right to rectification, it makes more sense to treat it as a right to access request, and offer the possibility to exercise their right to erasure instead. Maintaining a list of changes to the original training data to rectify personal information would be very complex (what to do when the original data changes?) and brittle (what if new data about this person arises in the training data at a later point?). And letting people influence what LLMs say about them would open major areas for abuse.
8. From your perspective, are there other aspects that play a role in the protection of personal data in AI models?
Interpreting what the law says into technical requirements is an interesting and fun intellectual exercise. But it can feel a bit pointless when organizations act like they can do whatever they want as long as they use publicly available data. There is nothing in the text nor the intent of the GDPR to exclude publicly available data from compliance obligations. Quite the opposite, in fact: there are both regulatory texts and case law that describe how existing stores of public data (like search engines or the Internet Archive) should operate. LLM providers should not get a pass simply because they added additional layers of math and engineering complexity! But they certainly act like they do, be it when it comes to privacy or copyright issues.
The way to fix that is not by changing the law or publishing additional clarification documents — it's by significantly ramping up enforcement, increasing the number of investigations and the severity of fines. Using all that personal data indiscriminately to train massive models without any real compliance story nor any regard for people's fundamental rights… That was never really acceptable to begin with. I find it disappointing that regulators seem to be trying to retroactively make it work within existing legislative frameworks, as opposed to focusing on enforcing the law. I hope this changes, and that we see stricter enforcement actions going forward.
Thanks to Aleatha Parker-Wood, Conan Dooley, and Daniel Simmons-Marengo for their helpful feedback on earlier versions of this post.