Are LLMs Trained On My Work?

Are LLMs Trained On My Work?

Today someone asked me, “does Github Copilot use my code to train the AI?”. The answer isn’t simple, so I thought I’d lay out the different angles. In the process, I hope you’ll come away with enough familiarity with LLMs that you can answer the question for yourself.

But first, what did they mean? I hear the question a lot, but it can be coming from a few different places:

  1. Fear: They’re stealing my work to make some AI company rich
  2. Excitement: I want an LLM that’s customized to me, does this do that?
  3. Security: Will my work be shown to some random person on the internet?

The question can be loaded. To make matters more complicated, the answers aren’t as clear as they seem like they should be. It’s best to start from the top and explain how LLMs work.

The Training Process

There’s two major steps when training an LLM:

  1. Pretraining
  2. Fine tuning

Also, there’s another thing going on: In-context learning. We’ll talk about all of them.

Pretraining

Pretraining produces a mostly functional model that behaves poorly.

The model is trained on massive amounts of text with no clear learning objective. This seems a little crazy to me, because traditionally we trained models to solve a specific problem and carefully measured how well it worked. But for LLMs, they just look at statistics about the text, like which words show up together. For example:

  • The pitch shifter is expensive
  • That pitch is so high it makes my ears hurt
  • The pitch was so easy that the batter smashed it out of the park
  • We covered the roof in pitch

In each of those, the model learns that pitch can mean very different things depending on if we’re talking about guitars, sounds, baseball or roofing.

A 3-year-old girl standing in a department store, looking up at a mannequin with a sense of awe and inquisitiveness. The girl is small and curious, her eyes wide with wonder. The mannequin, elegant and stylish, towers over her, creating a stark contrast in size and form. The department store setting is filled with racks of clothes and displays, providing a backdrop that emphasizes the child's fascination and the mannequin's imposing presence. The overall scene is heartwarming and captures a moment of childhood curiosity and admiration.

Pretraining is where most of the “knowledge” is acquired[1]. Knowledge is important for communication, because it helps us keep track of context. That’s something I hadn’t appreciated until LLMs came on the scene. So much of what we say never actually makes it into words, and a big reason why LLMs appear so intelligent is that they’re finally able to fill in the blanks.

The other day my three year old was at a clothing store with my wife and she looked up inquisitively at a mannequin. She asked, “is it dead?”

In a way, kids are also going through pretraining like LLMs. They’re learning when a word makes sense. Mastery of language and knowledge seem to be deeply intertwined. The mistakes a kid makes are cute and funny, but similar to LLM hallucinations, it’s caused by inadequante knowledge about the world.

Fine tuning

Next they’re fine tuned. This is where behavior is taught. Some examples:

  • Q/A models are taught to answer the question, not trail off
  • Chat models are taught to respond politely to the user
  • Coding models are drilled for following Python or JavaScript syntax
  • Models are reprimanded for making up stories about political figures or creating fake news

The goals of pretraining are fairly aimless, whereas the goals of fine tuning are clear. The training process here looks a lot like teaching a child to have manners. Give them a cookie when they clean their room, hide their toys when they key your car. There’s a lot of variations on how to fine tune, but they mostly all reward good behaior and punish bad behavior.

Run Time Learning

Generally, in machine learning, a model is frozen and doesn’t change after training. This is no differnt for LLMs. It doesn’t change after it’s trained. But that’s also misleading…

Beginning with LLMs we started to notice that they do something called “in-context learning”. You can give it examples of how to solve a problem and it actually learns how to do it.

Example: Reformat logs

Format this text for me:

10/31/23 DEBUG [main] Entered with block; count=5, latency=7.2

It won’t know how to process that, but if you first give it a few examples, it can do it fine:

I want you to fomrat some text for me.

Example inputs:
10/31/23 DEBUG [main] Entered with block; count=5, latency=7.2
10/31/23 INFO [get_luster] Calculating luster; hue=183, intensity=14, label=red
10/31/23 DEBUG [_save_inner] Saved data; gem_count=13 
11/01/23 DEBUG [main] Entered with block; count=5, latency=7.2

Example outputs:
{"date":"2023-10-31", "level":"DEBUG", "function":"main", "message":"Entered with block", stats:{"count":5, "latency":7.2}}
{"date":"2023-10-31", "level":"INFO", "function":"get_luster", "message":"Calculating luster", stats:{"hue":183, "intensity":14, "label":"red"}}
{"date":"2023-10-31", "level":"DEBUG", "function":"_save_inner", "message":"Saved data", stats:{"gem_count":13}}
{"date":"2023-11-01", "level":"DEBUG", "function":"main", "message":"Entered with block", stats:{"count":5, "latency":7.2}}

Input:
12/15/23 TRACE [calculate_density$loop$$] Finished loop; iterations=1384, result=1,8,3

It has to:

  1. Recognize the task
  2. Resolve ambiguities (e.g. how to handle 1,8,3?)
  3. Solve the task

In other words, the model has to learn how to do the task before it does it. Researchers noticed that internally, the LLM is doing some kind of learning algorithm that is roughly equivalent to stochastic gradient descent2, which is the learning algorithm that made deep learning successful in the early 2010s.

It’s not limited to learning tasks. We can also cram new information into the message and tell the LLM to reference it.

So even though the model is frozen, it’s still performing some kind of learning. On the other hand, once it finishes processing my request, that learning is gone. It’s ephemeral. Gone with the wind.

Are LLMs Being Trained On My Code?

Well, now we understand the question with a little more nuance.

  1. Pre-training: Maybe a new version of the model 2 years from now, but not any time soon.
  2. Fine-tuning: Unlikely. Someone would have to read & understand your code. That doesn’t scale.
  3. In-Context learning: Definitely. All the time.

The best way to customize a model to a business is to jam text into a chat before you start a conversation. Try this:

  1. Find a web page or Word doc
  2. Copy all text
  3. Paste it into ChatGPT with “Remember this:” prepended
  4. Use the ChatGPT session to ask questions about the document

It works great! You can have your own customized model in a few seconds! You can go a lot deeper with RAG (Retrieval Agmented Generation) too.

Now that we’ve got some context, let’s talk about the other concerns.

Fear: Is [OpenAI] stealing my code?

The problem all cloud services face is convincing customers that they aren’t looking at or profiting from your data. For cloud services like AWS, they had every incentive not to, otherwise they wouldn’t be able to convince laggards to migrate to the cloud. AI companies like OpenAI are different. The cloud barrier has already been worn down, yet they benefit monsterously from being able to train models on your data.

According to their privacy policy, OpenAI definitely does train models on your data. However, Azure OpenAI is extremely clear that they do not. Makes sense, they’re primarily a traditional cloud provider, also that’s the primary reason people use Azure OpenAI vs regular OpenAI.

Conclusion

I hope you have an appreciation for the nuance here. If the concern is coming mainly from a place of fear, I suggest moving away from the ChatGPT. OpenAI sells ChatGPT enterprise, but I would be suspicious of that as well.

If you’re excited about AI and want a model customized to you, I hope you now understand that it’s quite easy. Some systems even put data access controls in place that look exactly like pre-AI data access controls, so the RAG approach is a clear winner in a lot of ways.