Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments

Blaze@lemmy.blahaj.zone · 8 months ago

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments

hperrin@lemmy.world · 8 months ago

That’s probably not going to be useful. Reddit keeps your original comment text.

Th4tGuyII@kbin.social · 8 months ago

Yeah - this is what I was thinking. We all heard about people being unable to delete comments or Reddit keeping comments even after account deletions back during the first migration, so what stops them holding onto comment history - and what stops them using that to teach llms to discern poisoned data from real data as @pixxelkick said.

tehciolo@lemm.ee · 8 months ago

I think you missed the part where you were strongly suggested “not” to use copyrighted text.

The point is not to get rid of the original text. It’s to “poison” the training data.

Everythingispenguins@lemmy.world · edit-2 6 months ago

deleted by creator

FaceDeer@fedia.io · 8 months ago

If the AI trainers have the original text then “poisoning” the live site’s content isn’t going to do anything at all.

You can’t touch the original text. It’s already been archived.

tehciolo@lemm.ee · 8 months ago

If they scrape the updated comments again and ingest copyrighted text, you are poisoning the data.

FaceDeer@fedia.io · 8 months ago

That’s my point. They won’t.

And even if they did, it’s unclear that copyright has anything to say about AI training anyway.

InternetPerson@lemmings.world · 8 months ago

NYT is currently suing because of copyright infringiments.

https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html

it’s unclear that copyright has anything to say about AI training anyway

Although lawmakers worldwide have slept while AI advanced and therefore missed to make some important laws, they are catching up. Europe recently passed its first AI act. As far as I’ve seen it also states that companies must disclose a detailed summary of their training data.

https://www.ml6.eu/blogpost/ai-models-compliance-eu-ai-act

FaceDeer@fedia.io · 8 months ago

You can sue about anything you want in the United States, it remains to be seen whether the courts will side with them. I think it’s unlikely they’ll get much of a win out of it.

A law that requires disclosing a summary of training data isn’t going to stop anyone from using that training data.

pixxelkick@lemmy.world · 8 months ago

Yeah in fact you’re giving the llm additional data to train on what poisoned data looks like so it can avoid it better, as they can clear see the before vs after

InternetPerson@lemmings.world · 8 months ago

It is necessary to employ a method which enables the training procedure to distinguish copyrighted material. In the “dumbest” case, some humans will have to label it.

Just because you’ve edited a comment, doesn’t mean that this can be seen as “oh, this is under copyright now”.

I don’t say it’s technical impossible. To the contrary, it very much is possible. It’s just more work. This drives the development costs up and can give some form of satisfaction to angered ex-reddit users like me. However, those costs will be peanuts for giants like Google / Alphabet.

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments

The Luddite