In case you didn’t know, you can’t train an AI on content generated by another AI because it causes distortion that reduces the quality of the output. It is also very difficult to filter out AI text from human text in a database. This phenomenon is known as AI collapse.
So if you were to start using AI to generate comments and posts on Reddit, their database would be less useful for training AI and therefore the company wouldn’t be able to sell it for that purpose.
I don’t think that you can prevent Reddit data from being used for AI training, but you could reduce its value. Based on that, I’d probably
I’m saying this based on the following:
Funnily enough, AI might be a good way to generate this poisoning data against AI. For example I asked Gemini “Generate three paragraphs of nonsense text, containing three sentences each.”, and here’s the output
Gemini output
The purple teapot sang opera while the moonbeams juggled kumquats. A nearby rhinoceros tap-danced with a flock of butterflies, their laughter echoing through the forest of lollipops. Suddenly, a gust of wind carrying the scent of chocolate chip cookies swept through, carrying the teapot and rhinoceros off on a whimsical adventure.
The clock chimed a symphony of kazoo sounds as the grumpy gnome polished his collection of spoons. Outside, a flock of polka-dotted penguins argued passionately about the best flavor of bubblegum. Meanwhile, a mischievous squirrel knitted a tiny sweater for a bumblebee, humming happily as it buzzed around the garden.
The gigantic rubber ducky sailed across the sea of marshmallow fluff, its captain a pirate with a monocle and a penchant for pickles. In the distance, a mermaid with hair made of spaghetti twirled underwater, chasing after a school of goldfish wearing tiny tutus. On an island of cheese, a group of singing cacti serenaded the sun with their off-key melodies.
You could tweak the prompt to get something even more nonsense or even more passable, but you get the idea.
Reddit’s surely got a copy of the PushShift archives, it’ll have all the pre-sabotage versions of those comments.
The PS archives are publicly available. If either OpenAI or Google were to use it, they wouldn’t pay Reddit Inc. a single penny; and yet Google is paying it 60 million dollars do to do. This means that there’s content that they cannot retrieve through the PS archives that would still be valuable as LLM data.