“Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease,” they added. “We term this condition Model Autophagy Disorder (MAD).”

Interestingly, this might be a more challenging problem as we increase the use of generative AI models online.

  • Melody Fwygon@lemmy.one
    link
    fedilink
    English
    arrow-up
    7
    ·
    1 year ago

    Muahahahahahahaha.

    Looks like we found a relatively easy way to “poison” an AI dataset silently. Just feed it AI output.

    I could see this mechanic being exploited by websites to provide a bottomless amount of junk text that only a bot doing content scraping would see.

  • h3ndrik@feddit.de
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    1 year ago

    Wow. How is this going to affect all the projects that fine-tune Meta’s Llama model with synthetic training data?

    • lloram239@feddit.de
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      Not much at all I would think. The Llama models get trained on the superior GPT-4 output, not on their own output. In general I think it’s a bit of an artificial problem, nobody really expects to train AI on their own output and get good results. What actually happens is AI being used to curate real world data and use that curated data as input, which gives much better results than feeding the raw data directly into the AI (as can be seen by early LLMs that just go completely off track and start repeating comment section and HTML code, that has nothing to do with your prompt, but that just happens to be part of raw websites).

      • h3ndrik@feddit.de
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Thank you for explaining. Yes. Now that i have skimmed through the paper i’m kind of disappointed in their work. It’s not a surprise to me that quality will degrade if you design a feedback loop with low quality data. And does this even mean anything for a distinction between human and synthetic data? Isn’t it obvious a model will deteriorate if you feed it progressively lower quality input, regardless of where you got that from? I’m pretty sure this is the mechanism behind that. A better question to ask would be: Is there some point where synthetic output gets good enough to train something with it. And how far away is that point. Or can we rule that out because of some properties we can’t get around. I’m not sure if learning from own output is even possible like this. I as a human certainly can’t teach myself. I would need some input like books or curated assignments/examples prepared by other people. There are kind of intrinsic barriers when teaching oneself. However I can certainly practice stuff. But that’s kind of a different mechanism. And difficult to compare to the AI stuff.

        I’m glad i can continue to play with the language models, have them tuned to follow instructions (with the help of GPT4 data) etc

  • ZickZack@kbin.social
    link
    fedilink
    arrow-up
    6
    ·
    1 year ago

    That paper makes a bunch of(implicit) assumptions that make it pretty unrealistic: basically they assume that once we have decently working models already, we would still continue to do normal “brain-off” web scraping.
    In practice you can use even relatively simple models to start filtering and creating more training data:
    Think about it like the original LLM being a huge trashcan in which you try to compress Terrabytes of mostly garbage web data.
    Then, you use fine-tuning (like the instruction tuning used the assistant models) to increases the likelihood of deriving non-trash from the model (or to accurately classify trash vs non-trash).
    In general this will produce a datasets that is of significantly higher quality simply because you got rid of all the low-quality stuff.

    This is not even a theoretical construction: Phi-1 (https://arxiv.org/abs/2306.11644) does exactly that to train a state-of-the-art language model on a tiny amount of high quality data (the model is also tiny: only half a percent the size of gpt-3).
    Previously tiny stories https://arxiv.org/abs/2305.07759 showed something similar: you can build high quality models with very little data, if you have good data (in the case of tiny stories they generate simply stories to train small language models).

    In general LLM people seem to re-discover that good data is actually good and you don’t really need these “shotgun approach” web scrape datasets.

    • wahming@kbin.social
      link
      fedilink
      arrow-up
      4
      ·
      1 year ago

      Given the prevalence of bots and attempts to pass off fake data as real though, is there still any way to reliably differentiate good data from bad?

      • ZickZack@kbin.social
        link
        fedilink
        arrow-up
        1
        ·
        1 year ago

        Yes: keep in mind that with “good” nobody is talking about the content of the data, but rather how statistically interesting it is for the model.

        Really what machine learning is doing is trying to deduce a probability distribution q from a sampled distribution x ~ p(x).
        The problem with statistical learning is that we only ever see an infinitesimally small amount of the true distribution (we only have finite samples from an infinite sample space of images/language/etc…).

        So now what we really need to do is pick samples that adequately cover the entire distribution, without being redundant, since redundancy produces both more work (you simply have more things to fit against), and can obscure the true distribution:
        Let’s say that we have a uniform probability distribution over [1,2,3] (uniform means everything has the same probability of 1/3).

        If we faithfully sample from this we can learn a distribution that will also return [1,2,3] with equal probability.
        But let’s say we have some redundancy in there (either direct duplicates, or, in the case of language, close-to duplicates):
        The empirical distribution may look like {1,1,1,2,2,3} which seems to make ones a lot more likely than they are.
        One way to deal with this is to just sample a lot more points: if we sample 6000 points, we are naturally going to get closer to the true distribution (similar how flipping a coin twice can give you 100% tails probability, even if the coin is actually fair. Once you flip it more often, it will return to the true probability).

        Another way is to correct our observations towards what we already know to be true in our distribution (e.g. a direct 1:1 duplicate in language is presumably a copy-paste rather than a true increase in probability for a subsequence).

        <continued in next comment>

        • ZickZack@kbin.social
          link
          fedilink
          arrow-up
          1
          ·
          edit-2
          1 year ago

          The “adequate covering” of our distribution p is also pretty self-explanatory: We don’t need to see the statement “elephants are big” a thousand times to learn it, but we do need to see it at least once:

          Think of the p distribution as e.g. defining a function on the real numbers. We want to learn that function using a finite amount of samples. It now makes sense to place our samples at interesting points (e.g. where the function changes direction), rather than just randomly throwing billions of points against the problem.

          That means that even if our estimator is bad (i.e. it can barely distinguish real and fake data), it is still better than just randomly sampling (e.g. you can say “let’s generate 100 samples of law, 100 samples of math, 100 samples of XYZ,…” rather than just having a big mush where you hope that everything appears).
          That makes a few assumptions: the estimator is better than 0% accurate, the estimator has no statistical bias (e.g. the estimator didn’t learn things like “add all sentences that start with an A”, since that would shift our distribution), and some other things that are too intricate to explain here.

          Importantly: even if your estimator is bad, it is better than not having it. You can also manually tune it towards being a little bit biased, either to reduce variance (e.g. let’s filter out all HTML code), or to reduce the impact of certain real-world effects (like that most stuff on the internet is english: you may want to balance that down to get a more multilingual model).

          However, you have not note here that these are LANGUAGE MODELS. They are not everything models.
          These models don’t aim for factual accuracy, nor do they have any way of verifying it: That’s simply not the purview of these systems.
          People use them as everything models, because empirically there’s a lot more true stuff than nonsense in those scrapes and language models have to know something about the world to e.g. solve ambiguity, but these are side-effects of the model’s training as a language model.
          If you have a model that produces completely realistic (but semantically wrong) language, that’s still good data for a language model.
          “Good data” for a language model does not have to be “true data”, since these models don’t care about truth: that’s not their objective!
          They just complete sentences by predicting the next token, which is independent of factuallity.
          There are people working on making these models more factual (same idea: you bias your estimator towards more likely to be true things, like boosting reliable sources such as wikipedia, rather than training on uniformly weighted webscrapes), but to do that you need a lot more overview over your data, for which you need more efficient models, for which you need better distributions, for which you need better estimators (though in that case they would be “factuallity estimators”).
          In general though the same “better than nothing” sentiment applies: if you have a sampling strategy that is not completely wrong, you can still beat completely random sample models. If your estimator is good, you can substantially beat them (and LLMs are pretty good in almost everything, which means you will get pretty good samples if you just sample according to the probability that the LLM tells you “this data is good”)

          For actually making sure that the stuff these models produce is true, you need very different systems that actually model facts, rather than just modelling language. Another way is to remove the bottleneck of machine learning models with respect to accuracy (i.e. you build a model that may be bad, but can never give you a wrong answer):
          One example would be vector-search engines that, like search engines, retrieve information from a corpus based on the similarity as predicted by a machine learning model. Since you retrieve from a fixed corpus (like wikipedia) the model will never give you wrong information (assuming the corpus is not wrong)! A bad model may just not find the correct e.g. wikipedia entry to present to you.

  • Exaggeration207@beehaw.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 year ago

    I only have a small amount of experience with generating images using AI models, but I have found this to be true. It’s like making a photocopy of a photocopy. The results can be unintentionally hilarious though.

  • argv_minus_one@beehaw.org
    link
    fedilink
    English
    arrow-up
    41
    ·
    1 year ago

    Note that humans do not exhibit this property when trained on other humans, so this would seem to prove that “AI” isn’t actually intelligent.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      Humans are not entirely trained on other humans, though. We learn plenty of stuff from our environment and experiences. Note this very important part of the primary conclusion:

      without enough fresh real data in each generation

        • FaceDeer@kbin.social
          link
          fedilink
          arrow-up
          2
          ·
          edit-2
          1 year ago

          Dogs can do math and I’m quite sure I’ve never taught my dog that deliberately.

          Even for humans learning it, I would expect that most of our understanding of math comes from everyday usage of it rather than explicit rote training.

    • PenguinTD@lemmy.ca
      link
      fedilink
      English
      arrow-up
      13
      ·
      1 year ago

      do we even need to prove this? Like anyone study a bit how generative AI works know it’s not intelligent.

    • echo@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      5
      ·
      edit-2
      1 year ago

      I don’t think LLMs are intelligent, but “does it work the same as humans” is a really bad way to judge something’s intelligence

      • frog 🐸@beehaw.org
        link
        fedilink
        English
        arrow-up
        12
        ·
        1 year ago

        Even if we look at other animals, when they learn by observing other members of their own species, they get more competent rather than less. So AIs are literally the only thing that get worse when trained on their own kind, rather than better. It’s hard to argue they’re intelligent if the answer to “does it work the same as any other lifeform that we know of?” is “no”.

        • FaceDeer@kbin.social
          link
          fedilink
          arrow-up
          2
          ·
          1 year ago

          Are there any animals that only learn by observing the outputs of members of their own species? Or is it a mixture of that and a whole bunch of their own experiences with the outside world?

          • lol3droflxp@kbin.social
            link
            fedilink
            arrow-up
            2
            ·
            1 year ago

            I mean, it’s always a mixture but yes, animals can learn new behaviours purely by watching (corvids and monkeys for example).

            • FaceDeer@kbin.social
              link
              fedilink
              arrow-up
              1
              ·
              1 year ago

              “It’s always a mixture” is the key part, though. We haven’t run an experiment like this on a human or animal (and even if it were practical to do so it’d probably be horribly abusive).

          • frog 🐸@beehaw.org
            link
            fedilink
            arrow-up
            3
            ·
            1 year ago

            Humans (and animals) learn through a combination of their own experiences and observing the experiences of others. But this actually proves my point: if you feed an AI its own experiences (content it has created in response to prompts) and the experiences of other AIs (content they have produced in response to prompts), it cycles itself into oblivion. This is ultimately because it cannot create anything new.

            This is why Model Autophagy Disease occurs, I think. Humans, when put in repetitive scenarios, will actively work to create new stimuli to respond to. This varies from livening up a boring day by doing something ridiculous, to hallucinating when kept in extreme sensory deprivation. The human mind’s defence against repetitive stimuli is to literally create something new. But the AI’s can’t do that. They literally can’t. They can’t create anything that doesn’t have a basis in their training data, and when exposed only to iterations of their own training data (which is ultimately what all AI-generated content is: iterations of the training data), there is no process that allows them to break out of that repetitive cycle. They end up just spiralling inwards.

            From a certain perspective, AI’s are therefore essentially parasites. They cannot progress without sucking in more human-generated content. They aren’t self-sustaining on their own, because they literally cannot create the new ideas needed to prevent degradation of their own data sets.

            From your other comments here, it seems like you’re imagining a fully conscious mind sitting alone in a box, with nothing to react to. But that’s not the case: AIs aren’t sapient, going mad from a lack of stimulation. They are completely dormant until prompted to do something, and then they create an output that is statistically likely from the data set they’ve been trained on. If you add no new data, the AI doesn’t change. It doesn’t seek new stimuli. It doesn’t create new ideas while waiting for someone to prompt it. The only way it can change and create anything new is if it’s given more human-generated content to work with. If you give it content from other AI’s, that alters the statistical probabilities behind its output. If the AIs were actually conscious minds sitting alone in boxes, then exposing them to content created by other AIs would, in fact, be new stimuli that could generate new ideas, in the same way that a lonely human meeting another lonely human would quickly strike up a conversation and get all kinds of ideas.

    • ParsnipWitch@feddit.de
      link
      fedilink
      English
      arrow-up
      6
      ·
      1 year ago

      Current AI is not actually “intelligent” and, as far as I know, not even their creators directly describe them as that. The programs and models existing at the moment aren’t capable of abstract thinking oder reasoning and other processes that make an intelligent being or thing intelligent.

      The companies involved are certainly eager to create something like a general intelligence. But even when they reach that goal, we don’t know yet if such an AGI would even be truly intelligent.

    • lloram239@feddit.de
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Key point here being that humans train on other humans, not on themselves. They are also always exposed to the real world.

      If you lock a human in a box and only let them interact with themselves they go a bit funny in the head very quickly.

      • ParsnipWitch@feddit.de
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 year ago

        The reason is different from what is happening with AI, though. Sensory deprivation or extreme isolation and the Ganzfeld effect lead to hallucinations because our brain seems to have to constantly react to stimuli in order to keep functioning. Our brain starts creating things from imagination.

        With AI it is the other way around. They lose information when presented with the same data again and again because their statistical models look for probabilities.

    • h3ndrik@feddit.de
      link
      fedilink
      English
      arrow-up
      6
      ·
      1 year ago

      Wasn’t the echo chambers during the covid pandemic kind of proof that humans DO exhibit the same property? A good amount will start repeating stuff about nanoparticles and some black lint in a mask are worms that will control your brain?

  • frog 🐸@beehaw.org
    link
    fedilink
    English
    arrow-up
    10
    ·
    1 year ago

    Good!

    Was that petty?

    But, you know, good luck completely replacing human artists, musicians, writers, programmers, and everyone else who actually creates new content, if all generative AI models essentially give themselves prion diseases when they feed on each other.

      • frog 🐸@beehaw.org
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 year ago

        I absolutely agree! I’ve seen so many proponents of AI argue that AI learning from artworks scraped from the internet is no different to a human learning by looking at other artists, and while anyone who is actually an artist (or involved in any creative industry at all, including things like coding that require a creative mind) can see the difference, I’ve always struggled to coherently express why. And I think this it. Human artists benefit from other human art to look at, as it helps them improve faster, but they don’t need it in the same way, and they’re more than capable of coming up with new ideas without it. Even a brief look at art history shows plenty of examples of human artists coming up with completely new ideas, artworks that had absolutely no precedent. I really can’t imagine AI ever being able to invent, say, Cubism without having seen a human do it first.

        I feel like the only people that are in favour of AI artworks are those who don’t see the value of art outside of its commercial use. They’re the same people who are, presumably, quite happy playing the same same-y games and watching same-y TV and films over and over again. AI just can’t replicate the human spark of creativity, and I really can’t see it being good for society either economically or culturally to replace artists with algorithms that can only produce derivations of what they’ve already seen.

  • feeltheglee@beehaw.org
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 year ago

    You know how when you’re on a voice/video call and the audio keeps bouncing between two people and gets all feedback-y and screechy?

    That, but with LLMs.

  • voluntaryexilecat@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    7
    ·
    1 year ago

    But…isn’t unsupervised backfeeding the same as simply overtraining the same dataset? We already know overtraining causes broken models.

    Besides, the next AI models will be fed with the interactions from humans with AI, not just it’s own content. ChatGPT already works like this, it learns with every interaction, every chat.

    And the generative image models will be fed with AI-assisted images where humans will have fixed flaws like anatomy (the famous hands) or other glitches.

    So as interesting as this is, as long as humans interact with AI the hybrid output used for training will contain enough new “input” to keep the models on track. There are already refined image generators trained with their own but human-assisted output that are better than their predecessor.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      3
      ·
      edit-2
      1 year ago

      People in this thread seem really eager to jump to any “aha, AIs aren’t intelligent after all” conclusions they can grab hold of. This experiment isn’t analogous to anything that we put real people or animals through and seems like a relatively straightforward thing to correct for in future AI training.

  • Amax@lemmy.ca
    link
    fedilink
    English
    arrow-up
    5
    ·
    1 year ago

    MadAI’s disease.

    I guess we didn’t learn when we did it with cows.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      2
      ·
      1 year ago

      I wouldn’t base any expectations about real-world artificial intelligence off of a 27-year-old sci-fi comedy romance. With a 6/10 IMDB rating at that, if you really want to use pop culture as a basis for scientific thought.

  • Cybrpwca@beehaw.org
    link
    fedilink
    English
    arrow-up
    6
    ·
    1 year ago

    So we have generation loss instead of AI making better AI. At least for now. That’s strangely comforting.

    • FaceDeer@kbin.social
      link
      fedilink
      arrow-up
      1
      ·
      1 year ago

      The summary said:

      without enough fresh real data in each generation

      So as long as you’re mixing enough fresh data in you should be fine.