• FaceDeer@kbin.social
    link
    fedilink
    arrow-up
    15
    ·
    1 year ago

    For those who can’t get through the paywall, this is an article about a system called Kudurru that is monitoring a bunch of websites with images listed in the LAION-5B metadata set. When it sees the same IP address downloading images from those websites simultaneously, it assumes that it must be a bot that’s scraping the data in order to train an AI with it and either blocks them or “poisons” the scrape by sending incorrect images back.

    Frankly, I don’t see much likely impact from this. AI training has moved beyond simply using LAION-5B, we’re discovering that a smaller higher-quality dataset is better than just throwing mountains of data at the AI in training. So anything a trainer is downloading is going to be extensively curated before being used for training and this sort of obstruction will be fixed or filtered out.

    • mkhoury@lemmy.ca
      link
      fedilink
      arrow-up
      2
      ·
      1 year ago

      But the main result is achieved anyway, right? The picture that the system tried to download did not make it into the training set.

      • FaceDeer@kbin.social
        link
        fedilink
        arrow-up
        2
        ·
        1 year ago

        Unless the “this sort of obstruction will be fixed” part means the image is downloaded anyway. This is the weakest sort of DRM.