The big AI models are running out of training data (and it turns out most of the training data was produced by fools and the intentionally obtuse), so this might mark the end of rapid model advancement

  • SUPAVILLAIN@lemmygrad.ml
    link
    fedilink
    English
    arrow-up
    55
    ·
    edit-2
    22 days ago

    While synthetic data is a thing, you’ve really gotta wonder how often you can train a model on basically empty calories before the hallucination rate starts going up.

    I, for one, hope the theftbots die.

    • KnilAdlez [none/use name]@hexbear.net
      link
      fedilink
      English
      arrow-up
      24
      ·
      22 days ago

      I was reading an article about how ChatGPT will sometimes go on existential rants and I figure it’s probably because so much of the training data is now generated by LLMs and posted on the internet. probably a glut of people posting “I asked chatGPT what it was like to be a robot” and things of that nature.

  • lurkerlady [she/her]@hexbear.net
    link
    fedilink
    English
    arrow-up
    35
    ·
    edit-2
    22 days ago

    This is accurate, though I am actually going to explain why. These big model companies (Google, ClosedAI, etc) parasitize the open-weights/open-source community that actually makes good Loras, fine tunes, and research papers. Consumer hardware simply hasn’t gotten good and cheap enough for very good fine tune training, and thats why this is all slowly petering out. In a couple of generations of consumer GPUs, which will be when we get consumer GPUs geared towards AI (re: super high VRAM counts of like 70gb+ for an affordable sub 700 usd cost), we might see another leap forward in this tech. Though I will say that this mostly pertains to LLMs, generative AI models like Stable Diffusion have a lot of tricks up their sleeves that can still be explored. Most of recent research and tweaking has been based around building a structure for the AI to build on, to sort of guide it rather than letting it take random stabs at things, in order to improve outputs. Some people have been doing things like hard coding color theory, framing a photograph, etc, and interpreting human language to trigger that hard code.

    We’ve had statistical models like these since the 50s. Consumer hardware has always been the big materialist bottleneck, this is all powered by small research teams and hobbyist nerds. You can throw a ton of money at it and have a giant research team, but the performance you squeeze out of adding 400b more parameters to your 13b model or having a gigantic locked-down datacenter is going to be diminishing.

    Also, synthetic data can be useful, people are hating on it in this thread but its a great way to reinforce good habits in the AI and interpret garbled code and speech that would otherwise confuse the AI. I sometimes feel like people just see something about ‘AI bad’ and upvote it and don’t try to understand it, where it is useful and where it is not, and so on.

      • SUPAVILLAIN@lemmygrad.ml
        link
        fedilink
        English
        arrow-up
        13
        ·
        edit-2
        22 days ago

        That’s where I’m at. Sure, there might be moderately-beneficial use-cases, maybe; but it doesn’t change the fact that there’s no such thing as an ethically-trained model, and there’s still no such thing as a model that wasn’t created based on rampant theft by capitalists, so I consider anything that comes of it fruit of the poison tree.

        AI bad until the base that comprises it radically changes, across the board.

        • lurkerlady [she/her]@hexbear.net
          link
          fedilink
          English
          arrow-up
          11
          ·
          edit-2
          22 days ago

          Sure, there might be moderately-beneficial use-cases, maybe; but it doesn’t change the fact that there’s no such thing as an ethically-trained model, and there’s still no such thing as a model that wasn’t created based on rampant theft by capitalists, so I consider anything that comes of it fruit of the poison tree.

          I mean thats just the case with everything really. Theres a lot of very good use cases that are mostly to do with data manipulation, but the coolest ones are translating. I think we’re approaching a point where small models are providing very accurate translations and are even translating tone and intent properly, which is far superior to simple dictionary translation methods. I think its very possible that new phones could be outfitted with tensor cores and you could have a real-time universal translator in your hand, though it’ll likely only add ‘subtitles’ irl for you. AI voice-word recognition has also been very good and can be miniaturized. This is the use case I’m most excited for, personally, as a communist. Currently translating in a foreign country requires a lot of typing (if you dont have a perfect grasp of language) and it removes a very human element I feel to conversation. If everyone could locally run a subtitle-translation generation app it’d be amazing for all of humanity.

          Theres of course plenty of manufacturing use cases as well, but China is spearheading on that, though there is some work being done in the US as well in the few industries that remain.

        • bazingabrain [comrade/them]@hexbear.net
          link
          fedilink
          English
          arrow-up
          10
          ·
          22 days ago

          AI bad until the base that comprises it radically changes, across the board.

          which wont happen, hence why me and 650k others moved to cara and gave meta the finger.

      • lurkerlady [she/her]@hexbear.net
        link
        fedilink
        English
        arrow-up
        9
        ·
        edit-2
        22 days ago

        Synthetic data is basically a fancy way of saying ‘I’m properly formatting data and reinforcing the ai’s good outputs’. Rearranging words, fixing / adding tags, that sort of thing. This is generated with various tools that usually have an LLM or VLM plugged in, though some are as simple as a regex script.

    • MacN'Cheezus@lemmy.today
      link
      fedilink
      English
      arrow-up
      3
      ·
      22 days ago

      Better hardware isn’t going to change anything except scale if the underlying approach stays the same. LLMs are not intelligent, they’re just guessing a bunch of words that are statistically most likely to satisfy the user’s request based on their training data. They don’t actually understand what they’re saying.

  • davel [he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    30
    ·
    edit-2
    22 days ago

    Spicy autocomplete can produce much more content much faster than we can, and it is consuming its own content now. What could go wrong?

    clown-to-clown-communicationclown-to-clown-conversation

  • DragonBallZinn [he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    28
    ·
    edit-2
    22 days ago

    Based. Fuck AI.

    Always suspicious when its one of the few technologies boomers got super hyped up about and wanted to shove into everything.

  • ssj2marx@lemmy.ml
    link
    fedilink
    English
    arrow-up
    28
    ·
    edit-2
    22 days ago

    I know what they’re trying to say, but I really wish these writers would use accurate terms. “AI” aren’t intelligent in any meaningful sense, they’re just pattern generators, and they were never getting “smarter”, the patterns that they were capable of outputting were just getting more complex.

    • technocrit@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      16
      ·
      edit-2
      22 days ago

      Yeah 100%. It’s like adopting the language of your oppressor. The hucksters have been selling their “learning”, “intelligence”, “minds”, etc. for so long that many people have internalized it. Let’s please return to reality and using scientific terms like data, function, average, statistics, etc.

  • Owl [he/him]@hexbear.net
    link
    fedilink
    English
    arrow-up
    22
    ·
    22 days ago

    This entire boom was predicated on being able to throw 10x the compute budget at a problem and get 2x the quality of results, so it was inevitable. It’s not like big tech is suddenly funding long-term R&D teams again; they stopped doing that before most of these companies were even founded.

  • Assian_Candor [comrade/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    22
    ·
    22 days ago

    It would be funny if we hadn’t incinerated the planet for this shit. The peddlers will get rich too, zero consequences, except of course for the jobs that were snuffed out in infancy.

  • aaro [they/them]@hexbear.net
    link
    fedilink
    English
    arrow-up
    15
    ·
    22 days ago

    reposting my hot AI take

    Just because capital can’t possibly imagine more than 5 minutes in the future, and just because capital can only speak profit and couldn’t fathom progress for the sake of progress, doesn’t mean that AI isn’t real and scary. The technological hurdles are similar things that have been overcome in past technologies, the incentive to replace workers with machines is just as enticing as it’s ever been, and if we’ve seen noise and fervor like this now with this little of the total reward reaped, expect to continue to see this much noise and fervor until the last drop of blood has been squeezed out.