Interesting read. I get where he’s coming from vis a vis training data for LLMs. But if those are the problem, negotiate a solution with those companies or block their crawlers. Don’t kill the apps making the site usable for everyone else.
No doubt, his comments are accurate as far as they go albeit completely out of context. I’d be much more interested in knowing how many of the top 100 subs (rather than top 5000) have reopened. I’d like to know what “top” even means here. I’m sure that 97% of mods don’t use 3rd party apps (according to Huffman) because they mod subs of a few dozen to a few hundred members or their subs are almost completely inactive.
In other words, this is interesting damage control, but it needs a lot more context. And NPR’s quality control and fact-checking are sadly lacking.
Meanwhile, companies that already copy the whole web for their search engines (e.g. Microsoft and Google) still don’t need to make API calls to get Reddit posts.
More generally, anything that can be seen by a not-logged-in browser can be indexed by search engines and thus ingested into any training pipeline you can imagine.
This is part of my complaint against Reddit doing this. Google and Microsoft already have the data, they are just ensuring smaller companies and open source LLMs fail. I am also a little annoyed by the app thing, but I think it’s important that we don’t let tech giants monopolize this new technology.
I deleted my reddit post history, it’s not their data to sell.