Garbage garbage everywhere, & soon it starts…

Apr 21, 2023

What data went into training that ginormous AI language model? A dumpster full of webtrash.

2 Comments

Apr 21, 2023

Why ya gotta hate on AOL...? But seriously, garbage in/garbage out is such a truism...but I think, the point about old web sites may not be as objective -- whose to say if LIFO or FIFO is the best approach? Feel like it can't be binary...it has to be considered based on what changes over time.

For instance a website/page describing COVID symptoms and/or sources in 2019 vs. one in 2023 will be very different and have quite distinct value as fodder for AI generated blather...

Expand full comment

Reply (1)

Dan Tynan

Apr 23, 2023

I've made a career out of hating on AOL.

https://www.pcworld.com/article/535838/worst_products_ever.html

I agree that a broad range of sites (and other material) should be included in training AI, but I would want to draw the line at sites that promote illegal activity, are complete and utter cesspools, or publish dubious our outright false information. If you're training a model simply to understand human language, that's one thing. If it's a corpus of information it uses to generate alleged factual material, that's different. And especially if they are using creative works without permission.

tl;dr: folks who train these models need to come clean about what's inside them.

Expand full comment

Cranky Old Man Yells at Internet

Garbage garbage everywhere, & soon it starts…