2 Comments

Why ya gotta hate on AOL...? But seriously, garbage in/garbage out is such a truism...but I think, the point about old web sites may not be as objective -- whose to say if LIFO or FIFO is the best approach? Feel like it can't be binary...it has to be considered based on what changes over time.

For instance a website/page describing COVID symptoms and/or sources in 2019 vs. one in 2023 will be very different and have quite distinct value as fodder for AI generated blather...

Expand full comment

I've made a career out of hating on AOL.

https://www.pcworld.com/article/535838/worst_products_ever.html

I agree that a broad range of sites (and other material) should be included in training AI, but I would want to draw the line at sites that promote illegal activity, are complete and utter cesspools, or publish dubious our outright false information. If you're training a model simply to understand human language, that's one thing. If it's a corpus of information it uses to generate alleged factual material, that's different. And especially if they are using creative works without permission.

tl;dr: folks who train these models need to come clean about what's inside them.

Expand full comment