Why ya gotta hate on AOL...? But seriously, garbage in/garbage out is such a truism...but I think, the point about old web sites may not be as objective -- whose to say if LIFO or FIFO is the best approach? Feel like it can't be binary...it has to be considered based on what changes over time.
For instance a website/page describing COVID symptoms and/or sources in 2019 vs. one in 2023 will be very different and have quite distinct value as fodder for AI generated blather...
I agree that a broad range of sites (and other material) should be included in training AI, but I would want to draw the line at sites that promote illegal activity, are complete and utter cesspools, or publish dubious our outright false information. If you're training a model simply to understand human language, that's one thing. If it's a corpus of information it uses to generate alleged factual material, that's different. And especially if they are using creative works without permission.
tl;dr: folks who train these models need to come clean about what's inside them.
Why ya gotta hate on AOL...? But seriously, garbage in/garbage out is such a truism...but I think, the point about old web sites may not be as objective -- whose to say if LIFO or FIFO is the best approach? Feel like it can't be binary...it has to be considered based on what changes over time.
For instance a website/page describing COVID symptoms and/or sources in 2019 vs. one in 2023 will be very different and have quite distinct value as fodder for AI generated blather...
I've made a career out of hating on AOL.
https://www.pcworld.com/article/535838/worst_products_ever.html
I agree that a broad range of sites (and other material) should be included in training AI, but I would want to draw the line at sites that promote illegal activity, are complete and utter cesspools, or publish dubious our outright false information. If you're training a model simply to understand human language, that's one thing. If it's a corpus of information it uses to generate alleged factual material, that's different. And especially if they are using creative works without permission.
tl;dr: folks who train these models need to come clean about what's inside them.