Mesa Project Adds Code Comprehension Requirement After AI Slop Incident

rss · 2 months ago

Mesa Project Adds Code Comprehension Requirement After AI Slop Incident

Rentlar@lemmy.ca · 2 months ago

In the last example the one thing that AI really should be doing to have more people trust it is a confidence score for each answer, based on a separate meta-analysis of the data it has available to it vs. its response. The code testing tool kind of analyzes itself to see which issues it brought up are likely hallucinations.

AI is more useful for the “looking for anything potentially worrying out of a big set” than the “just rewrite the code for me and make sure it has no issues and improved performance”. The former it’s fine to not have accounted for anything but the latter, one has to ensure there are no ensuing regressions which are more likely the bigger the patch.

AnyOldName3@lemmy.world · 2 months ago

The LLM doesn’t know how confident it is, and even if you ask it, it’s just going to pick whichever number came up most often when people asked each other how confident they were in the model’s training data. That’s still an unsolved problem beyond basic tasks like getting an LLM to find sentences in some prose that aren’t supported by any of the sentences in some other prose.

Rentlar@lemmy.ca · 2 months ago

LLM is a functionally an averaging algorithm, I think it’s reasonable to calculate statistical parameters behind its answers using other algorithms. Something akin to the z-value of an answer on a Gaussian distribution, or the standard deviation indicating how spread the potential answers are.

PhilipTheBucket@piefed.social · 2 months ago

based on a separate meta-analysis of the data it has available to it vs. its response

I feel like you need to know more about how LLMs operate before you can really be lecturing other people about what they are and how they could be improved.

As far as needing to have an “awareness” of how confident they are in the answer and communicate that data point along with the answer itself, yeah that sort of thing would be 1,000% improvement, it would be wonderful. You are right about that part but your diagnosis of the details needs some work.

Rentlar@lemmy.ca · 2 months ago

Sorry if I am hand-waving away too many details on the implementation, but I’m glad you get my point.

PhilipTheBucket@piefed.social · 2 months ago

Yeah, it is a really good idea. The sticking point is that the modern structure of LLMs really doesn’t allow it, it is not that they haven’t tried. But at least so far, whatever secondary structure you try to apply to check the first answer suffers from the exact same issue, that it doesn’t really “understand” and so it’s subject to spouting totally wrong stuff sometimes no matter how carefully you try to set it up with awareness and fact checking.

AnyOldName3@lemmy.world · 2 months ago

That’ll tell you how likely it is that an answer’s gramatically correct and how likely it is to resemble text in the training data. If the LLM starts reciting a well known piece of literature that shows up several times in the training data and finishes it correctly, that’ll give a really good score by this metric, but no matter how much the LLM output resembles the script of Bee Movie, Bee Movie isn’t a true story or necessarily relevant to the question a user asked.