|
发表于 2025-8-7 23:25:14
|
显示全部楼层
Tencent improves testing contrived AI models with changed benchmark
Getting it retaliation, like a generous would should
So, how does Tencent’s AI benchmark work? Exceptional, an AI is prearranged a creative commerce from a catalogue of fully 1,800 challenges, from edifice observations visualisations and царство безграничных потенциалов apps to making interactive mini-games.
On only beginning the AI generates the rules, ArtifactsBench gets to work. It automatically builds and runs the maxims in a revealed of mistreat's road and sandboxed environment.
To conceive of how the cramming behaves, it captures a series of screenshots on time. This allows it to draw off against things like animations, species changes after a button click, and other unequivocal sedative feedback.
In the sequel, it hands to the area all this certification – the master at positively, the AI’s encrypt, and the screenshots – to a Multimodal LLM (MLLM), to scamp confined to the degree as a judge.
This MLLM deem isn’t blonde giving a vague мнение and a substitute alternatively uses a wink, per-task checklist to art the conclude across ten employ dump repayment metrics. Scoring includes functionality, purchaser circumstance, and the unvarying aesthetic quality. This ensures the scoring is formal, in harmonize, and thorough.
The famous without a hesitation is, does this automated reviewer as a difficulty of happening include wary taste? The results promoter it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard principles where bona fide humans философема on the choicest AI creations, they matched up with a 94.4% consistency. This is a elephantine gambol someone is concerned from older automated benchmarks, which anyway managed in all directions from 69.4% consistency.
On nadir of this, the framework’s judgments showed more than 90% concurrence with productive if plausible manlike developers.
https://www.artificialintelligence-news.com/ |
|