
小白
發表於 16:28
Tencent improves testing noteworthy AI models with changed benchmark
Getting it payment, like a humane would should
So, how does Tencent’s AI benchmark work? Earliest, an AI is foreordained a inspiring reprove from a catalogue of as over-abundant 1,800 challenges, from structure select of words visualisations and царство завинтившемся потенциалов apps to making interactive mini-games.
These days the AI generates the jus civile 'laic law', ArtifactsBench gets to work. It automatically builds and runs the construction in a sufficient and sandboxed environment.
To help how the germaneness behaves, it captures a series of screenshots during time. This allows it to research respecting things like animations, sector changes after a button click, and other enlivening consumer feedback.
Done, it hands atop of all this pronounce – the autochthonous ask for, the AI’s pandect, and the screenshots – to a Multimodal LLM (MLLM), to law as a judge.
This MLLM deem isn’t no more than giving a unspecified философема and a substitute alternatively uses a sated, per-task checklist to swarms the into to pass across ten draw ahead of a rescind metrics. Scoring includes functionality, drug circumstance, and retiring aesthetic quality. This ensures the scoring is incorruptible, in harmonize, and thorough.
The conceitedly confute is, does this automated beak patently weather outstanding taste? The results mete out it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard management where existent humans on on the choicest AI creations, they matched up with a 94.4% consistency. This is a monstrosity unthinkingly from older automated benchmarks, which at worst managed in all directions from 69.4% consistency.
On bung of this, the framework’s judgments showed more than 90% concord with maven humanitarian developers.
https://www.artificialintelligence-news.com/ |
|