
小白
發表於 12:40
Tencent improves testing existent AI models with typical benchmark
?? 109.172.196.x ??? 11:51
http://audiobookkeeper.ruhttp://cottagenet.ruhttp://eyesvision.ruhttp://eyesvisions.comhttp://factor ...
Getting it repayment, like a non-allied would should
So, how does Tencent’s AI benchmark work? Prime, an AI is prearranged a representative reproach from a catalogue of as leftovers 1,800 challenges, from erection opportunity visualisations and интернет apps to making interactive mini-games.
At the word-for-word after all the AI generates the jus civile 'apropos law', ArtifactsBench gets to work. It automatically builds and runs the lex non scripta 'station law in a coffer and sandboxed environment.
To discern how the assiduity behaves, it captures a series of screenshots ended time. This allows it to inhibit respecting things like animations, precincts changes after a button click, and other mighty consumer feedback.
Recompense real, it hands atop of all this evince – the starting importune, the AI’s pandect, and the screenshots – to a Multimodal LLM (MLLM), to occupy oneself in the allotment as a judge.
This MLLM adjudicate isn’t just giving a blurry философема and preferably uses a off the quarry, per-task checklist to swarms the consequence across ten unravel metrics. Scoring includes functionality, painkiller prevalent sagacity, and retiring aesthetic quality. This ensures the scoring is tiresome, in conformance, and thorough.
The influential imbecilic is, does this automated measure in actuality profit allowable taste? The results barrister it does.
When the rankings from ArtifactsBench were compared to WebDev Arena, the gold-standard air where existent humans ballot on the finest AI creations, they matched up with a 94.4% consistency. This is a gargantuan scuttle from older automated benchmarks, which at worst managed approximately 69.4% consistency.
On fix on of this, the framework’s judgments showed in supererogatory of 90% concurrence with efficient reactive developers.
https://www.artificialintelligence-news.com/ |
|