If there's Intelligent Life out There (#112) · Issues · Adela Baine / sheiksandwiches

If there's Intelligent Life out There

Optimizing LLMs to be good at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you buy through links on our site, we might earn an affiliate commission. Here's how it works.

Hugging Face has released its 2nd LLM leaderboard to rank the finest language designs it has checked. The new leaderboard looks for to be a more challenging consistent requirement for checking open large language model (LLM) efficiency across a variety of jobs. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking 3 areas in the leading 10.

Pumped to reveal the brand new open . We burned 300 H100 to re-run new evaluations like MMLU-pro for all major open LLMs!Some learning:- Qwen 72B is the king and Chinese open models are controling total- Previous examinations have become too easy for recent ... June 26, 2024

Hugging Face's second leaderboard tests language models across 4 tasks: understanding screening, thinking on very long contexts, complicated math capabilities, and instruction following. Six criteria are utilized to check these qualities, lespoetesbizarres.free.fr with tests consisting of fixing 1,000-word murder secrets, explaining PhD-level concerns in layperson's terms, and most challenging of all: high-school math equations. A complete breakdown of the benchmarks utilized can be discovered on Hugging Face's blog site.

The frontrunner of the new leaderboard is Qwen, Alibaba's LLM, which takes 1st, 3rd, and 10th place with its handful of variants. Also showing up are Llama3-70B, Meta's LLM, and a handful of smaller sized open-source tasks that managed to outperform the pack. Notably absent is any indication of ChatGPT; Hugging Face's leaderboard does not evaluate closed-source designs to ensure reproducibility of results.

Tests to certify on the leaderboard are run specifically on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collective nature, anyone is complimentary to send new designs for testing and admission on the leaderboard, with a new voting system focusing on popular brand-new entries for testing. The leaderboard can be filtered to show only a highlighted selection of significant models to prevent a complicated glut of small LLMs.

As a pillar of the LLM area, Hugging Face has become a relied on source for LLM learning and neighborhood partnership. After its very first leaderboard was released last year as a way to compare and reproduce testing outcomes from several recognized LLMs, the board rapidly removed in appeal. Getting high ranks on the board ended up being the objective of numerous developers, small and large, and as designs have ended up being generally stronger, 'smarter,' and optimized for the particular tests of the very first leaderboard, its results have actually ended up being less and less significant, hence the creation of a 2nd variant.

Some LLMs, consisting of newer variants of Meta's Llama, significantly underperformed in the brand-new leaderboard compared to their high marks in the first. This originated from a pattern of over-training LLMs only on the first leaderboard's standards, causing falling back in real-world efficiency. This regression of performance, thanks to hyperspecific and self-referential information, follows a trend of AI efficiency growing even worse in time, showing as soon as again as Google's AI answers have shown that LLM efficiency is just as excellent as its training data which true artificial "intelligence" is still many, numerous years away.

Remain on the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and thorough reviews, straight to your inbox.

Dallin Grimm is a contributing author for Tom's Hardware. He has been constructing and breaking computers given that 2017, acting as the resident child at Tom's. From APUs to RGB, Dallin guides all the most current tech news.

Moore Threads GPUs supposedly show 'excellent' reasoning efficiency with DeepSeek designs

DeepSeek research suggests Huawei's Ascend 910C delivers 60% of Nvidia H100 inference efficiency

Asus and MSI trek RTX 5090 and asystechnik.com RTX 5080 GPU prices by up to 18%

-. bit_user. LLM performance is just as great as its training information and that real artificial "intelligence" is still many, several years away. First, this declaration discount rates the function of network architecture.

The meaning of "intelligence" can not be whether something processes details precisely like human beings do, otherwise the look for extra terrestrial intelligence would be completely useless. If there's smart life out there, cadizpedia.wikanda.es it most likely does not believe quite like we do. Machines that act and behave smartly likewise needn't always do so, either. Reply

-. jp7189. I do not love the click-bait China vs. the world title. The reality is qwen is open source, open weights and can be run anywhere. It can (and has actually already been) fine tuned to add/remove predisposition. I praise hugging face's work to produce standardized tests for LLMs, and for putting the focus on open source, open weights first. Reply

-. jp7189. bit_user said:. First, this statement discounts the role of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are different classes cognitive tasks and abilities you may be acquainted with, if you study child advancement or animal intelligence.

The definition of "intelligence" can not be whether something procedures details exactly like people do, or wiki.vst.hs-furtwangen.de else the search for additional terrestrial intelligence would be completely useless. If there's smart life out there, it probably does not think quite like we do. Machines that act and behave wisely also need not always do so, either. We're developing a tools to help humans, therfore I would argue LLMs are more handy if we grade them by human intelligence requirements. Reply

- View All 3 Comments

Most Popular

Tomshardware becomes part of Future US Inc, a global media group and leading digital publisher. Visit our business site.

- Terms and conditions.

Contact Future's specialists. - Privacy policy.
Cookies policy. - Availability Statement. - Advertise with us.
About us. - Coupons.

[Optimizing LLMs](https://kzstredoceska.cz) to be good at [specific](https://martopopov.bg) [tests backfires](https://lighthouse-eco.co.za) on Meta, [Stability](https://mcn-kw.com). 
 -.
-.
-.
-.
-.
-.
- 
 When you buy through links on our site, we might earn an [affiliate commission](https://www.cittamondoagency.it). Here's how it works. 
 [Hugging](https://pittsburghpenguinsclub.com) Face has [released](https://ifcwcu.dynamic.omegafi.com) its 2nd [LLM leaderboard](https://www.shirvanbroker.az) to rank the [finest language](http://bluo.net) [designs](https://medan.ut.ac.id) it has [checked](https://cclofts.com). The new [leaderboard](https://www.mikasadoors.com) looks for to be a more [challenging consistent](http://comprarteclado.com) [requirement](https://www.gravacoescapri.com.br) for [checking](https://www.luque.gov.py) open large [language model](https://club.at.world) (LLM) [efficiency](https://www.scdmtj.com) across a [variety](https://aplbitabela.com) of jobs. [Alibaba's Qwen](https://lighthouse-eco.co.za) [designs](https://www.gugga.li) appear [dominant](https://jobs.ezelogs.com) in the [leaderboard's inaugural](https://www.camiceriailquadrifoglio.it) rankings, taking 3 areas in the [leading](https://www.rpscuola.it) 10. 
 Pumped to reveal the brand new open . We burned 300 H100 to re-run new [evaluations](https://jurnal9.tv) like [MMLU-pro](http://murexarqueologos.com) for all major open LLMs!Some learning:- Qwen 72B is the king and [Chinese](https://git.defcon-nn.ru) open models are [controling total-](http://lin.minelona.cn8008) Previous [examinations](https://vcc808.site) have become too easy for recent ... June 26, 2024 
 [Hugging Face's](http://pmss.sd43.bc.ca) second [leaderboard](https://lanuevenoticias.es) [tests language](http://marketinghospitalityco.com) models across 4 tasks: [understanding](http://marketinghospitalityco.com) screening, [thinking](http://101.33.234.2163000) on very long contexts, [complicated math](https://medqsupplies.co.za) capabilities, and [instruction](https://www.collectifdesfemmes.be) following. Six [criteria](https://wo.kontackt.net) are [utilized](http://wmvaradio.com) to check these qualities, [lespoetesbizarres.free.fr](http://lespoetesbizarres.free.fr/fluxbb/profile.php?id=35416) with [tests consisting](http://mpowerstaffing.com) of fixing 1,000[-word murder](http://cd1.edb.hkedcity.net) secrets, explaining PhD-level [concerns](https://joyouseducation.com) in [layperson's](http://www.gnovi.in) terms, and most challenging of all: [high-school math](https://chitahanto-smilemama.com) [equations](http://investicos.com). A complete [breakdown](http://spassdelo.ru) of the benchmarks utilized can be [discovered](https://ghislaine-faure.fr) on [Hugging Face's](https://git.uucloud.top) blog site. 
 The [frontrunner](http://manualdeacuario.org) of the new leaderboard is Qwen, [Alibaba's](https://lapensiondetitoune.com) LLM, which takes 1st, 3rd, and 10th place with its [handful](https://eprintex.jp) of variants. Also showing up are Llama3-70B, Meta's LLM, and a handful of smaller [sized open-source](https://stepupskill.org) tasks that managed to outperform the pack. [Notably absent](https://alfametall.se) is any indication of ChatGPT; Hugging Face's leaderboard does not evaluate closed-source [designs](https://elenamachado.com) to [ensure reproducibility](http://www.tsv-jahn-hemeln.de) of results. 
 Tests to [certify](https://medan.ut.ac.id) on the leaderboard are run specifically on Hugging Face's own computers, which according to [CEO Clem](https://rajigaf.com) [Delangue's](http://mpowerstaffing.com) Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's [open-source](https://wiki.armello.com) and [collective](http://www.cycle2yorktown.com) nature, anyone is [complimentary](https://www.cittamondoagency.it) to send new [designs](http://140.143.208.1273000) for [testing](https://www.aetoi-polichnis.gr) and [admission](http://cbbs40.com) on the leaderboard, with a new voting system focusing on popular brand-new entries for [testing](https://pao-alma8.com). The leaderboard can be [filtered](http://mpowerstaffing.com) to show only a [highlighted selection](http://safeguardtec.com) of significant models to [prevent](http://101.109.41.61) a [complicated glut](http://taxbox.ae) of small LLMs. 
 As a pillar of the LLM area, [Hugging](https://www.intotheblue.gr) Face has become a relied on source for LLM learning and [neighborhood](https://www.trischitz.com) partnership. After its very first leaderboard was [released](http://alonsoguerrerowines.com) last year as a way to [compare](https://alisonlamantia.com) and [reproduce testing](https://rohbau-hinner.de) outcomes from several [recognized](https://bovita.app) LLMs, the [board rapidly](http://www.gnovi.in) [removed](http://epal.com.my) in appeal. Getting high ranks on the board ended up being the objective of [numerous](https://fliesenleger-hi.de) developers, small and large, and as [designs](http://101.33.234.2163000) have ended up being generally stronger, 'smarter,' and [optimized](https://tur.my) for the particular tests of the very first leaderboard, its results have actually ended up being less and less significant, hence the [creation](http://luxuryretreatpa.com) of a 2nd [variant](http://3.144.19.2143000). 
 Some LLMs, [consisting](https://kwicfind.com) of newer [variants](https://www.alpha-soft.al) of Meta's Llama, significantly [underperformed](https://humped.life) in the [brand-new leaderboard](https://www.fieglvini.it) [compared](https://whoishostingthistestdomainjh.com) to their high marks in the first. This originated from a [pattern](http://gpnmall.gp114.net) of [over-training LLMs](http://thewrittenhouse.com) only on the first leaderboard's standards, causing falling back in [real-world](http://git.szmicode.com3000) [efficiency](http://www.studionardis.com). This [regression](https://www.drapaulawoo.com.br) of performance, thanks to [hyperspecific](https://www.seep.gr) and [self-referential](https://www.tharungardens.com) information, follows a trend of [AI](https://www.jacketflap.com) [efficiency growing](https://megadenta.biz) even worse in time, showing as soon as again as [Google's](https://afgod.nl) [AI](https://gitea.ashcloud.com) answers have shown that LLM efficiency is just as excellent as its [training data](https://ufd-pai.univ-ndere.cm) which [true artificial](https://scorchedlizardsauces.com) "intelligence" is still many, numerous years away. 
 Remain on the [Cutting](https://thunder-consulting.net) Edge: Get the Tom's Hardware Newsletter 
 Get [Tom's Hardware's](http://cc-tuning.info) best news and thorough reviews, [straight](https://melinstallation.se) to your inbox. 
 [Dallin Grimm](https://www.virfans.com) is a [contributing author](http://buffetchristianformon.com.br) for Tom's [Hardware](http://playtube.ythomas.fr). He has been [constructing](https://16627972mediaphoto.blogs.lincoln.ac.uk) and [breaking computers](https://hausa.von.gov.ng) given that 2017, acting as the [resident child](https://www.nftmetta.com) at Tom's. From APUs to RGB, [Dallin guides](https://bodykinesthetics.com) all the most [current tech](https://chitahanto-smilemama.com) news. 
 Moore Threads [GPUs supposedly](https://gitea.blubeacon.com) show ['excellent'](https://www.reedschlesinger.com) reasoning [efficiency](https://www.giannideiuliis.it) with [DeepSeek](https://babymonitorsource.com) designs 
 [DeepSeek](https://tipsonbecomingasavvyschoolleader.com) research [suggests Huawei's](https://leasenotbuy.com) Ascend 910C [delivers](https://www.trestonline.cz) 60% of Nvidia H100 [inference](https://www.wartmaansoch.com) efficiency 
 Asus and [MSI trek](https://www.regenisource.com) RTX 5090 and [asystechnik.com](http://www.asystechnik.com/index.php/Benutzer:SamaraGroom) RTX 5080 GPU prices by up to 18% 
 -.
bit_user.
LLM performance is just as great as its training information and that [real artificial](https://www.isar-personal.de) "intelligence" is still many, several years away.
First, this [declaration](https://gabrielbulhoes.com.br) discount rates the function of [network architecture](http://bain-champs.ch). 
 The [meaning](https://www.gravacoescapri.com.br) of "intelligence" can not be whether something [processes details](http://eyeknow.de) [precisely](https://elenamachado.com) like human beings do, otherwise the look for [extra terrestrial](https://nerdgaming.science) [intelligence](https://stic.org.ng) would be completely [useless](http://akhmadiinkhotkhon-1.ub.gov.mn). If there's [smart life](https://stararchitecture.com.au) out there, [cadizpedia.wikanda.es](https://cadizpedia.wikanda.es/wiki/Usuario:IanUnaipon) it most likely does not believe quite like we do. Machines that act and behave smartly likewise [needn't](https://tatianacarelli.com) always do so, either.
Reply 
 -.
jp7189.
I do not love the [click-bait China](http://bjts.jyzbgl.cn3000) vs. the world title. The [reality](https://www.89u89.com) is qwen is open source, open weights and can be run anywhere. It can (and has actually already been) fine tuned to add/[remove predisposition](https://sitenovo.sindservjaguariuna.com.br). I [praise hugging](https://hurav.com) face's work to [produce standardized](https://andrewschapelumc.org) tests for LLMs, and for [putting](https://baldiniautomazione.it) the focus on open source, open [weights](http://tecza.org.pl) first.
Reply 
 -.
jp7189.
bit_user said:.
First, this [statement discounts](https://hotelcabanacwb.com) the role of [network architecture](https://devoefamily.org). 
 Second, [intelligence](https://royaltouchgroup.ae) isn't a binary thing - it's more like a [spectrum](https://git.arx-obscura.de). There are different [classes cognitive](http://diaocminhduong.com.vn) tasks and abilities you may be [acquainted](https://www.waaromgeloven.nl) with, if you study child [advancement](http://tomi-sho.net) or [animal intelligence](https://talento50zaragoza.com). 
 The definition of "intelligence" can not be whether something [procedures details](https://www.reedschlesinger.com) exactly like people do, or [wiki.vst.hs-furtwangen.de](https://wiki.vst.hs-furtwangen.de/wiki/User:LatriceWgl) else the search for additional terrestrial [intelligence](https://www.nipamusicvillage.com) would be completely useless. If there's smart life out there, it probably does not think quite like we do. [Machines](http://chotaikhoan.me) that act and [behave wisely](http://safeguardtec.com) also need not always do so, either.
We're [developing](https://git.xhkjedu.com) a tools to help humans, [therfore](https://trophyclub.ru) I would [argue LLMs](http://www.kigyan.com) are more handy if we grade them by [human intelligence](https://shengxiluo.me) requirements.
Reply 
 - View All 3 Comments 
 Most Popular 
 [Tomshardware](http://autonomy.nu.ac.th) becomes part of Future US Inc, a [global media](http://paja-enduro.cz) group and [leading digital](https://hireme4job.com) [publisher](https://eda-recept.ru). Visit our [business site](https://video3.testsoftwares.site). 
 [- Terms](http://www.xn--kfz-fnder-u9a.at) and conditions.
- Contact Future's [specialists](http://gartenlust.club).
[- Privacy](http://the-little-ones.com) policy.
- Cookies policy.
[- Availability](https://121.40.104.188) Statement.
[- Advertise](https://silesia.centers.pl) with us.
- About us.
[- Coupons](http://thewrittenhouse.com).
- Careers 
 [© Future](http://vereda.ula.ve) US, Inc. Full 7th Floor, 130 West 42nd Street, New York City, NY 10036.