Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Submit feedback
    • Contribute to GitLab
  • Sign in
J
jkcredit
  • Project
    • Project
    • Details
    • Activity
    • Cycle Analytics
  • Issues 13
    • Issues 13
    • List
    • Board
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Create a new issue
  • Jobs
  • Issue Boards
  • Amelie Hersh
  • jkcredit
  • Issues
  • #3

Closed
Open
Opened Feb 10, 2025 by Amelie Hersh@ameliehersh961
  • Report abuse
  • New issue
Report abuse New issue

If there's Intelligent Life out There


Optimizing LLMs to be great at specific tests backfires on Meta, Stability.

-. -. -. -. -. -. -

When you purchase through links on our website, we might make an affiliate commission. Here's how it works.

Hugging Face has actually launched its 2nd LLM leaderboard to rank the finest language designs it has actually evaluated. The new leaderboard looks for to be a more difficult consistent requirement for testing open large language model (LLM) performance across a range of jobs. Alibaba's Qwen designs appear dominant in the leaderboard's inaugural rankings, taking 3 spots in the leading 10.

Pumped to announce the brand brand-new open LLM leaderboard. We burned 300 H100 to re-run new assessments like MMLU-pro for all major open LLMs!Some learning:- Qwen 72B is the king and Chinese open designs are dominating general- Previous assessments have actually become too simple for current ... June 26, 2024

Hugging Face's 2nd leaderboard tests language designs across 4 tasks: understanding testing, thinking on exceptionally long contexts, intricate math capabilities, and guideline following. Six standards are utilized to test these qualities, with tests including resolving 1,000-word murder mysteries, explaining PhD-level questions in layman's terms, and the majority of complicated of all: high-school mathematics equations. A full breakdown of the benchmarks utilized can be found on Hugging Face's blog site.

The frontrunner of the brand-new leaderboard is Qwen, Alibaba's LLM, which takes first, 3rd, and 10th place with its handful of variants. Also showing up are Llama3-70B, Meta's LLM, and a handful of smaller sized open-source jobs that handled to exceed the pack. Notably missing is any sign of ChatGPT; Hugging Face's leaderboard does not test closed-source designs to ensure reproducibility of results.

Tests to qualify on the leaderboard are run solely on Hugging Face's own computer systems, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collaborative nature, anyone is totally free to submit new models for testing and admission on the leaderboard, with a brand-new voting system focusing on popular brand-new entries for screening. The leaderboard can be filtered to show just a highlighted variety of significant models to avoid a confusing excess of little LLMs.

As a pillar of the LLM area, Hugging Face has actually ended up being a trusted source for LLM knowing and community partnership. After its very first leaderboard was launched last year as a method to compare and recreate testing arise from several recognized LLMs, the board rapidly took off in appeal. Getting high ranks on the board became the goal of numerous designers, qoocle.com small and large, and archmageriseswiki.com as models have become generally stronger, 'smarter,' and optimized for the particular tests of the first leaderboard, its results have actually become less and less significant, thus the development of a second version.

Some LLMs, newer variants of Meta's Llama, seriously underperformed in the new leaderboard compared to their high marks in the first. This originated from a pattern of over-training LLMs only on the first leaderboard's standards, causing regressing in real-world performance. This regression of performance, thanks to hyperspecific and self-referential data, follows a trend of AI efficiency growing even worse over time, proving when again as Google's AI answers have revealed that LLM performance is only as great as its training data and that real artificial "intelligence" is still many, several years away.

Remain on the Cutting Edge: Get the Tom's Hardware Newsletter

Get Tom's Hardware's best news and links.gtanet.com.br extensive evaluations, straight to your inbox.

Dallin Grimm is a contributing writer for Tom's Hardware. He has been building and breaking computer systems given that 2017, acting as the resident youngster at Tom's. From APUs to RGB, Dallin guides all the current tech news.

Moore Threads GPUs presumably reveal 'outstanding' reasoning efficiency with DeepSeek models

DeepSeek research study recommends Huawei's Ascend 910C delivers 60% of Nvidia H100 inference performance

Asus and MSI hike RTX 5090 and RTX 5080 GPU rates by up to 18%

-. bit_user. LLM performance is just as great as its training information which true artificial "intelligence" is still many, addsub.wiki many years away. First, this declaration discount rates the function of network architecture.

The meaning of "intelligence" can not be whether something procedures details precisely like people do, bio.rogstecnologia.com.br otherwise the look for additional terrestrial intelligence would be entirely useless. If there's smart life out there, it probably doesn't believe quite like we do. Machines that act and bphomesteading.com behave intelligently also needn't always do so, either. Reply

-. jp7189. I don't love the click-bait China vs. the world title. The truth is qwen is open source, open weights and can be run anywhere. It can (and has already been) tweaked to add/remove bias. I praise hugging face's work to produce standardized tests for LLMs, and for putting the focus on open source, open weights initially. Reply

-. jp7189. bit_user said:. First, this declaration discounts the role of network architecture.

Second, intelligence isn't a binary thing - it's more like a spectrum. There are various classes cognitive tasks and systemcheck-wiki.de abilities you might be acquainted with, if you study kid development or animal intelligence.

The meaning of "intelligence" can not be whether something procedures details precisely like human beings do, or else the look for extra terrestrial intelligence would be totally futile. If there's smart life out there, it most likely doesn't think quite like we do. Machines that act and act smartly also needn't necessarily do so, either. We're developing a tools to help human beings, therfore I would argue LLMs are more helpful if we grade them by human intelligence standards. Reply

- View All 3 Comments

Most Popular

Tomshardware is part of Future US Inc, a global media group and leading digital publisher. Visit our business site.

- Terms. - Contact Future's professionals.

  • Privacy policy. - Cookies policy.
  • Availability Statement.
  • Advertise with us.
  • About us.
  • Coupons.
  • Careers

    © Future US, Inc. Full 7th Floor, 130 West 42nd Street, New York, NY 10036.
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
No due date
0
Labels
None
Assign labels
  • View project labels
Reference: ameliehersh961/jkcredit#3