evald.ai

OpenAI Evaluation Filter May 11, 2026 10:00

How enterprises are scaling AI

How enterprises scale AI: from early experiments to compounding impact through trust, governance, workflow design, and quality at scale.

Google News LLM Evaluation May 11, 2026 08:21

Google News

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.

Google News LLM Evaluation May 11, 2026 08:21

Google News

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.

Google News LLM Evaluation May 11, 2026 08:21

Google News

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.

Google News LLM Evaluation May 11, 2026 08:20

Google News

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.

METR Blog May 11, 2026 07:00

Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity

A survey of 349 technical workers finds a median 1.4–2x self-reported change in value of work due to AI tools, expected to grow over time, though there are reasons to be skeptical of the magnitude.

METR Blog May 08, 2026 07:00

Review of the "Risks from automated R&D" section in the Anthropic Risk Report (February 2026)

External review from METR of the "Risks from automated R&D" section in Anthropic's February 2026 Risk Report

METR Blog May 08, 2026 07:00

Task Substitution and Uplift

We distinguish three measures of AI uplift -- on old tasks, on new tasks, and in value -- and show that task substitution can cause these to diverge substantially.

MLCommons Evaluation Filter May 07, 2026 13:23

GPT-OSS 20B: A Sparse MoE Pretraining Benchmark for MLPerf Training v6.0 - MLCommons

MLPerf Training v6.0 introduces GPT-OSS 20B, a new sparse Mixture-of-Experts (MoE) pretraining benchmark designed for accessibility on single 8-GPU nodes.

Hugging Face Evaluation Filter May 06, 2026 00:00

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

MLCommons Evaluation Filter May 05, 2026 13:37

DeepSeek-V3: A Large-Scale MoE Pretraining Benchmark for MLPerf Training v6.0 - MLCommons

MLPerf Training v6.0 introduces a large-scale pretraining benchmark built on DeepSeek-V3, bringing Mixture-of-Experts (MoE) evaluation to the suite.

Google News LLM Evaluation May 05, 2026 06:10

Google News

Comprehensive up-to-date news coverage, aggregated from sources all over the world by Google News.