Sign in Subscribe

AI Tools by Topic

AI Coding Assistants
Agent Frameworks
MCP Servers
AI Prompt Tools
Vibe Coding Tools
AI Design Tools
AI Database Tools
AI Website Builders
AI Testing Tools
LLM Evaluations

Follow Us

X / Twitter
LinkedIn
Reddit
Discord
Threads
Bluesky
Mastodon
YouTube
GitHub
Instagram

Get Started

About
Editorial Standards
Corrections & Disclosures
Community Guidelines
Advertise
Contact Us
Newsletter
Submit a Tool
Start a Discussion
Write A Blog
Share A Build
Terms of Service
Privacy Policy

Explore with AI

ChatGPT
Gemini
Claude
Grok
Perplexity

Agent Experience

llms.txt

Theme

With AI, Everyone is a Dev. EveryDev.ai © 2026

Main Menu

Tools
Developers
Topics
Discussions
Communities
News
Podcasts
Blogs
Builds
Contests
Compare
Arena

215 topics

Trending

AI Topics

Agents1659
Coding1210
Infrastructure537
Marketing449
Design434
Projects392
Research369
Analytics338
Testing232
MCP226
Data211
Security200
Integration169
Learning155
Communication148
Prompts144
Extensions137
Commerce125
Voice122
DevOps99
Web78
Finance21

Home
Topics
Testing
LLM Evaluations

AI Tools & Discussions in LLM Evaluations

Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

LLM Evaluations Tools (87)

View ExploitBench

ExploitBench tool icon

ExploitBench

Featured

AI Security Exploit Benchmark

LLM Evaluations Security Testing Agent Harness

DeepSWE tool icon

DeepSWE

Coding Agent Benchmark Tool

LLM Evaluations AI Coding Asst.Agent Harness

View InferenceBench

InferenceBench tool icon

InferenceBench

LLM Inference Optimization Benchmark

LLM Evaluations Agent Harness AI Infrastructure

Langtrace tool icon

Langtrace

Open Source LLM Observability Platform

Observability LLM Evaluations Monitoring Tools

View Raindrop Workshop

Raindrop Workshop tool icon

Raindrop Workshop

Local AI Agent Debugger

Observability Agent Frameworks LLM Evaluations

View Inspect AI

Inspect AI tool icon

Inspect AI

Featured

Open Source LLM Eval Framework

LLM Evaluations Agent Frameworks AI Dev Libraries

VitaBench tool icon

VitaBench

Open Source LLM Agent Benchmark

LLM Evaluations Agent Frameworks Academic Research

pmstack tool icon

pmstack

AI Commands for Product Managers

Prompt Engineering AI Coding Asst.LLM Evaluations

SWE-bench tool icon

SWE-bench

LLM Software Engineering Benchmark

LLM Evaluations Automated Testing AI Coding Asst.

Verifiers tool icon

Verifiers

Featured

LLM RL Training Environment Library

Agent Harness LLM Evaluations HITL Training

Top Tools in LLM Evaluations

Highest trending score

Web platform for comparing, running, and deploying large language models with hosted inference and API access.

A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

An open-source Python framework for large language model evaluations developed by the UK AI Security Institute, supporting agentic tasks, tool use, multi-turn dialog, and 200+ pre-built benchmarks.

New in LLM Evaluations

ExploitBench4m ago

InferenceBench2d ago

Featured Tool

Web platform for comparing, running, and deploying large language models with hosted inference and API access.

Last 7 Days

16

New Tools

28

Featured

13

Upvotes

Related Topics

Automated Testing91 tools

Bug Detection35 tools

Test Generation14 tools

Visual Testing7 tools

Performance Testing1 tools

LLM Evaluations Discussions

No discussions yet

Be the first to start a discussion about LLM Evaluations

Weekly Newsletter

One weekly email. New AI dev tools, news, and trends.

No spam — unsubscribe anytime