EveryDev.ai
Sign inSubscribe
AI Tools by Topic
  • AI Coding Assistants
  • Agent Frameworks
  • MCP Servers
  • AI Prompt Tools
  • Vibe Coding Tools
  • AI Design Tools
  • AI Database Tools
  • AI Website Builders
  • AI Testing Tools
  • LLM Evaluations
Follow Us
  • X / Twitter
  • LinkedIn
  • Reddit
  • Discord
  • Threads
  • Bluesky
  • Mastodon
  • YouTube
  • GitHub
  • Instagram
Get Started
  • About
  • Editorial Standards
  • Corrections & Disclosures
  • Community Guidelines
  • Advertise
  • Contact Us
  • Newsletter
  • Submit a Tool
  • Start a Discussion
  • Write A Blog
  • Share A Build
  • Terms of Service
  • Privacy Policy
Explore with AI
  • ChatGPT
  • Gemini
  • Claude
  • Grok
  • Perplexity
Agent Experience
  • llms.txt
Theme
With AI, Everyone is a Dev. EveryDev.ai © 2026
Main Menu
  • Tools
  • Developers
  • Topics
  • Discussions
  • Communities
  • News
  • Podcasts
  • Blogs
  • Builds
  • Contests
  • Compare
  • Arena
Create
    Home
    Topics

    215 topics

    • Trending
    AI Topics
    • Agents1659
    • Coding1210
    • Infrastructure537
    • Marketing449
    • Design434
    • Projects392
    • Research369
    • Analytics338
    • Testing232
    • MCP226
    • Data211
    • Security200
    • Integration169
    • Learning155
    • Communication148
    • Prompts144
    • Extensions137
    • Commerce125
    • Voice122
    • DevOps99
    • Web78
    • Finance21
    1. Home
    2. Topics
    3. Testing
    4. LLM Evaluations

    AI Tools & Discussions in LLM Evaluations

    Platforms and frameworks for evaluating, testing, and benchmarking LLM systems and AI applications. These tools provide evaluators and evaluation models to score AI outputs, measure hallucinations, assess RAG quality, detect failures, and optimize model performance. Features include automated testing with LLM-as-a-judge metrics, component-level evaluation with tracing, regression testing in CI/CD pipelines, custom evaluator creation, dataset curation, and real-time monitoring of production systems. Teams use these solutions to validate prompt effectiveness, compare models side-by-side, ensure answer correctness and relevance, identify bias and toxicity, prevent PII leakage, and continuously improve AI product quality through experiments, benchmarks, and performance analytics.

    LLM Evaluations Tools (87)

    View ExploitBench
    ExploitBench tool icon

    ExploitBench

    Featured

    AI Security Exploit Benchmark

    LLM EvaluationsSecurity TestingAgent Harness
    View DeepSWE
    DeepSWE tool icon

    DeepSWE

    Coding Agent Benchmark Tool

    LLM EvaluationsAI Coding Asst.Agent Harness
    View InferenceBench
    InferenceBench tool icon

    InferenceBench

    LLM Inference Optimization Benchmark

    LLM EvaluationsAgent HarnessAI Infrastructure
    View Langtrace
    Langtrace tool icon

    Langtrace

    Open Source LLM Observability Platform

    ObservabilityLLM EvaluationsMonitoring Tools
    View Raindrop Workshop
    Raindrop Workshop tool icon

    Raindrop Workshop

    Local AI Agent Debugger

    ObservabilityAgent FrameworksLLM Evaluations
    View Inspect AI
    Inspect AI tool icon

    Inspect AI

    Featured

    Open Source LLM Eval Framework

    LLM EvaluationsAgent FrameworksAI Dev Libraries
    View VitaBench
    VitaBench tool icon

    VitaBench

    Open Source LLM Agent Benchmark

    LLM EvaluationsAgent FrameworksAcademic Research
    View pmstack
    pmstack tool icon

    pmstack

    AI Commands for Product Managers

    Prompt EngineeringAI Coding Asst.LLM Evaluations
    View SWE-bench
    SWE-bench tool icon

    SWE-bench

    LLM Software Engineering Benchmark

    LLM EvaluationsAutomated TestingAI Coding Asst.
    View Verifiers
    Verifiers tool icon

    Verifiers

    Featured

    LLM RL Training Environment Library

    Agent HarnessLLM EvaluationsHITL Training

    Top Tools in LLM Evaluations

    Highest trending score

    LM Arena

    Web platform for comparing, running, and deploying large language models with hosted inference and API access.

    ProgramBench

    A benchmark that tests whether AI agents can rebuild real-world programs from scratch given only a compiled binary and its documentation, with no access to source code.

    Inspect AI

    An open-source Python framework for large language model evaluations developed by the UK AI Security Institute, supporting agentic tasks, tool use, multi-turn dialog, and 200+ pre-built benchmarks.

    New in LLM Evaluations

    ExploitBench4m agoDeepSWE19h agoInferenceBench2d ago

    Featured Tool

    LM Arena screenshot
    LM Arena

    Web platform for comparing, running, and deploying large language models with hosted inference and API access.

    Last 7 Days

    16
    New Tools
    28
    Featured
    13
    Upvotes

    Related Topics

    Automated Testing91 tools
    Bug Detection35 tools
    Test Generation14 tools
    Visual Testing7 tools
    Performance Testing1 tools

    LLM Evaluations Discussions

    No discussions yet

    Be the first to start a discussion about LLM Evaluations

    Weekly Newsletter

    One weekly email. New AI dev tools, news, and trends.

    No spam — unsubscribe anytime