LLM-as-a-Judge · BERTScore · Reproducible
Measure the reliability of your LLM chatbot¶
Open-source Python framework to assess the factual accuracy, semantic consistency and robustness of any LLM-based chatbot — in an automated, comparable and reproducible way.
Send test scenarios, let a judge model grade the answers with the help of computational metrics, and get structured reports — ready to run as a quality gate in your CI pipeline.
Three dimensions of reliability¶
-
Factual accuracy
Does the chatbot answer correctly? Responses are compared against a verifiable ground truth, with trap scenarios to surface hallucination.
-
Semantic consistency
Does the same question, asked in different ways, get the same answer? The framework sends rephrasings and compares the outputs with each other.
-
Robustness
Does quality hold up against noise, typos and adversarial inputs? The variants are compared against the original answer.
Get started in 2 minutes¶
CI-ready
The same command becomes a quality gate that blocks quality regressions on every pull request. See the gate calibration guide.
Where to go next¶
-
Scenario methodology
How the bank was built and how to point to your own bank via
scenarios_path. -
Human validation
Human × judge agreement protocol and the annotation templates.
-
Quality gate in CI
Calibrate when and how the gate fails in your pipeline.
-
Code & issues
Contribute, open issues or explore the source code on GitHub.
Documentation in progress
This is the first version of the site. Quickstart, configuration reference, providers and evaluation details are coming next (issue #71).
llm-eval started as an undergraduate thesis (TCC) in Software Engineering at
the University of Brasília (UnB), filling a practical gap: applying LLM
evaluation metrics and criteria in a systematic, accessible way.
The guide pages below are currently available in Portuguese; English translations are on the way.