The methodology assesses RAG answer quality using LLM-as-a-judge prompts, focusing on groundedness and completeness as core reliability criteria. Answers are evaluated strictly against the retrieved context used for generation. Groundedness is checked at the sentence level and aggregated conservatively using a worst-sentence approach, including detection of incorrect cross-chunk combinations. Completeness measures whether all relevant context sentences are covered. A human-annotated meta-evaluation dataset validates alignment between LLM and human judgments.
Research area(s)
Generative AI and LLMs
Technical features
The evaluation framework relies entirely on prompt-based LLM judges for groundedness and completeness with respect to the retrieved context. Specialized prompts detect unsupported statements, missing essential information, and contextually invalid combinations of instructions across retrieved chunks. The approach favors interpretability, flexibility, and domain adaptability, allowing rapid prompt refinement as chatbot behavior evolves.
Integration constraints
Solutions that use LLMs
Targeted customer(s)
Philips and any industry developing chatbots
Conditions for reuse
Originally to be used internally, licensing to be considered
Contact
Martijn Krans
Email martijn.krans@philips.com