HAQM Bedrock RAG and Model Evaluations now support custom metrics
HAQM Bedrock Evaluations allows you to evaluate foundation models and retrieval-augmented generation (RAG) systems, whether hosted on HAQM Bedrock or multicloud and on-prem deployments. Bedrock Evaluations offers human-based evals, programmatic evals such as BERTScore, F1 and other exact match metrics, as well as LLM-as-a-judge for both model and RAG evaluation. For both model and RAG evaluation with LLM-as-a-judge, customers can select from an extensive list of built-in metrics such as correctness, completeness, faithfulness (hallucination detection), as well as responsible AI metrics such as answer refusal, harmfulness, and stereotyping. But, there are times when they want to define these metrics differently, or make new metrics that are relevant to their needs. For example, customers may define a metric that evaluates an application response’s adherence to their specific brand voice, or they want to classify responses according to a custom categorical rubric.
Now, HAQM Bedrock Evaluations offers customers the ability to create and re-use custom metrics for both model and RAG evaluation powered by LLM-as-a-judge. Customers can write their own judge prompts, define their own categorical or numerical rating scales, and use built-in variables to inject data from their dataset or GenAI responses into the judge prompt during runtime to fully customize the data flow in their evaluations. Customers can be inspired to create new judge prompt templates/rubrics with provided quickstart templates or they can make their own from scratch.
To get started, visit the HAQM Bedrock console or use the Bedrock APIs. For more information, see the user guide.