February 15th, 2026

New

Improved

Ground truth labelling, redesigned evals toolkit, and faster navigation

Ground truth labelling

You can now manually label traces to measure how accurate your LLM evals are. For boolean, score, and category evals attached to a dataset, go to the new Ground Truth page in the sidebar to start labelling. Once you have labels, Beval calculates precision, recall, F1, and accuracy automatically.

After viewing run results, you'll also see a prompt to start labelling right there β€” so you can go from "are these results right?" to measuring accuracy in one click.

Evals toolkit

The Evals page has been redesigned as a toolkit grouped by type β€” Boolean, Score, Category, and Comment. Each section explains what the eval type does and has a quick "+ New" card to create one. A search bar at the top lets you filter across all types.

Runs now show eval version

The Runs table now displays which version of the eval was used for each run, making it easier to track prompt iterations.

Faster page navigation

Every page now has loading skeletons so navigating between views feels snappier β€” no more blank screens while data loads.

Datasets page refresh

The Datasets page now has a search bar and groups your datasets by Today, This week, and Earlier for easier browsing.