February 15th, 2026
New
Improved
You can now manually label traces to measure how accurate your LLM evals are. For boolean, score, and category evals attached to a dataset, go to the new Ground Truth page in the sidebar to start labelling. Once you have labels, Beval calculates precision, recall, F1, and accuracy automatically.
After viewing run results, you'll also see a prompt to start labelling right there β so you can go from "are these results right?" to measuring accuracy in one click.
The Evals page has been redesigned as a toolkit grouped by type β Boolean, Score, Category, and Comment. Each section explains what the eval type does and has a quick "+ New" card to create one. A search bar at the top lets you filter across all types.
The Runs table now displays which version of the eval was used for each run, making it easier to track prompt iterations.
Every page now has loading skeletons so navigating between views feels snappier β no more blank screens while data loads.
The Datasets page now has a search bar and groups your datasets by Today, This week, and Earlier for easier browsing.