Improved

<h2>Ground truth labelling</h2><p>You can now <strong>manually label traces</strong> to measure how accurate your LLM evals are. For boolean, score, and category evals attached to a dataset, go to the new <strong>Ground Truth</strong> page in the sidebar to start labelling. Once you have labels, Beval calculates <strong>precision, recall, F1, and accuracy</strong> automatically.</p><p>After viewing run results, you'll also see a prompt to <strong>start labelling</strong> right there — so you can go from "are these results right?" to measuring accuracy in one click.</p><h2>Evals toolkit</h2><p>The <strong>Evals</strong> page has been redesigned as a toolkit grouped by type — <strong>Boolean</strong>, <strong>Score</strong>, <strong>Category</strong>, and <strong>Comment</strong>. Each section explains what the eval type does and has a quick "+ New" card to create one. A <strong>search bar</strong> at the top lets you filter across all types.</p><h2>Runs now show eval version</h2><p>The <strong>Runs</strong> table now displays which <strong>version</strong> of the eval was used for each run, making it easier to track prompt iterations.</p><h2>Faster page navigation</h2><p>Every page now has <strong>loading skeletons</strong> so navigating between views feels snappier — no more blank screens while data loads.</p><h2>Datasets page refresh</h2><p>The <strong>Datasets</strong> page now has a search bar and groups your datasets by <strong>Today</strong>, <strong>This week</strong>, and <strong>Earlier</strong> for easier browsing.</p>

Ground truth labelling, redesigned evals toolkit, and faster navigation

Help Center

Fixed

beval

In Review

Planned

In Progress

Completed

Rejected

High Priority

Low Priority

Backlog

Next up

Done

Main Roadmap

Hey {name|there}! 👋