March 9th, 2026
New
Improved
We've given the entire app a visual refresh — every card, detail page, and table has been redesigned for consistency and clarity.
February 18th, 2026
Improved
The landing page now highlights the ground truth feature — step 3 is now Run & compare, showing that you can add human labels and measure how well your evals match. A subtle "New" badge draws attention to it.
Also tidied up the footer by removing the privacy policy link.
February 16th, 2026
New
Improved
A round of polish across the app:
Separated footer buttons — dialogs, alert confirmations, and the login/signup forms now have a visually distinct footer area with a subtle border and background, matching modern shadcn conventions.
Decluttered layouts — removed unnecessary card wrappers from the eval prompt, eval runs section, danger zone, and accuracy metrics. Pages feel lighter and less boxy.
Ground truth page tidied up — consolidated the labelling page from 6 separate cards into 3 clean sections. Navigation is now integrated into the trace card, and all sidebar controls live in a single panel.
Richer example dataset — the example dataset now includes 10 trip planning conversations (up from 3) covering diverse scenarios like solo backpacking, family holidays, ski trips, and more. It also ships with 3 example evals (boolean, score, and category) so new users can see the full range of eval types immediately.
Component showcase — added a hidden page at /app/components showing all UI components used across the app. Handy as a living style guide.
February 15th, 2026
New
Improved
You can now manually label traces to measure how accurate your LLM evals are. For boolean, score, and category evals attached to a dataset, go to the new Ground Truth page in the sidebar to start labelling. Once you have labels, Beval calculates precision, recall, F1, and accuracy automatically.
After viewing run results, you'll also see a prompt to start labelling right there — so you can go from "are these results right?" to measuring accuracy in one click.
The Evals page has been redesigned as a toolkit grouped by type — Boolean, Score, Category, and Comment. Each section explains what the eval type does and has a quick "+ New" card to create one. A search bar at the top lets you filter across all types.
The Runs table now displays which version of the eval was used for each run, making it easier to track prompt iterations.
Every page now has loading skeletons so navigating between views feels snappier — no more blank screens while data loads.
The Datasets page now has a search bar and groups your datasets by Today, This week, and Earlier for easier browsing.
February 13th, 2026
Fixed
A couple of fixes today:
February 9th, 2026
Fixed
Signing up now takes you straight into the app — no more brief flash of an email verification screen. One click and you're in.
February 9th, 2026
New
New users now see a Try an example dataset button when they have no datasets yet. One click loads 3 trip planning assistant conversations so you can start exploring evals right away — no CSV needed.
February 9th, 2026
New
Quick and dirty LLM-based evaluations for your AI product traces. Upload a dataset, tell the LLM what to look for, and get results in minutes.