Changelog

Follow new updates and improvements to beval.

March 9th, 2026

New

Improved

We've given the entire app a visual refresh — every card, detail page, and table has been redesigned for consistency and clarity.

  • Redesigned cards — Dataset, eval, and ground truth cards now share a clean portrait layout with icons, metadata rows, and separators. Dataset cards show which evals are attached. Eval cards show when they were last run.
  • New runs table — Runs are now displayed in a proper sortable table with filters for eval, column, status, dataset, and date range. Click any row to see the full results.
  • Cleaner detail pages — Eval and dataset detail pages use flat section headers instead of nested cards, giving everything more breathing room.
  • Eval descriptions — Evals now have an optional description field so you can capture the intent behind each eval in plain language.
  • Daily rate limit — To keep things fair, there's now a limit of 200 trace evaluations per day per user. You'll see a clear message if you hit it.

February 18th, 2026

Improved

The landing page now highlights the ground truth feature — step 3 is now Run & compare, showing that you can add human labels and measure how well your evals match. A subtle "New" badge draws attention to it.

Also tidied up the footer by removing the privacy policy link.

February 16th, 2026

New

Improved

A round of polish across the app:

  • Separated footer buttons — dialogs, alert confirmations, and the login/signup forms now have a visually distinct footer area with a subtle border and background, matching modern shadcn conventions.

  • Decluttered layouts — removed unnecessary card wrappers from the eval prompt, eval runs section, danger zone, and accuracy metrics. Pages feel lighter and less boxy.

  • Ground truth page tidied up — consolidated the labelling page from 6 separate cards into 3 clean sections. Navigation is now integrated into the trace card, and all sidebar controls live in a single panel.

  • Richer example dataset — the example dataset now includes 10 trip planning conversations (up from 3) covering diverse scenarios like solo backpacking, family holidays, ski trips, and more. It also ships with 3 example evals (boolean, score, and category) so new users can see the full range of eval types immediately.

  • Component showcase — added a hidden page at /app/components showing all UI components used across the app. Handy as a living style guide.

February 15th, 2026

New

Improved

Ground truth labelling

You can now manually label traces to measure how accurate your LLM evals are. For boolean, score, and category evals attached to a dataset, go to the new Ground Truth page in the sidebar to start labelling. Once you have labels, Beval calculates precision, recall, F1, and accuracy automatically.

After viewing run results, you'll also see a prompt to start labelling right there — so you can go from "are these results right?" to measuring accuracy in one click.

Evals toolkit

The Evals page has been redesigned as a toolkit grouped by type — Boolean, Score, Category, and Comment. Each section explains what the eval type does and has a quick "+ New" card to create one. A search bar at the top lets you filter across all types.

Runs now show eval version

The Runs table now displays which version of the eval was used for each run, making it easier to track prompt iterations.

Faster page navigation

Every page now has loading skeletons so navigating between views feels snappier — no more blank screens while data loads.

Datasets page refresh

The Datasets page now has a search bar and groups your datasets by Today, This week, and Earlier for easier browsing.

February 13th, 2026

Fixed

A couple of fixes today:

  • Dataset renaming now works as expected. Previously, clicking the pencil icon to rename a dataset would silently fail — this has been resolved.
  • Run detail page no longer stretches beyond the screen when results contain long text. Everything stays neatly within the page now.

February 9th, 2026

Fixed

Signing up now takes you straight into the app — no more brief flash of an email verification screen. One click and you're in.

February 9th, 2026

New

New users now see a Try an example dataset button when they have no datasets yet. One click loads 3 trip planning assistant conversations so you can start exploring evals right away — no CSV needed.

February 9th, 2026

New

Quick and dirty LLM-based evaluations for your AI product traces. Upload a dataset, tell the LLM what to look for, and get results in minutes.

  • Upload datasets — drop in a CSV of user conversations and preview it instantly
  • Create evals — write a plain-English prompt describing what to evaluate. Classify (yes/no), score (1-5), categorize, or get freeform comments
  • Run — pick a dataset, pick an eval, hit run. Results stream in live as the LLM works through your traces
  • Charts & insights — see results at a glance with pie charts, bar charts, and per-trace reasoning
  • Export — download everything as a CSV with eval results as new columns, ready for your spreadsheet
  • Iterate — update your eval prompt, re-run on the same dataset, and compare versions