Edge AI LLM Benchmark — Jenish Patel

Overview

A research tool that benchmarks small language models running entirely inside the browser using WebGPU acceleration — no backend, no API keys, no inference cost. Models are loaded via @mlc-ai/web-llm and cached in IndexedDB so subsequent runs skip the download. The tool also benchmarks cloud APIs (Gemini 2.5 Flash, Gemini 3 Flash, Claude Haiku 4.5) using Python scripts and compares them on latency, throughput, and per-query cost. Results are exportable as CSV.

Architecture & Approach

The web frontend is built in Vanilla JS with ES6 modules. Model cards let users select from Llama 3.2 1B, Qwen 2.5 1.5B, SmolLM2 1.7B, or Llama 3.1 8B (q4f16/q4f32 quantized). Each benchmark run records TTFT and TPS for 20 standardized queries across 8 categories (factual, technical, math, code, etc.). Thermal throttling is detected by comparing the first and last 5 queries. MMLU quality is measured using 57 subjects × 2 questions each, scored against ground-truth answers. The Python scripts (gemini_testing.py, anthropic_testing.py, mmlu_benchmark.py) call the respective APIs with temperature=0 and max_tokens=10 for consistent multiple-choice evaluation. All results are exported to timestamped CSVs from multiple hardware profiles (Windows NVIDIA GPU, Mac M1, Mac M2 Pro).

Results & Outcome

Llama 3.2 1B delivered the best edge performance: 73–172ms average TTFT and 48–52 TPS on test hardware, at $0 per run. Qwen 2.5 1.5B and SmolLM2 1.7B followed at 32–39 TPS. Cloud APIs averaged ~300ms TTFT for Gemini 2.5 Flash with measurable per-query cost. Thermal throttling was observable on sustained multi-model runs. Combined benchmark CSVs across 3 hardware profiles enable apples-to-apples device comparison.

Tech Stack

WebGPUWebLLMJavaScriptPythonGemini APIAnthropic APIMMLU