JONESTECH_SYS
BENCH
DOC:004 // REAL HARDWARE · REAL NUMBERS · LIVE MODEL

Numbers &
Live Test

Benchmark data from real deployments — then test a live model yourself. The chat below runs Llama 3.3 70B via Groq. Same model family we deploy locally, so you know what you're getting.

PEAK THROUGHPUT
62 tok/s
Qwen 2.5 32B Q5 · RTX 4090
TIME TO FIRST TOKEN
84ms
Llama 3.3 70B Q4 · LAN only
VS GPT-4 API LATENCY
9×
faster TTFT, no network hop
COST PER 1M TOKENS
~$0.04
amortised hardware cost only
THROUGHPUT

Real Numbers.
Real Hardware.

All benchmarks measured on production hardware with llama.cpp / Ollama, 2048 token context, greedy sampling. Numbers are generation throughput — tokens per second after the first token lands.

LOCAL DEPLOYMENTS
Llama 3.3 70B Q4RTX 4090 · 24GB VRAM
47 tok/s
Qwen 2.5 Coder 32B Q5RTX 4090 · 24GB VRAM
62 tok/s
Mistral Small 22B Q6RTX 3090 · 24GB VRAM
38 tok/s
Phi-4 14B Q8Apple M3 Max · 128GB unified
28 tok/s
Llama 3.1 8B Q4RTX 3080 · 10GB VRAM
96 tok/s
CLOUD API COMPARISON
GPT-4o APIOpenAI · avg latency
~30 tok/s *
Claude Sonnet APIAnthropic · avg latency
~35 tok/s *

* Cloud API throughput varies with provider load, geographic region, and tier. More critically — TTFT for cloud APIs averages 600–1200ms due to network round-trip. Local TTFT is 60–120ms regardless of load. Numbers sourced from community benchmarks; your results may vary.

LATENCY BREAKDOWN

Where Time Actually Goes

Throughput (tok/s) is only half the story. Time To First Token (TTFT) determines how responsive your application feels — and this is where local wins most dramatically.

TTFT (avg) THROUGHPUT RATE LIMITS CONSISTENCY DATA LEAVES NETWORK
Local · RTX 4090 84ms 47 tok/s NONE DETERMINISTIC NEVER
Local · RTX 3090 110ms 38 tok/s NONE DETERMINISTIC NEVER
GPT-4o API ~800ms ~30 tok/s TPM / RPM CAPS VARIABLE ALWAYS
Claude Sonnet API ~700ms ~35 tok/s TPM / RPM CAPS VARIABLE ALWAYS
LIVE TEST

Talk to Llama 3.3
Right Now

This is Llama 3.3 70B — the same open-weight model we deploy on local hardware. Running here via Groq's LPU infrastructure so you can test it without needing your own GPU. Ask it anything about local AI deployment.

JONESTECH AI LLAMA 3.3 70B
POWERED BY GROQ LPU
JONESTECH AI

Hey — I'm running on Llama 3.3 70B, the same open-weight model Jonestech deploys on local hardware. Ask me anything about local LLM deployment, hardware requirements, quantization, RAG pipelines, or whether local AI makes sense for your setup.

READY TO OWN YOUR STACK

Run This
On Your Hardware

You've seen the numbers and tested the model. We'll get Llama 3.3 70B — or whatever fits your use case — running on your own infrastructure. No more API bills.