Production Ready · Deployable in Minutes

AI Inference,
30–36% Greener.

vert-suite autonomously co-optimises CPU and GPU power on bare-metal AI servers — achieving energy savings with no changes to your software, inference stack, or hardware.

30–36%
Energy per token saved
Over 40%
GPU server power reduction
Wall-Socket
Auditable measurement
Zero
App changes required
GPU+CPU Efficiency · Empirical Benchmarks

Proven Savings
at Scale

Every number below is verified at the server plug using a state-of-the-art power analyser — not estimated, not simulated.

0%
Peak Server Power Reduction (case study, measured at wall-socket)
0%
Typical Energy per Token Reduction · NVIDIA Blackwell
vLLM
llama.cpp
Validated inference engines
Measured at Wall-Socket
All Watt-second readings taken at the server wall-socket using a state-of-the-art power analyser — not estimated from TDP or software counters.
Out-of-Band Validation
Independent telemetry path verifies results separate from the vert-suite control plane — no self-reported figures.
Like-for-Like Baselines
Every result compared against an unoptimised, state-of-the-art baseline on identical hardware under identical load.
Live Toggle Verification
Savings confirmed by enabling/disabling vert-suite mid-run and observing real-time power and throughput transitions.
Open-source data collection · verticular/ute9811-mqtt-bridge
The software used to sample power readings from the analyser is publicly available — full transparency on how data was collected.
01
The Challenge

AI Inference Is Energy-Hungry

On a cutting-edge workstation GPU, each generated token carries a significant energy cost. With millions of tokens processed daily, that adds up fast. And without active optimisation, both CPU and GPU hardware are left running at full speed regardless of actual workload demand.

Qwen3 32B
RTX Pro 6000 Blackwell · vLLM
Without vert-suite
37.38
Ws / token
24.0 tps
throughput
With vert-suite
25.74
Ws / token
−31.2% energy per token
21.64 tps
throughput
−9.8% throughput
Gemma 4 31B
RTX Pro 6000 Blackwell · vLLM
Without vert-suite
37.03
Ws / token
24.5 tps
throughput
With vert-suite
25.33
Ws / token
−31.6% energy per token
22.36 tps
throughput
−8.7% throughput

* All Watt-second readings measured at the server wall-socket. Test system: NVIDIA RTX Pro 6000 Blackwell (96 GB VRAM), 24-core workstation, 128 GB RAM, 2,050 W PSU.

02
The Solution

vert-suite

Production Ready Deployable in Minutes No Code Changes Engine Agnostic

A production-ready software platform for autonomous GPU/CPU efficiency orchestration — inference engine agnostic, with no hardware changes, no application modifications, and no manual tuning required.

Autonomous CPU/GPU Optimisation

Autonomous bare-metal GPU and CPU management for AI/LLM workloads. Empirically validated energy savings of over 30% per token — with no changes to your application or inference stack.

Continuous SLA Spectrum

Specify your maximum acceptable throughput reduction. vert-suite automatically scans the GPU/CPU operating envelope and locks in the profile delivering the deepest energy savings within your constraint.

Green & Cost-Efficient

Reduce OpEx, extend platform lifespan, and lower your infrastructure's carbon footprint — without replacing or upgrading hardware.

Deep Monitoring with eBPF

Leverage eBPF for deep system observability with minimal operational disruption — complemented by physical out-of-band telemetry for independent power validation.

Zero-Trust, Kubernetes-Native

Least-privilege agents, just-in-time capabilities, and mTLS-secured control-plane traffic. Transparent telemetry with out-of-band validation for full auditability.

Seamless Integration

Inference engine agnostic — validated on vLLM and llama.cpp, and compatible with other leading engines. One inclusive library, cross-platform, cross-Linux/K8S distribution, deployable in minutes with no operators, scheduler extensions, or YAML modifications.

Most bare-metal optimisers demand root access. We don't.

The industry standard is to hand over the keys to the kingdom — permanent, unconditional root access to your kernel, drivers, and hardware. You shouldn't have to compromise your cluster's security just to run your compute efficiently.

Zero-Compromise Security Architecture

vert-suite is built on a strict principle of minimal authority. Instead of deploying a privileged monolith, the architecture is physically split — a standard unprivileged worker that requests just-in-time access only when needed, and only for the exact duration required.

vert-suite zero-trust security architecture
① Secure Entry

Public key at a terminal — signature verification before any agent is admitted.

② Restricted Access

Main agent operates as a standard worker within an unprivileged boundary — CAP_SYS_RAWIO and CAP_SYS_ADMIN never granted permanently.

③ Just-in-Time

A Security Escort briefly unlocks the required resource, completes the exact task, and immediately relocks — no standing privileges.

④ Secure Comms

All control-plane traffic is encrypted via mTLS tunnels — no plaintext communication between components.

⑤ Minimal Authority

Digital identity verification prevents any component from operating outside its authorised scope.

How it works
1

Install vert-suite

No YAML changes, no Kubernetes operators, no code modifications to your inference stack.

2

Set Your SLA

Tell vert-suite your maximum acceptable throughput reduction. It autonomously scans the full GPU/CPU operating envelope.

3

Savings Start Immediately

Real-time CPU/GPU co-optimisation locks in the deepest energy savings within your constraint.

03
Performance SLA Spectrum

You Set the Constraint.
We Find the Optimum.

vert-suite does not offer a fixed menu of profiles. It offers a continuous spectrum. Simply tell the software the maximum throughput reduction you can accept — say, no more than 10%, 15%, or 20% — and it automatically identifies the optimal GPU/CPU operational mode to deliver the deepest possible energy savings within that constraint.

Less energy savedMore energy saved
≤10% TPS
impact
≤15% TPS
impact
≤20% TPS
impact

Three example datapoints — any point on the spectrum is achievable

≤10% TPS reduction

Minimal Impact

The software identifies the energy-saving profile that keeps throughput within 10% of baseline. Deep savings with near-transparent operational impact.

~28–30%energy per token saved
≤15% TPS reduction ★

Sweet Spot

A marginal additional throughput budget unlocks significantly deeper energy reductions. Consistently the highest-value point on the spectrum across all tested models.

~33–34%energy per token saved
≤20% TPS reduction

Maximum Efficiency

Accepts a larger throughput trade-off to push energy savings to their maximum. Ideal for cost-capped batch workloads where latency is not time-critical.

~34–37%energy per token saved
04
Benchmark Results

Comprehensive Model Benchmarks

Five leading open-weight models. Three SLA profiles. All energy independently verified at the server wall-socket using a state-of-the-art power analyser — against unoptimised SotA baselines.

Technology Stack
RTX Pro 6000 Blackwell Llama.cpp vert-suite
Energy per token (Ws/token) reduction (energy saved — higher is better)
≤10% 
≤15% 
≤20% TPS impact
Model
Quantisation
Llama-3-70B-Instruct
Baseline: 39.15 Ws/token · 20.56 tps · Q8_0 · 75 GB
70B params
≤10% TPSConservative
−28.83% 27.87 Ws/token
19.97 tps
−2.87% throughput
≤15% TPSBalanced ★
−33.84% 25.90 Ws/token
19.50 tps
−5.17% throughput
≤20% TPSAggressive
−35.99% 25.06 Ws/token
18.02 tps
−12.34% throughput
Qwen2.5-72B-Instruct
Baseline: 40.84 Ws/token · 19.96 tps · Q8_0 · 77.5 GB
72B params
≤10% TPSConservative
−29.60% 28.75 Ws/token
19.39 tps
−2.86% throughput
≤15% TPSBalanced ★
−34.30% 26.83 Ws/token
18.94 tps
−5.11% throughput
≤20% TPSAggressive
−36.52% 25.93 Ws/token
17.48 tps
−12.42% throughput
Qwen3-32B
Baseline: 18.52 Ws/token · 42.04 tps · Q8_0 · 34.8 GB
32B params
≤10% TPSConservative
−28.07% 13.32 Ws/token
39.86 tps
−5.17% throughput
≤15% TPSBalanced ★
−32.58% 12.49 Ws/token
38.38 tps
−8.72% throughput
≤20% TPSAggressive
−33.75% 12.27 Ws/token
34.18 tps
−18.70% throughput
Qwen3.5-27B
Baseline: 26.99 Ws/token · 27.76 tps · BF16 · 50.7 GB
27B params
≤10% TPSConservative
−26.59% 19.82 Ws/token
26.13 tps
−5.88% throughput
≤15% TPSBalanced ★
−31.16% 18.58 Ws/token
25.32 tps
−8.80% throughput
≤20% TPSAggressive
−31.49% 18.50 Ws/token
22.82 tps
−17.81% throughput
Qwen3.5-35B-A3B
Baseline: 5.23 Ws/token · 114.58 tps · BF16 · 69.4 GB
35B MoE params
≤10% TPSConservative
−19.25% 4.22 Ws/token
103.11 tps
−10.01% throughput
≤15% TPSBalanced ★
−20.16% 4.18 Ws/token
99.28 tps
−13.36% throughput
≤20% TPSAggressive
−20.65% 4.15 Ws/token
94.82 tps
−17.25% throughput

* All Watt-second readings measured at the server wall-socket on a single NVIDIA RTX Pro 6000 Blackwell GPU server.

05
Case Study

Optimising Autonomous AI Agents

Continuous, zero-intervention AI agent loops create a sustained, demanding inference load. We validated vert-suite in exactly this scenario — using Claude Code acting as an autonomous coding agent running non-stop research iterations on a dedicated GPU server.

Technology Stack
Claude Code RTX Pro 6000 Blackwell vLLM Gemma 4 · 31B vert-suite
Average Server Power (measured at server wall-socket)
Without vert-suite 781 W
781 W
With vert-suite 459 W
459 W
~32%
Energy per token saved
~41%
Power Reduction with vert-suite on our testing server
~12%
Throughput reduction
What's happening
The Local Agent
Claude Code running locally, pointing to a GPU server — RTX Pro 6000 Blackwell running vLLM and Gemma 4 (31B).
The Problem
Running an autonomous coding agent is power-hungry. The server was drawing 800 W+ to keep Gemma running at 19–22 tokens/sec.
The Fix
vert-suite applied. Power dropped to ~450 W — a ~41% reduction. Throughput only fell by 10–14%.
The Agent Loop
An extension of Karpathy's Autoresearch pattern: modify → evaluate → measure → keep or revert → repeat. Zero human intervention.
Click to Play Key moments
Optimisation ON — power usage drops sharply
Optimisation OFF — power climbs back to baseline
Optimisation ON again — confirms repeatability
What's on screen
Top Left
vLLM Logs
Right Half
Claude Code
Bottom Left
Agent Loop
Bottom Right
Power Trace

What 30% energy savings means in practice

Average UK commercial electricity rate: 25.5p/kWh · 24/7 continuous operation · Savings scale linearly with fleet size

£0 £5K £10K £15K £20K £23K
Workstation
2.5 kW · £5,585/yr
saves £1,675
per year
Mid-Range Server
6.25 kW · £13,961/yr
saves £4,188
per year
Enterprise Server
10 kW · £22,338/yr
saves £6,701
per year
Annual cost after savings
Saved with vert-suite (30%)
06
News & Recognition

Recognised by Industry Leaders

UKRI Logo

Innovate UK 'New Innovators' Grant

Verticular has secured the 'Growth Catalyst Early Stage' grant to advance EPIC (Energy-efficient Processing for Intensive Computing) — using AI to dynamically manage both CPU and GPU power and cut energy costs for AI data centres without performance trade-offs.

NVIDIA Inception Logo

Member of NVIDIA Inception

We are proud members of the NVIDIA Inception program, giving us early access to the latest GPU ecosystem and enabling us to optimise AI workloads at the deepest hardware level.

Partnering with

NVIDIA Inception Nokia MK Stadium Real Wireless Madevo UKRI Weaver Labs Hiro University of Bristol NVIDIA Inception Nokia MK Stadium Real Wireless Madevo UKRI Weaver Labs Hiro University of Bristol
07
The Team

Built by Industry Experts

Verticular was founded by two tech leaders who spent careers engineering at the frontier of wireless networks and AI systems — and who saw first-hand the cost and energy problem coming.

Dr Dan Warren
Dr Dan Warren
Chief Executive Officer
  • Former Director of Advanced Network Research at Samsung Research UK
  • Creator of the VoLTE standard at GSMA
  • dan@verticular.uk
Dr Andrea Tassi
Dr Andrea Tassi
Chief Technology Officer
  • Former Chief Engineer at Samsung Research UK
  • IEEE Senior Member · 50+ publications · Multiple Best Paper Awards. Specialist in System Design/Simulation
  • andrea@verticular.uk
"

Our mission is to solve the dual problem of rising cloud costs and the massive energy footprint of AI — building deep, chip-level software that lowers OpEx, maximises hardware ROI, and meets sustainability goals without compromising SLAs.

08
Contact Us

See Your Savings.
Live. On Your Hardware.

Request a demo and we'll show you wall-Socket power measurements on a real GPU server — before and after vert-suite — so you can see the savings for yourself, not just take our word for it.