Verticular – Your Cloud, Optimised

GPU+CPU Efficiency · Empirical Benchmarks

Proven Savings
at Scale

Every number below is verified at the server plug using a state-of-the-art power analyser — not estimated, not simulated.

0%

Peak Server Power Reduction (case study, measured at wall-socket)

0%

Typical Energy per Token Reduction · NVIDIA Blackwell

vLLM
llama.cpp

Validated inference engines

Measured at Wall-Socket

All Watt-second readings taken at the server wall-socket using a state-of-the-art power analyser — not estimated from TDP or software counters.

Out-of-Band Validation

Independent telemetry path verifies results separate from the vert-suite control plane — no self-reported figures.

Like-for-Like Baselines

Every result compared against an unoptimised, state-of-the-art baseline on identical hardware under identical load.

Live Toggle Verification

Savings confirmed by enabling/disabling vert-suite mid-run and observing real-time power and throughput transitions.

Open-source data collection · verticular/ute9811-mqtt-bridge

The software used to sample power readings from the analyser is publicly available — full transparency on how data was collected.

01

The Challenge

AI Inference Is Energy-Hungry

On a cutting-edge workstation GPU, each generated token carries a significant energy cost. With millions of tokens processed daily, that adds up fast. And without active optimisation, both CPU and GPU hardware are left running at full speed regardless of actual workload demand.

Qwen3 32B

RTX Pro 6000 Blackwell · vLLM

Without vert-suite

37.38

Ws / token

24.0 tps

throughput

With vert-suite

25.74

Ws / token

−31.2% energy per token

21.64 tps

throughput

−9.8% throughput

Gemma 4 31B

RTX Pro 6000 Blackwell · vLLM

Without vert-suite

37.03

Ws / token

24.5 tps

throughput

With vert-suite

25.33

Ws / token

−31.6% energy per token

22.36 tps

throughput

−8.7% throughput

* All Watt-second readings measured at the server wall-socket. Test system: NVIDIA RTX Pro 6000 Blackwell (96 GB VRAM), 24-core workstation, 128 GB RAM, 2,050 W PSU.

02

The Solution

vert-suite

Production Ready Deployable in Minutes No Code Changes Engine Agnostic

A production-ready software platform for autonomous GPU/CPU efficiency orchestration — inference engine agnostic, with no hardware changes, no application modifications, and no manual tuning required.

Autonomous CPU/GPU Optimisation

Autonomous bare-metal GPU and CPU management for AI/LLM workloads. Empirically validated energy savings of over 30% per token — with no changes to your application or inference stack.

Continuous SLA Spectrum

Specify your maximum acceptable throughput reduction. vert-suite automatically scans the GPU/CPU operating envelope and locks in the profile delivering the deepest energy savings within your constraint.

Green & Cost-Efficient

Reduce OpEx, extend platform lifespan, and lower your infrastructure's carbon footprint — without replacing or upgrading hardware.

Deep Monitoring with eBPF

Leverage eBPF for deep system observability with minimal operational disruption — complemented by physical out-of-band telemetry for independent power validation.

Zero-Trust, Kubernetes-Native

Least-privilege agents, just-in-time capabilities, and mTLS-secured control-plane traffic. Transparent telemetry with out-of-band validation for full auditability.

Seamless Integration

Inference engine agnostic — validated on vLLM and llama.cpp, and compatible with other leading engines. One inclusive library, cross-platform, cross-Linux/K8S distribution, deployable in minutes with no operators, scheduler extensions, or YAML modifications.

Most bare-metal optimisers demand root access. We don't.

The industry standard is to hand over the keys to the kingdom — permanent, unconditional root access to your kernel, drivers, and hardware. You shouldn't have to compromise your cluster's security just to run your compute efficiently.

Zero-Compromise Security Architecture

vert-suite is built on a strict principle of minimal authority. Instead of deploying a privileged monolith, the architecture is physically split — a standard unprivileged worker that requests just-in-time access only when needed, and only for the exact duration required.

vert-suite zero-trust security architecture

① Secure Entry

Public key at a terminal — signature verification before any agent is admitted.

② Restricted Access

Main agent operates as a standard worker within an unprivileged boundary — CAP_SYS_RAWIO and CAP_SYS_ADMIN never granted permanently.

③ Just-in-Time

A Security Escort briefly unlocks the required resource, completes the exact task, and immediately relocks — no standing privileges.

④ Secure Comms

All control-plane traffic is encrypted via mTLS tunnels — no plaintext communication between components.

⑤ Minimal Authority

Digital identity verification prevents any component from operating outside its authorised scope.

How it works

1

Install vert-suite

No YAML changes, no Kubernetes operators, no code modifications to your inference stack.

2

Set Your SLA

Tell vert-suite your maximum acceptable throughput reduction. It autonomously scans the full GPU/CPU operating envelope.

3

Savings Start Immediately

Real-time CPU/GPU co-optimisation locks in the deepest energy savings within your constraint.

03

Performance SLA Spectrum

You Set the Constraint.
We Find the Optimum.

vert-suite does not offer a fixed menu of profiles. It offers a continuous spectrum. Simply tell the software the maximum throughput reduction you can accept — say, no more than 10%, 15%, or 20% — and it automatically identifies the optimal GPU/CPU operational mode to deliver the deepest possible energy savings within that constraint.

Less energy savedMore energy saved

≤10% TPS
impact ≤15% TPS
impact ≤20% TPS
impact

Three example datapoints — any point on the spectrum is achievable

≤10% TPS reduction

Minimal Impact

The software identifies the energy-saving profile that keeps throughput within 10% of baseline. Deep savings with near-transparent operational impact.

~28–30%energy per token saved

≤15% TPS reduction ★

Sweet Spot

A marginal additional throughput budget unlocks significantly deeper energy reductions. Consistently the highest-value point on the spectrum across all tested models.

~33–34%energy per token saved

≤20% TPS reduction

Maximum Efficiency

Accepts a larger throughput trade-off to push energy savings to their maximum. Ideal for cost-capped batch workloads where latency is not time-critical.

~34–37%energy per token saved

04

Benchmark Results

Comprehensive Model Benchmarks

Five leading open-weight models. Three SLA profiles. All energy independently verified at the server wall-socket using a state-of-the-art power analyser — against unoptimised SotA baselines.

Technology Stack

RTX Pro 6000 Blackwell Llama.cpp vert-suite

Energy per token (Ws/token) reduction (energy saved — higher is better)

≤10%

≤15%

≤20% TPS impact

Model

Quantisation

Llama-3-70B-Instruct

Baseline: 39.15 Ws/token · 20.56 tps · Q8_0 · 75 GB

70B params

≤10% TPSConservative

−28.83% 27.87 Ws/token

19.97 tps

−2.87% throughput

≤15% TPSBalanced ★

−33.84% 25.90 Ws/token

19.50 tps

−5.17% throughput

≤20% TPSAggressive

−35.99% 25.06 Ws/token

18.02 tps

−12.34% throughput

Qwen2.5-72B-Instruct

Baseline: 40.84 Ws/token · 19.96 tps · Q8_0 · 77.5 GB

72B params

≤10% TPSConservative

−29.60% 28.75 Ws/token

19.39 tps

−2.86% throughput

≤15% TPSBalanced ★

−34.30% 26.83 Ws/token

18.94 tps

−5.11% throughput

≤20% TPSAggressive

−36.52% 25.93 Ws/token

17.48 tps

−12.42% throughput

Qwen3-32B

Baseline: 18.52 Ws/token · 42.04 tps · Q8_0 · 34.8 GB

32B params

≤10% TPSConservative

−28.07% 13.32 Ws/token

39.86 tps

−5.17% throughput

≤15% TPSBalanced ★

−32.58% 12.49 Ws/token

38.38 tps

−8.72% throughput

≤20% TPSAggressive

−33.75% 12.27 Ws/token

34.18 tps

−18.70% throughput

Qwen3.5-27B

Baseline: 26.99 Ws/token · 27.76 tps · BF16 · 50.7 GB

27B params

≤10% TPSConservative

−26.59% 19.82 Ws/token

26.13 tps

−5.88% throughput

≤15% TPSBalanced ★

−31.16% 18.58 Ws/token

25.32 tps

−8.80% throughput

≤20% TPSAggressive

−31.49% 18.50 Ws/token

22.82 tps

−17.81% throughput

Qwen3.5-35B-A3B

Baseline: 5.23 Ws/token · 114.58 tps · BF16 · 69.4 GB

35B MoE params

≤10% TPSConservative

−19.25% 4.22 Ws/token

103.11 tps

−10.01% throughput

≤15% TPSBalanced ★

−20.16% 4.18 Ws/token

99.28 tps

−13.36% throughput

≤20% TPSAggressive

−20.65% 4.15 Ws/token

94.82 tps

−17.25% throughput

* All Watt-second readings measured at the server wall-socket on a single NVIDIA RTX Pro 6000 Blackwell GPU server.

05

Case Study

Optimising Autonomous AI Agents

Continuous, zero-intervention AI agent loops create a sustained, demanding inference load. We validated vert-suite in exactly this scenario — using Claude Code acting as an autonomous coding agent running non-stop research iterations on a dedicated GPU server.

Technology Stack

Claude Code RTX Pro 6000 Blackwell vLLM Gemma 4 · 31B vert-suite

Average Server Power (measured at server wall-socket)

Without vert-suite 781 W

781 W

With vert-suite 459 W

459 W

~32%

Energy per token saved

~41%

Power Reduction with vert-suite on our testing server

~12%

Throughput reduction

What's happening

The Local Agent

Claude Code running locally, pointing to a GPU server — RTX Pro 6000 Blackwell running vLLM and Gemma 4 (31B).

The Problem

Running an autonomous coding agent is power-hungry. The server was drawing 800 W+ to keep Gemma running at 19–22 tokens/sec.

The Fix

vert-suite applied. Power dropped to ~450 W — a ~41% reduction. Throughput only fell by 10–14%.

The Agent Loop

An extension of Karpathy's Autoresearch pattern: modify → evaluate → measure → keep or revert → repeat. Zero human intervention.

Click to Play Key moments

Optimisation ON — power usage drops sharply

Optimisation OFF — power climbs back to baseline

Optimisation ON again — confirms repeatability

What's on screen

Top Left

vLLM Logs

Right Half

Claude Code

Bottom Left

Agent Loop

Bottom Right

Power Trace

What 30% energy savings means in practice

Average UK commercial electricity rate: 25.5p/kWh · 24/7 continuous operation · Savings scale linearly with fleet size

£0 £5K £10K £15K £20K £23K

Workstation

2.5 kW · £5,585/yr

saves £1,675

per year

Mid-Range Server

6.25 kW · £13,961/yr

saves £4,188

per year

Enterprise Server

10 kW · £22,338/yr

saves £6,701

per year

Annual cost after savings

Saved with vert-suite (30%)

AI Inference,
30–36% Greener.

Proven Savings
at Scale

AI Inference Is Energy-Hungry

vert-suite

Autonomous CPU/GPU Optimisation

Continuous SLA Spectrum

Green & Cost-Efficient

Deep Monitoring with eBPF

Zero-Trust, Kubernetes-Native

Seamless Integration

Zero-Compromise Security Architecture

Install vert-suite

Set Your SLA

Savings Start Immediately

You Set the Constraint.
We Find the Optimum.

Minimal Impact

Sweet Spot

Maximum Efficiency

Comprehensive Model Benchmarks

Optimising Autonomous AI Agents

What 30% energy savings means in practice

Recognised by Industry Leaders

Innovate UK 'New Innovators' Grant

Member of NVIDIA Inception

Partnering with

Built by Industry Experts

See Your Savings.
Live. On Your Hardware.

AI Inference,30–36% Greener.

Proven Savingsat Scale

AI Inference Is Energy-Hungry

vert-suite

Autonomous CPU/GPU Optimisation

Continuous SLA Spectrum

Green & Cost-Efficient

Deep Monitoring with eBPF

Zero-Trust, Kubernetes-Native

Seamless Integration

Zero-Compromise Security Architecture

Install vert-suite

Set Your SLA

Savings Start Immediately

You Set the Constraint.We Find the Optimum.

Minimal Impact

Sweet Spot

Maximum Efficiency

Comprehensive Model Benchmarks

Optimising Autonomous AI Agents

What 30% energy savings means in practice

Recognised by Industry Leaders

Innovate UK 'New Innovators' Grant

Member of NVIDIA Inception

Partnering with

Built by Industry Experts

See Your Savings.Live. On Your Hardware.

AI Inference,
30–36% Greener.

Proven Savings
at Scale

You Set the Constraint.
We Find the Optimum.

See Your Savings.
Live. On Your Hardware.