The fast and the frugal
Outrun the big models – without draining the grid

Capgemini

Oct 21, 2025

What if AI could be fast – without being furious to the planet? In the current AI arms race, bigger often means better… and a lot more power-hungry. But when every token costs energy, emissions, and compute, maybe it’s time to question the size-over-smarts approach. Enter the frugal model: smaller, contextual, and good enough to hold its own against the LLM heavyweights, without draining the grid. Let’s challenge the assumption that scale equals value and show how learner, fine-tuned models can achieve surprising results. Call it a new kind of AI performance: less drag race, more precision drift.

The hidden environmental cost of AI: One token at a time

As the AI gold rush accelerates, there’s a quiet environmental tax we’re all paying – one token at a time. A chat query may not feel like much, but beneath the surface lurk LLMs with a rather intimidating computational appetite. The hyperscalers’ top-of-the-line LLMs powering today’s smartest chatbots and emerging AI agents churn through electricity and water in quantities perhaps more befitting an aluminum smelter than a software service.

“Our GPUs are melting.”
Sam Altman, CEO of OpenAI

Thus, changes in climate are not only driven by the usual suspects, like energy, transport, or agriculture, but increasingly by the digital technologies we embrace. Artificial intelligence, a technology seen by some as a tool to combat climate challenges, has its own environmental footprint. However, the potential of generative AI is undeniable. We shouldn’t slow down, but we must get smarter in the way we are using it.

Make every token count

That means making every token count: reducing waste, and designing systems that are not just intelligent, but efficient. Because LLMs don’t just consume data, they consume power. Recent studies show that generative AI, including both model training and user inference, consumed an estimated six to nine terawatt-hours (TWh) of electricity in 2023, comparable to the energy use of a small country. By 2027, AI servers could draw as much as 134 TWh/year, roughly the energy needs of Sweden.

Agents on the rise

The real problem, however, may not only come from the models themselves – but the agents that will be built on top of them. Picture future companies run by a handful of humans and thousands of LLM-powered agents: optimizing code, outwitting other agents in stock trading, crafting legal frameworks, and winning the war for attention with AI-generated content. In this arms race, performance is currency, and currency is performance. The better your agents, the sharper your competitive edge and, today, the top-performing agents are driven by large, cloud-hosted LLMs. They’re impressive, certainly, but they’re also expensive financially as well as in terms of sustainability.

Watching Vin Diesel drift around

And the inefficiency adds up. As agents chain together models, expand context windows, and embed documents for every micro-task, we’re witnessing token inflation on a grand scale. Like watching Vin Diesel drift around a carpark to reach a spot 10 metres away, we’re using GPT, Gemini, and Claude top models to rephrase a sentence. So, we don’t need to throw a library at a question that can be answered by a reference to an article.

When querying LLMs, verbosity is expensive. Every token adds cost, complexity, and carbon. Each processed token draws computational power. But not every token needs to be spelled out. Many can be implied, if the model has context. In order to give your agent context and enable downsizing, there are three main strategies.

Train a model from scratch. You train a foundation model from the ground up, using your custom dataset. It will be tailored entirely to your domain but can be extremely expensive and requires massive amounts of time and data.

Finetune an existing model. Start with an existing LLM and retrain the upper level on your domain-specific data. Change the “way” rather than the “what” in LLM responses. A great use case is to finetune a model on generating cypher queries on text prompts (as we will see soon).

Augmented retrieval. Keep the model frozen but supply external knowledge at runtime. That’s where traditional RAG and its multidimensional cousin GraphRAG makes its entrance, giving the model a shared reference point without overloading it with detail. Rather than cramming background into every prompt, the model can now refer to entities and relationships already mapped in the graph.

A leap of faith: Will a small, context-aware model cut it?

There’s nothing new about this reasoning, but it takes a little leap of faith to switch from the smoke-and-flash of nitrous-fueled drag racers to the quiet grace of a machine engineered exactly for the task at hand.

This is more than just green tech evangelism: it’s sound business logic. Using fewer resources is basically good business, delivering reduced latency, lower cost, and independence of data centers and clouds.

The central tension remains: will a small, context-aware model really be enough when it is competing against the full firepower of GPT-whatever running on a nuclear-powered server farm? The battle between efficient precision and brute force brilliance is about to play out here below.

Ladies and gentlemen, start your engines

To find out, we designed a small task to generate cypher queries from natural text, and pitted a finetuned local model of roughly 4 GB against the much larger gpt-4o-mini. To make things more interesting, we also invited the Llama-3.3-70b-versatile from Meta.

The challenge. All three models were fed the same text – a question regarding information in the underlying graph database or a request to update it. They need to generate a valid cypher query which will be executed, and the responses are then compared.

The underlying database consists of a limited set of startups, technologies, and founders – linked together based on data from an ecosystem register.

An example query used for benchmarking: “Which technologies are used across multiple startups?”

Meet the contenders

Note that the tomasonjo/llama3- text2cypher-demo is finetuned to handle text-to-cypher. It is open source, based on the llama3 model, and can run fully locally on a laptop.

The cloud models are not particularly trained on cypher, but as it is part of the training material, they have a basic understanding of cypher.

Outcome: We ran a number of different queries. The results so far show that the local model and the gpt4o-mini perform on par, with the llama-3.3-70b shows slightly less performance.

Note: The local model didn’t just keep up – it outpaced gpt-4o-mini, slashing the generation time by more than 50 percent.

Our benchmark shows that when armed with relevant context, local LLMs can match the performance of massive cloud-based GPT models – without the weight, the latency, or the energy bill. It’s a direct challenge to the idea that bigger always means better. With the right architecture, local isn’t just a fallback – it’s a strategic edge. The race for smarter AI is not about who burns the most fuel. It’s about who handles the corners best.

“Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius – and a lot of courage to move in the opposite direction.”
E.F. Schumacher

Local and gpt-4-o-mini are on par.

“LLMs don’t just consume data, they consume power.”

Start innovating now

Take a leap of faith

and go with smaller models; with the use of knowledge graphs, results become more context-aware, efficient, and environmentally sustainable.

Build processes

for quality assurance and automatic benchmark probing with large models.

Continuously measure

and communicate reductions in energy usage and carbon emissions resulting from data operations, fostering sustainable AI practices from the start.

Meet the authors

Joakim is part of both the Swedish and European CTO office where he drives the expansion of Knowledge Graphs forward. He is also client partner lead for Neo4j in Europe and has experience running Knowledge Graph projects as a consultant both for Capgemini and Neo4j, both in private and public sector – in Sweden and abroad.

Connect with us

Expert title

First Name *

First Name is not valid.

Last Name *

Last Name is not valid.

Email *

Email is not valid.

Company *

Company is not valid.

Country

Country is not valid.

Phone (optional)

Your Message *

Your Message is not valid.

Slide to submit

Thank you for filling out the form! You will be contacted shortly.

We are sorry, the form submission failed. Please try again.

Johan Müllern-Aspegren is Emerging Tech Lead at the Applied Innovation Exchange (AIE) Nordics, where he explores, drives and applies innovation, helping organizations navigate emerging technologies and transform them into strategic opportunities. He is also part of Capgemini’s AI Futures Lab, a global centre for AI research and innovation, where he collaborates with industry and academic partners to push the boundaries of AI development and understanding.

Connect with us

Expert title

First Name *

First Name is not valid.

Last Name *

Last Name is not valid.

Email *

Email is not valid.

Company *

Company is not valid.

Country

Country is not valid.

Phone (optional)

Your Message *

Your Message is not valid.

Slide to submit

Thank you for filling out the form! You will be contacted shortly.

We are sorry, the form submission failed. Please try again.

The hidden environmental cost of AI: One token at a time

Make every token count

Agents on the rise

Watching Vin Diesel drift around

A leap of faith: Will a small, context-aware model cut it?

Ladies and gentlemen, start your engines

Meet the contenders

Start innovating now

Meet the authors

Joakim Nilsson

Connect with us

Johan Müllern-Aspegren

Connect with us