Moritz Landwehr
How we match every German tender to a company in 2 seconds
BackApplied AI15 min read

How we match every German tender to a company in 2 seconds

A problem that naive solutions cannot scale, and how a CPV pre-filter, a diffusion language model, and prompt caching turn it into a matter of split seconds and cents.

Moritz LandwehrMoritz Landwehr

The problem

If you run a small or mid-sized company that could win public contracts, you already know this feeling. Somewhere in the thousands of tenders published this week there are three or four that are a perfect fit for you. You will probably never find them.

They are scattered across dozens of portals, written in bureaucratic language that rarely matches the words you would use for your own work, and buried under thousands of contracts that have nothing to do with you. To find the few that fit, you would have to read past hundreds that do not. So most companies do the rational thing and give up before they start. The work is there, the budget is there, and it goes to whoever happened to be looking.

Now put numbers on it. Our database currently holds 25,745 tenders (25,393 of them from the EU TED database), of which 13,767 are active contract notices. The question is always the same: which of these fit this company? That is hard for one company on one day. It becomes a different kind of hard when you want to answer it for every company, every day, automatically.

What I actually want first

I am impatient by nature, and I suspect most people in this position are too. Before I invest in reading a tender, checking its deadline, pulling the documents, and deciding whether to bid, I want exactly one thing: a fast, honest answer to the question "what is even relevant for me to look at?"

That ordering matters more than it sounds. The thing that bothers me most about the existing tools is that they make you do a late, expensive step right at the beginning. Qualifying a tender in depth is a step-four problem; it should never be the first thing you touch. The first step has to be relevance, and it has to be cheap.

There is a capacity angle too. Most companies never seriously engage with public tenders because they do not have the people for it. If I buy a tool that then needs three more people to operate it, I have not been helped, I have just bought myself another job.

So the requirement writes itself. To even decide whether public tenders are worth my time, the very first step in the pipeline has to be extremely fast, already smart, and return curated results. Not a search box. A short, trustworthy shortlist I can judge at a glance.

The pipeline
From 25,745 tenders to a match
Tenders25,745full corpusCPV candidates~8083.2%, 100 in parallelMatchesMatchesjudged, not guessed-97% freejudge, ~2s
Each stage strips cheap work off the next: first a deterministic CPV pre-filter, then a language model as the judge.

Why the obvious approaches do not cut it

Before building anything clever, you try the obvious things. Here is why each one falls short.

Keyword search. You search for the words you care about. That finds the obvious matches, misses everything named differently, and floods you with hits where your word appears but the contract is something else entirely. A search for "lighting" returns stage lighting, traffic signals, and a study on light pollution.

CPV codes. Every tender is tagged with one or more CPV codes (Common Procurement Vocabulary), the EU's standard catalogue of "what is being bought." This is genuinely useful: it is structured, language independent, and every notice carries it. But as your only tool it is far too blunt. A code tells you the rough category, not whether you can deliver the job. Thousands of unrelated contracts share a single code, the people writing tenders pick codes inconsistently, and a company that does three things carries codes that overlap with half the catalogue. There is simply too much fuzziness in it to decide a match on the code alone.

Vector search. The clever-sounding fix. You turn each company description and each tender into an embedding: a long list of numbers, a point in a high-dimensional space, arranged so that texts with similar meaning land close together. Then you match a company to the tenders nearest to it. It feels right. In our tests it works badly, for one structural reason.

Semantic closeness is not the same as fitness to deliver. Take a company that manufactures LED street luminaires. In embedding space it lands right on top of a tender for the three-year maintenance of municipal street lighting: both are saturated with the same words, street and lighting and LED and luminaire and municipal. The model sees near-identical text and calls it a match. But one wants a product on a pallet and the other wants a crew with ladders on call for three years. Same neighbourhood, opposite business.

Illustrative
Close in vector space, opposite in practice
Embedding space (2-D projection)OUR COMPANYmakes luminairesSUPPLY LED LUMINAIRESreal fitMAINTAIN LIGHTING 3 YRSwrong jobINSTALL STREET LIGHTINGwrong jobOFFICE CATERINGfar, and correct
Illustrative 2-D sketch of a high-dimensional space. The three tenders crowding our company are near-identical to a vector search. Only one is a contract we can actually win.

But a re-ranker fixes that, right?

The usual rescue is a hybrid: retrieve with vector search, then run a re-ranker, a second, heavier model that re-scores the shortlist and drops the bad ones. It helps, but only with half the problem.

A re-ranker can clean up false positives, things that were retrieved but do not actually fit. You showed it the candidate, so it can throw it out. What it can never do is recover a false negative, a tender that fits but was never retrieved in the first place. If fuzzy embeddings did not pull it onto the shortlist, the re-ranker never sees it, and it is gone.

That asymmetry is the whole game. False positives are annoying but cheap: a human glances and moves on. False negatives are silent and expensive: a contract you could have won, that you never knew existed. Vector search, fuzzy by nature, produces exactly the kind of error a re-ranker cannot fix.

Illustrative
Two ways to be wrong
01020304050607001020304050False positives (wasted reviews, %)False negatives (missed matches, %)FEW MISSES, LOW NOISENOISY AND BLINDKEYWORD SEARCHRAG + RE-RANKERCPV + LLM JUDGEours
Illustrative. A re-ranker drags you left (fewer false positives) but never down: the matches vector search failed to retrieve stay missed. The CPV pre-filter plus an LLM that reads every survivor lands in the bottom-left.

What tipped the scales for us

So here is where we start talking about our own pipeline. The whole thing is built around that one asymmetry. The CPV pre-filter is deterministic, so it does not "miss" the way embeddings do: if a tender carries a matching code, it survives, full stop. Then a language model reads every survivor and judges it one by one. Nothing relevant is dropped on a quiet similarity score. As long as the CPV code is right, there are no false negatives left to recover, because nothing was thrown away unseen.

The design almost wrote itself from there: a deterministic filter that cannot miss, a model fast enough to read every survivor, and a caching trick that makes reading them nearly free. The next three steps are exactly that.

Step 1: The CPV pre-filter, the 80/20

This is the fast, smart first step that everything else depends on. CPV codes are 8 digits long and hierarchical: each extra digit is one more level of detail, and each level cuts the field. Follow one branch down, for a company that lays cable:

45 00 00 00   Construction work            division   ~12,768 candidates
45 31 00 00   Electrical installation      class        ~3,461 candidates
45 31 10 00   Wiring and fitting work       sub-cat        ~808 candidates
45 31 43 00   Cabling infrastructure        leaf           ~788 candidates

The deeper you go, the fewer tenders survive, but the gains shrink fast. Across our real company profiles (about 18 CPV codes per firm on average), the full funnel looks like this:

CPV digitsAvg candidates/firmShare of corpus
2 (division)12,76850.3%
3 (group)6,86027.0%
4 (class)3,46113.6%
5 (category)1,0484.1%
6 (sub-category)8083.2%
8 (leaf)7883.1%

The first three to four digits do the heavy lifting. The biggest single cut is at 4 to 5 digits, from 3,461 down to 1,048 candidates, a drop of 70%. From the sixth digit on, the funnel has basically run dry: going from 6 to 8 digits saves only 20 more candidates, while buying a real risk, namely excluding genuine matches that share the category but carry a different leaf.

Diminishing returns
CPV selectivity per digit
SWEET SPOT2345678030006000900012000CPV digitsCandidates per firm
The candidate pool drops steeply through 5 to 6 digits, then barely moves. Filtering finer just trims a few more while risking real matches.

That gives a sweet spot at 5 to 6 digits. Filter coarser and too much junk gets through, which multiplies the cost of the next stage; filter finer and you save almost nothing while throwing away good hits. In production we run 6 digits: 25,745 tenders become about 808 candidates on average, 97% gone, deterministically and for roughly $0.40 per company.

Step 2: The diffusion LLM as the judge

The remaining ~800 candidates need real judgement, exactly the "does this firm deliver this, or does the contract want something else?" question that vectors fail at. That is a job for a language model.

Classic autoregressive language models produce their answer token by token, sequentially. For our case that is doubly awkward: latency sensitive (you wait on every single token) and slow in aggregate when you need hundreds of verdicts.

A diffusion language model works differently. It does not generate word by word, it refines the whole answer in a few parallel passes at once, like an image being sharpened out of noise. The result is dramatically higher throughput and low latency. Concretely, with the model we use (Inception Mercury): 0.24 s latency per request, 427 tokens/s throughput.

How it generates
Token by token vs all at once
Token by tokendone
Thistenderfitsthecompany
Diffusion, all at oncedone
Thistenderfitsthecompany
Top: a normal model writes one token at a time, left to right. Bottom: a diffusion model starts with every token at once as noise and crystallises them into place in a few fast passes, so the answer lands almost immediately.

So we do not ask 800 times in sequence, we ask 100 tenders per wave in parallel. For ~800 candidates that is roughly eight waves, about 2 seconds for a complete company against every relevant tender in Germany.

Why diffusion, and not just a fast model fired in parallel?

Fair objection: take any fast autoregressive model, fire all 800 judgements at once, and you would hit roughly the same wall-clock. So why a diffusion model specifically? Two reasons.

The first is early stopping. We run a marketing pipeline that builds a free preview for a company: its 20 best-matching tenders, shown in seconds. For that we do not need all 800 verdicts, we need the strongest handful. So we judge in fast batches and stop the moment a batch has produced enough strong matches. Firing all 800 at once would compute every verdict even when the first wave already answered the question. Because each diffusion batch is so cheap and low latency, this batch-then-early-stop loop stays almost instant: the preview is usually ready after a wave or two, not after grinding through everything.

The second reason is simpler: we think it is cool. There is something genuinely satisfying about a model that conjures a whole answer out of noise instead of crawling through it one token at a time, and it is still rare to get to use one in production.

Step 3: Prompt caching, why calls 2 to 800 cost almost nothing

Now look at the cost of those 800 calls. Each call is built from two parts. The first is long and never changes: the instruction plus the company's full profile, its service groups and references, about 1,575 tokens of stable text. The second is short and changes every time: the one tender being judged, about 393 tokens. So roughly 80% of every call is identical to the last one.

Modern LLM providers let you exploit exactly this with prompt caching. The first time the model processes that long prefix it does the full work and stores the result. On every later call that starts with the same prefix it reuses the stored work instead of recomputing it, and charges roughly one tenth of the normal input price for those tokens.

For us this is close to a cheat code. When we match one company against every tender, the 1,575-token company prefix is byte for byte identical across all 800 calls. We pay for it in full exactly once, on call 1. Calls 2 through 800 read it from cache at a tenth, and only the ~393 fresh tender tokens cost full price.

Prompt caching
One company profile, paid for once
Company profile (stable) (1575 tok, 80%)This tender (20%)CALL 1Pays full price for all ~1,968 tokens, prefix included.CALLS 2 TO 800FROM CACHEPrefix served from cache at ~1/10 price; only the ~393 tender tokens are fresh.
The 1,575-token company profile is identical in every call. It is billed in full once, then read from cache 799 times.

Multiply that across a full run and the prefix, 80% of the input, almost vanishes from the bill after the first call.

Prompt caching
Cost of one full run (1 firm x 25,745)
WITHOUT CACHE$21.88WITH CACHE$12.75
Caching the company prefix cuts input cost by 72% and total cost by 42% (output tokens are unchanged, so the total saving is smaller than the input saving).

The result: human vs machine

Put the two worlds side by side, the same task, one company against its ~808 pre-filtered candidates:

HumanMercury (diffusion + cache)
Time13.5 hours (1.7 working days)~2.2 seconds
Cost~187 EUR (minimum wage 13.90 EUR/h)~0.40 USD (~0.37 EUR)
Showstopper
Time for one company (~808 candidates)
HUMAN13.5 hMERCURY2.2 s
About 22,000x faster. The Mercury bar is there; it is just far too short to see next to a human.
Showstopper
Cost for one company (~808 candidates)
HUMAN187 EURMERCURY0.37 EUR
About 500x cheaper. Same story: the Mercury bar barely registers next to a human at minimum wage (13.90 EUR/h).

That is about 22,000x faster and about 500x cheaper, and that is in the fair version where the human also gets the CPV pre-filter. Without it, a human would have to check all 25,745 tenders: 429 hours, about 54 working days, about 5,964 EUR, per company, per run.

And this is exactly where applied AI pays off. What explodes with customer count for a human stays cheap for us. Matching 1,000 companies a day in full against their candidate pool costs about 400 USD and takes minutes. For a human that would be 13,500 working hours a day, simply impossible.

The chain is the point: a CPV pre-filter (97% gone, free, and no false negatives), then a diffusion LLM as the judge (parallel, 2 s), then prompt caching (input down 72%). Each step on its own is well known. Together they turn a "cannot be done in real time" problem into a matter of split seconds and cents.