The problem
If you run a small or mid-sized company that could win public contracts, you already know this feeling. Somewhere in the thousands of tenders published this week there are three or four that are a perfect fit for you. You will probably never find them.
They are scattered across dozens of portals, written in bureaucratic language that rarely matches the words you would use for your own work, and buried under thousands of contracts that have nothing to do with you. To find the few that fit, you would have to read past hundreds that do not. So most companies do the rational thing and give up before they start. The work is there, the budget is there, and it goes to whoever happened to be looking.
Now put numbers on it. Our database currently holds 25,745 tenders (25,393 of them from the EU TED database), of which 13,767 are active contract notices. The question is always the same: which of these fit this company? That is hard for one company on one day. It becomes a different kind of hard when you want to answer it for every company, every day, automatically.
What I actually want first
I am impatient by nature, and I suspect most people in this position are too. Before I invest in reading a tender, checking its deadline, pulling the documents, and deciding whether to bid, I want exactly one thing: a fast, honest answer to the question "what is even relevant for me to look at?"
That ordering matters more than it sounds. The thing that bothers me most about the existing tools is that they make you do a late, expensive step right at the beginning. Qualifying a tender in depth is a step-four problem; it should never be the first thing you touch. The first step has to be relevance, and it has to be cheap.
There is a capacity angle too. Most companies never seriously engage with public tenders because they do not have the people for it. If I buy a tool that then needs three more people to operate it, I have not been helped, I have just bought myself another job.
So the requirement writes itself. To even decide whether public tenders are worth my time, the very first step in the pipeline has to be extremely fast, already smart, and return curated results. Not a search box. A short, trustworthy shortlist I can judge at a glance.
Why the obvious approaches do not cut it
Before building anything clever, you try the obvious things. Here is why each one falls short.
Keyword search. You search for the words you care about. That finds the obvious matches, misses everything named differently, and floods you with hits where your word appears but the contract is something else entirely. A search for "lighting" returns stage lighting, traffic signals, and a study on light pollution.
CPV codes. Every tender is tagged with one or more CPV codes (Common Procurement Vocabulary), the EU's standard catalogue of "what is being bought." This is genuinely useful: it is structured, language independent, and every notice carries it. But as your only tool it is far too blunt. A code tells you the rough category, not whether you can deliver the job. Thousands of unrelated contracts share a single code, the people writing tenders pick codes inconsistently, and a company that does three things carries codes that overlap with half the catalogue. There is simply too much fuzziness in it to decide a match on the code alone.
Vector search. The clever-sounding fix. You turn each company description and each tender into an embedding: a long list of numbers, a point in a high-dimensional space, arranged so that texts with similar meaning land close together. Then you match a company to the tenders nearest to it. It feels right. In our tests it works badly, for one structural reason.
Semantic closeness is not the same as fitness to deliver. Take a company that manufactures LED street luminaires. In embedding space it lands right on top of a tender for the three-year maintenance of municipal street lighting: both are saturated with the same words, street and lighting and LED and luminaire and municipal. The model sees near-identical text and calls it a match. But one wants a product on a pallet and the other wants a crew with ladders on call for three years. Same neighbourhood, opposite business.
But a re-ranker fixes that, right?
The usual rescue is a hybrid: retrieve with vector search, then run a re-ranker, a second, heavier model that re-scores the shortlist and drops the bad ones. It helps, but only with half the problem.
A re-ranker can clean up false positives, things that were retrieved but do not actually fit. You showed it the candidate, so it can throw it out. What it can never do is recover a false negative, a tender that fits but was never retrieved in the first place. If fuzzy embeddings did not pull it onto the shortlist, the re-ranker never sees it, and it is gone.
That asymmetry is the whole game. False positives are annoying but cheap: a human glances and moves on. False negatives are silent and expensive: a contract you could have won, that you never knew existed. Vector search, fuzzy by nature, produces exactly the kind of error a re-ranker cannot fix.
What tipped the scales for us
So here is where we start talking about our own pipeline. The whole thing is built around that one asymmetry. The CPV pre-filter is deterministic, so it does not "miss" the way embeddings do: if a tender carries a matching code, it survives, full stop. Then a language model reads every survivor and judges it one by one. Nothing relevant is dropped on a quiet similarity score. As long as the CPV code is right, there are no false negatives left to recover, because nothing was thrown away unseen.
The design almost wrote itself from there: a deterministic filter that cannot miss, a model fast enough to read every survivor, and a caching trick that makes reading them nearly free. The next three steps are exactly that.
Step 1: The CPV pre-filter, the 80/20
This is the fast, smart first step that everything else depends on. CPV codes are 8 digits long and hierarchical: each extra digit is one more level of detail, and each level cuts the field. Follow one branch down, for a company that lays cable:
45 00 00 00 Construction work division ~12,768 candidates
45 31 00 00 Electrical installation class ~3,461 candidates
45 31 10 00 Wiring and fitting work sub-cat ~808 candidates
45 31 43 00 Cabling infrastructure leaf ~788 candidates
The deeper you go, the fewer tenders survive, but the gains shrink fast. Across our real company profiles (about 18 CPV codes per firm on average), the full funnel looks like this:
| CPV digits | Avg candidates/firm | Share of corpus |
|---|---|---|
| 2 (division) | 12,768 | 50.3% |
| 3 (group) | 6,860 | 27.0% |
| 4 (class) | 3,461 | 13.6% |
| 5 (category) | 1,048 | 4.1% |
| 6 (sub-category) | 808 | 3.2% |
| 8 (leaf) | 788 | 3.1% |
The first three to four digits do the heavy lifting. The biggest single cut is at 4 to 5 digits, from 3,461 down to 1,048 candidates, a drop of 70%. From the sixth digit on, the funnel has basically run dry: going from 6 to 8 digits saves only 20 more candidates, while buying a real risk, namely excluding genuine matches that share the category but carry a different leaf.
That gives a sweet spot at 5 to 6 digits. Filter coarser and too much junk gets through, which multiplies the cost of the next stage; filter finer and you save almost nothing while throwing away good hits. In production we run 6 digits: 25,745 tenders become about 808 candidates on average, 97% gone, deterministically and for roughly $0.40 per company.
Step 2: The diffusion LLM as the judge
The remaining ~800 candidates need real judgement, exactly the "does this firm deliver this, or does the contract want something else?" question that vectors fail at. That is a job for a language model.
Classic autoregressive language models produce their answer token by token, sequentially. For our case that is doubly awkward: latency sensitive (you wait on every single token) and slow in aggregate when you need hundreds of verdicts.
A diffusion language model works differently. It does not generate word by word, it refines the whole answer in a few parallel passes at once, like an image being sharpened out of noise. The result is dramatically higher throughput and low latency. Concretely, with the model we use (Inception Mercury): 0.24 s latency per request, 427 tokens/s throughput.
So we do not ask 800 times in sequence, we ask 100 tenders per wave in parallel. For ~800 candidates that is roughly eight waves, about 2 seconds for a complete company against every relevant tender in Germany.
Why diffusion, and not just a fast model fired in parallel?
Fair objection: take any fast autoregressive model, fire all 800 judgements at once, and you would hit roughly the same wall-clock. So why a diffusion model specifically? Two reasons.
The first is early stopping. We run a marketing pipeline that builds a free preview for a company: its 20 best-matching tenders, shown in seconds. For that we do not need all 800 verdicts, we need the strongest handful. So we judge in fast batches and stop the moment a batch has produced enough strong matches. Firing all 800 at once would compute every verdict even when the first wave already answered the question. Because each diffusion batch is so cheap and low latency, this batch-then-early-stop loop stays almost instant: the preview is usually ready after a wave or two, not after grinding through everything.
The second reason is simpler: we think it is cool. There is something genuinely satisfying about a model that conjures a whole answer out of noise instead of crawling through it one token at a time, and it is still rare to get to use one in production.
Step 3: Prompt caching, why calls 2 to 800 cost almost nothing
Now look at the cost of those 800 calls. Each call is built from two parts. The first is long and never changes: the instruction plus the company's full profile, its service groups and references, about 1,575 tokens of stable text. The second is short and changes every time: the one tender being judged, about 393 tokens. So roughly 80% of every call is identical to the last one.
Modern LLM providers let you exploit exactly this with prompt caching. The first time the model processes that long prefix it does the full work and stores the result. On every later call that starts with the same prefix it reuses the stored work instead of recomputing it, and charges roughly one tenth of the normal input price for those tokens.
For us this is close to a cheat code. When we match one company against every tender, the 1,575-token company prefix is byte for byte identical across all 800 calls. We pay for it in full exactly once, on call 1. Calls 2 through 800 read it from cache at a tenth, and only the ~393 fresh tender tokens cost full price.
Multiply that across a full run and the prefix, 80% of the input, almost vanishes from the bill after the first call.
The result: human vs machine
Put the two worlds side by side, the same task, one company against its ~808 pre-filtered candidates:
| Human | Mercury (diffusion + cache) | |
|---|---|---|
| Time | 13.5 hours (1.7 working days) | ~2.2 seconds |
| Cost | ~187 EUR (minimum wage 13.90 EUR/h) | ~0.40 USD (~0.37 EUR) |
That is about 22,000x faster and about 500x cheaper, and that is in the fair version where the human also gets the CPV pre-filter. Without it, a human would have to check all 25,745 tenders: 429 hours, about 54 working days, about 5,964 EUR, per company, per run.
And this is exactly where applied AI pays off. What explodes with customer count for a human stays cheap for us. Matching 1,000 companies a day in full against their candidate pool costs about 400 USD and takes minutes. For a human that would be 13,500 working hours a day, simply impossible.
The chain is the point: a CPV pre-filter (97% gone, free, and no false negatives), then a diffusion LLM as the judge (parallel, 2 s), then prompt caching (input down 72%). Each step on its own is well known. Together they turn a "cannot be done in real time" problem into a matter of split seconds and cents.
