Hazy Transparency: Blockchain Forensics, the Co-Spend Heuristic, and the Legal Limits of Crypto Tracing

Perhaps the biggest open question in web3 is just what we get in exchange for using public blockchains. This applies in all sorts of directions but right now we want to focus on compliance, forensics and generally how tools that leverage blockchain transparency interact with legal systems.

Recently, Chainalysis touted a study they claimed proved their tools were "accurate and reliable." That post appeared shortly after an Amicus Brief we contributed to was filed arguing their tools were, at least sometimes, anything but. More recently, we released some preliminary test results of the most common and oldest blockchain tracing technique which suggest tools like what Chainalysis builds fail spectacularly, at least sometimes.

We absolutely are not claiming none of these tools work. We are not claiming all the legal system outcomes derivative of blockchain forensics need to be thrown out. But we are still making two big claims. First, that even the oldest and most-heavily-used blockchain forensic technique – the co-spend heuristic – can fail badly under realistic circumstances. And second, that validation work done to date on these tools is grossly inadequate such that a huge amount of remedial effort is required.

Below we will go through some details that, for example, illustrate just how different the results Chainalysis cites are from our own work. Forensic science is science and science can be messy. Data abound. No individual piece of work is the infallible truth in science. And if there are only two data points and they disagree the solution is not to ignore one of them. The solution is to get more data. But even before that it is important to understand the lay of the land.

Co-Spending

The oldest blockchain forensic technique, first published in 2013, is the "co-spend heuristic." This applies to blockchains with a structure similar to Bitcoin and asserts that all the inputs to a single UTXO transaction are under common control. The logic behind this rule of thumb is simple and compelling. Bitcoin users do not share their private keys. And the private keys for all the inputs to a UTXO are required to generate a valid transaction. Therefore the only way for multiple inputs to get into a single UTXO is if they all belong to the same person.

That sounds good. And those two predicate assumptions are approximately true. But there are plenty of reasons people share private keys. And while it is necessary for all the input private keys to sign a single UTXO for the transaction to be valid there are protocols that allow groups to achieve this without actually sharing the keys among the users. So "approximately" is doing a lot of work there.

If the assumptions break down often enough, or at least often enough when this technique's application matters, then the co-spend heuristic will not work in practice.

It is important to be careful about phrasing this. Nobody is claiming these assumptions must be 100% right, 100% of the time to yield a useful technique. But neither can we blithely assume these conditions always hold and this heuristic is always perfect. Someone needs to go off and measure how often the heuristic gives the right answer and then use of the co-spend heuristic must be guided by those measurements. The standard for fingerprint matching is not "pixel-for-pixel identical" but rather is guided by measurements. Science.

Insisting on testing the reliability of new techniques and using those observations to inform the application of those techniques should not be considered controversial. That is just basic science. We test medicines before relying on them to treat disease. We safety-test cars before they are allowed on the road. None of this is controversial. To the extent there are arguments around these kinds of testing they concern "how good?" or "how safe?" or "what are the side effects?" and not whether testing itself is needed at all.

Chainalysis' Claims

Several academics studied the reliability of the co-spend heuristic in the " attribution of illicit services ." Specifically this study measured the accuracy of co-spend-based clustering at identifying which addresses were and were not parts of three illicit services that had long before been captured by law enforcement. So this study is setting the scope of our analysis at analyzing illicit onchain services. This is not about identifying exchange deposit addresses or following funds through some chain of custody. The problem at hand is determining which addresses are inside and outside a given illicit service. The scope here is set by the study that was already done and distributed.

There are two key things to note about this study. The first is that it concerned three services that had already been captured and of which the blockchain forensics community was already well aware. Blockchain forensics is not like fingerprint analysis in that it is not easy for forensic tool builders to acquire more test data easily. Only a limited number of captured services exist and there is no easy way to go get more. Law enforcement, presumably, is already trying to capture as many illicit services as possible.

So the blockchain forensics community is already, in effect, limited to training on services captured by law enforcement as part of their work with law enforcement. That is all fine and reasonable and everyone would expect law enforcement to support the building and testing of good tools for use in investigations.

But this also means tests run on these three services are what statistics calls "in-sample" testing. This is where you measure the effectiveness of a model on data that the model designers had access to when they did their work. Models that do not work in-sample are generally discarded. So we expect everything put into production to work well in-sample.

Real-world use in an investigation is "out-of-sample" as the model has never seen the data at issue. This performance is what we really care about. A fingerprint matching technique that works on the test cases is worthy of further study. But we can only really call it "working" when it correctly matches out-of-sample fingerprints.

And the second item to note about this study are the observed error rates. The false positive rate (FPR), which measures when an address was predicted to be part of a service but was not actually part of it, was nearly 0% for all three cases. But the false negative rate (FNR), which measures addresses predicted to be outside the service which are in fact inside it, was between 5% and 75%.

One interpretation here might be that the co-spend heuristic is conservative in that it is nearly-always correct on inclusions but often wrong on exclusions. That is possible. This could happen because the assumptions underlying the heuristic fail in ways that produce this result. Or it could happen because engineers looking at various techniques decided to optimize for a low FPR at the expense of a high FNR and this heuristic is the technique that won out. Recall that the heuristic is a high-level rule and there are many design choices to make when writing software to implement it.

It could also be that this heuristic fails in this specific way for these three services but behaves differently for other services. When a model exhibits significantly different behaviour in-sample vs out-of-sample one cause is often "overfitting." This is where the model picks up unintended features in the training data and focusestoo closely on the nitty-gritty of your starting data set.

There is no simple perfectly-reliable way to figure out precisely what is happening inside these kinds of model. But the main tool used to investigate these questions is out-of-sample testing.

Our Tests

We designed an out-of-sample testing scheme built around a technique to convert between Coinjoins – the obfuscation primitives used by mixers on Bitcoin – and zero-knowledge mixers like Tornado Cash on Ethereum. This technique was devised specifically to enable this testing but the math is quite simple and general. We start from the observation that Coinjoins achieve obfuscation by collectively transferring among an input set and an output set all at once so everyone, roughly, pays everyone else simultaneously. This prevents any sort of "this output came from this input" analysis.

ZK mixers, on the other hand, achieve obfuscation by commingling many users funds in a common account. So every withdrawal is sourced collectively from all prior deposits (at least back as far as the commingled balance exceeds the size of the withdrawal).

So we can map between these techniques by uniting uninterrupted strings of deposits as UTXO inputs and uninterrupted strings of withdrawals as UTXO outputs. If a mixer gets 10 deposits followed by 10 withdrawals we can rewrite that as a single 10-to-10 UTXO with the same obfuscation properties. Then we add logic to handle settled balances before and after this UTXO transaction and we have a conversion scheme.

Once we have code to implement this mapping we can set up a test as follows:

Grab all transactions for a single Tornado Cash pool
Convert them into equivalent Coinjoins
Run co-spend on those Coinjoins
Measure FPR and FNR for "converted Tornado Cash"

To the extent no analytics company does this while building their tools this is out-of-sample testing. We want to make clear upfront we know this is particularly brutal out-of-sample testing. Why? Because we know the predicate assumptions for co-spend do not hold at all for Tornado Cash. But these are at least somewhat real transactions and we know with certainty which converted addresses belong to "Tornado Cash" and which are end users because we wrote the mapping code.

We ran this testing for 10 Tornado Cash pools. And our FNR and FPR were both significantly worse than reported in-sample. Our FPR was always at least 7%, FNR always at least 18% and in four of the 10 cases both were over 50%. For all 10 cases FPR+FNR was at least 46%. These are bad results for what we admitted already are difficult cases for the tool.

Something is wrong here because these results look nothing like the in-sample analysis quoted above. And, plainly, these results do not suggest co-spend works on these data.

Reconciling Differences

The simplest explanation here is that co-spend works on the types of services cited above and it does not work for the type of service we synthesized. Both studies would need replication, and probably analysis of more data, to firm up that conclusion. But something like that is pretty clearly what this, admittedly preliminary, data tells us.

But that is an unsatisfying, even troubling, conclusion because when looking at a novel illicit service we have no way of knowing if it is the good kind for co-spend or the bad kind. Our mapping scheme works because we can read the Tornado Cash code so we know how it works. But when examining an unknown service in the wild we may not have that luxury. We know nothing beyond the observation that at least two performance regimes exist for the co-spend heuristic and the heuristic exhibits wildly different performance in at least those two regimes.

How is law enforcement supposed to know if the heuristic works when they go apply for a search warrant? How is anyone supposed to respond when a defense lawyer asks "how do you know your tools work for the service you accuse my client of running?" The plain truth is that there are not enough studies out there for anyone to give satisfying answers to these questions.

Further, notice those two example questions are of a fundamentally different character. The first concerns whether the co-spend heuristic is a good enough source of leads that warrants should be issued, or queries sent to exchanges, or other basic investigative steps should be taken. This is a relatively low standard that generally – the details vary across legal systems – requires something like "reasonableness" and that the technique be better than a "hunch" or "mere guessing."

But the second question exists more in the context of a trial where a higher standard applies. And the main reason tracing reliability comes into the discussion at trial is if the investigative process found no stronger evidence. Fingerprint evidence, or a matching hair or shirt color, does not come into the discussion at a murder trial if there is solid DNA evidence and a video of the defendant committing the crime. Cases are made at the margin using the best available evidence.

If tracing evidence built atop the co-spend heuristic is the best available evidence then we are looking at a standard like "is this reliable beyond a reasonable doubt" or "is this convincing on a balance of probabilities?" It is perfectly reasonable for a technique to meet one standard but not another. This is normal. If you are in a bank when that bank is robbed the police will almost surely question you based on your presence alone. But they are unlikely to charge you with robbery absent additional evidence. And a conviction is exceptionally unlikely – it is not supposed to happen at all in any legal system – without something stronger to tie you to the robbery than merely being in the bank at the time alongside a bunch of other people.

What standard for the co-spend heuristic then? We do not know. But we will point out that even the near-zero FPRs reported above are orders of magnitude worse than DNA testing. Overall even the "good" results for co-spend look roughly like contactless fingerprinting. Contactless fingerprinting is an amazing technique. But everyone with a fingerprint scanner on their mobile phone knows it is not something you would blindly base a conviction around without more work.

Similarly, some aspiring forensic techniques like phrenology and recovered-memory therapy fail under testing and not usable at all in legal processes anywhere. The co-spend heuristic is unlikely to prove as bad as those two long-discarded branches of pseudoscience. But forensic science is a science and testing standards for forensic techniques are not new. Many a Sherlock Holmes story from the late 19th century revolves around some then-new forensic analysis which today sounds commonplace. Science marches on.

Blockchain forensics, at least this corner of blockchain forensics, is not infallible evidence. Even the best data shows it is dramatically worse than the gold standard forensic technique of DNA. Is it good enough for warrants? Should it be admitted at trial at all? These are hard questions.

But this industry needs to start to move on this or there will be trouble. What if there is no solid forensic science foundation for tracing and someone is able to prove a pivotal positive is in fact a false positive and get a case thrown out?

If that happens everyone in jail with a case connected even tenuously to the co-spend heuristic will file to have their case thrown out immediately. Everyone with funds frozen will file. Everyone that can claim, in any way, that their positive was a false positive will have a go. Given the known-non-zero error rate and paucity of solid data this looks inevitable unless more data is prepared in time. And if courts begin rejecting some blockchain forensics as pseudoscience it will be exceptionally challenging to roll that back. This is a risk nobody in web3 seems prepared for. This risk could prove existential for some analytics firms.

➢ Stay ahead of the curve. Join Blockhead on Telegram today for all the latest in crypto.

+ Follow Blockhead on Google News

Blockcast – Licensed to Shill: How Energy & Geopolitics Are Building a Bitcoin-Driven World, ft. Bitcoin Arabia's Lara Eggimann & Jeff Gorman

The Middle East is poised to become a pivotal hub in the global cryptocurrency ecosystem. Countries within the region are increasingly recognizing the strategic importance of integrating blockchain technology into their economic frameworks, energy markets, and geopolitical strategies, according to Lara Eggimann and Jeff Gorman, co-founders of Bitcoin Arabia , a strategic Bitcoin advisory and ecosystem builder that connects the global Bitcoin industry with the Middle East’s most powerful stakeholders.

Thanks for tuning in! If you enjoyed this episode, please like and subscribe to Blockcast on your favorite podcast platforms like Spotify and Apple .

Be at the heart of TradFi–DeFi collaboration at Money20/20 Asia 2026 .

Are you looking to forge partnerships with banks and fintechs? To expand into new markets across Asia, or to secure funding from top-tier investors? This April, the world of digital assets, blockchain, and Web3 converges with the biggest players in APAC’s financial ecosystem at Money20/20 Asia 2026 and its brand new ‘Intersection’ zone, complete with a dedicated content stage, TradFi-Defi innovator showcase, and curated networking spaces. From traditional banking giants to decentralised innovators, private equity leaders, and cutting-edge fintech disruptors, this is where they meet to forge partnerships, spark dialogue, and shape the future of finance.

Hazy Transparency: Blockchain Forensics, the Co-Spend Heuristic, and the Legal Limits of Crypto Tracing

Co-Spending

Chainalysis' Claims

Our Tests

Reconciling Differences

Blockcast – Licensed to Shill: How Energy & Geopolitics Are Building a Bitcoin-Driven World, ft. Bitcoin Arabia's Lara Eggimann & Jeff Gorman

Circle Stock Surges 50% in Two Days – But It's a Short Squeeze, Not a Turnaround

NVIDIA Posts Record Quarter, But Investors Ask: When Does AI Pay Off?

The Retail vs. Institutional Divide: How Trading Patterns Reveal Two Completely Different Markets