Friday, July 4, 2008

Privacy and Data Mining, and false positives

James Wimberly has an interesting post up at RBC about the possibility of large-scale data mining as an intelligence-gathering tool, its costs, and benefits. I'm less sanguine about the possibilities of any such large-scale operation than he is, but he's absolutely correct to note that we need to be having this discussion in the open, rather than relying on the "lawless-unitary-executive-knows-best" model that we've been running on so far.

The problem with any such operation is dealing with the false positives, the things the algorithm says are suspicious that turn out to be nothing. Let's say his "third-degree" assumption is roughly accurate and we're targeting about one million people nationwide for data surveillance. Further assume that there are as many as 1,000 truly dangerous terrorist organizers in the US--determined, competent, well-financed. I'm not talking about janitors with fantasies of blowing up airports, I'm talking about people who have figured out how an airport could be sabotaged, and have access to the means to carry it out, and are motivated to do so.

First of all, we note that 1,000 / 1,000,000 = 0.1% of our targets are actually dangerous. The other 999,000 are not dangerous--not motivated, incompetent, don't have the means, whatever.

Now suppose we have a screening method that can detect 95% of the bad guys and screen out 99% of the non-bad-guys. This is, of course, MUCH better than any actual method can do. But run the numbers:

We find 950 out of 1,000 terrorists (true positives), leaving 50 dangerous people at large (false negatives--people we think are harmless, who really aren't).

We also round up 999,000 * 0.01 = 9, 990 people who aren't dangerous but weren't screened out--false positives.

Meaning we round up a total of 9,990 + 950 = 10,940 people, of whom 9,900 (90.49%) aren't dangerous.

This isn't as much of a needle-in-a-haystack problem as we started with, I'll grant. But what happens when we tell investigators to go through a list of people with suspicious data traffic, but to remember most are probably completely innocent?

Well, we know that about 5% of Americans cheat on their taxes. And we know that IRS auditors, who spend all day dealing with tax fraud, estimate that 30% of Americans cheat on their taxes. When you deal with something out of the ordinary all day long, you can forget how out of the ordinary it is. When you deal with tax cheats all day, you tend to overestimate the prevalence of tax cheats.

Good luck getting your investigators going through the list of "data-based suspects" to remember that 90% are probably innocent or harmless, or both.

We can't just sit back and do nothing. But we should also avoid falling into the trap laid out in Yes, Minister:

We must do something.
This is something.
Therefore, we must do this.

No comments: