I was amused to read the following article the other day: “OneCare rises from bottom-place ranking”. The reporter notes that “Microsoft’s anti-virus product OneCare is no longer bottom of the pile when it comes to the tests carried out by an independent anti-virus researcher.” 

Unfortunately, these are two completely different kinds of tests, so this is kind of like comparing apples and hammers. The February 2007 comparative test by AV-Comparatives is a bulk-detection test. You take a giant pile of malware, turn off all the real-time functionality in a completely updated product, and perform an on-demand scan of the pile. The collection of malware used can span years, but most companies never remove signatures from their definitions, so everyone should be on an equal footing.

By comparison, a retrospective/proactive test, like the one from May cited in the article above takes a very different approach. This test starts with smaller pile of new malware (say, anything received by the reviewer in the last couple of months), and scans that collection with signatures older than the oldest sample in the collection. In theory, this means that the test measures the ability of the products to detect new malware. In other words, these are samples they could not possibly have written signatures for, because they did not exist at the time the signatures were written.

What this means is that you can write signatures that detect everything that exists today, and nothing that comes into being tomorrow. Or vice versa. So our quote is like saying that my car, which failed its safety crash test last week has improved because it completed the quarter mile in less time than someone else. Although it doesn’t mean that both areas haven’t improved, it certainly doesn’t tell you that they have.

This is not meant to take anything away from Microsoft or AV-Comparatives. But we as humans (and especially magazine publishers) tend to like black-and-white answers, and try to make everything fit that mold. Unfortunately, we can make incorrect assumptions when we leap to the wrong conclusions.  For example:

  • Particularly for proactive tests, the score of any particular vendor might be related to the quality of their heuristics, but might also be related to the size, distribution, and false-tolerance of their user base. Companies with larger or more diverse customer bases have to be more cautious in developing new heuristics to avoid false positives on (relatively) obscure files that some small, but non-trivial, portion of their user base might be using with the vendor’s knowledge. Likewise, companies that provide only gateway scanners can tend to be more aggressive, because the impacts of a false positive (deleted email or blocked download) are much lower than they might be if files running on important servers were deleted. 
  • The numbers generated for proactive tests might also be lower than what a user might see for a number of reasons:
    • The signatures used in these tests are frozen months or at least weeks before the test starts. A user who is only two days out of date will likely receive much better proactive detection rates than the worst-case scenario offered here.
    • This kind of test, which never executes the malware in question, will underestimate the benefits of behavioral technologies, such as our Access Protection Rules, Generic Buffer Overflow technology, and System Guards. 
  • Comparative tests such as the February 2007 tests likewise have some caveats that should be understood:
    • Each piece of malware is treated equally, so a highly prevalent and damaging virus that has infected millions of computers is considered as important as an obscure Trojan horse that may have existed only on a handful of computers.
    • Like the proactive test, the benefits of on-access or real-time and behavioral technologies at preventing or limiting the impact of zero-day threats are ignored.
    • Some of the samples may be old or irrelevant to the platform the product is designed to defend. In fact, one of the reasons that AV signatures always grow is that none of the vendors can remove signatures for old threats, because we’re likely to see them in reviews and get slammed for missing them.

Very few tests actually test running malware against real, fully updated security products. This provides the best correlation to real-world performance, but is very labor- and time-consuming to test. As a result, most tests of this nature are run on a small set of malware samples (at most 10 to 20). This means that the performance of any particular vendor might be different tomorrow if the test were run on a different sample set.

Needless to say, all of these techniques are useful and contain important data. In fact, we run all of these kinds of tests in our lab to determine whether we are improving over time, and whether our products meet our quality standards. 

That being said, from years of experience I can say that higher numbers on tests do not always correlate to improved performance. Optimizing for one kind of result is likely to cause worse performance in other areas, be they the size of definitions, system performance, false-positive rates, removal effectiveness, or supportability. 

Likewise, reading too much (or too little) into test results can lead to selecting the wrong product for your situation. Here are some links to excellent resources on testing methodology and interpretation:

Comparing the Comparatives

Counting Spyware Detections

Antivirus Testing Workshop in Reykjavik

And particularly check the FAQ section of the methodology document located off of the Comparatives tab at http://www.av-comparatives.org/.