We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy and even outperforms 94% of expert virologists when compared directly on question subsets specifically tailored to the experts’ specialites.

https://archive.ph/xILJR

  • themurphy@lemmy.ml
    link
    fedilink
    English
    arrow-up
    1
    ·
    4 days ago

    Great results. Would an AI build for this not be better, or is it just meant as a kind of benchmark for LLMs?