Depending on your specific use case—whether you are training a machine learning model, studying cross-species transmission, or doing a phylogenetic analysis—there are several specialized, high-quality datasets mapping viruses to their diverse hosts (humans, animals, plants, bacteria, and archaea).
Here are the best repositories and datasets currently available:
1. Broad Cross-Host Databases
Virus-Host Database (Virus-Host DB): Developed by Kyoto University, this is arguably the cleanest dataset for broad virus-host mapping. It maps relationships between viruses and their hosts using official NCBI taxonomy IDs.
It covers viruses across all major host groups (Eukaryotes, Bacteria, Archaea) and pulls data from RefSeq, GenBank, UniProt, and curated literature. Best for: Mapping comprehensive viral taxonomy to multi-host taxonomy.
Link:
genome.jp/virushostdb
Viral Host Range Database (VHRdb): Maintained by the Institut Pasteur, this open-access resource centralizes experimental data regarding the host ranges of viruses.
While it originated primarily for bacteriophages (bacteria-infecting viruses), it is fully compatible with viruses infecting all forms of life. Best for: Accessing experimental and laboratory-verified host range data.
2. Machine Learning & Predictive Datasets
Viral Host Predictor Dataset: Created by the CVR Bioinformatics team, this includes pre-parsed genomic data from over 3,000 single-stranded RNA viruses across 12 different taxonomic reservoir host groups and arthropod vectors.
Best for: Training ML models to predict reservoir origins or vector-borne transmission status.
Link:
bioinformatics.cvr.ac.uk/software/viral-host-predictor/
Pan-Virome Human Compatibility Dataset: A curated genomic dataset explicitly tailored for training deep learning/neural network models to identify human host specificity vs. other animal hosts based on nucleotide and k-mer frequency patterns.
Data source reference: Look into recent 2026 publications on Sequence-based virus host prediction (available via Oxford Academic / Virus Evolution).
3. Disease & Interaction-Specific Datasets
Viral Diseases Explorer (VDE): Part of the Virus World Database (VWdb), this tool leverages cross-validated data to catalog thousands of unique diseases affecting over 5,500 distinct hosts.
Best for: Mapping virus-host pairs explicitly to the symptoms and clinical diseases they cause.
Link:
virus-world.org
VirusMentha: If your focus is molecular, this dataset hosts virus-host protein-protein interactions (PPI), allowing you to see exactly how viral proteins interact with the cellular machinery of different hosts.
Link:
virusmentha.uniroma2.it
Pro-Tip for Raw Data Extraction
If you want to build a completely custom dataset, the gold standard is downloading raw data from NCBI Virus or GenBank. When querying, you can filter sequences directly by the Host modifier tag (e.g., Homo sapiens, Aves, Chiroptera), though you will need to write a simple Python script (using Biopython) to clean up variations in how researchers text-label hosts in their submissions.
What kind of project are you building with this data? I can point you toward the easiest API or download format if you have a specific angle in mind.