Collaborators Luca Nishimura and Phuong Thanh Nguyen helped me with the computational analysis. Doctor Nicholas Parrish of RIKEN provided me with scientific discussions and helped me write.
Results
The access count of herv-tfbs.com is increasing
942 unique IP addresses visited the database more than ten times, and 481 unique IP addresses visited more than twenty times, indicating that there are several hundred “core” users who visit the database regularly.
Herv-tfbs.com is globally accessed
Conclusion
Although the total number of accesses is still increasing, the majority of users have only visited the database once. Overall, we conclude that the database has attracted the attention of scientists and provides valuable information to them.
Comprehensive Discovery of CRISPR- targeted Terminally Redundant DNA
Metagenomes
Introduction
A commonly used criterion to filter viral genomes from metagenomic sequences is the presence of viral signature genes. Several surveys showed that these recorded viral genomes are only a fraction of the entire viral diversity on Earth.
Results
The discovered sequences include 2,154 tail phage genomes, along with 257 complete crAssphage genomes larger than 200 kilobases (kb), 766 Microviridae genomes, 56 Inoviridae genomes, 5,658 plasmid-like genomes, and 2,757 uncharacterized genomes. Although the majority of the detected sequences that were larger than 20 kb were mostly characterized by viral or plasmid genomes, a significant portion of sequences smaller than 20 kb were not recorded in either plasmid or viral databases.
Extraction of CRISPR-targeted sequences
By analyzing human gut metagenome reads and contigs assembled from them, we successfully extracted 11,391 CRISPR-targeted terminal redundant (TR) sequences ranging from 894 to 292,414 bases. The size of CRISPR-targeted TR sequences ranged from 894 to 292,414 bases (Supplementary Table 2-2 and Supplementary Data 2-1).
Classification of TR sequences
Conversely, the small plasmid-like and unclassified TR sequences had a wider distribution of coverages and one-tone coding ratios, indicating that some of these sequences may have a large proportion of non-coding regions and repeats. Capsidless TR sequences with ParA, ParB, ParM and/or MoBM were classified as plasmids.
Predicted CRISPR-targeting hosts of TR sequences
Sequences encoding a detectable capsid gene were classified as a viral taxon according to capsid type, as follows. When ≥10 protospacers could not be assigned to a taxon, the predicted target host was marked as unavailable (NA).
Firmicutes–Verrucomicrobia multiple infecting viruses are suspicious
Predicted host targeting of CRISPR-targeted TR sequences. A) Target host composition of TR sequences. The HGT option limits the usefulness of the DR-host connection for inferring the actual target host of amb-1.
Predicted targeting hosts above the taxonomic level of order are consistent
The DR sequence similarity could be explained by HGT in the CRISPR-Cas system between these species. The comparison of the tRNA-based predicted hosts with the DR-to-RefSeq-based predicted targeting hosts revealed a 93%.
CRISPR-targeted noncoding elements
The large proportion of high-GC TR sequences that were predicted to be targeted by Actinobacteria likely indicate genomic adaptation of parasitic genetic elements that infect and are routinely targeted by the CRISPR systems of high-GC Gram-positive bacteria (e.g. eg, Actinobacteria). Host range within a viral lineage was further assessed by phylogenetic analysis of TR sequences that were determined to represent Microviridae species. Interestingly, most of the known viral species infecting Escherichia coli (such as phiX174) and other protein-infecting species (such as phi MH2K [38]) used as a reference in this phylogeny were divided into two clades, and the clade containing the latter was within a group of TR sequences targeted by Firmicutes.
The nested phylogenetic structure of capsids encoded by TR sequences predicted to represent Microviridae species may indicate that host switching, a critically important topic in viral evolution [ 39 , 40 ], occurred within the viral lineage. The genome comparisons indicated that circ-1 and circ-2 were nearly complete sequences of circular or tandem genomes (Supplementary Figures 2–9). Thus, circ-1 and circ-2 appear to be CRISPR-targeted DNA elements without a reliably detected or annotated coding gene.
Comparison of CRISPR-targeted TR sequences with available viral and plasmid sequences
In particular, we detected 11 TR sequences larger than 200 kb, two of which corresponded to recently reported ones. We concluded that many of the large TR sequences are already represented in virus or plasmid databases, except for those larger than 200 kb, which were only recently reported. In contrast to the large TR sequences, most of the small TR sequences were not represented in the databases (Figures 2-3B).
Among the 942 RefSeq plasmid-listed sequences, 444 small TR sequences were listed in RefSeq virus, IMG/VR and/or GVD, suggesting a possible misclassification of viruses and plasmids within these databases. We examined the database representation of the CRISPR-targeted TR sequences predicted to represent Microviridae and Inoviridae. Finally, we compared the TR sequences with the Integrative and Conjugative Element (ICEberg) database and obtained nine hits.
Comparison with the prediction results of VirSorter
We found that 21 genomes from our study were highly similar to the genomes from the previous study, supporting our prediction that these sequences were indeed Inoviridae genomes. Inoviridae major coat protein-encoding TR sequences, and 109 TR sequences without a detectable capsid predicted to be positively viral (category £6). Conversely, the program predicted about half of the HK97 capsid protein-coding TR sequences and no Inoviridae MCP-coding ones.
Classification of Inoviridae major coat-protein-encoding TR sequences
The aligned region was located at the 3¢ end of the tRNA list gene; this observation demonstrated a previously reported integration mechanism of Inoviridae species [ 46 ]. A taxonomy level previously named Inoviridae in ssDNA virus is now a family of the newly defined Tubulavirales order. These families were defined based on the analysis of the gene sharing network of filamentous phage genomes, to represent the enormous HGTs between the closely related species within this order [ 45 , 48 ].
To assign the detected genomes to known families based on the gene sharing method, we used the classification program provided by ICTV [49]. The families of the detected genomes (Ino-01 to Ino-08) were predicted using the ICTV taxonomic classification program. The sequence reported in a previous study (Ino-07) is indicated in parentheses. B) Genome organizations of the discovered Inoviridae species.
Gene-content-based hierarchical clustering of CRISPR-targeted TR sequences
In contrast to the large TR sequences, we found that the small TR sequences remained largely enigmatic when grouped by gene content (Figure 2-5B). Few clusters were represented in IMG/VR, and although GVD covers a relatively wider range, representatives were still missing or sparse for some clusters. Since many clusters did not encode detectable capsid genes, both clusters representing Microviridae displayed capsid genes.
Another pattern shown by this analysis was that TR sequences with high GC content were frequently observed with Actinobacteria as the predicted targeting host. Heatmaps represent the gene content of TR sequences, where each row is a TR sequence and each column is a gene cluster. Capsid genes are colored differently according to their types, as indicated in the figure; HK97, Microviridae major capsid protein (MicroMCP) and Inoviridae major coat protein (InoMCP).
Remnant CRISPR spacers and contribution of CRISPR-targeted sequences to the identified spacers
Discussion
Although most of the genomes discovered with detectable capsid genes were previously known viral lineages, significant portions of particularly small TR sequences remained unclassified (Figure 2-1B). The target host prediction results suggest that approximately 70% of the detected sequences are targeted by the specific host phyla (Figure 2-2A). Therefore, multiple methods must be applied to ascertain the host of genomes detected from such environments.
This can complicate the program's prediction mechanism, leading to less reliable prediction results. Finally, none of the major TR sequences encoding Inoviridae envelope proteins, including previously annotated genomes, were predicted to be positively viral. The Zot domains of the discovered Inoviridae species formed a distinct clade in the phylogenetic tree.
Conclusion
Appendix A
Supplementary Information
- Materials and Methods
Materials
Database versions and download dates
Metagenome assembly
Detection of CRISPR and spacer extraction
Detection of protospacer loci
Co-occurrence-based spacer clustering
Deduplication of CRISPR-targeted sequences
Gene prediction and annotation of CRISPR-targeted sequences
Assessment of capsid-protein-detection sensitivity and specificity
Targeting host prediction
Gene-content-based hierarchical clustering of TR sequences
Phylogenetic analysis of Microviridae MCP
Phylogenetic analysis of the Zot domain
Generation of scrambled sequences
- Supplementally Figures
- Supplementary Tables
- Supplementary Scripts
- Supplementary Data
Within initial clusters, co-occurrences of protospacers were counted and used to construct an undirected graph. The average distances between adjacent protospacer loci in clusters were calculated and used to extract CRISPR-targeted sequences. The length and number of protospacers shown here are conceptual and not based on observed data.
The numbers on the arrows are counts of the number of TR sequences associated with the associated type. The positions of the protospacers and the ORFan gene are shown outside and inside the circle. Dataset including detected CRISPR spacers, direct repeats, protospacers, spacer clustering results based on co-occurrence, predicted protein sequences, HMMs built, database comparison results, phylogenetic analysis results, predicted target hosts and CRISPR-targeted TR sequences.
CRISPR-targeted RNA-dependent RNA polymerase coding RNA
Introduction
Results
Extraction of non-transcribed RNA sequences
Taxonomy of RNA-dependent RNA polymerase in the human gut
From the non-transcribed contigs, protein-coding genes were predicted and clustered with a 30% sequence similarity threshold, resulting in 60,430 representative protein sequences. Pie charts represent the numbers of the representative RdRP protein sequences taxonomically assigned to the corresponding viral families. Sequences without a hit within the threshold (E-value < 10-10) or aligned to multiple viral families were assigned Unclassified.
The DR is highlighted in green and shows the pairwise alignment with the reference DR depicted above. The spacer in the read and the protospacer in the contig are highlighted in red along with the pairwise alignment. To predict the CRISPR target host, the DR associated with the protospacer found from the Picobirnaviridae contigs was searched in the assembled contigs and RefSeq genomes.
On the other hand, hits with one or more mismatches were found from RefSeq-registered Lachnospiraceae family genomes (NZ_NFLQ01000008.1). The tips without labels are sequences discovered in this study, and the points marked with the NCBI accessions are the Picobirnaviridae sequences obtained from the RefSeq database. In the phylogeny, the CRISPR-targeted sequences were placed within the branches of RefSeq recorded Picobirnaviridae sequences, suggesting that this sequence is a genuine lineage of new Picobirnaviridae species.
Discovered Picobirnaviridae genomes have prokaryotic ribosome binding motifs
Discussion
Furthermore, we showed that these viruses have prokaryotic ribosomal binding sites upstream of the RdRP genes. The estimated hourly rate of detected Picobirnaviridae species was nearly 10-2 replacements per site per year. However, we were unable to find the original CRISPR locus of the Picobirnaviridae-related protospacer due to fragmentation of the assembled metagenome contigs.
This problem may be caused by insufficient depth to capture cells with a low population and the high diversity of the CRISPR locus in the population. With information about the CRISPR locus associated with the RNA phage, we should be able to narrow down the host range of the target RNA phage that could be used to construct an RNA phage culture system. Finally, almost a quarter of the detected RdRP protein sequences were taxonomically unassigned, suggesting that there are still uncovered RNA virus sequences occurring in the human gut environment.
Appendix B
Supplementary Information
- Materials and Methods
Sequence preprocessing
Kmer extraction from metagenome sequences
Metatranscriptome assemblies and extraction of non-transcribed contigs
Gene annotation of non-transcribed RNA sequences
Taxonomic assignment of RdRP protein sequences
Phylogenetic analysis of RdRP protein sequences
Dated phylogenetic analysis of RdRP coding sequences
Capsids and genomes of large-sized bacteriophages reveal the evolutionary range of the HK97 fold. Create a megataxonomic framework that fills all major/primary taxonomic ranks, for dsDNA viruses encoding major capsid proteins of the HK97 type. Evolution and diversity of the viral family Microviridae through a collection of 81 new complete genomes assembled from virome reads.
Create a megataxonomic framework, filling all major taxonomic ranks, for DNA viruses encoding vertical major jelly-type capsid proteins. Microviridae, a divided family: isolation, characterization, and genome sequence of φMH2K, a bacteriophage of the obligate intracellular parasitic bacterium Bdellovibrio bacteriovorus. Type IF CRISPR-Cas resistance against virulent phages results in failed infection and confers population-level immunity.