Group I intron Sequence and Structure Database

gIRfam introduction:

Rfam (http://www.sanger.ac.uk/Software/Rfam/) is a comprehensive collection of non-coding RNA (ncRNA) families, represented by multiple sequence alignments and profile stochastic context-free grammars. The group I intron family Intron_gpI (RF00028) contains 20085 sequences. The number is much bigger than that in CRW (http://www.rna.ccbb.utexas.edu/). Are all of them real group I introns, or just false positives of the INFERNAL program? INFERNAL (http://infernal.wustl.edu/), the searching engine of Rfam database, uses highly human-curated alignments as seed alignments to search similar RNA sequences considering both sequence and structure similarity. For the family Intron_gpI, there are only 30 sequences in the seed alignment. Whether it has good sensitivity and specificity? If they were real group I introns, how about their distribution, concentrated or scattered in the nature? What are the differences in comparison with those group I introns in CRW? Our work intended to answer these questions. This page shows the information we obtained currently.

Firstly, we checked if those sequences have P7, which forms a pseudoknot with P3 and comprises the catalytical core of group I introns. The INFERNAL program can not deal with pseudoknots, though it can consider the sequence conservation of P7. Possibly, some of those sequences may not have P7, and could be viewed as false positives. We used the most often presented P7 pairing pattern in our 1789 structures, termed "strict P7", to filter those records, and we got 17871 sequences. We found that the length of J67, J87 and J34 were very conserved in the 1789 structures in our database. So, in a further step, we filtered out those records with strict P7 but not satisfying the length restrictions of J67, J87 and J34. After that, it remained 16914 sequences, which could be deemed as reliable group I introns. The sequences, structures data could be retrived in the 'Data' section.

Secondly, we wanted to know the distribution of those 16914 introns in the nature. As the intron number containing taxonomy tree of the 1789 introns in the 'Distribution' page, other three trees were constructed and could be juxaposed together to compare. The links are in the 'Distribution' section in this page.

Thirdly, we wanted to know the subgroups the 16914 introns belong to, after we knew the distribution in organisms. We built 14 CMs by subgroup based on our manually curated alignments by using 'cmbuild' program in INFERNAL package. We did 5-fold cross validation to test our CMs and determined the score cutoff with MER (Minimum Error Rate), which minimizes the sum of false positives (FP) and false negatives (FN). Then, we used 'cmsearch' to search the 16914 intron sequences to classify those introns. The classification results are in the 'Classification' section in this page.

gIRfam data

The original seed alignment and full alignment of group I intron family Intron_gpI (RF00028) could be downloaded fromhttp://www.sanger.ac.uk/Software/Rfam/. Here three processed flat files are downloadable (tab-delimited files with first line explaining the fields).

20085 sequences info.	Download
17871 sequences info.	Download
16914 sequences info.	Download

Distribution:

In order to ease the comparison of the distributions of group I introns from CRW and those from Rfam, another 3 intron number containing taxonomy trees prepared. The first one is all the records in the full alignment of group I intron family in Rfam including 20085 sequences. The second one contains 17871 sequences which passed the strict P7 restriction. The third one includes 16914 reliable group I intron candidates, which satisfy the rules on J34, J67 and J87. The three ones are independently and could be viewed and compared at the same time. The detail information for could also be viewed by hitting the red node in the lined page.

VIEW all records from group I intron family in Rfam (20085).
VIEW the records having strict P7 (17871).
VIEW the records having strict P7 and canonical length of J34, J67 and J87 (16914).

Classification

Classification of confident Rfam group I introns (16914) into subgroups using the MER threshold (cut-off) obtained in 5-fold cross validation. A very small number of introns were not well resolved. A total of 35 group I introns were classified into two subgroups, 8 introns into three subgroups, and 2 into four subgroups. Introns in IB subgroups are particularly poorly resolved probably because their high structure similarity.

A tab-delimited file containing the subgroup information: Download