A New Machine Learning Dataset for Hierarchical Classification of Transposable Elements

Bruna Zamith SantosRicardo Cerri

Transposable Elements (TEs) are DNA sequences that can change their location within the genome. They make up a large portion of the DNA in organisms, and contribute to genetic diversity within and across species. Fur- thermore, they increase the size of the genome and may affect the functionality of genes. Accurate classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. Usually, TEs classification is performed using homology-based Bioinformatics tools, comparing a sequence with a database with many sequences belonging to previously known TE classes. This is a limited strategy, since it ignores the sequences' biochemical properties, and also the hierarchical relationships that may exist between the different TE classes. Based on existing proposals to es- tablish a hierarchical TE taxonomy, we propose a new dataset for TE classifi- cation, having features that try to consider sequence properties that cannot be represented only by character sequences. Furthermore, the proposed dataset is hierarchically structured, facilitating its use by conventional and hierarchical classification methods. Focusing on investigating the interpretabiliy potential of our features, we tested our new dataset using decision trees and rule induction algorithms. The experiments showed promising results.

Caso o link acima esteja inválido, faça uma busca pelo texto completo na Web: Buscar na Web

Biblioteca Digital Brasileira de Computação - Contato:
     Mantida por: