How accurate are transcription factor binding-site motifs?

iStock photos

iStock photos

An international team led by researchers from EPFL and the Russian Academy of Sciences have undertaken a comprehensive benchmarking study to assess the predictive performance of publicly available transcription factor binding site motifs.

Transcription factors are the key regulators of gene expression. They specifically bind to short DNA sequences in the genome called "sequences motifs". Sequence motifs are widely used computational models to predict transcription factor binding sites in the absence of experimental data. However, reliable information about the accuracy of these models has largely been unavailable so far. The need for such information becomes even more urgent as users today have the "curse of the choice" between up to 10 alternative, and often dissimilar, motifs for the same transcription factor.

In an article published in Genome Biology, the scientists addressed the issue of transcription factor binding motif accuracy by benchmarking 4972 motifs from three different resources on 3161 experimental test data sets for human transcription factors generated with three different technologies. The results from this study will help researcher critically assess published research based on transcription factor binding site predictions and enable them to select optimal motif subsets for particular use cases. In the long run, it is hoped that the computational protocols developed for this benchmarking effort will lead to more accurate transcription factor binding site models and thereby to a significant improvement of bioinformatics tools to predict the effects o regulatory genetic mutations in various diseases contexts.

The complete set of more than 15 million performance values resulting from this all-against-all benchmarking study is freely available from the open access repository Zenodo. To facilitate computational reproducibility, the benchmarking protocols were containerized as docker images and made publicly available from github.

Other contributors

  • Swiss Institute of Bioinformatics
  • Russian Academy of Sciences
  • Moscow State University
  • Martin Luther University Halle-Wittenberg
  • Aix Marseille University
  • University of British Columbia
Funding

EPFL

Swiss Institute of Bioinformatics

COST (European Cooperation in Science and Technology)

Russian Foundation for Basic Research

Russian Science Foundation

Russian Academy of Sciences Presidium

References

Giovanna Ambrosini, Ilya Vorontsov, Dmitry Penzar, Romain Groux, Oriol Fornes, Daria D. Nikolaeva, Benoit Ballester, Jan Grau, Ivo Grosse, Vsevolod Makeev, Ivan Kulakovskiy, Philipp Bucher. Insights gained from a comprehensive all-against-all transcription factor binding motif benchmarking study. Genome Biology 11 May 2020. DOI: 10.1186/s13059-020-01996-3