Integrated data analysis pipeline for whole human genome transcription factor binding sites prediction

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2015-06-11
Department
Major/Subject
Bioinformatics
Mcode
T3012
Degree programme
Master's Programme in Bioinformatics (MBI)
Language
en
Pages
36+1
Series
Abstract
Transcription factors (TF) have a central role in regulating gene expression by binding to regulatory regions in DNA. Position weight matrix (PWM) model is the most commonly used model for representing and predicting TF binding sites. Consequently, several studies have been done on predicting TF binding sites using PWMs and many databases have been created containing large numbers of PWMs. However, these studies require the user to search for binding sites for each PWM separately, thus making it is difficult to get a general view of binding predictions for many PWMs simultaneously. In response to this need, this thesis project evaluates both individual and groups of PWMs and creates an effortless method to analyze and visualize the desired set of PWMs together, making it easier for biologist to analyze large amount of data in a short period of time. For this purpose, we used bioinformatics methods to detect putative TF binding sites in human genome and make them available online via the UCSC genome browser. Still, the sheer amount of data in PWM databases required a more efficient method to summarize TF binding prediction. Hence, we used PWM similarity measures and clustering algorithms to group together PWMs and to create one integrated database from four popular PWM databases: SELEX, TRANSFAC, UniPROBE, and JASPAR. All results are made publicly available for the research community via the UCSC genome broswer.
Description
Supervisor
Lähdesmäki, Harri
Thesis advisor
Lähdesmäki, Harri
Keywords
transcription factor, PWM, TRANSFAC, JASPAR, SELEX, PBM
Other note
Citation