Activity Detail
Seminar
Epigenomics Tools for Understanding the Cellular Language of Pluripotency and Cancer
Prof. Marcos Arauzo
Cellular reprogramming is key technology in regenerative medicine. The reprogramming process is based on the crosstalk between genetic and epigenetic networks in a language whose “words” are DNA regulatory sequences. To interpret such language we designed computational tools that discover “DNA words” with genetic and epigenetic meaning. We developed software to search of ab-initio DNA patterns, to process efficiently DNA methylomics data, and to predict superenhancers from next generation sequencing epigenomics data. Our computational method to discover “DNA words” with regulatory meaning exploits never used transcription factors (TFs) properties to reveal the missing TF binding motives (TFBMs) and their sites (TFBSs) in all human gene promoters. We disclose the crosstalk between “DNA words” with an algorithm that extracts TF combinatorial binding patterns compiling a collection of TF regulatory syntactic rules. Our TF binding site map for combinatory TFBMs discovery provides a comprehensive resource for regulation analysis that includes a dictionary of ‘‘DNA words,’’ newly predicted motifs and their corresponding combinatorial patterns that represent syntax of gene regulation. Compiling the epigenomics counterpart of the dictionary of TFBMs requires processing of massive quantities of Bisulfite sequencing (BSseq) data. We developed P3BSseq, a parallel processing pipeline for fast, accurate and automatic analysis of BSseq reads that trims, aligns, annotates, records the intermediate results, performs bisulfite conversion quality assessment, generates BED methylome and report files following the NIH standards. Gene expression regulation is gated by DNA promoter methylation states modulating TF binding. The known DNA methylation/unmethylation mechanisms are sequence unspecific, but different cells with the same genome have different methylomes, thus additional processes bringing specificity to the methylation/unmethylation mechanisms are required. Searching for such processes, we demonstrated that CpG methylation states are influenced by the sequence context surrounding the CpGs. We used such a property to develop a CpG methylation motif discovery algorithm. The discovered motifs reveal ‘‘methylation/unmethylation factors’’ that could recruit the ‘‘methylation/unmethylation machinery’’ to the loci specified by the motifs. The motifs that were found discriminate between hypomethylated and hypermethylated regions and represent a dictionary of “DNA methylation words”. Other longer length DNA words are the superenhancers (SE), structural genomic elements determining cell fate and considered epigenetic syntactic elements. We developed NaviSE for fully-automated parallel processing of genome-wide epigenomics data. NaviSE implements “epigenomics signal algebra” that allows the combination of multiple activation and repression epigenomics signals. NaviSE annotates the SE associated genes and performs gene ontology enrichment analysis, TFBSs enriched in SE,