To Bind, Or Not To Bind

An Zheng
4 min readJan 18, 2021
Photo created by kjpargeter on www.freepik.com

A machine-learning framework for helping to decode the rules by which transcription factors bind their target sites

Posted by An Zheng on behalf of Gymrek Lab

This blog post introduces our paper “Deep neural networks identify sequence context features predictive of transcription factor binding” published on Nature Machine Intelligence [1]. Here is a free read-only copy of this paper [2]. A part of this work was also presented in the 2019 ICML Workshop on Computational Biology [3].

TL;DR We designed a machine learning framework, AgentBind, to identify and interpret sequence features that are most important for transcription factor (TF) binding. Unlike most previous works studying binding motifs, our work focuses on the sequence context in the vicinity of motifs and studies its role in TF binding.

Background

The binding of transcription factors (TFs) to DNA is one of the major transcriptional regulation mechanisms. Studies have shown that most TFs have unique binding preferences and only recognize DNA sequences containing specific patterns (i.e. core motifs). However, there is often only a partial overlap between motif-matching sequences and experimentally determined binding sites. Whether a particular motif instance is bound depends on many other factors, including chromatin accessibility, nucleosome positioning, cooperative and competitive binding with other TFs, and more. Many of these factors are related to the sequence context around the TF motif. To investigate the role of sequence context in TF binding, we developed a framework, named AgentBind, for (1) predicting whether a motif instance will be bound and (2) interpreting the specific nucleotides with the strongest influence on binding status.

Method

Our model framework consists of three steps: pre-training, fine-tuning, and interpretation (Figure 1), and applies DanQ as the model architecture. First, we pre-train a DanQ model on epigenomic annotations from multiple cell types (collected from the DeepSEA project). Second, we build a binary dataset for each TF: we extract 1kb genomic sequences centered on motif instances and label each sequence as bound (positive) versus unbound (negative) based on overlap with binding sites identified by ChIP-sequencing. Each binary dataset is used to fine-tune an individual pre-trained model, allowing it to learn important features for TF binding. Third, we used a model interpretation method named Grad-CAM to score the contribution of each nucleotide to binding predictions.

Figure 1. Method schematic.

Result

Through AgentBind, we identified nucleotide bases predictive of transcription factor binding. Figure 2 shows an example containing Grad-CAM scores for a region (chr1: 12289432–12290431 in hg19). The y-axis shows the Grad-CAM score of each nucleotide. Sequences are shown for the central SP1 motif and two regions with high scores corresponding to NFY motifs.

Figure2. An example showing Grad-CAM scores.

We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin.

For the negative samples in training data, we can choose to restrict them to be within DNase I hypersensitive sites or not, resulting in two different models: a DNase-I-controlled model and a baseline model. Figure 3 shows the key context features identified by these two models. While the baseline model yields better classification results, the DNase-I-controlled model identified some distinct patterns ignored in the baseline model.

Figure 3. Identifying key context sequence features for TF binding in GM12878. The top figure is generated using the baseline model while the bottom one is generated using the DNase-I-controlled model.

Our results for STAT3 on multiple cell types indicate that important context bases are highly cell-type-specific.

To investigate the ability of our framework to capture cell-type-specific regulatory features, we chose a TF named STAT3 and trained separate models to predict STAT3 binding using ChIP-sequencing data from three cell types (GM12878, CD4+ Th17, and HeLa cells). Our analysis reveals that some enriched 5-mers are shared across multiple cell types whereas others are highly cell-type-specific (Figure 4).

Figure 4: Cell-type-specific enrichment of 5-mers influential for STAT3 binding. The inset table shows the auROC obtained from training each model on one cell type and using it to predict STAT3 binding status in another cell type.

Please check out our paper for more results and method details.

Conclusion

Altogether, our study provides a valuable machine-learning framework for helping decode the rules by which TFs bind their target sites and identifying specific non-coding nucleotides with the strongest effects on the binding. To facilitate future applications, Grad-CAM scores for all TF models studied here and code for running AgentBind are available on our GitHub page.

[1] An Zheng, Michael Lamkin, Hanqing Zhao, et al. “Deep neural networks identify sequence context features predictive of transcription factor binding”. Nature Machine Intelligence (2021), DOI: 10.1038/s42256–020–00282-y. [2] https://rdcu.be/cdMmE
[3] An Zheng, Michael Lamkin, Hao Su, Melissa Gymrek. “AgentBind: Profiling Context-specific Determinants of Transcription Factor Binding Affinity”. International Conference on Machine Learning (ICML) Workshop on Computational Biology, Long Beach, CA, 2019.

--

--

An Zheng

Ph.D. candidate in Computer Science at UC San Diego.