To Bind, Or Not To Bind

Photo created by kjpargeter on

A machine-learning framework for helping to decode the rules by which transcription factors bind their target sites

Posted by An Zheng on behalf of Gymrek Lab

This blog post introduces our paper “Deep neural networks identify sequence context features predictive of transcription factor binding” published on Nature Machine Intelligence [1]. Here is a free read-only copy of this paper [2]. A part of this work was also presented in the 2019 ICML Workshop on Computational Biology [3].

TL;DR We designed a machine learning framework, AgentBind, to identify and interpret sequence features that are most important for transcription factor (TF) binding. Unlike most previous works studying binding motifs, our work focuses on the sequence context in the vicinity of motifs and studies its role in TF binding.



Figure 1. Method schematic.


Figure2. An example showing Grad-CAM scores.

We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin.

For the negative samples in training data, we can choose to restrict them to be within DNase I hypersensitive sites or not, resulting in two different models: a DNase-I-controlled model and a baseline model. Figure 3 shows the key context features identified by these two models. While the baseline model yields better classification results, the DNase-I-controlled model identified some distinct patterns ignored in the baseline model.

Figure 3. Identifying key context sequence features for TF binding in GM12878. The top figure is generated using the baseline model while the bottom one is generated using the DNase-I-controlled model.

Our results for STAT3 on multiple cell types indicate that important context bases are highly cell-type-specific.

To investigate the ability of our framework to capture cell-type-specific regulatory features, we chose a TF named STAT3 and trained separate models to predict STAT3 binding using ChIP-sequencing data from three cell types (GM12878, CD4+ Th17, and HeLa cells). Our analysis reveals that some enriched 5-mers are shared across multiple cell types whereas others are highly cell-type-specific (Figure 4).

Figure 4: Cell-type-specific enrichment of 5-mers influential for STAT3 binding. The inset table shows the auROC obtained from training each model on one cell type and using it to predict STAT3 binding status in another cell type.

Please check out our paper for more results and method details.


[1] An Zheng, Michael Lamkin, Hanqing Zhao, et al. “Deep neural networks identify sequence context features predictive of transcription factor binding”. Nature Machine Intelligence (2021), DOI: 10.1038/s42256–020–00282-y. [2]
[3] An Zheng, Michael Lamkin, Hao Su, Melissa Gymrek. “AgentBind: Profiling Context-specific Determinants of Transcription Factor Binding Affinity”. International Conference on Machine Learning (ICML) Workshop on Computational Biology, Long Beach, CA, 2019.

Ph.D. candidate in Computer Science at UC San Diego.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store