Deep-learning augmented RNA-seq analysis of transcript splicing | 用深度学习预测可变剪切

可变剪切的预测已经很流行了，目前主要有两个流派：

用DNA序列以及variant来预测可变剪切；GeneSplicer、MaxEntScan、dbscSNV、S-CAP、MMSplice、clinVar、spliceAI
用RNA来预测可变剪切；MISO、rMATS、DARTS

前言废话

科研圈的热点扎堆现象是永远存在的，且一波接一波，大部分不屑于追热点且不出成果也基本都被圈子给淘汰了。

做纯方法开发的其实是很心累的，费时费力费脑，特别是自己的研究领域已经过时的时候，另外还得承受外行的歧视：“你们搞这个有什么用吗？文章也发不了，最后也没人用。”

最近这些年最大的一个热点就是“单细胞”，很多人都趁着这股东风捞了一些文章，最早一批开发方法的也发了不少nature method和NBT，bioinformatics和NAR更多。但大部分后面就销声匿迹了，因为门槛越来越低，进入者越来越多，经过几年的发展，现在已经成了三足鼎立之势，强者愈强，弱者退场。

写方法类的文章也有个潜规则，千万不要写得过于通俗易懂，大部分审稿人如果一眼就能看懂，就会自然认为你做的研究过于简单，没有发表的必要。最好要写得有理有据，且90%的审稿人没法一眼看懂，但细细琢磨后有那么点意思。哈哈，当笑话听就好。

跳到另外一篇用深度学习来预测可变剪切的。

Deep-learning augmented RNA-seq analysis of transcript splicing

PDF

文章里面需要重点了解的基础知识：

Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10，看看前人是如何根据cis sequence特征来预测exon的剪切模式的

涉及到的文献：

The human splicing code reveals new insights into the genetic determinants of disease – 2015

Deciphering the splicing code – 2010

Deep learning of the tissue-regulated splicing code – 2014

BRIE: transcriptome-wide splicing quantification in single cells – 2017

哈哈，深度学习在可变剪切上的应用的风2014年就开始刮了，你不可能是第一个吃螃蟹的了。

想了解什么是AS，可以直接看现在开发的工具，里面肯定有图详细介绍，同时介绍其算法，一图胜千言。

MISO (Mixture of Isoforms) software documentation 目前只支持python2版本，用conda的话还需要从文档中copy一下miso_settings.txt文件。

rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

生物和信息之间存在一个巨大的gap，优秀的人能很快察觉到这个gap，并填补这个gap。

问题：

为什么AS的鉴定依赖测序深度？得了解现在主流的AS检测算法

怎么理解样本间的差异可变剪切这个概念？

如何理解cis sequence features，这个文件里都包含了哪些信息？

怎么predict exon-inclusion/skipping levels in bulk tissues or single cells

怎么理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.

两部分：

a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features

a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing

During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.

During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.

This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.

差不多懂了，第一个BHT就是常规差异剪切分析工具（如MISO 和 rMATS）的升级版，用于制造有lable的训练数据。BHT的结果用于训练DNN模型；新的数据可以放进DNN模型里，得到的结果可以作为后面贝叶斯模型的先验，我们的RNA-seq数据则是用于更新先验形成后验，如果先验足够准确，则更新时对数据的依赖不搞，这也就是为什么该方法可以弥补RNA-seq测序深度不足的情形。

To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.

cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions

这个DNN把可变剪切转换成了一个regression的问题，特征就是上面两种，因为它们决定了最终的一个特征是否发生了可变剪切。

最终用到的特征：2,926 cis sequence features and 1,498 annotated RBPs

DNN用到的训练数据具体是什么？

large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium

We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons

The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN

From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 这样就把每个exon给的特征给提取出来了，lable也有了，就可以用于训练了。

比较了三个模型：

We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.

关于贝叶斯模型的部分：

incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).

Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data

如果文章看得迷迷糊糊的，就直接跑代码吧！

第一个功能BHT：

Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/

test_norep_data.txt文件是这样的：

IDGeneIDgeneSymbolchrstrandexonStart_0baseexonEndupstreamESupstreamEEdownstreamESdownstreamEEIDIJC_SAMPLE_1SJC_SAMPLE_1IJC_SAMPLE_2SJC_SAMPLE_2IncFormLenSkipFormLen
82439ENSG00000169045.17_1HNRNPH1chr5-1790462691790464081790451451790453241790478921790480368243915236319677483418090
21374ENSG00000131876.16_3SNRPA1chr15-1018264181018264981018259301018260061018271121018272152137441051182925416990
32815ENSG00000141027.20_3NCOR1chr17-15990485159906591598971215989756159951761599523232815624564549126118090
43143ENSG00000133731.9_2IMPA1chr8-825979978259819882593732825938198259848682598518431431553322234118090
111671ENSG00000100320.22_3RBFOX2chr22-362323663623248636205826362060513623623836236460111671931933553418090

每一行是一个基因，无冗余，然后就是一些属性.

跑出来的结果是这样的：

1 ID      I1      S1      I2      S2      inc_len skp_len psi1    psi2    delta.mle       post_pr
      2 1225    160     0       169     6       180     90      1       0.934   -0.0663 0.4367
      3 15829   52      58      12      41      180     90      0.31    0.128   -0.1819 0.8867
      4 20347   1084    930     371     615     180     90      0.368   0.232   -0.1365 1
      5 21374   4105    118     292     54      169     90      0.949   0.742   -0.2065 1
      6 24817   177     275     263     741     143     90      0.288   0.183   -0.1057 0.974
      7 32815   624     564     549     1261    180     90      0.356   0.179   -0.1774 1
      8 43143   155     332     22      341     180     90      0.189   0.031   -0.158  1
      9 46548   1685    4040    216     1752    180     90      0.173   0.058   -0.1145 1

每一行是对之前条目的预测。　　

第二个功能DNN：

下载model

Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS

预测

Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS

其中的第一个文件是Input feature file (*.h5) or Darts_BHT output (*.txt)

ID      I1      S1      I2      S2      inc_len skp_len mu.mle  delta.mle       post_pr
chr1:-:10002681:10002840:10002738:10002840:9996576:9996685      581     0       462     0       155     99      1       0       0
chr1:-:100176361:100176505:100176389:100176505:100174753:100174815      28      0       49      2       126     99      1       -0.0493827160493827     0.248
chr1:-:109556441:109556547:109556462:109556547:109553537:109554340      2       37      0       81      119     99      0.0430341230167355
      -0.0430341230167355     0.188
chr1:-:11009680:11009871:11009758:11009871:11007699:11008901    11      2       49      4       176     99      0.755725190839695       0.117542135892979       0.329333333333333
chr1:-:11137386:11137500:11137421:11137500:11136898:11137005    80      750     64      738     133     99      0.0735580941766509      -0.0129207126090368     0

第二个文件是Kallisto expression files

thymus  adipose
RPS11   2678.83013      2531.887535
ERAL1   14.350975       13.709394
DDX27   18.2573 14.02368
DEK     32.463558       14.520312
PSMA6   102.332592      77.089475
TRIM56  4.519675        6.14762566667
TRIM71  0.082009        0.0153936666667
UPF2    7.150812        5.23628033333
FARS2   6.332831        7.291382
ALKBH8  3.056208        1.27043633333
ZNF579  5.13265 8.248575

结果文件，第一列是ID，第二列是真实的标签，第三列是预测的标签：

ID      Y_true  Y_pred
chr22:-:39136893:39137055:39137011:39137055:39136271:39136437   1.000000        0.318161
chr12:-:69326921:69326979:69326949:69326979:69326457:69326620   1.000000        0.073966
chr3:-:49053236:49053305:49053251:49053305:49052920:49053140    0.947333        0.295664
chr4:-:68358468:68358715:68358586:68358715:68357897:68357993    1.000000        0.304907
chr11:-:124972532:124972705:124972629:124972705:124972027:124972213     0.937333        0.365548
chr15:+:43695880:43696040:43695880:43695997:43696610:43696750   1.000000        0.450762

参考：

The Expanding Landscape of Alternative Splicing Variation in Human Populations.　

这篇是比较纯粹的DL应用：

Gene expression inference with deep learning | 基于深度学习的基因表达推测

案例文章：Gene expression inference with deep learning

uci-cbcl/D-GEX – github

深度学习的风已经过了几年了，目前在医疗影像处理上已经公认非常有效，所以后面想发文章必须数据足够大足够靓，方法上想创新太难。

LINCS L1000 data

核心的意思就是这个项目只测了不到一千个基因的表达，却要通过LR和DL来推测出其他全部的3万个基因的表达，所以称那978个基因叫landmark genes。

个人收藏笔记记录

开通VIP