View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network |
Identification Number: |
doi:10.21979/N9/FFN0XH |
Distributor: |
DR-NTU (Data) |
Date of Distribution: |
2021-02-25 |
Version: |
1 |
Bibliographic Citation: |
Chan, Alvin; Korsakova, Anna; Ong, Yew-Soon; Winnerdy, Fernaldo Richtia; Lim, Kah Wai; Phan, Anh Tuan, 2021, "Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network", https://doi.org/10.21979/N9/FFN0XH, DR-NTU (Data), V1 |
Citation |
|
Title: |
Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network |
Identification Number: |
doi:10.21979/N9/FFN0XH |
Authoring Entity: |
Chan, Alvin (Nanyang Technological University) |
Korsakova, Anna (Nanyang Technological University) |
|
Ong, Yew-Soon (Nanyang Technological University) |
|
Winnerdy, Fernaldo Richtia (Nanyang Technological University) |
|
Lim, Kah Wai (Nanyang Technological University) |
|
Phan, Anh Tuan (Nanyang Technological University) |
|
Software used in Production: |
python, R |
Grant Number: |
Data Science and Artificial Intelligence Research Center (DSAIR) |
Grant Number: |
Investigatorship (NRFNRFI2017-09) |
Distributor: |
DR-NTU (Data) |
Access Authority: |
Korsakova Anna |
Depositor: |
Korsakova Anna |
Date of Deposit: |
2021-02-25 |
Holdings Information: |
https://doi.org/10.21979/N9/FFN0XH |
Study Scope |
|
Keywords: |
Computer and Information Science, Medicine, Health and Life Sciences, Computer and Information Science, Medicine, Health and Life Sciences, RNA splicing prediction, RNA binding protein |
Abstract: |
Context Augmented Psi Dataset (CAPD) dataset for benchmarking of RNA splicing models. Contains percent-spliced-in labels for 250 samples from each of the 14 tissue types for all 23 human chromosomes along with auxiliary signals for each sample (RBP transcript abundance levels) and a gene dictionary (same gene sequences for all samples). The dataset is split into training, testing, and validation datasets. Auxiliary signals are provided in a separate archive as a .csv table, where each line represents one sample with a respective tissue label and a number. Labels are stored as .jsonl dictionaries for each sample separately; each entry in the dictionary contains gene name, acceptor, and donor coordinates (with respect to the very first acceptor site of the gene) with respective PSI levels in the range from 0 to 1. Gene dictionary is stored as .jsonl file as well, where each entry is a pre-mRNA gene sequence with 1000nt flanking ends on each side. |
Kind of Data: |
Benchmarking dataset for ML models |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Other Study Description Materials |
|
Related Publications |
|
Citation |
|
Identification Number: |
10.1145/3450439.3451857 |
Bibliographic Citation: |
Chan, A., Korsakova, A., Ong, Y., Winnerdy, F. R., Lim, K. W. & Phan, A. T. (2021). RNA alternative splicing prediction with discrete compositional energy network. Proceedings of the Conference on Health, Inference, and Learning (CHIL '21), 193-203 |
Citation |
|
Identification Number: |
10356/155091 |
Bibliographic Citation: |
Chan, A., Korsakova, A., Ong, Y., Winnerdy, F. R., Lim, K. W. & Phan, A. T. (2021). RNA alternative splicing prediction with discrete compositional energy network. Proceedings of the Conference on Health, Inference, and Learning (CHIL '21), 193-203 |
Label: |
aux_inputs.tar.gz |
Text: |
.csv tables with auxiliary input entries: samples in rows, RBP and chemically modifying protein transcript IDs in columns, each cell is the respective transcript abundance level (RPKM normalized); samples from all tissues concatenated together in the same table. Control .csv tables with random transcript abundance levels or membrane protein abundance levels are included as well. |
Notes: |
application/x-gzip |
Label: |
gene_dict_alltraintest.jsonl |
Text: |
Gene dictionary containing pre-mRNA gene sequences (+1000nt flanks on both sides) for all the genes used for training, testing, and validation. |
Notes: |
application/octet-stream |
Label: |
main_inputs_test.tar.gz |
Text: |
Test dataset for model training: each file contains splicing labels for each gene for a particular sample. |
Notes: |
application/x-gzip |
Label: |
main_inputs_train.tar.gz |
Text: |
Train dataset for model training: each file contains splicing labels for each gene for a particular sample. |
Notes: |
application/x-gzip |
Label: |
main_inputs_valid.tar.gz |
Text: |
Validation dataset for model training: each file contains splicing labels for each gene for a particular sample. |
Notes: |
application/x-gzip |