Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network (doi:10.21979/N9/FFN0XH)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link) (external link)

Document Description

Citation

Title:

Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network

Identification Number:

doi:10.21979/N9/FFN0XH

Distributor:

DR-NTU (Data)

Date of Distribution:

2021-02-25

Version:

1

Bibliographic Citation:

Chan, Alvin; Korsakova, Anna; Ong, Yew-Soon; Winnerdy, Fernaldo Richtia; Lim, Kah Wai; Phan, Anh Tuan, 2021, "Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network", https://doi.org/10.21979/N9/FFN0XH, DR-NTU (Data), V1

Study Description

Citation

Title:

Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network

Identification Number:

doi:10.21979/N9/FFN0XH

Authoring Entity:

Chan, Alvin (Nanyang Technological University)

Korsakova, Anna (Nanyang Technological University)

Ong, Yew-Soon (Nanyang Technological University)

Winnerdy, Fernaldo Richtia (Nanyang Technological University)

Lim, Kah Wai (Nanyang Technological University)

Phan, Anh Tuan (Nanyang Technological University)

Software used in Production:

python, R

Grant Number:

Data Science and Artificial Intelligence Research Center (DSAIR)

Grant Number:

Investigatorship (NRFNRFI2017-09)

Distributor:

DR-NTU (Data)

Access Authority:

Korsakova Anna

Depositor:

Korsakova Anna

Date of Deposit:

2021-02-25

Holdings Information:

https://doi.org/10.21979/N9/FFN0XH

Study Scope

Keywords:

Computer and Information Science, Medicine, Health and Life Sciences, Computer and Information Science, Medicine, Health and Life Sciences, RNA splicing prediction, RNA binding protein

Abstract:

Context Augmented Psi Dataset (CAPD) dataset for benchmarking of RNA splicing models. Contains percent-spliced-in labels for 250 samples from each of the 14 tissue types for all 23 human chromosomes along with auxiliary signals for each sample (RBP transcript abundance levels) and a gene dictionary (same gene sequences for all samples). The dataset is split into training, testing, and validation datasets. Auxiliary signals are provided in a separate archive as a .csv table, where each line represents one sample with a respective tissue label and a number. Labels are stored as .jsonl dictionaries for each sample separately; each entry in the dictionary contains gene name, acceptor, and donor coordinates (with respect to the very first acceptor site of the gene) with respective PSI levels in the range from 0 to 1. Gene dictionary is stored as .jsonl file as well, where each entry is a pre-mRNA gene sequence with 1000nt flanking ends on each side.

Kind of Data:

Benchmarking dataset for ML models

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Related Publications

Citation

Identification Number:

10.1145/3450439.3451857

Bibliographic Citation:

Chan, A., Korsakova, A., Ong, Y., Winnerdy, F. R., Lim, K. W. & Phan, A. T. (2021). RNA alternative splicing prediction with discrete compositional energy network. Proceedings of the Conference on Health, Inference, and Learning (CHIL '21), 193-203

Citation

Identification Number:

10356/155091

Bibliographic Citation:

Chan, A., Korsakova, A., Ong, Y., Winnerdy, F. R., Lim, K. W. & Phan, A. T. (2021). RNA alternative splicing prediction with discrete compositional energy network. Proceedings of the Conference on Health, Inference, and Learning (CHIL '21), 193-203

Other Study-Related Materials

Label:

aux_inputs.tar.gz

Text:

.csv tables with auxiliary input entries: samples in rows, RBP and chemically modifying protein transcript IDs in columns, each cell is the respective transcript abundance level (RPKM normalized); samples from all tissues concatenated together in the same table. Control .csv tables with random transcript abundance levels or membrane protein abundance levels are included as well.

Notes:

application/x-gzip

Other Study-Related Materials

Label:

gene_dict_alltraintest.jsonl

Text:

Gene dictionary containing pre-mRNA gene sequences (+1000nt flanks on both sides) for all the genes used for training, testing, and validation.

Notes:

application/octet-stream

Other Study-Related Materials

Label:

main_inputs_test.tar.gz

Text:

Test dataset for model training: each file contains splicing labels for each gene for a particular sample.

Notes:

application/x-gzip

Other Study-Related Materials

Label:

main_inputs_train.tar.gz

Text:

Train dataset for model training: each file contains splicing labels for each gene for a particular sample.

Notes:

application/x-gzip

Other Study-Related Materials

Label:

main_inputs_valid.tar.gz

Text:

Validation dataset for model training: each file contains splicing labels for each gene for a particular sample.

Notes:

application/x-gzip