Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network (doi:10.21979/N9/FFN0XH)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link) (external link)

Document Description
Citation
Title:	Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network
Identification Number:	doi:10.21979/N9/FFN0XH
Distributor:	DR-NTU (Data)
Date of Distribution:	2021-02-25
Version:	1
Bibliographic Citation:	Chan, Alvin; Korsakova, Anna; Ong, Yew-Soon; Winnerdy, Fernaldo Richtia; Lim, Kah Wai; Phan, Anh Tuan, 2021, "Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network", https://doi.org/10.21979/N9/FFN0XH, DR-NTU (Data), V1
Study Description
Citation
Title:	Replication Data for: RNA Alternative Splicing Prediction with Discrete Compositional Energy Network
Identification Number:	doi:10.21979/N9/FFN0XH
Authoring Entity:	Chan, Alvin (Nanyang Technological University)
	Korsakova, Anna (Nanyang Technological University)
	Ong, Yew-Soon (Nanyang Technological University)
	Winnerdy, Fernaldo Richtia (Nanyang Technological University)
	Lim, Kah Wai (Nanyang Technological University)
	Phan, Anh Tuan (Nanyang Technological University)
Software used in Production:	python, R
Grant Number:	Data Science and Artificial Intelligence Research Center (DSAIR)
Grant Number:	Investigatorship (NRFNRFI2017-09)
Distributor:	DR-NTU (Data)
Access Authority:	Korsakova Anna
Depositor:	Korsakova Anna
Date of Deposit:	2021-02-25
Holdings Information:	https://doi.org/10.21979/N9/FFN0XH
Study Scope
Keywords:	Computer and Information Science, Medicine, Health and Life Sciences, Computer and Information Science, Medicine, Health and Life Sciences, RNA splicing prediction, RNA binding protein
Abstract:	Context Augmented Psi Dataset (CAPD) dataset for benchmarking of RNA splicing models. Contains percent-spliced-in labels for 250 samples from each of the 14 tissue types for all 23 human chromosomes along with auxiliary signals for each sample (RBP transcript abundance levels) and a gene dictionary (same gene sequences for all samples). The dataset is split into training, testing, and validation datasets. Auxiliary signals are provided in a separate archive as a .csv table, where each line represents one sample with a respective tissue label and a number. Labels are stored as .jsonl dictionaries for each sample separately; each entry in the dictionary contains gene name, acceptor, and donor coordinates (with respect to the very first acceptor site of the gene) with respective PSI levels in the range from 0 to 1. Gene dictionary is stored as .jsonl file as well, where each entry is a pre-mRNA gene sequence with 1000nt flanking ends on each side.
Kind of Data:	Benchmarking dataset for ML models
Methodology and Processing
Sources Statement
Data Access
Other Study Description Materials
Related Publications
Citation
Identification Number:	10.1145/3450439.3451857
Bibliographic Citation:	Chan, A., Korsakova, A., Ong, Y., Winnerdy, F. R., Lim, K. W. & Phan, A. T. (2021). RNA alternative splicing prediction with discrete compositional energy network. Proceedings of the Conference on Health, Inference, and Learning (CHIL '21), 193-203
Citation
Identification Number:	10356/155091
Bibliographic Citation:	Chan, A., Korsakova, A., Ong, Y., Winnerdy, F. R., Lim, K. W. & Phan, A. T. (2021). RNA alternative splicing prediction with discrete compositional energy network. Proceedings of the Conference on Health, Inference, and Learning (CHIL '21), 193-203
Other Study-Related Materials
Label:	aux_inputs.tar.gz
Text:	.csv tables with auxiliary input entries: samples in rows, RBP and chemically modifying protein transcript IDs in columns, each cell is the respective transcript abundance level (RPKM normalized); samples from all tissues concatenated together in the same table. Control .csv tables with random transcript abundance levels or membrane protein abundance levels are included as well.
Notes:	application/x-gzip
Other Study-Related Materials
Label:	gene_dict_alltraintest.jsonl
Text:	Gene dictionary containing pre-mRNA gene sequences (+1000nt flanks on both sides) for all the genes used for training, testing, and validation.
Notes:	application/octet-stream
Other Study-Related Materials
Label:	main_inputs_test.tar.gz
Text:	Test dataset for model training: each file contains splicing labels for each gene for a particular sample.
Notes:	application/x-gzip
Other Study-Related Materials
Label:	main_inputs_train.tar.gz
Text:	Train dataset for model training: each file contains splicing labels for each gene for a particular sample.
Notes:	application/x-gzip
Other Study-Related Materials
Label:	main_inputs_valid.tar.gz
Text:	Validation dataset for model training: each file contains splicing labels for each gene for a particular sample.
Notes:	application/x-gzip