View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Dataset and Code for: Code problem similarity detection using code clones and pretrained models |
Identification Number: |
doi:10.21979/N9/VPCR7H |
Distributor: |
DR-NTU (Data) |
Date of Distribution: |
2023-05-08 |
Version: |
1 |
Bibliographic Citation: |
Yeo, Geremie Yun Siang, 2023, "Dataset and Code for: Code problem similarity detection using code clones and pretrained models", https://doi.org/10.21979/N9/VPCR7H, DR-NTU (Data), V1 |
Citation |
|
Title: |
Dataset and Code for: Code problem similarity detection using code clones and pretrained models |
Identification Number: |
doi:10.21979/N9/VPCR7H |
Authoring Entity: |
Yeo, Geremie Yun Siang (Nanyang Technological University) |
Software used in Production: |
Code Problem Similarity Checker |
Distributor: |
DR-NTU (Data) |
Access Authority: |
Yeo, Geremie Yun Siang |
Depositor: |
Yeo, Geremie Yun Siang |
Date of Deposit: |
2023-05-08 |
Holdings Information: |
https://doi.org/10.21979/N9/VPCR7H |
Study Scope |
|
Keywords: |
Computer and Information Science, Computer and Information Science, Code Clone Detection, Code Problem Similarity |
Abstract: |
This dataset complements the following study: Code problem similarity detection using code clones and pretrained models (SCSE22-0384). This study explores a new approach of detecting similar algorithmic-style code problems from websites such as LeetCode and Codeforces, by comparing the similarity of the solution source codes, an application of type IV code clone detection. It is based on 107,000 submissions in 3 different languages (Python, C++ and Java) from 3,000 problems on Codeforces between 2020 to 2023. Experiments were carried out using 3 different pre-trained models on this dataset (C4-CodeBERT, GraphCodeBERT, UniXcoder). UniXcoder performed the best with an F1 score of 0.905. As such, UniXcoder was used as the backbone of the code problem similarity checker (CPSC) which is used to identify the top similar problems (out of all the problems in the dataset) to an input source code. Based on the tests conducted in this project, his approach achieves state-of-the-art results when it comes to detecting similarity between various code problems. More research can be done, in domains where type IV code clone detection can be useful. |
Kind of Data: |
Source Code Snippets |
Notes: |
This data is collected as part of Yeo Geremie Yun Siang's FYP at NTU: "Code problem similarity detection using code clones and pretrained models", under supervision of Associate Professor Anwitaman Datta & Associate Professor Patrick Pun Chi Seng. |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Other Study Description Materials |
|
Related Publications |
|
Citation |
|
Identification Number: |
10356/165850 |
Bibliographic Citation: |
Yeo, G. Y. S. (2023). Code problem similarity detection using code clones and pretrained models. Final Year Project (FYP), Nanyang Technological University, Singapore. |
Label: |
Codeforces Dataset Part 1.zip |
Text: |
Dataset |
Notes: |
application/zip |
Label: |
Codeforces Dataset Part 2.zip |
Text: |
Dataset |
Notes: |
application/zip |
Label: |
dataprep_similar_problems.ipynb |
Text: |
Generate pairs for the pair of problems to be compared |
Notes: |
application/x-ipynb+json |
Label: |
model.py |
Text: |
CPSC model file |
Notes: |
text/x-python |
Label: |
preliminary_tests_similar_problems.ipynb |
Text: |
Notebook for preliminary tests |
Notes: |
application/x-ipynb+json |
Label: |
run.py |
Text: |
CPSC run file |
Notes: |
text/x-python |
Label: |
run_similar_problems.py |
Text: |
Code file for preliminary tests |
Notes: |
text/x-python |
Label: |
UniXcoder_notebook.ipynb |
Text: |
Results for UniXcoder model (0.905) |
Notes: |
application/x-ipynb+json |