Dataset and Code for: Code problem similarity detection using code clones and pretrained models (doi:10.21979/N9/VPCR7H)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

Dataset and Code for: Code problem similarity detection using code clones and pretrained models

Identification Number:

doi:10.21979/N9/VPCR7H

Distributor:

DR-NTU (Data)

Date of Distribution:

2023-05-08

Version:

1

Bibliographic Citation:

Yeo, Geremie Yun Siang, 2023, "Dataset and Code for: Code problem similarity detection using code clones and pretrained models", https://doi.org/10.21979/N9/VPCR7H, DR-NTU (Data), V1

Study Description

Citation

Title:

Dataset and Code for: Code problem similarity detection using code clones and pretrained models

Identification Number:

doi:10.21979/N9/VPCR7H

Authoring Entity:

Yeo, Geremie Yun Siang (Nanyang Technological University)

Software used in Production:

Code Problem Similarity Checker

Distributor:

DR-NTU (Data)

Access Authority:

Yeo, Geremie Yun Siang

Depositor:

Yeo, Geremie Yun Siang

Date of Deposit:

2023-05-08

Holdings Information:

https://doi.org/10.21979/N9/VPCR7H

Study Scope

Keywords:

Computer and Information Science, Computer and Information Science, Code Clone Detection, Code Problem Similarity

Abstract:

This dataset complements the following study: Code problem similarity detection using code clones and pretrained models (SCSE22-0384). This study explores a new approach of detecting similar algorithmic-style code problems from websites such as LeetCode and Codeforces, by comparing the similarity of the solution source codes, an application of type IV code clone detection. It is based on 107,000 submissions in 3 different languages (Python, C++ and Java) from 3,000 problems on Codeforces between 2020 to 2023. Experiments were carried out using 3 different pre-trained models on this dataset (C4-CodeBERT, GraphCodeBERT, UniXcoder). UniXcoder performed the best with an F1 score of 0.905. As such, UniXcoder was used as the backbone of the code problem similarity checker (CPSC) which is used to identify the top similar problems (out of all the problems in the dataset) to an input source code. Based on the tests conducted in this project, his approach achieves state-of-the-art results when it comes to detecting similarity between various code problems. More research can be done, in domains where type IV code clone detection can be useful.

Kind of Data:

Source Code Snippets

Notes:

This data is collected as part of Yeo Geremie Yun Siang's FYP at NTU: "Code problem similarity detection using code clones and pretrained models", under supervision of Associate Professor Anwitaman Datta & Associate Professor Patrick Pun Chi Seng.

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Related Publications

Citation

Identification Number:

10356/165850

Bibliographic Citation:

Yeo, G. Y. S. (2023). Code problem similarity detection using code clones and pretrained models. Final Year Project (FYP), Nanyang Technological University, Singapore.

Other Study-Related Materials

Label:

Codeforces Dataset Part 1.zip

Text:

Dataset

Notes:

application/zip

Other Study-Related Materials

Label:

Codeforces Dataset Part 2.zip

Text:

Dataset

Notes:

application/zip

Other Study-Related Materials

Label:

dataprep_similar_problems.ipynb

Text:

Generate pairs for the pair of problems to be compared

Notes:

application/x-ipynb+json

Other Study-Related Materials

Label:

model.py

Text:

CPSC model file

Notes:

text/x-python

Other Study-Related Materials

Label:

preliminary_tests_similar_problems.ipynb

Text:

Notebook for preliminary tests

Notes:

application/x-ipynb+json

Other Study-Related Materials

Label:

run.py

Text:

CPSC run file

Notes:

text/x-python

Other Study-Related Materials

Label:

run_similar_problems.py

Text:

Code file for preliminary tests

Notes:

text/x-python

Other Study-Related Materials

Label:

UniXcoder_notebook.ipynb

Text:

Results for UniXcoder model (0.905)

Notes:

application/x-ipynb+json