misogawa
Duplicate Entries Found in Training and Validation Datasets
2024-05-27 06:50
Hi everyone,
I wanted to bring to your attention that I’ve found 655 duplicate entries between the training and validation datasets. This could potentially affect the model’s performance and evaluation metrics.
Has anyone else encountered this issue? If so, how are you handling it?
Looking forward to your insights and suggestions.
Thank you!
4 comments
OAG-Challenge Team
Fanjin Zhang
Reply
Hi participants,
thanks for your information! We will try our best to remove duplicate entries in the test set.
nothing reply Fanjin Zhang
Reply
Hi,Could you update the current leaderboard by excluding those samples? The private leaderboard is still several days away
OAG-Challenge Team
Fanjin Zhang reply nothing
Reply
Hi, we are studying and working on this.
NISH01
Reply
Hi there.
I picked a specific example.
qa_train.txt
contains the line:And
qa_valid_wo_ans.txt
also hasThe question and body are exactly the same.
In this case, submitting
56d82022dabfae2eeeb68805
from the training data seems like the obvious choice and increases the leaderboard score, but it’s essentially meaningless.There are 655 similar examples out of 2919.
This issue seems quite critical for this competition.
misogawa
Reply
Hi, Mr. Fanjin Zhang
Thanks for your response.
There are approximately 600 duplicate questions and bodies, but I haven’t checked if the answers are also duplicated.
OAG-Challenge Team
Fanjin Zhang
Reply
Hi, misogawa,
thanks for your feedback. There may be some duplicate questions in the training and valiation datasets.
What are duplicate entries mean here? Duplicate quetions, duplicate bodys or duplicate answers?