Discussion

Home     Discussion Board      AQA-KDD-2024      Duplicate Entries Found in Training and Validation Datasets

misogawa

Duplicate Entries Found in Training and Validation Datasets

posted in   AQA-KDD-2024

2024-05-27 06:50

Hi everyone,

I wanted to bring to your attention that I’ve found 655 duplicate entries between the training and validation datasets. This could potentially affect the model’s performance and evaluation metrics.

Has anyone else encountered this issue? If so, how are you handling it?

Looking forward to your insights and suggestions.

Thank you!

4  comments

Enjoy Markdown! coding now...
x
 
1

OAG-Challenge Team

Fanjin Zhang

2024-05-28 06:48

Reply

1

Hi participants,

thanks for your information! We will try our best to remove duplicate entries in the test set.

  • nothing reply Fanjin Zhang

    2024-05-28 11:15

    Reply

    0

    Hi,Could you update the current leaderboard by excluding those samples? The private leaderboard is still several days away

    • OAG-Challenge Team

      Fanjin Zhang reply nothing

      2024-05-29 09:09

      Reply

      0

      Hi, we are studying and working on this.

NISH01

2024-05-28 05:29

Reply

0

Hi there.

I picked a specific example.

qa_train.txt contains the line:

  1. {"question": "What is the difference between problem solving and intelligence?", "body": "<p>What is the difference between problem solving and intelligence?</p>\n\n<p>All I know is that the tower of hanoi is thought of as a problem solving test, perhaps due to its lack of automation, I don't know. I'd assume that \"intelligence\" is broader, measured by iq tests and reflected in skills etc..</p>\n\n<p>Is it just that intelligence is <a href=\"https://psychology.stackexchange.com/questions/145/what-is-the-difference-between-solving-a-problem-and-acquiring-a-skill\">learnt</a>?</p>\n", "pids": ["56d82022dabfae2eeeb68805"]}

And qa_valid_wo_ans.txt also has

  1. {"question": "What is the difference between problem solving and intelligence?", "body": "<p>What is the difference between problem solving and intelligence?</p>\n\n<p>All I know is that the tower of hanoi is thought of as a problem solving test, perhaps due to its lack of automation, I don't know. I'd assume that \"intelligence\" is broader, measured by iq tests and reflected in skills etc..</p>\n\n<p>Is it just that intelligence is <a href=\"https://psychology.stackexchange.com/questions/145/what-is-the-difference-between-solving-a-problem-and-acquiring-a-skill\">learnt</a>?</p>\n"}

The question and body are exactly the same.
In this case, submitting 56d82022dabfae2eeeb68805 from the training data seems like the obvious choice and increases the leaderboard score, but it’s essentially meaningless.

There are 655 similar examples out of 2919.

This issue seems quite critical for this competition.

misogawa

2024-05-28 05:03

Reply

0

Hi, Mr. Fanjin Zhang

Thanks for your response.

There are approximately 600 duplicate questions and bodies, but I haven’t checked if the answers are also duplicated.

OAG-Challenge Team

Fanjin Zhang

2024-05-28 04:01

Reply

0

Hi, misogawa,

thanks for your feedback. There may be some duplicate questions in the training and valiation datasets.


What are duplicate entries mean here? Duplicate quetions, duplicate bodys or duplicate answers?