Update 2019-12-1
The test set has been released since Nov.30 23:59 UTC. The submission will end at Dec. 2 23:59 UTC, which means you have 48 hours to download and submit your results.
(Note: the daily submission limit is now raised to 6 times. PLEASE make full use of the time.)
Background
In many applications, such as scientific literature management, researcher search, social network analysis and etc, Name Disambiguation (aiming at disambiguating WhoIsWho) has been a challenging problem. In addition, the growth of scientific literature makes the problem more difficult and urgent. Although name disambiguation has been extensively studied in academia and industry, the problem has not been solved well due to the clutter of data and the complexity of the same name scenario.
Online academic search systems (such as Google Scholar, Dblp, and AMiner) have large amount of research papers, and have become important and popular academic communication and paper search platforms. However, due to the limitations of the paper assignment algorithm, there are many papers assigned to error authors. In addition, these academic platforms are collecting a large number of new papers every day. Therefore, how to accurately and quickly assign papers to existing author profiles, and maintain the consistency of author profiles is an urgent problem to be solved for current online academic systems and platforms.
Because the amount of data inside the academic systems is very large (AMiner has about 130,000,000 author profiles and more than 200,000,000 papers), the same name situation is very complicated. [1]
The competition hopes participants to propose models that can distinguish the papers of the same name but belong to different authors according to the detailed information of the paper and the link between the author and the paper, and obtain good disambiguation results.
Tasks
OAG-WhoIsWho has two tracks.
Track 1: Name Disambiguation from Scratch
Task Description: Given a bunch of papers with authors of one same name, participants will be asked to return different clusters of papers. Each cluster has one author, and different clusters have different authors, although they have the same name.
Suggested Methods: The common ideas to solve this problem include using clustering algorithms to extract paper features and define similarity measures. After that, one group of papers can be clustered into several groups, so that the papers in one cluster are as similar as possible, but quite different from papers in different clusters. Therefore, one cluster of papers can be considered to have the same authorship. [7] is a classic research paper that reported a clustering method. First, using strong rules (such as if two papers have at least two same co-authors, they belong to one cluster). Second, using weak rules combine previous clusters and improve the recall. Some other work also considered low-dimension semantic space due to the limitation of traditional feature engineering methods. It mapped research papers to vectors in low-dimensional space and used clustering methods on vectors [2].
Track 2: Continuous Name Disambiguation
Task Description: The online academic platforms are accumulating a large number of papers every day. How to accurately and quickly assign these new papers to the existed author profiles is the most urgent problem to be solved for the online academic systems. Therefore, the problem can be is defined as the following: given a set of new papers and a group of existed author's paper lists already on the system, how to assign the new papers to the existed authors correctly.
Suggested Methods: The incremental disambiguation track is different from the cold start disambiguation task (track 1). It is based on existed authorship profiles and needs to allocate new papers to these profiles. Therefore, a direct method is to compare the existing author’s papers with the newly added papers, extract the traditional features such as collaborators, organizations or journal similarity and etc. After that, traditional classifiers such as SVM can be used. It is also possible to use the vector representation method based on low-dimensional space. By representing authorship and papers to low-dimensional vectors, supervised learning methods can be used to extract features and train models.
Reference
[1]. Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.
[2]. Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. In Proceedings of the Twenty-Forth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18).
[3]. Jie Tang, A.C.M. Fong, Bo Wang, and Jing Zhang. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transaction on Knowledge and Data Engineering (TKDE), 2012, Volume 24, Issue 6, Pages 975-987.
[4]. Xuezhi Wang, Jie Tang, Hong Cheng, and Philip S. Yu. ADANA: Active Name Disambiguation. In Proceedings of 2011 IEEE International Conference on Data Mining (ICDM'11), pages 794-803.
[5]. https://biendata.com/competition/scholar2018/data/
[6]. The Microsoft Academic Search Dataset and KDD Cup 2013
[7]. Wang, F. , Li, J. , Tang, J. , Zhang, J. , & Wang, K. . (2008). Name Disambiguation Using Atomic Clusters. Web-Age Information Management, 2008. WAIM '08. The Ninth International Conference on.
OAG-WhoIsWho Track 1
¥50,000 (~ $3496)
861 participants
start
Final Submissions
2019-09-30
2019-12-02
Sponsor:BAAI, AMiner