The heterogeneous graph benchmark (HGB) aims to advance open and reproducible heterogeneous graph research. In HGB, we standardize the process of data splits, feature processing, and performance evaluation, by establishing the HGB pipeline “feature preprocessing → HGNN encoder → downstream decoder”, among 3 heterogeneous graph learning tasks, i.e., node classification, link prediction and knowledge-aware recommendation.
Node Classification
Given a heterogeneous graph and some labeled nodes, you need to pre-dict the categories of some unlabeled nodes. There are both single-label setting and multi-label setting in our datasets.
Link Prediction
Given a heterogeneous graph, you need to predict the probability of the existence of a link between some node pairs. Since real-world graphs are usually sparse, link predict makes sense of finding missing links.
Knowledge-aware Recommendation
Given a user-item graph and a knowledge graph attched to it, you need to give the item rankings for some test users.
How to take part in HGB?
There are three channels in HGB, i.e., node classification (NC), link prediction (LP) and knowledge-aware recommendation (Recom). You can find the entries at the bottom of HGB homepage. For NC and LP, you can join the competition as a common biendata competition, and submit your prediction to get the scores. For Recom, due to the prediction files are too large to upload, you need to test offline by yourself.
There are some helpful code related to data loading, model training and evaluation, prediction saving for submission in our Github repo. HGB related scripts are in NC/benchmark, LP/benchmark, and Recom/baseline sub-folder, respectively.
If you want to show your scores on the official HGB leaderboard, you have to make a request and fill a form in the "My Submissions" page in each competition channel. After we have verified your provided information and code reproducibility, you will obtain a rank on the official HGB leaderboard.
About Pipeline
To standardize heterogeneous graph experiment settings, we adapt all methods under “feature preprocessing → HGNN encoder → downstream decoder” HGB pipeline. The whole pipeline is trained in an end-to-end fashion. Ideally, researchers only need to focus on HGNN encoder part. Maybe you can also explore more effective feature processing and downstream decoder modules, but most of time, you can just leave these two parts to HGB pipeline scripts.
Feature preprocessing
Since different node types may have different node feature dimension, we use a Linear layer for each node type to encode all nodes into a unified vector space.
Moreover, we find that not all node features have positive effect on model performance. For example, if we replace some types of node features with one-hot encoding, performance may boost. Therefore, we select three types of feature combinations, i.e., using all given node features, using only features of target node type, or replacing all node features as one-hot vectors.
HGNN encoder
This is the primary module for researchers to explore.
Downstream decoder
For node classification, we use softmax (single-label classification) or sigmoid (multi-label classification) decoder.
For link prediction, we use dot product decoder or DistMult decoder.
For recommendation, we use dot product and BPR-MF ensembled decoder.
About Data Format
Although you can use our data_loader script to load data, we still make some clarification of our dataset format here.
Node classification
Each line is split by '\t'. For each dataset, we have:
node.dat:The information of nodes. Each line has (node_id, node_name, node_type_id, node_feature). One-hot node_features can be omitted. Each node type takes a continuous range of node_ids. And the order of node_ids are sorted by node_type_id, which means that node_type_id=0 takes the first interval of node_id, node_type_id=1 takes the second interval of node_id, etc. Node features are vectors split by comma.
link.dat:The information of edges. Each line has (node_id_source, node_id_target, edge_type_id, edge_weight).
label.dat: The information of node labels. Each line has (node_id, node_type_id, node_label). For multi-label setting, node_labels are split by comma.
label.dat.test: Test set node labels. The format is same as label.dat, but the node_label is randomly replaced.
Link prediction
node.dat: The format is same as node.dat in node classification.
link.dat: The format is same as link.dat in node classification.
link.dat.test: Originally, this file is test set links. But to pervent data leakage problem, the target nodes are radomly replaced.
Recommendation
mf.npz: Pretrained BPR-MF embeddings for recommendation.
kg_final.txt: Each line has (head_entity_id, relation_id, tail_entity_id). The entity ids are aligned with items if possible.
train.txt: For each line, the first number is the id of user, and the second to the last number are the items related to this user.
test.txt: The format is same as train.txt, but for test set. Note that we do not replace this test set, because the datasets are same with those in KGAT, which can be obtained easily. Therefore, you need to care about data leakage issue by yourself.
DataSets
Name
Size
Tesk
Metric
Last Update
Download
Paper/Code Included
Account/Email Login
Phone Number Login
Reset password
Please enter your registration email to reset your password.
Reset Password via Phone
Reset password
Please enter your registration email to reset your password.
Reset Password
Your password must have at least 7 characters
Passwords don't match
Welcome to Biendata
User name
Please enter your unique user name
Phone number
Please enter your phone number to receive verification code
Verification code sent to your phone
Please enter verification code
E-mail
Please enter your e-mail
Password
Your password must have at least 7 characters
Confirm password
Passwords don't match
Please send emails to question@biendata.com when you have any question.
Didn’t receive any activation email?
Thank You
We have sent you an email with account activation link.
Didn’t receive any activation email?
Host a Competition
*Name
*Organization
*E-mail
*phone or wechat
*Description
Your account has been banned due to website's policy, please contact question@biendata.com for detail.
Reset Account
Reset password
Reset Password via Email
Reset Password
Reset code has been sent.You can apply to resend
Your password must have at least 7 characters
Your password must have at least 7 characters
Passwords don't match
Reset Account
We have sent you an email with account activation link.