Missing link prediction

updated: 2023-04-01
#network-science/link-prediction

Link prediction is a task that seeks to identify relationships that have not yet been observed in a network. It can be used for a variety of purposes, including building recommendation systems, detecting fraud, and uncovering knowledge. It can also be used to test theories about how networks form, with successful prediction offering insights into the processes that shape their structure.

A link prediction algorithm is evaluated based on its ability to predict missing edges that are present but not observed in a given network. The widely-accepted benchmark goes as follows:

  1. An algorithm is given a training network with missing edges and is required to provide a prediction score  for each pair of nodes  and . The higher the score, the greater the likelihood of an edge existing between those two nodes.
  2. Since there are significantly more unconnected nodes than missing edges due to edge sparsity, a subset of unconnected node pairs sampled uniformly at random is used, with the same size as the set of missing edges.
  3. The algorithm is deemed to be a successful link prediction algorithm if the score is higher for the missing edges (green in the figure below) than for the unconnected node pairs (red).
  4. The overlap of the distributions for the missing edges and unconnected node pairs is commonly quantified by the area under curve (AUC) of the receiving operator characteristics (ROC; right panel).

Quest for predictive structural variables

A key to a successful link prediction is to find the structural variables that are strongly predictive of edge existence. Pearhaps the simplest variable is the edge density, with which one predicts that an edge appears between any pair of nodes with the same probability, though this is a crude prediction.

Degree

A more predictive variable is node degree, i.e., number of neighbors in a given network. For a given node pair , a prediction score is given by

Despite its simplicity, degree is surprisingly predictive of edges. Indeed, it sometimes excels more complex link prediction algorithms like graph convolution network and graph attention network.

Degree is often a strong determinant of many structural variables, and thus, may contribute to their edge predictability.

Common neighbors

Common neighbors is a pairwise variable that looks at how many neighbors two nodes have in common, reflecting the local relationship between them. This is a major departure from degree, as it focuses more on the connections between pairs of nodes.

The following structural variables are all based on common neighbors, with difference being the normalization by the degree of common neighbors and the number of neighbors.

Path

Path-based variables measure the distance between two nodes in a network by looking at the path between them. This is a generalization of the concept of common neighbors, i.e., two nodes that have common neighbors are connected by the shortest path of length two, the shortest distance for nodes that are not connected in the network.

Matrix factorization

-Learning spectral graph transformations for link prediction | Proceedings of the 26th Annual International Conference on Machine Learning

Model-based

Graph embedding

Metadata data

Evaluation metric

References

Benchmark design

Prediction methods