updated: 2023-04-01
#network-science/link-prediction
Link prediction is a task that seeks to identify relationships that have not yet been observed in a network. It can be used for a variety of purposes, including building recommendation systems, detecting fraud, and uncovering knowledge. It can also be used to test theories about how networks form, with successful prediction offering insights into the processes that shape their structure.
A link prediction algorithm is evaluated based on its ability to predict missing edges that are present but not observed in a given network. The widely-accepted benchmark goes as follows:
A key to a successful link prediction is to find the structural variables that are strongly predictive of edge existence. Pearhaps the simplest variable is the edge density, with which one predicts that an edge appears between any pair of nodes with the same probability, though this is a crude prediction.
A more predictive variable is node degree, i.e., number of neighbors in a given network. For a given node pair
Despite its simplicity, degree is surprisingly predictive of edges. Indeed, it sometimes excels more complex link prediction algorithms like graph convolution network and graph attention network.
Degree is often a strong determinant of many structural variables, and thus, may contribute to their edge predictability.
Common neighbors is a pairwise variable that looks at how many neighbors two nodes have in common, reflecting the local relationship between them. This is a major departure from degree, as it focuses more on the connections between pairs of nodes.
The following structural variables are all based on common neighbors, with difference being the normalization by the degree of common neighbors and the number of neighbors.
Path-based variables measure the distance between two nodes in a network by looking at the path between them. This is a generalization of the concept of common neighbors, i.e., two nodes that have common neighbors are connected by the shortest path of length two, the shortest distance for nodes that are not connected in the network.