#network-science/graph-embedding
#network-science/random-walks
node2vec is a direct application of word2vec optimized with negative sampling.
There are many repos for node2vec, although not every repo produces consistent results, and some repos seem to have bugs. The choice of repo does matter.
node2vec
in pypi. However, this node2vec often underperforms because of a bad hyperparameter configuration. Furthermore, this node2vec is extremely slow and memory demanding and not scalable to large networks.All packages except the pytorch geometric are built on top of word2vec implemented in gensim.
node2vec has several parameters to define random walks in networks together with the parameters for gensim (word2vec).
num_walks
specifies the number of walkers starting from each node. Larger is better at the expense of computation time and memory. A good value ranges between 10 or 20. If the network is directed, set a larger value like 30 and 40.p
is inversely proportional to the probability of backtrack. Less likely a walker backtracks if q
is inversely proportional to the probability of visiting a common neighbor of the previously visited node. Less likely a walker visits the common neighbor if context
or window_length
defines the length of context window. It controls the resolution of the structure to be preserved in the embedding, i.e., smaller window size preserves more local structure. Set window_length = 10
if you have no preference.batch_walk
or batch_size
is the number of data samples with which to calculate a gradient. Larger is better. Set batch_walk = 10000
if you have no preference.workers
is the number of CPU cores to train word2vec.epochs
(or iter
in gensim version 3.9 or less) is the number of times we go through the given sentences. Larger is better but epochs=1
works enough in many cases.ns_exponent
is the exponent of word frequency distribution used to generate negative samples. Set ns_exponent = 0.75
or 1
if no preference.ns_exponent=1
) paperp
and q
. papernum_walks
or epochs
as much as possible. For reference, num_walks
should be at least 20 for networks of 10,000 nodes. Increase more when training a larger network.num_walks
when embedding directed networks. node2vec is known to perform poorly for directed networks. I found that this is because a random walker stops walking when it hits a dangling node, producing fewer walks with which to train word2vec.context
or (window_length
) is one. paperns_exponent = 0
or the given network is a regular graph.