Execute this notebook: Download locally

Node2Vec representation learning with Stellargraph components

This example demonstrates how to apply components from the stellargraph library to perform representation learning via Node2Vec. This uses a Keras implementation of Node2Vec available in stellargraph instead of the reference implementation provided by gensim. This implementation provides flexible interfaces to downstream tasks for end-to-end learning.


[1] Node2Vec: Scalable Feature Learning for Networks. A. Grover, J. Leskovec. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016. (link)

[2] Distributed representations of words and phrases and their compositionality. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. In Advances in Neural Information Processing Systems (NIPS), pp. 3111-3119, 2013. (link)

[3] word2vec Parameter Learning Explained. X. Rong. arXiv preprint arXiv:1411.2738. 2014 Nov 11. (link)


Following word2vec [2,3], for each (target,context) node pair \((v_i,v_j)\) collected from random walks, we learn the representation for the target node \(v_i\) by using it to predict the existence of context node \(v_j\), with the following three-layer neural network.


Node \(v_i\)’s representation in the hidden layer is obtained by multiplying \(v_i\)’s one-hot representation in the input layer with the input-to-hidden weight matrix \(W_{in}\), which is equivalent to look up the \(i\)th row of input-to-hidden weight matrix \(W_{in}\). The existence probability of each node conditioned on node \(v_i\) is outputted in the output layer, which is obtained by multiplying \(v_i\)’s hidden-layer representation with the hidden-to-out weight matrix \(W_{out}\) followed by a softmax activation. To capture the target-context relation between \(v_i\) and \(v_j\), we need to maximize the probability \(\mathrm{P}(v_j|v_i)\). However, computing \(\mathrm{P}(v_j|v_i)\) is time consuming, which involves the matrix multiplication between \(v_i\)’s hidden-layer representation and the hidden-to-out weight matrix \(W_{out}\).

To speed up the computing, we adopt the negative sampling strategy [2,3]. For each (target, context) node pair, we sample a negative node \(v_k\), which is not \(v_i\)’s context. To obtain the output, instead of multiplying \(v_i\)’s hidden-layer representation with the hidden-to-out weight matrix \(W_{out}\) followed by a softmax activation, we only calculate the dot product between \(v_i\)’s hidden-layer representation and the \(j\)th column as well as the \(k\)th column of the hidden-to-output weight matrix \(W_{out}\) followed by a sigmoid activation respectively. According to [3], the original objective to maximize \(\mathrm{P}(v_j|v_i)\) can be approximated by minimizing the cross entropy between \(v_j\) and \(v_k\)’s outputs and their ground-truth labels (1 for \(v_j\) and 0 for \(v_k\)).

Following [2,3], we denote the rows of the input-to-hidden weight matrix \(W_{in}\) as input_embeddings and the columns of the hidden-to-out weight matrix \(W_{out}\) as output_embeddings. To build the Node2Vec model, we need look up input_embeddings for target nodes and output_embeddings for context nodes and calculate their inner product together with a sigmoid activation.

import matplotlib.pyplot as plt

from sklearn.manifold import TSNE

import os
import networkx as nx
import numpy as np
import pandas as pd
from tensorflow import keras

from stellargraph import StellarGraph
from stellargraph.data import BiasedRandomWalk
from stellargraph.data import UnsupervisedSampler
from stellargraph.data import BiasedRandomWalk
from stellargraph.mapper import Node2VecLinkGenerator, Node2VecNodeGenerator
from stellargraph.layer import Node2Vec, link_classification

from stellargraph import datasets
from IPython.display import display, HTML

%matplotlib inline


For clarity, we use only the largest connected component, ignoring isolated nodes and subgraphs; having these in the data does not prevent the algorithm from running and producing valid results.

dataset = datasets.Cora()
G, subjects = dataset.load(largest_connected_component_only=True)
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.
StellarGraph: Undirected multigraph
 Nodes: 2485, Edges: 5209

 Node types:
  paper: [2485]
    Features: float32 vector, length 1433
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [5209]
        Weights: all 1 (default)
        Features: none

The Node2Vec algorithm

The Node2Vec algorithm introduced in [1] is a 2-step representation learning algorithm. The two steps are:

  1. Use random walks to generate sentences from a graph. A sentence is a list of node ids. The set of all sentences makes a corpus.

  2. The corpus is then used to learn an embedding vector for each node in the graph. Each node id is considered a unique word/token in a dictionary that has size equal to the number of nodes in the graph. The Word2Vec algorithm [2] is used for calculating the embedding vectors.

In this implementation, we train the Node2Vec algorithm in the following two steps:

  1. Generate a set of (target, context) node pairs through starting the biased random walk with a fixed length at per node. The starting nodes are taken as the target nodes and the following nodes in biased random walks are taken as context nodes. For each (target, context) node pair, we generate 1 negative node pair.

  2. Train the Node2Vec algorithm through minimizing cross-entropy loss for target-context pair prediction, with the predictive value obtained by performing the dot product of the ‘input embedding’ of the target node and the ‘output embedding’ of the context node, followed by a sigmoid activation.

Specify the optional parameter values: the number of walks to take per node, the length of each walk. Here, to guarantee the running efficiency, we respectively set walk_number and walk_length to 100 and 5. Larger values can be set to them to achieve better performance.

walk_number = 100
walk_length = 5

Create the biased random walker to perform context node sampling, with the specified parameters.

walker = BiasedRandomWalk(
    p=0.5,  # defines probability, 1/p, of returning to source node
    q=2.0,  # defines probability, 1/q, for moving to a node away from the source node

Create the UnsupervisedSampler instance with the biased random walker.

unsupervised_samples = UnsupervisedSampler(G, nodes=list(G.nodes()), walker=walker)

Set the batch size and the number of epochs.

batch_size = 50
epochs = 2

Define an attri2vec training generator, which generates a batch of (index of target node, index of context node, label of node pair) pairs per iteration.

generator = Node2VecLinkGenerator(G, batch_size)

Build the Node2Vec model, with the dimension of learned node representations set to 128.

emb_size = 128
node2vec = Node2Vec(emb_size, generator=generator)
x_inp, x_out = node2vec.in_out_tensors()

Use the link_classification function to generate the prediction, with the ‘dot’ edge embedding generation method and the ‘sigmoid’ activation, which actually performs the dot product of the input embedding of the target node and the output embedding of the context node followed by a sigmoid activation.

prediction = link_classification(
    output_dim=1, output_act="sigmoid", edge_embedding_method="dot"
link_classification: using 'dot' method to combine node embeddings into edge embeddings

Stack the Node2Vec encoder and prediction layer into a Keras model. Our generator will produce batches of positive and negative context pairs as inputs to the model. Minimizing the binary crossentropy between the outputs and the provided ground truth is much like a regular binary classification task.

model = keras.Model(inputs=x_inp, outputs=prediction)


Train the model.

history = model.fit(
Train for 39760 steps
Epoch 1/2
39760/39760 [==============================] - 155s 4ms/step - loss: 0.2924 - binary_accuracy: 0.8557
Epoch 2/2
39760/39760 [==============================] - 238s 6ms/step - loss: 0.1096 - binary_accuracy: 0.9641

Visualise Node Embeddings

Build the node based model for predicting node representations from node ids and the learned parameters. Below a Keras model is constructed, with x_inp[0] as input and x_out[0] as output. Note that this model’s weights are the same as those of the corresponding node encoder in the previously trained node pair classifier.

x_inp_src = x_inp[0]
x_out_src = x_out[0]
embedding_model = keras.Model(inputs=x_inp_src, outputs=x_out_src)

Get the node embeddings from node ids.

node_gen = Node2VecNodeGenerator(G, batch_size).flow(subjects.index)
node_embeddings = embedding_model.predict(node_gen, workers=4, verbose=1)
50/50 [==============================] - 0s 1ms/step

Transform the embeddings to 2d space for visualisation.

transform = TSNE  # PCA

trans = transform(n_components=2)
node_embeddings_2d = trans.fit_transform(node_embeddings)
# draw the embedding points, coloring them by the target label (paper subject)
alpha = 0.7
label_map = {l: i for i, l in enumerate(np.unique(subjects))}
node_colours = [label_map[target] for target in subjects]

plt.figure(figsize=(7, 7))
    node_embeddings_2d[:, 0],
    node_embeddings_2d[:, 1],
plt.title("{} visualization of node embeddings".format(transform.__name__))

Downstream task

The node embeddings calculated using Node2Vec can be used as feature vectors in a downstream task such as node attribute inference (e.g., inferring the subject of a paper in Cora), community detection (clustering of nodes based on the similarity of their embedding vectors), and link prediction (e.g., prediction of citation links between papers).

Execute this notebook: Download locally