Loading and saving data between StellarGraph and Neo4j

Run the master version of this notebook:

This demo explains how to load data from Neo4j into a form that can be used by the StellarGraph library, and how to save predictions back into the database. `See all other demos <../README.md>`__.

The StellarGraph library supports many deep machine learning (ML) algorithms on graphs. This graph information can be loaded from the popular Neo4j graph database. If your data is already in Neo4j, this is a great way to load it. If not, loading via Pandas or via NetworkX is likely to be faster and potentially more convenient.

This notebook demonstrates one approach to connecting StellarGraph and Neo4j. It uses the SQL-like Cypher language to read a graph or subgraph from Neo4j into Pandas DataFrames, and then uses these to construct a StellarGraph object (following the same techniques as in the loading via Pandas demo, which has more details about that aspect). This notebook assumes some familiarity with Cypher constructs like MATCH, RETURN and WHERE. This notebook uses the py2neo library to interact with a Neo4j instance.

This notebook walks through scenarios for loading and storing graphs. Feel free to go through the whole notebook, or search for the following titles to jump to sections most relevant to you.

  • homogeneous graph without features (a homogeneous graph is one with only one type of node and one type of edge)
  • homogeneous graph with features
  • homogeneous graph with edge weights
  • directed graphs (a graph is directed if edges have a “start” and “end” nodes, instead of just connecting two nodes)
  • heterogenous graphs (more than one node type and/or more than one edge type) with and without node features or edge weights, this includes knowledge graphs
  • subgraphs (an example of filtering which nodes and edges are loaded)
  • saving predictions into Neo4j

The StellarGraph class is available at the top level of the stellargraph library:

[1]:
# install StellarGraph if running on Google Colab
import sys
if 'google.colab' in sys.modules:
  %pip install -q stellargraph[demos]==1.0.0rc1
[2]:
# verify that we're using the correct version of StellarGraph for this notebook
import stellargraph as sg

try:
    sg.utils.validate_notebook_version("1.0.0rc1")
except AttributeError:
    raise ValueError(
        f"This notebook requires StellarGraph version 1.0.0rc1, but a different version {sg.__version__} is installed.  Please see <https://github.com/stellargraph/stellargraph/issues/1172>."
    ) from None
[3]:
from stellargraph import StellarGraph

Connecting to Neo4j

To read anything from Neo4j, we’ll need a connection to a running instance.

[4]:
import os
import py2neo

default_host = os.environ.get("STELLARGRAPH_NEO4J_HOST")

# Create the Neo4J Graph database object; the parameters can be edited to specify location and authentication
neo4j_graph = py2neo.Graph(host=default_host, port=None, user=None, password=None)

Dataset

We’ll be working with a graph representing a square with a diagonal. We’ll give the a node label foo and the other nodes the label bar, along with some features. We’ll also give each edge a label matching its orientation and a weight.

a -- b
| \  |
|  \ |
d -- c

This section uses the types from py2neo to seed our Neo4j instance with the example data. For real work involving StellarGraph and Neo4j, the real data would be loaded into the database via some external process. However, we need some data to work with for this demo and so we need to have the cells in this section. They can be safely ignored, and removed for real work.

[5]:
from py2neo.data import Node, Relationship, Subgraph

a = Node("foo", name="a", top=True, left=True, foo_numbers=[0.1, 0.2, 0.3])
b = Node("bar", name="b", top=True, left=False, bar_numbers=[1, -2])
c = Node("bar", name="c", top=False, left=False, bar_numbers=[34, 5.6])
d = Node("bar", name="d", top=False, left=True, bar_numbers=[0.7, -98])

ab = Relationship(a, "horizontal", b, weight=1.0)
bc = Relationship(b, "vertical", c, weight=0.2)
cd = Relationship(c, "horizontal", d, weight=3.4)
da = Relationship(d, "vertical", a, weight=5.67)
ac = Relationship(a, "diagonal", c, weight=1.0)

subgraph = Subgraph([a, b, c, d], [ab, bc, cd, da, ac])

We don’t want to accidentally overwrite or delete important data or add junk in a production Neo4j instance. As a check, this demo requires the Neo4j instance to be empty. If the neo4j_graph connection is to a non-empty database, please either:

  • delete everything from it (there’s a cell at the end of the notebook that can be used, if that’s ok)
  • start a new instance, adjust the parameters to py2neo.Graph above to connect to it, and rerun the cells from there
[6]:
num_nodes = len(neo4j_graph.nodes)
num_relationships = len(neo4j_graph.relationships)
if num_nodes > 0 or num_relationships > 0:
    raise ValueError(
        f"neo4j_graphdb: expected an empty database to give a reliable result and to avoid corrupting your data with mutations & the `delete_all` in the last cell, found {num_nodes} nodes and {num_relationships} relationships in the database already"
    )

Finally, we can fill the database by writing our example data to the database.

[7]:
neo4j_graph.create(subgraph)

# basic check that the database has the right data
assert len(neo4j_graph.nodes) == 4
assert len(neo4j_graph.relationships) == 5

Homogeneous graph without features (edges only)

We’ll start with a homogeneous graph without any node features. This means the graph consists of only nodes and edges without any information other than a unique identifier. To simulate this, we will be ignoring all of the properties we added except the name property, which is a unique identifier for each node.

We can use a single Cypher query to retrieve the identifiers for the source and target of each edge. We’re using name as the identifier here, and each application should choose an appropriate identifier, such as the ``id` <https://neo4j.com/docs/cypher-manual/current/functions/scalar/#functions-id>`__ (if the dangers of ID reuse don’t apply).

We can execute a Cypher query using the ``run` method <https://py2neo.org/v4/database.html#py2neo.database.Graph.run>`__ of py2neo.Graph, which returns a Cursor object that has a ``to_data_frame` method <https://py2neo.org/v4/database.html#py2neo.database.Cursor.to_data_frame>`__ to convert the results to a columnar DataFrame. StellarGraph type expects the columns for the nodes in an edge to be called source and target by default, so the query uses an AS to ensure the DataFrame columns match those defaults.

[8]:
edges = neo4j_graph.run(
    """
    MATCH (s) --> (t)
    RETURN s.name AS source, t.name AS target
    """
).to_data_frame()
edges.head()
[8]:
source target
0 d a
1 a b
2 b c
3 a c
4 c d

We now have a DataFrame where each row represents an edge in the graph, which is exactly the format expected by the ``StellarGraph` constructor <https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph>`__. We can pass the DataFrame as the edges parameter:

[9]:
edges_only = StellarGraph(edges=edges)

The ``info` method <https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.StellarGraph.info>`__ gives a high-level summary of a StellarGraph:

[10]:
print(edges_only.info())
StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: none
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]
        Weights: all 1 (default)

On this square, it tells us that there’s 4 nodes of type default (a homogeneous graph still has node and edge types, but they default to default), with no features, and one type of edge between them. It also tells us that there’s 5 edges of type default that go between nodes of type default. This matches what we expect: it’s a graph with 4 nodes and 5 edges and one type of each.

Homogeneous graph with features

For many real-world problems, we have more than just graph structure: we have information about the nodes and edges. For instance, we might have a graph of academic papers (nodes) and how they cite each other (edges): we might have information about the nodes such as the authors and the publication year, and even the abstract or full paper contents. If we’re doing a machine learning task, it can be useful to feed this information into models. The StellarGraph class supports this using another Pandas DataFrame: each row corresponds to a feature vector for a node.

We can create an appropriate DataFrame in the same way as we created the edges one, with a Cypher query that selects the relevant information. In this case, we need the name to match the rows of features to their node, and we’re also going to have 3 features:

[11]:
raw_homogeneous_nodes = neo4j_graph.run(
    """
    MATCH (n)
    RETURN n.name AS name, n.top, n.left, exists(n.bar_numbers)
    """
).to_data_frame()

raw_homogeneous_nodes
[11]:
name n.top n.left exists(n.bar_numbers)
0 a True True False
1 b True False True
2 c False False True
3 d False True True

StellarGraph uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Currently our dataframe just has a simple numeric range as the index, but it needs to be using the name column. Pandas offers a few ways to control the indexing; in this case, we want to replace the current index by moving the name column to it, which is done most easily with set_index:

[12]:
homogeneous_nodes = raw_homogeneous_nodes.set_index("name")
homogeneous_nodes
[12]:
n.top n.left exists(n.bar_numbers)
name
a True True False
b True False True
c False False True
d False True True

We’ve now got all the right node data, in addition to the edges from before, so now we can create a StellarGraph.

[13]:
homogeneous = StellarGraph(homogeneous_nodes, edges)
print(homogeneous.info())
StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 3
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]
        Weights: all 1 (default)

Notice the output of info now says that the nodes of the default type have 3 features.

Homogeneous graph with edge weights

Some algorithms can understand edge weights, which can be used as a measure of the strength of the connection, or a measure of distance between nodes. A StellarGraph instance can have weighted edges, by including a weight column in the DataFrame of edges.

We can extend our Cypher query that loads the edge sources and targets to also load the weight property. As with node features, we could any computation supported by Neo4j to calculate the weight, beyond just accessing a property as we do here.

[14]:
weighted_edges = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t)
    RETURN s.name AS source, t.name AS target, r.weight AS weight
    """
).to_data_frame()
weighted_edges
[14]:
source target weight
0 d a 5.67
1 a b 1.00
2 b c 0.20
3 a c 1.00
4 c d 3.40
[15]:
weighted_homogeneous = StellarGraph(homogeneous_nodes, weighted_edges)
print(weighted_homogeneous.info())
StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 3
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]
        Weights: range=[0.2, 5.67], mean=2.254, std=2.25534

Notice the output of info now shows additional statistics about edge weights.

Directed graphs

Some graphs have edge directions, where going from source to target has a different meaning to going from target to source.

A directed graph can be created by using the StellarDiGraph class instead of the StellarGraph one. The construction is almost identical, and we can reuse any of the DataFrames that we created in the sections above. For instance, continuing from the previous cell, we can have a directed homogeneous graph with node features and edge weights.

[16]:
from stellargraph import StellarDiGraph

directed_weighted_homogeneous = StellarDiGraph(homogeneous_nodes, weighted_edges)
print(directed_weighted_homogeneous.info())
StellarDiGraph: Directed multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 3
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]
        Weights: range=[0.2, 5.67], mean=2.254, std=2.25534

Heterogeneous graphs

Some graphs have multiple types of nodes and multiple types of edges. Each type might have different data associated with it.

For example, an academic citation network that includes authors might have wrote edges connecting author nodes to paper nodes, in addition to the cites edges between paper nodes. There could be `supervised edges between authors <https://academictree.org>`__ too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.

StellarGraph supports all forms of heterogeneous graphs.

A heterogeneous StellarGraph can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did for the Cora examples in the “homogeneous graph with features” section and others above. For a heterogenous graph, a dictionary has to be passed; passing a single DataFrame does not work.

Multiple node types

The nodes of our square graph were given labels when we created them: a is of type foo, but b, c and d are of type bar. The foo node has an attribute foo_numbers that is a list/vector of numbers, and similarly the bar nodes has bar_numbers. These vectors might be some sort of summary of text associated with each node, or any other pre-computed information about the node to use as input to our machine learning algorithm.

The two types have properties with different names, and, they have different lengths: the foo node has a list of length 3, while all of the bar nodes have a list of length 2. We will load them into separate DataFrames with separate Cypher queries, first finding the node(s) of type foo and their properties, and then the same for the nodes of type bar.

[17]:
raw_foo_nodes = neo4j_graph.run(
    """
    MATCH (n:foo)
    RETURN n.name AS name, n.foo_numbers AS numbers
    """
).to_data_frame()
raw_foo_nodes
[17]:
name numbers
0 a [0.1, 0.2, 0.3]

In this case, our features are more complicated than just independent booleans that can become columns; instead we have a list that we need to turn into individual columns. One way is by converting the list column to a list of lists, and using Pandas’s constructor to convert this back to a DataFrame. We can set the index directly with this technique, and do not need to separately use set_index.

[18]:
import pandas as pd
[19]:
foo_nodes = pd.DataFrame(raw_foo_nodes["numbers"].tolist(), index=raw_foo_nodes["name"])
foo_nodes
[19]:
0 1 2
name
a 0.1 0.2 0.3

We’ve now got a DataFrame with 3 columns of numbers, as required!

We can do the same for the nodes of type bar to get a DataFrame with 2 columns of numbers:

[20]:
raw_bar_nodes = neo4j_graph.run(
    """
    MATCH (n:bar)
    RETURN n.name AS name, n.bar_numbers AS numbers
    """
).to_data_frame()

bar_nodes = pd.DataFrame(raw_bar_nodes["numbers"].tolist(), index=raw_bar_nodes["name"])
bar_nodes
[20]:
0 1
name
b 1.0 -2.0
c 34.0 5.6
d 0.7 -98.0

We have the information for the two node types foo and bar in separate DataFrames, so we can now put them in a dictionary to create a StellarGraph. Notice that info() is now reporting multiple node types, as well as information specific to each.

[21]:
heterogeneous_nodes = StellarGraph({"foo": foo_nodes, "bar": bar_nodes}, edges)
print(heterogeneous_nodes.info())
StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  bar: [3]
    Features: float32 vector, length 2
    Edge types: bar-default->bar, bar-default->foo
  foo: [1]
    Features: float32 vector, length 3
    Edge types: foo-default->bar

 Edge types:
    foo-default->bar: [2]
        Weights: all 1 (default)
    bar-default->bar: [2]
        Weights: all 1 (default)
    bar-default->foo: [1]
        Weights: all 1 (default)

Multiple edge types

FIXME https://github.com/stellargraph/stellargraph/issues/1183

Graphs with multiple edge types are almost identical: instead of passing a single DataFrame or a dictionary with one element for the edges, pass a dictionary with multiple elements.

For example, our square graph has labelled each edge with its orientation. We can retrieve this using the ``type` function <https://neo4j.com/docs/cypher-manual/current/functions/scalar/#functions-type>`__ to get a DataFrame with a label column too.

[22]:
labelled_edges = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t)
    RETURN s.name AS source, t.name AS target, type(r) AS label
    """
).to_data_frame()

labelled_edges
[22]:
source target label
0 d a vertical
1 a b horizontal
2 b c vertical
3 a c diagonal
4 c d horizontal

We need to convert this to a dictionary for each type, by grouping on that column.

[23]:
# FIXME https://github.com/stellargraph/stellargraph/issues/1183
grouped = {name: df.drop(columns="label") for name, df in labelled_edges.groupby("label")}
grouped
[23]:
{'diagonal':   source target
 3      a      c,
 'horizontal':   source target
 1      a      b
 4      c      d,
 'vertical':   source target
 0      d      a
 2      b      c}

We now have a dictionary of the edges, so we can create a graph with one node type, but multiple edge types. Notice how info() shows 3 edge types.

[24]:
hetereogeneous_edges = StellarGraph(edges=grouped)
print(hetereogeneous_edges.info())
StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: none
    Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default

 Edge types:
    default-vertical->default: [2]
        Weights: all 1 (default)
    default-horizontal->default: [2]
        Weights: all 1 (default)
    default-diagonal->default: [1]
        Weights: all 1 (default)

The edges can be weighted if desired.

StellarGraph supports multiple node types and multiple edge types at the same time:

[25]:
hetereogeneous_everything = StellarGraph({"foo": foo_nodes, "bar": bar_nodes}, grouped)
print(hetereogeneous_everything.info())
StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  bar: [3]
    Features: float32 vector, length 2
    Edge types: bar-diagonal->foo, bar-horizontal->bar, bar-horizontal->foo, bar-vertical->bar, bar-vertical->foo
  foo: [1]
    Features: float32 vector, length 3
    Edge types: foo-diagonal->bar, foo-horizontal->bar, foo-vertical->bar

 Edge types:
    foo-horizontal->bar: [1]
        Weights: all 1 (default)
    foo-diagonal->bar: [1]
        Weights: all 1 (default)
    bar-vertical->foo: [1]
        Weights: all 1 (default)
    bar-vertical->bar: [1]
        Weights: all 1 (default)
    bar-horizontal->bar: [1]
        Weights: all 1 (default)

Subgraphs

In many cases, one wants to work with only a subgraph of the data that is stored in Neo4j. For example:

  • only some node and edges that are interesting for the model, so one can avoid transferring data unnecessarily by filtering in the database
  • there’s only a small amount of data with labels for machine learning, so again one can reduce how much data is transferred
  • it’s faster and easier to explore and experiment with a smaller version of a huge graph

The Cypher queries we’re using to load our data can be extended to do these.

Node/edge filtering

One type of subgraph in which someone might be interested is one where the nodes and/or edges satisfy certain criteria. This can be done by applying filters like a ``WHERE` clause <https://neo4j.com/docs/cypher-manual/current/clauses/where/>`__ to the Cypher queries.

For instance, maybe we only want to load nodes that are either on the left of the square or on the bottom or both (meaning, not b, which is the top right corner).

[26]:
raw_subgraph_nodes = neo4j_graph.run(
    """
    MATCH (n)
    WHERE n.left OR NOT n.top
    RETURN n.name AS name, n.left, n.top
    """
).to_data_frame()

subgraph_nodes = raw_subgraph_nodes.set_index("name")
subgraph_nodes
[26]:
n.left n.top
name
a True True
c False False
d True False

We’ve got a set of nodes, and we now need the edges that connect these nodes, and only these nodes. We should not have any edges that involve nodes we didn’t select. For our example, that means we need to find the 3 edges between the a, c and d nodes, and avoid the a-b and b-c edges.

Some ways to do this are to start with the query for all edges and add a WHERE clause to filter to the nodes of interest, which might be done in two ways:

  • pass the identifiers for the selected nodes as parameters into the queries and perform a match with IN against the identifiers
  • reproduce the same filtering on the source and target nodes of each edge

The first option can look something like:

[27]:
subgraph_edges = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t)
    WHERE s.name IN $node_names AND t.name IN $node_names
    RETURN s.name AS source, t.name AS target
    """,
    {"node_names": list(subgraph_nodes.index)},
).to_data_frame()

subgraph_edges
[27]:
source target
0 d a
1 a c
2 c d
[28]:
subgraph = StellarGraph(subgraph_nodes, subgraph_edges)
print(subgraph.info())
StellarGraph: Undirected multigraph
 Nodes: 3, Edges: 3

 Node types:
  default: [3]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [3]
        Weights: all 1 (default)

The second option can look something like:

[29]:
subgraph_edges_refilter = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t)
    WHERE (s.left OR NOT s.top) AND (t.left OR NOT t.top)
    RETURN s.name AS source, t.name AS target
    """
).to_data_frame()

subgraph_edges_refilter
[29]:
source target
0 d a
1 a c
2 c d
[30]:
subgraph_refilter = StellarGraph(subgraph_nodes, subgraph_edges_refilter)
print(subgraph_refilter.info())
StellarGraph: Undirected multigraph
 Nodes: 3, Edges: 3

 Node types:
  default: [3]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [3]
        Weights: all 1 (default)

Similar filtering can be applied to edges, such as only including edges with specific types or anything more complicated than that. This can happen in addition to any node filtering, by expanding the WHERE clause in the edge query to filter based on the source and target nodes and on whatever criteria one has chosen for edges.

k-Hop subgraphs

Another sort of subgraph in which one might be interested is a “k-hop” subgraph of a set of start nodes. This refers to all nodes where the length of the path (number of edges) to a start node is at most k. For example, the 1-hop subgraph around b in the square is nodes a, b and c, because the shortest path from b to d is two edges.

Many graph machine learning algorithms only use a small neighbourhood of a node for influencing the predictions of the model, commonly in the form of its 1-, 2- or 3-hop subgraph. If we’re only interested in feeding small groups of nodes into a model, we can work with just the neighbourhoods of those nodes and avoid loading the rest of the potentially-large graph. This might apply in cases like:

  • only a small number of nodes have ground-truth labels for training a model
  • a trained model is being used to predict on only a small group of nodes of interest

For many cases, the nodes in the subgraph can be calculated a Cypher query with a variable length relationship constraint. For instance, if we’re computing the 1-hop subgraph around the b node, we might do something like the following cell. Some notes about it:

  • the *0..1 means a path of 0 to 1 edges; the 0 is important to make sure we include the b node in the final subgrapht too, for a 2-hop subgraph, this should be (start) -[*0..2]- (n)
  • it uses a list to easily support using multiple start nodes, which will be more common
[31]:
start_nodes = ["b"]

raw_hop_nodes = neo4j_graph.run(
    """
    MATCH (start) -[*0..1]- (n)
    WHERE start.name IN $start_nodes
    WITH DISTINCT n
    RETURN n.name AS name, n.top, n.left
    """,
    {"start_nodes": start_nodes},
).to_data_frame()

hop_nodes = raw_hop_nodes.set_index("name")
hop_nodes
[31]:
n.top n.left
name
b True False
a True True
c False False

Once we’ve got the nodes, we can do the same process as in the previous section to get the edges between the nodes.

[32]:
hop_edges = neo4j_graph.run(
    """
    MATCH (s) -[r]-> (t)
    WHERE s.name IN $node_names AND t.name IN $node_names
    RETURN s.name AS source, t.name AS target
    """,
    {"node_names": list(hop_nodes.index)},
).to_data_frame()

hop_edges
[32]:
source target
0 a b
1 b c
2 a c
[33]:
hop_subgraph = StellarGraph(hop_nodes, hop_edges)
print(subgraph.info())
StellarGraph: Undirected multigraph
 Nodes: 3, Edges: 3

 Node types:
  default: [3]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [3]
        Weights: all 1 (default)

One can expand the query to do more complicated computations, such as filtering which type of edges are included in the paths (like [:horizontal*0..1] to only follow horizontal edges), or which nodes are considered with WHERE clauses as in the previous section.

The ``apoc.path.subgraphNodes` function <https://neo4j.com/docs/labs/apoc/current/graph-querying/expand-subgraph-nodes/>`__ from the APOC library offers more control too.

Saving predictions into Neo4j

Most graph machine learning tasks will end up with some sort of predictions about some set of nodes or links in the graph. For example, [a node classification task]((../node-classification/gcn/gcn-cora-node-classification-example.ipynb) might result in either predicted scores for a node into different classes, or even just the single class that is the most likely. The formats of these are usually:

  • scores: a multidimensional NumPy array. In the node classification example linked above, it’s an array of floats of shape (1, 2708, 7), where each of element along the axis of size 2708 represents a node, and the 7 numbers for that element represents the scores for each of the 7 classes for that node.
  • classes: a one-dimensional NumPy array. In the node classification example linked above, it’s an array of strings of length 2708, where each element represents the predicted class for a node.

For our graph, let’s suppose we have finished predicting the class of a node, with three classes X, Y and Z, and now want to save them back into the Neo4j database to use for visualisation and downstream tasks. For this hypothetical example, we were only interested in predictions for nodes a and b.

The result of the task and all post-processing might be something like:

[34]:
import numpy as np

predicted_nodes = ["a", "b"]
predicted_scores = np.array([[[0.1, 0.8, 0.1], [0.4, 0.35, 0.25]]])  # a  # b
predicted_class = np.array(["Y", "X"])

We want to update the Neo4j database to hold the scores in a predicted_class_scores properties and the class itself in a predicted_class score for each of the nodes with predictions. This can be achieved with ansi parameterised Cypher query using UNWIND and SET. For this, we need to have the data as a sequence of one record for each node.

[35]:
predictions = [
    {"name": name, "scores": list(scores), "class": class_}
    for name, scores, class_ in zip(predicted_nodes, predicted_scores[0], predicted_class)
]
predictions
[35]:
[{'name': 'a', 'scores': [0.1, 0.8, 0.1], 'class': 'Y'},
 {'name': 'b', 'scores': [0.4, 0.35, 0.25], 'class': 'X'}]

Now we can execute the query. The UNWIND means that prediction hold each of the dictionaries successively, for which we can find the relevant node and update its properties as desired.

[36]:
neo4j_graph.evaluate(
    """
    UNWIND $predictions AS prediction
    MATCH (n { name: prediction.name })
    SET n.predicted_class_scores = prediction.scores
    SET n.predicted_class = prediction.class
    """,
    {"predictions": predictions},
)

To verify that this behaved as desired, let’s read back all the nodes, to see that a and b were updated with the right information.

[37]:
verification_data = neo4j_graph.run(
    "MATCH (n) RETURN n.name, n.predicted_class_scores, n.predicted_class"
).to_data_frame()

verification_data.sort_values("n.name")  # sort for ease of reference
[37]:
n.name n.predicted_class_scores n.predicted_class
0 a [0.1, 0.8, 0.1] Y
1 b [0.4, 0.35, 0.25] X
2 c None None
3 d None None

Conclusion

This notebook demonstrated many ways to read data from Neo4j into a StellarGraph graph object, for many types of graphs:

  • with or without node features
  • with or without edge weights
  • directed or not
  • homogeneous or heterogeneous

We used the py2neo library to run Cypher queries to create Pandas DataFrames, that we could load into StellarGraph objects. The process for loading from Pandas DataFrames is explored in more detail in the loading via Pandas demonstration, that has more discussion and explanations of every option for finer control.

This notebook also demonstrated saving the results of a graph machine learning algorithm back into Neo4j to use for visualisation and other tasks.

Revisit this document to use as a reminder.

Once you’ve loaded your data, you can start doing machine learning: a good place to start is the demo of the GCN algorithm on the Cora dataset for node classification. Additionally, StellarGraph includes many other demos of other algorithms, solving other tasks. We also have experimental support for running some algorithms directly using Neo4j.

(We’re still exploring the best ways to have StellarGraph work with Neo4j, so please let us know your experience of using StellarGraph with Neo4j, both positive and negative.)

[38]:
# clean everything up, so that we're not leaving the square graph in the Neo4j instance
neo4j_graph.delete_all()

Run the master version of this notebook: