Execute this notebook: Download locally

# Loading and saving data between StellarGraph and Neo4j¶

This demo explains how to load data from Neo4j into a form that can be used by the StellarGraph library, and how to save predictions back into the database. See all other demos.

The StellarGraph library supports loading graph information from Neo4j. Neo4j is a popular graph database.

If your data is already in Neo4j, this is a great way to load it. If not, loading via another route is likely to be faster and potentially more convenient.

This notebook demonstrates one approach to connecting StellarGraph and Neo4j. It uses the SQL-like Cypher language to read a graph or subgraph from Neo4j into Pandas DataFrames, and then uses these to construct a `StellarGraph`

object (following the same techniques as in the loading via Pandas demo, which has more details about that aspect). This notebook assumes some familiarity
with Cypher constructs like `MATCH`

, `RETURN`

and `WHERE`

. This notebook uses the Py2neo library to interact with a Neo4j instance.

StellarGraph also has experimental support for running some algorithms directly using Neo4j.

This notebook walks through scenarios for loading and storing graphs.

homogeneous graph without features (a homogeneous graph is one with only one type of node and one type of edge)

homogeneous graph with features

homogeneous graph with edge weights

directed graphs (a graph is directed if edges have a “start” and “end” nodes, instead of just connecting two nodes)

heterogeneous graphs (more than one node type and/or more than one edge type) with and without node features or edge weights, this includes knowledge graphs

subgraphs (an example of filtering which nodes and edges are loaded)

saving predictions into Neo4j

StellarGraph supports loading data from many sources with all sorts of data preprocessing, via Pandas DataFrames, NumPy arrays, Neo4j and NetworkX graphs. See all loading demos for more details.

The `StellarGraph`

class is available at the top level of the `stellargraph`

library:

```
[3]:
```

```
from stellargraph import StellarGraph
```

## Connecting to Neo4j¶

To read anything from Neo4j, we’ll need a connection to a running instance.

```
[4]:
```

```
import os
import py2neo
default_host = os.environ.get("STELLARGRAPH_NEO4J_HOST")
# Create the Neo4j Graph database object; the parameters can be edited to specify location and authentication
neo4j_graph = py2neo.Graph(host=default_host, port=None, user=None, password=None)
```

## Dataset¶

We’ll be working with a graph representing a square with a diagonal. We’ll give the `a`

node label `foo`

and the other nodes the label `bar`

, along with some features. We’ll also give each edge a label matching its orientation and a weight.

```
a -- b
| \ |
| \ |
d -- c
```

This section uses the types from `py2neo`

to seed our Neo4j instance with the example data. For real work involving StellarGraph and Neo4j, the real data would be loaded into the database via some external process. However, we need some data to work with for this demo and so we need to have the cells in this section. They can be **safely ignored**, and removed for real work.

```
[5]:
```

```
from py2neo.data import Node, Relationship, Subgraph
a = Node("foo", name="a", top=True, left=True, foo_numbers=[0.1, 0.2, 0.3])
b = Node("bar", name="b", top=True, left=False, bar_numbers=[1, -2])
c = Node("bar", name="c", top=False, left=False, bar_numbers=[34, 5.6])
d = Node("bar", name="d", top=False, left=True, bar_numbers=[0.7, -98])
ab = Relationship(a, "horizontal", b, weight=1.0)
bc = Relationship(b, "vertical", c, weight=0.2)
cd = Relationship(c, "horizontal", d, weight=3.4)
da = Relationship(d, "vertical", a, weight=5.67)
ac = Relationship(a, "diagonal", c, weight=1.0)
subgraph = Subgraph([a, b, c, d], [ab, bc, cd, da, ac])
```

We don’t want to accidentally overwrite or delete important data or add junk in a production Neo4j instance. As a check, this demo requires the Neo4j instance to be empty. If the `neo4j_graph`

connection is to a non-empty database, please either:

delete everything from it (there’s a cell at the end of the notebook that can be used, if that’s ok)

start a new instance, adjust the parameters to

`py2neo.Graph`

above to connect to it, and rerun the cells from there

```
[6]:
```

```
num_nodes = len(neo4j_graph.nodes)
num_relationships = len(neo4j_graph.relationships)
if num_nodes > 0 or num_relationships > 0:
raise ValueError(
f"neo4j_graphdb: expected an empty database to give a reliable result and to avoid corrupting your data with mutations & the `delete_all` in the last cell, found {num_nodes} nodes and {num_relationships} relationships in the database already"
)
```

Finally, we can fill the database by writing our example data to the database.

```
[7]:
```

```
neo4j_graph.create(subgraph)
# basic check that the database has the right data
assert len(neo4j_graph.nodes) == 4
assert len(neo4j_graph.relationships) == 5
```

## Homogeneous graph without features (edges only)¶

We’ll start with a homogeneous graph without any node features. This means the graph consists of only nodes and edges without any information other than a unique identifier. To simulate this, we will be ignoring all of the properties we added except the `name`

property, which is a unique identifier for each node.

We can use a single Cypher query to retrieve the identifiers for the source and target of each edge. We’re using `name`

as the identifier here, and each application should choose an appropriate identifier, such as the `id(...)`

(docs) if the dangers of ID reuse don’t apply.

We can execute a Cypher query using the `run`

method (docs) of `py2neo.Graph`

, which returns a `Cursor`

object that has a `to_data_frame`

method (docs) to convert the results to a columnar DataFrame. `StellarGraph`

type expects the columns for the nodes in an edge to be called `source`

and `target`

by default, so the query uses an `AS`

to ensure the DataFrame columns match those defaults.

```
[8]:
```

```
edges = neo4j_graph.run(
"""
MATCH (s) --> (t)
RETURN s.name AS source, t.name AS target
"""
).to_data_frame()
edges.head()
```

```
[8]:
```

source | target | |
---|---|---|

0 | d | a |

1 | a | b |

2 | b | c |

3 | a | c |

4 | c | d |

We now have a DataFrame where each row represents an edge in the graph, which is exactly the format expected by the `StellarGraph`

constructor (docs). We can pass the DataFrame as the `edges`

parameter:

```
[9]:
```

```
edges_only = StellarGraph(edges=edges)
```

The `info`

method (docs) gives a high-level summary of a `StellarGraph`

:

```
[10]:
```

```
print(edges_only.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
default: [4]
Features: none
Edge types: default-default->default
Edge types:
default-default->default: [5]
Weights: all 1 (default)
```

On this square, it tells us that there’s 4 nodes of type `default`

(a homogeneous graph still has node and edge types, but they default to `default`

), with no features, and one type of edge between them. It also tells us that there’s 5 edges of type `default`

that go between nodes of type `default`

. This matches what we expect: it’s a graph with 4 nodes and 5 edges and one type of each.

## Homogeneous graph with features¶

For many real-world problems, we have more than just graph structure: we have information about the nodes and edges. For instance, we might have a graph of academic papers (nodes) and how they cite each other (edges): we might have information about the nodes such as the authors and the publication year, and even the abstract or full paper contents. If we’re doing a machine learning task, it can be useful to feed this information into models. The `StellarGraph`

class supports this using
another Pandas DataFrame: each row corresponds to a feature vector for a node.

We can create an appropriate DataFrame in the same way as we created the edges one, with a Cypher query that selects the relevant information. In this case, we need the `name`

to match the rows of features to their node, and we’re also going to have 3 features:

the

`top`

and`left`

properties from each node as two of a featureswhether the

`bar_numbers`

property exists on the node using the`exists`

function (docs): this is a demonstration that features don’t have to be just properties, but can be calculated with any computation supported by Neo4j

```
[11]:
```

```
raw_homogeneous_nodes = neo4j_graph.run(
"""
MATCH (n)
RETURN n.name AS name, n.top, n.left, exists(n.bar_numbers)
"""
).to_data_frame()
raw_homogeneous_nodes
```

```
[11]:
```

name | n.top | n.left | exists(n.bar_numbers) | |
---|---|---|---|---|

0 | a | True | True | False |

1 | b | True | False | True |

2 | c | False | False | True |

3 | d | False | True | True |

`StellarGraph`

uses the index of the DataFrame as the connection between a node and a row of the DataFrame. Currently our dataframe just has a simple numeric range as the index, but it needs to be using the `name`

column. Pandas offers a few ways to control the indexing; in this case, we want to replace the current index by moving the `name`

column to it, which is done most easily with
`set_index`

:

```
[12]:
```

```
homogeneous_nodes = raw_homogeneous_nodes.set_index("name")
homogeneous_nodes
```

```
[12]:
```

n.top | n.left | exists(n.bar_numbers) | |
---|---|---|---|

name | |||

a | True | True | False |

b | True | False | True |

c | False | False | True |

d | False | True | True |

We’ve now got all the right node data, in addition to the edges from before, so now we can create a `StellarGraph`

.

```
[13]:
```

```
homogeneous = StellarGraph(homogeneous_nodes, edges)
print(homogeneous.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
default: [4]
Features: float32 vector, length 3
Edge types: default-default->default
Edge types:
default-default->default: [5]
Weights: all 1 (default)
```

Notice the output of `info`

now says that the nodes of the `default`

type have 3 features.

### Homogeneous graph with edge weights¶

Some algorithms can understand edge weights, which can be used as a measure of the strength of the connection, or a measure of distance between nodes. A `StellarGraph`

instance can have weighted edges, by including a `weight`

column in the DataFrame of edges.

We can extend our Cypher query that loads the edge sources and targets to also load the `weight`

property. As with node features, we could any computation supported by Neo4j to calculate the weight, beyond just accessing a property as we do here.

```
[14]:
```

```
weighted_edges = neo4j_graph.run(
"""
MATCH (s) -[r]-> (t)
RETURN s.name AS source, t.name AS target, r.weight AS weight
"""
).to_data_frame()
weighted_edges
```

```
[14]:
```

source | target | weight | |
---|---|---|---|

0 | d | a | 5.67 |

1 | a | b | 1.00 |

2 | b | c | 0.20 |

3 | a | c | 1.00 |

4 | c | d | 3.40 |

```
[15]:
```

```
weighted_homogeneous = StellarGraph(homogeneous_nodes, weighted_edges)
print(weighted_homogeneous.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
default: [4]
Features: float32 vector, length 3
Edge types: default-default->default
Edge types:
default-default->default: [5]
Weights: range=[0.2, 5.67], mean=2.254, std=2.25534
```

Notice the output of `info`

now shows additional statistics about edge weights.

## Directed graphs¶

Some graphs have edge directions, where going from source to target has a different meaning to going from target to source.

A directed graph can be created by using the `StellarDiGraph`

class instead of the `StellarGraph`

one. The construction is almost identical, and we can reuse any of the DataFrames that we created in the sections above. For instance, continuing from the previous cell, we can have a directed homogeneous graph with node features and edge weights.

```
[16]:
```

```
from stellargraph import StellarDiGraph
directed_weighted_homogeneous = StellarDiGraph(homogeneous_nodes, weighted_edges)
print(directed_weighted_homogeneous.info())
```

```
StellarDiGraph: Directed multigraph
Nodes: 4, Edges: 5
Node types:
default: [4]
Features: float32 vector, length 3
Edge types: default-default->default
Edge types:
default-default->default: [5]
Weights: range=[0.2, 5.67], mean=2.254, std=2.25534
```

## Heterogeneous graphs¶

Some graphs have multiple types of nodes and multiple types of edges. Each type might have different data associated with it.

For example, an academic citation network that includes authors might have `wrote`

edges connecting `author`

nodes to `paper`

nodes, in addition to the `cites`

edges between `paper`

nodes. There could be `supervised`

edges between `author`

s (example) too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands
of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.

`StellarGraph`

supports all forms of heterogeneous graphs.

A heterogeneous `StellarGraph`

can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did for the Cora examples in the “homogeneous graph with features” section and others above. For a heterogeneous graph, a dictionary has to be passed; passing a single DataFrame does not work.

### Multiple node types¶

The nodes of our square graph were given labels when we created them: `a`

is of type `foo`

, but `b`

, `c`

and `d`

are of type `bar`

. The `foo`

node has an attribute `foo_numbers`

that is a list/vector of numbers, and similarly the `bar`

nodes has `bar_numbers`

. These vectors might be some sort of summary of text associated with each node, or any other precomputed information about the node to use as input to our machine learning algorithm.

The two types have properties with different names, and, they have different lengths: the `foo`

node has a list of length 3, while all of the `bar`

nodes have a list of length 2. We will load them into separate DataFrames with separate Cypher queries, first finding the node(s) of type `foo`

and their properties, and then the same for the nodes of type `bar`

.

```
[17]:
```

```
raw_foo_nodes = neo4j_graph.run(
"""
MATCH (n:foo)
RETURN n.name AS name, n.foo_numbers AS numbers
"""
).to_data_frame()
raw_foo_nodes
```

```
[17]:
```

name | numbers | |
---|---|---|

0 | a | [0.1, 0.2, 0.3] |

In this case, our features are more complicated than just independent booleans that can become columns; instead we have a list that we need to turn into individual columns. One way is by converting the list column to a list of lists, and using Pandas’s constructor to convert this back to a DataFrame. We can set the index directly with this technique, and do not need to separately use `set_index`

.

```
[18]:
```

```
import pandas as pd
```

```
[19]:
```

```
foo_nodes = pd.DataFrame(raw_foo_nodes["numbers"].tolist(), index=raw_foo_nodes["name"])
foo_nodes
```

```
[19]:
```

0 | 1 | 2 | |
---|---|---|---|

name | |||

a | 0.1 | 0.2 | 0.3 |

We’ve now got a DataFrame with 3 columns of numbers, as required!

We can do the same for the nodes of type `bar`

to get a DataFrame with 2 columns of numbers:

```
[20]:
```

```
raw_bar_nodes = neo4j_graph.run(
"""
MATCH (n:bar)
RETURN n.name AS name, n.bar_numbers AS numbers
"""
).to_data_frame()
bar_nodes = pd.DataFrame(raw_bar_nodes["numbers"].tolist(), index=raw_bar_nodes["name"])
bar_nodes
```

```
[20]:
```

0 | 1 | |
---|---|---|

name | ||

b | 1.0 | -2.0 |

c | 34.0 | 5.6 |

d | 0.7 | -98.0 |

We have the information for the two node types `foo`

and `bar`

in separate DataFrames, so we can now put them in a dictionary to create a `StellarGraph`

. Notice that `info()`

is now reporting multiple node types, as well as information specific to each.

```
[21]:
```

```
heterogeneous_nodes = StellarGraph({"foo": foo_nodes, "bar": bar_nodes}, edges)
print(heterogeneous_nodes.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
bar: [3]
Features: float32 vector, length 2
Edge types: bar-default->bar, bar-default->foo
foo: [1]
Features: float32 vector, length 3
Edge types: foo-default->bar
Edge types:
foo-default->bar: [2]
Weights: all 1 (default)
bar-default->bar: [2]
Weights: all 1 (default)
bar-default->foo: [1]
Weights: all 1 (default)
```

### Multiple edge types¶

Graphs with multiple edge types are simpler. Since we have no features on the edges, we can pass a DataFrame with an additional column for the type, specifying it via the `edge_type_column`

parameter. (Multiple edge types can also be created in the same way as multiple node types, by passing with a dictionary of DataFrames, but this is not necessary.)

For example, our square graph has labelled each edge with its orientation. We can retrieve this using the `type`

function (docs) to get a DataFrame with a label column too.

```
[22]:
```

```
labelled_edges = neo4j_graph.run(
"""
MATCH (s) -[r]-> (t)
RETURN s.name AS source, t.name AS target, type(r) AS label
"""
).to_data_frame()
labelled_edges
```

```
[22]:
```

source | target | label | |
---|---|---|---|

0 | d | a | vertical |

1 | a | b | horizontal |

2 | b | c | vertical |

3 | a | c | diagonal |

4 | c | d | horizontal |

We now have a dictionary of the edges, so we can create a graph with one node type, but multiple edge types. Notice how `info()`

shows 3 edge types.

```
[23]:
```

```
hetereogeneous_edges = StellarGraph(edges=labelled_edges, edge_type_column="label")
print(hetereogeneous_edges.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
default: [4]
Features: none
Edge types: default-diagonal->default, default-horizontal->default, default-vertical->default
Edge types:
default-vertical->default: [2]
Weights: all 1 (default)
default-horizontal->default: [2]
Weights: all 1 (default)
default-diagonal->default: [1]
Weights: all 1 (default)
```

The edges can be weighted if desired.

`StellarGraph`

supports multiple node types and multiple edge types at the same time:

```
[24]:
```

```
hetereogeneous_everything = StellarGraph(
{"foo": foo_nodes, "bar": bar_nodes}, labelled_edges, edge_type_column="label"
)
print(hetereogeneous_everything.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
bar: [3]
Features: float32 vector, length 2
Edge types: bar-diagonal->foo, bar-horizontal->bar, bar-horizontal->foo, bar-vertical->bar, bar-vertical->foo
foo: [1]
Features: float32 vector, length 3
Edge types: foo-diagonal->bar, foo-horizontal->bar, foo-vertical->bar
Edge types:
foo-horizontal->bar: [1]
Weights: all 1 (default)
foo-diagonal->bar: [1]
Weights: all 1 (default)
bar-vertical->foo: [1]
Weights: all 1 (default)
bar-vertical->bar: [1]
Weights: all 1 (default)
bar-horizontal->bar: [1]
Weights: all 1 (default)
```

## Subgraphs¶

In many cases, one wants to work with only a subgraph of the data that is stored in Neo4j. For example:

only some node and edges that are interesting for the model, so one can avoid transferring data unnecessarily by filtering in the database

there’s only a small amount of data with labels for machine learning, so again one can reduce how much data is transferred

it’s faster and easier to explore and experiment with a smaller version of a huge graph

The Cypher queries we’re using to load our data can be extended to do these.

### Node/edge filtering¶

One type of subgraph in which someone might be interested is one where the nodes and/or edges satisfy certain criteria. This can be done by applying filters like a `WHERE`

clause (docs) to the Cypher queries.

For instance, maybe we only want to load nodes that are either on the left of the square or on the bottom or both (meaning, not `b`

, which is the top right corner).

```
[25]:
```

```
raw_subgraph_nodes = neo4j_graph.run(
"""
MATCH (n)
WHERE n.left OR NOT n.top
RETURN n.name AS name, n.left, n.top
"""
).to_data_frame()
subgraph_nodes = raw_subgraph_nodes.set_index("name")
subgraph_nodes
```

```
[25]:
```

n.left | n.top | |
---|---|---|

name | ||

a | True | True |

c | False | False |

d | True | False |

We’ve got a set of nodes, and we now need the edges that connect these nodes, and only these nodes. We should not have any edges that involve nodes we didn’t select. For our example, that means we need to find the 3 edges between the `a`

, `c`

and `d`

nodes, and avoid the `a`

-`b`

and `b`

-`c`

edges.

Some ways to do this are to start with the query for all edges and add a `WHERE`

clause to filter to the nodes of interest, which might be done in two ways:

pass the identifiers for the selected nodes as parameters into the queries and perform a match with

`IN`

against the identifiersreproduce the same filtering on the source and target nodes of each edge

The first option can look something like:

```
[26]:
```

```
subgraph_edges = neo4j_graph.run(
"""
MATCH (s) -[r]-> (t)
WHERE s.name IN $node_names AND t.name IN $node_names
RETURN s.name AS source, t.name AS target
""",
{"node_names": list(subgraph_nodes.index)},
).to_data_frame()
subgraph_edges
```

```
[26]:
```

source | target | |
---|---|---|

0 | d | a |

1 | a | c |

2 | c | d |

```
[27]:
```

```
subgraph = StellarGraph(subgraph_nodes, subgraph_edges)
print(subgraph.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 3, Edges: 3
Node types:
default: [3]
Features: float32 vector, length 2
Edge types: default-default->default
Edge types:
default-default->default: [3]
Weights: all 1 (default)
```

The second option can look something like:

```
[28]:
```

```
subgraph_edges_refilter = neo4j_graph.run(
"""
MATCH (s) -[r]-> (t)
WHERE (s.left OR NOT s.top) AND (t.left OR NOT t.top)
RETURN s.name AS source, t.name AS target
"""
).to_data_frame()
subgraph_edges_refilter
```

```
[28]:
```

source | target | |
---|---|---|

0 | d | a |

1 | a | c |

2 | c | d |

```
[29]:
```

```
subgraph_refilter = StellarGraph(subgraph_nodes, subgraph_edges_refilter)
print(subgraph_refilter.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 3, Edges: 3
Node types:
default: [3]
Features: float32 vector, length 2
Edge types: default-default->default
Edge types:
default-default->default: [3]
Weights: all 1 (default)
```

Similar filtering can be applied to edges, such as only including edges with specific types or anything more complicated than that. This can happen in addition to any node filtering, by expanding the `WHERE`

clause in the edge query to filter based on the source and target nodes and on whatever criteria one has chosen for edges.

### k-Hop subgraphs¶

Another sort of subgraph in which one might be interested is a “k-hop” subgraph of a set of start nodes. This refers to all nodes where the length of the path (number of edges) to a start node is at most `k`

. For example, the 1-hop subgraph around `b`

in the square is nodes `a`

, `b`

and `c`

, because the shortest path from `b`

to `d`

is two edges.

Many graph machine learning algorithms only use a small neighbourhood of a node for influencing the predictions of the model, commonly in the form of its 1-, 2- or 3-hop subgraph. If we’re only interested in feeding small groups of nodes into a model, we can work with just the neighbourhoods of those nodes and avoid loading the rest of the potentially-large graph. This might apply in cases like:

only a small number of nodes have ground-truth labels for training a model

a trained model is being used to predict on only a small group of nodes of interest

For many cases, the nodes in the subgraph can be calculated a Cypher query with a variable length relationship constraint. For instance, if we’re computing the 1-hop subgraph around the `b`

node, we might do something like the following cell. Some notes about it:

the

`*0..1`

means a path of 0 to 1 edges; the 0 is important to make sure we include the`b`

node in the final subgraph too, for a 2-hop subgraph, this should be`(start) -[*0..2]- (n)`

it uses a list to easily support using multiple start nodes, which will be more common

```
[30]:
```

```
start_nodes = ["b"]
raw_hop_nodes = neo4j_graph.run(
"""
MATCH (start) -[*0..1]- (n)
WHERE start.name IN $start_nodes
WITH DISTINCT n
RETURN n.name AS name, n.top, n.left
""",
{"start_nodes": start_nodes},
).to_data_frame()
hop_nodes = raw_hop_nodes.set_index("name")
hop_nodes
```

```
[30]:
```

n.top | n.left | |
---|---|---|

name | ||

b | True | False |

a | True | True |

c | False | False |

Once we’ve got the nodes, we can do the same process as in the previous section to get the edges between the nodes.

```
[31]:
```

```
hop_edges = neo4j_graph.run(
"""
MATCH (s) -[r]-> (t)
WHERE s.name IN $node_names AND t.name IN $node_names
RETURN s.name AS source, t.name AS target
""",
{"node_names": list(hop_nodes.index)},
).to_data_frame()
hop_edges
```

```
[31]:
```

source | target | |
---|---|---|

0 | a | b |

1 | b | c |

2 | a | c |

```
[32]:
```

```
hop_subgraph = StellarGraph(hop_nodes, hop_edges)
print(subgraph.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 3, Edges: 3
Node types:
default: [3]
Features: float32 vector, length 2
Edge types: default-default->default
Edge types:
default-default->default: [3]
Weights: all 1 (default)
```

One can expand the query to do more complicated computations, such as filtering which type of edges are included in the paths (like `[:horizontal*0..1]`

to only follow horizontal edges), or which nodes are considered with `WHERE`

clauses as in the previous section.

The `apoc.path.subgraphNodes`

function (docs) from the APOC library offers more control too.

## Saving predictions into Neo4j¶

Most graph machine learning tasks will end up with some sort of predictions about some set of nodes or links in the graph. For example, a node classification task might result in either predicted scores for a node into different classes, or even just the single class that is the most likely. The formats of these are usually:

scores: a multidimensional NumPy array. In the node classification example linked above, it’s an array of floats of shape

`(1, 2708, 7)`

, where each of element along the axis of size 2708 represents a node, and the 7 numbers for that element represents the scores for each of the 7 classes for that node.classes: a one-dimensional NumPy array. In the node classification example linked above, it’s an array of strings of length 2708, where each element represents the predicted class for a node.

For our graph, let’s suppose we have finished predicting the class of a node, with three classes `X`

, `Y`

and `Z`

, and now want to save them back into the Neo4j database to use for visualisation and downstream tasks. For this hypothetical example, we were only interested in predictions for nodes `a`

and `b`

.

The result of the task and all post-processing might be something like:

```
[33]:
```

```
import numpy as np
predicted_nodes = ["a", "b"]
predicted_scores = np.array([[[0.1, 0.8, 0.1], [0.4, 0.35, 0.25]]]) # a # b
predicted_class = np.array(["Y", "X"])
```

We want to update the Neo4j database to hold the scores in a `predicted_class_scores`

properties and the class itself in a `predicted_class`

score for each of the nodes with predictions. This can be achieved with a parameterised Cypher query using `UNWIND`

and `SET`

. For this, we need to have the data as a sequence of one record for each node.

```
[34]:
```

```
predictions = [
{"name": name, "scores": list(scores), "class": class_}
for name, scores, class_ in zip(predicted_nodes, predicted_scores[0], predicted_class)
]
predictions
```

```
[34]:
```

```
[{'name': 'a', 'scores': [0.1, 0.8, 0.1], 'class': 'Y'},
{'name': 'b', 'scores': [0.4, 0.35, 0.25], 'class': 'X'}]
```

Now we can execute the query. The `UNWIND`

means that `prediction`

hold each of the dictionaries successively, for which we can find the relevant node and update its properties as desired.

```
[35]:
```

```
neo4j_graph.evaluate(
"""
UNWIND $predictions AS prediction
MATCH (n { name: prediction.name })
SET n.predicted_class_scores = prediction.scores
SET n.predicted_class = prediction.class
""",
{"predictions": predictions},
)
```

To verify that this behaved as desired, let’s read back all the nodes, to see that `a`

and `b`

were updated with the right information.

```
[36]:
```

```
verification_data = neo4j_graph.run(
"MATCH (n) RETURN n.name, n.predicted_class_scores, n.predicted_class"
).to_data_frame()
verification_data.sort_values("n.name") # sort for ease of reference
```

```
[36]:
```

n.name | n.predicted_class_scores | n.predicted_class | |
---|---|---|---|

0 | a | [0.1, 0.8, 0.1] | Y |

1 | b | [0.4, 0.35, 0.25] | X |

2 | c | None | None |

3 | d | None | None |

## Conclusion¶

This notebook demonstrated many ways to read data from Neo4j into a `StellarGraph`

graph object, for many types of graphs:

with or without node features

with or without edge weights

directed or not

homogeneous or heterogeneous

We used the `py2neo`

library to run Cypher queries to create Pandas DataFrames, that we could load into `StellarGraph`

objects. The process for loading from Pandas DataFrames is explored in more detail in the loading via Pandas demonstration, that has more discussion and explanations of every option for finer control.

This notebook also demonstrated saving the results of a graph machine learning algorithm back into Neo4j to use for visualisation and other tasks.

Revisit this document to use as a reminder.

Once you’ve loaded your data, you can start doing machine learning: a good place to start is the demo of the GCN algorithm on the Cora dataset for node classification. Additionally, StellarGraph includes many other demos of other algorithms, solving other tasks.

We also have experimental support for running some algorithms directly using Neo4j.

(We’re still exploring the best ways to have StellarGraph work with Neo4j, so please let us know your experience of using StellarGraph with Neo4j, both positive and negative.)

```
[37]:
```

```
# clean everything up, so that we're not leaving the square graph in the Neo4j instance
neo4j_graph.delete_all()
```

Execute this notebook: Download locally