Execute this notebook: Download locally

Loading data into StellarGraph from NumPy¶

This demo explains how to load data from NumPy into a form that can be used by the StellarGraph library. See all other demos.

The StellarGraph library supports loading graph information from NumPy. NumPy is a library for working with data arrays.

If your data can easily be loaded into a NumPy array, this is a great way to load it that has minimal overhead and offers the most control.

This notebook walks through loading three kinds of graphs.

homogeneous graph with feature vectors
homogeneous graph with feature tensors
heterogeneous graph with feature vectors and tensors

StellarGraph supports loading data from many sources with all sorts of data preprocessing, via Pandas DataFrames, NumPy arrays, Neo4j and NetworkX graphs. This notebook demonstrates loading data from NumPy. See the other loading demos for more details.

This notebook only uses NumPy for the node features, with Pandas used for the edge data. The details and options for loading edge data in this format are discussed in the “Loading data into StellarGraph from Pandas” demo.

Additionally, if the node features are in a complicated format for loading and/or requires significant preprocessing, loading via Pandas is likely to be more convenient.

The documentation for the StellarGraph class includes a compressed reminder of everything discussed in this file, as well as explanations of all of the parameters.

The StellarGraph class is available at the top level of the stellargraph library:

[3]:

from stellargraph import StellarGraph

Loading via NumPy¶

A StellarGraph has two basic components:

nodes, with feature arrays or tensors
edges, consisting of a pair of nodes as the source and target, and feature arrays or tensors

A NumPy array consists of a large number of values of a single type. It is thus appropriate for the feature arrays in nodes, but not as useful for edges, because the source and target node IDs may be different. Thus, node data can be input as a NumPy array directly, but edge data cannot. The latter still uses Pandas.

[4]:

import numpy as np
import pandas as pd

Sequential numeric graph structure¶

As with the Pandas demo, we’ll be working with a square graph. For simplicity, we’ll start with a graph where the identifiers of nodes are sequential integers starting at 0:

0 -- 1
| \  |
|  \ |
3 -- 2

The edges of this graph can easily be encoded as the rows of a Pandas DataFrame:

[5]:

square_numeric_edges = pd.DataFrame(
    {"source": [0, 1, 2, 3, 0], "target": [1, 2, 3, 0, 2]}
)
square_numeric_edges

[5]:

	source	target
0	0	1
1	1	2
2	2	3
3	3	0
4	0	2

Homogeneous graph with sequential IDs and feature vectors¶

Now, suppose we have some feature vectors associated with each node in our square graph. For instance, maybe node 0 has features [1, -0.2]. This can come in the form of a 4 × 2 matrix, with one row per node, with row 0 being features for the 0 node, and so on. Filling out the rest of the example data:

[6]:

feature_array = np.array(
    [[1.0, -0.2], [2.0, 0.3], [3.0, 0.0], [4.0, -0.5]], dtype=np.float32
)
feature_array

[6]:

array([[ 1. , -0.2],
       [ 2. ,  0.3],
       [ 3. ,  0. ],
       [ 4. , -0.5]], dtype=float32)

Because our nodes have IDs 0, 1, …, we can construct the StellarGraph by passing in the feature array directly, along with the edges:

[7]:

square_numeric = StellarGraph(feature_array, square_numeric_edges)

The info method (docs) gives a high-level summary of a StellarGraph:

[8]:

print(square_numeric.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  default: [4]
    Features: float32 vector, length 2
    Edge types: default-default->default

 Edge types:
    default-default->default: [5]
        Weights: all 1 (default)
        Features: none

On this square, it tells us that there’s 4 nodes of type default (a homogeneous graph still has node and edge types, but they default to default), with 2 features, and one type of edge that touches it. It also tells us that there’s 5 edges of type default that go between nodes of type default. This matches what we expect: it’s a graph with 4 nodes and 5 edges and one type of each.

The default node type and edge types can be set using the node_type_default and edge_type_default parameters to StellarGraph(...):

[9]:

square_numeric_named = StellarGraph(
    feature_array,
    square_numeric_edges,
    node_type_default="corner",
    edge_type_default="line",
)
print(square_numeric_named.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  corner: [4]
    Features: float32 vector, length 2
    Edge types: corner-line->corner

 Edge types:
    corner-line->corner: [5]
        Weights: all 1 (default)
        Features: none

Non-sequential graph structure¶

Requiring node identifiers to always be sequential integers from 0 is restrictive. Most real-world graphs don’t have such neat IDs. For instance, maybe our graph instead uses strings as IDs:

a -- b
| \  |
|  \ |
d -- c

As before, these edges get encoded as a DataFrame:

[10]:

square_edges = pd.DataFrame(
    {"source": ["a", "b", "c", "d", "a"], "target": ["b", "c", "d", "a", "c"]}
)
square_edges

[10]:

	source	target
0	a	b
1	b	c
2	c	d
3	d	a
4	a	c

Homogeneous graph with non-numeric IDs and feature vectors¶

With non-sequential, non-numeric IDs, we cannot use a NumPy array directly, because we need to know which row of the array corresponds to which node. This is done with the IndexedArray (docs) type. It is a much simplified Pandas DataFrame, that is generalised to be more than 2-dimensional. It is available at the top level of stellargraph, and supports an index parameter to define the mapping from row to node. The index should have one element per row of the NumPy array.

[11]:

from stellargraph import IndexedArray

[12]:

indexed_array = IndexedArray(feature_array, index=["a", "b", "c", "d"])

[13]:

square_named = StellarGraph(
    indexed_array, square_edges, node_type_default="corner", edge_type_default="line",
)
print(square_named.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  corner: [4]
    Features: float32 vector, length 2
    Edge types: corner-line->corner

 Edge types:
    corner-line->corner: [5]
        Weights: all 1 (default)
        Features: none

As before, there’s 4 nodes, each with features of length 2.

Homogeneous graph with non-numeric IDs and feature tensors¶

Some algorithms work with than just a feature vector associated with each node. For instance, if each node corresponds to a weather station, one might have a time series of observations like “temperature” and “pressure” associated with each node. This is modelled by having a multidimensional feature for each node.

Time series algorithms within StellarGraph expect the tensor to be shaped like nodes × time steps × variates. For the weather station example, nodes is the number of weather stations, time steps is the number of points within each series and variates is the number of observations at each time step.

For our square graph, we might have time series of length three, containing two observations.

[14]:

feature_tensors = np.array(
    [
        [[1.0, -0.2], [1.0, 0.1], [0.9, 0.1]],
        [[2.0, 0.3], [1.9, 0.31], [2.1, 0.32]],
        [[3.0, 0.0], [10.0, 0.0], [3.0, 0.0]],
        [[4.0, -0.5], [0.0, -1.0], [1.0, -3.0]],
    ],
    dtype=np.float32,
)
feature_tensors

[14]:

array([[[ 1.  , -0.2 ],
        [ 1.  ,  0.1 ],
        [ 0.9 ,  0.1 ]],

       [[ 2.  ,  0.3 ],
        [ 1.9 ,  0.31],
        [ 2.1 ,  0.32]],

       [[ 3.  ,  0.  ],
        [10.  ,  0.  ],
        [ 3.  ,  0.  ]],

       [[ 4.  , -0.5 ],
        [ 0.  , -1.  ],
        [ 1.  , -3.  ]]], dtype=float32)

[15]:

indexed_tensors = IndexedArray(feature_tensors, index=["a", "b", "c", "d"])

[16]:

square_tensors = StellarGraph(
    indexed_tensors, square_edges, node_type_default="corner", edge_type_default="line",
)
print(square_tensors.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  corner: [4]
    Features: float32 tensor, shape (3, 2)
    Edge types: corner-line->corner

 Edge types:
    corner-line->corner: [5]
        Weights: all 1 (default)
        Features: none

We can see that the features of the corner nodes are now listed as a tensor, with shape 3 × 2, matching the array we created above.

Heterogeneous graphs¶

Some graphs have multiple types of nodes.

For example, an academic citation network that includes authors might have wrote edges connecting author nodes to paper nodes, in addition to the cites edges between paper nodes. There could be supervised edges between authors (example) too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.

StellarGraph supports all forms of heterogeneous graphs.

A heterogeneous StellarGraph can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did in the “homogeneous graph with features” section and others above. For a heterogeneous graph, a dictionary has to be passed; passing a single IndexedArray does not work.

Let’s return to the square graph from earlier:

a -- b
| \  |
|  \ |
d -- c

Feature arrays¶

Suppose a is of type foo, and no features, but b, c and d are of type bar and have two features each, e.g. for b, 0.4, 100. Since the features are different shapes (a has zero), they need to be modeled as different types, with separate IndexedArrays.

[17]:

square_foo = IndexedArray(index=["a"])

[18]:

bar_features = np.array([[0.4, 100], [0.1, 200], [0.9, 300]])
bar_features

[18]:

array([[4.e-01, 1.e+02],
       [1.e-01, 2.e+02],
       [9.e-01, 3.e+02]])

[19]:

square_bar = IndexedArray(bar_features, index=["b", "c", "d"])

We have the information for the two node types foo and bar in separate DataFrames, so we can now put them in a dictionary to create a StellarGraph. Notice that info() is now reporting multiple node types, as well as information specific to each.

[20]:

square_foo_and_bar = StellarGraph({"foo": square_foo, "bar": square_bar}, square_edges)
print(square_foo_and_bar.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  bar: [3]
    Features: float64 vector, length 2
    Edge types: bar-default->bar, bar-default->foo
  foo: [1]
    Features: none
    Edge types: foo-default->bar

 Edge types:
    foo-default->bar: [2]
        Weights: all 1 (default)
        Features: none
    bar-default->bar: [2]
        Weights: all 1 (default)
        Features: none
    bar-default->foo: [1]
        Weights: all 1 (default)
        Features: none

Node IDs (the DataFrame index) needs to be unique across all types. For example, renaming the a corner to b like square_foo_overlap in the next cell, is not accepted and a StellarGraph(...) call will throw an error

[21]:

square_foo_overlap = IndexedArray(index=["b"])

[22]:

# Uncomment to see the error
# StellarGraph({"foo": square_foo_overlap, "bar": square_bar}, square_edges)

If the node IDs aren’t unique across types, one way to make them unique is to add a string prefix. You’ll need to add the same prefix to the node IDs used in the edges too. Adding a prefix can be done by replacing the index:

[23]:

square_foo_overlap_prefix = IndexedArray(
    square_foo_overlap.values, index=[f"foo-{s}" for s in square_foo_overlap.index]
)

[24]:

square_bar_prefix = IndexedArray(
    square_bar.values, index=[f"bar-{s}" for s in square_bar.index]
)

Feature tensors¶

Nodes of different types can have features of completely different shapes, not just vectors of different lengths. For instance, suppose our foo node (a) has the multi-variate time series from above as a feature.

[25]:

foo_tensors = np.array([[[1.0, -0.2], [1.0, 0.1], [0.9, 0.1]]])
foo_tensors

[25]:

array([[[ 1. , -0.2],
        [ 1. ,  0.1],
        [ 0.9,  0.1]]])

[26]:

square_foo_tensors = IndexedArray(foo_tensors, index=["a"])

[27]:

square_foo_tensors_and_bar = StellarGraph(
    {"foo": square_foo_tensors, "bar": square_bar}, square_edges
)
print(square_foo_tensors_and_bar.info())

StellarGraph: Undirected multigraph
 Nodes: 4, Edges: 5

 Node types:
  bar: [3]
    Features: float64 vector, length 2
    Edge types: bar-default->bar, bar-default->foo
  foo: [1]
    Features: float64 tensor, shape (3, 2)
    Edge types: foo-default->bar

 Edge types:
    foo-default->bar: [2]
        Weights: all 1 (default)
        Features: none
    bar-default->bar: [2]
        Weights: all 1 (default)
        Features: none
    bar-default->foo: [1]
        Weights: all 1 (default)
        Features: none

We can now see that the foo node is listed as having a feature tensor, as desired.

Conclusion¶

You hopefully now know more about building node features for a StellarGraph in various configurations via NumPy arrays.

For more details on graphs with directed, weighted or heterogeneous edges, see the “Loading data into StellarGraph from Pandas” demo. All of the examples there work with IndexedArray instead of Pandas DataFrames for the node features.

Revisit this document to use as a reminder, or the documentation for the StellarGraph class.

Once you’ve loaded your data, you can start doing machine learning: a good place to start is the demo of the GCN algorithm on the Cora dataset for node classification. Additionally, StellarGraph includes many other demos of other algorithms, solving other tasks.

Execute this notebook: Download locally