Execute this notebook: Download locally

# Loading data into StellarGraph from NumPy¶

This demo explains how to load data from NumPy into a form that can be used by the StellarGraph library. See all other demos.

The StellarGraph library supports loading graph information from NumPy. NumPy is a library for working with data arrays.

If your data can easily be loaded into a NumPy array, this is a great way to load it that has minimal overhead and offers the most control.

This notebook walks through loading three kinds of graphs.

homogeneous graph with feature vectors

homogeneous graph with feature tensors

heterogeneous graph with feature vectors and tensors

StellarGraph supports loading data from many sources with all sorts of data preprocessing, via Pandas DataFrames, NumPy arrays, Neo4j and NetworkX graphs. This notebook demonstrates loading data from NumPy. See the other loading demos for more details.

This notebook only uses NumPy for the node features, with Pandas used for the edge data. The details and options for loading edge data in this format are discussed in the “Loading data into StellarGraph from Pandas” demo.

Additionally, if the node features are in a complicated format for loading and/or requires significant preprocessing, loading via Pandas is likely to be more convenient.

The documentation for the `StellarGraph`

class includes a compressed reminder of everything discussed in this file, as well as explanations of all of the parameters.

The `StellarGraph`

class is available at the top level of the `stellargraph`

library:

```
[3]:
```

```
from stellargraph import StellarGraph
```

## Loading via NumPy¶

A StellarGraph has two basic components:

nodes, with feature arrays or tensors

edges, consisting of a pair of nodes as the source and target, and feature arrays or tensors

A NumPy array consists of a large number of values of a single type. It is thus appropriate for the feature arrays in nodes, but not as useful for edges, because the source and target node IDs may be different. Thus, node data can be input as a NumPy array directly, but edge data cannot. The latter still uses Pandas.

```
[4]:
```

```
import numpy as np
import pandas as pd
```

## Sequential numeric graph structure¶

As with the Pandas demo, we’ll be working with a square graph. For simplicity, we’ll start with a graph where the identifiers of nodes are sequential integers starting at 0:

```
0 -- 1
| \ |
| \ |
3 -- 2
```

The edges of this graph can easily be encoded as the rows of a Pandas DataFrame:

```
[5]:
```

```
square_numeric_edges = pd.DataFrame(
{"source": [0, 1, 2, 3, 0], "target": [1, 2, 3, 0, 2]}
)
square_numeric_edges
```

```
[5]:
```

source | target | |
---|---|---|

0 | 0 | 1 |

1 | 1 | 2 |

2 | 2 | 3 |

3 | 3 | 0 |

4 | 0 | 2 |

## Homogeneous graph with sequential IDs and feature vectors¶

Now, suppose we have some feature vectors associated with each node in our square graph. For instance, maybe node `0`

has features `[1, -0.2]`

. This can come in the form of a 4 × 2 matrix, with one row per node, with row `0`

being features for the `0`

node, and so on. Filling out the rest of the example data:

```
[6]:
```

```
feature_array = np.array(
[[1.0, -0.2], [2.0, 0.3], [3.0, 0.0], [4.0, -0.5]], dtype=np.float32
)
feature_array
```

```
[6]:
```

```
array([[ 1. , -0.2],
[ 2. , 0.3],
[ 3. , 0. ],
[ 4. , -0.5]], dtype=float32)
```

Because our nodes have IDs `0`

, `1`

, …, we can construct the `StellarGraph`

by passing in the feature array directly, along with the edges:

```
[7]:
```

```
square_numeric = StellarGraph(feature_array, square_numeric_edges)
```

The `info`

method (docs) gives a high-level summary of a `StellarGraph`

:

```
[8]:
```

```
print(square_numeric.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
default: [4]
Features: float32 vector, length 2
Edge types: default-default->default
Edge types:
default-default->default: [5]
Weights: all 1 (default)
Features: none
```

On this square, it tells us that there’s 4 nodes of type `default`

(a homogeneous graph still has node and edge types, but they default to `default`

), with 2 features, and one type of edge that touches it. It also tells us that there’s 5 edges of type `default`

that go between nodes of type `default`

. This matches what we expect: it’s a graph with 4 nodes and 5 edges and one type of each.

The default node type and edge types can be set using the `node_type_default`

and `edge_type_default`

parameters to `StellarGraph(...)`

:

```
[9]:
```

```
square_numeric_named = StellarGraph(
feature_array,
square_numeric_edges,
node_type_default="corner",
edge_type_default="line",
)
print(square_numeric_named.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
corner: [4]
Features: float32 vector, length 2
Edge types: corner-line->corner
Edge types:
corner-line->corner: [5]
Weights: all 1 (default)
Features: none
```

## Non-sequential graph structure¶

Requiring node identifiers to always be sequential integers from 0 is restrictive. Most real-world graphs don’t have such neat IDs. For instance, maybe our graph instead uses strings as IDs:

```
a -- b
| \ |
| \ |
d -- c
```

As before, these edges get encoded as a DataFrame:

```
[10]:
```

```
square_edges = pd.DataFrame(
{"source": ["a", "b", "c", "d", "a"], "target": ["b", "c", "d", "a", "c"]}
)
square_edges
```

```
[10]:
```

source | target | |
---|---|---|

0 | a | b |

1 | b | c |

2 | c | d |

3 | d | a |

4 | a | c |

## Homogeneous graph with non-numeric IDs and feature vectors¶

With non-sequential, non-numeric IDs, we cannot use a NumPy array directly, because we need to know which row of the array corresponds to which node. This is done with the `IndexedArray`

(docs) type. It is a much simplified Pandas DataFrame, that is generalised to be more than 2-dimensional. It is available at the top level of `stellargraph`

, and supports an `index`

parameter to define the mapping from
row to node. The `index`

should have one element per row of the NumPy array.

```
[11]:
```

```
from stellargraph import IndexedArray
```

```
[12]:
```

```
indexed_array = IndexedArray(feature_array, index=["a", "b", "c", "d"])
```

```
[13]:
```

```
square_named = StellarGraph(
indexed_array, square_edges, node_type_default="corner", edge_type_default="line",
)
print(square_named.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
corner: [4]
Features: float32 vector, length 2
Edge types: corner-line->corner
Edge types:
corner-line->corner: [5]
Weights: all 1 (default)
Features: none
```

As before, there’s 4 nodes, each with features of length 2.

## Homogeneous graph with non-numeric IDs and feature tensors¶

Some algorithms work with than just a feature vector associated with each node. For instance, if each node corresponds to a weather station, one might have a time series of observations like “temperature” and “pressure” associated with each node. This is modelled by having a multidimensional feature for each node.

Time series algorithms within StellarGraph expect the tensor to be shaped like `nodes × time steps × variates`

. For the weather station example, `nodes`

is the number of weather stations, `time steps`

is the number of points within each series and `variates`

is the number of observations at each time step.

For our square graph, we might have time series of length three, containing two observations.

```
[14]:
```

```
feature_tensors = np.array(
[
[[1.0, -0.2], [1.0, 0.1], [0.9, 0.1]],
[[2.0, 0.3], [1.9, 0.31], [2.1, 0.32]],
[[3.0, 0.0], [10.0, 0.0], [3.0, 0.0]],
[[4.0, -0.5], [0.0, -1.0], [1.0, -3.0]],
],
dtype=np.float32,
)
feature_tensors
```

```
[14]:
```

```
array([[[ 1. , -0.2 ],
[ 1. , 0.1 ],
[ 0.9 , 0.1 ]],
[[ 2. , 0.3 ],
[ 1.9 , 0.31],
[ 2.1 , 0.32]],
[[ 3. , 0. ],
[10. , 0. ],
[ 3. , 0. ]],
[[ 4. , -0.5 ],
[ 0. , -1. ],
[ 1. , -3. ]]], dtype=float32)
```

```
[15]:
```

```
indexed_tensors = IndexedArray(feature_tensors, index=["a", "b", "c", "d"])
```

```
[16]:
```

```
square_tensors = StellarGraph(
indexed_tensors, square_edges, node_type_default="corner", edge_type_default="line",
)
print(square_tensors.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
corner: [4]
Features: float32 tensor, shape (3, 2)
Edge types: corner-line->corner
Edge types:
corner-line->corner: [5]
Weights: all 1 (default)
Features: none
```

We can see that the features of the `corner`

nodes are now listed as a tensor, with shape 3 × 2, matching the array we created above.

## Heterogeneous graphs¶

Some graphs have multiple types of nodes.

For example, an academic citation network that includes authors might have `wrote`

edges connecting `author`

nodes to `paper`

nodes, in addition to the `cites`

edges between `paper`

nodes. There could be `supervised`

edges between `author`

s (example) too, or any number of additional node and edge types. A knowledge graph (aka RDF, triple stores or knowledge base) is an extreme form of an heterogeneous graph, with dozens, hundreds or even thousands
of edge (or relation) types. Typically in a knowledge graph, edges and their types represent the information associated with a node, rather than node features.

`StellarGraph`

supports all forms of heterogeneous graphs.

A heterogeneous `StellarGraph`

can be constructed in a similar way to a homogeneous graph, except we pass a dictionary with multiple elements instead of a single element like we did in the “homogeneous graph with features” section and others above. For a heterogeneous graph, a dictionary has to be passed; passing a single `IndexedArray`

does not work.

Let’s return to the square graph from earlier:

```
a -- b
| \ |
| \ |
d -- c
```

### Feature arrays¶

Suppose `a`

is of type `foo`

, and no features, but `b`

, `c`

and `d`

are of type `bar`

and have two features each, e.g. for `b`

, `0.4, 100`

. Since the features are different shapes (`a`

has zero), they need to be modeled as different types, with separate `IndexedArray`

s.

```
[17]:
```

```
square_foo = IndexedArray(index=["a"])
```

```
[18]:
```

```
bar_features = np.array([[0.4, 100], [0.1, 200], [0.9, 300]])
bar_features
```

```
[18]:
```

```
array([[4.e-01, 1.e+02],
[1.e-01, 2.e+02],
[9.e-01, 3.e+02]])
```

```
[19]:
```

```
square_bar = IndexedArray(bar_features, index=["b", "c", "d"])
```

We have the information for the two node types `foo`

and `bar`

in separate DataFrames, so we can now put them in a dictionary to create a `StellarGraph`

. Notice that `info()`

is now reporting multiple node types, as well as information specific to each.

```
[20]:
```

```
square_foo_and_bar = StellarGraph({"foo": square_foo, "bar": square_bar}, square_edges)
print(square_foo_and_bar.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
bar: [3]
Features: float64 vector, length 2
Edge types: bar-default->bar, bar-default->foo
foo: [1]
Features: none
Edge types: foo-default->bar
Edge types:
foo-default->bar: [2]
Weights: all 1 (default)
Features: none
bar-default->bar: [2]
Weights: all 1 (default)
Features: none
bar-default->foo: [1]
Weights: all 1 (default)
Features: none
```

Node IDs (the DataFrame index) needs to be unique across all types. For example, renaming the `a`

corner to `b`

like `square_foo_overlap`

in the next cell, is not accepted and a `StellarGraph(...)`

call will throw an error

```
[21]:
```

```
square_foo_overlap = IndexedArray(index=["b"])
```

```
[22]:
```

```
# Uncomment to see the error
# StellarGraph({"foo": square_foo_overlap, "bar": square_bar}, square_edges)
```

If the node IDs aren’t unique across types, one way to make them unique is to add a string prefix. You’ll need to add the same prefix to the node IDs used in the edges too. Adding a prefix can be done by replacing the index:

```
[23]:
```

```
square_foo_overlap_prefix = IndexedArray(
square_foo_overlap.values, index=[f"foo-{s}" for s in square_foo_overlap.index]
)
```

```
[24]:
```

```
square_bar_prefix = IndexedArray(
square_bar.values, index=[f"bar-{s}" for s in square_bar.index]
)
```

### Feature tensors¶

Nodes of different types can have features of completely different shapes, not just vectors of different lengths. For instance, suppose our `foo`

node (`a`

) has the multi-variate time series from above as a feature.

```
[25]:
```

```
foo_tensors = np.array([[[1.0, -0.2], [1.0, 0.1], [0.9, 0.1]]])
foo_tensors
```

```
[25]:
```

```
array([[[ 1. , -0.2],
[ 1. , 0.1],
[ 0.9, 0.1]]])
```

```
[26]:
```

```
square_foo_tensors = IndexedArray(foo_tensors, index=["a"])
```

```
[27]:
```

```
square_foo_tensors_and_bar = StellarGraph(
{"foo": square_foo_tensors, "bar": square_bar}, square_edges
)
print(square_foo_tensors_and_bar.info())
```

```
StellarGraph: Undirected multigraph
Nodes: 4, Edges: 5
Node types:
bar: [3]
Features: float64 vector, length 2
Edge types: bar-default->bar, bar-default->foo
foo: [1]
Features: float64 tensor, shape (3, 2)
Edge types: foo-default->bar
Edge types:
foo-default->bar: [2]
Weights: all 1 (default)
Features: none
bar-default->bar: [2]
Weights: all 1 (default)
Features: none
bar-default->foo: [1]
Weights: all 1 (default)
Features: none
```

We can now see that the `foo`

node is listed as having a feature tensor, as desired.

## Conclusion¶

You hopefully now know more about building node features for a `StellarGraph`

in various configurations via NumPy arrays.

For more details on graphs with directed, weighted or heterogeneous edges, see the “Loading data into StellarGraph from Pandas” demo. All of the examples there work with `IndexedArray`

instead of Pandas DataFrames for the node features.

Revisit this document to use as a reminder, or the documentation for the `StellarGraph`

class.

Once you’ve loaded your data, you can start doing machine learning: a good place to start is the demo of the GCN algorithm on the Cora dataset for node classification. Additionally, StellarGraph includes many other demos of other algorithms, solving other tasks.

Execute this notebook: Download locally