Execute this notebook: Download locally

Resource usage of the StellarGraph class¶

This notebooks records the time and memory (both peak and long-term) required to construct a StellarGraph object for several datasets.

This notebook is aimed at helping contributors to the StellarGraph library itself understand how their changes affect the resource usage of the StellarGraph object.

Various measures of resource usage for several “real world” graphs of various sizes are recorded:

time for construction
memory usage of the final StellarGraph object
peak memory usage during StellarGraph construction (both absolute, and additional compared to the raw input data)

These are recorded both with explicit nodes (and node features if they exist), and implicit/inferred nodes.

The memory usage is recorded end-to-end. That is, the recording starts from data on disk and continues until the StellarGraph object has been constructed and other data has been cleaned up. This is important for accurately recording the total memory usage, as NumPy arrays can often share data with existing arrays in memory and so retroactive or partial (starting from data in memory) analysis can miss significant amounts of data. The parsing code in stellargraph.datasets doesn’t allow determining the memory usage of the intermediate nodes and edges input to the StellarGraph constructor, and so cannot be used here.

[3]:

import stellargraph as sg
import pandas as pd
import numpy as np

import gc
import json
import os
import timeit
import tempfile
import tracemalloc

Optional reddit data¶

The original GraphSAGE paper evaluated on a reddit dataset, available at http://snap.stanford.edu/graphsage/#datasets. This dataset is large (1.3GB compressed) and so there is not automatic download support for it. The following reddit_path variable controls whether and how the reddit dataset is included:

to ignore the dataset: set the variable to None
to include the dataset: download the dataset zip, decompress it, and set the variable to the decompressed directory

[4]:

reddit_path = os.path.expanduser("~/data/reddit")

Datasets¶

Cora¶

[5]:

cora = sg.datasets.Cora()
cora.download()

cora_cites_path = os.path.join(cora.data_directory, "cora.cites")
cora_content_path = os.path.join(cora.data_directory, "cora.content")
cora_dtypes = {0: int, **{i: np.float32 for i in range(1, 1433 + 1)}}


def cora_pandas_parts(include_nodes):
    if include_nodes:
        nodes = pd.read_csv(
            cora_content_path,
            header=None,
            sep="\t",
            index_col=0,
            usecols=range(0, 1433 + 1),
            dtype=cora_dtypes,
            na_filter=False,
        )
    else:
        nodes = None
    edges = pd.read_csv(
        cora_cites_path,
        header=None,
        sep="\t",
        names=["source", "target"],
        dtype=int,
        na_filter=False,
    )
    return nodes, edges, {}


def cora_indexed_array_parts(include_nodes):
    nodes, edges, args = cora_pandas_parts(include_nodes)
    if nodes is not None:
        nodes = sg.IndexedArray(nodes.to_numpy(), index=nodes.index)
    return nodes, edges, args

BlogCatalog3¶

[6]:

blogcatalog3 = sg.datasets.BlogCatalog3()
blogcatalog3.download()

blogcatalog3_edges = os.path.join(blogcatalog3.data_directory, "edges.csv")
blogcatalog3_group_edges = os.path.join(blogcatalog3.data_directory, "group-edges.csv")
blogcatalog3_groups = os.path.join(blogcatalog3.data_directory, "groups.csv")
blogcatalog3_nodes = os.path.join(blogcatalog3.data_directory, "nodes.csv")


def blogcatalog3_parts(include_nodes):
    if include_nodes:
        raw_nodes = pd.read_csv(blogcatalog3_nodes, header=None)[0]
        raw_groups = pd.read_csv(blogcatalog3_groups, header=None)[0]
        nodes = {
            "user": pd.DataFrame(index=raw_nodes),
            "group": pd.DataFrame(index=-raw_groups),
        }
    else:
        nodes = None

    edges = pd.read_csv(blogcatalog3_edges, header=None, names=["source", "target"])

    group_edges = pd.read_csv(
        blogcatalog3_group_edges, header=None, names=["source", "target"]
    )
    group_edges["target"] *= -1
    start = len(edges)
    group_edges.index = range(start, start + len(group_edges))

    edges = {"friend": edges, "belongs": group_edges}
    return nodes, edges, {}

FB15k¶

[7]:

fb15k = sg.datasets.FB15k()
fb15k.download()
fb15k_files = [
    os.path.join(fb15k.data_directory, f"freebase_mtr100_mte100-{x}.txt")
    for x in ["train", "test", "valid"]
]


def fb15k_parts(include_nodes, usecols=None):
    loaded = [
        pd.read_csv(
            name,
            header=None,
            names=["source", "label", "target"],
            sep="\t",
            dtype=str,
            na_filter=False,
            usecols=usecols,
        )
        for name in fb15k_files
    ]
    edges = pd.concat(loaded, ignore_index=True)

    if include_nodes:
        # infer the set of nodes manually, in a memory-minimal way
        raw_nodes = set(edges.source)
        raw_nodes.update(edges.target)
        nodes = pd.DataFrame(index=raw_nodes)
    else:
        nodes = None

    return nodes, edges, {"edge_type_column": "label"}


def fb15k_no_edge_types_parts(include_nodes):
    nodes, edges, _ = fb15k_parts(include_nodes, usecols=["source", "target"])
    return nodes, edges, {}

reddit¶

As discussed above, the reddit dataset is large and optional. It is also slow to parse, as the graph structure is a huge JSON file. Thus, we prepare the dataset by converting that JSON file into a NumPy edge list array, of shape (num_edges, 2). This is significantly faster to load from disk.

[8]:

%%time

# if requested, prepare the reddit dataset by saving the slow-to-read JSON to a temporary .npy file
if reddit_path is not None:
    reddit_graph_path = os.path.join(reddit_path, "reddit-G.json")
    reddit_feats_path = os.path.join(reddit_path, "reddit-feats.npy")

    with open(reddit_graph_path) as f:
        reddit_g = json.load(f)
    reddit_numpy_edges = np.array([[x["source"], x["target"]] for x in reddit_g["links"]])

    reddit_edges_file = tempfile.NamedTemporaryFile(suffix=".npy")
    np.save(reddit_edges_file, reddit_numpy_edges)

CPU times: user 15.9 s, sys: 1.97 s, total: 17.8 s
Wall time: 17.9 s

[9]:

def reddit_numpy_parts(include_nodes):
    if include_nodes:
        nodes = np.load(reddit_feats_path).astype(np.float32)
    else:
        nodes = None

    raw_edges = np.load(reddit_edges_file.name)
    edges = pd.DataFrame(raw_edges, columns=["source", "target"])
    return nodes, edges, {}


def reddit_pandas_parts(include_nodes):
    nodes, edges, args = reddit_numpy_parts(include_nodes)
    if nodes is not None:
        nodes = pd.DataFrame(nodes)

    return nodes, edges, args

Collected¶

[10]:

datasets = {
    "Cora (Pandas)": cora_pandas_parts,
    "Cora (IndexedArray)": cora_indexed_array_parts,
    "BlogCatalog3": blogcatalog3_parts,
    "FB15k (no edge types)": fb15k_no_edge_types_parts,
    "FB15k": fb15k_parts,
}
if reddit_path is not None:
    datasets["reddit (Pandas)"] = reddit_pandas_parts
    datasets["reddit (NumPy)"] = reddit_numpy_parts

Measurement¶

[11]:

def mem_snapshot_diff(after, before):
    """Total memory difference between two tracemalloc.snapshot objects"""
    return sum(elem.size_diff for elem in after.compare_to(before, "lineno"))

[12]:

# names of columns computed by the measurement code
def measurement_columns(title):
    names = [
        "time",
        "memory (graph)",
        "memory (graph, not shared with data)",
        "peak memory (graph)",
        "peak memory (graph, ignoring data)",
        "memory (data)",
        "peak memory (data)",
    ]
    return [(title, x) for x in names]


columns = pd.MultiIndex.from_tuples(
    [
        ("graph", "nodes"),
        ("graph", "node feat size"),
        ("graph", "edges"),
        *measurement_columns("explicit nodes"),
        *measurement_columns("inferred nodes (no features)"),
    ]
)

[13]:

def measure_time(f, include_nodes):
    nodes, edges, args = f(include_nodes)
    start = timeit.default_timer()
    sg.StellarGraph(nodes, edges, **args)
    end = timeit.default_timer()
    return end - start

[14]:

def measure_memory(f, include_nodes):
    """
    Measure exactly what it takes to load the data.

    - the size of the original edge data (as a baseline)
    - the size of the final graph
    - the peak memory use of both

    This uses a similar technique to the 'allocation_benchmark' fixture in tests/test_utils/alloc.py.
    """
    gc.collect()
    # ensure we're measuring the worst-case peak, when no GC happens
    gc.disable()

    tracemalloc.start()
    snapshot_start = tracemalloc.take_snapshot()

    nodes, edges, args = f(include_nodes)

    gc.collect()
    _, data_memory_peak = tracemalloc.get_traced_memory()
    snapshot_data = tracemalloc.take_snapshot()

    if include_nodes:
        assert nodes is not None, f
        sg_g = sg.StellarGraph(nodes, edges, **args)
    else:
        assert nodes is None, f
        sg_g = sg.StellarGraph(edges=edges, **args)

    gc.collect()
    snapshot_graph = tracemalloc.take_snapshot()

    # clean up the input data and anything else leftover, so that the snapshot
    # includes only the long-lasting data: the StellarGraph.
    del edges
    del nodes
    del args
    gc.collect()

    _, graph_memory_peak = tracemalloc.get_traced_memory()
    snapshot_end = tracemalloc.take_snapshot()
    tracemalloc.stop()

    gc.enable()

    data_memory = mem_snapshot_diff(snapshot_data, snapshot_start)
    graph_memory = mem_snapshot_diff(snapshot_end, snapshot_start)
    graph_over_data_memory = mem_snapshot_diff(snapshot_graph, snapshot_data)

    return (
        sg_g,
        graph_memory,
        graph_over_data_memory,
        graph_memory_peak,
        graph_memory_peak - data_memory,
        data_memory,
        data_memory_peak,
    )

[15]:

def measure(f):
    time_nodes = measure_time(f, include_nodes=True)
    time_no_nodes = measure_time(f, include_nodes=False)

    sg_g, *mem_nodes = measure_memory(f, include_nodes=True)
    _, *mem_no_nodes = measure_memory(f, include_nodes=False)

    feat_sizes = sg_g.node_feature_sizes()
    try:
        feat_sizes = feat_sizes[sg_g.unique_node_type()]
    except ValueError:
        pass

    return [
        sg_g.number_of_nodes(),
        feat_sizes,
        sg_g.number_of_edges(),
        time_nodes,
        *mem_nodes,
        time_no_nodes,
        *mem_no_nodes,
    ]

[16]:

%%time
recorded = [measure(f) for f in datasets.values()]

CPU times: user 28 s, sys: 7.04 s, total: 35 s
Wall time: 35.1 s

[17]:

raw = pd.DataFrame(recorded, columns=columns, index=datasets.keys())
raw

[17]:

	graph			explicit nodes							inferred nodes (no features)
	nodes	node feat size	edges	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)
Cora (Pandas)	2708	1433	5429	0.025028	15586530	15564897	46764625	31079400	15685225	31995857	0.002037	60994	63025	251118	160985	90133	197529
Cora (IndexedArray)	2708	1433	5429	0.001163	15585170	40633	31993945	16356516	15637429	31993945	0.001545	61018	63049	251118	160985	90133	197529
BlogCatalog3	10351	{'group': 0, 'user': 0}	348459	0.020382	4635099	7428226	14146092	8477331	5668761	10805413	0.027501	4633843	7427186	14061652	8479763	5581889	10711633
FB15k (no edge types)	14951	0	592213	0.098957	3970442	2985020	25830730	10151739	15678991	25830730	0.184846	3969362	3107220	34644016	19090289	15553727	25049683
FB15k	14951	0	592213	0.610353	9793950	13398273	57650243	36747424	20902819	35792614	0.700297	9794126	13521649	57650811	36873168	20777643	35011663
reddit (Pandas)	232965	602	11606919	3.130784	665684661	665691152	1868696353	1121990320	746706033	1682947017	0.483119	106555123	106556406	375628530	189913865	185714665	185723196
reddit (NumPy)	232965	602	11606919	0.545932	665684061	104705536	1682947017	936252548	746694469	1682947017	0.475468	106555123	106556406	375628530	189913865	185714665	185723196

Pretty results¶

This shows the results in a prettier way, such as memory in MB instead of bytes.

[18]:

mem_columns = raw.columns[["memory" in x[1] for x in raw.columns]]

memory_mb = raw.copy()
memory_mb[mem_columns] = (memory_mb[mem_columns] / 10 ** 6).round(3)
memory_mb

[18]:

	graph			explicit nodes							inferred nodes (no features)
	nodes	node feat size	edges	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)	time	memory (graph)	memory (graph, not shared with data)	peak memory (graph)	peak memory (graph, ignoring data)	memory (data)	peak memory (data)
Cora (Pandas)	2708	1433	5429	0.025028	15.587	15.565	46.765	31.079	15.685	31.996	0.002037	0.061	0.063	0.251	0.161	0.090	0.198
Cora (IndexedArray)	2708	1433	5429	0.001163	15.585	0.041	31.994	16.357	15.637	31.994	0.001545	0.061	0.063	0.251	0.161	0.090	0.198
BlogCatalog3	10351	{'group': 0, 'user': 0}	348459	0.020382	4.635	7.428	14.146	8.477	5.669	10.805	0.027501	4.634	7.427	14.062	8.480	5.582	10.712
FB15k (no edge types)	14951	0	592213	0.098957	3.970	2.985	25.831	10.152	15.679	25.831	0.184846	3.969	3.107	34.644	19.090	15.554	25.050
FB15k	14951	0	592213	0.610353	9.794	13.398	57.650	36.747	20.903	35.793	0.700297	9.794	13.522	57.651	36.873	20.778	35.012
reddit (Pandas)	232965	602	11606919	3.130784	665.685	665.691	1868.696	1121.990	746.706	1682.947	0.483119	106.555	106.556	375.629	189.914	185.715	185.723
reddit (NumPy)	232965	602	11606919	0.545932	665.684	104.706	1682.947	936.253	746.694	1682.947	0.475468	106.555	106.556	375.629	189.914	185.715	185.723

Execute this notebook: Download locally