Module tgml.dataloaders

Class EdgeLoader

class EdgeLoader (graph: TigerGraph, batch_size: int = None,
    num_batches: int = 1,local_storage_path: str = './tmp',
    cloud_storage_path: str = None, buffer_size: int = 4,
    output_format: str = 'dataframe', cache_id: str = None,
    aws_access_key_id: str = None, aws_secret_access_key: str = None)

Data loader that pulls either the whole edge list or batches of edges from database. Edge attributes are not supported.

For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again. For the data loader to work, the Graph Data Processing Service has to be running on the TigerGraph server.

There are two ways to use the data loader:

  • The data loader can be used as an iterator, which means you can loop through it to get every batch of data. If you load all edges at once (num_batches=1), there will be only one batch (of all the edges) in the iterator.

  • Second, you can access the data property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case.

If there are multiple batches of data to load, it will return the loader again. It can either stream data directly from the server or cache data on the cloud. Set cloud_storage_path to turn on cloud cache. This way data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server. However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce workload of the server, and consequently it might be faster overall.

If using cloud caching, cloud storage access keys need to be provided. For AWS S3, aws_access_key_id and aws_secret_access_key are required. However, the class can read from environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, and again it is recommended to store those credentials in the .env file instead of hardcoding them.

Parameters

graph : TigerGraph

Connection to the TigerGraph database.

batch_size : int, optional

Size of each batch. If given, num_batches will be recalculated based on batch size. Defaults to None.

num_batches : int, optional

Number of batches to split the whole dataset. Defaults to 1.

local_storage_path : str, optional

Place to store data locally. Defaults to "./tmp".

cloud_storage_path : str, optional

S3 path used for cloud caching. If not None, cloud caching will be used. Defaults to None.

buffer_size : int, optional

Number of data batches to prefetch and store in memory. Defaults to 4.

output_format : str, optional

Format of the output data of the loader. Only pandas dataframe is supported. Defaults to "dataframe".

cache_id : str, optional

An identifier associated to data from this loader. If none, a random string will be used automatically. Defaults to None.

aws_access_key_id : str, optional

AWS access key for cloud storage. Defaults to None.

aws_secret_access_key : str, optional

AWS access key secret for cloud storage. Defaults to None.

Class GraphLoader

class GraphLoader (graph: TigerGraph, v_in_feats: str = '',
    v_out_labels: str = '', v_extra_feats: str = '',
    local_storage_path: str = './tmp', cloud_storage_path: str = None,
    buffer_size: int = 4, output_format: str = 'PyG', num_batches: int = 1,
    cache_id: str = None, reindex: bool = False, aws_access_key_id: str = None,
    aws_secret_access_key: str = None)

Data loader that pulls the whole graph from database.

Note: For the first time you initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again. For the data loader to work, the Graph Data Processing Service has to be running on the TigerGraph server.

There are two ways to use the data loader.

  • First, it can be used as an iterator, which means you can loop through it to get every batch of data. Since this loader loads the whole graph at once, there will be only one batch of data (of the whole graph) in the iterator.

  • Second, you can access the data property of the class directly. Since there is only one batch of data (the whole graph), it will give you the batch directly instead of an iterator.

It can either stream data directly from the server or cache data on the cloud. Set cloud_storage_path to turn on cloud cache. This way data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server.

However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce workload of the server, and consequently it might be faster overall.

If using cloud caching, cloud storage access keys need to be provided. For AWS S3, aws_access_key_id and aws_secret_access_key are required. However, the class can read from environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, and hence it is recommended to store those credentials in the .env file instead of hardcoding them.

Parameters

graph : TigerGraph

Connection to the TigerGraph database.

v_in_feats : str, optional

Attributes to be used as input features and their types. Attributes should be separated by ',' and an attribute and its type should be separated by ':'. The type of an attribute can be omitted together with the separator ':', and the attribute will be default to type "float32". Defaults to "".

v_out_labels : str, optional

Attributes to be used as labels for prediction. It follows the same format as 'v_in_feats'. Defaults to "".

v_extra_feats : str, optional

Other attributes to get such as indicators of train/test data. It follows the same format as 'v_in_feats'. Defaults to "".

local_storage_path : str, optional

Place to store data locally. Defaults to "./tmp".

cloud_storage_path : str, optional

S3 path used for cloud caching. If not None, cloud caching will be used. Defaults to None.

buffer_size : int, optional

Number of data batches to prefetch and store in memory. Defaults to 4.

output_format : str, optional

Format of the output data of the loader. Only "PyG" is supported. Defaults to "PyG".

reindex : bool, optional

Whether to reindex the vertices. Defaults to False.

cache_id : str, optional

An identifier associated to data from this loader. If none, a random string will be used automatically. Defaults to None.

aws_access_key_id : str, optional

AWS access key for cloud storage. Defaults to None.

aws_secret_access_key : str, optional

AWS access key secret for cloud storage. Defaults to None.

Class NeighborLoader

class NeighborLoader( graph: TigerGraph, tmp_id: str = 'tmp_id',
    v_in_feats: str = '', v_out_labels: str = '', v_extra_feats: str = '',
    local_storage_path: str = './tmp', cloud_storage_path: str = None,
    buffer_size: int = 4, output_format: str = 'PyG', batch_size: int = None,
    num_batches: int = 1, num_neighbors: int = 10, num_hops: int = 2, cache_id:
    str = None, shuffle: bool = False, filter_by: str = None,
    aws_access_key_id: str = None, aws_secret_access_key: str = None)

A data loader that performs neighbor sampling as introduced in the paper Inductive Representation Learning on Large Graphs.

Specifically, it first chooses batch_size number of vertices as seeds, then picks num_neighbors number of neighbors of each seed at random, then num_neighbors neighbors of each neighbor, and repeat for num_hops. This generates one subgraph. As you loop through this data loader, all vertices are chosen as seeds, and you will get all subgraphs expanded from those seeds.

If you want to limit seeds to certain vertices, the boolean attribute provided to filter_by will be used to indicate which vertices can be included as seeds.

When you first initialize the loader on a graph in TigerGraph, the initialization might take half a minute as it installs the corresponding query to the database and optimizes it. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same TG graph again. For the data loader to work, the Graph Data Processing Service has to be running on the TigerGraph server.

There are two ways to use the data loader. See here for examples.

  • First, it can be used as an iterator, which means you can loop through it to get every batch of data. If you load all edges at once (num_batches=1), there will be only one batch (of all the edges) in the iterator.

  • Second, you can access the data property of the class directly. If there is only one batch of data to load, it gives you the batch directly instead of an iterator. If there are multiple batches of data to load, it returns the loader again.

It can either stream data directly from the server or cache data on the cloud. Set cloud_storage_path to turn on cloud cache. This way data will be moved to a cloud storage first and then downloaded to local, so it will be slower compared to streaming directly from the server.

However, when there are multiple consumers of the same data such as when trying out different models in parallel or tuning hyperparameters, the cloud caching would reduce workload of the server, and consequently it might be faster overall.

If using cloud caching, cloud storage access keys need to be provided. For AWS S3, aws_access_key_id and aws_secret_access_key are required. However, the class can read from environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, and hence it is recommended to store those credentials in the .env file instead of hardcoding them.

Parameters

graph : TigerGraph

Connection to the TigerGraph database.

tmp_id : str, optional

Attribute name that holds the temporary ID of vertices. Defaults to "tmp_id".

v_in_feats : str, optional

Attributes to be used as input features and their types. Attributes should be separated by ',' and an attribute and its type should be separated by ':'. The type of an attribute can be omitted together with the separator ':', and the attribute will be default to type "float32". and Defaults to "".

v_out_labels : str, optional

Attributes to be used as labels for prediction. It follows the same format as 'v_in_feats'. Defaults to "".

v_extra_feats : str, optional

Other attributes to get such as indicators of train/test data. It follows the same format as 'v_in_feats'. Defaults to "".

local_storage_path : str, optional

Place to store data locally. Defaults to "./tmp".

cloud_storage_path : str, optional

S3 path used for cloud caching. If not None, cloud caching will be used. Defaults to None.

buffer_size : int, optional

Number of data batches to prefetch and store in memory. Defaults to 4.

output_format : str, optional

Format of the output data of the loader. Accepted values are PyG or DGL. Defaults to PyG.

batch_size : int, optional

Number of vertices as seeds in each batch. Defaults to None.

num_batches : int, optional

Number of batches to split the vertices. Defaults to 1.

num_neighbors : int, optional

Number of neighbors to sample for each vertex. Defaults to 10.

num_hops : int, optional

Number of hops to traverse when sampling neighbors. Defaults to 2.

shuffle : bool, optional

Whether to shuffle the vertices after every epoch. Defaults to False.

filter_by : str, optional

A boolean attribute used to indicate which vertices can be included as seeds. Defaults to None.

cache_id : str, optional

An identifier associated to data from this loader. If none, a random string will be used automatically. Defaults to None.

aws_access_key_id : str, optional

AWS access key for cloud storage. Defaults to None.

aws_secret_access_key : str, optional

AWS access key secret for cloud storage. Defaults to None.

Instance methods

.inference()

def inference(self) -> None

Sets the loader to inference mode. This allows the loader to fetch data for specific vertices so that users can do predictions and inferences on those vertices It also resets the loader to clear any job or data left from training, and start the workers for getting inference data.

Parameters

None.

Return value

None.

.fetch()

def fetch(self, input_vertices: Union[dict, List[dict]])

Fetch the specific data instances for inference or prediction. This method only works if the loader has been set to inference mode by calling the inference() method.

Parameters
input_vertices: dict or list of dict

The data instances to fetch. The parameter can be of type dict or list of dict. If it is a single dictionary, then it is regarded as a single data instance, while a list of dictionaries are regarded as multiple instances. Each dict should have two keys, id and type for vertex id and type, respectively. For example, {"id": "57", "type": "Paper"}.

Return value

A PyG or DGL graph (depending on the output_format parameter of the loader) containing the input vertices.

Class VertexLoader

class VertexLoader (graph: TigerGraph, batch_size: int = None,
    num_batches: int = 1, attributes: str = '',
    local_storage_path: str = './tmp', cloud_storage_path: str = None,
    buffer_size: int = 4, output_format: str = 'dataframe', cache_id: str = None,
    aws_access_key_id: str = None, aws_secret_access_key: str = None)

Parameters

graph : TigerGraph

Connection to the TigerGraph database.

batch_size : int, optional

Size of each batch. If given, num_batches will be recalculated based on batch size. Defaults to None.

num_batches : int, optional

Number of batches to split the whole dataset. Defaults to 1.

attributes : str, optional

Vertex attributes to get, separated by comma. Defaults to "".

local_storage_path : str, optional

Place to store data locally. Defaults to "./tmp".

cloud_storage_path : str, optional

S3 path used for cloud caching. If not None, cloud caching will be used. Defaults to None.

buffer_size : int, optional

Number of data batches to prefetch and store in memory. Defaults to 4.

output_format : str, optional

Format of the output data of the loader. Only pandas dataframe is supported. Defaults to "dataframe".

cache_id : str, optional

An identifier associated to data from this loader. If none, a random string will be used automatically. Defaults to None.

aws_access_key_id : str, optional

AWS access key for cloud storage. Defaults to None.

aws_secret_access_key : str, optional

AWS access key secret for cloud storage. Defaults to None.