Factory Functions

Factory Functions are a special collection of functions that return an instance of a class.

All factory functions are methods of the GDS class. You can call a factory function after instantiating a TigerGraph Connection. For example:

conn = TigerGraphConnection(
    host="http://127.0.0.1",
    graphname="Cora",
    username="tigergraph",
    password="tigergraph",
    useCert=False
)
edge_loader = conn.gds.edgeLoader(
    num_batches=1,
    attributes=["time", "is_train"])

The object returned has access to instance methods of the class. You can find the reference for those classes on the following pages:

If you are not sure how to configure the optional arguments related to Kafka, leave them blank. More detailed instructions on how to use them will be provided in a future release.

neighborLoader()

neighborLoader(v_in_feats: Union[list, dict] = None, v_out_labels: Union[list, dict] = None, v_extra_feats: Union[list, dict] = None, e_in_feats: Union[list, dict] = None, e_out_labels: Union[list, dict] = None, e_extra_feats: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, num_neighbors: int = 10, num_hops: int = 2, shuffle: bool = False, filter_by: str = None, output_format: str = "PyG", add_self_loop: bool = False, loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → NeighborLoader

Returns a NeighborLoader instance. A NeighborLoader instance performs neighbor sampling from all vertices in the graph in batches in the following manner:

  1. It chooses a specified number (batch_size) of vertices as seeds. The number of batches is the total number of vertices divided by the batch size.

    • If you specify the number of batches (num_batches) instead, batch_size is calculated by dividing the total number of vertices by the number of batches. If specify both parameters, batch_size takes priority.

  2. It picks a specified number (num_neighbors) of neighbors of each seed at random.

  3. It picks the same number of neighbors for each neighbor, and repeats this process until it finished performing a specified number of hops (num_hops).

This generates one subgraph. As you loop through this data loader, every vertex will at some point be chosen as a seed and you will get the subgraph expanded from the seeds. If you want to limit seeds to certain vertices, the boolean attribute provided to filter_by will be used to indicate which vertices can be included as seeds.

When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again.

Parameters:

  • v_in_feats (list, optional): Vertex attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None.

  • v_out_labels (list, optional): Vertex attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None.

  • v_extra_feats (list, optional): Other attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None.

  • e_in_feats (list, optional): Edge attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None.

  • e_out_labels (list, optional): Edge attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None.

  • e_extra_feats (list, optional): Other edge attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None.

  • batch_size (int, optional): Number of vertices as seeds in each batch. Defaults to None.

  • num_batches (int, optional): Number of batches to split the vertices into as seeds. If both batch_size and num_batches are provided, batch_size takes higher priority. Defaults to 1.

  • num_neighbors (int, optional): Number of neighbors to sample for each vertex. Defaults to 10.

  • num_hops (int, optional): Number of hops to traverse when sampling neighbors. Defaults to 2.

  • shuffle (bool, optional): Whether to shuffle the vertices before loading data. Defaults to False.

  • filter_by (str, optional): A boolean attribute used to indicate which vertices can be included as seeds. Defaults to None.

  • output_format (str, optional): Format of the output data of the loader. Only "PyG", "DGL" and "dataframe" are supported. Defaults to "PyG".

  • add_self_loop (bool, optional): Whether to add self-loops to the graph. Defaults to False.

  • loader_id (str, optional): An identifier of the loader which can be any string. It is also used as the Kafka topic name. If None, a random string will be generated for it. Defaults to None.

  • buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.

  • kafka_address (str, optional): Address of the kafka broker. Defaults to None.

  • kafka_max_msg_size (int, optional): Maximum size of a Kafka message in bytes. Defaults to 104857600.

  • kafka_num_partitions (int, optional): Number of partitions for the topic created by this loader. Defaults to 1.

  • kafka_replica_factor (int, optional): Number of replications for the topic created by this loader. Defaults to 1.

  • kafka_retention_ms (int, optional): Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000.

  • kafka_auto_del_topic (bool, optional): Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True.

  • kafka_address_consumer (str, optional): Address of the kafka broker that a consumer should use. Defaults to be the same as kafkaAddress.

  • kafka_address_producer (str, optional): Address of the kafka broker that a producer should use. Defaults to be the same as kafkaAddress.

  • timeout (int, optional): Timeout value for GSQL queries, in ms. Defaults to 300000.

edgeLoader()

edgeLoader(attributes: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, shuffle: bool = False, filter_by: str = None, output_format: str = "dataframe", loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → EdgeLoader

Returns an EdgeLoader instance. An EdgeLoader instance loads all edges in the graph in batches.

It divides all edges into num_batches and returns each batch separately. You can also specify the size of each batch, and the number of batched is calculated accordingly. If you provide both parameters, batch_size take priority. The boolean attribute provided to filter_by indicates which edges are included. If you need random batches, set shuffle to True.

When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again.

There are two ways to use the data loader.

  • It can be used as an iterable, which means you can loop through it to get every batch of data. If you load all edges at once (num_batches=1), there will be only one batch (of all the edges) in the iterator.

  • You can access the data property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator. If there are multiple batches of data to load, it returns the loader itself.

Parameters:

  • attributes (list, optional): Edge attributes to be included. Defaults to None.

  • batch_size (int, optional): Number of edges in each batch. Defaults to None.

  • num_batches (int, optional): Number of batches to split the edges. Defaults to 1.

  • shuffle (bool, optional): Whether to shuffle the edges before loading data. Defaults to False.

  • filter_by (str, optional): A boolean attribute used to indicate which edges are included. Defaults to None.

  • output_format (str, optional): Format of the output data of the loader. Only "dataframe" is supported. Defaults to "dataframe".

  • loader_id (str, optional): An identifier of the loader which can be any string. It is also used as the Kafka topic name. If None, a random string will be generated for it. Defaults to None.

  • buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.

  • kafka_address (str, optional): Address of the kafka broker. Defaults to None.

  • kafka_max_msg_size (int, optional): Maximum size of a Kafka message in bytes. Defaults to 104857600.

  • kafka_num_partitions (int, optional): Number of partitions for the topic created by this loader. Defaults to 1.

  • kafka_replica_factor (int, optional): Number of replications for the topic created by this loader. Defaults to 1.

  • kafka_retention_ms (int, optional): Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000.

  • kafka_auto_del_topic (bool, optional): Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True.

  • kafka_address_consumer (str, optional): Address of the kafka broker that a consumer should use. Defaults to be the same as kafkaAddress.

  • kafka_address_producer (str, optional): Address of the kafka broker that a producer should use. Defaults to be the same as kafkaAddress.

  • timeout (int, optional): Timeout value for GSQL queries, in ms. Defaults to 300000.

vertexLoader()

vertexLoader(attributes: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, shuffle: bool = False, filter_by: str = None, output_format: str = "dataframe", loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → VertexLoader

Returns a VertexLoader instance. A VertexLoader can load all vertices of a graph in batches.

It divides vertices into num_batches and returns each batch separately. The boolean attribute provided to filter_by indicates which vertices are included. If you need random batches, set shuffle to True.

When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again.

There are two ways to use the data loader:

  • It can be used as an iterable, which means you can loop through it to get every batch of data. If you load all vertices at once (num_batches=1), there will be only one batch (of all the vertices) in the iterator.

  • You can access the data property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader again.

Parameters:

  • attributes (list, optional): Vertex attributes to be included. Defaults to None.

  • batch_size (int, optional): Number of vertices in each batch. Defaults to None.

  • num_batches (int, optional): Number of batches to split the vertices. Defaults to 1.

  • shuffle (bool, optional): Whether to shuffle the vertices before loading data. Defaults to False.

  • filter_by (str, optional): A boolean attribute used to indicate which vertices can be included. Defaults to None.

  • output_format (str, optional): Format of the output data of the loader. Only "dataframe" is supported. Defaults to "dataframe".

  • loader_id (str, optional): An identifier of the loader which can be any string. It is also used as the Kafka topic name. If None, a random string will be generated for it. Defaults to None.

  • buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.

  • kafka_address (str, optional): Address of the kafka broker. Defaults to None.

  • kafka_max_msg_size (int, optional): Maximum size of a Kafka message in bytes. Defaults to 104857600.

  • kafka_num_partitions (int, optional): Number of partitions for the topic created by this loader. Defaults to 1.

  • kafka_replica_factor (int, optional): Number of replications for the topic created by this loader. Defaults to 1.

  • kafka_retention_ms (int, optional): Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000.

  • kafka_auto_del_topic (bool, optional): Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True.

  • kafka_address_consumer (str, optional): Address of the kafka broker that a consumer should use. Defaults to be the same as kafkaAddress.

  • kafka_address_producer (str, optional): Address of the kafka broker that a producer should use. Defaults to be the same as kafkaAddress.

  • timeout (int, optional): Timeout value for GSQL queries, in ms. Defaults to 300000.

graphLoader()

graphLoader(v_in_feats: Union[list, dict] = None, v_out_labels: Union[list, dict] = None, v_extra_feats: Union[list, dict] = None, e_in_feats: Union[list, dict] = None, e_out_labels: Union[list, dict] = None, e_extra_feats: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, shuffle: bool = False, filter_by: str = None, output_format: str = "PyG", add_self_loop: bool = False, loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → GraphLoader

Returns a GraphLoader`instance. A `GraphLoader instance loads all edges from the graph in batches, along with the vertices that are connected with each edge.

Different from NeighborLoader which produces connected subgraphs, this loader generates (random) batches of edges and vertices attached to those edges.

When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again.

There are two ways to use the data loader:

  • It can be used as an iterable, which means you can loop through it to get every batch of data. If you load all data at once (num_batches=1), there will be only one batch (of all the data) in the iterator.

  • You can access the data property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader itself.

Parameters:

  • v_in_feats (list, optional): Vertex attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None.

  • v_out_labels (list, optional): Vertex attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None.

  • v_extra_feats (list, optional): Other attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None.

  • e_in_feats (list, optional): Edge attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None.

  • e_out_labels (list, optional): Edge attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None.

  • e_extra_feats (list, optional): Other edge attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None.

  • batch_size (int, optional): Number of edges in each batch. Defaults to None.

  • num_batches (int, optional): Number of batches to split the edges. Defaults to 1.

  • shuffle (bool, optional): Whether to shuffle the data before loading. Defaults to False.

  • filter_by (str, optional): A boolean attribute used to indicate which edges can be included. Defaults to None.

  • output_format (str, optional): Format of the output data of the loader. Only "PyG", "DGL" and "dataframe" are supported. Defaults to "dataframe".

  • add_self_loop (bool, optional): Whether to add self-loops to the graph. Defaults to False.

  • loader_id (str, optional): An identifier of the loader which can be any string. It is also used as the Kafka topic name. If None, a random string will be generated for it. Defaults to None.

  • buffer_size (int, optional): Number of data batches to prefetch and store in memory. Defaults to 4.

  • kafka_address (str, optional): Address of the kafka broker. Defaults to None.

  • kafka_max_msg_size (int, optional): Maximum size of a Kafka message in bytes. Defaults to 104857600.

  • kafka_num_partitions (int, optional): Number of partitions for the topic created by this loader. Defaults to 1.

  • kafka_replica_factor (int, optional): Number of replications for the topic created by this loader. Defaults to 1.

  • kafka_retention_ms (int, optional): Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000.

  • kafka_auto_del_topic (bool, optional): Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True.

  • kafka_address_consumer (str, optional): Address of the kafka broker that a consumer should use. Defaults to be the same as kafkaAddress.

  • kafka_address_producer (str, optional): Address of the kafka broker that a producer should use. Defaults to be the same as kafkaAddress.

  • timeout (int, optional): Timeout value for GSQL queries, in ms. Defaults to 300000.

featurizer()

featurizer() → Featurizer

Get a featurizer.

Returns:

Featurizer

vertexSplitter()

vertexSplitter(timeout: int = 600000)

Get a vertex splitter that splits vertices into at most 3 parts randomly.

The split results are stored in the provided vertex attributes. Each boolean attribute indicates which part a vertex belongs to.

Usage:

  • A random 60% of vertices will have their attribute attr_name set to True, and others False. attr_name can be any attribute that exists in the database (same below). Example:

conn = TigerGraphConnection(...)
splitter = RandomVertexSplitter(conn, timeout, attr_name=0.6)
splitter.run()
  • A random 60% of vertices will have their attribute "attr_name" set to True, and a random 20% of vertices will have their attribute "attr_name2" set to True. The two parts are disjoint. Example:

conn = TigerGraphConnection(...)
splitter = RandomVertexSplitter(conn, timeout, attr_name=0.6, attr_name2=0.2)
splitter.run()
  • A random 60% of vertices will have their attribute "attr_name" set to True, a random 20% of vertices will have their attribute "attr_name2" set to True, and another random 20% of vertices will have their attribute "attr_name3" set to True. The three parts are disjoint. Example:

conn = TigerGraphConnection(...)
splitter = RandomVertexSplitter(conn, timeout, attr_name=0.6, attr_name2=0.2, attr_name3=0.2)
splitter.run()

Parameter:

  • timeout (int, optional): Timeout value for the operation. Defaults to 600000.

edgeSplitter()

edgeSplitter(timeout: int = 600000)

Get an edge splitter that splits edges into at most 3 parts randomly.

The split results are stored in the provided edge attributes. Each boolean attribute indicates which part an edge belongs to.

Usage:

  • A random 60% of edges will have their attribute "attr_name" set to True, and others False. attr_name can be any attribute that exists in the database (same below). Example:

    conn = TigerGraphConnection(...)
    splitter = conn.gds.edgeSplitter(timeout, attr_name=0.6)
    splitter.run()
  • A random 60% of edges will have their attribute "attr_name" set to True, and a random 20% of edges will have their attribute "attr_name2" set to True. The two parts are disjoint. Example:

    conn = TigerGraphConnection(...)
    splitter = conn.gds.edgeSplitter(timeout, attr_name=0.6, attr_name2=0.2)
    splitter.run()
  • A random 60% of edges will have their attribute "attr_name" set to True, a random 20% of edges will have their attribute "attr_name2" set to True, and another random 20% of edges will have their attribute "attr_name3" set to True. The three parts are disjoint. Example:

    conn = TigerGraphConnection(...)
    splitter = conn.gds.edgeSplitter(timeout, attr_name=0.6, attr_name2=0.2, attr_name3=0.2)
    splitter.run()

Parameter:

timeout (int, optional): Timeout value for the operation. Defaults to 600000.