Factory Functions
Factory Functions are a special collection of functions that return an instance of a class.
All factory functions are methods of the GDS
class.
You can call a factory function after instantiating a TigerGraph Connection.
For example:
conn = TigerGraphConnection(
host="http://127.0.0.1",
graphname="Cora",
username="tigergraph",
password="tigergraph",
useCert=False
)
edge_loader = conn.gds.edgeLoader(
num_batches=1,
attributes=["time", "is_train"])
The object returned has access to instance methods of the class. You can find the reference for those classes on the following pages:
If you are not sure how to configure the optional arguments related to Kafka, leave them blank. More detailed instructions on how to use them will be provided in a future release. |
neighborLoader()
neighborLoader(v_in_feats: Union[list, dict] = None, v_out_labels: Union[list, dict] = None, v_extra_feats: Union[list, dict] = None, e_in_feats: Union[list, dict] = None, e_out_labels: Union[list, dict] = None, e_extra_feats: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, num_neighbors: int = 10, num_hops: int = 2, shuffle: bool = False, filter_by: str = None, output_format: str = "PyG", add_self_loop: bool = False, loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → NeighborLoader
Returns a NeighborLoader
instance.
A NeighborLoader
instance performs neighbor sampling from all vertices in the graph in batches in the following manner:
-
It chooses a specified number (
batch_size
) of vertices as seeds. The number of batches is the total number of vertices divided by the batch size.-
If you specify the number of batches (
num_batches
) instead,batch_size
is calculated by dividing the total number of vertices by the number of batches. If specify both parameters,batch_size
takes priority.
-
-
It picks a specified number (
num_neighbors
) of neighbors of each seed at random. -
It picks the same number of neighbors for each neighbor, and repeats this process until it finished performing a specified number of hops (
num_hops
).
This generates one subgraph.
As you loop through this data loader, every vertex will at some point be chosen as a seed and you will get the subgraph
expanded from the seeds.
If you want to limit seeds to certain vertices, the boolean
attribute provided to filter_by
will be used to indicate which vertices can be
included as seeds.
When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again. |
See the ML Workbench tutorial notebook for examples.
Parameters:
-
v_in_feats (list, optional)
: Vertex attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None. -
v_out_labels (list, optional)
: Vertex attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None. -
v_extra_feats (list, optional)
: Other attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None. -
e_in_feats (list, optional)
: Edge attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None. -
e_out_labels (list, optional)
: Edge attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None. -
e_extra_feats (list, optional)
: Other edge attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None. -
batch_size (int, optional)
: Number of vertices as seeds in each batch. Defaults to None. -
num_batches (int, optional)
: Number of batches to split the vertices into as seeds. If bothbatch_size
andnum_batches
are provided,batch_size
takes higher priority. Defaults to 1. -
num_neighbors (int, optional)
: Number of neighbors to sample for each vertex. Defaults to 10. -
num_hops (int, optional)
: Number of hops to traverse when sampling neighbors. Defaults to 2. -
shuffle (bool, optional)
: Whether to shuffle the vertices before loading data. Defaults to False. -
filter_by (str, optional)
: A boolean attribute used to indicate which vertices can be included as seeds. Defaults to None. -
output_format (str, optional)
: Format of the output data of the loader. Only "PyG", "DGL" and "dataframe" are supported. Defaults to "PyG". -
add_self_loop (bool, optional)
: Whether to add self-loops to the graph. Defaults to False. -
loader_id (str, optional)
: An identifier of the loader which can be any string. It is also used as the Kafka topic name. IfNone
, a random string will be generated for it. Defaults to None. -
buffer_size (int, optional)
: Number of data batches to prefetch and store in memory. Defaults to 4. -
kafka_address (str, optional)
: Address of the kafka broker. Defaults to None. -
kafka_max_msg_size (int, optional)
: Maximum size of a Kafka message in bytes. Defaults to 104857600. -
kafka_num_partitions (int, optional)
: Number of partitions for the topic created by this loader. Defaults to 1. -
kafka_replica_factor (int, optional)
: Number of replications for the topic created by this loader. Defaults to 1. -
kafka_retention_ms (int, optional)
: Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000. -
kafka_auto_del_topic (bool, optional)
: Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True. -
kafka_address_consumer (str, optional)
: Address of the kafka broker that a consumer should use. Defaults to be the same askafkaAddress
. -
kafka_address_producer (str, optional)
: Address of the kafka broker that a producer should use. Defaults to be the same askafkaAddress
. -
timeout (int, optional)
: Timeout value for GSQL queries, in ms. Defaults to 300000.
edgeLoader()
edgeLoader(attributes: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, shuffle: bool = False, filter_by: str = None, output_format: str = "dataframe", loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → EdgeLoader
Returns an EdgeLoader
instance.
An EdgeLoader
instance loads all edges in the graph in batches.
It divides all edges into num_batches
and returns each batch separately.
You can also specify the size of each batch, and the number of batched is calculated accordingly.
If you provide both parameters, batch_size
take priority.
The boolean attribute provided to filter_by
indicates which edges are included.
If you need random batches, set shuffle
to True.
When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again. |
There are two ways to use the data loader.
-
It can be used as an iterable, which means you can loop through it to get every batch of data. If you load all edges at once (
num_batches=1
), there will be only one batch (of all the edges) in the iterator. -
You can access the
data
property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator. If there are multiple batches of data to load, it returns the loader itself.
Parameters:
-
attributes (list, optional)
: Edge attributes to be included. Defaults to None. -
batch_size (int, optional)
: Number of edges in each batch. Defaults to None. -
num_batches (int, optional)
: Number of batches to split the edges. Defaults to 1. -
shuffle (bool, optional)
: Whether to shuffle the edges before loading data. Defaults to False. -
filter_by (str, optional)
: A boolean attribute used to indicate which edges are included. Defaults to None. -
output_format (str, optional)
: Format of the output data of the loader. Only "dataframe" is supported. Defaults to "dataframe". -
loader_id (str, optional)
: An identifier of the loader which can be any string. It is also used as the Kafka topic name. IfNone
, a random string will be generated for it. Defaults to None. -
buffer_size (int, optional)
: Number of data batches to prefetch and store in memory. Defaults to 4. -
kafka_address (str, optional)
: Address of the kafka broker. Defaults to None. -
kafka_max_msg_size (int, optional)
: Maximum size of a Kafka message in bytes. Defaults to 104857600. -
kafka_num_partitions (int, optional)
: Number of partitions for the topic created by this loader. Defaults to 1. -
kafka_replica_factor (int, optional)
: Number of replications for the topic created by this loader. Defaults to 1. -
kafka_retention_ms (int, optional)
: Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000. -
kafka_auto_del_topic (bool, optional)
: Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True. -
kafka_address_consumer (str, optional)
: Address of the kafka broker that a consumer should use. Defaults to be the same askafkaAddress
. -
kafka_address_producer (str, optional)
: Address of the kafka broker that a producer should use. Defaults to be the same askafkaAddress
. -
timeout (int, optional)
: Timeout value for GSQL queries, in ms. Defaults to 300000.
See the ML Workbench edge loader tutorial notebook for examples.
vertexLoader()
vertexLoader(attributes: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, shuffle: bool = False, filter_by: str = None, output_format: str = "dataframe", loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → VertexLoader
Returns a VertexLoader
instance.
A VertexLoader
can load all vertices of a graph in batches.
It divides vertices into num_batches
and returns each batch separately.
The boolean attribute provided to filter_by
indicates which vertices are included.
If you need random batches, set shuffle
to True.
When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again. |
There are two ways to use the data loader:
-
It can be used as an iterable, which means you can loop through it to get every batch of data. If you load all vertices at once (
num_batches=1
), there will be only one batch (of all the vertices) in the iterator. -
You can access the
data
property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader again.
Parameters:
-
attributes (list, optional)
: Vertex attributes to be included. Defaults to None. -
batch_size (int, optional)
: Number of vertices in each batch. Defaults to None. -
num_batches (int, optional)
: Number of batches to split the vertices. Defaults to 1. -
shuffle (bool, optional)
: Whether to shuffle the vertices before loading data. Defaults to False. -
filter_by (str, optional)
: A boolean attribute used to indicate which vertices can be included. Defaults to None. -
output_format (str, optional)
: Format of the output data of the loader. Only "dataframe" is supported. Defaults to "dataframe". -
loader_id (str, optional)
: An identifier of the loader which can be any string. It is also used as the Kafka topic name. IfNone
, a random string will be generated for it. Defaults to None. -
buffer_size (int, optional)
: Number of data batches to prefetch and store in memory. Defaults to 4. -
kafka_address (str, optional)
: Address of the kafka broker. Defaults to None. -
kafka_max_msg_size (int, optional)
: Maximum size of a Kafka message in bytes. Defaults to 104857600. -
kafka_num_partitions (int, optional)
: Number of partitions for the topic created by this loader. Defaults to 1. -
kafka_replica_factor (int, optional)
: Number of replications for the topic created by this loader. Defaults to 1. -
kafka_retention_ms (int, optional)
: Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000. -
kafka_auto_del_topic (bool, optional)
: Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True. -
kafka_address_consumer (str, optional)
: Address of the kafka broker that a consumer should use. Defaults to be the same askafkaAddress
. -
kafka_address_producer (str, optional)
: Address of the kafka broker that a producer should use. Defaults to be the same askafkaAddress
. -
timeout (int, optional)
: Timeout value for GSQL queries, in ms. Defaults to 300000.
See the ML Workbench tutorial notebook for examples.
graphLoader()
graphLoader(v_in_feats: Union[list, dict] = None, v_out_labels: Union[list, dict] = None, v_extra_feats: Union[list, dict] = None, e_in_feats: Union[list, dict] = None, e_out_labels: Union[list, dict] = None, e_extra_feats: Union[list, dict] = None, batch_size: int = None, num_batches: int = 1, shuffle: bool = False, filter_by: str = None, output_format: str = "PyG", add_self_loop: bool = False, loader_id: str = None, buffer_size: int = 4, kafka_address: str = None, kafka_max_msg_size: int = 104857600, kafka_num_partitions: int = 1, kafka_replica_factor: int = 1, kafka_retention_ms: int = 60000, kafka_auto_del_topic: bool = True, kafka_address_consumer: str = None, kafka_address_producer: str = None, timeout: int = 300000) → GraphLoader
Returns a GraphLoader`instance.
A `GraphLoader
instance loads all edges from the graph in batches, along with the vertices that are connected with each edge.
Different from NeighborLoader which produces connected subgraphs, this loader generates (random) batches of edges and vertices attached to those edges.
When you initialize the loader on a graph for the first time, the initialization might take a minute as it installs the corresponding query to the database. However, the query installation only needs to be done once, so it will take no time when you initialize the loader on the same graph again. |
There are two ways to use the data loader:
-
It can be used as an iterable, which means you can loop through it to get every batch of data. If you load all data at once (
num_batches=1
), there will be only one batch (of all the data) in the iterator. -
You can access the
data
property of the class directly. If there is only one batch of data to load, it will give you the batch directly instead of an iterator, which might make more sense in that case. If there are multiple batches of data to load, it will return the loader itself.
Parameters:
-
v_in_feats (list, optional)
: Vertex attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None. -
v_out_labels (list, optional)
: Vertex attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None. -
v_extra_feats (list, optional)
: Other attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None. -
e_in_feats (list, optional)
: Edge attributes to be used as input features. Only numeric and boolean attributes are allowed. The type of an attribute is automatically determined from the database schema. Defaults to None. -
e_out_labels (list, optional)
: Edge attributes to be used as labels for prediction. Only numeric and boolean attributes are allowed. Defaults to None. -
e_extra_feats (list, optional)
: Other edge attributes to get such as indicators of train/test data. All types of attributes are allowed. Defaults to None. -
batch_size (int, optional)
: Number of edges in each batch. Defaults to None. -
num_batches (int, optional)
: Number of batches to split the edges. Defaults to 1. -
shuffle (bool, optional)
: Whether to shuffle the data before loading. Defaults to False. -
filter_by (str, optional)
: A boolean attribute used to indicate which edges can be included. Defaults to None. -
output_format (str, optional)
: Format of the output data of the loader. Only "PyG", "DGL" and "dataframe" are supported. Defaults to "dataframe". -
add_self_loop (bool, optional)
: Whether to add self-loops to the graph. Defaults to False. -
loader_id (str, optional)
: An identifier of the loader which can be any string. It is also used as the Kafka topic name. IfNone
, a random string will be generated for it. Defaults to None. -
buffer_size (int, optional)
: Number of data batches to prefetch and store in memory. Defaults to 4. -
kafka_address (str, optional)
: Address of the kafka broker. Defaults to None. -
kafka_max_msg_size (int, optional)
: Maximum size of a Kafka message in bytes. Defaults to 104857600. -
kafka_num_partitions (int, optional)
: Number of partitions for the topic created by this loader. Defaults to 1. -
kafka_replica_factor (int, optional)
: Number of replications for the topic created by this loader. Defaults to 1. -
kafka_retention_ms (int, optional)
: Retention time for messages in the topic created by this loader in milliseconds. Defaults to 60000. -
kafka_auto_del_topic (bool, optional)
: Whether to delete the Kafka topic once the loader finishes pulling data. Defaults to True. -
kafka_address_consumer (str, optional)
: Address of the kafka broker that a consumer should use. Defaults to be the same askafkaAddress
. -
kafka_address_producer (str, optional)
: Address of the kafka broker that a producer should use. Defaults to be the same askafkaAddress
. -
timeout (int, optional)
: Timeout value for GSQL queries, in ms. Defaults to 300000.
See the ML Workbench tutorial notebook for graph loaders for examples.
vertexSplitter()
vertexSplitter(timeout: int = 600000)
Get a vertex splitter that splits vertices into at most 3 parts randomly.
The split results are stored in the provided vertex attributes. Each boolean attribute indicates which part a vertex belongs to.
Usage:
-
A random 60% of vertices will have their attribute
attr_name
set to True, and others False.attr_name
can be any attribute that exists in the database (same below). Example:
conn = TigerGraphConnection(...) splitter = RandomVertexSplitter(conn, timeout, attr_name=0.6) splitter.run()
-
A random 60% of vertices will have their attribute "attr_name" set to True, and a random 20% of vertices will have their attribute "attr_name2" set to True. The two parts are disjoint. Example:
conn = TigerGraphConnection(...) splitter = RandomVertexSplitter(conn, timeout, attr_name=0.6, attr_name2=0.2) splitter.run()
-
A random 60% of vertices will have their attribute "attr_name" set to True, a random 20% of vertices will have their attribute "attr_name2" set to True, and another random 20% of vertices will have their attribute "attr_name3" set to True. The three parts are disjoint. Example:
conn = TigerGraphConnection(...) splitter = RandomVertexSplitter(conn, timeout, attr_name=0.6, attr_name2=0.2, attr_name3=0.2) splitter.run()
Parameter:
-
timeout (int, optional)
: Timeout value for the operation. Defaults to 600000.
edgeSplitter()
edgeSplitter(timeout: int = 600000)
Get an edge splitter that splits edges into at most 3 parts randomly.
The split results are stored in the provided edge attributes. Each boolean attribute indicates which part an edge belongs to.
Usage:
-
A random 60% of edges will have their attribute "attr_name" set to True, and others False.
attr_name
can be any attribute that exists in the database (same below). Example:conn = TigerGraphConnection(...) splitter = conn.gds.edgeSplitter(timeout, attr_name=0.6) splitter.run()
-
A random 60% of edges will have their attribute "attr_name" set to True, and a random 20% of edges will have their attribute "attr_name2" set to True. The two parts are disjoint. Example:
conn = TigerGraphConnection(...) splitter = conn.gds.edgeSplitter(timeout, attr_name=0.6, attr_name2=0.2) splitter.run()
-
A random 60% of edges will have their attribute "attr_name" set to True, a random 20% of edges will have their attribute "attr_name2" set to True, and another random 20% of edges will have their attribute "attr_name3" set to True. The three parts are disjoint. Example:
conn = TigerGraphConnection(...) splitter = conn.gds.edgeSplitter(timeout, attr_name=0.6, attr_name2=0.2, attr_name3=0.2) splitter.run()
Parameter:
timeout (int, optional): Timeout value for the operation. Defaults to 600000.