PBGENA

Parallelized Binary embedding GENerator for Attributed graphs

A Sketch-based Approach towards Scalable and Efficient Attributed Network Embedding

Key Features:

BGENA is a super-fast sketch-based ANE solver, which uses BinSketch coupled with a novel edge propagation mechanism.
PBGENA, Parallelized BGENA, works with MPI to leverage a system's multi-core architecture to further accelerate the embedding process.
PBGENA's binary embeddings allow efficient bitarray/sparse-matrix storage.
PBGENA embeddings achieve state-of-the-art performance in graph analysis tasks like node classification and link prediction.
PBGENA is highly flexible and can work with just the topology or the attributes of the graph and enables turning off propagation altogether.

Get PBGENA:

git clone https://github.com/tapadeep/PBGENA.git

Satisfy all requirements for PBGENA:

pip install -r PBGENA/requirements.txt

Execute BGENA:

Perform Node Embedding:

cd PBGENA/Code/Algorithm
python BGENA.py --graph Facebook --N 2000 --alpha 0.5 --b_a 0.7 --b_t 0.6

Perform Node Classification:

cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph Facebook --algorithm BGENA --tr 0.7 --multi True

Ignore the --multi flag for graphs that are not multi-labeled.

Perform Link Prediction:

cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph Facebook --algorithm BGENA --erf 0.3 --N 2000 --alpha 0.8 --b_a 1 --b_t 0

Execute PBGENA:

Perform Node Embedding:

cd PBGENA/Code/Algorithm
python PBGENA.py --graph PubMed --N 8000 --alpha 0.65 --b_a 0 --b_t 0.8 --p 6

The system must have Open MPI, MPICH, or Microsoft MPI installed.

Perform Node Classification:

cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph PubMed --algorithm PBGENA --tr 0.7

Perform Link Prediction:

cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph PubMed --algorithm PBGENA --erf 0.3 --N 8000 --alpha 1 --b_a 0.25 --b_t 0 --p 6

Execute Baselines:

Perform Node Embedding:

cd PBGENA/Code/Algorithm
python Baseline.py --graph CiteSeer --algorithm FeatherNode --reduction_dimensions 25 --eval_points 5 --order 1

Perform Node Classification:

cd 'PBGENA/Code/Node Classification'
python node_classification.py --graph CiteSeer --algorithm FeatherNode --tr 0.7

Perform Link Prediction:

cd 'PBGENA/Code/Link Prediction'
python -B link_prediction.py --graph CiteSeer --algorithm FeatherNode --erf 0.3 --reduction_dimensions 25 --eval_points 5 --order 1

Relevant Flags:

Flag	Type	Description	Range	Default
`--algorithm`	`STRING`	Embedding Method to be used	{BGENA, PBGENA, Baselines}	PBGENA
`--alpha`	`FLOAT`	Fraction of the Embedding Dimension to be used for attributes	[0, 1]	-
`--b_a`	`FLOAT`	Attribute Bitset Probability	[0, 1]	-
`--b_t`	`FLOAT`	Topology Bitset Probability	[0, 1]	-
`--erf`	`FLOAT`	Fraction of edges to be removed for link prediction	(0, 1)	0.3
`--f`	`INTEGER`	Fragment parameter for PBGENA communication calls	ℕ	1
`--f_a`	`FLOAT`	Reduction fraction for b_a each pass	[1, ∞)	2
`--f_t`	`FLOAT`	Reduction fraction for b_t each pass	[1, ∞)	2
`--graph`	`STRING`	Network to be used for embedding	Any Network	-
`--l_a`	`INTEGER`	Number of passes of attribute edge propagation	ℕ	1
`--l_t`	`INTEGER`	Number of passes of topology edge propagation	ℕ	1
`--multi`	`BOOLEAN`	Identify if the graph is Multi-Labeled	{True, False}	False
`--N`	`INTEGER`	Embedding Dimension	ℕ	2000
`--p`	`INTEGER`	Number of workers to use	ℕ	32
`--tr`	`FLOAT`	Training Ratio for Node Classification	(0, 1)	0.7

Some of the others are standard flags, and the remaining flag descriptions are available with the python graph learning framework Karate Club.

Examples:

An online Google Colaboratory Notebook demonstrates PBGENA's working, and other examples are in the Examples folder.

Relevant Hyperparameters:

Node Classification Hyperparameters:

Graph	α	b_a	b_t
Wikipedia	0.85	0.00	0.20
Cora	0.60	0.80	0.80
CiteSeer	0.80	0.90	0.40
Facebook	0.50	0.70	0.60
BlogCatalog	0.60	0.00	0.00
Flickr	1.00	0.00	0.00
PubMed	0.65	0.00	0.80
PPI	0.10	0.95	0.50
Twitter	0.60	0.00	0.00
Google+	0.00	0.00	0.00
Reddit	0.00	0.00	0.00
TWeibo	0.60	0.00	0.00
MAKG	0.85	0.86	0.60

Link Prediction Hyperparameters:

Graph	α	b_a	b_t
Wikipedia	0.95	0.20	0.20
Cora	0.60	0.30	0.40
CiteSeer	0.90	0.20	0.40
Facebook	0.80	1.00	0.00
BlogCatalog	0.90	0.05	0.00
Flickr	0.90	0.00	0.00
PubMed	1.00	0.25	0.00
PPI	0.00	0.00	0.30
Twitter	1.00	0.00	0.00
Google+	0.90	0.05	0.10
Reddit	0.95	0.20	0.00
TWeibo	0.95	0.00	0.00
MAKG	0.85	0.86	0.60

The other baseline hyperparameters used in the experiments can be obtained here.

Datasets:

Graph	#Vertices	#Edges	#Attributes	#Labels	Multi- labeled?
Wikipedia	2,405	11,596	4,973	17	No
Cora	2,708	5,278	1,433	7	No
CiteSeer	3,312	4,536	3,703	6	No
Facebook	4,039	88,234	1,283	193	Yes
BlogCatalog	5,196	171,743	8,189	6	No
Flickr	7,575	239,738	12,047	9	No
PubMed	19,717	44,324	500	3	No
PPI	56,944	793,632	50	121	Yes
Twitter	81,306	1,342,296	216,839	4,065	Yes
Google+	107,614	12,238,285	15,907	468	Yes
Reddit	232,965	57,307,946	602	41	No
TWeibo	2,320,895	50,133,369	1,657	9	No
MAKG	59,249,719	976,901,586	7,211	100	Yes

Some preprocessed networks are added to the Datasets folder, and the remaining can be downloaded online. All of the datasets are obtained from Dr. Jieming Shi's website. To generate files suitable for PBGENA with data from their website, use:

cd PBGENA/Datasets
python data_processing.py --file cora --graph Cora

This converts cora.attr.tar.gz to the Cora folder containing relevant files which can be used to train PBGENA. Use the --multi True flag for converting multi-labeled networks.

Bechmarked Performance:

Graph	Node Classification (%)		Link Prediction (%)		Time (s)
Graph	Macro-F1	Micro-F1	AUC-ROC	Avg Precision	Serial	Parallel
Wikipedia	70.23±4.45	79.53±1.78	85.32±0.26	86.49±0.27	0.85	0.08
Cora	84.88±1.56	85.66±1.15	89.50±0.60	90.42±0.65	0.64	0.08
CiteSeer	68.95±0.95	72.07±0.54	92.93±0.55	94.33±0.45	0.80	0.09
Facebook	32.89±1.65	75.41±0.88	97.70±0.05	97.88±0.04	1.22	0.16
BlogCatalog	91.97±0.58	92.07±0.59	80.23±0.14	80.46±0.17	0.36	0.06
Flickr	78.08±0.76	77.92±0.81	90.40±0.07	90.49±0.08	0.09	0.06
PubMed	86.55±0.24	86.75±0.22	91.91±0.23	92.38±0.28	2.09	0.20
PPI	40.30±0.06	54.58±0.05	96.12±0.03	96.82±0.03	16.30	1.05
Twitter	25.43±0.33	56.47±0.74	97.46±0.02	97.50±0.02	46.19	1.61
Google+	59.24±0.41	84.57±0.23	93.63±0.05	93.21±0.12	13.83	1.14
Reddit	87.01±0.11	88.13±0.12	88.10±0.01	86.16±0.03	75.32	3.68
TWeibo	17.14±0.20	56.83±0.09	96.22±0.02	96.39±0.01	84.08	14.02
MAKG	37.77±0.16	50.26±0.21	85.15±0.03	84.81±0.05	24504.66	6136.49

Except for MAKG all of the above reported results were performed with N=2000, tr=0.7, erf=0.3, and p=32 in a server with 270GB RAM with proper hyperparameters. For MAKG, we used tr=0.1, p=6, and f=100. To get even better results perform the experiments at N=8000. A more comprehensive performance report can be viewed here.

Create your own network:

To create your network, you need three files: edge_list.npy, attribute_matrix.npz, and label_array.npy or label_array.npz (depending on whether the graph is single-labeled or multi-labeled). Ensure all your vertex IDs are in the range (0, nodes-1). Create a numpy array of the edges of shape (edges, 2) and save that file as edge_list.npy. Store the attributes as a sparse CSR matrix of shape (nodes, attributes). To run PBGENA for non-attributed graphs, create an empty attribute matrix for the graph of the proper shape and set α=0. Save the attribute file as attribute_matrix.npz. The file for label array is required only for node classification and can be ignored for performing link prediction on unlabeled graphs. If the graph is single-labeled, the file label_array.npy is a simple 1D array where the i'th number denotes the label for node i. For multi-labeled graphs, label_array.npz is a MultiLabelBinarized CSR-matrix of shape (nodes, labels). Finally, put these three files in a folder with the graph name and add it to the Datasets folder.

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
Code		Code
Datasets		Datasets
Embeddings		Embeddings
Examples		Examples
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

tapadeep/PBGENA

Folders and files

Latest commit

History

Repository files navigation

PBGENA

Parallelized Binary embedding GENerator for Attributed graphs

Key Features:

Get PBGENA:

Satisfy all requirements for PBGENA:

Execute BGENA:

Perform Node Embedding:

Perform Node Classification:

Perform Link Prediction:

Execute PBGENA:

Perform Node Embedding:

Perform Node Classification:

Perform Link Prediction:

Execute Baselines:

Perform Node Embedding:

Perform Node Classification:

Perform Link Prediction:

Relevant Flags:

Examples:

Relevant Hyperparameters:

Node Classification Hyperparameters:

Link Prediction Hyperparameters:

Datasets:

Bechmarked Performance:

Create your own network:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages