Commit b5c6761d authored by ljia's avatar ljia

modifié : README.md

	modifié :         notebooks/run_cyclicpatternkernel.ipynb
	modifié :         notebooks/run_marginalizedkernel_acyclic.ipynb
	modifié :         notebooks/run_pathkernel_acyclic.ipynb
	modifié :         notebooks/run_spkernel_acyclic.ipynb
	modifié :         notebooks/run_treeletkernel_acyclic.ipynb
	modifié :         notebooks/run_treepatternkernel.ipynb
	modifié :         notebooks/run_untildpathkernel_acyclic.ipynb
	nouveau fichier : notebooks/run_untilnwalkkernel.ipynb
	modifié :         notebooks/run_weisfeilerLehmankernel_acyclic.ipynb
	modifié :         pygraph/kernels/treePatternKernel.py
	modifié :         pygraph/kernels/untildPathKernel.py
	nouveau fichier : pygraph/kernels/untilnWalkKernel.py
	nouveau fichier : pygraph/utils/model_selection_precomputed.py
	modifié :         pygraph/utils/utils.py
parent 6fcdb5ff
# py-graph
A python package for graph kernels.
## Requirements
* numpy - 1.13.3
* scipy - 1.0.0
* matplotlib - 2.1.0
* networkx - 2.0
* sklearn - 0.19.1
* tabulate - 0.8.2
## Results with minimal test RMSE for each kernel on dataset Asyclic
All kernels expect for Cyclic pattern kernel are tested on dataset Asyclic, which consists of 185 molecules (graphs). (Cyclic pattern kernel is tested on dataset MAO and PAH.)
The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.
For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.
| Kernels | RMSE(℃) | STD(℃) | Parameter | k_time |
|------------------|:-------:|:------:|------------------:|-------:|
| Shortest path | 35.19 | 4.50 | - | 14.58" |
| Marginalized | 18.02 | 6.29 | p_quit = 0.1 | 4'19" |
| Path | 18.41 | 10.78 | - | 29.43" |
| WL subtree | 7.55 | 2.33 | height = 1 | 0.84" |
| WL shortest path | 35.16 | 4.50 | height = 2 | 40.24" |
| WL edge | 33.41 | 4.73 | height = 5 | 5.66" |
| Treelet | 8.31 | 3.38 | - | 0.50" |
| Path up to d | 7.43 | 2.69 | depth = 2 | 0.59" |
| Tree pattern | 7.27 | 2.21 | lamda = 1, h = 2 | 37.24" |
| Cyclic pattern | 0.9 | 0.11 | cycle bound = 100 | 0.31" |
~~For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.~~
| Kernels | train_perf | valid_perf | test_perf | Parameters | gram_matrix_time |
|------------------|-----------:|-----------:|-----------:|------------------------------------------------------:|-----------------:|
| Shortest path | 28.65±0.59 | 36.09±0.97 | 36.45±6.63 | 'alpha': '3.55e+01' | 12.67" |
| Marginalized | 12.42±0.28 | 18.60±2.02 | 16.51±5.12 | 'p_quit': 0.3, 'alpha': '3.16e-06' | 430.42" |
| Path | 11.19±0.73 | 23.66±1.74 | 25.04±9.60 | 'alpha': '2.57e-03' | 21.84" |
| WL subtree | 6.00±0.27 | 7.59±0.71 | 7.92±2.92 | 'height': 1.0, 'alpha': '1.26e-01' | 0.84" |
| WL shortest path | 28.32±0.63 | 35.99±0.98 | 37.92±5.60 | 'height': 2.0, 'alpha': '1.00e+02' | 39.79" |
| WL edge | 30.10±0.57 | 35.13±0.78 | 37.70±6.92 | 'height': 4.0, 'alpha': '3.98e+01' | 4.35" |
| Treelet | 7.38±0.37 | 14.21±0.80 | 15.26±3.65 | 'alpha': '1.58e+00' | 0.49" |
| Path up to d | 5.48±0.23 | 10.00±0.83 | 10.73±5.67 | 'depth': 2.0, 'k_func': 'MinMax', 'alpha': '7.94e-02' | 0.57" |
| Tree pattern | | | | | |
| Cyclic pattern | 0.62±0.02 | 0.62±0.02 | 0.57±0.17 | 'cycle_bound': 125.0, 'C': '1.78e-01' | 0.33" |
| Walk up to n | 6.19±0.15 | 6.95±0.20 | 7.14±1.35 | 'n': 3.0, 'alpha': '1.00e-10' | 1.19" |
* RMSE stands for arithmetic mean of the root mean squared errors on all splits.
* STD stands for standard deviation of the root mean squared errors on all splits.
* Paremeter is the one with which the kenrel achieves the best results.
* k_time is the time spent on building the kernel matrix.
* The targets of training data are normalized before calculating *treelet kernel*.
* Paremeters are the ones with which the kenrel achieves the best results.
* gram_matrix_time is the time spent on building the gram matrix.
* See detail results in [results.md](pygraph/kernels/results.md).
## References
[1] K. M. Borgwardt and H.-P. Kriegel. Shortest-path kernels on graphs. In Proceedings of the International Conference on Data Mining, pages 74-81, 2005.
[2] H. Kashima, K. Tsuda, and A. Inokuchi. Marginalized kernels between labeled graphs. In Proceedings of the 20th International Conference on Machine Learning, Washington, DC, United States, 2003.
[3] Suard F, Rakotomamonjy A, Bensrhair A. Kernel on Bag of Paths For Measuring Similarity of Shapes. InESANN 2007 Apr 25 (pp. 355-360).
[4] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12:2539-2561, 2011.
[5] Gaüzère B, Brun L, Villemin D. Two new graphs kernels in chemoinformatics. Pattern Recognition Letters. 2012 Nov 1;33(15):2038-47.
[6] Liva Ralaivola, Sanjay J Swamidass, Hiroto Saigo, and Pierre Baldi. Graph kernels for chemical informatics. Neural networks, 18(8):1093–1110, 2005.
[7] Pierre Mahé and Jean-Philippe Vert. Graph kernels based on tree patterns for molecules. Machine learning, 75(1):3–35, 2009.
[8] Tamás Horváth, Thomas Gärtner, and Stefan Wrobel. Cyclic pattern kernels for predictive graph mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 158–167. ACM, 2004.
[9] Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003.
This source diff could not be displayed because it is too large. You can view the blob instead.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -29,10 +29,12 @@ def treepatternkernel(*args, node_label = 'atom', edge_label = 'bond_type', labe
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
depth : integer
Depth of search. Longest length of paths.
k_func : function
A kernel function used using different notions of fingerprint similarity.
kernel_type : string
Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'.
lmda : float
Weight to decide whether linear patterns or trees pattern of increasing complexity are favored.
h : integer
The upper bound of the height of tree patterns.
Return
------
......@@ -74,6 +76,12 @@ def _treepatternkernel_do(G1, G2, node_label, edge_label, labeled, kernel_type,
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
kernel_type : string
Type of tree pattern kernel, could be 'untiln', 'size' or 'branching'.
lmda : float
Weight to decide whether linear patterns or trees pattern of increasing complexity are favored.
h : integer
The upper bound of the height of tree patterns.
Return
------
......
......@@ -8,8 +8,6 @@ import pathlib
sys.path.insert(0, "../")
import time
from collections import Counter
import networkx as nx
import numpy as np
......@@ -36,8 +34,8 @@ def untildpathkernel(*args, node_label = 'atom', edge_label = 'bond_type', label
Return
------
Kmatrix/kernel : Numpy matrix/float
Kernel matrix, each element of which is the path kernel up to d between 2 praphs. / Path kernel up to d between 2 graphs.
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
"""
depth = int(depth)
if len(args) == 1: # for a list of graphs
......
"""
@author: linlin
@references: Thomas Gärtner, Peter Flach, and Stefan Wrobel. On graph kernels: Hardness results and efficient alternatives. Learning Theory and Kernel Machines, pages 129–143, 2003.
"""
import sys
import pathlib
sys.path.insert(0, "../")
import time
from collections import Counter
import networkx as nx
import numpy as np
def untilnwalkkernel(*args, node_label = 'atom', edge_label = 'bond_type', labeled = True, n = 10):
"""Calculate common walk graph kernels up to depth d between graphs.
Parameters
----------
Gn : List of NetworkX graph
List of graphs between which the kernels are calculated.
/
G1, G2 : NetworkX graphs
2 graphs between which the kernel is calculated.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
n : integer
Longest length of walks.
Return
------
Kmatrix : Numpy matrix
Kernel matrix, each element of which is the path kernel up to d between 2 praphs.
"""
Gn = args[0] if len(args) == 1 else [args[0], args[1]] # arrange all graphs in a list
Kmatrix = np.zeros((len(Gn), len(Gn)))
n = int(n)
start_time = time.time()
# get all paths of all graphs before calculating kernels to save time, but this may cost a lot of memory for large dataset.
all_walks = [ find_all_walks_until_length(Gn[i], n, node_label = node_label, edge_label = edge_label, labeled = labeled) for i in range(0, len(Gn)) ]
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _untilnwalkkernel_do(all_walks[i], all_walks[j], node_label = node_label, edge_label = edge_label, labeled = labeled)
Kmatrix[j][i] = Kmatrix[i][j]
run_time = time.time() - start_time
print("\n --- kernel matrix of walk kernel up to %d of size %d built in %s seconds ---" % (n, len(Gn), run_time))
return Kmatrix, run_time
def _untilnwalkkernel_do(walks1, walks2, node_label = 'atom', edge_label = 'bond_type', labeled = True):
"""Calculate walk graph kernels up to n between 2 graphs.
Parameters
----------
walks1, walks2 : list
List of walks in 2 graphs, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
Return
------
kernel : float
Treelet Kernel between 2 graphs.
"""
counts_walks1 = dict(Counter(walks1))
counts_walks2 = dict(Counter(walks2))
all_walks = list(set(walks1 + walks2))
vector1 = [ (counts_walks1[walk] if walk in walks1 else 0) for walk in all_walks ]
vector2 = [ (counts_walks2[walk] if walk in walks2 else 0) for walk in all_walks ]
kernel = np.dot(vector1, vector2)
return kernel
# this method find walks repetively, it could be faster.
def find_all_walks_until_length(G, length, node_label = 'atom', edge_label = 'bond_type', labeled = True):
"""Find all walks with a certain maximum length in a graph. A recursive depth first search is applied.
Parameters
----------
G : NetworkX graphs
The graph in which walks are searched.
length : integer
The maximum length of walks.
node_label : string
node attribute used as label. The default node label is atom.
edge_label : string
edge attribute used as label. The default edge label is bond_type.
labeled : boolean
Whether the graphs are labeled. The default is True.
Return
------
walk : list
List of walks retrieved, where for unlabeled graphs, each walk is represented by a list of nodes; while for labeled graphs, each walk is represented by a string consists of labels of nodes and edges on that walk.
"""
all_walks = []
for i in range(0, length + 1):
new_walks = find_all_walks(G, i)
if new_walks == []:
break
all_walks.extend(new_walks)
if labeled == True: # convert paths to strings
walk_strs = []
for walk in all_walks:
strlist = [ G.node[node][node_label] + G[node][walk[walk.index(node) + 1]][edge_label] for node in walk[:-1] ]
walk_strs.append(''.join(strlist) + G.node[walk[-1]][node_label])
return walk_strs
return all_walks
def find_walks(G, source_node, length):
"""Find all walks with a certain length those start from a source node. A recursive depth first search is applied.
Parameters
----------
G : NetworkX graphs
The graph in which walks are searched.
source_node : integer
The number of the node from where all walks start.
length : integer
The length of walks.
Return
------
walk : list of list
List of walks retrieved, where each walk is represented by a list of nodes.
"""
return [[source_node]] if length == 0 else \
[ [source_node] + walk for neighbor in G[source_node] \
for walk in find_walks(G, neighbor, length - 1) ]
def find_all_walks(G, length):
"""Find all walks with a certain length in a graph. A recursive depth first search is applied.
Parameters
----------
G : NetworkX graphs
The graph in which walks are searched.
length : integer
The length of walks.
Return
------
walk : list of list
List of walks retrieved, where each walk is represented by a list of nodes.
"""
all_walks = []
for node in G:
all_walks.extend(find_walks(G, node, length))
### The following process is not carried out according to the original article
# all_paths_r = [ path[::-1] for path in all_paths ]
# # For each path, two presentation are retrieved from its two extremities. Remove one of them.
# for idx, path in enumerate(all_paths[:-1]):
# for path2 in all_paths_r[idx+1::]:
# if path == path2:
# all_paths[idx] = []
# break
# return list(filter(lambda a: a != [], all_paths))
return all_walks
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment