Commit 6364fbd9 authored by ljia's avatar ljia

* MOD Weisfeiler-Lehman subtree kernel and the test code.

parent ac457974
......@@ -10,10 +10,17 @@ a python package for graph kernels.
* sklearn - 0.19.1
* tabulate - 0.8.2
## results with minimal RMSE for each kernel on dataset Asyclic
| Kernels | RMSE(℃) | std(℃) | parameter |
|---------------|:---------:|:--------:|-------------:|
| shortest path | 36.400524 | 5.352940 | - |
| marginalized | 17.8991 | 6.59104 | p_quit = 0.1 |
| path | 14.270816 | 6.366698 | - |
| WL subtree | 9.01403 | 6.35786 | height = 1 |
## results with minimal test RMSE for each kernel on dataset Asyclic
- All the kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).
- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.
- For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.
| Kernels | RMSE(℃) | std(℃) | parameter | k_time |
|---------------|:---------:|:--------:|-------------:|-------:|
| shortest path | 36.40 | 5.35 | - | - |
| marginalized | 17.90 | 6.59 | p_quit = 0.1 | - |
| path | 14.27 | 6.37 | - | - |
| WL subtree | 9.00 | 6.37 | height = 1 | 0.85" |
**In each line, paremeter is the one with which the kenrel achieves the best results**
**In each line, k_time is the time spent on building the kernel matrix.**
**See detail results in [results.md](pygraph/kernels/results.md)
......@@ -10,15 +10,24 @@ a python package for graph kernels.
* sklearn - 0.19.1
* tabulate - 0.8.2
## results with minimal RMSE for each kernel on dataset Asyclic
| Kernels | RMSE(℃) | std(℃) | parameter |
|---------------|:---------:|:--------:|-------------:|
| shortest path | 36.400524 | 5.352940 | - |
| marginalized | 17.8991 | 6.59104 | p_quit = 0.1 |
| path | 14.270816 | 6.366698 | - |
| WL subtree | 9.01403 | 6.35786 | height = 1 |
## results with minimal test RMSE for each kernel on dataset Asyclic
- All the kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).
- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.
- For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.
| Kernels | RMSE(℃) | std(℃) | parameter | k_time |
|---------------|:---------:|:--------:|-------------:|-------:|
| shortest path | 36.40 | 5.35 | - | - |
| marginalized | 17.90 | 6.59 | p_quit = 0.1 | - |
| path | 14.27 | 6.37 | - | - |
| WL subtree | 9.00 | 6.37 | height = 1 | 0.85" |
**In each line, paremeter is the one with which the kenrel achieves the best results**
**In each line, k_time is the time spent on building the kernel matrix.**
## updates
### 2017.12.21
* MOD Weisfeiler-Lehman subtree kernel and the test code. - linlin
### 2017.12.20
* ADD Weisfeiler-Lehman subtree kernel and its result on dataset Asyclic. - linlin
### 2017.12.07
......
# results with minimal test RMSE for each kernel on dataset Asyclic
- All the kernels are tested on dataset Asyclic, which consists of 185 molecules (graphs).
- The criteria used for prediction are SVM for classification and kernel Ridge regression for regression.
- For predition we randomly divide the data in train and test subset, where 90% of entire dataset is for training and rest for testing. 10 splits are performed. For each split, we first train on the train data, then evaluate the performance on the test set. We choose the optimal parameters for the test set and finally provide the corresponding performance. The final results correspond to the average of the performances on the test sets.
## summary
| Kernels | RMSE(℃) | std(℃) | parameter | k_time |
|---------------|:---------:|:--------:|-------------:|-------:|
| shortest path | 36.40 | 5.35 | - | - |
| marginalized | 17.90 | 6.59 | p_quit = 0.1 | - |
| path | 14.27 | 6.37 | - | - |
| WL subtree | 9.00 | 6.37 | height = 1 | 0.85" |
**In each line, paremeter is the one with which the kenrel achieves the best results**
**In each line, k_time is the time spent on building the kernel matrix.**
## detailed results for WL subtree kernel.
height RMSE_test std_test RMSE_train std_train kernel_build_time(s)
-------- ----------- ---------- ------------ ----------- ----------------------
0 36.2108 7.33179 141.419 1.08284 0.374255
1 9.00098 6.37145 140.065 0.877976 0.853411
2 19.8113 4.04911 140.075 0.928821 1.31835
3 25.0455 4.94276 140.198 0.873857 1.83817
4 28.2255 6.5212 140.272 0.838915 2.27403
5 30.6354 6.73647 140.247 0.86363 2.53348
6 32.1027 6.85601 140.239 0.872475 3.06373
7 32.9709 6.89606 140.094 0.917704 3.4109
8 33.5112 6.90753 140.076 0.931866 4.05149
9 33.8502 6.91427 139.913 0.928974 4.62658
10 34.0963 6.93115 139.894 0.942612 4.99069
......@@ -70,12 +70,13 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
Kmatrix[i][j] = _weisfeilerlehmankernel_do(Gn[i], Gn[j])
Kmatrix[i][j] = _weisfeilerlehmankernel_do(Gn[i], Gn[j], height = height)
Kmatrix[j][i] = Kmatrix[i][j]
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), (time.time() - start_time)))
run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel matrix of size %d built in %s seconds ---" % (base_kernel, len(args[0]), run_time))
return Kmatrix
return Kmatrix, run_time
else: # for only 2 graphs
......@@ -97,9 +98,10 @@ def weisfeilerlehmankernel(*args, height = 0, base_kernel = 'subtree'):
kernel = _pathkernel_do(args[0], args[1])
print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, time.time() - start_time))
run_time = time.time() - start_time
print("\n --- Weisfeiler-Lehman %s kernel built in %s seconds ---" % (base_kernel, run_time))
return kernel
return kernel, run_time
def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
......@@ -119,24 +121,44 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
Gn = args[0]
Kmatrix = np.zeros((len(Gn), len(Gn)))
all_num_of_labels_occured = 0 # number of the set of letters that occur before as node labels at least once in all graphs
# initial for height = 0
all_labels_ori = set() # all unique orignal labels in all graphs in this iteration
all_num_of_each_label = [] # number of occurence of each label in each graph in this iteration
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs
# for each graph
for idx, G in enumerate(Gn):
# get the set of original labels
labels_ori = list(nx.get_node_attributes(G, 'label').values())
all_labels_ori.update(labels_ori)
num_of_each_label = dict(Counter(labels_ori)) # number of occurence of each label in graph
all_num_of_each_label.append(num_of_each_label)
num_of_labels = len(num_of_each_label) # number of all unique labels
all_labels_ori.update(labels_ori)
all_num_of_labels_occured += len(all_labels_ori)
# calculate subtree kernel with the 0th iteration and add it to the final kernel
for i in range(0, len(Gn)):
for j in range(i, len(Gn)):
labels = set(list(all_num_of_each_label[i].keys()) + list(all_num_of_each_label[j].keys()))
vector1 = np.matrix([ (all_num_of_each_label[i][label] if (label in all_num_of_each_label[i].keys()) else 0) for label in labels ])
vector2 = np.matrix([ (all_num_of_each_label[j][label] if (label in all_num_of_each_label[j].keys()) else 0) for label in labels ])
Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
Kmatrix[j][i] = Kmatrix[i][j]
# iterate each height
for h in range(height + 1):
all_labels_ori = set() # all unique orignal labels in all graphs in this iteration
all_num_of_each_label = [] # number of occurence of each label in each graph in this iteration
for h in range(1, height + 1):
all_set_compressed = {} # a dictionary mapping original labels to new ones in all graphs in this iteration
num_of_labels_occured = all_num_of_labels_occured # number of the set of letters that occur before as node labels at least once in all graphs
all_labels_ori = set()
all_num_of_each_label = []
# for each graph
for idx, G in enumerate(Gn):
# get the set of original labels
labels_ori = list(nx.get_node_attributes(G, 'label').values())
num_of_each_label = dict(Counter(labels_ori)) # number of occurence of each label in graph
num_of_labels = len(num_of_each_label) # number of all unique labels
all_labels_ori.update(labels_ori)
# num_of_labels_occured += num_of_labels #@todo not precise
num_of_labels_occured = all_num_of_labels_occured + len(all_labels_ori) + len(all_set_compressed)
set_multisets = []
for node in G.nodes(data = True):
......@@ -148,7 +170,6 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
set_multisets.append(multiset)
# label compression
# set_multisets.sort() # this is unnecessary
set_unique = list(set(set_multisets)) # set of unique multiset labels
# a dictionary mapping original labels to new ones.
set_compressed = {}
......@@ -159,20 +180,20 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
else:
set_compressed.update({ value : str(num_of_labels_occured + 1) })
num_of_labels_occured += 1
# set_compressed = { value : (all_set_compressed[value] if value in all_set_compressed.keys() else str(set_unique.index(value) + num_of_labels_occured + 1)) for value in set_unique }
all_set_compressed.update(set_compressed)
# num_of_labels_occured += len(set_compressed) #@todo not precise
# relabel nodes
# nx.relabel_nodes(G, set_compressed, copy = False)
for node in G.nodes(data = True):
node[1]['label'] = set_compressed[set_multisets[node[0]]]
# get the set of compressed labels
labels_comp = list(nx.get_node_attributes(G, 'label').values())
num_of_each_label.update(dict(Counter(labels_comp)))
all_labels_ori.update(labels_comp)
num_of_each_label = dict(Counter(labels_comp))
all_num_of_each_label.append(num_of_each_label)
all_num_of_labels_occured += len(all_labels_ori)
# calculate subtree kernel with h iterations and add it to the final kernel
for i in range(0, len(Gn)):
......@@ -183,12 +204,10 @@ def _wl_subtreekernel_do(*args, height = 0, base_kernel = 'subtree'):
Kmatrix[i][j] += np.dot(vector1, vector2.transpose())
Kmatrix[j][i] = Kmatrix[i][j]
all_num_of_labels_occured += len(all_labels_ori)
return Kmatrix
def _weisfeilerlehmankernel_do(G1, G2):
def _weisfeilerlehmankernel_do(G1, G2, height = 0):
"""Calculate Weisfeiler-Lehman kernels between 2 graphs. This kernel use shortest path kernel to calculate kernel between two graphs in each iteration.
Parameters
......@@ -206,14 +225,13 @@ def _weisfeilerlehmankernel_do(G1, G2):
kernel = 0 # init kernel
num_nodes1 = G1.number_of_nodes()
num_nodes2 = G2.number_of_nodes()
height = 12 #min(num_nodes1, num_nodes2)) #Q how to determine the upper bound of the height?
# the first iteration.
# labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) }
# labelset2 = { G2.nodes(data = True)[i]['label'] for i in range(num_nodes2) }
kernel += pathkernel(G1, G2) # change your base kernel here (and one more below)
kernel += spkernel(G1, G2) # change your base kernel here (and one more below)
for h in range(0, height):
for h in range(0, height + 1):
# if labelset1 != labelset2:
# break
......@@ -222,7 +240,7 @@ def _weisfeilerlehmankernel_do(G1, G2):
relabel(G2)
# calculate kernel
kernel += pathkernel(G1, G2) # change your base kernel here (and one more before)
kernel += spkernel(G1, G2) # change your base kernel here (and one more before)
# get label sets of both graphs
# labelset1 = { G1.nodes(data = True)[i]['label'] for i in range(num_nodes1) }
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment