Data

This repository contains implementation of function-free first-order logic mining algorithms enhanced by pruning using learned domain theories, which is also called saturation-based pruning. It was originally published in Pruning hypothesis spaces using learned domain theories ILP'17. The current code base contains some extensions beyond the original feature extractor, e.g., domain-theory learner published in Relaxing Deductive and Inductive Reasoning in Relational Learning (dissertation, CTU'24). A prototypical version of this pruning was also employed under the hood of Stacked structure learning for lifted relational neural networks published at ILP'17.

If you'd like to cite us, please use the following bib-entry:

@inproceedings{svatovs2018pruning,
  title={Pruning hypothesis spaces using learned domain theories},
  author={Svato{\v{s}}, Martin and {\v{S}}ourek, Gustav and {\v{Z}}elezn{\'{y}}, Filip and Schockaert, Steven and Ku{\v{z}}elka, Ond{\v{r}}ej},
  booktitle={Inductive Logic Programming: 27th International Conference, ILP 2017, Orl{\'e}ans, France, September 4-6, 2017, Revised Selected Papers 27},
  pages={152--168},
  year={2018},
  organization={Springer}
}

Data

The folder datasets contains the NCI GI collection of molecular dataset which were used in the paper. However, the folder ./datasets/nci contains the original NCI datasets, which we transformed into datasets of interpretations which are in the form of

+ ar_bond(29, 26), car(29), car(26), ar_bond(26, 29)....

where the first character + or - stands for positive and negative example, respectively. Thereafter, a list of facts follows. This snippet of data, in particular, shows a part of positive molecule that has two carbons that are connected via the aromatic bond. In general, predicates may be of arbitrary non-zero arity; they start with a lower-case. Constants start with a lower-case as well; in this case, integers are used to denote two different atoms. Finally, a variable starts with an upper-case, e.g. V1; however, these are not allowed within examples since we are working within the learning from interpretation
-- and so each example has is a (fully specified) Herbrand interpretation. Each line of the dataset file corresponds to exactly one learning example.

Beside the NCI datasets, there are also datasets from Alchemy; we just had to transform these (IMDB, Proteins, UWCS) to be in the same format we use. Last but not least, there are a few new dataset we created. Namely, these are PDDL-Grip, loops and groups. The PDDL-Grip dataset was created using VAL; the archive ./datasets/pddl_grip/pddl.rar contains the code that constructed the dataset. Note that it includes an executable of the above-mentioned VAL program; it's present only for reproducibility, it is not well documented. In the same fashion, the archive ./datasets/groupds/groups.rar contains the code that constructed the loops datasets. Again, it may contain code from another authors as it uses GAP which needs to be installed; it is not well documented, it is here only for reproducibility purposes.

Installation

You may either use the fat jar pruningHypothesis.jar or build your own jar using the source codes and dependencies (all libraries -- Matching by Ondřej Kuželka and Sat4J -- are either built in or in the folder dependencies).

Running domain theory learner

The first application of this code base is to learn domain theory, i.e., a set of universally quantified clauses such that each clause covers all examples. Learning domain theory is as easy as

java
    -Dida.searchPruning.dataset=./datasets/imdb/merged.db.oneLine
    -Dida.searchPruning.learn=domainTheory
    -Dida.logicStuff.constraints.maxLiterals=4
    -Dida.logicStuff.constraints.maxVariables=15
    -Dida.logicStuff.constraints.maxPosLit=1
    -Dida.logicStuff.constraints.useSaturation=false
    -jar pruningHypotheses.jar

The parameter ida.searchPruning.dataset should point to a dataset in the format described above. The second parameter sets up learning only domain theory. Further, we may set the maximal number of positive, negative and literals in total, as well as number of variables, i.e., language bias. At the end, you should end up with something like

# ultraSCL setting
#	useSaturations	false
...

a setup of the learning setting followed by some statistics

# overall stats
# ----------------------------
# 		      | total refinements |	 non-isomorphic | non-isomorphic saturated |  #HC	 |	     time
# level		1           34	             34	                0	                    2		0.012882649983333333
# level		2         1016	            530	                0	                    109		0.06162433666666667
# level		3        13143	           6872	                0	                    115		0.22864976833333334
# level		4       135410	          77698	                0	                    242		1.5896445733333333
# counters
# literal stats
# uscl:theory simplifier is turned-off
# there are 242 hard constraints
!actor(V1), !genre_aaction(V1)
!actor(V1), !genre_acomedy(V1)
!actor(V1), !genre_adocumentary(V1)
!actor(V1), !genre_athriller(V1)
!director(V1), !actor(V1)
....

that are followed further by the learned constraints. All lines starting with # are comments. These stat tell us that at the first level, there were 34 refinements, from these, all 34 were non-isomorphic, none of these were removed using saturations. Finally, there were 2 learned constraints within 0.01 minutes. Recall that we are learning universally quantified clauses, i.e., the last one stands for $\forall x \lnot director(x) \lor \lnot actor(x)$ which says that nobody is a director and an actor at the same time. Clearly, this follows from the data and not from real life -- take, for example, Mel Gibson, Ben Affleck, and many others.

Well, that was not hard but -- where is the saturated pruning? The above-mentioned did not utilize saturation pruning within the learning (searching) process. However, we have a version that incorporate the pruning. Actually, it uses domain theory learned up to i-th level (exclusively) to saturate candidates in the i-th level. Sounds good, right? Just, try it by altering the ida.logicStuff.constraints.useSaturation parameter.

java
    -Dida.searchPruning.dataset=./datasets/imdb/merged.db.oneLine
    -Dida.searchPruning.learn=domainTheory
    -Dida.logicStuff.constraints.maxLiterals=4
    ...
    -Dida.logicStuff.constraints.useSaturation=true
    -jar pruningHypotheses.jar

Well, it worked -- behold!

....
# overall stats
# ----------------------------
# 		      | total refinements |	 non-isomorphic | non-isomorphic saturated |  #HC	 |	     time
# level		1           34	               34       	       34	              2		    	0.01601748165
# level		2         1016	              530       	      441	             109			0.19080514666666668
# level		3        12682	             6708       	     2123	             115			1.2459660583333334
# level		4       118855      	    71006       	    11156	             184			5.854695296666667
....

The execution time is slightly worse, however there are less total refinements produced. The discrepancy between non-isomorphic and non-isomorhpic saturated expresses how many clauses were pruned by the saturation-based pruning. You play around with the language bias, e.g., allowing multiple components in a clause by setting up ida.logicStuff.constraints.maxComponents parameter or changing the saturation mode -- that is what type of saturations we are going to allow -- using ida.logicStuff.constraints.saturationMode parameter. These are the possible values:

-1 -- we are computing only positive saturations
-2 -- we are computing only negative saturations
-3 -- we are computing both negative and positive saturations
-4 -- no saturations are computed (same as -Dida.logicStuff.constraints.useSaturation=false)
x -- where $x > 0$ means that saturations are computed only up to (inclusively) i-th layer, i.e., clauses with at most x literals Saturations are computationally expensive, so this parameter allows you to switch between utilizing our approach and scalability.

Aggregating results from the table-like statistics above should be easy. Finally, you should be able to obtain results similar to the disseratation, i.e.

Running feature extractor

Mining features is as easy as running the following command:

java
    -Dida.searchPruning.dataset=./datasets/nci_transformed/gi50_screen_KM20L2.txt
    -Dida.searchPruning.learn=features
    -Dida.searchPruning.maxDepth=4
    -Dida.searchPruning.maxVariables=8
    -Dida.searchPruning.minSupport=1
    -Dida.searchPruning.mining=bfs
    -Dida.searchPruning.runner.overallLimit=1200
    -Dida.logicStuff.mineWithSaturations=true
    -jar pruningHypotheses.jar

In particular, this mines features, i.e., conjunctive patterns, on the KM20L2 dataset with at most 4 literals, 8 variables, and a minimal support of 1, within at most 1200 minutes and saturation-based pruning. Thus, at the start, a domain theory is learned -- you may set up parameters of the theory, e.g., maximal number of literals, etc., as in the section above. This particular approach is level-wise WARMR-like method. The output, starts in similar way as above, e.g.

# start mining
# storing the output to .\r636023851589300

# overall stats
# ----------------------------
# 		      | total refinements |	 non-isomorphic | non-isomorphic saturated |  #HC	 |	     time
# level	 1             50	                 50	                  50	             4		0.010924436666666667
# level	 2           2178            	   1172	                 881	           232		0.3835711766666667
...
# end of mining
*****
depth : 1
constraint time [ms]: 23414
search time [ms]: 371
overall time [ms]: 23785
total searched nodes: 21
total extended: 4
total pruned nodes: 4
total killed: 8
*****

which starts showing the directory with the output -- this can be changed using the ida.searchPruning.targetFolder parameter. Familiar output of domain theory learner is displayed thereafter. Finally, statistics of the mining are displayed layer by layer showing execution time from start (in ms), cumulative number (from the start) of opened nodes, cumulative number of extended hypotheses extended by saturations (these got new literals thanks to saturations), cumulative number of pruned nodes (the ones that were pruned by saturations either by being isomorphic after saturation, or by being trivial), and cumulative number of killed hypotheses (those there were found trivial w.r.t. domain theory).

Running the feature mining without the saturation-based pruning, leaving it only with isomorphic-based pruning, is as simple as using -Dida.logicStuff.mineWithSaturations=false. Then, no domain theory is learned and so you won't find the first part of the output.

In the output folder, you will find a file with a domain theory and cumulative statistics with features per layer. Aggregating these results should be relatively easy and would lead to similar results as in the papers, i.e.

Miscellaneous

This code base contains much more ideas around the saturation-based pruning than were published in either of the publications. For example, you should be able to utilize the ida.searchPruning.targetPredicate parameter to learn Horn rules. There are also much more implemented strategies, e.g., beam search, DFS, etc, that already incorporate either basic saturation-based pruning or much more extended version -- these include speeding up hacks, e.g., language bias, caching refinements (literals) that were deemed trivial w.r.t. domain theory, and extended saturations, e.g., incorporating clauses with imperfect coverage over the dataset and adaptive saturation learning. Note that these changes in the codebase may slightly change the results; you may find earlier implementations in older commits. Similarly, the examples in this readme were executed on a different machine, with different parameters, subsets of data, etc. than the experiments in the papers. A proper experimental setup would also need the memory, e.g., -Xms and -Xmx.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Logic		Logic
LogicStuff		LogicStuff
SearchPruning		SearchPruning
Utils		Utils
datasets		datasets
dependencies		dependencies
test1		test1
README.md		README.md
img.png		img.png
img_1.png		img_1.png
org.sat4j.core.jar		org.sat4j.core.jar
org.sat4j.maxsat.jar		org.sat4j.maxsat.jar
org.sat4j.pb.jar		org.sat4j.pb.jar
pruningHypotheses.jar		pruningHypotheses.jar

martinsvat/Pruning-Hypotheses

Folders and files

Latest commit

History

Repository files navigation

Data

Installation

Running domain theory learner

Running feature extractor

Miscellaneous

About

Topics

Resources

Stars

Watchers

Forks

Languages