Tag Archives: glucose

Predicting Clause Usefulness

In this post, I will be taking a shot at building prediction models for learnt clause usefulness by running over 150 unsatisfiable CNF problems, extracting over 140 features for each learnt clause, and then calculating whether the clause was used in the UNSAT proof. Here I need to thank Marijn Heule who has helped me adding a clause ID to each clause in his DRAT checker so I could perform supervised learning on the collected data. Thanks Marijn!

The first works on machine learning in SAT has been by the crowd creating portifolio solvers such as SATZilla that calculate features of a CNF and then run the SAT solver that is predicted to solve that instance the fastest. This has been achieved by coming up with a set of easy-to-calculate but useful features and then using a standard prediction method to fit SAT solvers to CNFs. The issue, as always with machine learning, is overfitting. Unfortunately, CNFs tended to stay very similar between old SAT Competitions, and overfitting was richly rewarded. Hence, these portfolio solvers sometimes did well in competitions but sometimes relatively poorly in practice.

The second wave of improvements came with MapleSAT where a multi-armed bandit (MAB) framework was used to sometimes to pick branching variables. This won the 2016 SAT Competition‘s Main Track and was a novel and truly interesting idea, manipulating the branching heuristic based on the feature that the researchers called “learning rate“.

With this blog post, I hope to inspire a third wave of improvements in SAT solver performance. Learnt clause usefulness metrics have been a a very hot potato since the glucose SAT solver which engineered and used a new feature, glues, to decide the usefulness on. Maybe with the work and data provided below, we could use features not used before, or combine them in novel ways to achieve better prediction accuracy.

A History of Clause Usefulness Heuristics

For a long time, learnt clauses were chosen to be kept around in the clause database based on the feature called their “activity”, which measured how often was the clause involved in generating conflict clauses. This was the heuristic used in the original MiniSat.

The next big step came with glues that was present in the earliest of the Glucose solver family. This was a huge step in a different direction — suddenly a new feature has been engineered, “glue”, and the heuristic used to decide has changed. It was now the glue of the clause, relative to the other clauses, that determined whether the clause stayed or was discarded.

The next step was a hybrid approach, that stays until today. This says that a learnt clause has very low glue should stay no matter what, clauses with slightly higher glues should stick around for a while at least, and clauses with much higher glues should be only kept around for a short while, based on their activities. This is the strategy used by MapleSAT, a very successful SAT solver.

DRAT and the Era of Knowing What’s Good

The DRAT-trim proof checker has ushered in a new possibility in the era of SAT solving. We could finally know how the UNSAT proof was built, and furthermore, we could know which lemmas, i.e. which learnt clauses were actually used to build that proof. This means that we could finally measure all sorts of features of clauses, e.g. their glue or their size, or their activity, store this along with the clause, and once the SAT solver has finished and we know which clauses were actually useful, try to build predictive models using supervised machine learning. However, for whatever reason, nobody did this.

Thanks to Marijn Heule, who I asked to help me parse Clause IDs in DRAT, I have finally built a system that does exactly as above. It gathers over 140 features about each learnt clause, then when the solving finishes, it runs DRAT and then associates a “OK” or “BAD” nominal class value with the features. The output is an SQLite file for each solved problem. I have picked 153 problems that all solved as UNSAT within less than 10 minutes from the past 6 years of SAT competitions, and have run my setup on it. The solver used was CryptoMiniSat and so as not to taint the results, I have set it not to delete any clause at all from the database (i.e. no clause cleaning was performed) and set it to preprocess the instance but not to perform any in-processing. Furthermore, the restart heuristic was set to be the one pioneered by swdia5by, i.e. glue-based and geometric restarts are combined, in an iterative way.

The Features

The features saved broadly fall into four categories:

  1. Features computed once in a while on the whole CNF. These are similar to those engineered in SATZilla and other portifolio solvers. Thanks go to Yuri Malitsky and Horst Samulowitz who sent me “features_fast.c”. This includes things like statistics about clauses and variable distribution, etc. These start with “feat.” such as “feat.horn_min”
  2. Features of the previous restart. This includes things like the number of binary clauses learnt, average branch depth, average learnt clause glue size, clause size, etc. These start with “rst.” such as “rst.propLongRed” — the number of propagations made by long redundant (what most solvers call “learnt”) clauses.
  3. Features of the current clause such as its size, glue, its variables’ average standardised activity (i.e. divided by “var_inc”), etc. These start with “cls.” such as “cl.conflicts_this_restart”, i.e. the number of conflicts in this restart.
  4. Some computed features, such as “cl.glue_rel” which is equal to “cl.glue” / “cl.glue_hist”

The final “class” attribute is set to either “OK” meaning the clause was used in the final UNSAT proof, or “BAD” meaning it was not. Obviously, we want to never keep the BAD ones, and only keep the OK ones. If we could predict based on the four types of features above (which are all available at the time of the clause creation) which clauses we should keep, then we potentially could solve problems a lot faster. Partially because we would just throw the clauses away that we predict to be useless, and partially because we could steer the solver towards regions of the search space that contain more useful clauses — e.g. by restarting when we decide that where we are is a poor place to be.

The Data

CSV is available HERE

First of all, a disclaimer. There are a million ways this can be done differently, improved, or made worse. I have collected over a hundred features and I think they are useful. I also have conducted the experiments on over 150 problems each running at most 10 minutes (but at least 10’000 conflicts), using CryptoMiniSat, never deleting any clause. This constitutes millions of data points.  I have spent over two years in doing this, and I have been analysing different versions of this dataset for multiple months.  Hence, I may well be biased, in both the way the data has been collected and how I’m analysing it. To counter these biases, I am making the data available so you can perform your own analysis and CryptoMiniSat is made fully open source, including all that you need to completely and precisely re-generate this (or any other) data. See the bottom of this post for some how-tos.

I am only making available the sampled CSV for space and practicality reasons. I created a CSV that contains 153 problems’ datapoints, picking 10’000 randomly from each, which was then randomly sampled to 30’000 elements. This 51MB data is available here. I strongly suggest using Weka to read this file. In case you have trouble using Weka and/or creating predictive models, I’d suggest the free online course Data Mining with Weka.

Note: I have the 1530k line file as well, but it’s 2.6GB. If you really need it, please let me know at my email. For almost all analysis, 30k datapoints is sufficient — Weka will not be able to practically work with any dataset over 100k elements. In case you want to perform deep learning, we can take a shot at that 2.6GB data piece together. However, I think clustering and per-cluster prediction models are a lower-hanging fruit.

Preliminary Analysis

So, what does the data tell us? Throughout here I will use a percentage split of 66% for the training-test set, using the 30k-line data file above. All results below are trivially reproducible.

Zero Rule Check
First of all let’s do a ZeroR analysis, i.e. decide to keep or not keep a clause without looking at any data. We get:

So we have about 50% useful and 50% useless clauses. Note that this is may not be representative for problems that are difficult. We picked problems that solve under 10 minutes that have at least 10’000 conflicts.

One Rule Check
Next up, let’t do the slightly less trivial setup and do a OneR, i.e. one rule analysis. Here, Weka is only allowed to pick one feature to decide. One would hope this to be glue. It isn’t. Let’s set the minimum bucket size to 100, which means Weka won’t give us a 100 different small decisions (i.e. won’t overfit, which one can observe in some portifolio solvers). We get:

What is very strange here is that the best predictor is not glue. In fact, the next best predictors are, in order:

  1. cl.decision_level (72.3 %)
  2. rst.branchDepth (67.4%)
  3. rst.branchDepthDelta (67.6 %)
  4. cl.decision_level_hist (67.1 %)
  5. cl.backtrack_level_hist (67.1 %)

So they all seem to have something to do with the branch depth/decision depth. If forced, Weka will draw the line at glue 15 and below to be OK and 16 and above to be BAD, giving a 61.9% accuracy.

J48 Decision Tree
Let’s try to see if we can get some interesting result with J48, a standard decision tree-building setup. I’m going to put a minimum of 400 elements in any bucket, or we will have overfitting.

Interesting… so we first branch on backtrack_level, which is not unsurprising, given the OneR results above. Then we either use the glue distribution variance of the last time we measured the features, “feat.red_glue_distr_var” or the previous restart’s average branch depth delta (i.e. backjump-size), “rst.branchDepthDelta”. The lower branches are interesting as well.

Random Decision Tree
Just for completeness sake, let’s try to do a RandomTree as well. Let’s set the MinNum to 500 so we can combat overfitting. We then get a pretty complicated tree with the following stats:

Which is pretty fair performance, but the tree is hard to understand and its performance is worse than J48, though not substantially.

There are other analyses that could be done, for example clustering of similar problems — you can find the CNF filename in feature “fname” (though you should never make any decisions based on that in your decision trees). Building problem classes may be interesting — we could cut up the CSV into different classes and create different prediction models for each. I am pretty sure this would significantly boost our prediction performance.

You can use Weka to visualize the data in many ways. Here is one:

You can use such visualisations to find anomalies, incorrect datapoints, or skews in the collection of data. If you do find any of these, please email me. I am happy to correct them of course — we might even achieve better prediction.

How to Do This at Home

It’s quite easy to create data for your own CNF instances. Just follow the README on the CryptoMiniSat GitHub repository, which says to get a clean Debian or Ubuntu distribution and then:

This will do predictive analysis using Python’s scikit-learn’s DecisionTreeClassifier (which may or may not be what you want) on a test problem and build a nice visual PNG tree too using graphviz:

Note that this decision tree is highly non-representative — it ran on a single, incredibly small instance. You can look at test_predict/data.sqlite.csv with Weka to run your own analysis and build your own, custom trees. To run this on a CNF of your own choice, run:

You will then have your scikit-learn tree generated and more importantly, you will have the CSV under “myoutput/data.sqlite.csv” so you can read that with Weka and do your own analysis. Beware that you probably want to cut down the size of that CSV before loading it into Weka — but randomise first or you will only see the beginning of the solving. Concatenating CSVs of different runs is easy, just make sure to strip the topmost line from CSV: it’s the header and should not be repeated. For randomisation, use “shuf file” and for cutting the header off, use “tail -n +2 file”. To get only the header, use “head -n 1 file”:


I think the above could help how choosing parameters/features and cutoffs when deciding whether to keep a learnt clause and in determining when we are in a bad place in search. This has been done experimentally until now, playing with the cutoffs, trying new cutoffs, and running all the problems on a large cluster, many times.

With the above data in hand, I think we could do better. Of course we will still be using clusters, verifying what the data analysis is indicating. But we may be able to check, engineer and determine the features and their cutoffs better. And finally, with all this data, I am hopeful that we will be able to have a discussion grounded in more data than just solving times.

Clause glues are a mystery to me

Note: the work below has been done in collaboration with Vegard Nossum, but the wording and some of the (mis-)conculsions are mine. If you are interested, check out his master thesis, it’s quite incredible

Anyone who has ever tried to really understand clause glues in SAT solvers have probably wondered what they really mean. Their definition is simple: the number of variables in the final conflict clause that come from different decision levels. An explanation of these terms can be found here. On the surface, this sounds very simple and clean: the number of different decision levels should somehow be connected to the number of variables that need to be set before the learnt clause activates itself, i.e. it causes a propagation or a conflict. Unfortunately, this is just the surface, because if you try to draw 2-3 implication graphs, you will see that in fact the gap between the number of variables needed to be set (let’s call this ‘activation number’) and the glue can be virtually anything, making the glue a potentially bad indicator of the activation number.

The original reasoning

The original reasoning behind glues is the following: variables in the same decision level, called ‘blocks of variables’ have a chance to be linked together through direct dependencies, and these dependencies should be expressed somehow in order to reduce the number of decisions needed to reach a conflict (and thus ultimately reduce the search space). To me, this reasoning is less clear than the one above. In fact, there are about as many intuitions about glues as the number of people I have met.


With Vegard Nossum we have developed (in exactly one day) something quite fun. On the face of it, it’s just glucose 1.0, plain and simple. However, it has an option, “-track”, which does the following: whenever a learnt clause is about to cause a conflict, it jumps over this learnt clause, saves the state of the solver, and works on until the next conflict in order to measure the amount of work the SAT solver would have had to do if that particular learnt clause had not been there. Then, when the next clause wishes to cause a conflict, it records the amount of propagation and decisions between the original conflict and this new conflict, resets the state to the place saved, and continues on its journey as if nothing had happened. The fun part here is that the state is completely reset, meaning that the solver behaves exactly as glucose, but at the same time it records how much search that particular cause has saved. This is very advantageous because it doesn’t give some magical number like glue, but actually measures the usefulness of the given clause. Here is a typical output:

The output is in SQL format for easy SQL import. The “size” is the clause size, “glue” is the glue number, “conflicts” is the number of times the clause caused a conflict, “props” is the number of propagations gained by having that clause (i.e. by doing the conflict early), “bogoprops” is an approximation of the amount of time gained based on the number of watchlists and the type of clauses visited during propagation, and “decisions” is the number of decisions gained. The list is sorted according to “bogoprops”, though once the output is imported to MySQL, many sortings are possible. You might notice that the ‘glue’ is 1 for some clauses (e.g. on the second output line) — these are clauses that have caused a propagation at decision level 0, so they will eventually be removed through clause-cleaning, since they are satisfied. Notice that high up, there are some relatively large clauses (of size 102 for example) with glue 2, that gain quite a lot in terms of time of search. The gained conflicts/propagations/etc. are all cleared after every clause-cleaning, though since clauses are uniquely indexed (‘idx’), they can be accumulated in all sorts of ways.

The program is 2-5x slower than normal glucose, but considering that it has to save an extreme amount of state due to the watchlists being so difficult to handle and clauses changing all the time, I think it does the job quite well — as a research tool it’s certainly quite usable. In case you wish to download it, it’s up in GIT here, and you can download a source tarball here. To build, issue “cmake .” and “make”. Note that the tool only measures the amount of search saved by having the clause around when it tries to conflict. It does not measure the usefulness of the propagations that a learnt clause makes. Also, it doesn’t measure the other side of the coin: the (potentially better) conflict generated by using this clause instead of the other one. In other words, it gives a one-sided view (no measure of help through propagation) of a one-sided view (doesn’t measure the quality of difference between the conflict clauses generated). Oh well, it was a one-day hack.


I have made very few experiments with glucosetrack, but you might be interested in the following result. I have taken UTI-20-10p0, ran it until completion, imported the output into MySQL, and executed the following two queries. The first one:

calculates the average number of saved propagations between each round of cleaning for clauses of glue >= 2 (i.e. clauses that didn’t eventually cause a propagation at decision level 0), and of size >= 2, because unitary clauses are of no interest. The second is very similar:

which calculates the same as above, but for size.

Some explanation is in order here regarding why I didn’t count SUM(), and instead opted for AVG(). In fact I personally did make graphs for SUM(), but Vegard corrected me: there is in fact no point in doing that. If I came up with a new glue calculation function that gave an output of ‘1’ for every clause, then the SUM for that function would look perfect: every clause would be in the same bracket, saving a lot of propagations, but that would not help me make a choice of which clauses to throw out. But the point of glues is exactly that: to help me decide which clauses to throw out. So what we really want is a usefulness metric that tells me that if I keep clauses in that bracket, how much do I gain per clause. The AVG() gives me that.

Here goes the AVG() graph for the last clause cleaning (clause cleaning iteration 33):

Notice that the y axis is in logscale. In case you are interested in a larger graph, here it is. The graph for clause cleaning iteration 22 is:

(Iteration 11 has high fluctuations due to less data, but for the interested, here it is). I think it’s visible that glues are good distinguishers. The graph for glues drops down early and stays low. For sizes, the graph is harder to read. Strangely, short clauses are not that good, and longer clauses are better on average. If I had to make a choice about which clauses to keep based on the size graph, it would be a hard choice to make: I would be having trouble selecting a group that is clearly better than the rest. There are no odd-one-out groups. On the other hand, it’s easier to characterise which clauses are good to have in terms of glues: take the low glue ones, preferably below 10, though we can skip the very low ones if we are really picky. An interesting side-effect of the inverse inclination of the size and glue graphs and the fact that “glue<=size” is that maybe we could choose better clauses to keep if we go for larger clauses that have a low glue.


Unfortunately, there are no real conclusions to this post. I guess running glucosetrack for far more than just one example, and somehow also making it measure the difference between the final conflict clauses’ effectiveness would help to write some partially useful conclusion. Vegard and me have tried to put some time and effort into this, but to not much avail I am afraid.

PS: Notice that glucosetrack allows you to generate many of the graphs here using the right SQL query.

Propagating binary clauses

Binary clauses are somewhat special. They are clauses that look like:

-v1 or v20 = true
-v1 or v21 = true

which basically mean: if v1 is TRUE, then both v20 and v21 must be TRUE. We put these clauses into a special datastructure, that looks like this:

watchlist[-v1] = v20,v21, ...

This makes us all very happy, since this datastructure is very simple: just literals (negated or non-negated variables) one after the other, in a very trivial list. This makes the whole thing very compact, and very efficient.

The problem with such a list is that when v1 is indeed set to TRUE, then when we set the variables (e.g. v20,v21), we must record what set these variables to their new values. What usually is done is to simply record the pointer to the clause that does this. However, in the above datastructure, there are no clause pointers. The datastructure for GLUCOSE contained these pointers. I have lately been experimenting with removing these pointers. The tricky part is to update the conflict generation routine that examines the clauses and decisions that lead to a conflict. This routine must now automatically recognise that these binary clauses are special, and they are only 2-long: one part of them is the variable that got set, and the other part is the variable that set it. These two informations are available: watchlist[-v1] immediately tells us that literal -v1 set the variable, and the variable that got set is always known.

Armed with this logic, one would think that moving towards this all-new no-clause-pointers heaven is the way to go. However, apparently, this might not be so. This is quite astonishing, given that doing things without pointers should mean less cache-misses (since pointers are less likely to be resolved during conflict analysis). I consistently get worse speeds, though. I am guessing that the problem lies with the datastructure I use to store the information stating what lead to the setting of the variable. I use a very simple struct:

where wasPointer is TRUE if a non-binary clause lead here (and we store the pointer), and FALSE if it was a binary clause (and we store the other literal). The union takes care of the rest: pointer is valid when wasPointer is TRUE and lit is valid when wasPointer is FALSE.

What is apparent in this datastructure is the following: we now store a 32/64 bit pointer plus we store a boolean. Right, so 42/72 bits of data? No. Data alignment in C++ means, that this datastructure will be aligned according to its largest member, so it will be aligned to 32/64-bit boundaries: this datastructure takes 128/64 bits to store on 32 and 64 bit architectures, respectively. Oops, we just doubled our demand of data writes!

There are some remedies, of course. We can simply pack:

which removes the memory burden of aligning to maximum size, and the actual size will really be 42/72 bits. But this reaises the overhead of accessing data inside this structure. Furthermore, I have been using a 32-bit mini-pointer on 64-bit architectures, which is a big hack, so it crumbles as soon as something as complicated as this comes up: so I am left with writing 72 bits instead of 32. This probably leads to the slowdowns. Oh well, I have got to work harder…

EDITED TO ADD: Next up, I will talk about some new ideas that let me solve 217 problems from the 2009 SAT Race within the original time limit. Remember that last year the best solver only solved 204.