Category Archives: Development

My SAT solver fuzzing setup

Since the SAT solver fuzzing paper by Brummayer, Lonsing and Biere, SAT solver fuzzing has been quite the thing in the SAT community. Many bugs have since been discovered in almost all SAT solvers, including those of this author. In this blog post I will detail my SAT fuzzing setup: what fuzzers I use and how I use them.

My fuzzers

The most obvious fuzzers to use are those present in the paper and accompanying website: CNFFuzz and FuzzSAT. These fuzzers randomly genereate high-quality structured problem instances. The emphasis here is on the word structured. Randomly generated instances, as the authors of the original paper note, do not excercise SAT solvers enough. SAT solvers are written with very specific problem types in mind: problems containing long chains of binary clauses, gates, etc.

Over the year(s) I have found, however, that these fuzzers are not enough. One of the revelations came when Vegard sent me lots of bugreports about CryptoMiniSat failing when Gaussian elimination was enabled. It turned out that although I ran the fuzzer for long hours, I only ran it with the default option, Gaussian turned off. So default options are sometimes not enough and when off-the track, systems can fail.

Vegard then sent me a fuzzer that wasn’t using CNFFuzz or FuzzSAT, but instead his own CNF generator for SHA-1 hash functions. Instances thus generated are not only structured, they are also very picky about the solution: they either have exactly one or zero. This helped fuzz solution reconstruction, which after blocked clause elimination may not be exactly easy. In fact, it’s so good at fuzzing that it finds a bug I have never been able to fix in CryptoMiniSat, to this day, about binary clauses. Whenever I block them, I cannot reconstruct the solution, ever, unless I disable equivalent literal replacement. The authors of the paper give a pretty complicated proof about why it should work in the presence of equivalent literal replacement, which I still haven’t understood.

When I implemented disconnected component analysis into CryptoMiniSat 3, I knew I had to be careful. In the SAT Competition’11 I failed terribly with it: CryptoMiniSat 2 memory-outed and crashed on many instances with disconnected components. The solution I came up with was to make a program that could concatenate two or more problems into one problem, renumbering the variables. This meta-fuzzer (that used fuzzers as inputs) was then used to fuzz the disconnected component handling system. Although this sounds sufficient, it is in fact far from it. Firstly, the problems generated are disconnected at toplevel — whereas problems in the real world sometimes disconnect later during the search. Secondly, it took 2, maybe 4 CNF inputs, but real problems sometimes have thousands of disconnected components. Both of these issues had to be addressed.

One of the most difficult parts of modern SAT solvers is time-outs of complicated algorithms. Time-outs sometimes make code quite convoluted, and thus are a source of bugs. However, most fuzzers generate problems that are too small for any time-outs to occur, thus severely limiting the scope of the fuzzer. So, the fuzzers don’t exercise a part of the system that is fully exercised by real-world problems. To fix this, I have written a short fuzzer script that generates enormous, but simple problems with millions of variables. This helped me fix a lot of issues in timeouts. I advise having a RAM-disk when using it though, as it writes gigabyte-size CNFs.

I also use some other fuzzers, namely one that excercise XOR-manipulation and the sgen4 CNF generator by Ivor Spence (PDF, pages 85-86) that generates problems of tuneable hardness.

My fuzzing system

Fuzz CNF generators are nice, but not enough on their own. A system around them is necessary for the full benefit. Writing one sounds trivial at first, but my system around the fuzzers is about 1000 lines of python code, so maybe it’s not that trivial after all. It has to do a number of things to be truly useful.

First, it needs to test the solutions given by the solver. If the solution is SAT, it needs to read the (gzipped) CNF file, and check the (partial) solution. Second, if the solution is UNSAT, it either needs to call the DRUP checker to verify the proof, and/or it needs to call another solver to see if it finds a solution or not — and verify that the other solver’s solution is correct. I actually run both DRUP and the other solver, you can never be sure enough. Note that solution verification also complicates time-limiting: we don’t want to wait 30 minutes because a fuzz CNF is too complicated, and we don’t want to indicate an error just because the problem is hard.

Secondly, fuzzing is not only about generating CNFs that trigger bugs, but also about saving these problematic cases for later, as a form of regression tests. These regression tests need to be ran regularly, so that bugs that came up before don’t come up again. I have a large set of regression tests that I have accumulated over the years. These help me increase confidence that when I improve the solver, I don’t accidentally re-introduce bugs. My fuzzer system helps manage and run these regression tests.

Finally, we don’t just want to fuzz the solver through the CNF interface. SAT solvers typically expose a library interface where one can add clauses, solve (with assumptions) and check the conflict returned in terms of the assumptions given. For now, CryptoMiniSat’s library is fuzzed through a specialized CNF format, where “c Solver::solve()” lines are interspersed with the regular clauses. These lines indicate for the solver to call the solve() method and output the solution to a file. This file is then read by the fuzzing system and verified by copying parts of the file into another CNF and using the DRUP-checker and another SAT solver to verify the results. This simulates library usage without assumptions. Thanks to Lukas Prokop who alerted me to bugs in assumptions usage, assumptions-based solving is also simulated through “c Solver::solve( LIT LIT LIT…)” calls, where LITs are replaced with a random literals.

Future work

I want to do two things related to testing and fuzzing. First, I want to improve the checker so that it checks for more problems. I want to check the conflict returned when assumptions are used. Further, I want to check timeouts — this will require some coordination between the fuzzer system and the solver where times are printed that are later checked for correctness. Some optimisations must never take more than N seconds on a fast enough machine, and this could be checked relatively easily.

Second, I want to move CryptoMiniSat into a test-driven system. This will require some quite extensive code re-writing so that test-harnesses will be easier to write. However, it will help fix one issue that hasn’t been talked about, kind of like the elephant in the room: performance bugs. They are quite prevalent, and cannot be caught by simple fuzzing. They plague many SAT solvers, I’m sure, as I keep on finding them at every corner. They are the silent bugs that prevent solvers to be more performant, yet they don’t diminish the resulting solution — only the time takes to find it.

CryptoMiniSat 3.3 released

This is just a short note that CryptoMiniSat 3.3 has been released. This is mainly a bugfix release. I have managed to make a fuzzer that simulates library usage of the system by interspersing the CNF with “c Solver::solve()” calls and then checking the intermediate solutions. Checking is either performed by verifying that all clauses are satisfied, or if the solution is UNSAT, by using a standard SAT solver (I use lingeling). This turned up a surprising number of bugs. I have also fixed the web-based statistics printing which should now work without a hitch.

The non-bugfix part of the release is that CryptoMiniSat now saves memory by dynamically shortening and reallocating the watchlist pointer array and some other data structures. This can save quite a bit of memory on large, sparse instances.

Clause glues are a mystery to me

Note: the work below has been done in collaboration with Vegard Nossum, but the wording and some of the (mis-)conculsions are mine. If you are interested, check out his master thesis, it’s quite incredible

Anyone who has ever tried to really understand clause glues in SAT solvers have probably wondered what they really mean. Their definition is simple: the number of variables in the final conflict clause that come from different decision levels. An explanation of these terms can be found here. On the surface, this sounds very simple and clean: the number of different decision levels should somehow be connected to the number of variables that need to be set before the learnt clause activates itself, i.e. it causes a propagation or a conflict. Unfortunately, this is just the surface, because if you try to draw 2-3 implication graphs, you will see that in fact the gap between the number of variables needed to be set (let’s call this ‘activation number’) and the glue can be virtually anything, making the glue a potentially bad indicator of the activation number.

The original reasoning

The original reasoning behind glues is the following: variables in the same decision level, called ‘blocks of variables’ have a chance to be linked together through direct dependencies, and these dependencies should be expressed somehow in order to reduce the number of decisions needed to reach a conflict (and thus ultimately reduce the search space). To me, this reasoning is less clear than the one above. In fact, there are about as many intuitions about glues as the number of people I have met.

Glucosetrack

With Vegard Nossum we have developed (in exactly one day) something quite fun. On the face of it, it’s just glucose 1.0, plain and simple. However, it has an option, “-track”, which does the following: whenever a learnt clause is about to cause a conflict, it jumps over this learnt clause, saves the state of the solver, and works on until the next conflict in order to measure the amount of work the SAT solver would have had to do if that particular learnt clause had not been there. Then, when the next clause wishes to cause a conflict, it records the amount of propagation and decisions between the original conflict and this new conflict, resets the state to the place saved, and continues on its journey as if nothing had happened. The fun part here is that the state is completely reset, meaning that the solver behaves exactly as glucose, but at the same time it records how much search that particular cause has saved. This is very advantageous because it doesn’t give some magical number like glue, but actually measures the usefulness of the given clause. Here is a typical output:

c This is glucose 1.0  with usefulness tracking by
c Vegard Nossum and Mate Soos. Based on glucose, which
c is in run based on MiniSat, Many thanks to all teams
c
c ============================[ Problem Statistics ]=============================
c |                                                                             |
c |  Number of variables:  138309                                               |
c |  Number of clauses:    942285                                               |
c |  Parsing time:         0.28         s                                       |
============================[ Search Statistics ]==============================
| Conflicts |          ORIGINAL         |          LEARNT          | Progress |
|           |    Vars  Clauses Literals |    Limit  Clauses Lit/Cl |          |
===============================================================================
|         0 |  138309   942285  2636352 |   314095        0   -nan |  0.000 % |
|       620 |  138074   942285  2636352 |   314095      615     82 |  1.559 % |
|       919 |  135307   925938  2596895 |   314095      906     75 |  3.908 % |
|      2714 |  130594   894562  2518954 |   314095     2664     67 |  6.799 % |
|      2814 |  130593   894562  2518954 |   314095     2763     69 |  6.808 % |
|      2930 |  130592   894562  2518954 |   314095     2879     70 |  6.808 % |
|      4042 |  127772   874934  2471045 |   314095     3974     69 |  9.292 % |
|      4142 |  127772   874934  2471045 |   314095     4074     70 |  9.292 % |
...
c Cleaning clauses (clean number 0). Current Clause usefulness stats:
INSERT INTO data(cleanno, idx, size, glue, conflicts, props, bogoprops, decisions) VALUES(0 , 961074 , 107 , 5 , 42 , 185509 , 1301341 , 0);
INSERT INTO data(cleanno, idx, size, glue, conflicts, props, bogoprops, decisions) VALUES(0 , 944729 , 14 , 1 , 7 , 36865 , 268229 , 0);
INSERT INTO data(cleanno, idx, size, glue, conflicts, props, bogoprops, decisions) VALUES(0 , 948275 , 7 , 1 , 15 , 27909 , 220837 , 0);
INSERT INTO data(cleanno, idx, size, glue, conflicts, props, bogoprops, decisions) VALUES(0 , 953762 , 102 , 2 , 2 , 29365 , 197410 , 0);
...
c End of this round of database cleaning
...
|     38778 |  110896   758105  2182915 |   314095    28270     93 | 20.167 % |
|     39488 |  110894   758105  2182915 |   314095    28978     93 | 20.185 % |
c Cleaning clauses (clean number 1). Current Clause usefulness stats:
INSERT INTO data(cleanno, idx, size, glue, conflicts, props, bogoprops, decisions) VALUES(1 , 979236 , 71 , 1 , 8 , 45531 , 280156 , 0);
INSERT INTO data(cleanno, idx, size, glue, conflicts, props, bogoprops, decisions) VALUES(1 , 954908 , 2 , 2 , 7 , 0 , 232760 , 0);
...

The output is in SQL format for easy SQL import. The “size” is the clause size, “glue” is the glue number, “conflicts” is the number of times the clause caused a conflict, “props” is the number of propagations gained by having that clause (i.e. by doing the conflict early), “bogoprops” is an approximation of the amount of time gained based on the number of watchlists and the type of clauses visited during propagation, and “decisions” is the number of decisions gained. The list is sorted according to “bogoprops”, though once the output is imported to MySQL, many sortings are possible. You might notice that the ‘glue’ is 1 for some clauses (e.g. on the second output line) — these are clauses that have caused a propagation at decision level 0, so they will eventually be removed through clause-cleaning, since they are satisfied. Notice that high up, there are some relatively large clauses (of size 102 for example) with glue 2, that gain quite a lot in terms of time of search. The gained conflicts/propagations/etc. are all cleared after every clause-cleaning, though since clauses are uniquely indexed (‘idx’), they can be accumulated in all sorts of ways.

The program is 2-5x slower than normal glucose, but considering that it has to save an extreme amount of state due to the watchlists being so difficult to handle and clauses changing all the time, I think it does the job quite well — as a research tool it’s certainly quite usable. In case you wish to download it, it’s up in GIT here, and you can download a source tarball here. To build, issue “cmake .” and “make”. Note that the tool only measures the amount of search saved by having the clause around when it tries to conflict. It does not measure the usefulness of the propagations that a learnt clause makes. Also, it doesn’t measure the other side of the coin: the (potentially better) conflict generated by using this clause instead of the other one. In other words, it gives a one-sided view (no measure of help through propagation) of a one-sided view (doesn’t measure the quality of difference between the conflict clauses generated). Oh well, it was a one-day hack.

Experiments

I have made very few experiments with glucosetrack, but you might be interested in the following result. I have taken UTI-20-10p0, ran it until completion, imported the output into MySQL, and executed the following two queries. The first one:

SELECT glue, AVG(props), FROM data
WHERE glue >= 2 AND size >= 2
GROUP BY glue
ORDER BY glue

calculates the average number of saved propagations between each round of cleaning for clauses of glue >= 2 (i.e. clauses that didn’t eventually cause a propagation at decision level 0), and of size >= 2, because unitary clauses are of no interest. The second is very similar:

SELECT size, AVG(props), FROM data
WHERE glue >= 2 AND size >= 2
GROUP BY size
ORDER BY size

which calculates the same as above, but for size.

Some explanation is in order here regarding why I didn’t count SUM(), and instead opted for AVG(). In fact I personally did make graphs for SUM(), but Vegard corrected me: there is in fact no point in doing that. If I came up with a new glue calculation function that gave an output of ‘1’ for every clause, then the SUM for that function would look perfect: every clause would be in the same bracket, saving a lot of propagations, but that would not help me make a choice of which clauses to throw out. But the point of glues is exactly that: to help me decide which clauses to throw out. So what we really want is a usefulness metric that tells me that if I keep clauses in that bracket, how much do I gain per clause. The AVG() gives me that.

Here goes the AVG() graph for the last clause cleaning (clause cleaning iteration 33):

Notice that the y axis is in logscale. In case you are interested in a larger graph, here it is. The graph for clause cleaning iteration 22 is:

(Iteration 11 has high fluctuations due to less data, but for the interested, here it is). I think it’s visible that glues are good distinguishers. The graph for glues drops down early and stays low. For sizes, the graph is harder to read. Strangely, short clauses are not that good, and longer clauses are better on average. If I had to make a choice about which clauses to keep based on the size graph, it would be a hard choice to make: I would be having trouble selecting a group that is clearly better than the rest. There are no odd-one-out groups. On the other hand, it’s easier to characterise which clauses are good to have in terms of glues: take the low glue ones, preferably below 10, though we can skip the very low ones if we are really picky. An interesting side-effect of the inverse inclination of the size and glue graphs and the fact that “glue<=size” is that maybe we could choose better clauses to keep if we go for larger clauses that have a low glue.

Conclusions

Unfortunately, there are no real conclusions to this post. I guess running glucosetrack for far more than just one example, and somehow also making it measure the difference between the final conflict clauses’ effectiveness would help to write some partially useful conclusion. Vegard and me have tried to put some time and effort into this, but to not much avail I am afraid.

PS: Notice that glucosetrack allows you to generate many of the graphs here using the right SQL query.

CryptoMiniSat 3.2.0 released

CyptoMinSat 3.2.0 has been released. This code should be extremely stable and should contain no bugs. In case it does, CryptoMiniSat will fail quite bady at the competition. I have fuzzed the solver for about 2-3000 CPU hours, with some sophisticated fuzzers (all available here — most of them not mine) so all should be fine, but fingers are crossed.

Additions and fixes

The main addition this time is certification proofs for UNSAT through the use of DRUP. This allows for use of the solver where certainty of the UNSAT result is a necessity — e.g. where lives could potentially depend on it. Unfortunately, proof checking is relatively slow through any system, though DRUP seems to be the best and fastest method. Other than the implementation of DRUP, I have fixed some issues with variable replacement.

SAT Competition’13

The description of the solver sent in to the SAT Competition’13 is available from the subfolder “desc” of the tarball. The code of 3.2.0 is actually the same that will run during the competition, the only changes made were:

  • the DRUP output had to be put into the standard output
  • the line “o proof DRUP” had to be printed
  • certified UNSAT binary uses the “–unsat 1” option by default
  • compilation was changed to use the m4ri library that was included with the tarball
  • linking is static so m4ri and other requirements don’t cause trouble
  • boost minimum version had to be lowered

You can download my submissions to the competition, forl, from here and here.

GCC 4.5.2 at the SAT Competition’13

(Note update below)

Just a short note. All MiniSat-derived SAT solvers might give wrong results during the competition. No, MiniSat doesn’t have a bug. GCC has a giant bug. Vegard reported this years ago — I personally put about 12h tracking it down, posted it to the CryptoMiniSat mailing list, and Vegard took it up, stripped it down and reported it.

This is a well-known bug, there is a thread about it on the MiniSat mailing list and I personally asked the competition organizers back in 2011 not to use the 4.5.2(or .3?) they wanted to use. I used to have code to check for this bug (there is a handy check attached to the bug report), but I removed it, though maybe I should add it back. It’s fixed in 4.5.4+, still present in 4.5.3 where Vegard&me found it. I hope I will be able to convince the organisers this time around, too :) By the way, the option you need to pass to gcc to avoid the bug is “-fno-tree-pre” — though of course this turns off an optimisation and so slows down the final executable.

Update: The organisers decided to move to gcc 4.8.0. This is very good news. They also note that the speed of the solvers is better, which is also very good news. I am happy I contributed to this welcome improvement and I hope it doesn’t cause any delays or inconveniences to anyone.