Stats on CryptoMiniSat’s development

I have just discovered the gitstats tool, and quickly put together a set of statistics on the development of CryptoMiniSat, the SAT solver I have been working on lately. I have been using the GIT repository for most of my work on CryptoMiniSat, doing ~2000 commits since its inception, so it gives a good picture of how things evolved, and how I usually work. Firstly, the hours I usually work:

I basically work almost as much at midnight as I do at midday. The afternoon is clearly the most active and practically useful part of my day. Next, the graph to show which days I work the most:

Essentially, I worked as much on CryptoMiniSat during Sunday as I did during Monday or Friday. Saturday was the day I must have been the most fed up with development. Finally, the number of commits by months:

The 2nd and 3rd months of 2010 was a struggle because of the SAT and PoS deadlines. I tried to put together a credible research paper, which finally got accepted at PoS. During the 6th and 7th months of 2010 I was mostly looking forward to the results of the SAT Race, so I wasn’t developing so much.

Winning the SAT Race

The software I developed, CryptoMiniSat 2.5 won the SAT Race of 2010. It is a great honour to win such a difficult race, as there were 19 other solvers submitted from 9 different countries.

Thinking back and trying to answer the question why CryptoMiniSat won, I think there are no clear answers. Naturally, I put immense effort into it, but that is only one part of the picture. Other important factors include those who supported me: my fiancée, the PLANETE project of INRIA, and in particular Claude Castelluccia and my colleagues here in Paris, the LIP6 team and my professor here, Jean-Charles Faugere. As with all randomised algorithms, luck played a good part of winning, too.

In all of this, it is interesting to note that I never had a colleague who worked in or even near the field of SAT: both the PLANETE and the LIP6 team use SAT solvers only as a means to an end, not as an end in itself. In other words, I had to get this far mostly alone. To be honest, I think this is one of the reasons I had won, too: there probably is too much misunderstanding in SAT that everybody takes for granted. Since I never really talked with anyone about their understanding of the subject (I only read papers, which state facts), I had a white paper in front of me, which I could fill with my own intuitions, without being lead by those of others. Naturally, this means that I have probably stepped into a large number things that are widely understood to be wrong, but at the end, it seems that I must have also made some good steps, too.

As for the bad steps I have made (and of which I am aware of), I had copied the subsumption algorithm from SatElite, which the authors of MiniSat told me was a bad move, since its self-subsuming resolution is slow. I had in fact observed this, but I thought it was a fact of life. Alas, it isn’t. Other errors I have made include a non-portable pointer-shortening which shortens pointers to 32 bits when using 64-bit architectures (Armin Biere’s lingeling does this right), and a wrong clause-prefetching code that Norbert Manthey put me right about. I have also conceptually misunderstood the possible impact of variable elimination on the difficulty of resolving an instance, a grave mistake that Niklas Sörensson put me right about.

As for the good steps, I think I have made a couple of them. Firstly, I have put the 2009 SAT paper we wrote with my colleagues to its natural conclusion: including XOR clauses and partial XOR reasoning into the innermost parts of the solver. Secondly, I have used the community to its fullest potential: CryptoMiniSat was used by the constraint solver STP far before the SAT Race began, and it was distributed to a number of highly skilled individuals who made use of it and gave feedback. Sometimes I got very negative feedbacks (these were the most valuable), which highlighted where and why the solver was failing. Thirdly, I have tried to take advantage of the academic environment I work in: I have tried to read the relevant papers, used the clusters provided by the French universities to carry out experiments, and generally took my time (rarely an option in industry) to get things right instead of rushing out a solver.

The 2010 SAT Conference

I am currently at the SAT 2010 conference, in Edinburgh. It is my second SAT conference, but the first one where I finally have a chance to know who is who. I had a chance to talk to some great researchers. For instance, I talked to (in no particular order) Armin Biere (author of PrecoSat and Lingeling), Niklas Een, Niklas Sörensson (authors of MiniSat — new version is out, btw), Marijn Heule (author of the march_* set of solvers), Matti Järvisalo (one of the authors of Blocked Clause Elimination), the author of MiraXT, and many others. It’s really a great place to meet up people and get a grasp on what SAT is really about. It’s also a great way to understand the motivation behind certain ideas, and to hear the problems encountered by others.

For instance, I now know that bugs, about which I have already written a blog post about is a problem that comes up for nearly everyone, and tackling them is one of the most difficult things for the developers. I have also finally understood the reason behind the MiniSat Template Library (MTL), which I’ve always found odd: it re-implements many things in the STL (the Standard Template Library), which I have always found unnecessary. Although the reasons are now clear (better control, easier debugging, less memory overhead), I think I will still replace it with STL. I have also had a chance to talk to the kind reviewer who did such an extensive review of my paper, which was quite an enlightening discussion. If you are interested in SAT, I encourage you to attend next year: there are usually some workshops associated with the conference, where it is easier to have a paper accepted: e.g. this year, a Masters student got his quite interesting paper (long version) accepted, and I am sure he has benefited a lot from it.

ANF to CNF conversion

Algebraic Normal Form, or simply ANF, is the form used to describe cryptographic functions. It looks like this:
a*b \oplus b*c \oplus b \oplus d = 0\\b*c \oplus c \oplus a = 0\\\ldots
Where a,\ldots d are binary variables, the \oplus sign is binary XOR, and the * sign is binary AND. Every item that is connected with the XOR-s is called a monomial, and its degree is the number of independent variables inside it. So, a*b is a 2-degree monomial in the first equation, and c is a 1-degree monomial in the second equation.

An important problem in SAT is to translate an ANF into the input of SAT solvers, Conjunctive Normal Form, or simply CNF. A CNF formula looks like this:
a \vee b \vee c = 1\\a \vee \neg d = 1\\\ldots
Where again a,\ldots d are binary variables, the \vee sign is the binary OR, and the \neg sign is the binary NOT (i.e. inverse).

The scientific reference paper

The most quoted article about ANF-to-CNF translation is that by Courtois and Bard, which advocates for the following translation process:

  1. Introduce internal variables for every monomial of degree larger than 1
  2. Describe the equations as large XORs using the recently introduced variables

The example problem in CNF

According to the original method, the equations presented above are first translated to the following form:
v1 = a*b\\v2 = b*c\\v1 \oplus v2 \oplus b \oplus d = 0\\v2 \oplus c \oplus a = 0

Where v1, v2 are fresh binary variables. Then, each of the above equations are translated to CNF. The internal variables are translated as such:

  1. Translation of v1 = a*b:
    v1 \vee \neg a \vee \neg b = 1\\\neg v1 \vee a = 1\\\neg 1 \vee b = 1
  2. Translation of v2 = b*c
    v2 \vee \neg b \vee \neg c = 1\\\neg v2 \vee b = 1\\\neg v2 \vee c = 1
  3. Translation of v1 + v2 + b + d = 0:
    \neg v1 \vee v2 \vee b \vee d = 1\\v1 \vee \neg v2 \vee b \vee d = 1\\v1 \vee v2 \vee \neg b \vee d = 1\\v1 \vee v2 \vee b \vee -d = 1\\\neg v1 \vee \neg v2 \vee \neg b \vee d = 1\\\neg v1 \vee \neg v2 \vee b \vee \neg d = 1\\\neg v1 \vee v2 \vee \neg b \vee \neg d = 1\\v1 \vee \neg v2 \vee \neg b \vee \neg d = 1
  4. Translation of v2 + c + a = 0 :
    v2 \vee c \vee \neg a = 1\\v2 \vee \neg c \vee a = 1\\\neg v2 \vee c \vee a = 1\\\neg v2 \vee \neg c \vee \neg a = 1

We are now done. The final CNF file is this. This final CNF  has a small header, and some  fluffs have been removed: variables are not named, but referred to with a number, and the = true-s have been replaced with a line-ending 0.

As you can imagine, there are many ways to enhance this process. I have written a set of them down in this paper. The set of improvements in a nutshell are the following:

  1. If a variable’s value is given, (e.g. a = true), first substitute this value in the ANF, and transcribe the resulting ANF to CNF.
  2. If there are two monomials, such as: a*b + b in an equation, make a non-standard monomial (-a)*b from the two, and transcribe this to CNF. Since the CNF can use negations, this reduces the size of the resulting CNF
  3. If the ANF can be described using Karnaugh maps shorter than with the method presented above, use that translation method.

An automated way

I just got a nice script to perform step (1) from Martin Albrecht, one of the developers of Sage:

sage: B = BooleanPolynomialRing(4500,'x')
sage: L = [B.random_element(degree=2,terms=10) 
      for _ in range(4000)]
sage: s = [B.gen(i) + GF(2).random_element() 
      for i in range(1000)]
sage: %time F = 
      mq.MPolynomialSystem(L+s).
      eliminate_linear_variables(maxlength=1)
CPU time: 1.03 s,  Wall time: 1.11 s

In this code, Martin generates a boolean system of equations with 4500 variables, with 4000 random equations each with randomly selected monomials of degree 2 and of XOR size 10. Then, he sets 1000 random variables to a random value (true or false), and finally, he substitutes the assigned values, and simplifies the system. If you think this is a trivial issue, alas, it is not. Both Maple and Sage take hours to perform it if using the standard eval function. The function above uses a variation of the ElimLin algorithm from the Courtois-Bard paper to do this efficiently.

How to bit-stuff pointers

In my last post I have described a small problem I had regarding a C++ union that had to store a pointer and a value at the same time, plus an indicator that told us which of the two was active:

class WhatLeadHere {
public:
    bool wasPointer;
    union{Clause* pointer; uint32_t lit;};
};

Martin Maurer has wrote this enlightening email regarding this:

The Clause * pointer is [..] located at least on a 4 byte boundary (at least when not forced by e.g. #pragma). If it is not at least on a 4 byte boundary, CPU must (at least some of them) do two accesses: one lower DWORD, one upper DWORD, and merge both together, which reduces speed.

So the lower two bits are always zero :-) Why not move the “wasPointer” to lower two bits (i think you need only one, have one spare bit). Of course before accessing the “pointer” (Clause *), you must “and” the two lower bits back to zero.

He is perfectly right: the pointer will never be zero on the least significant bit (LSB), so setting that to “1” can be used as an indicator. This is a form of bit-stuffing. To check if it’s a pointer or a literal the following function can be used:

const bool WhatLeadHere::wasPointer() const
{
   return (lit&1);
}

As for storing lit, we need to right-shift it by one and set the LSB to 1. When lit is needed, we left-shift it to get back the original value. This is not a problem, since variables in SAT never use out the 32-bit space: there are rarely more than 10 million variables, so right-shifting this value will not lead to information loss.