The Inprocessing API of CryptoMiniSat

Many modern SAT solvers do a lot of what’s called inprocessing. These steps simplify the CNF into something that is easier to solve. In the compiler world, these are called rewritngs since the effectively rewrite (parts of) the formula to something else that retain certain properties, such as satisfiability. One of the most successful such rewrite rules for CNF is Bounded Variable Elimination (BVE, classic paper here), but there are many others. These rewrites are usually done by modern SAT solvers in a particular order that was found to be working well for their particular use-case, but they are not normally accessible from the outside.

Sometimes one wants to use these rewrite rules for something other than just solving the instance via the SAT solver. One such use-case is to use these rewrite rules to simplify the CNF in order to count the solution to it. In this scenario, the user wants to rewrite the CNF in a very particular way, and then extract the simplified CNF. Other use-cases are easy to imagine, such as e.g. MaxSAT, core counting, etc. Over the years, CryptoMiniSat has evolved such a rewrite capability. It is possible to tell CryptoMiniSat to simplify the formula exactly how the user wants the solver to be satisfied and then extract the simplified formula.

Example Use-Case

Let’s say we have a CNF that we want to simplify:

p cnf 4 2
1 2 3 4 0
1 2 3 0

In this CNF, 1 2 3 4 0 is not needed, because it is subsumed by the clause 1 2 3 0. You can run subsumption using CryptoMiniSat this way:

#include "cryptominsat5/cryptominisat.h"
#include <vector>
#include <cmath>
#include <iostream>

using namespace CMSat;
using namespace std;
#define lit(a) Lit(std::abs(a)-1, a<0)

int main() {
  Solver s;
  vector<Lit> cl;
  s.add_new_vars(4);

  cl = vector<Lit>{lit(1), lit(2), lit(3), lit(4)};
  s.add_clause(cl);
  cl = vector<Lit>{lit(1), lit(2), lit(3)};
  s.add_clause(cl);
  
  s.simplify(NULL, "occ-backw-sub");
  s.start_getting_clauses();
  while(s.get_next_clause(cl) {
    for(const auto l: cl) cout << l << " ";
    cout << endl;
  }
  s.end_getting_clauses()
  
  return 0;
}

This code runs the inprocessing system occ-backw-sub, which stands for backwards subsumption using occurrence lists. The input CNF can be anything, and the output CNF is the simplified CNF. This sounds like quite a lot of code for simple subsumption, but this does a lot of things under the hood for things to be fast, and it is a lot more capable than just doing subsumption.

Notice that the first argument we passed to simplify() is NULL. This means we don’t care about any variables being preserved — any variable can (and will) be eliminated if occ-bve is called. In case some variables are important to you not to be eliminated, you can create a vector of them and pass the pointer here. If you have called the renumber API, then you can get the set of variables you had via clean_sampl_and_get_empties(). The numbering will not be preserved, but their set will be the same, though not necessarily the same size. This is because some variables may have been set, or some variables may be equivalent to other variables in the same set. You can get the variables that have been set via get_zero_assigned_lits().

Supported Inprocessing Steps

Currently, the following set of inprocessing steps are supported:

API nameInprocessing performed
occ-backw-subBackwards subsumption using occurence lists
occ-backw-sub-strBackwards subsumption and strengthening using occurence lists
occ-bceBlocked clause elimination (paper)
occ-ternary-resTernary resolution (paper)
occ-lit-remLiteral removal via strengthening
occ-cl-rem-with-orgatesOR-gate based clause removal (unpublished work, re-discovered by others)
occ-rem-with-orgatesOR-gate based literal removal
occ-bveBounded variable elimination (paper)
occ-bve-emptyBounded variable elimination of empty resolvents only
intree-probeProbe using in-tree probing, also do hyper-binary resolution and transitive reduction (paper). Also does hyper-binary resoution & transitive reduction
full-probeProbe each literal individually (slow compared to intree-probe)
backboneBackbone simplification (cadiback paper)
sub-implSubsume with binary clauses with binary clauses (fast)
sub-str-cls-with-binSubsume and strengthen long clauses with binary clauses (fast)
sub-cls-with-binSubsume long clauses with binary clauses (fast)
distill-binsDistill binary clauses
distill-clsDistill long clauses (distillation paper)
distill-cls-onlyremDistill long clauses, but only remove clauses, don’t shorten them. Useful if you want to make sure BVE can run at full blast after this step.
clean-clsClean clauses of set literals, and delete satisfied clauses
must-renumberRenumber variables to start from 0, in case some have been set to TRUE/FALSE or removed due to equivalent literal replacement.
must-scc-vreplPerform strongly connected component analysis and perform equivalent literal replacement.
oracle-vivifyVivify clauses using the Oracle tool by Korhonen and Jarvisalo (paper). Slow but very effective.
oracle-vivif-sparsifyVivify & sparsify clauses using the Oracle tool by Korhonen and Jarvisalo. Slow but very effective.

Convenience Features Under the Hood

The steps above do more than what they say on the label. For example, the ones that start with occ build an occurrence list and use it for the next simplification stop if it also starts with occ. They also all make sure that memory limits and time limits are adhered to. The timeout multiplier can be changed via set_timeout_all_calls(double multiplier). The time limits are entirely reproducible, there is no actual seconds, it’s all about an abstract “tick” that is ticking. This means that all bugs in your code are always reproducible. This helps immensely with debugging — no more frustrating Heisenbugs. You can check the cryptominisat.h file for all the different individual timeouts and memouts you can set.

Under the hood you also get a lot of tricks implemented by default. You don’t have to worry about e.g. strengthening running out of control, it will terminate in reasonable amount of ticks, even if that means it will not run to completion. And the next time you run it, it will start at a different point. This makes a big difference in case you actually want your tool to be usable, rather than just “publish and forget”. In many cases, simplification only makes things somewhat faster, and you want to stop performing the simplification after some time, but you also want your users to be able to report bugs and anomalies. If the system didn’t have timeouts, you’d run the risk of the simplifier running way too long, even though the actual solving would have taken very little time. And if the timeout was measured in seconds, you’d run the risk of a bug being reported but being irreproducible, because the exact moment the timeout hit for the bug to occur would be irreproducible.

Making the Best of it All

This system is just an API — it doesn’t do much on its own. You need to play with it, and creatively compose simplifications. If you take a look at cryptominisat.h, it already has a cool trick, where it moves the simplified CNF from an existing solver to a new, clean solver through the API, called copy_simp_solver_to_solver(). It is also used extensively in Arjun, our CNF simplifier for counting. There, you can find the function that controls CryptoMiniSat from the outside to simplify the CNF in the exact way needed. It may be worthwhile reading through that function if you want to control CryptoMiniSat via this API.

The simplify() API can give you the redundant clauses, too (useful if you e.g. did ternary or hyper-binary resolution), and can give you the non-renumbered CNF as well — check out the full API in cryptominisat.h, or the Arjun code. Basically, there is a red and a simplified parameter you can pass to this function.

Perhaps I’ll expose some of this API via the Python interface, if there is some interest for it. I think it’s quite powerful and could help people who use CNFs in other scenarios, such as MaxSAT solving, core counting, core minimization, etc.

Closing Thoughts

I think there is currently a lack of tooling to perform the already well-known and well-documented pre- and inprocessing steps that many SAT solvers implement internally, but don’t expose externally. This API is supposed to fill that gap. Although it’s a bit rough on the edges sometimes, hopefully it’s something that will inspire others to either use this API to build cool stuff, or to improve the API so others can build even cooler stuff. While it may sound trivial to re-implement e.g. BVE, once you start going into the weeds of it (e.g. dealing with the special case of detecting ITE, OR & AND gates’ and their lower resolvent counts, or doing it incrementally with some leeway to allow clause number increase), it gets pretty complicated. This API is meant to alleviate this stress, so researchers and enthusiasts can build their own simplifier given a set of working and tested “LEGO bricks”.

Pepin, our Probabilistic Approximate Volume Counter

Let’s say you are allowed to have cubes in a 3-dimensional space that happens to be binary. In this space, x1, x2, and x3 are the axis, and for example the cube x1=0 is a cube that happens to have 4 points in it: 000, 001, 010, and 011. So far, so good. Now, let’s say we need to calculate the total volume of two cubes in this space. If they don’t intersect, it’s quite easy, we simply add their volumes. But what if they intersect? Now we need to compute their overlap, and subtract it from their sum of volumes. But there is a better way.

Probabilistic Approximate Volume Counting

Our new tool Pepin (code, paper) is based on the probabilistic approximate model counting algorithm by Meel, Vinodchandran, and Chakraborty (paper) that is so simple in principle yet so ingenious that even Donald Knuth got excited about it, recently writing a 15-page note and spending considerable amount of time on the algorithm.

So, what is this algorithm? Well, it’s so simple it’s almost funny. Basically, we keep randomly sampled points from the volume that we currently hold. At every moment we have a representative, evenly sampled set of points from the volume we currently have, and we know the sampling rate. You can think of the “sampling rate” as the approximate volume that each sample represents. So, if at any point we want to know the volume, we calculate number_of_points*(1/sampling_rate) and have the approximate volume. Kinda neat, no? So all we need is a randomly sampled set of points and a corresponding sampling rate. But how do we go about that?

Let’s say we are about to process the first cube (i.e. volume, could be any shape, actually, but cubes are simple). Our sampling rate is 1.0, we have 10 dimensions, and we have a limit of 10 samples in our bag. Let’s say the first cube is:

x1=0

Notice that this cube has 2^9 elements: 0000000000, 0000000001, … 0111111111. But that’s too many elements to hold, we only have space for 10 samples in the bag! We have to drop our sampling rate. The algorithm throws coins for each element in the volume (at first, a coin that has 1.0 probability of heads… funny coin!), and realizes that it won’t fit the sample bag. So it halves the sample rate (to 0.5) and throws the coins again. Realizes it doesn’t fit… halves the sample rate… and eventually will likely end up with a sampling rate of 1/2^6. That sample rate makes sense: it means 2^9/2^6 = 9 samples on average, which fits. Let’s say the algorithm flipped the coins, halved the sampling rate every time, and settled on a sampling rate of 1/2^6, and 9 samples.

Let’s see these samples. For readability, I’ll write the samples as 0110… which means x1=0, x2=1, x3=1, x4=0, etc.

0001011110
0010110110
0010000101
0001101001
0101001011
0011110101
0101101001
0110101100
0000011111

Think of each of these points representing a volume of 2^6. The current estimate of the volume is 9*2^6 = 2^9.17 (instead of 2^9). So far so good. Now a new cube comes in. Let’s say this cube is:

x1=0, x1=1

This is the tricky part. What do we do? The algorithm is extremely simple: we throw away every sample that matches this cube, and then we sample the cube. Let’s throw away everything that matches, i.e. everything that starts with “01”:

0001011110
0010110110
0010000101
0001101001
0011110101
0000011111

Good. Now we sample the cube “x1=0, x1=1” with 1/2^6 probability. Time to take out our weird coin that has a 1/2^6 chance of winning! We toss this coin on every element in the cube, i.e. all 2^8 (since x0=1, x1=1 has 2^8 elements). We should get about 4 heads, but let’s say we got 3. Happens. Now, append these 3 to our neat little sample bag:

0001011110
0010110110
0010000101
0001101001
0011110101
0000011111
0101001011
0101111101
0110000010

Nice. What just happened? Well, we got rid of the elements from the volume we were holding that were contained in the cube that we are processing. Then, we randomly sampled elements from the cube at our given sampling rate. If you think about it, this means that as long as we did everything uniformly at random, we are basically back to where we started. In fact, the 2nd cube could have been anything. It could have been completely outside of our current volume, or it could have partially intersected it. The game is the same. Running out of space? Just throw away each sample with probability half, and half your sampling rate! Here’s a visual representation what we would do in case the two cubes partially intersected:

As the number of samples goes over the limit in our bag, we simply throw away each sample with 0.5 probability, and halve our sampling rate. This allows us to keep a fixed maximum number of samples, regardless of the size of the volume we are holding. For our example, we decided to limit ourselves to 10 samples, and got a pretty good estimate of the volume: 9*2^6 (=~2^9.17) instead of 2^9. One thing of note. This algorithm is not only approximate in the sense that we get something close to the actual solution, like 2^9.17. It’s also probabilistic in the sense that with a low probability, we get a complete garbage value. You can reduce the probability of getting a garbage value in a number of ways, though.

This algorithm is really powerful, because you could approximately count a 100-dimensional volume with thousands of cubes using less than 1MB of memory, and actually be pretty damn accurate, with a very low probability of getting a wrong value. Unsurprisingly, this is exactly what we do in our tool, Pepin.

The Tricks of Pepin

If you think about it, Pepin does nothing but: (1) holds a bunch of samples (2) deals with some large & small numbers (2^10000 is large, and 1/2^100000 is small), and (3) deals with probabilities. For dealing with small & large numbers, we used the GNU MP Bignum library, with all the standard tricks (pre-allocating memory & constants, etc), and for probabilities we use a number of tricks — sampling the binomial distribution is easy until you have to do it with t=2^100 and p=1/2^95. While dealing with probabilities was hard from a mathematical perspective, the really fun part for me was dealing with the sampling bag.

It turns out I wasn’t the only one who got sidetracked with the sampling bag: if you take a look at Knuth’s implementation, he goes into great detail about his datastructure, the Treap (honestly, that whole note by Knuth is quite a trip, you can see how his mind is racing). Anyway, back to the sampling bag. Firstly, the bag is just a matrix (with some rows sometimes empty) so we can store it either column or row-major. This is probably the first thing that comes to everyone’s mind as a computer science 101 trick. More interestingly, if one looks at the performance bottlenecks of the algorithm, it quickly becomes apparent that: (1) generating millions of bits of randomness for all the samples is expensive, and (2) writing all those samples to memory is expensive. So, what can we do?

The cool trick we came up with is what Knuth would call “late binding”, or we can call it lazy evaluation. Let’s keep to our original example: we first had the cube “x1=1”. We ran our coin-flip technique and found that we needed 9 samples (setting the sampling rate to 1/2^6), fine. But why do we really need all the bits of the samples? We don’t need them now! We may need them later — but since it’s all random anyway, we can generate them anytime, now or later! So, let’s generate 9 samples like this:

1*********
1*********
1*********
1*********
1*********
1*********
1*********
1*********
1*********

The “*” simply means this bit needs to be selected randomly. Now the cube “x1=1, x2=0” comes, which forces our hands, we have to know what the 2nd bit is: if it’s a “1” we have to throw the sample away, if it’s a “0”, we can keep it in. So we decide this bit now (not in the past, but now, when we need it), randomly:

00********
00********
00********
00********
01********
00********
01********
01********
00********

And now we can throw away the samples that match “x1=0, x2=1”, just like before. The sample bag is now:

00********
00********
00********
00********
00********
00********

We can now sample from the cube “x1=0, x2=1” randomly — again, notice we don’t need to know what the 3rd or 4th, etc. bits are. They can be decided later, when they are needed:

00********
00********
00********
00********
00********
00********
01********
01********
01********

Nice. Notice that this leads to exactly where we would have arrived anyway from a conceptual point of view. Except we (1) don’t need to generate as many random numbers and (2) if we can fill memory with “*” faster, then we can sample much faster.

In our implementation, we store values mem-packed, with 2 bits representing 0/1/*: for us, 00=0, 01=1, 11=*. This way, we can simply memset() with 1-s and fill it all up with “*”, a common occurrence. It probably won’t surprise anyone that of course we tried using a sparse matrix representation as well. It works very well for some problems. However, it depends what the exact problem is, and the performance degrades drastically for many problems. So the system uses dense representation by default. I’d be happy to merge a pull request that flips between sparse and dense depending on some density metric.

Obviously, I could not have done any of this alone. Pepin, the tool & paper, is by Divesh Aggarwal, Sourav Chakraborty, Kuldeep S. Meel, Maciej Obremski, and myself. Honestly speaking, it was wonderful to work together with all these amazing people.

Performance

At this stage, it should be obvious that it doesn’t even make sense to compare this tool to exact methods. The difference is mind-boggling. It’d be like racing a fighter jet against a rabbit. It’s a lot more fun to compare against other, existing approximate volume counting tools.

Above is a set of graphs of how Pepin performs against other approximate volume counting tools. Basically, it’s either way faster (easy 1000x speedup), or it’s a lot faster. Graph (c) is a bit misleading: for completeness, we included DNFApproxMC (PDF), which performs very poorly on these problems, and so Pepin seems to perform “as well as the others”. But, on closer examination:

Not really like the others: more like 3x faster than the others.

Potential Future Work

As mentioned above, a pretty straightforward (but not trivial) improvement would be to automatically switch to a sparse matrix representation. This would be akin to “making the horse run faster”, rather than inventing the steam engine, but hey, if it works, it works. Building a steam engine would be more like putting the whole algorithm into GPGPU and/or parallelizing it. It should be possible to rewrite this algorithm in a dynamic programming way, as it should be possible to combine sample bags and sampling probabilities (maybe not, I’m just an engineer). Then you can do divide-and-conquer. If you do that over a GPGPU that has 1000+ streaming cores, it could be possible to make this whole thing run 100x+ faster.

As engineers we like to over-engineer for performance, so it’s important to keep in mind that we are already hundreds of times faster than exact algorithms. Hence, perhaps it’d be more interesting to come up with interesting use-cases, rather than focusing on further improving speed. To keep with the horse analogy, it may be useful not to put the cart in front of the horse.

Closing Thoughts

Approximate volume counting is actually really cool. It takes the power of randomization and uses it to its own advantage to make something really difficult into something that one could explain a child. I have a feeling we could make a few beautiful graphics and teach this algorithm to 12 year olds. The sample-in-a-bag idea is so simple, yet so powerful. Actually, it’s also incredibly weird if you start going into the weeds of it. Like, what happens when the size of your bag is 1? It’s the kind of question that only Knuth would ask, and then answer with clarity and prowess that only one with deep mathematical insight can. I won’t even entertain the thought, but you can, if you read his notes and then work on his questions.

PS: Pepin was named after the rather eccentric character of the same name from Bohumil Hrabal‘s book Cutting It Short, also released as a film. Pepin in the book was inspired by Hrabal’s own uncle who came to visit his hometown for two weeks but stayed for 40 years. I think we have all been there.

Arjun, our New CNF Model Counting Preprocessor

After many years of work, I am very happy to present our new CNF model counting preprocessor, Arjun [PDF][static binary]. It improves upon our approximate model counter ApproxMC [static binary], and can be used as a stand-alone binary to preprocess CNFs for (projected) model counting . Usage is simple:

./arjun blasted_case110.cnf
c Arjun Version: 43b17ef5899c461b8e947b0f3ca286efcf71464c
c CMS SHA revision 9bf1d82a7b9fa5492cf4f0437cf7110b77ad7230
[..]
c [arjun] original sampling set size: 284
c [arjun] final set size:   15 percent of original: 5.2817 %

Here, Arjun took a CNF that had a projection set of 284 variables, and reduced it to an independent support of only 15 variables. Arjun is now part of ApproxMC by default. Let’s solve a problem with ApproxMC without Arjun:

./approxmc --arjun 0 blasted_squaring42.cnf.gz.no_w.cnf.gz 
c ApproxMC SHA revision 9fb4fdd01b40cc0526c4933e2f2dca402f0ab91f
c CMS SHA revision 9bf1d82a7b9fa5492cf4f0437cf7110b77ad7230
c [appmc] Orig sampling vars size: 349
[after waiting 5000 seconds, it times out]

Now let’s do the same with Arjun (this does not use code from SharpSAT-td):

./approxmc blasted_squaring42.cnf.gz.no_w.cnf.gz 
c ApproxMC SHA revision 9fb4fdd01b40cc0526c4933e2f2dca402f0ab91f
c CMS SHA revision 9bf1d82a7b9fa5492cf4f0437cf7110b77ad7230
c Arjun SHA revision 43b17ef5899c461b8e947b0f3ca286efcf71464c
c [appmc] Orig sampling vars size: 349
c [arjun] final set size: 155 percent of original: 44.4126 %
c [appmc+arjun] Total time: 12.80
c [appmc] Number of solutions is: 80*2**127
s SATISFIABLE
s mc 13611294676837538538534984297270728458240

Done in 12.8 seconds, NICE. Wanna try it out? Arjun+ApproxMC static Linux binary HERE, Arjun static Linux binary HERE. Research paper HERE. Arjun code HERE. Now that the demo is out of the way, let’s get into the nitty-gritty details!

Projected Model Counting

CNF model counting is the problem where you want to count the number of solutions to a set of equations written in the CNF form. This form is quite restrictive, but also very powerful, here is an example:

 x1 OR  x2 OR -x3 = True
x1 OR -x2 = True
x1 OR -x3 = True
x1 OR -x4 = True
-x1 OR x4 = True

The above set of constraints clearly have a solution, e.g. setting x1=True and x4=False will satisfy all constraints. The question in model counting is how many solutions a system has. The above problem has 5 solutions in total:

[Where "1" means TRUE and "-1" means FALSE]
x1 x2 x3 x4
-----------------
-1 -2 -3 4 0
1 2 3 -4 0
1 2 -3 -4 0
1 -2 3 -4 0
1 -2 -3 -4 0

This was kinda easy. However, there are two complications. One complication tends to be is that there are a massive number of solutions. For example, there are 2^200 solutions. Enumerating them one-by-one is just not going to work. Hence, we need some smart systems to do the counting for us, that don’t count one-by-one. Secondly, often we are interested only in the distinct number of solutions over a certain set of variables. Let’s take the above example, and say we want the distinct number of solutions over variables x1, x2 and x4 only. So, let’s delete column for x3 from above table:

x1   x2  x4   Same?
-------------------
-1 -2 4 0
1 2 -4 0 *
1 2 -4 0 *
1 -2 -4 0 +
1 -2 -4 0 +

Notice there are only 3 distinct solutions: the rows marked with * are both the same, and the rows marked with + are also the same. This is called projected counting because we are projecting the solution space over a specific set of variables — in this case, x1, x2, and x4.

Computing the Independent set of a Projection Set

In the above example, our projection set was x1, x2, and x4. However, notice that x1=-x4, due to the constraints:

 x1 OR -x4 = True
-x1 OR x4 = True

Hence, it is good enough to count over x1 and x2, since x4 is determined by x1 and so x4 cannot possibly make the distinct number of solutions smaller or larger. This realization is at the heart of calculating an independent support — we basically want to minimize the set of variables that we project over. The independent support is simply a subset of the projection set, that’s hopefully a lot smaller than the projection set.

An early work on independent support calculation is the B+E preprocessor [PDF], which aims to do two things:  “B” which finds a small independent support, and “E”, which eliminates variables from the problem. Let’s concentrate on “B” here. In the B+E paper, the authors talk about independent support over all variables of the formula, not over a projection set. In other words, their code and paper originally was conceived to minimize a projection set that had all variables inside. In our case, their work would minimize x1..x4, finding that x1,x2,x3 is an independent support of x1..x4.

Implicit Definability

The trick of B+E is to use the idea of so-called implicit definability. Implicit deniability is really simple. If I take x1, x2 and x3, does that define the value of x4? Yep. In fact, x1 on its own would define the value of x4, because x1=-x4.  However, sometimes you have equations that define a variable through a (large) number of other variables. For example:

 x30 OR -x10 OR -x20 = True
-x30 OR x10 = True
-x30 OR x20 = True

This set of constraints describe x10 AND x20=x30. In this case, x10 and x20 together define x30. So if you have the above set of constraints in a CNF, and you have x10, x20, and x30 all in your projection set, you can confidently take out x30 from the set — and the number of distinct solutions projected will not change.

Implicit definability states that we can find these definitions in an implicit way, using a relatively straightforward system. We make two copies of the CNF, let’s call these CNF1 and CNF2. Then, we have variables v1…vN in the first CNF and w1..wN in the second CNF. Now, we create new variables called indicator variables, i1…iN that indicate whether v1=w1, v2=w2… etc. After this setup, the query is simple. Given i1=TRUE (i.e. given v1=w2), is it possible that v2 is different from w2? If it is, then our new independent set must contain both i1 and i2 (If not, we can get rid of i2). Now we move to i3. Let’s set i1=TRUE and i2=TRUE, and ask if it is possible that v3 is different from w3? Otherwise, our independent set remains i1, and we ask, is it possible that i1=TRUE and i2=TRUE, but v3 is different from w3? etc. This way, we test all the variables from v/w1…v/wN, and eventually end up with a set of indicator variables that tell us a minimal independent support.

Explicit Definability

As an external observer, without knowing nothing about implicit definability, CNF copying and renumbering and the like, it is quite easy to figure out that x1=-x4 just by looking at the CNF, since it contains:

 x1 OR -x4 = True
-x1 OR x4 = True

Such set of constraints can be trivially found using Tarjan’s algorithm. This is actually linear in the number of binary (i.e. 2-long) constraints, so this is super-quick to run. Similarly, we can very easily find the AND gate x10 AND x20=x30:

 x30 OR -x10 OR -x20 = True
-x30 OR x10 = True
-x30 OR x20 = True

It’s a quick check on all the constraints that are 3-long and have the corresponding 2-long constraints. Such syntactic checks are well-known parts of SAT solvers. For example, the extremely successful kissat solver has the following lines of code in gates.c:

if (kissat_find_equivalence_gate (solver, lit))
res = true;
else if (kissat_find_and_gate (solver, lit, 0))
res = true;
else if (kissat_find_and_gate (solver, not_lit, 1))
res = true;
else if (kissat_find_if_then_else_gate (solver, lit, 0))
res = true;
else if (kissat_find_if_then_else_gate (solver, not_lit, 1))
res = true;
else if (kissat_find_definition (solver, lit))
res = true;

So, kissat finds equivalence gates (i.e. x1=x2), AND gates (like x10 AND x20=x30), IF-THEN-ELSE gates, and curiously… “definition” gates. What could these be? Well, it turns out kissat uses a very cool trick to find definitions. It takes all constraints that contain a variable, say x30/-x30, and cuts off the x30/-x30:

-x10 OR -x20 = True  <-- we cut of "x30 OR"
x10 = True <-- we cut of "-x30 OR"
x20 = True <-- we cut of "-x30 OR"

Notice that this set of constraints is now unsatisfiable. In fact, for all gates that define a variable, if you take all the constraints that the variable is in, and you chop off the variable, the constraints will together be unsatisfiable. To find such gates, kissat uses a SAT solver called kitten. Kitten solves these problems really fast and allows for UNSAT core extraction (not present in kissat) to find the defining variables.

Arjun

So what did we do with Arjun? We basically do the following set of improvements over B+E — while not even once looking at B+E’s code, since it was close-sourced, and by the time the authors open-sourced it, we already had our paper in the review queue:

  1. We allow taking in a projection set, rather than the full set of variables like B+E does. This allows us to compute and independent support of a projection set, which is extremely important for our model counter ApproxMC. With a small independent support, ApproxMC uses a  smaller matrix, so Gauss-Jordan elimination, which is an O(n^3) algorithm in our implementation, is significantly faster.
  2. We use CryptoMiniSat to our advantage to simplify the CNF formula before running explicit or implicit definability detection. CryptoMiniSat is a powerful system that allows fine-tuned set of heuristics to be run, completely controlled from the API, e.g.  “auto str = string(“intree-probe, occ-backw-sub-str, distill-bins, “); solver.simplify(NULL, &str);”
  3. We use all the different ways to syntactically recover gates: and/or gates, if-then-else gates, and parity (XOR) constraints. We also recover gates semi-syntactically using picosat (instead of kitten) — this is semi-syntactic, because we use syntactic methods to get the occurrence list of a variable and to construct the SAT query, but then use a SAT solver (i.e. a semantic method) to recover the definition, if any. We then construct a directed graph from the gates and greedily find a minimum cover. This is needed because there could be loops, and we don’t want x1 to be defined by x2, and x2 being defined by x1, and accidentally remove both, which would be silly.
  4. We repeat steps 2 & 3 at least 2x to make sure it’s done to a good enough level. This allows us to save time on the expensive next step, implicit definability detection. At this stage, we also do empty occurrence detection — if a variable is in the projection set, but doesn’t even occur in the formula, then obviously the variable can be removed and the count increased by a factor of 2.
  5. We change the SAT solver’s assumptions method to improve implicit definability detection. From the standpoint of a SAT solver, the  queries mentioned in the implicit definability section are so-called assumption queries: assuming i1=TRUE, v2=TRUE, v3=FALSE, is there a solution? The list of assumptions tends to be very long here, so SAT solvers can struggle. Our insight here is to change the traditional way of how assumptions are handled in the SAT solver, and effectively make the SAT solver not backtrack to the beginning every time a query is made. This makes quite a bit of difference in performance.(Note: it turns out this improvement idea has been invented, and re-invented, a few times already, e.g. by SharpSAT-td, and by IntelSAT — see Alexander Nadel’s presentation at SAT 2022).

That’s all folks! These improvements add up to about 5k LoC of code. All in all, they make Arjun more than an order of magnitude faster than B+E, and it can compute an independent support over a projection set, rather than all variables in the formula.

Performance of Arjun

Here’s an example performance graph for B+E vs Arjun, using the set of benchmark instances we have been measuring the performance of ApproxMC on for ~5 years now — perhaps outdated, but at least it’s easy to compare all the values in all of our papers. The performance difference is quite big:

Basically, in B+E took 4932 seconds to compute the independent set of 1752 instances, while Arjun did the same in 117 seconds. Quite nice, about 40x improvement. Notice that due to the way our implicit definability works, it’s possible to e.g. order the variables in such a way as to make querying faster, but end up with a larger independent set. So, here is the comparison of independent sets computed by the same system as per the graph above:

Essentially, the same, modulo some noise. There is some advantage to using B+E in a few cases, and there is some advantage using Arjun in other cases, from the perspective of the calculated independent set. However, notice that Arjun is so much faster, that we could actually run it 2x with different variable ordering, thereby getting a (much) smaller independent set, and often still be a lot faster than B+E. However, there is always a trade-off, so we decided to be approximately as good in the minimization as B+E, while being significantly faster.

Performance Impact on ApproxMC

So, Arjun is fast but what does that mean for ApproxMC? It turns out, it means a lot. ApproxMC adds XOR constraints over the projection set, and the smaller the projection set, the smaller the XORs . Hence, the matrix that the Gauss-Jordan elimination algorithm has to run on inside ApproxMC is much smaller — which is an O(n^3) algorithm in our implementation. Hence, the performance of ApproxMC4 vs Arjun+ApproxMC4 is remarkable:

Basically, we can count in 2.24 seconds with Arjun+ApproxMC4 as many instances as we could count in 5000s with ApproxMC4 alone: more than 3 orders of magnitude improvement. In fact, it’s so fast we’ll need to switch away from our set of benchmark instances, because we are getting close to the limit of 1896, the number of instances in total. Notice that over the years, ApproxMC has improved dramatically in speed. In fact, the earliest ApproxMC could only count ~400 instances within 5000 seconds. We can now count 400 instances within a time limit of only 0.09 seconds.

Conclusions

As it may be evident from winning a number of tracks at the Model Counting Competition, Arjun is not only making ApproxMC faster — we also won the non-projected exact(!) model counting track with an entry called Arjun+SharpSAT-td, where SharpSAT-td [PDF] is an exact model counter by Tuukka Korhonen and Matti Jarvisalo. We hope to improve Arjun further in the future, and we hope Arjun will be become a standard preprocessor for model counters in the Model Counting Competition in the coming years, exploiting its CNF reduction capabilities that are currently available as: “./arjun input.cnf minimized.cnf”, using code and ideas from SharpSAT-td.

Checking Uniform-Like Samplers

Uniform sampling is pretty simple: there are a set of solutions to a set of equations, and I want you to give me N solutions uniformly at random. Say, I have a set of equations that only has the solutions x=1…6, and I ask you to give me solutions uniformly at random. In this case, if the system is not cheating, it should give me solutions exactly like a random dice would give: 1..6, each with the same probability of 1/6. Now, if you give me nothing but 1’s I’d be slightly confused, and eventually would think you may be trying to cheat.

Uniformity is useful not only in gambling, but also e.g. if you want to make sure to cover a good chunk of the potential states in a system. Say, your equations describe how a program can run. Now, uniform solutions to these equations will be examples of the state space of your program. If you want to check that your system is in most cases doing the right thing, you can ask for uniform solutions at random and check the states for inconsistencies or unexpected behaviors.

Uniform Sampling of CNFs

One type of equations that is quite popular is to have all variables boolean (i.e. True/False), and equations being nothing but an “AND of ORs”, i.e. something like: “(a OR b) AND (b OR c OR d) AND (-f OR -g)”. These types of equations, also known as CNF, can express quite complicate things, e.g. (parts of) computer programs, logical circuits (e.g. parts of CPUs), and more. What’s nice about CNF is that there are many different tools to convert your problem language into CNF.

Given CNF as the intermediate language, all you need to do is run a uniform sampler on the CNF, and interpret the samples given your transformation. For example, you could translate your Ethereum cryptocurrency contract into QF_BV logic and blast that to CNF using e.g. the STP solver. Then you can get uniform random examples of the execution of your cryptocurrency contract using e.g. UniGen.

There are many different samplers for CNFs, and they mostly fall into two categories: ones that give guarantees and are hence truly uniform, and ones that don’t give guarantees and therefore fall into what we’ll call uniform-like samplers. These latter samplers tend to be significantly faster than truly uniform samplers, however, they can, and sometimes do, give non-uniform samples. I personally help maintain one truly uniform sampler, UniGen (PDF), and one uniform-like sampler, CMSGen (PDF). Other popular truly uniform samplers include KUS (PDF) and SPUR (PDF).

Catching Uniform-Like Samplers: Barbarik

If I give you a sampler and ask you to tell me if it’s truly uniform or simply uniform-like, what should you do? How should we distinguish one from the other, without looking into the internals of the system? It turns out that this is not a simple question to answer. It is very reasonable to assume that there are e.g. 2^200 solutions, so e.g. trying to prove that a sampler is uniform by saying that it should only output the same sample twice rarely is not really meaningful. In this latter case, you’d need to get approx 2^100 samples from a truly uniform sampler before it will likely give two colliding samples. While a uniform-like sampler may collide at e.g. only 2^60 (i.e. a trillion times earlier), that’s still a lot of computation, and only a single data point.

Doing this efficiently is the question that was on the minds of the authors of Barbarik. Basically, their idea was the following: take a CNF, remove all solutions but two, and blow up these two solutions into equally many solutions each. Say, you find two solutions to a CNF, one BLUE and one RED. Barbarik will take BLUE, make 100 blue balls out of it, and take RED, and make 100 red balls out of it. Then, it will give the CNF with the 200 solutions, half blue, half red, to the sampler under test. Then it asks the sampler to give it a bunch of balls. We of course expect approx 50%-50% red-blue distribution of balls from a truly uniform sampler.

Barbarik fails a solver if it is gives a distribution too far away from the 50%-50% we’d expect. Barbarik runs this check many times, with different example blue/red solutions that are “blown up” to multiple solutions. Given different base CNFs, and many tries, it is possible to differentiate the good from the bad… most of the time:

If you take a look at the table above, QuickSampler and STS, both uniform-like samplers, are rejected by Barbarik given 50 different CNFs. In contrast, UniGen survives all 50 tests with flying colors. Notice, however, that CMGen also survived all tests.

The Birth of CMSGen

When Barbarik was being created, I was fortunate enough to be present, and so I decided to tweak my SAT solver, CryptoMiniSat purely using command-line parameters, until Barbarik could not distinguish it from a truly uniform sampler. This, in my view, showcased how extremely important a test system was: I actually had a bug in CryptoMiniSat’s randomization that Barbarik clearly showed and I had to fix to get higher quality samples. We called the resulting uniform-like sampler CMSGen, and it was used in the function synthesis tool Manthan (PDF), that blew all other function synthesis system out of the water, thanks to its innovative design and access to high quality, fast samples from CMSGen:

Notice that within 200 seconds Manthan outperformed all other function synthesis systems, even if we give the other system 7000 seconds to work with. If you are interested in the details of this crazy improvement over previous state-of-the-art, check out the slides here or the video here (these are all work of my coworkers, I am not an author).

While CMSGen was clearly fast and powerful, it bothered me endlessly that Barbarik couldn’t demonstrate that it wasn’t a truly uniform sampler. This eventually lead to the development of ScalBarbarik.

The Birth of ScalBarbarik

Since CMSGen passed all the tests of Barbarik, we had to come up with a new trick to distinguish it from truly uniform samplers. ScalBarbarik‘s (PDF) underlying system is still the same as Barbarik: we take a CNF, take 2 samples from it, and blow both of these samples up to a certain number. However, how we blow them up is where the trick lies. Before, they were both blown up the same way into solutions that are equally easy/hard to find. However, this time around, we’ll make one of the solution types much harder to find than the other. For this, we’ll use Vegard Nossum’s SHA-1 CNF generator (PDF) to force the system to reverse a partial, reduced-round SHA-1 hash, with some fixed inputs & outputs. This allows us to both change the complexity of the problem and the number of solutions to it rather easily.

While one set of solutions will be hard to find, the other ones will be trivial to find — if one special variable is set to TRUE, the finding a set of solutions is trivial But when it’s set to FALSE, the system has to reverse a reduce-round SHA-1. The logic behind this is that the uniform-like systems will likely be finding the easy solutions with much higher probability than the hard solutions, so they will sample much more unevenly. Indeed that’s the case:

Note that as the hardness parameter is increased, CMSGen is rejected more and more often, and eventually it’s rejected for all CNFs. Also interesting to note is that QuickSampler and STS get both rejected, as before, but this time around, STS gets rejected for all 50 CNFs, rather than for only 36 out of 50. In other words, ScalBarbarik is overall a stronger/better distinguisher.

Conclusions

With the birth of Barbarik, a set of uniform-like testers were shown to be less than ideal, and a new, more robust near-uniform sampler was born, CMSGen. But as a the gauntlet has been throw down by CMSGen, a new tool emerged, ScalBarbarik, to help find the non-truly uniform samplers. With this cycle in mind, I hope that new, more elaborate, and higher-quality uniform-like samplers will emerge that will be able to beat ScalBarbarik at its own game, improving the quality of the sampling while maintaining the speed advantage that uniform-like samplers enjoy over truly uniform samplers. With better uniform-like sampling tools, hopefully we’ll be able to make headway in automated test case generation (imagine having it as part of all development IDEs), higher performance function synthesis, and hopefully even more diverse areas of interest for the general public.

A Tale of Shift Left, Shift Right, and MEV

The NIST Cybersecurity Framework’s five Functions are: Identify, Protect, Detect, Respond, and Recover. Within this framework “shift left” means that we shift our focus towards Identify and Protect, i.e. towards fixing issues before our system is deployed into production. Say, if you have a bug in your Apple Store/Google Play app, you’d prefer fixing it before you deploy it to the internet, or, similarly you prefer fixing the bug in your cryptocurrency contract before you deploy it in the wild (i.e. “on the mainnet”). While shift left makes lots of sense (let’s prevent the issue in the first place!), in many cases, preventing all issues is likely impossible.

Shifting Left

Firstly, it is important to state that preventing all issues is often impossible because of what safety practitioners call the “blunt end”. When writing code, practitioners, i.e. developers, are at what safety literature would call the “sharp end”: they are like airplane pilots, directly at the control of their craft. While it’s easy to imagine that pilots are in “full control”, in reality, airplane pilots depend on and are affected by many things: accurate weather reports, quality and amount of training, certification bodies, proper aircraft maintenance, production pressure from corporate, etc. These are what safety literature calls the “blunt end” — things that contribute to both success and failure, but are not directly visible/obvious. When the 10th plane drops out of the sky this same year, and Boeing keeps blaming the (now dead) pilots, you know it’s the blunt end you ought to be looking at.

What this boils down to in terms of shift left is that while developers are in “full control”, in reality, there is a large blunt end that they have little to no control over, and which may set up the conditions such that failure is sometimes inevitable, even with the most careful of developers. An easy example of this is previously unknown vulnerability classes: for example unknown issues with CPU cache timing side-channels in classical IT security, or flash loan attacks for cryptocurrency contracts, which are a relatively new phenomenon, allowing anyone to instantly obtain a large capital for a fraction of time to break assumptions about a contract, which can help attackers. Another slightly less obvious example is production pressure, where competing entities are known to be working on the same idea/system and releasing first is critical for business success. It is one thing to “ask for time”, it’s another thing when you gotta release or it was all for nothing. While techniques like fuzzing, formal verification, coding best practices, etc. can help, they are neither bulletproof, nor are they reasonable to expect to be done fully for all contracts.

Notice that given the nature of cryptocurrencies where “code is the law”, it seems counter-intuitive to think that there are factors at play other than the code itself. On the surface, the rules of the game seem set and all mistakes seem to happen to/with code, hence the culprit is always clear. In other words, it seems as if there is no “blunt end”. However, as with all interesting endeavors, in my view, blunt ends are abound, even if they only manifest themselves in code in subtle ways. For example, before overflow protection was enabled by default on the most used Ethereum compiler, Solidity, overflow issues were possible (e.g. issues where 255+1=0, oops!), and could have lead to attacks. In other words, changing the compiler defaults can prevent a type of attack. Similarly, the SMT Checker, shipped as part of Solidity and Remix, can also warn developers of potential mistakes. These systems change the landscape, preventing the “sharp end”, i.e. developers, from making mistakes.

Shifting Right

The IT security sector has realized for some time now that spending all their effort on shift left, i.e. identifying issues and preventing them pre-deployment, is not adequate. With some effort, time, and money being spent on detection, adequate response, and recovery, i.e. post-deployment and hence on the “right” side, the impact of attacks or unforeseen events can be minimized. This is incredibly important, because as per above, it is often impossible to prevent all attacks. From phishing attacks, novel attack classes, persistent attackers, to production pressures, some attacks will go through, and be successful — at which point the only question is how successful will they be, i.e. will they reach all (or any) of their intended goals, or will they be stopped early, minimizing the damage caused.

There are some interesting properties of spending proper time dealing with post-production incidents. Firstly, it allows the system designers and developers to understand what attack patterns are typical against systems. In classical IT security, e.g. honeypots are often used to understand attacker behavior. These are (often under-protected) systems deployed either on the public internet, or more interestingly, on the intranet(s) of organizations, to see if (and how) attackers try to attack it. These systems can help spot (novel) attack patterns, and attackers in general. Other information-sharing systems are also in-place in the traditional IT security world: in Germany the BSI is responsible to warn large enterprises of (novel/successful) attack patterns in the wild. Within the space of cryptocurrencies, attacks tend to be on the public chain, and can be observed directly both as they happen (unless they are in a so-called private mempool, more on this later), and retrospectively, thanks to the public ledger. There is plenty to learn from how others succeed, or fail.

MEV, and On-Chain Security-as-a-Service

With all the above out of the way, let’s get to the meat of this article: how behavior patterns in the cryptocurrency world that are often cited as undesirable are actually doing a form of shift right. This inadvertent shift right is something I believe we could exploit to improve the safety of the chain overall.

Generalized frontrunning is a technique that consists of people listening to the set of sent, but unconfirmed transactions on the chain, and seeing if copying the same transaction themselves would make them money. They then submit the same transaction with a higher fee, giving more fee to the miners who confirm transactions. This leads to both transactions being executed, but theirs being executed first (and the 2nd likely failing), thereby giving them the money. Clearly, miners (who confirm transactions) could always do such frontrunning, and could do it much easier, since they decide who is going to be first, so they can make themselves first without paying a higher (or any) fee. Transaction frontrunning is slightly more complicated: it tries to both front- and back-run a transaction (also called sandwiching), manipulating the price before the transaction is executed (the “front” part) and then selling the excess (the “back” part). Generalized and transaction frontrunning attacks all fit under the umbrella of transaction reordering attacks and are also called Miner Extractable Value, or MEV for short. MEV is big business, with 8 million USD extracted in the past 30 days on the Ethereum chain, as of writing.

When you perform generalized frontrunning, you do two things: (1) you constantly check for potential value to be extracted, such an attack would do, and (2) you make sure your transaction runs first (hence the name “front-run”) that extracts the value to get the value for yourself. This clearly maps to the post-deployment part of the Cybersecurity Framework of NIST: detect & respond. In other words, it would actually make sense to constantly run a bot on-chain to detect and front run attacks against your own contracts. In a sense, the frontrunners are operating what in the traditional IT security world would be called a SOC — a Security Operations Center that is constantly on the lookout for attacks.

In this sense, MEV is not undesired, but a price to pay to have a non-stop security system. Of course, it is not a given that MEV bot operators extracting MEV will give back your money for the attack they front-ran. However, I posit that it’s much more likely that you’ll get your money back from them than the attackers. For example, at the MEV day for Devconnect, there was a discussion I participated in where a large MEV bot operator mentioned that they were willing to give back the funds of attacks they front-ran for a 10% fee. I expect that large organizations will in the future run their own MEV bots, and medium-sized ones will either simply pay the fee (10% or whatever it will be) when an attack happens, or perhaps pay a monthly fee for the protection service provided.

(Footnote: there is not only bad MEV. For example, backrunning (i.e. running after) liquidation events are incentivized to happen, and are considered normal for the system to run as designed.)

Current Solutions to MEV, and their Impact

There are two obvious ways to deal with the MEV issue: creating economic incentives to reduce their impact, and hiding pending transactions. Let’s to to get into both.

One way to deal with MEV is by not making pending transaction visible. So-called private mempools are operated by some miners (i.e. confirmers of transactions) that don’t allow the public to see the not-yet confirmed transactions. This is advantageous because it means that attackers cannot run clients to frontrun/sandwich/etc. pending transactions sent to these private mempools. In a way, private mempools are kind of a trusted environment, where you can be safe from people trying to extract MEV. However, it also means that malicious behavior submitted to these private pools cannot be front-run. Since these private pools are trust-based, it would be possible to allow some trusted parties to observe these private pools and frontrun attacks. Rather than being purely trust-based, it is possible that some sort of incentives-based system could also be set up, which seems like an open question/potential for improvement. This would be similar to what Flashbots does. Flashbots tries to use economic incentives through an open marketplace to make MEV extraction less disruptive to the ecosystem (among other benefits). However, it does not work for private mempools, since, well, they are private, and Flashbots requires publicly visible unconfirmed transactions to be able to work, due to its open marketplace concept.

MEV is not only big business, it’s also wasteful: often, the original transaction will fail, but still execute, consuming computational resources and burning money. MEV systems trying to outbid one another for the prize are effectively participating in an all-pay auction, i.e. everyone pays for the good but only one gets it. Thankfully Flashbots solves this by running an off-chain auction that is not all-pay. However, once we move from proof-of-work to proof-of-stake (i.e. to ETH 2.0), the attack transaction can be directly proposed to the next validator, skipping the public pool entirely. Essentially, the attacker can choose be in the private mempool by default (with some caveats, see Proposer-Builder Separation, and MEV-Boost), rather than by exception.

Conclusions

While Miner Extractable Value (MEV) seems on the surface to be an undesirable side-effect of the combination of current cryptocurrency protocols and the public nature of digital ledgers, it can also be seen as a feature. This feature could allow one to perform detection & response to attacks, allowing legitimate entities to mitigate the impact of attacks on contracts post-deployment. There is some evidence that some people running MEV extraction bots are already effectively providing a form of post-deployment attack-mitigation-as-a-service for a fee. However, as usual, preventative measures are both more reliable and cheaper, and should be the focus of most efforts against attacks.