Handling disconnected components

In CryptoMiniSat 3.2 and above there is a disconnected component finding&solving system. This finds toplevel disconnected components regularly and solves them by launching an instance of the solver itself. If the component is unsatisfiable (UNSAT), the whole system is UNSAT and solving can be immediately aborted. If the component is satisfiable (SAT) the component’s solution is saved and later when printing the solution to the whole problem the component’s satisfying assignment is also printed.

Why deal with disconnected components?

Disconnected components in SAT instances are problematic because the search logic of the solver doesn’t effectively restrict its search to a single component at a time, making its strategy ineffective. Firstly, learnt clause cleaning and variable activity heuristics behave worse. Secondly, solutions to components are thrown away when the search restarts when another component’s solution is too difficult to find. Note that some solvers, e.g. Marijn Heule’s march, do restrict their search to one component at a time, though I personally don’t know of any other.

Disconnected components seem to be quite prevalent. In the SAT’11 competition‘s set of 300 Application instances, 53 contained disconnected components at top-level at the beginning of the search. On average the number of disconnected components for these 53 instances were 393(!). When doing top-level analysis on a regular basis during search, the number of problems containing disconnected components rises to 58 and the average number of components found decreases to 294.

Finding disconnected components

The algorithm I use to find disconnected components is relatively easy. It essentially performs the following: it initializes a set of datastructures that hold information about the found component(s) and then feeds each clause one-by-one to a function that creates new component(s) and/or merges existing components. The datastructures needed are:

Array ‘varToComp’ that indicates which variable is in which component. If a variable is in no component yet, it has special value MAX
A map of arrays ‘compToVar’ indexed by the component. For example, compToVar[10] points to the array of variables in component number 10
Variable ‘nextComp’, an ever-incrementing index indicating the maximum component number used
Variable ‘realComps’ indicating the actual number of components — some components counted by ‘nextComp’ may have been merged, and are now empty

The algorithm, in pseudo-python is:

nextComp = 0;
realComps = 0;
varToComp = [MAXVAR] * numVars();
compToVar = {};

for clause in clauses:
  handleClause(clause);

def handleClause(clause) :
  newComp = []
  compsToMerge = set();
  
  for lit in clause:
    var = lit.var();
    if varToComp[var] != MAX :
      comp = varToComp[var];
      compsToMerge.insert(comp);
    else:
      newComp.append(var);

 
    #no components to merge, just put all of them into one component
    if len(compsToMerge) == 1 :
      comp = compsToMerge.pop();
      for var in newComp:
        varToComp[var] = comp
        compsToVar[comp].append(var);

      return;
  
    #delete components to merge and put their variables into newComp
    for comp in compsToMerge:
      vars = compToVar[comp];
      for var in vars:
        newComp.append(var);

      compToVar.erase(comp);
      realComps--;
    
    #mark all variables in newComp belonging to 'nextComp'
    for var in newComp:
      varToComp[var] = nextComp;
    
    compToVar[nextComp] = newComp;
    nextComp++;
    realComps++;

There are ways to make this algorithm faster. A trivial one is to remove the ‘set()’ and use an array initialized to 0 and mark components already seen. A fixed-sized array of the size of variables will do the job. Similarly, the use of ‘nextComp’ forces us to use a map for ‘compToVar’ which is slow — a smarter algorithm will use a round-robin system with an array of arrays. We can also avoid having too many small components at the beginning by calling the handleClause() function with the longer clauses first. It doesn’t give any guarantees, but in practice works quite well.

Finally, we need to add approximate time measurements so that we could exit early in case the algorithm takes too much time. According to my measurements, on the SAT’11 Application instances it took only 0.186s on average to find all components, and in only 34 cases out of 1186 did it take longer than the timeout — which was set to around 7s. At the same time, the maximum time ever spent on finding components was 7.8s. In other words, it is almost free to execute the algorithm.

CryptoMiniSat handles found components in the following way. First, it counts the number of components found. If there is only one component it simply continues solving it. If there are more than one, it orders them according to the number of variables inside and picks the smallest one, solves it with a sub-solver, and checks it solution. If it’s UNSAT, it exits with UNSAT, otherwise it saves the solution for later and picks the next smallest component until there is only one left. When there is only one left, it continues its job with this, largest, component.

The sub-solver CryptoMiniSat launches is a bit special. First of all, it renumbers the variables so if variables 100..120 are in a component, the sub-solver will be launched with variables 0..20. This saves memory in the sub-solver but it means variables must be back-numbered when extracting solutions. Since the subsolvers themselves also internally re-number variables for further speedup, this can get a little bit complicated.

Further, if the sub-solver is handling an extremely small problem (<50 vars) most of its internal functionality is turned off to reduce build-up and tear-down speed. This is important, because in case there are 5000 components we need to go through, even a 0.01s build-up&tear-down time is unacceptable. Finding 5000 components, by the way, is not some wild imaginary number: it happens with the instance transport-[...]10packages-2008seed.040 from SAT'11. This instance actually contains ~5500 components at beginning of the search at toplevel, with an average of 8 variables each.

Demo

Let’s see an example output from CryptoMiniSat:

c Reading file 'traffic_3b_unknown.cnf.gz'
[..]
c -- clauses added: [..] 533919 irredundant 
c -- vars added      39151
[..]
c Found components: 23 BP: 8.85M time: 0.06 s
c large component     0 size:        833
c large component     1 size:       1071
c large component     2 size:       4879
c large component     4 size:        357
c large component     5 size:      14994
c large component     6 size:        476
c large component     7 size:        952
c large component     9 size:        952
c large component    10 size:        595
c large component    11 size:      10234
c large component    16 size:        476
c large component    17 size:        476
c large component    19 size:        476
c Not printed total small (<300 vars) components:10 vars: 2380

In total 23 components were found, the largest component containing 15K variables. The other 22 components contain 10K to as little as less-than 300 variables. Notice that finding the components was essentially free at 0.06s (on a i7-3612QM).

Another typical output is:

c Reading file 'transport-transport-city-sequential-25nodes-1000size-3degree-100mindistance-3trucks-10packages-2008seed.060-SAT.cnf.gz'
[...]
c -- clauses added [..] 3869060 irredundant
c -- vars added     723130
[...]
c Found component(s): 2779 BP: 253.02M time: 0.58 s
c large component  2778 size:     657548
c Not printed total small (<300 vars) components:2778 vars: 25002
c Coming back to original instance, solved 2778 component(s), 25002 vars T: 1.46

In this case there were 2.8K small disconnected components with (on average) 9 variables each within a problem that originally contained 723K variables and 3.9M clauses. The components were found in 0.58s and all but the largest were solved within 1.46s. This means that 2.8K CryptoMiniSat sub-solvers were launched, initialized, ran and destroyed within a span of 1.46s. The largest component with 658K variables is what's left to the solver to deal with. The rest have been removed and their solutions saved.

Conculsions

It's fun, useful, and relatively easy to find disconnected components in SAT problems. They appear at the beginning of the search for some problems, and for some more, during search. Although it is not a game-changer, it is a small drop of water in a cup that gets mostly filled with small drops anyway.

People doing problem modeling for a living are probably crying in horror and blame the modeler who created an instance with thousands of disconnected components. As someone who deals with CNFs though, I don't have the liberty of blaming anyone for their instances and have to deal with what comes my way.