AMD’s OpenCL heaven and hell

I have been using the AMD’s OpenCL platform for a while now, so I’ll try to write a little overview of what I think is good and bad about OpenCL in general and AMD’s OpenCL implementation in particular.

Update: AMD was working on the issues below, and temporarily fixed some of them. They are broken again, and I gave up trying to help them. They would need more engineers to fix these issues on a permanent basis.

The good

  • The speed of execution is unbeatable. I can easily get 10x better speeds than current top-of-the-line desktop CPUs. And that’s not just saving on buying the CPU and all the peripherals (memory, motherboard, etc.), but also on power, which can easily be more expensive than the upfront cost of new machines
  • Once a C/C++ template code is written, OpenCL is easy to start using. There are no difficult systems to master from the main program’s perspective, only a couple of well-named functions like clEnqueueWriteBuffer
  • The OpenCL documentation from Khronos is very well written. They even provide a handy reference card that is well made, too
  • There are many books available. I’ve bought The OpenCL programming guide, though I got bored from its first part (describing the specification), and never got to the supposedly interesting second part, so I can’t yet recommend it
  • The AMD programming guide is simply great. It lists all the important factors like register usage of kernel vs. register-limited wavefronts on the compute unit. It is basically a compressed bible, written by someone extremely knowledgeable in the subject, probably some core AMD engineer(s). I highly recommend reading it
  • OpenCL 1.1 supports mulithreading and is easy to multi-thread. I have turned a program that was single-threaded and used only one GPU into a multi-GPU, multi-threaded system in only two days with OpenMP
  • The GNU debugger can be used to debug the OpenCL kernel if the CPU is chosen as the target platform. This is extremely useful and considerably cuts down on development time.

So much for the good part. For something that is still in its infancy — it has only been around for about 3.5 years — the above is quite impressive. Apparently, others think it’s impressive, too, as OpenCL is already a leading solution for GPGPU developers.  Now let’s get to the bad part.

The bad

  • Warning and errors are sometimes silently eaten. This is very frustrating. For example, I passed a constant memory piece as clCreateBuffer&clEnqueueWriteBuffer, but the size of the buffer was larger than the specified maximum (which for constants is pretty low, a couple of KB). What error message did I get? None. The kernel executed ‘fine’, with the slight glitch that it didn’t do anything, at all, just returned immediately — with CL_SUCCESS, naturally. I have hit this errors-get-eaten bug multiple times, with multiple kinds of bugs. So, if my code doesn’t work, it’s hard to know what is wrong
  • AMD supplies debugging and analysis essentially only for Windows and Visual Studio. I have actually shelled out money on a Windows license (first time in my life) only to get these running, but then I realised I would also need a Visual Studio license, for a mere thousand dollars, to use all tools. However, after seeing the greatness of the KernelAnalyzer which I can reliably crash any moment, and gives no information other than what I can already see from the ISA file, I thought… maybe it’s not worth the money
  • Valgrind gives a lot of warnings when launched on a program that uses AMD’s OpenCL implementation. This makes debugging a lot harder, because OpenCL copies a lot of data between the GPU and the CPU, and if you got the numbers wrong, there is nothing to tell you why your data is garbage. The best would be if the AMD APP SDK was valgrind-clean, and when instructed to use the CPU as the OpenCL platform it would detect memory access errors even in the OpenCL kernel.
  • Global memory usage of the kernel really makes a big difference. I thought that although it’s slow, it’s not a big deal. I was wrong. It’s really slow. I can make my program go 10-20% faster just by replacing something like “x = y; x ^= z;” with “x = y^z”, where “x” is global memory, and “y” and “z” are both registers. This was completely new to me. It also gives you a hint at the intelligence of the ‘optimising’ compiler of the AMD APP SDK. But more on that later.

Now that the bad part is out of the way, it’s time for what’s really ugly.

The ugly

  • The ‘optimising’ compiler in the AMD APP SDK can essentially be treated as a random function. It sometimes gives me faster code, sometimes a much slower one, and I can make it switch from one to the other with what essentially amounts to comments in the code. This is like letting a random monster eat your code then treating what came out of the monster as something to benchmark/test. Don’t be surprised if what comes out is, well, quite inedible. This ‘slight’ problem can be evaded if you simply ask the compiler not to ‘optimise’, by passing the “-cl-opt-disable” option. Sometimes optimisation helps, but when debugging performance problems, it’s best to get rid of it. My current main setup doesn’t use the optimiser and I get approx 3x faster code this way. My other code is speeded up by a factor of 1.8x with the optimiser — though it’s only a slow, sort-of backup implementation anyway
  • You have to copy-paste code. Last time I copy-pasted code was when I was 12, programming Pascal. The reason for this is that in the function declaration&definition you have to specify whether the parameter is a register, a global, or a constant. Since you might want to use the same function twice, once with a constant parameter, once with a global one, you have no other choice than to copy-paste. Oh, yes, you could use a #define, but I am not sure that’s cleaner in any way. You also have to copy-paste if you want the same function, but with a slightly different ending. Putting an ‘if’ at the diverging point and passing a parameter to control the ‘if’ will be considerably slower (yes, the compiler is that bad)
  • AMD’s fglrx dirvers on Linux are about as stable as a house of cards in high wind. If it doesn’t crash, that’s a good day. I can make it crash just by launching a video player, but I have more ‘fancy’ ways (e.g. pushing 2 button combinations) of reliably crashing it, too. However, the ‘radeon’ driver which is stable on Linux doesn’t work with OpenCL.Thanks to this, I actually have to reboot my machine to watch any video — and I thought those days (Windows 95?) were over
  • AMD APP SDK on Linux is about as stable as their driver. For example, compiling&launching a program on the 2.6 version uses 150% CPU and forces the kernel to launch 3 kworker(=kernel worker) threads, each eating up 50% CPU, completely using up a 3-core machine. However, the same exact program can be compiled&launched on the 2.5 version, eating exactly 1% CPU and forcing no kworker usage

Conclusions

The OpenCL architecture is clean, usable and user-friendly. Unfortunately, the current AMD OpenCL implementation is bad. I hope this will improve in the future, and as more and more people start using it, the bugs will eventually be reported, triaged, and fixed. As the number of people that buy a particular card for its OpenCL capabilities increase, for instance because games start using OpenCL as a way to speed up certain physics calculations, the makers of these cards will eventually be forced to fix the bugs and make development easier. Until then, we might have to keep on suffering for the speedups we get.

[paypal-donation]

Thoughts on SAT@home

If you have ever wondered how your multi-month SAT problem could be solved, the answer is here: SAT@home by some kind Russian researchers! This idea has been brewing in my head, we even started a mini-project with a  friendly American researcher, but we never got it into a working condition :(. It’s great that someone took the time to do it, though. Creating such a setup has some easier and harder parts. Setting up a BOINC server is not that hard, but it needs to run 24/7, which is difficult, unless you are department head at a university or you have a company behind you. Then the scripts need to be written, which is relatively easy to do, but testing and debugging is time-consuming. The really hard part, though, is how to distribute the workload. Treatises could be written on that, and the one above uses the simplest method, that of cutting up the problem along a couple of variables. It’s a good start, I think: starting easy is always the best way to start — the first automobile didn’t try to achieve 300km/h speeds, and it didn’t have to.

In the long run, I guess the setups will get more and more complicated, combining many techniques from cutting on variables, to clause sharing with a multitude of clause usefulness heuristics, to sharing parameter usefulness statistics. Workshares themselves could eventually be diversified, some doing ‘simple’ search without resolution, some doing only resolution (and clause cleaning?), some doing only (specific) simplification algorithms such as subsumption or strengthening, some doing data analysis on the current state of the problem such as reachability analysis, number of gates or other higher-level elements such as adders in the problem, etc. In other words, there are many ways to improve this. Maybe even a common interface to such a system could be laid down where different SAT solvers could be plugged in and used without notice.

A friend of mine has once written me a mail describing his work, which essentially talks about paying for a webpage not by looking at adverts, but by running a javascript program. The idea is, of course, that we could write a SAT solver in javascript, and make people run the script by putting it on a well-visited webpage. The server would then distribute SAT problems to the javascripts, which would send some computed data back (like learnt clauses), while the person is surfing the page. Cheap (maybe even free) and essentially unlimited CPU time for hard problems. We could use it to answer some fun math questions, or break some cryptographic systems, for example… just have to find someone crazy enough to write a SAT solver in javascript! :)

CCC Camp’11

In case you’ve missed it, the CCC Camp was a great opportunity to meet people both working in security and otherwise. I have even met a very kind Taiwanese researcher who worked on SAT and Gröbner basis: in fact, if you haven’t had the chance to read this paper, I highly recommend it. A set of kind Taiwanese researchers recommended this paper to me, and I think it’s the most interesting SAT paper I have read in the past year.

We at SRLabs have made two releases during this camp, one that breaks GPRS encryption, and one that breaks smart card ROM encryption. I was involved with the first release, essentially working on the crypto part. In case you are interested in the videos, the one on GPRS is uploaded here, and the one on smart card ROM encryption is here. This reminds me of something: the videos from the MIT SAT/SMT Summer School are missing :( Well, given my fail there, maybe that’s a good thing :)

Failed literal probing and UIP

I have just realised that CryptoMiniSat, having won a number of medals, does one of the most basic things, failed literal probing, all wrong. Let me tell you why it’s all wrong. In essence, failed literal probing is trivial. We enqueue a fact, such as a, and then propagate it. If this fails, i.e. if two opposing facts such as g and -g are derived, we know that a cannot be true, so we set -a. Easy. Or maybe not so easy.

The devil is in the details, so let’s see how we derived both g and -g from a. Let’s assume that we have the following set of binary clauses:
-a V b
-b V c
-b V d
-d V e
-d V f
-e V g
-f V -g

which, from the point of view of a is best described as the graph:

Propagating "a", deriving both "g" and "-g"

The problem is, that if we only derive -a from this graph, we end up with only that one literal set, because -a doesn’t propagate anything on the clauses. This is quite sad, because, in fact, we could derive something stronger. From the graph it is evident that setting d would have failed: the graph would simply have its upper part cut away completely, but the lower part, including the derivation of both g and -g would still stand:

Deriving both "g" and "-g" from "d"

What is special about node d? Well, it’s where the 1st UIP, the first unique implication point, lies. And it is quite simple to see that -d is in fact the strongest thing we can derive from this graph. It’s much stronger than simply -a, because setting -d propagates on the clauses, giving -b,-a, setting three variables instead of one, including -a. An interesting observation is the following: deriving -b is the 2nd UIP, and deriving -a is the last UIP. In other words, at least in this most simple case, 1st UIP is in fact the strongest, and 2nd, 3rd.. last UIP are less strong in strict order.

Let me remark on one more thing about failed literal probing. Once failed literal probing has been done on literal x and it visited the node set N, there is no need to try to do failed literal probing on any nodes in N, since they cannot possibly fail. Although the failing of a literal can have consequences on the failing of other literals, if we ignore this side-effect, we could speed up failed literal probing by marking literals already visited, and only carrying out failed literal probing on ones that haven’t been marked. This is really trivial, but I haven’t been using it :S

I am quite sure that some advanced SAT solvers (such as lingeling) do all of the above things right. It’s probably only CryptoMiniSat that fails miserably :)

Note: there is a subtle bug with marking literals visited as “OK”. If two different subgraphs could fail, but we abort on the first one, then we might mark a literal OK when in fact it isn’t. It is therefore easiest not to mark any literals if the probe failed (which is very rare).

Note to self: higher level autarkies

While reading this thesis, I have had a thought about autarkies. Let me first define what an autarky in SAT solving is:

A partial assignment phi is called a weak autarky for F if “phi*F is a subset of F” holds, while phi is an autarky for F if phi is a weak autarky for all sub-clause-sets F' of F. (Handbook of Satisfiability, p. 355)

What does this mean? Well, it means that in the clause-set {x V y, x} the assignment y=False is a weak autarky, because assigning y to False will lead to a clause that is already in the clause set (the clause x), while the assignment of x=True is a (normal) autarky, because it will satisfy all clauses that x is in. The point of autarkies is that we can remove clauses from the clause set by assigning values to a set of variables, while not changing the (un)satisfiability property of the original problem. In other words, autarkies are a cheap way of reducing the problem size. The only problem is that it seems to be relatively expensive to search for them, so conflict-driven SAT solvers don’t normally search for them (but, some lookahead solvers such as march by Marijn Heule do).

So, what is the idea? Well, I think we can have autarkies for equivalent literals, too. Here is an example CNF:

 a V d V -f
-b V d V -f
-a V e
 b V e

setting a = -b will not change the satisfiability of the problem.

We can add any number of clause pairs of the form X V Y where Y is any set of literals not in {a, -a, b, -b}, and X is (-)a in clause 1 and (-)b in clause 2. Further, one of the two variables, say, a can be in any clauses at will (in any literal form), though variable b can then only be in clauses defined by the clause pairs. Example:

 a V  d V -f
-b V  d V -f
-a V  e
 b V  e
 a V  c V -h
 a V -f

An extension that is pretty simple from here, is that we can even have clauses whose Y part is somewhat different: some literals can be missing, though again in a controlled way. For example:

 a V d
-b V d V -f
-a V e
 b V e

is possible. It would restrict b a bit more than necessary, but if a is the only one of the pairs missing literals, we are still OK I believe.

Finally, let me give an example of what these higher-level autarkies are capable of. Let’s assume we have the clause set:

 a V  d
-b V  d V -f
-a V  e
 b V  e
 a V  c V -h
 a V -f
 a V  b V  f V -g

setting a=-b now simplifies this to:

 a V  d
-a V  e
 a V  c V -h
 a V -f

which is equisatisfiable to the first one. Further, if the latter is satisfiable, computing the satisfying solution to the former given a solution to the latter is trivial.