I'm also interested in this problem for biology applications. My first pass algorithm is to create a canonical representation of all the solutions (which is possible because NP-complete problems are decidable). Then each thread takes a range of numbers, decodes them into a solution based on the canonical representation, evaluates them, and shoves them in a solutions list. This method is good because you never explicitly store candidate solutions unless they indeed solve the subset sum problem, saving you an exponential amount of memory. You only store actual solutions, which should usually be a virtually infinitesimal fraction of the candidates.
The problem with this arises when you want to try to run problems with more than 4 billion candidate solutions, as you'll need larger than 32-bit integers, but I'm not sure how long such a kernel would take to run anyways (possibly days).