Let’s first examine a traditional sequential (also called serial) solution that uses no parallelism, of the simplified version of this algorithm described in the previous chapter. As with many parallel implementations of algorithms, this serial version can form the basis for a parallel solution. We will later see several parallel solutions that have been updated from the following serial version.
You will want to examine the code itself while reading this chapter.
In the complete archive, dd.tar.gz, this example is under the dd/serial directory.
Alternatively, for this chapter, these are the individual files to download:
The Makefile is for use on linux systems.
The example program provides a sequential C++ implementation of our simplified drug design problem.
Note
The program optionally accepts up to three command-line arguments:
- maximum length of the (randomly generated) ligand strings
- number of ligands generated
- protein string to which ligands will be compared
A straightforward compile can be used for this sequential example:
g++ -o dd_serial dd_serial.cpp
Or you can use the makefile and simply type ‘make’ at the command line on a linux system.
In this implementation, the class MR encapsulates the map-reduce steps Generate_tasks(), Map(), and Reduce() as private methods (member functions of the class), and a public method run() invokes those steps according to a map-reduce algorithmic strategy (see previous Introduction section for detailed explanation). We have highlighted calls to the methods representing map-reduce steps in the following code segment from MR::run().
Generate_tasks(tasks); // assert -- tasks is non-empty while (!tasks.empty()) { Map(tasks.front(), pairs); tasks.pop(); } do_sort(pairs); int next = 0; // index of first unprocessed pair in pairs[] while (next < pairs.size()) { string values; values = ""; int key = pairs[next].key; next = Reduce(key, pairs, next, values); Pair p(key, values); results.push_back(p); }
We use the STL containers queue<> and vector<> to hold the results from each of the map-reduce steps: namely, the task queue of ligands to process, the list key-value pairs produced by the Map() phase, and the list of resulting key-value pairs produced by calls to Reduce(). We define those container variables as data members in the class MR:
queue<string> tasks;
vector<Pair> pairs, results;
Here, Pair is a struct representing key-value pairs with the desired types:
struct Pair {
int key;
string val;
Pair(int k, const string &v) {key=k; val=v;}
};
In the example code, Generate_tasks() merely produces nligands strings of random lower-case letters, each having a random length between 0 and max_ligand. The program stores those strings in a task queue named tasks.
For each ligand in the task queue, the Map() function computes the match score from comparing a string representing that ligand to a global string representing a target protein, using the simplified match-scoring algorithm described above. Map() then yields a key-value pair consisting of that score and that ligand, respectively.
The key-value pairs produced by all calls to Map() are sorted by key in order to group pairs with the same score. Then Reduce() is called once for each of those groups in order to yield a vector of Pairs consisting of a score s together with a list of all ligands whose best score was s.
Note
Map-reduce frameworks such as the open-source Hadoop commonly use sorting to group values for a given key, as does our program. This has the additional benefit of producing sorted results from the reduce stage. Also, the staged processes of performing all Map() calls before sorting and of performing all Reduce() calls after the completion of sorting are also common among map-reduce frameworks.
The methods Generate_tasks(), Map(), and Reduce() may seem like unnecessary complication for this problem since they abstract so little code. Indeed, we could certainly rewrite the program more simply and briefly without them. We chose this expression for several reasons:
We have not attempted to implement the fault tolerance and scalability features of a production map-reduce framework such as Hadoop.