Barbarians at the Gate: A Political Methodologist's Notes on using the National Center for Supercomputer Applications

Philip A. Schrodt
Dept of Political Science
Penn State University
University Park, PA 16802
phone: +1-814-865-8978
email: schrodt.parusanalytics.com

August 1999

This note is a synopsis of my recent experience using the facilities at the National Center for Supercomputer Applications (NCSA) in Urbana-Champaign. The quick summary: If you are already familiar with Unix systems, it is remarkably straightforward to run ANSI C code on these machines. The individual processors in the "supercomputer" are much slower than those in a contemporary personal computer, but one can gain substantial wall-clock speed advantages through parallelism. In most instances, parallel processing can be added to be program with very little additional effort. The computing time at NCSA is free and should be a consideration for anyone using computationally intensive methods that are parallel either in their inner-most loops (e.g. linear algebra routines) or at the outer-most loops (e.g. Monte Carlo or resampling).

Background

Kansas, as an NSF-challenged state, is part of an initiative called EPSCoR that tries to encourage greater levels of Federally funded research. Late last year, the NCSA allocated 10,000 hours of supercomputer time to the University of Kansas under EPSCoR auspices. I applied for 50 hours; our research office [correctly] thought this was too little and added a zero to make it 500, and for some reason NCSA actually granted me 4,000 hours.

NCSA time is available only by peer-reviewed application (see http://www.ncsa.uiuc.edu): academic researchers do not have the option of "buying" time through a research budget. None of the supported projects currently listed on the NCSA web pages have any discernible social science content; whether this is the product of lack of interest or lack of support is not clear. NCSA was certainly generous with my request -- and the "proposal" was just a couple of paragraphs in length -- though this may have been a function of EPSCoR.

[Side comment: An additional motivation for utilizing the NCSA facility were some comments that Larry Smarr, NCSA's director, had made concerning the social sciences last year in the Chronicle of Higher Education (11 September 98). Among his pithy observations were "...academic social scientists have self-defined themselves into small problems." and "What's this 'so hard' crap? [Social scientists] don't want to have the answer hard enough." Dr. Smarr, in short, views the social sciences much as a dog views a fire hydrant.]

My project involved the estimation of some large hidden Markov models for forecasting conflict in the former Yugoslavia. This involves a numerical optimization algorithm that is equivalent to the expectation-maximization algorithm of Markov chain Monte Carlo fame (and more generally, HMMs are a type of MCMC model). Consequently the mix of calculations is similar to that found in most statistical applications, though as we will see, the structure of the calculations is atypical. HMM estimation is complicated by a large number of local maxima, which I bludgeoned into submission with a combination of genetic algorithms and Monte Carlo methods.

My experiment with the NCSA involved two key challenges. First, could I modify my existing C programs to run on the NCSA machines without going to a lot of trouble (and ideally, without having to maintain two versions of the code)? Second, would the NCSA facility provide significantly faster turn-around?

Connecting, Consulting, and Documentation:

Making connections to the NCSA requires the Unix "ssh" (secure shell) system (or Kerberos). This required a bit of reconfiguration on my end -- the local computer center had to assign an alphabetic string to my IP number -- but once established, connecting has been painless. From inside the secure system, one can use ftp to get back out to one's home system to transfer files. The NCSA systems are Unix (you were expecting Windows'95?), so if one knows Unix, there is no learning curve on the basic file and navigation commands. Allocations of disk space are quite generous, and more space is available if needed.

The NCSA maintains extensive on-line documentation in assorted formats -- man, html, pdf -- and has a well-organized "Getting started" page [http://www.ncsa.uiuc.edu/SCD/Info/] that provides step-by-step instructions for getting onto the system and running jobs. I also have had very positive experiences with the email consulting staff -- questions were answered within 24 hours and generally were correct the first time.

Most of the detailed machine-specific documentation, on the other hand, has been created in the "write-only" genre: it presumably makes sense for purposes of reference, but is not aimed at the casual user. The NCSA web site provides numerous examples of sample scripts and commands, and I usually had more success modifying these than trying to work out solutions de novo.

Modifying Programs:

My chief worry about this exercise was the level of effort required to get my existing ANSI C programs to run on the parallel machines. At the most basic level , the answer is a very reassuring "no additional effort whatsoever". Once I figured out the appropriate options to use on the "cc" compiler -- a task which required a couple of emails to the consultants -- ANSI C runs without modification. There's even an option for using C++-style comments.

Keeping with my objective of minimizing the learning curve, I did not attempt to debug programs on the NCSA system. Instead, I got them running correctly with a familiar PC-based editor and debugger (Metrowerks CodeWarrior, in my case), then uploaded the code to the NCSA for the production runs. More generally, I get the impression that it is a very good idea to make sure your code runs correctly as a serial process before trying to run it in parallel.

Up to this point, I had to agree with Smarr's observations about the ease-of-use of the NCSA. If one has source code in C, C++, or Fortran , and is familiar with Unix -- conditions that in all likelihood apply to almost anyone who really needs a supercomputer -- then the start-up costs are negligible.

And just how fast is it?

Not very -- on the hidden Markov estimation with a single processor (no parallelism), the NCSA machines run about a third as fast as my 350 Mhz Macintosh G3 (which runs at about the speed of a 450 Mhz Pentium). In fact, I first suspected that I'd somehow introduced an infinite loop in my programs. This is supposed to be a supercomputer, dude, and we go to great lengths to keep these out of the hands of godless commies (forcing the Chinese military, in turn, to do all of their nuclear weapons development at Los Alamos...).

But, alas, there's a technology lag here. The NCSA facility primarily uses 195 Mhz MIPS 10000 processors (plus an assortment of 250 Mhz chips), albeit it has 1024 of these. These are circa 1997, and given that the personal computer industry moves in "dog years" -- one year of computer innovation equals seven years of most other technologies -- equipment can become outdated very quickly. The relative speed of my 1999 Mac G3 and the 1997 NCSA processors corresponds almost exactly to "Moore’s Law" -- computer capacity doubles every 18 months.

The Chronicle article asserted that the NCSA processors were each "50 per cent faster than the best desktop computer currently available." This statement is certainly not true now, and probably was not even true in September 1998, though it probably was true in 1997. The current generations of the Pentium, PowerPC and Alpha microprocessors have borrowed extensively from the experience with supercomputer designs, and now have much more in common architecturally with a 1980s supercomputer than with a 1980s mainframe.

One can submit up to five jobs at a time to the NCSA batch queues, but only three of those jobs can run simultaneously, so for single processors, the throughput at the NCSA simply matches that of a PC. Of course, if you are stuck with an older computer, NCSA could be a decided improvement over the machine sitting on your desk. Because this is a batch facility, the wall-clock turn-around for the NCSA facility can run considerably slower than a dedicated PC -- this is not an option that will save your APSA presentation at the last minute. Obviously to make significant progress, you've got to get things running in parallel.

Going Parallel

There is an "automatic parallel optimization" option on the C compiler ("-apo") that looks for loops within a program that can be run in parallel. One can also manually mark up the code for parallel processing, but I get the sense that unless you are really, really fond of a program, it makes more sense to just let the compiler handle the optimization.

[A related note: optimizing compilers put a premium on not being clever when writing code. Just tell the compiler what you want to do. With contemporary pipelined architectures, trying to second-guess the machine code is almost always counter-productive, and in any case the folks who wrote the compiler almost certainly know the target machine better than you do. Older programmers will now appreciate how John Henry felt about that steam drill...]

Asking for more processors increases the delay in the batch-processing queue, and increases the allocated time "billed" for processing. NCSA's accounting algorithm is simple -- CPU time times number of processors -- so 4,000 hours is equivalent to about 1,500 hours on a fast PC. But that’s still a lot of time.

Alas, the -apo option made little difference in the speed of my application: The system only made use of three processors, though in my batch submission I requested the use of eight. (I even got a little note from the consultants asking about this anomaly -- this is a batch facility, and THEY are watching…). The -apolist option on the compiler provides a very readable evaluation of the compiler's assessment of every loop in the program, indicating whether it could be made parallel, and if not, why not.

As it turns out, the Baum-Welch algorithm used to estimate HMMs is essentially a dynamic programming algorithm, which means that calculations on iteration i in most loops are dependent on the results of iteration i-1, so the loops cannot be automatically broken down for parallel execution. The loops that could be parallelized -- and the compiler found quite a few -- were relatively short and therefore used only a small number of processors. This is probably not typical of tasks in statistics -- for example most matrix operations lend themselves very well to loop-level parallel optimization.

Enter MPI

But there is an alternative approach: All Monte-Carlo experiments are parallel at a macro level involving the entire program, even if the estimation technique isn't parallel at the micro-level of loops. This opened the possibility of using the other major paradigm in parallelism, message-passing. This approach operates by having various parts of the program execute on separate processors, interacting by sending and receiving chunks of data -- "messages." It has been implemented in two widely-used interfaces -- PVM and MPI -- and NCSA strongly recommends MPI. MPI has over 100 functions and an assortment of tutorials are available on the web (see http://WWW.ERC.MsState.Edu/labs/hpcl/projects/mpi/presentations.html).

The conceptual trick in programming with MPI is that processors are not running separate programs; they are all running the same program which does different things depending on the "rank" identify each processor picks up from MPI. Your synapses may adapt to this approach more quickly than mine did. In one of those quintessential formal systems experiences, I spent about half a day going through various MPI texts and documents before realizing that for a Monte Carlo problem, the appropriate MPI implementation was absurdly simple:

Serial implementation:

Init_Files();
for (nexp= 0; nexp<N_EXPER; nexp++)
{ <the Monte-Carlo stuff> }

MPI implementation:

int rank, size;
MPI_Init();
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
Init_Files(rank);
for (nexp= 0; nexp<N_EXPER/size; nexp++)
{ <the Monte-Carlo stuff> }
MPI_Finalize();

Five additional lines of code to parallelize a 2,500-line program. So again it seems that Smarr is right about the simplicity.

In this instance, all I'm using MPI to do -- and all it needs to do -- is run the identical program on multiple processors, but with different random numbers. MPI_Init sets up the system. MPI_Comm_size puts the number of processors available into "size"; that number is actually set when one submits the batch job. MPI_Comm_rank assigns a number between 0 and size-1 to each processor; this is subsequently used whenever a process needs to identify itself. MPI_Finalize cleans up at the end. The only modification in the code itself was reducing the number of iterations in the Monte-Carlo loop as a function of the number of processors available.

Is it really that easy? Okay, I lied -- it’s slightly more complicated. Submitting an MPI batch job and managing the files involved some additional adventures with the NCSA operating systems, which took a few experiments before everything worked. I also wanted each of the processes to write to a different file, so the Init_Files(rank) routine for initializing files goes to

char name_fres[] = "MC.SERB.RR"; // general file name
name_fres[9] = 65 + rank; // append rank indicator (A, B, C,…) to the file name
fres=fopen(name_fres,"w"); // open the file

Meanwhile, the familiar technique for initializing the random number generator -- srand(time(NULL)) -- results in all processes using the same seed, because they start simultaneously (duh…). Changing this to srand(time(NULL)*(rank+1)) solves that problem. But the potentially messy stuff -- for example the fact that multiple processors will be reading from the same input file -- is handled automatically.

This is an extreme case -- apparently the term-of-art applied to this situation is "embarrassingly parallel." As it happens, I was already writing the Monte-Carlo results to a file and analyzing them with a separate program. Merge the various output files from the parallel processes using the Unix "cat" command, and I've got the same thing. (Conveniently, the programs I had written to analyze the output worked without modification.) The separate programs are oblivious to each other's existence; the only message they need to pass is to me.

I would emphasize that MPI can do a lot more than this, and there is nothing clever going on in this implementation. But it gets the job done with minimal modifications to the serial code, and in fact is as efficient as parallel execution can get. Resampling problems are just as embarrassingly parallel as Monte Carlo, and bootstrapping techniques come close, so there are a lot of opportunities here.

With MPI, one is now in a position to exploit some fraction of the 1024 processors available at the NCSA, and once one gets through the batch queue, the clock-time required to run a program drops proportionately with the number of processors used. This can be quite dramatic.

Overall evaluation

Because it is possible to use the NCSA resources with minimal hassle -- I just want the results, not to adapt to yet-another-computer-configuration -- the NCSA appears to be a viable resource for running very long jobs. My guess is that most computationally-intensive processes of interest to political methodologists are substantially parallel either in their inner-most loops (matrix manipulations) or their outer-most loops (Monte Carlo methods, resampling). Both can be adapted to parallel execution with a minimum of additional programming. And remember, the time is free.

If you aren't working with C/C++ or Fortran programs, NCSA has a variety of mathematical, linear algebra and numerical optimization packages. The list of software can be found on their website: http://ncsa.uiuc.edu; In 1999, S-Plus was "under consideration" and they had "part of" SAS, but both of these have now disappeared from the software listings.

Finally, the ease with which MPI can be used suggests the "build your own supercomputer" route: public-domain implementations of MPI that run across personal computer networks are available for Unix, Windows NT and Macintosh systems (see http://WWW.ERC.MsState.Edu/labs/hpcl/projects/mpi/implementations.html) Keep your eye on the dumpsters behind your Business School and you might be able to acquire a dozen or two computers that run as fast as NCSA's MIPS 10000s.

Postscript: Beowulf Clusters

In the two years since I wrote this, "Beowulf clusters" -- parallel computers constructed from off-the-shelf PCs running Linux and linked with standard Ethernet connections -- have emerged as a major competitor to classical supercomputers. The web site http://www.beowulf.org/ provides all of the information to build one of these, as well as links to about 100 installations.

A similar system called "Appleseed" ( http://exodus.physics.ucla.edu/appleseed/appleseed.html) builds clusters from Macintoshes. This project is based in the Physics Department at UCLA and has developed a library for running MPI code on a cluster of Macintoshes. The Mac cluster runs at about the same speed as a Cray T3E-900 with a comparable number of processors, and C or Fortran source code can be shifted between the Mac cluster and an NCSA supercomputer with no modifications.