ANDY: A general, fault-tolerant tool for database searching on computer clusters

ANDY is a set of Perl programs and modules for easily running large biological database searches, and similar applications, over the nodes of a Linux/Unix computer cluster in order to achieve speedup through parallelization. Users specify their cluster runs through a straightforward XML configuration file, in which they specify a template for the pipeline or sequence of commands they want to run on the nodes, as well as the locations and types of data sources for the command lines and other information. The tool works directly with Distributed Resource Management ("DRM") systems, such as GridEngine and OpenPBS/PBSPro, and is easily extensible to different DRMs simply by writing a small DRM-specific module for submitting and monitoring jobs. Cluster runs can be done in both dedicated mode, where nodes are held and used until all tasks complete, or in fair mode, where each submitted job does only a specified amount of computation and then exits, allowing fair use of the cluster where other users' jobs can be interspersed among ANDY jobs. For efficiency, support is provided for (optional) piping of file input and output of commands through named pipes and buffering in memory, in order to avoid the possible performance hit of slow disk I/O. The tool is fault tolerant, checking that all jobs and tasks complete and resubmitting failed jobs and tasks until the run is complete. Users can provide their own application-specific error checking simply by adding error checking commands into the sequence or pipeline of commands, and the tool will detect non-zero exit statuses and flag error, causing the server to retry; users can similarly provide for succinct summaries of raw program output to be created, which can minimize communication with the server to give better performance. The tool also allows a command pipeline to be run at the server over the course of a run into which the results from node jobs are piped, allowing global summaries to be created. We also provide a set of concrete client and server side summarization and error checking routines for key bioinformatics programs such as BLAST, FASTA, and ssearch. The tool thus provides an infrastructure for generalized, distributed, fault tolerant, high-performance command pipelines over a compute cluster. The tool has been widely used in support of research in the Brenner computational biology lab at UC Berkeley, and we are making it publicly available.

For questions, comments, bug reports, feature requests, etc. concerning ANDY or this site email us.