Adapa (Automatic DAta PArallelism)
Version 0.1 - 19/09/2009
Copyright (C) 2009 Pedro Ribeiro (CRACS & INESC-Porto LA, U. Porto)

User Manual

Contents


Introduction

Typical scientists not coming from a computer science background are used to normal serial tools and the amount of work involved getting up to speed for running even the simplest application in parallel can deter one from using it.

Adapa is a reusable tool born to leverage this, that is able to run parallel applications on multiple computing platforms, while being flexible enough to allow its users to use any familiar programming languages that they are acquainted with. It is designed to work only with problems exhibiting data parallelism.

The main features of Adapa are:

Adapa was writen in C++ and is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3, as published by the Free Software Foundation.

Adapa is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License version 3 for more details.


Download and Installation

Adapa can be downloaded from http://www.dcc.fc.up.pt/adapa/.

The current Adapa version is only ready to work in a Linux platform (future releases will include an Windows version). After you've downloaded adapa-VERSION.tar.gz you should decompress the package (tar xvzf adapa-VERSION.tar.gz) and enter adapa-VERSION directory. Adapa is organized as normal GNU packages and therefore you should be able to run the following commands in order to configure, build and install Adapa:

./configure; make; make install

More detailed instructions are given in the package, in INSTALL file.


Main workflow

The basic workflow is divided in four major phases, depicted in the following figure (taken from [1]):

  1. Divide the data: Adapa divides the data into independent partitions
  2. Submit the data: Adapa submits the data to a cluster/grid environment.
  3. Compute the data: An executable provided by the user computes independently the results for each partition (this is done in parallel in the cluster/grid environment)
  4. Aggregate the results: the results of each independent partition are aggregated to create a single global result

Usage

Adapa is currently a command-line driven program (a graphical user interface may be released in the future). If you have installed it correctly you should be able to run the program simply by running the following command:

adapa

Each application we want to run is called an experiment in the context of Adapa, and each experiment is defined by a directory, where all the associated files will be stored. The syntax of the command-line is:

adapa <experiment directory> <action> [<any other options>]
The Experiment Directory is a relative or absolute path to the directory that should contain the experiment (in case we are starting a new one) or the directory that already has the experiment (in case we are in the middle of an active and started experiment).

The action to take can be one of the following (more details here):

All the other options (that we call parameters) are not mandatory in the command-line and should come in the form --PARAMETER_NAME VALUE. These parameters control exactly how Adapa behaves and a complete reference of them can be seen here.

Since there can be a lot of parameters to be specified, Adapa allows the user to specify a configuration file where parameters and their respective values are defined. For that you can use the parameter CONFIG_FILE. So, for example, if you had a configuration file named config.txt you could run the following,

adapa <experiment directory> <action> --CONFIG_FILE config.txt [<any other options>]

Configuration files come in the form PARAMETER_NAME = VALUE (only one parameter per line). Comments are allowed (just start the line with #).

Dividing the data

The first thing you need to think of is in how to divide the data. The pre-requisite is that you should have data files containing the original data do be analyzed. The data should come in streams of values meaning that it should be a value, followed by another value, followed by another value and so on. You can have several channels of data, with each channel having the same number of values.

You then need to specify exactly how you are gonna divide the data in order to process it in a parallel way. Adapa will create several partitions, each one of them again with a stream of values. Imagine for example that you have 12 values. If you divide them in 3 partitions you would obtain the following:

Original data:
V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12

Partitioned data:
V01 V02 V03 V04 | V05 V06 V07 V08 | V09 V10 V11 V12
  Partition 1       Partition 2       Partition 3

If you have more than one channel, than the same partition will be given the same n-th values of all channels. For example, partition 1 would have values 1 trough 4 of channel 1, values 1 trough 4 of channel 2, etc.

The user should then provide a single executable that will be called with each partition. Note that it is the same executable that will be run for each partition, but with different input data in each one. The main aspect to note is that each partition will constitute a different computation job and therefore the cluster/grid environment used will be able compute in parallel each job, because jobs are independent from each other.

Creating an executable to be used

The bulk of the computation is done in each partition by an executable provided by the user. This executable can be created in any programming language the user desires. The only requisite is that it is executable in the cluster/grid environment that will be used.

This executable will be called with the following standard input in each partition (one value per line):

partition_number (a positive integer number)
number_of_values (number of values in this partition)
number_of_channels (number of channels in this partition)
channel_1_file (path to the file containing values from channel 1)
channel_2_file (path to the file containing values from channel 2)
(...)
channel_N_file (path to the file containing values from channel N)

All you need to do is to create an executable capable of reading this input and then do the computation it requires, whatever that should be.

Cluster/Grid Environment

Another pre-requisite is that you have a cluster or grid environment ready to be used. The current version of Adapa supports the following batch engines:

Adapa already has some support for other batch engines (namely Torque/PBS) but it will only be released on following versions.

The computer where you run Adapa should have submissions privileges and access to the usual command line interface to the respective submission engine.

After you run the prepare command, you should be able to see an automatically generated submission file in the the respective experiment directory that you can even customize even further if you desire.


Examples of usage

We now give some examples on how to use Adapa.

A simple example

Let's start with a simple example, designed to make you better understand the inner workings of Adapa.

What do you need?

Now, download the 3 channel files, the program and the configuration file to the same director. Start by compiling the program to create an executable named simple. Assuming you have gcc, you should do the following:

gcc -o simple simple.c

Let's create an experiment with this, to be stored in directory "experiment". Start by preparing the files:

adapa experiment/ prepare --CONFIG_FILE simple.txt

It should start ADAPA and give you a couple of messages indicating what was being done. Now, check the experiment state with the following command:

adapa experiment/ status

Among other things, it should say something like:

*********************************************************
Experiment directory: "experiment/"
Workflow State: Data division done and submission files created
(waiting for submission order)
*********************************************************

Let's submit all jobs created. Run:

adapa experiment/ submit

Now check the status again with the same command as before. You should see the jobs in the queue or, if they are completed, the correspondent message. Once this happens (checking status after the jobs are completed), the experiment will be ready to the final aggregation step. Run:

adapa experiment/ collect

You can now see the results in output.txt. Analyzing the file you should understand completely how everything works.

Before you finalize everything, have a look at the experiment/ directory. You do not have to know exactly what each file means, but knowing it can help you on better understanding Adapa functioning and open new possibilities for its usage. The files you should be able to see are:

Each Condor job was basically calling "simple < inputX.txt > outputX.txt" with data-pX-c1, data-pX-c2 and data-pX-c3 available for that particular job. Note that the jobs were independent and Condor should have allocated them to different CPUs (if enough were available) to calculate outputs in parallel, running several jobs at the same time.

When you want to remove any trace of the experiment you can just remove the directory or run the equivalent:

adapa experiment/ remove

If you want you can now give it a try at different options, for example on the partition, to see how it affects the outcome. Run it with VERBOSE=TRUE to see exactly what the starting and ending positions of each partition.

All-to-all correlation

We will now give a more "real" example of usage, demonstrating how to do an all-to-all correlation in sliding windows of various channels of data. The main idea is detailed in [1].

Basically our input are N channels of data. We want to compute correlations between all pairs of channels, not on the whole data, but for each "sliding windows" of size t, each time sliding k values. For example, dividing nine windows of this type between 3 CPUs (3 different jobs) we would have (taken from [1]):

Download the file correlation.tar.gz containing the following:

Now do the following steps (before '->'):

tar xvzf correlation.tar.gz -> decompress the files
make -> compile the programs
./gen_data 100000 256 -> Generate 256 channels of data, each one with 100,000 random integers
adapa experiment/ prepare --CONFIG_FILE correlation.txt -> prepare files for submission
adapa experiment/ submit-> submit jobs
adapa experiment/ status-> check job status (keep doing until they all terminate)
adapa experiment/ collect -> concatenate results

In the end you should have the required all-to-all correlation in output.txt, computed in parallel in approximately 1/10th of the time it would take to calculate in serial.

A more detailed explanation of this example is out of the scope of this manual. Have a look a the files to understand better or contact the authors for more information.


Complete reference of parameters

In this section all valid arguments will be presented in the following way:

When an argument name is followed by an asterisk (*) it means the argument is mandatory in order to run Adapa.

When an option is followed by a plus sign (+) it means that it is the default option whenever another option is not explicitly given.

If the argument dos not have a limited set of values, then in will be given in the form ARGUMENT_NAME = <type_of_value> (with type being for example a positive integer)

Workflow parameters

Data parameters

Partition parameters

Miscellaneous parameters


Template configuration file

You can download the complete template here, or you can just copy the relevant lines from the following:

##################################
#    Adapa configuration File    #
##################################

# You can comment by starting the line with '#'

# Arguments come in the form ARG=VALUE

# -------------------------------------------------
# Job
# -------------------------------------------------

# EXECUTABLE: path to the user executable to be run on each partition
EXECUTABLE= 

# ARGUMENTS: extra command-line arguments to pass to the executable
# ARGUMENTS= 

# -------------------------------------------------
# Data
# -------------------------------------------------

# NUM_CHANNELS: Number of different channels of data
NUM_CHANNELS=

# NUM_VALUES: Number of values per channel.
NUM_VALUES=

# DATA_TYPE: type of data. Can be
#    - ASCII: one value per line (DEFAULT VALUE)
#    - BINARY: binary vales of BYTES_PER_VALUE each
# DATA_TYPE=ASCII

# BYTES_PER_VALUE: number of bytes of each value (when DATA_TYPE=BINARY_VALUE)
# BYTES_PER_VALUE=

# TYPE_CHANNELS: how are channels stored. Can be:
#    - SINGLE: All on a single file
#    - MULTIPLE: one file for each channel with all values (DEFAULT VALUE)
# TYPE_CHANNELS=

# SINGLE_TYPE: how are values stores on one file. Can be:
# (only valid when TYPE_CHANNELS=SINGLE)
#    - CHANNEL: channel 1 all values, channel 2 all values, etc
#    - VALUE:   value 1 all channels, value all channels, etc
# SINGLE_TYPE=

# SINGLE_NAME: what is the name of the single file containing data
# (only valid when TYPE_CHANNELS=SINGLE)
# SINGLE_NAME=

# MULTIPLE_NAME: what are the names of the files of all channels. Can be:
# (only valid when TYPE_CHANNELS=MULTIPLE)
#    - CUSTOM: meaning that we should then define all MULTIPLE_CUSTOM_X
#    - any other: meaning that the files are another. Should use %d or %0Xd
#      to substitute for channel number
# Example: MULTIPLE_NAME=data-c%d.txt (data-c1.txt, data-c2.txt, etc)
# MULTIPLE_NAME=

# MULTIPLE_CUSTOM_X: custom named files for each channel
# (only valid when MULTIPLE_NAME=CUSTOM)
# Example:
# MULTIPLE_NAME=CUSTOM
# MULTIPLE_CUSTOM_1=data-a.txt
# MULTIPLE_CUSTOM_2=data-b.txt
# MULTIPLE_CUSTOM_3=data-c.txt

# TRANSFER_FILES: transfer data files. Can be:
#   - TRUE: do transfer files (DEFAULT VALUE)
#   - FALSE (or any other): do not transfer
# TRANSFER_FILES=

# -------------------------------------------------
# Partitions
# -------------------------------------------------

# PARTITION_TYPE: type of partitions to be made. Can be:
#    - EQUAL_SIZED_NUMBER: equal sized partitions, given number of partitions
#    - EQUAL_SIZED_SIZE: equal sized partition, given partition size
#    - SLIDING_WINDOW: sliding window of size WINDOW_SIZE, sliding each time
#      WINDOW_SLIDE values
#    - CUSTOM: create CUSTOM_NUMBER partitions with start START_X and end END_X
PARTITION_TYPE=

# EQUAL_SIZE_VALUE: value of EQUAL_SIZED_X argument
#   - If EQUAL_SIZED_NUMBER, number of partitions
#   - If EQUAL_SIZED_SIZE, size of each partition
# EQUAL_SIZE_VALUE=

# WINDOW_SIZE: size of each window in SLIDING_WINDOW partitions
# (only valid when PARTITION_TYPE=SLIDING_WINDOW)
# WINDOW_SIZE=

# WINDOW_SLIDE: a window slides by this number of values each time
# (only valid when PARTITION_TYPE=SLIDING_WINDOW)
# WINDOW_SLIDE=

# WINDOW_PARTITIONS: how many real partitions containing complete
# and consecutive valid sliding windows (i.e, how many jobs to create)
# (no value indicated means as partitions as windows)
# WINDOW_PARTITIONS=

# CUSTOM_NUMBER: number of custom partitions
# (only valid when PARTITION_TYPE=CUSTOM)
# Example:
# CUSTOM_NUMBER=3
# START_1=0
# END_1=9
# START_2=10
# END_2=89
# START_3=90
# END_3=100

# -------------------------------------------------
# Miscellaneous
# -------------------------------------------------

# VERBOSE: be more verbose on all actions taken. Can be:
#   - TRUE: be verbose
#   - FALSE (or any other): don't be verbose
# VERBOSE = TRUE

References