PARSEC DEVELOPMENT GUIDELINES

This short document is intended to give a brief overview of the internal structure of PARSEC, and to establish a few guidelines for future development. The guidelines are not "set to stone", but they should be followed in order to keep source files clean and compatible with each other.

1. Parallelization

In multi-processor machines, parallelization is done two layers: in the outer layer, k-points, spins and representations are distributed across processors and, in the inner layer, grid points are also distributed. This scheme is very flexible and especially efficient in large parallel machines.

1.1 Outermost parallelization: groups of processors

First, some definitions:

If the product n = Nk*Ns*Nr is greater than 1 and if Np/n is integer, then we can distribute the triplets [representation,k-point,spin] across processors. This is done with the use of MPI Groups. We start by taking an integer Ng such that S = Np/Ng is integer and G = n/Ng is integer. Now, we divide the pool of processors into Ng groups, each one containing S processors. In order to take advantage of mixed architecture, we put processors with consecutive ranks in the same group: processors 0 to S-1 belong to group 0, processors S to 2*S-1 belong to group 1, and so on. Within a group, each processor has a subrank: the subranks of processors 0 to S-1 are 0 to S-1 in that sequence; the subranks of processors S to 2*S-1 are 0 to S-1 in that sequence; and so on. Also, each group has a group_master processor, which is the one with subrank 0. In the entire pool of processors, the master is processor 0, so it is also group_master in its group. In order to have ordered input/output, we also need to define a group of masters with all the group_masters (those processors with subrank 0 in their groups). By definition, the master processor also belongs to the group of masters.

The triplets [representation,k-point,spin] are also distributed so that each group is assigned G triplets. Processors are in charge of computing eigenvalues/eigenvectors only of the triplets assigned to their group. Each group works independently from the others during calls to eigensolvers, subspace filtering, and Poisson solver.

Information about the layout of MPI groups is stored in structure parallel: Operations which involve quantities global to the groups (such as the electron charge density and potentials) are done across the group of masters only, since other processors do not need to participate. All the input and most of the output is done by the master processor.

Example

This is one example of MPI group layout: we have a system with 2 spin channels, 1 representation and 3 k-points. 8 processors are available in the pool. This will be the values stored in structure parallel:

parallel%procs_num = 8

parallel%masterid = 0

parallel%groups_num = 2

parallel%group_size = 4

parallel%group_master = 0

parallel%gmap = ( ( 0, 1, 2, 3 ),( 4, 5, 6, 7 ) )

The values of local variables are:

parallel%iam parallel%iammaster parallel%mygroup parallel%iamgmaster parallel%group_iam parallel%gmaster_iam
0 True 0 True 0 0
1 False 0 False 1 -
2 False 0 False 2 -
3 False 0 False 3 -
4 False 1 True 0 1
5 False 1 False 1 -
6 False 1 False 2 -
7 False 1 False 3 -
The assignment of triplets [representation,k-point,spin] is done round-robin among groups. Information about the assignment is stored in variable elec_st%eig%group. In this example, the assignment will be like this:

representation

irp
k-point

kplp
spin

isp
group

elec_st%eig(irp,kplp,isp)%group
1 1 1 0
1 2 1 1
1 3 1 0
1 1 2 1
1 2 2 0
1 3 2 1

1.1 Innermost parallelization: distributed real-space grid

Real-space grid points are distributed within each group of processors in a "slice-stitching" manner. First, grid points are ordered in a one-dimensional array, going faster along the z direction, then y and then x (consecutive points in the array are most likely neighbors along the z direction). If Nr is the number of grid points, then each processor receives a block of points with size approximately Nr/S: processor with subrank 0 receives points 1 to Nr/S, processor with subrank 1 receives points Nr/S+1 to 2*Nr/S, and so on. If Nr/S is not integer, there is an imbalance of at most one grid point among processors. In order to account for the "exterior" of a confined system (region of space outside the grid), a null grid point is added to the array of grid points. These are the important variables (all global):

Another important variable is parallel%mydim: the number of grid points belonging to this processor, of order of Nr/S. This variable is local. Most distributed arrays defined on the grid (wave-functions, potentials, etc.) have size parallel%ldn, but only the block (1:parallel%mydim) is actually used. The block (parallel%mydim+1:parallel%ldn) is referenced only when points outside the grid are referenced. The only situation when they are referenced is in non-periodic systems, when the neighbor of some point happens to be outside the grid because the point itself is close to the boundary. Since wave-functions are zero outside the boundary, then the exterior block contains only zeros.

The distribution of grid points is done in subroutine grid_partition, executed by group_masters only. In subroutines setup and comm_neigh, the group_masters broadcast information about the grid layout to other processors in their group before each processor does the bookkeeping of neighbors. Look at subroutine comm_neigh for a description of the bookkeeping of neighbors.

Since the grid is distributed within each group, then reductions over the grid are done within the group only. For example, the "dot product" between two wave-functions < &Psi|&Phi > is peformed first with a local call to the dot_product fortran function, and then with a call to MPI_Allreduce, with operation MPI_Sum and communicator parallel%group_comm.

2. Input/Output

Most of the input/output is done by the master processor. Processors other than the master only write data to files out.*. All the input is done by the master processor. Subroutines usrinputfile (reading in of parsec.in) and pseudo (reading in of pseudopotential files) are performed by the master processor only. Information to be distributed to other processors is broadcasted in subroutine init_var, which is also where most structures are defined by non-master processors.

How to add new input options? You will need to work on at least 3 separate subroutines:

3. Real/complex eigensolvers

With the exception of the thick-restart Lanczos solver, all other eigensolvers implemented in parsec (including subspace filtering) handle real and complex (hermitian) Hamiltonians. Since the algorithm is very similar between real algebra and complex algebra, the source files for these solvers have real and complex options merged. Files with this feature have extension .f90z. At compile time, the preprocessor duplicates the information in those files and creates two sets of code: one for real algebra, and another for complex algebra. The macro used to select real versus complex algebra is CPLX. File mycomplex.h defines the rules for the duplication of code. As a rule of thumb, preprocessed subroutines/functions/variables whose name contains a capital Z have the Z replaced with lower-case z, for complex algebra, or droped, for real algebra. Occasionaly, the capital Z is replaced with lower-case d for real algebra. Variable type MPI_DOUBLE_SCALAR is expanded to MPI_DOUBLE_COMPLEX or MPI_DOUBLE_PRECISION during preprocessing

4. Guidelines in the source files

Source files are written in Fortran 95, free-form style (except for the LBFGS library file). Non-standard constructions are avoided. Since it evolved from FORTRAN 77, numbered lines are seen frequently, many breaks in flow are done with go to statements, alternate returns are used frequently, and the code makes little use of modular programing. All structures and most modules are declared in structures.f90p file. Most of constants are defined in const.f90 file. Successful exits of the MPI environment is done through subroutine exit_err, where arrays within structures are deallocated.

Subroutines always start with a header containing some description of its purpose, input and/or output, followed by declaration of variables and executable statements. Programers are strongly encouraged to write as much in-line documentation as possible, and to keep the overall "look" of subroutines: description, declaration of input/output variables, declaration of internal variables, and executable statements, in this sequence. Most structures are defined on all processors, and many of them have global values. Names of variables are written in lower case. Names of constants can be writen in lower case or upper case, but never a mixture of lower case and upper case in the same word. Some integer variables have special meanings, altough they are treated as internal variables: These are some suggestions for variable names and general writing of code:

Center for Computational Materials, Univ. of Texas at Austin

PARSEC

Author of this document: Murilo Tiago