## Abstract

CRYSTAL is an *ab initio* electronic structure program, based on the linear combination of atomic orbitals, for periodic systems. This paper concerns the ability of CRYSTAL to exploit massively parallel computer hardware. A brief review of the theory, numerical implementations and parallel solutions will be given and some of the functionalities and capabilities highlighted. Some features that are unique to CRYSTAL will be described and development plans outlined.

## 1. Introduction

The development and the application of new materials have played an important role in the technologies that impact on our daily lives. The global challenges of security, energy, climate change and health have further sharpened the focus on materials research and technology. The application of high-performance computing to materials chemistry problems is currently contributing significantly to, for instance, the discovery and the development of solar absorbers, new catalysts and hydrogen storage systems. The key relationship is between composition, structure and desirable properties. In many cases, it is an understanding of the electronic properties at an atomic and quantum mechanical level that is vital. As the need has grown, the computer simulation of electronic structure has developed in accuracy, reliability and scale. There is now a rapidly growing number of cases where realistic models of important systems can be simulated and the value of experimental measurements significantly enhanced by direct comparison with *ab initio* theory. These developments can be attributed to advances in the underlying theory, algorithmic developments and the exploitation of high-performance computers.

The CRYSTAL package (Dovesi *et al.* 2009; http://www.crystal.unito.it) was jointly developed by the Theoretical Chemistry Group at the University of Torino and the Computational Materials Science group in the Daresbury and Rutherford Appleton Laboratories of the Science and Technology Facilities Council (STFC). The program performs *ab initio* calculations of the ground-state energy, energy gradient, electronic wave function and properties of periodic systems. Hartree–Fock or Kohn–Sham Hamiltonians, which describe the effects of electronic exchange and correlation using a potential derived from density functional theory (DFT) can be used (Kohn & Sham 1956; Hohenberg & Kohn 1964; Parr & Yang 1989). Systems periodic in zero (molecules, zero dimensional), one (polymers, one dimensional), two (slabs, two dimensional) and three dimensions (crystals, three dimensional) are treated on an equal footing. Symmetry is exploited and CRYSTAL automatically implements the 230 space groups, 80 layer groups, 99 rod groups and 45 point symmetry groups. In the case of polymers, helical symmetries can also be applied and exploited.

This paper, which is aimed at providing information about developments of CRYSTAL that enable parallel computing of the energy and the wave function of a periodic system, is organized as follows. In §2, we review briefly the underlying theory: Hartree–Fock theory (§2*a*), DFT (§2*b*) and the numerical implementation of the theories (§2*c*). In §3, we outline and compare parallelization strategies adopted and implemented in the PCRYSTAL and MPPCRYSTAL parallel versions of the program. In §4, the parallel scalability of MPPCRYSTAL is demonstrated, and a brief outline of recent and ongoing developments of CRYSTAL are discussed in §5. Finally, conclusions are made in §6.

## 2. Theoretical background

### (a) Hartree–Fock theory

The electronic Hamiltonian operator, , consists of a sum of three terms: the kinetic energy, the interaction with the external potential (*V*_{ext}) and the electron–electron interaction (*V*_{ee}). That is (in atomic units)
2.1In materials simulation, the external potential of interest is generally the interaction of the electrons with the atomic nuclei
2.2Here, **r**_{i} is the coordinate of electron *i* and the charge on the nucleus at **R**_{α} is *Z*_{α}. Note that in order to simplify the notation and to focus the discussion on the main features of DFT the spin coordinate is omitted here and throughout this paper.

In Hartree–Fock theory (Szabo & Ostlund 1982; Harrison 2003), the ground state is found by minimizing the total energy of the system with respect to a set of *N* normalized spin orbitals *ψ*_{i}. This leads to the Hartree–Fock (or self-consistent field (SCF)) equations
2.3where *ν*_{x}(**r**_{1},**r**_{2}) is the non-local exchange potential. Equation (2.3) can also be written as , where is the Fock operator.

### (b) Density functional theory

Hartree–Fock theory describes non-interacting electrons in a mean field potential consisting of a classical Coulomb potential and a non-local exchange potential. Electron correlation is, however, neglected. DFT provides an improved approach for including the effect of electron correlation (Parr & Yang 1989; Martin 2004)
2.4where *ν*_{xc}(**r**)=*δE*_{xc}/*δρ* is a local multiplicative potential that is the functional derivative of the exchange correlation energy with respect to the density.

The DFT energy can be written as the sum of the kinetic energy, the classical Coulomb interaction, the electron–nucleus interaction and an exchange correlation term, *E*_{xc}. This later term, *E*_{xc}, is the sum of the errors in the approximations made in assuming a non-interacting kinetic energy term and in treating the electron–electron interaction classically.

Studies of the homogenous electron gas suggest that *E*_{xc} can be described in terms of the local electron density; several different functionals exist that exploit this property and they are known collectively as local density approximations (LDAs). The LDA can be improved upon by also considering the first derivatives of the density, and functionals that use this are known as generalized gradient approximations (GGAs). The combination of non-local Fock exchange and density functionals was first proposed by Becke in the B3LYP hybrid functional that mixes 20 per cent Fock exchange with a GGA exchange functional (Becke 1988, 1993*a*,*b*).

The energy functional and matrix elements of the exchange and correlation potentials are not analytical functions of the Gaussian basis set and are therefore integrated numerically on an atom centred grid of points. The integration over radial and angular coordinates is performed using Gauss–Legendre and Lebedev schemes, respectively.

In general, bond enthalpies are significantly better described when using the GGA than when using either the Hartree–Fock theory or the LDA. Hybrid exchange functionals partially correct for electronic self-interaction and therefore generally further improve the description of bond enthalpies and vibrational frequencies. The hybrid exchange approach also produces single particle band gaps for a variety of semiconductors, simple oxides and transition metal oxides more reliably than Hartree–Fock, LDA or GGA theories (Muscat *et al.* 2001; Tomić *et al.* 2008).

### (c) Numerical implementation in CRYSTAL

In CRYSTAL, the crystalline orbitals, *ψ*_{i}(**r**;**k**), are expanded as a linear combination of Bloch functions, *ϕ*_{μ}(**r**;**k**), which are themselves expressed as linear combinations of atom centred atomic orbitals, *φ*_{μ}(**r**),
2.5and
2.6where **A**_{μ} denotes the coordinate of the nucleus in the zero reference cell on which *φ*_{μ} is centred, and the is extended to the set of all lattice vectors **g**.

Each atomic orbital is expressed as a linear combination of individually normalized Gaussian type functions (GTFs), with fixed coefficients *d*_{j} and exponents, *α*_{j},
2.7

The collection of all atomic orbitals is referred to as the basis set. The expansion coefficients of the Bloch functions, *a*_{μ,i}(**k**), are calculated by solving the matrix equation for each reciprocal lattice vector, **k**,
2.8in which **S**(**k**) is the overlap matrix over the Bloch functions, **E**(**k**) is the diagonal energy matrix and **F**(**k**) is the Fock matrix in reciprocal space,
2.9The matrix elements of **F**^{g}, the Fock matrix in direct space, can be written as a sum of one-electron and two-electron contributions in the basis set of the atomic orbitals,
2.10The one-electron contribution is the sum of the kinetic and nuclear attraction terms,
2.11In core pseudopotential calculations, includes the sum of the atomic pseudo-potentials. The two-electron term is the sum of two contributions,
where is the Coulomb term given by
2.12with
2.13The term is the exchange contribution in the Hartree–Fock method, calculated as follows:
2.14while is the exchange and correlation contribution in DFT, obtained by integrating the exchange-correlation potential *ν*_{xc}(**r**), see equation (2.4).

The Coulomb interactions, that is, those of electron–nucleus, electron–electron and nucleus–nucleus, are individually divergent, owing to the infinite size of the system. The grouping of corresponding terms is necessary in order to eliminate this divergence.

The *P*^{n} density matrix elements in the atomic orbitals basis set are computed by integration over the volume of the Brillouin zone,
2.15where *a*_{i,m} denotes the *i*th component of the *m*th eigenvector, *θ* is the step function, *ϵ*_{F} is the Fermi energy and *ϵ*_{n} is the *n*th eigenvalue. The total electronic energy per unit cell is given by
2.16

## 3. Parallel implementation in CRYSTAL

The algorithm used to determine the ground-state electron density and energy in CRYSTAL is similar to that used in other local orbital programs (Soler *et al.* 2002; Guest *et al.* 2005; http://www.gaussian.com; http://www.molpro.net/) and can be briefly summarized as follows.

Given an initial density matrix, **P**^{g}.

Calculate analytically the kinetic, Coulombic and, if necessary, the exact exchange contributions to the Fock matrix in the atomic orbital representation,

**F**^{g}.If required, calculate by quadrature the DFT exchange and correlation contributions to

**F**^{g}.Transform

**F**^{g}into reciprocal space in its crystalline orbital representation**F**(**k**). This is done in two steps: first by a Fourier transform and then by a similarity transform.At each

**k**point, diagonalize**F**(**k**).Using the eigenvalues from step 4, calculate the Fermi level, and hence the occupation numbers of each orbital at each

**k**point.Sum over the occupied eigenvectors to construct a new density matrix,

**P**^{g}.Repeat steps 1 to 6 until convergence of the total energy.

A parallel algorithm has been implemented for each of the steps (1–6) with the exception of step 5, which is not computationally demanding.

A replicated data approach is easily implemented. In step 1, the evaluation of each element of the Fock matrix is an independent task involving large numbers of analytical integrals. In step 2, the quadrature on a grid is a classic example of data parallelism as each central processing unit (CPU) can evaluate the integral on different parts of the grid. Finally, the transformation and diagonalization of **F**(**k**) can be performed at each **k** point independently and after communication of the eigenvalues for the determination of the Fermi level, the contribution of orbitals at each **k** point to **P**^{g} is also an independent calculation. A parallel version of CRYSTAL (PCRYSTAL) based on this approach was first released in 1996.

PCRYSTAL uses a replicated data paradigm; all CPUs have access to a complete copy of all the objects required, but each CPU will be performing different calculations at any instant. The replication of data leads to a fairly straightforward parallelism. The terms evaluated analytically (step 1) may simply be farmed out across the CPUs. There is a potential load balance problem when doing this, as each term will take a different amount of time to compute. However, in practice, this is not usually a problem, as typically there are a very large number of terms to compute compared with the number of CPUs, and a simple static load balancing procedure is sufficient to achieve good parallel scaling. This parallelization method has also been implemented in several other quantum chemistry programmes (Guest *et al.* 2005; http://www.gaussian.com).

The evaluation of the quadrature (step 2) is a little more complex. In CRYSTAL, the unit cell is divided into a number of parallelepipeds, and the quadrature within each of these parallelepipeds is evaluated independently and can be conducted in parallel. However, unlike the analytical terms, the number of parallelepipeds is often of a similar order of magnitude to the number of CPUs. Hence, load balance is a problem, as each parallelepiped may contain a different number of grid points, and may contribute to different numbers of matrix elements. PCRYSTAL, therefore, dynamically load balances this part of the calculation by measuring the time taken to perform each of the quadratures and assigning the parallelepipeds to the CPUs appropriately. For the first cycle, it is reasonable to assume that the time required for a given parallelepiped is proportional to the number of grid points it contains.

For the calculations in reciprocal space, i.e. the Fourier and similarity transforms (step 3), the diagonalization of **F**(**k**) (step 4) and the construction of **P**^{g} (step 6), PCRYSTAL exploits the independence of the **k** points. Each CPU is assigned a subset of **k** points, and performs the calculation on that subset constructing a partial **P**^{g}. In unrestricted calculations, the independence of the spin states is also exploited. The only synchronization point is the evaluation of the Fermi level and the global summation of the partial **P**^{g}.

The resulting code is simple and for many cases performs very well. However, it does have a number of limitations.

The parallelism in reciprocal space is limited by the number of

**k**points. As the system size increases the number of**k**points required decreases; very large systems can be accurately described using just one**k**point.The maximum size of a calculation may be limited by the amount of the memory available. As all the data are replicated, the largest system that can be addressed by PCRYSTAL is no larger than that which serial CRYSTAL can address. In an era, where the amount of memory per CPU is falling, this is a particularly serious problem.

In principle, the costs of re-replicating the data at each stage could become important as the cost of the procedure grows with the number of CPUs. While this is important for other codes, in practice, in PCRYSTAL, each stage is sufficiently expensive that these costs are negligible.

Limitation 1 results in PCRYSTAL typically only scaling to a few 10 s of CPUs, with most calculations becoming impractical owing to runtime before limitation 2 is reached.

To address these problems, a new massively parallel version of PCRYSTAL has been developed, MPPCRYSTAL. The main change from PCRYSTAL is that all large objects are distributed. In particular, all the large reciprocal space matrices are distributed and operated on in parallel. For this, the ScaLAPACK library is used, and thus a block cyclic distribution of the data is implemented (http://www.netlib.org/scalapack; http://www.mpi-forum.org).

Thus in MPPCRYSTAL, for the reciprocal space part of the calculation, a hierarchical parallelism is used; first the independence of the **k** points is exploited, and then for each **k** point, a number of CPUs perform the calculation in parallel by using the ScaLAPACK library. This addresses both the major limitations noted above, and MPPCRYSTAL can: (i) scale to 1000 s of CPUs and (ii) address much larger problems than either CRYSTAL or PCRYSTAL. For instance, it has been shown to be able to perform SCF iterations at over 40 000 basis functions.

But what of the evaluation of **F**^{g} in MPPCRYSTAL? With one major exception, this is performed in almost the same manner as PCRYSTAL, i.e. **F**^{g} and **P**^{g} are replicated. This is not nearly as large an overhead as one might think, as in CRYSTAL, both these objects are stored in sparse format, and thus are very much smaller than the dense matrix representations used for the reciprocal space objects. However, ultimately, this is a potential problem, and we shall return to this later. The major exception is the grid used by DFT calculations, which is distributed owing to its memory requirement; as each CPU only performs quadratures across a small portion of the grid it only holds those parts of the grid that it needs.

The resulting code is, as stated above, much more scalable in terms of both time and memory than PCRYSTAL, and may exploit many more CPUs to perform calculations on much larger systems. In fact, benchmarks on a number of systems suggest that a very rough estimate of the number of CPUs to which the code can scale reasonably efficiently is given by
3.1where *N*_{k} is the number of **k** points, *N*_{s} the number of spins and *N*_{b} the number of basis functions. Thus, very roughly one might expect an MPPCRYSTAL calculation using 10 **k** points and 3000 basis functions to run efficiently on up to around 1000 CPUs.

It is worth stressing the memory usage of the algorithm discussed above, as effective and efficient memory usage is becoming much more important on modern high-performance computers. In fact, it is worth noting that the massively parallel processing code can optionally use algorithms that are less time efficient but more memory efficient than PCRYSTAL precisely because it is designed to handle large systems where memory considerations are paramount. If these low-memory options (controlled by the new `LOWMEM` directive) are exploited, MPPCRYSTAL uses no replicated objects that scale with the square of the system size. The required total memory is reduced by: (i) storing only the symmetry irreducible form of the density matrix and computing elements of the reducible form as required, (ii) optimizing the storage of tables that index the analytic Coulomb and exchange integrals, and (iii) distributing the matrices associated with the symmetrized directions.

This optimization has proved important for phase 2 of HECToR, which consists of nodes with a total of 8 GB of memory and four CPUs, that is just 2 GB memory per CPU; it was previously not possible to use all the CPU per node when running large systems on HECToR.

The current major limitations of MPPCRYSTAL are the following.

— The time to solution of the ScaLAPACK diagonalizer does not scale well with CPU count. This has been somewhat helped by the introduction of the faster routines (

`PDSYEVD`and`PZHEEVD`) in ScaLAPACK v. 1.7 (see http://www.cse.scitech.ac.uk/arc/diags.shtml; http://www.hector.ac.uk/cse/distributedcse/reports/castep/castep_performance_xt/4_Distributed_Diagonaliser_.html), but it is still an issue, and this is the ultimate limitation in the scaling of the time to solution of MPPCRYSTAL.— As noted above,

**F**^{g}and**P**^{g}are still replicated. While these are much smaller than the equivalent reciprocal space objects, they are still a significant size, and it is these that limit the size of calculation that can be performed.

Work is in progress to address both of these issues.

## 4. Demonstration of the scalability of CRYSTAL

The massively parallel implementation of CRYSTAL allows systems with approximately 1000 atoms per unit cell to be studied routinely on parallel computers. The two key computational steps are the evaluation of the Fock matrix through the calculation of matrix element integrals and the solution of the Kohn–Sham equations through diagonalization of the Fock matrix. The parallel capability of CRYSTAL will be demonstrated with three examples of calculations on the UK’s national supercomputer, HECToR. It has to be noted that, with regards to the analytical gradient calculation, the dominant component is also the evaluation of a set of bi-electronic integrals and integrals over the derived DFT potential in a manner highly analogous the integrals for the potential and energy expression; the scaling is, therefore, very similar and is not reported.

The first system is an 864 atom supercell of zincblende GaSb with 6 Sb atoms randomly substituted by N atoms producing a Ga_{432}Sb_{426}N_{6} cell with no spatial symmetry. Ga, Sb and N are described using triple valence basis sets yielding 19 836 atomic orbitals per cell—this is the rank of the Fock matrix. The Fock matrix is diagonalized at two symmetry independent **k** points.

In figure 1, the scalability of one SCF cycle is presented and separated into its two major components: calculation of the two-electron integrals (`SHELLX`) and diagonalization of the Fock matrix (`MPP_DIAG`). The overall speed-up of an SCF cycle is 3.3 when comparing runs on 896 and 3584 CPUs (the ideal speed-up is 4). The parallel scaling of the integral calculation is near ideal, 3.8^{2}, while the diagonalization step scales reasonably well up to 1792 CPUs but not to 3584 CPUs; this is as expected from the rough estimation of scalability given by equation (3.1). Although the diagonalization is responsible for a small fraction of the total runtime, it is clear that the scaling of the diagonalization ultimately limits the parallel scaling of the calculation.

The diagonalization is performed within the ScaLAPACK library using a divide and conquer algorithm (using the `DCDIAG` directive in MPPCRYSTAL) and is sensitive to the block size used to distribute the matrix over the CPUs. This is illustrated in figure 2; reducing the block size (using the `MPPBLOCK` directive) from the default value of 96 to 64 or 32 results in a speed-up of around 10 per cent in the diagonalization and about 20 per cent speed-up in the total time for similarity transform followed by diagonalization and back transform. The optimal value of the block size is machine dependent.

The second example is a TiO_{2} (3×3×3) supercell consisting of 648 atoms with 13 608 atomic orbitals. There are 16 symmetry operators that are exploited in the calculation of the two-electron integrals. In figure 3, the total runtime for a single SCF cycle and the contributions from `SHELLX` and `MPP_DIAG` are displayed for cases where a single **k** point and 8 **k** points are used. Comparing this system to the low-symmetry Ga_{432}Sb_{426}N_{6} system, the integral calculation in `SHELLX` is almost two orders of magnitude faster, while the diagonalization is only marginally faster. The parallel scaling of the integral calculation is again near ideal, while the diagonalization at a single **k** point scales to only 256 CPUs. Exploiting parallelism over 8 **k** points significantly increases the number of CPUs that can be used effectively in `MPP_DIAG` to around 4096. This observed scalability is similar to that estimated by equation (3.1). It is slightly lower than expected for the case where there is only one (real) **k** point. This is, in part, because the estimate given by equation (3.1) assumes a mixture of both real and complex **k** points. Complex **k** points require around three times more computation time than real **k** points; this characteristic is exploited in MPPCRYSTAL. Consequently, systems that have complex **k** points generally scale to larger numbers of CPUs than those that only have real **k** points.

The third example is a TiO_{2} (5×4×4) supercell with a single Fe dopant, consisting of 1920 atoms with 40 320 atomic orbitals. The calculations are performed on the HECToR phase 2b machine at the unrestricted Hartree–Fock level of theory, with one symmetry operator. In figure 4, the total runtime for a single SCF cycle and the contributions from `SHELLX` and `MPP_DIAG` are displayed for cases where a single **k** point is used. Comparing this system to the second example with one **k** point, the integral calculation in `SHELLX` maintains perfect scalability over the whole range of CPUs considered, while now scaling of the Fock matrix diagonalization step `MPP_DIAG`, is dramatically improved. Although in this example, we have considered only the real **k** point, improved scalability in this part is basically owing to the much larger rank of the Fock matrix (40 320 versus 13 608), which increases the work load per CPU relative to the communication costs.

These three examples have been chosen as they are typical of the work undertaken on HECToR by members of the Materials Chemistry Consortium and confirm the expected parallel scalability of MPPCRYSTAL estimated in equation (3.1).

## 5. Developments of CRYSTAL relevant to parallel scaling

Currently, the major ongoing development of CRYSTAL is to include density functional perturbation theory for response properties and time-dependent DFT (TD-DFT) for calculating excited states. Details of this development and its parallel implementation are given in §5*a*.

In addition, several other relevant developments have recently been made or are currently in progress. An interface to WanT for computing electronic transport in nanostructures will soon be available and will exploit the ability of MPPCRYSTAL to describe systems of 1000+ atoms (http://www.wannier-transport.org). The nudged elastic band (NEB) method for transition state searches has recently been implemented (Bailey *et al.* 2005). This method exploits the task-farming parallelism in MPPCRYSTAL whereby multiple energy evaluations are performed simultaneously and independently. This division of labour is conceptually simple and efficient, requiring minimal communication. Furthermore, an individual energy evaluation can be parallelized as described in §3.

For large-scale parallel geometry optimization, an interface between the DL-FIND library (Kastner *et al.* 2009) and CRYSTAL has recently been implemented. DL-FIND is open-source code written primarily at STFC Daresbury Laboratory and is available from CCPForge (http://ccpforge.cse.rl.ac.uk). A wide range of algorithms for local and global minimization, as well as transition state searching, are available via the library interface. The non-sequential optimization methods, i.e. those involving multiple configurations at each iteration, such as the stochastic searches, exploit the task-farming parallelism in CRYSTAL. Consequently, this hierarchical approach will enable efficient calculations on 10 000 s of CPUs.

### (a) Dielectric properties and excited states

The calculation of the many-body dynamical polarizability and hyperpolarizability tensors is available in the latest release of CRYSTAL (CRYSTAL09). The calculation is based on the self-consistent solution of a set of coupled-perturbed equations, derived within the linear or quadratic response approximations of perturbation theory. This method is well known in molecular physics, (Hurst *et al.* 1988), and can be implemented at the Hartree–Fock, DFT or hybrid-DFT levels of theory. Its extension to periodic crystalline systems in CRYSTAL09 relies on an analytical representation of the **k**-dependent position operator for extended systems, which allows us to achieve consistent levels of accuracy across wide ranges of material structures and compositions (Ferrero *et al.* 2008).

Solution of the coupled-perturbed equations at a frequency ±*ω* yields a set of unitary matrices *U*^{k,a}_{±ω}, where the superscript *a* indicates a Cartesian direction, which transform the unperturbed one-particle orbitals into their linear-response orbitals. Equivalently, a linear density-matrix response *D*^{k,a}_{±ω} can be defined in terms of the matrices *U*^{k,a}_{±ω}. The many-body dynamical polarizability tensor ** α**(±

*ω*) is then given by 5.1where the subscript

*a*and

*b*indicate Cartesian components, and

*i*and

*j*are band indices. The polarizability is related to the dielectric tensor

*ϵ*(±

*ω*) by 5.2where

*V*is the cell volume. Optically allowed many-body electronic excitation energies can be computed from the poles of the mean dynamical polarizability 5.3by examining the behaviour of this quantity within a given range of frequencies. In the hybrid-DFT approximation (B3LYP), the quality of this method is sufficient to yield optical gaps within 0.1 eV of experimental estimates for several semiconductors and oxides (Bernasconi

*et al.*submitted), including those exhibiting (bound) exciton transitions. In the CRYSTAL09 implementation, this method scales linearly with the number of

**k**points included in the sampling of the Brillouin zone. This approach is typically appropriate for studying the lowest energy excitations of an extended quantum system.

The coupled-perturbed method can be extended to provide a general formalism for computing electronic excitation energies and transition probabilities, which does not involve the self-consistent solution of a set of coupled equations. This is the so-called *random phase approximation matrix formalism*, (Hirata *et al.* 1999), in which excitation energies correspond to the eigenvalues of a (in general non-Hermitian) coupling matrix. Transition probabilities are related to the eigenvectors of the coupling matrix. The coupling matrix is expressed in terms of super-operators and , which couple single-particle excitations via a local (DFT) or non-local (Hartree–Fock and hybrid-DFT) two-electron response term (Hirata *et al.* 1999). In CRYSTAL09, and are represented in crystal-orbital basis to yield the eigenvalue equation
5.4where **A** and **B** are (*N*_{occ}×*N*_{vir})^{2} matrices (*N*_{occ}/*N*_{vir} being the number of occupied/virtual one-particle states) and **X**/**Y** are left/right eigenvectors. Depending on the form of the matrices **A** and **B**, equation (5.3) corresponds to either TD-DFT, time-dependent Hartree–Fock, or hybrid TD-DFT. **A** and **B** can also be simplified to yield approximate forms of these theories, the Tamm–Dancoff approximation (Hirata & Head-Gordon 1999) to TD-DFT and the configuration interaction singles method. Equation (5.3) can be solved either by direct diagonalization, or via an iterative Davidson-like method. Calculation of matrix-vector products (exploiting possible sparsities in the matrices and/or Davidson guess vectors, which may arise in the crystal-orbital representation of and ) is carried out by means of a non-self-consistent coupled-perturbed calculation, similar to the molecular approach of Bauernschmitt & Ahlrichs (1996).

These perturbation methods can explicitly exploit task farming in frequency scans and will also exploit parallelism for matrix operations and linear algebra using similar techniques to those implemented in the current version of MPPCRYSTAL and described in §3.

## 6. Conclusions

An overview of the CRYSTAL code has been given, with particular emphasis on the adaptation of key algorithms for the exploitation of parallel computer hardware. It has been demonstrated that CRYSTAL can scale up to thousands of CPUs for systems consisting of several hundred atoms and 10–20 thousand atomic orbitals. Recent developments have reduced the total memory required to run large CRYSTAL calculations. This has enabled CRYSTAL to run efficiently on the UK’s national supercomputer, HECToR. Future developments include TD-DFT for calculating excited states and an interface to WanT (http://www.wannier-transport.org) for computing electronic transport in nanostructures.

## Footnotes

One contribution of 16 to a Special feature ‘High-performance computing in the chemistry and physics of materials’.

- Received October 29, 2010.
- Accepted March 7, 2011.

- This journal is © 2011 The Royal Society