0
Research Papers

# Parallelized Simulation of a Finite Element Method in Many Integrated Core ArchitectureOPEN ACCESS

[+] Author and Article Information
Moonho Tak

Computational Solid and Structural
Mechanics Laboratory,
Department of Civil and Environmental
Engineering,
Hanyang University,
222 Wangsimni-ro, Seongdong-gu,
Seoul 04763, South Korea
e-mail: pivotman@hanyang.ac.kr

Taehyo Park

Professor
Computational Solid and Structural
Mechanics Laboratory,
Department of Civil and Environmental
Engineering,
Hanyang University,
222 Wangsimni-ro, Seongdong-gu,
Seoul 04763, South Korea
e-mail: cepark@hanyang.ac.kr

1Corresponding author.

Contributed by the Materials Division of ASME for publication in the JOURNAL OF ENGINEERING MATERIALS AND TECHNOLOGY. Manuscript received June 1, 2016; final manuscript received October 25, 2016; published online February 7, 2017. Assoc. Editor: Xi Chen.

J. Eng. Mater. Technol 139(2), 021009 (Feb 07, 2017) (6 pages) Paper No: MATS-16-1162; doi: 10.1115/1.4035326 History: Received June 01, 2016; Revised October 25, 2016

## Abstract

We investigate a domain decomposition method (DDM) of finite element method (FEM) using Intel's many integrated core (MIC) architecture in order to determine the most effective MIC usage. For this, recently introduced high-scalable parallel method of DDM is first introduced with a detailed procedure. Then, the Intel's Xeon Phi MIC architecture is presented to understand how to apply the parallel algorithm into a multicore architecture. The parallel simulation using the Xeon Phi MIC has an advantage that traditional parallel libraries such as the message passing interface (MPI) and the open multiprocessing (OpenMP) can be used without any additional libraries. We demonstrate the DDM using popular libraries for solving linear algebra such as the linear algebra package (LAPACK) or the basic linear algebra subprograms (BLAS). Moreover, both MPI and OpenMP are used for parallel resolutions of the DDM. Finally, numerical parallel efficiencies are validated by a two-dimensional numerical example.

<>

## Introduction

In engineering problems, mesh-based numerical methods such as an FEM, a finite difference method (FDM), and a finite volume element (FVM) have been one of the best solutions for decades, because these methods provide high accuracy and efficiency. Geometrically complex shapes and materials which have complex elastoplastic behaviors can be expressed by assembling simple meshes and constitutive equations, respectively. Therefore, these methods play an important role in engineering research fields as well as science to predict and understand the physical phenomena. However, the use of fine mesh is inevitable in order to obtain a numerical accuracy, although a numerical efficiency is decreased by increasing equations. Specially, as simulations for real or large scale considering detail analysis consume much computational costs, it is one of the numerical challenges, and many researchers have focused on how to decrease the size of modeling without a loss of numerical accuracy under a limited computer resource.

The parallelized solution using both parallel algorithm and system provides fast and accurate results. However, for some decades, parallel methods using a supercomputer or a cluster system which has many central processing units (CPUs) have been researched in specific fields such as the computer science, the mathematics, or the physics because building the system is much expensive and complex to handle [12]. Fortunately, the development of CPU technology has been concentrated on a multicore architecture in a CPU by manufacturers, recently; therefore, we can use a multicore system in our office without expensive computational cost [3]. Corresponding to this trend, NVIDIA and Intel are spurring development on new architecture of multicore system. NVIDIA launched the graphics processing unit (GPU) in 1999 [4], and competitor Intel also introduced the MIC in 2012 [5]. These multicore units have advantages, namely, installation is easy and applicable to most workstation systems inexpensively. These units are based on the shared memory parallel (SMP) but distributed memory parallel (DMP).

A DDM is the most popular parallel method, and it is based on iterative method in order to define connect force. The concept is that a computational domain is decomposed into several independent subdomains, and it is connected by contact forces between master and slave domains. The finite element tearing and interconnect (FETI) method is one of the very successful DDMs, so modified FETI methods have been extended by many researchers until now [69]. However, this method has a drawback that it is difficult to handle floating domains without boundary conditions. In parallel algorithm, this can cause a descent of computational parallel speed. Moreover, numerical instability can be occurred when stiffness matrices are inversed by a singular value decomposition. Recently, a direct method on DDM has been proposed by Tak and Park [10] in order to remedy this numerical inefficiency and inaccuracy. This approach provides high parallel scalability and accuracy in two-dimensional FEM examples. Inverse problem in floating domain is resolved by predefined Dirichlet boundary conditions, and parallel speed for inverse stiffness matrix is improved by optimized schematic algorithm using a banded Schur complement. However, an efficiency of this method has been validated on a homogeneous cluster system.

In this paper, we investigate a parallel efficiency for the direct method when the MIC is implemented. For this, the direct method of DDM is first introduced with a detailed procedure, and then, an MIC architecture is presented. Because the MIC is based on SMP, usage of the MPI and the OpenMP in the direct method is modified and demonstrated various combinations of MICs. Finally, the most effective and scalable combination of MICs are determined by a two-dimensional numerical example.

## Domain Decomposition Method Using Direct Method

###### Formulation.

The DDM is the most effective parallel method in FEM based on finite meshes. The computational domain is divided into subdomains, which are linked by connected forces. According to the methodology to find these forces, they are classified into an iterative and a direct method on DDM. We introduce the high-scalable direct method proposed by Tak and Park [10].

Let us consider that the total domain is denoted by $Ω$, and a system of equation for static analysis can be defined as follows: Display Formula

(1)$Ku=f on Ω$

where $K$ is the stiffness matrix, $u$ is the displacement vector, and $f$ is the force vector.

The total domain $Ω$ can be divided into subdomains $Ωi$ in which subscript i represents the ith subdomain, and each subdomain has internal and interface nodes denoted by superscripts t and f, respectively. Using these notations, Eq. (1) can be expressed to the system equation for each subdomain as follows: Display Formula

(2)$[KittKitfKiftKiff]{uituif}={fitfif} on Ωi, i=1,…,N$

where submatrices are components of the stiffness matrix $K$, which can be moved according to the degrees-of-freedom (DOFs) numbering of nodes. On interface nodes f, two more subdomains are shared; therefore, the size of submatrices can be increased according to sharing subdomains.

When the subdomains are not prescribed by boundary conditions called floating condition, the linear equation (2) for subdomains cannot be solved by a linear algebra; namely, Eq. (2) should be invertible in a direct method. This can be resolved by assembling a nonsingular finite element. For this, an interfacial finite element is considered as follows: Display Formula

(3)$[DiffDifpDipfDipp]{uifuip}={fiffip} on Ωi, i=1,…,N$

where the matrix $D$ is a signed stability diagonal matrix and singular positive definite. Superscript p of $D$ is denoted outer nodes on which boundary conditions are prescribed. Therefore, the force vector $fip$ can be calculated by a simple relationship $fip=−Dipf[Diff]−1fif$. When Eqs. (2) and (3) are combined as Eq. (4), the force vector $fip$ becomes a reaction force vector. The magnitude of this force is same as a reaction force vector of adjust subdomains but the direction is opposite Display Formula

(4)$[KittKitf0KiftKiff+DiffDifp0DipfDipp]{uituifuip}={fitfiffip} on Ωi, i=1,…,N$

The force vector $fip$ can be represented as follows: Display Formula

(5)$fip=Dipf[KittKitfKiftKiff+Diff]−1{fitfif} on Ωi, i=1,…,N$

where the inverse matrix can be represented by the Schur complement decomposition as follows: Display Formula

(6)$[KittKitfKiftKiff+Diff]−1=[Si−1−Si−1Kitf[Kiff+Diff]−1−[Kiff+Diff]−1KiftSi−1[Kiff+Diff]−1+[Kiff+Diff]−1KiftSi−1Kitf[Kiff+Diff]−1] on Ωi i=1,…,N$
where the Schur complement is defined as $Si=Kitt−Kitf[Kiff+Diff]−1Kift$.

In Eq. (4), an interfacial force vector $fif$ can be calculated by the relationship $fip=−fi+1p$ on an interface when an internal force vector $fit$ is known. This can be reformulated for a total domain as in Eq. (7). The interfacial force is identical with the Lagrange multiplier in the FETI method to express connected forces Display Formula

(7)$(∑i=1NBi[KittKitfKiftKiff+Diff]−1BiT)ff=−(∑i=1NBi[KittKitfKiftKiff+Diff]−1fit)$

where B is the connectivity matrix and consists of signed Boolean entries.

###### Algorithm for the Direct Domain Decomposition Method.

In order to obtain a high-scalable parallel efficiency, a specific algorithm considering both the fast solver of linear equations and low data transfer is needed. For a linear solver, a band for a positive definite matrix is a key issue in the direct DDM, because it requires low memory size and provides fast calculation. For a data transfer, it is important to consider minimized sharing data on interface. This can be resolved by an appropriate algorithm for computer architecture. The procedure of the introduced algorithm is described with parallel efficiency as follows and a flowchart as shown in Fig. 1.

###### Read an Input File and Separation.

In this procedure, an input file including information for nodes, elements, and material of finite element modeling, which are established by preprocessing, is read on root CPU. Then, the data are separated into smaller input files in order to use in other CPUs. Global nodes in data are rearranged as local nodes for subdomains. This is an important job to reduce memory usages in parallel system; especially, it is strongly necessary in SMP system because the size of physical memory is less extended comparing to DMP system. The distributed smaller files are saved in hard disk.

###### Read Separated Input Files and Define Degrees-of-Freedom.

Each CPU reads separated input file, and information of an input file are stored in memory. Then, DOFs are defined according to the internal and interfacial zones, respectively. Because the bandwidth of stiffness matrix is determined by the number of DOFs, Cuthill and McKee algorithm [11], which is the best solution to permute a square matrix, is used at the internal and interfacial zones, respectively. The banded submatrices $Kitt$ for internal zone, $Kiff$ for interfacial zone, and $Diff$ in Eq. (5) allow to use the Schur complement approach for the high-performance computing for inverse problem.

###### Assemble the Stiffness Matrix and the Force Vector.

The submatrices in Eqs. (2) and (3) are assembled, which are summed to calculate Eq. (5). It is noted that the component $Kiff+Diff$ in Eq. (5) can be changed into a nonpositive definite, because the submatrix $Diff$, which has signed values, affects diagonal components in Eq. (5). If interfacial elements are shared on a point called a cross point, the effect of the submatrix $Diff$ becomes large. Therefore, it is important to choose appropriate values at submatrix $Diff$ in order to keep positive definite. In direct solver, the symmetric-banded positive definite condition has an advantage that the Cholesky factorization which is one of the fast solutions for linear algebra can be used.

###### Inverse Problem in the Schur Complement Decomposition.

In this procedure, inverse matrix in Eq. (5) is decomposed by the Schur complement S as in Eq. (6). Then, the submatrices are inversed $[Kiff+Diff]−1$ and $Si−1$ in Eq. (6) and are calculated by the LU or the Cholesky factorization. In the LAPACK library, a spotrf for a single precision or a dpotrf for a double precision can be used with Cholesky method. If the submatrices are not positive definite, then these can be solved by the library sgbtrf for single precision or dgbtrf for a double precision in the LU decomposition. On the other hand, multiplying the calculated inverse matrices by components of the Schur complement decomposition is performed by the OpenMP library.

###### Assemble the Connectivity Matrix.

The connectivity matrix B in Eq. (7) is first calculated by the relationship between subdomains. Because the size of matrix B is the number of total DOFs by free DOFs, it can consume much memory and computing time. However, it can be reduced by using the compressed sparse row (CSR) format as matrices of both sides in Eq. (7) have mostly zero values. The summation is performed by the MPI_Reduce library, in which each CPU send data formatted CSR to root CPU.

###### Calculate Interface Forces and Subdisplacement.

Equation (7) is calculated to obtain an interface force vector $ff$. The LHS of Eq. (7) has low bandwidth because of sparse matrix. However, because this matrix is not positive definite, the linear equation can be solved by the LAPACK library: a sgbsv for a single precision or a dgbsv for a double precision. This is performed by root CPU and distributed to other CPUs. Each CPU calculates the subdisplacement by Schur components—force multiplication.

###### Calculate Displacement Vector.

Finally, the total displacement vector on root CPU is calculated by correcting subdisplacements. Displacements for internal DOFs and for interfacial DOFs are assembled, respectively, and total displacements for a total domain are determined.

## Many Integrated Core

###### An Architecture of Many Integrated Core.

The MIC released by the Intel is called the Intel Xeon Phi coprocessor. The MIC has many coprocessors, GDDRs memory, cash memories, and operating system. This device is connected to other devices via the PCIs bus and can operate with Intel Xeon processor in a host. As the architecture of MIC is same as SMP, it becomes one of the nodes when it is combined with host system. Therefore, the parallel libraries both MPI and OpenMP can be run via IP address. This is an advantage that codes for Intel Xeon processor are minimally modified to run in the MIC, and it is liberal to use popular libraries in Linux.

The 5110 P model which is one of the MICs and used in our test was released in 2012. The total cores are 60 (240 threads), and its clock speed is 1.053 GHz. Each core has 32 KB L1 cache and 512 KB L2 cache, and the cores are linked to a ring interconnect. The 8 GB GDDR5 memory, which is shared by 60 cores, can be speedup to 320 GB/s, theoretically. This model can be communicated with up to eight MICs and a host via X16 PCI expressed bus.

###### Application of the Direct Domain Decomposition Method.

In order to apply the direct DDM to the MIC, two approaches can be considered: a native compilation and an offload compilation.

The native compilation is the simplest approach to run applications on the MIC. The program in a host is shared to the MIC like cluster systems composed of a host and clients, and parallel analyses are performed at the same time. In this case, it is no need to modify the source codes of CPU processing, but compiling should be separated. The parallel efficiency mostly depends on a slower system because this symmetric architecture assumes that a host and MICs show equal performance.

The offload compilation is a more flexible approach to use MICs. The main program is performed on a host, but parallel solution needs many cores if performed on MICs. It is necessary to modify some sentence in source codes for an offload. This approach has an advantage that the size of the memory is less problematic when compared to the native compilation, because the memory of the host is mostly larger than the limited memory 8 GB of MIC. However, bottleneck problems caused by different data transfer speed should be considered.

In the direct DDM, the library MPI is used for data transfer or sharing between nodes. In addition, the library OpenMP is used for a matrix–matrix and a matrix–vector multiplication between the cores in an internal network. As mentioned previously, the direct DDM can be performed by two approaches. For the native compilation, the host and MICs are symmetric. Therefore, multicores in the host and many cores in MICs are participated as nodes. However, low parallel scalability can be occurred because the bandwidth between the host and MICs via PCI express is lower than interconnect bandwidths between cores. In order to escape this problem, it is necessary to reduce the amount of data transfer in the MPI; moreover, subdomains should be distributed into suitable classified core groups in which one core is assigned to the host, and other cores are used in the calculations of the OpenMP. For example, multiplications should be calculated in the cores linked to an interconnection.

On the other hand, for an offload compilation, MICs can be used in the calculations of multiplications in the direct DDM. If MICs are assigned to computational nodes, source codes can be complex because independent algorithm for subdomains should be considered in a source file. This approach occurs at low parallel scalability as the most parallel computing depends on the MICs via low bandwidth PCI express networking. Therefore, it is the best way that OpenMP for calculation of multiplications is run on MICs.

## Benchmarks

We demonstrate the direct DDM using the native compilation in MIC architecture. The source code is developed by c/c++ language, clapack for linear equation solver, MPI for DMP parallel solution, and OpenMP for SMP parallel solution, which are included in the Intel parallel studio package. The code is run in a workstation in which the clock speed of two Intel Xeon processors E5-2690 is 2.9 GHz and has a total of 32 cores with 256 GB memory. Also, a total of four 5110 P MICs are installed in the workstation. Each MIC has 60 cores and 8 GB memory, and it is linked to a ring interconnection.

A two-dimensional half-circle ring model is examined to validate the scalability for the MIC as shown in Fig. 2. The width and the radius of inner circle are 5 m and 15 m, respectively. Also, a concentrated force of 10 N is applied at the end of the outer circle line. Fixed boundary conditions for the x- and y-directions are prescribed at the first line. This model is divided into 16 subdomains, and each domain has 6900 two-dimensional four-node elements; therefore, a total of 110,400 elements and a total of 111,605 nodes are used.

We consider three factors for scalability of this model: (1) the number of subdomains, (2) the number of MICs, and (3) the number of threads. First, the number of subdomains is identical with the number of contributed cores. Therefore, we can demonstrate scalability of contributed cores according to various subdomains 2, 4, 8, and 16. This measure is called a strong scalability in which a parallel computational time is reduced by increasing decompositions, and it can be estimated by the speedup S as follows: Display Formula

(8)

where n is the number of participated cores for parallel processing. If S is identical to n, then it is an ideal speedup.

On the other hand, it can carry out parallel processing by increasing subdomains which are called a weak scalability. In this measure, high scalability means that computational times are always constant regardless of the number of subdomains. In this example, we handle a strong scalability.

Second, the number of MICs is related with networking system and its bottle neck tests. In an MIC, cores are connected to an internal networking system called a ring interconnection. It shows high performance and scalability for parallel simulation because of the fast internal networking. However, the speed of calculation can be reduced when MICs are linked by external networking such as local area network and PCI bus. It is caused by different network speeds. An internal networking between cores can speed up to 96 GB/s, but external network speeds are generally lower than 8 GB/s. Therefore, a bottle neck problem occurs on the external network system. We demonstrate this by using four MICs which are linked by a PCI bus.

Finally, the number of threads is strongly related with the core performance. Each core in an MIC has four threads, which can participate in parallel calculations. The program mainly uses one thread, and other threads are assigned when one thread is loaded such as double- and triple-loop in a code. We use up to four threads in each core in order to know the effect of threads in a parallel simulation.

Table 1 shows the computational times and speedup S about the parallel analysis of a two-dimensional model using MICs and its threads. Considering the strong scalability mentioned earlier, computational times decrease with increasing decompositions in all the cases. The speedup S is increased with increasing decompositions, and it expresses high scalability. The high scalability is clearly presented at the use of one thread. When we consider the number of participated MICs, running times are nearly identical with increasing MICs at a fixed decomposition. This means that effects of low-speed external network are very little. On the other hand, the number of threads obviously affects the running time in all the cases. Because matrix multiplication in the program is one of the main calculations, the thread effect is outstanding.

For detailed descriptions, it is necessary to present an elapsed time during the procedure. As discussed in Sec. 2.2, there are total seven steps in the direct DDM. Figure 3 shows an elapsed time using two MICs and one thread. Total times at each step are presented according to various decompositions.

In step 1, an input file is read on one CPU, then the file is separated into smaller input files by the number of decompositions. Generally, computational time is consumed with increasing interfacial DOF at step 1 in Fig. 3.

In step 2, separated files are read, and DOFs are assigned. Much time is not necessary when Cuthill and McKee algorithm [11] is used. Therefore, time consumption at step 2 is nearly identical to step 1.

In step 3, a stiffness matrix and a force vector are assembled on each CPU. The elapsed time is linearly increased with increasing DOFs. This is well presented in step 3.

In step 4, the Schur complement is calculated with inverse problems for the interfacial elements and multiplications. Therefore, the consumption of the computational time is dependent on both the number of interfacial and internal DOFs. The elapsed time difference is increased with decreasing decomposition. This means that an effect of the number of internal DOFs is larger than the interfacial DOFs.

In step 5, connectivity matrices are assembled and multiplied with inversed stiffness matrices. Also, results on each subdomain are summed via MPI_Reduce library. In this step, it can consume much time, but it can be reduced by using CSR format as an example.

In step 6, the interface force vector $ff$ is calculated by using the linear algebra library. This is run on root CPU, and the results are distributed into other CPUs. Then, each CPU calculates subdisplacements for internal and interface nodes with multiplication Schur components by the force vector. Less computational time is consumed for obtaining interface force vector $ff$ because bandwidth in the LHS of Eq. (7) is very small, and most components of the matrix and vector is zero. On the other hand, calculating subdisplacements on each CPU consumes much time. We can see that more computational time is needed when the number of subdomains decreases.

In step 7, calculated time is not nearly consumed because summation between displacement vectors is just carried out.

Figure 4 shows the time difference between steps on one thread. Computational time is mostly spent in step 4, and it is decreased by increasing decompositions. As mentioned above, calculation of inverse matrix in Eq. (6) via Schur complement, the effect of the number of internal DOFs is larger than the interface DOFs. This also means that the calculation of inverse matrix for interface DOFs $[Kiff+Diff]−1$ in Schur complement does not nearly affect the calculation of matrix multiplication for Schur complement $Si$. On the other hand, effects of external network do not nearly express although the number of MICs, which are connected to PCI express, is increased. In step 6, each CPU sends calculated matrix and vector in Eq. (7) to root CPU. However, it is not loaded to a network transfer because the matrix shows low bandwidth, and a few components except for zero values are transferred.

Figure 5 presents the time difference between steps when the number of threads is increased on N = 16. In this graph, we can see an effect of participated threads in matrix multiplication. There are many loops for the multiplication in a developed program, and the OpenMP using threads is applied to the loops. Most matrix–matrix and matrix–vector multiplications are performed in step 4 so that Schur components are calculated. Therefore, the largest effect of the number of threads is expressed in step 4.

Figure 6 shows the elapsed time for decompositions N = 1 and N = 16 according to increasing threads. From steps 1 to 7, running times for N = 1 are larger than N = 16, and computational time can be reduced by using larger threads.

## Conclusions

In this paper, the MIC based on the Intel's Xeon architecture was demonstrated by using the DDM proposed by Tak and Park [10]. Detailed procedure for the DDM was introduced in Sec. 2. The DDM method has advantages that performance of the MPI and the OpenMP libraries can be estimated; moreover, various assembling of MICs can be tested in order to find an optimal usage. For this, we presented the architecture of MIC and its application to DDM in Sec. 3. In Sec. 4, an effect of MIC on DDM was validated via a half-circle example which is decomposed up to 16. A total of three factors were considered: the number of subdomains, the number of MICs, and the number of threads. For the number of subdomains, we validated that computational times decrease by increasing the number of decompositions, and it was presented high parallel scalability. However, the number of MICs does not affect the computational time although they are connected to low-speed PCI express. In addition, thread usage could reduce computational costs because the running time is strongly dependent on matrix multiplications in DDM. To sum up, the MIC usage provided high-scalable parallel solution, and it was possible to use assembling MICs without loss of calculation time. However, this demonstrate was run on a native compilation in which each core runs a program independently; therefore, it is necessary to investigate an effect of an offload compilation in the future work.

## Acknowledgements

This research was supported by Grant No. 15CTAP-C077510-02 from the Infrastructure and Transportation Tech Facilitation Research Program funded by the Ministry of Land, Infrastructure, and Transport of Korean Government.

## References

Giloi, W. K. , 1994, “ Parallel Supercomputer Architectures and Their Programming Models,” Parallel Comput., 20(10–11), pp. 1443–1470.
Attig, N. , Gibbon, P. , and Lippert, Th. , 2011, “ Trends in Supercomputing: The European Path to Exascale,” Comput. Phys. Commun., 182(9), pp. 2041–2046.
Lim, D. J. , Anderson, T. R. , and Shott, T. , 2015, “ Technological Forecasting of Supercomputer Development: The March to Exascale Computing,” Omega, 51, pp. 128–135.
Lyakh, D. I. , 2015, “ An Efficient Tensor Transpose Algorithm for Multicore CPU, Intel Xeon Phi, and NVIDIA Tesla GPU,” Comput. Phys. Commun., 189, pp. 84–91.
Needham, P. J. , Bhuiyan, A. , and Walker, R. C. , 2016, “ Extension of the AMBER Molecular Dynamics Software to Intel's Many Integrated Core (MIC) Architecture,” Comput. Phys. Commun., 201, pp. 95–105.
Amestoy, P. R. , Duff, I. S. , Guermouche, A. , and Slavova, Tz. , 2010, “ Analysis of the Solution Phase of a Parallel Multifrontal Approach,” Parallel Comput., 36(1), pp. 3–15.
Farhat, C. , and Roux, F. X. , 1991, “ A Method of Finite Element Tearing and Interconnecting and Its Parallel Solution Algorithms,” Int. J. Numer. Methods Eng., 32(6), pp. 1205–1227.
Farhat, C. , Lesoinne, M. , and Pierson, K. , 2000, “ A Scalable Dual-Primal Domain Decomposition Method,” Numer. Linear Algebra Appl., 7(7–8), pp. 687–714.
Farhat, C. , Pierson, K. , and Lesoinne, M. , 2000, “ The Second Generation FETI Methods and Their Application to the Parallel Solution of Large-Scale Linear and Geometrically Non-Linear Structural Analysis Problems,” Comput. Methods Appl. Mech. Eng., 184(2–4), pp. 333–374.
Tak, M. , and Park, T. , 2013, “ High Scalable Non-Overlapping Domain Decomposition Method Using a Direct Method for Finite Element Analysis,” Comput. Methods Appl. Mech. Eng., 264, pp. 108–128.
Cuthill, E. , and McKee, J. , 1969, “ Reducing the Bandwidth of Sparse Symmetric Matrices,” ACM 24th National Conference, New York, Aug. 26–28, pp. 157–172.
View article in PDF format.

## References

Giloi, W. K. , 1994, “ Parallel Supercomputer Architectures and Their Programming Models,” Parallel Comput., 20(10–11), pp. 1443–1470.
Attig, N. , Gibbon, P. , and Lippert, Th. , 2011, “ Trends in Supercomputing: The European Path to Exascale,” Comput. Phys. Commun., 182(9), pp. 2041–2046.
Lim, D. J. , Anderson, T. R. , and Shott, T. , 2015, “ Technological Forecasting of Supercomputer Development: The March to Exascale Computing,” Omega, 51, pp. 128–135.
Lyakh, D. I. , 2015, “ An Efficient Tensor Transpose Algorithm for Multicore CPU, Intel Xeon Phi, and NVIDIA Tesla GPU,” Comput. Phys. Commun., 189, pp. 84–91.
Needham, P. J. , Bhuiyan, A. , and Walker, R. C. , 2016, “ Extension of the AMBER Molecular Dynamics Software to Intel's Many Integrated Core (MIC) Architecture,” Comput. Phys. Commun., 201, pp. 95–105.
Amestoy, P. R. , Duff, I. S. , Guermouche, A. , and Slavova, Tz. , 2010, “ Analysis of the Solution Phase of a Parallel Multifrontal Approach,” Parallel Comput., 36(1), pp. 3–15.
Farhat, C. , and Roux, F. X. , 1991, “ A Method of Finite Element Tearing and Interconnecting and Its Parallel Solution Algorithms,” Int. J. Numer. Methods Eng., 32(6), pp. 1205–1227.
Farhat, C. , Lesoinne, M. , and Pierson, K. , 2000, “ A Scalable Dual-Primal Domain Decomposition Method,” Numer. Linear Algebra Appl., 7(7–8), pp. 687–714.
Farhat, C. , Pierson, K. , and Lesoinne, M. , 2000, “ The Second Generation FETI Methods and Their Application to the Parallel Solution of Large-Scale Linear and Geometrically Non-Linear Structural Analysis Problems,” Comput. Methods Appl. Mech. Eng., 184(2–4), pp. 333–374.
Tak, M. , and Park, T. , 2013, “ High Scalable Non-Overlapping Domain Decomposition Method Using a Direct Method for Finite Element Analysis,” Comput. Methods Appl. Mech. Eng., 264, pp. 108–128.
Cuthill, E. , and McKee, J. , 1969, “ Reducing the Bandwidth of Sparse Symmetric Matrices,” ACM 24th National Conference, New York, Aug. 26–28, pp. 157–172.

## Figures

Fig. 1

Flowchart for the direct DDM

Fig. 2

Half-circle ring model

Fig. 3

Elapsed time for the DDM on two MICs and one thread usages

Fig. 4

Time difference between steps on one threads usage

Fig. 5

Time difference between steps on N = 16

Fig. 6

Elapsed time for the DDM on N = 1 for one MIC and N = 16 for four MICs

## Tables

Table 1 Parallel computation time (s) and speedup S

## Discussions

Some tools below are only available to our subscribers or users with an online account.

### Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related Proceedings Articles
Related eBook Content
Topic Collections