FENFLOSS - CFD on High performance computers

The Teraflop Project in cooperation with NEC and the HLRS

The numerical flow simulation software FENFLOSS (Finite Element based Numerical Flow Simulation System) is beeing developped at the IHS since the early 80s. It is used to compute laminar and turbulent, steady and unsteady incompressible flows. Complex geometries may be meshed easily with unstructured meshes due to the highly flexible Finite Element approach. Scale and mesh adaptive turbulence models enable it to reproduce unsteady turbulent flow behaviour and associated pressure fluctuations very accurately. An efficient algorithm is implemented to couple meshes of different discretisation, refined regions, and moving and steady parts, e.g. rotor-stator coupling. Furthermore, a special formulation of the Navier-Stokes Equations results in a better and more accurate solution for rotating geometries and frames. FENFLOSS is used to simulate any kind of incompressible flows, especially in hydraulic machinery.

 

Figure: Solution scheme in FENFLOSS.

 

An interface offers the opportunity to dynamically load libraries containing user defined subroutines. This enables communication with the visualisation software COVISE (HLRS) and to exchange data during the simulation run, in order to do an automatic visualisation of the time-step results. Furthermore, for flow-structure interaction simulations there is an adapter to the MpCCI (FhG SCAI) code coupling interface available as a shared object library. The parallelised (MPI) and vectorised linear solver, based on a Krylov Method (van der Vorst's BICGStab(2)), handles computations with some millions of nodes on clusters of different architecture. A special storage scheme (JAD) and optimised loops make it very efficient on state of the art high performance vector computers, e.g. NEC SX-8.

 

Figure: Left: Speed-up on NEC SX-8 with different meshes. Right: Advantage with SMP on large grids

Further optimisation in cooperation with NEC yield a vector performace of almost 35% of the peak performance, i.e. 16 GFlop/s per vector CPU. Increasing the number of processors yields in higher communication effort and decreasing vecor length. This results in a lower performance. Using shared memory parallelisation (SMP) will produce relief.