max/poet

mirror of https://git.gfz-potsdam.de/naaice/poet.git synced 2025-12-15 12:28:22 +01:00

2024-11-22 09:38:27 +01:00

14 KiB

Raw Blame History

POET

POET is a coupled reactive transport simulator implementing a parallel architecture and a fast, original MPI-based Distributed Hash Table.

Parsed code documentiation

A parsed version of POET's documentation can be found at Gitlab pages.

External Libraries

The following external libraries are shipped with POET:

CLI11 - https://github.com/CLIUtils/CLI11
IPhreeqc with patches from GFZ/UP - https://github.com/usgs-coupled/iphreeqc - https://git.gfz-potsdam.de/naaice/iphreeqc
tug - https://git.gfz-potsdam.de/naaice/tug

Installation

Requirements

To compile POET you need following software to be installed:

C/C++ compiler (tested with GCC)
MPI-Implementation (tested with OpenMPI and MVAPICH)
CMake 3.9+
Eigen3 3.4+ (required by tug)
optional: doxygen with dot bindings for documentation
R language and environment including headers or -dev packages (distro dependent)

The following R packages (and their dependencies) must also be installed:

This can be simply achieved by issuing the following commands:

# start R environment
$ R

# install R dependencies (case sensitive!)
> install.packages(c("Rcpp", "RInside","qs"))
> q(save="no")

Clone the repository

POET can be anonimously cloned from this repo over https. Make sure to also download the submodules:

git clone --recurse-submodules https://git.gfz-potsdam.de/naaice/poet.git

The --recurse-submodules option is a shorthand for:

cd poet
git submodule init && git submodule update

Compiling source code

POET is built with CMake. You can generate Makefiles by running the usual:

mkdir build && cd build
cmake ..

This will create the directory build and processes the CMake files and generate Makefiles from it. You're now able to run make to start build process.

If everything went well you'll find the executables at build/src/poet, but it is recommended to install the POET project structure to a desired CMAKE_INSTALL_PREFIX with make install.

During the generation of Makefiles, various options can be specified via cmake -D <option>=<value> [...]. Currently, there are the following available options:

POET_DHT_Debug=boolean - toggles the output of detailed statistics about DHT usage. Defaults to OFF.
POET_ENABLE_TESTING=boolean - enables small set of unit tests (more to come). Defaults to OFF.
POET_PHT_ADDITIONAL_INFO=boolean - enabling the count of accesses to one PHT bucket. Use with caution, as things will get slowed down significantly. Defaults to OFF.
POET_PREPROCESS_BENCHS=boolean - enables the preprocessing of predefined models/benchmarks. Defaults to ON.
USE_AI_SURROGATE=boolean - includes the functions of the AI surrogate model. When active, CMake relies on find_package() to find an a implementation of Threads and a Python environment where Numpy and Keras need to be installed. Defaults to OFF.

Example: Build from scratch

Assuming that only the C/C++ compiler, MPI libraries, R runtime environment and CMake have been installed, POET can be installed as follows:

# start R environment
$ R

# install R dependencies
> install.packages(c("Rcpp", "RInside","qs"))
> q(save="no")

# cd into POET project root
$ cd <POET_dir>

# Build process
$ mkdir build && cd build
$ cmake -DCMAKE_INSTALL_PREFIX=/home/<user>/poet ..
$ make -j<max_numprocs>
$ make install

This will install a POET project structure into /home/<user>/poet which is called hereinafter <POET_INSTALL_DIR>. With this version of POET we do not recommend to install to hierarchies like /usr/local/ etc.

The correspondending directory tree would look like this:

poet
├── bin
│   ├── poet
│   └── poet_init
└── share
    └── poet
        ├── barite
        │   ├── barite_200.rds
        │   ├── barite_200_rt.R
        │   ├── barite_het.rds
        │   └── barite_het_rt.R
        ├── dolo
        │   ├── dolo_inner_large.rds
        │   ├── dolo_inner_large_rt.R
        │   ├── dolo_interp.rds
        │   └── dolo_interp_rt.R
        └── surfex
            ├── PoetEGU_surfex_500.rds
            └── PoetEGU_surfex_500_rt.R

With the installation of POET, two executables are provided:

poet - the main executable to run simulations
poet_init - a preprocessor to generate input files for POET from R scripts

Preprocessed benchmarks can be found in the share/poet directory with an according runtime setup. More on those files and how to create them later.

Running

Run POET by mpirun ./poet [OPTIONS] <RUNFILE> <SIMFILE> <OUTPUT_DIRECTORY> where:

OPTIONS - POET options (explained below)
RUNFILE - Runtime parameters described as R script
SIMFILE - Simulation input prepared by poet_init
OUTPUT_DIRECTORY - path, where all output of POET should be stored

POET command line arguments

The following parameters can be set:

Option	Value	Description
--work-package-size=	1..n	size of work packages (defaults to 5)
-P, --progress		show progress bar
--ai-surrogate		activates the AI surrogate chemistry model (defaults to OFF)
--dht		enabling DHT usage (defaults to OFF)
--qs		store results using qs::qsave() (.qs extension) instead of default RDS (.rds)
--dht-strategy=	0-1	change DHT strategy. NOT IMPLEMENTED YET (Defaults to 0)
--dht-size=	1-n	size of DHT per process involved in megabyte (defaults to 1000 MByte)
--dht-snaps=	0-2	disable or enable storage of DHT snapshots
--dht-file=	`<SNAPSHOT>`	initializes DHT with the given snapshot file
--interp-size	1-n	size of PHT (interpolation) per process in megabyte
--interp-bucket-entries	1-n	number of entries to store at maximum in one PHT bucket
--interp-min	1-n	number of entries in PHT bucket needed to start interpolation

Additions to `dht-snaps`

Following values can be set:

0 = snapshots are disabled
1 = only stores snapshot at the end of the simulation with name <OUTPUT_DIRECTORY>.dht
2 = stores snapshot at the end and after each iteration iteration snapshot files are stored in <DIRECTORY>/iter<n>.dht

Example: Running from scratch

We will continue the above example and start a simulation with barite_het, which simulation files can be found in <INSTALL_DIR>/share/poet/barite/barite_het*. As transport a heterogeneous diffusion is used. It's a small 2D grid, 2x5 grid, simulating 50 time steps with a time step size of 100 seconds. To start the simulation with 4 processes cd into your previously installed POET-dir <POET_INSTALL_DIR>/bin and run:

cp ../share/poet/barite/barite_het* .
mpirun -n 4 ./poet barite_het_rt.R barite_het.rds output

After a finished simulation all data generated by POET will be found in the directory output.

You might want to use the DHT to cache previously simulated data and reuse them in further time-steps. Just append --dht to the options of POET to activate the usage of the DHT. Also, after each iteration a DHT snapshot shall be produced. This is done by appending the --dht-snaps=<value> option. The resulting call would look like this:

mpirun -n 4 ./poet --dht --dht-snaps=2 barite_het_rt.R barite_het.rds output

Example: Preparing Environment and Running with AI surrogate

To run the AI surrogate, you need to have a Keras installed in your Python environment. The implementation in POET is agnostic to the exact Keras version, but the provided model file must match your Keras version. Using Keras 3 with .keras model files is recommended. The compilation process of POET remains mostly the same as shown above, but the CMake option -DUSE_AI_SURROGATE=ON must be set.

To use the AI surrogate, you must declare several values in the R input script. This can be either done directly in the input script or in an additional file. This file can be provided by adding the file path as the element ai_surrogate_input_script to the chemistry_setup list in the R input script.

The following variables and functions must be declared:

model_file_path [string]: Path to the Keras model file with which the AI surrogate model is initialized.
validate_predictions(predictors, prediction) [function]: Must return a boolean vector of length nrow(predictions). The output of this function defines which predictions are considered valid and which are rejected. the predictors and predictions are passed in their original original (not transformed) scale. Regular simulation will only be done for the rejected values. The input data of the rejected rows and the respective true results from simulation will be added to the training data buffer of the AI surrogate model. Can eg. be implemented as a mass balance threshold between the predictors and the prediction.

The following variables and functions can be declared:

batch_size [int]: Batch size for the inference and training functions, defaults to 2560.
training_epochs [int]: Number of training epochs with each training data set, defaults to 20.
training_data_size [int]: Size of the training data buffer. After the buffer has been filled, the model starts training and removes this amount of data from the front of the buffer. Defaults to the size of the Field.
use_Keras_predictions [bool]: Decides if the Keras prediction function should be used instead of the custom C++ implementation. Keras might be faster for larger models, especially on GPU. The C++ inference function assumes that the Keras model is a standrad feed forward network with either 32 or 64 bit precision and ReLU activation. Any model that deviates from this architecture should activate the Keras prediction function to ensure correct calculation. Defaults to false.
disable_training [bool]: Deactivates the training functions. Defaults to false.
train_only_invalid [bool]: Use only the data from PHREEQC for training instead of the whole field (which might contain the models own predictions). Defaults to false.
save_model_path [string]: After each training step the current model is saved to this path as a .keras file.
preprocess(df) [function]: Returns the scaled/transformed data frame. The default implementation uses no scaling or transformations.
postprocess(df) [function]: Returns the rescaled/backtransformed data frame. The combination of preprocess() and postprocess() is expected to be idempotent. The default implementation uses no scaling or transformations.
assign_clusters(df) [function]: Must return a vector of length nrow(predictions) that contains cluster labels as 0/1. According to these labels, two separate models will be used for inference and training. Cluster assignments can e.g. be done for the reactive and non reactive parts of the field.
model_reactive_file_path [string]: Path to the Keras model file with which the AI surrogate model for the reactive cluster is initialized. If ommitted, the models for both clusters will be initialized from model_file_path

cd <installation_dir>/bin

# copy the benchmark files to the installation directory
cp <project_root_dir>/bench/barite/{barite_50ai*,db_barite.dat,barite.pqi} .

# preprocess the benchmark
./poet_init barite_50ai.R

# run POET with AI surrogate and GPU utilization
srun --gres=gpu -N 1 -n 12 ./poet --ai-surrogate barite_50ai_rt.R barite_50ai.rds output

Keep in mind that the AI surrogate is currently not stable or might also not produce any valid predictions.

Defining a model

In order to provide a model to POET, you need to setup a R script which can then be used by poet_init to generate the simulation input. Which parameters are required can be found in the Wiki. We try to keep the document up-to-date. However, if you encounter missing information or need help, please get in touch with us via the issue tracker or E-Mail.

poet_init can be used as follows:

./poet_init [-o, --output output_file] [-s, --setwd]  <script.R>

where:

output - name of the output file (defaults to the input file name with the extension .rds)
setwd - set the working directory to the directory of the input file (e.g. to allow relative paths in the input script). However, the output file will be stored in the directory from which poet_init was called.

About the usage of MPI_Wtime()

Implemented time measurement functions uses MPI_Wtime(). Some important information from the OpenMPI Man Page:

For example, on platforms that support it, the clock_gettime() function will be used to obtain a monotonic clock value with whatever precision is supported on that platform (e.g., nanoseconds).

14 KiB Raw Blame History