Project

General

Profile

CCLM simulations fail on Mistral - floating point exception C

Added by Edoardo Mazza over 3 years ago

Dear colleagues,

I have been trying to run a 1-day, test simulation with CCLM cosmo4.8_clm19 on Mistral for the first time since Blizzard was retired.
I am running the CCLM with an almost standard configuration for the 0.0625° horizontal resolution that I have successfully employed in several experiments on Blizzard.

I have modified the batch script as suggested here http://redc.clm-community.eu/projects/cclmdkrz/wiki/Run-scripts.

After a series of minor problems that were solved thanks to the error messages included in the .out and .err files I came to a dead-end.
Now when I submit my job with sbatch the simulation runs for some seconds, produces the lffd1996100400c.nc file and exits without leaving any error message in the .out file. However, in the .err file I get multiple errors in this form:

56: [m10393:41443:0] Caught signal 8 (Floating point exception)

56: ==== backtrace ====
56: 2 0x00000000000548cc mxm_handle_error() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-268-gcc-OFED-3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:641
56: 3 0x0000000000054a3c mxm_error_signal_handler() /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u4-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-268-gcc-OFED-3.12-redhat6.4/mxm-master/src/mxm/util/debug/debug.c:616
56: 4 0x00000000000326a0 killpg() ??:0
56: 5 0x00000000002dabd5 pow.L() ??:0
56: 6 0x000000000001ed5d __libc_start_main() ??:0

srun: error: m10393: tasks 40,45-50,52-59: Floating point exception
srun: Terminating job step 2108157.0
00: slurmstepd: * STEP 2108157.0 ON m10314 CANCELLED AT 2016-03-15T20:09:55 *
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

The model sources have been complied correctly and have been successfully used by another member of our DKRZ account. It seems that there is a floating point exception that I cannot figure out in any way.

Does any of you ever encountered such problems, or have a clue at what might be causing all this?

I have attached my batch script, my .err and .out files along with the YUSPECIF, the YUDEBUG and the YUCHKDAT.

Your help would be incredibly appreciated.

Best,

Edoardo Mazza

run_cclm_mistral_default run_cclm_mistral_default 8.38 KB batch script
testurn_slurm.o2108157.out testurn_slurm.o2108157.out 15.1 KB output file
testrun_slurm.o2108157.err testrun_slurm.o2108157.err 74.7 KB error file, where the floating point exception appears.
YUSPECIF YUSPECIF 28.4 KB
YUDEBUG YUDEBUG 50.5 KB
YUCHKDAT YUCHKDAT 54.6 KB

Replies (6)

RE: CCLM simulations fail on Mistral - floating point exception C - Added by Hans-Juergen Panitz over 3 years ago

Dear Edoardo,

having a very first and quick look into your YUCHKDAT I would say that something is wrong with your forcing data.
Look at your T_SO values in the deeper layers.
They become very small and even negative!!!! The unit for T_SO is Kelvin!!

Furthermore I saw in your YUSPECIF that you run the model in NWP mode, not in climate mode (lbdclim=.FALSE.).
Is this what you want to do?

Hans-Juergen

RE: CCLM simulations fail on Mistral - floating point exception C - Added by Edoardo Mazza over 3 years ago

Dear Hans-Juergen,

Thank you very much for your support and sorry for the late reply but it took me a few days to go back to the roots of the problem.
I agree that there’s something wrong with those temperature, therefore I went back to the previous downscaling step to see where these weird values came from.

I wanted to repeat the simulation driven with ERA-Interim obtained from the DKRZ directory /pool/data/CCLM/reanalyses/ERAInterim. I adapted the run_int2lm script for the gcm2cclm case. Again, I wanted to test that it was working fine for 24 hours.

Unfortunately the situation does not seem to have changed at all. The “floating point exception” error is still causing the program to quit. So it seems that the problem goes beyond the T_SO values. I am really losing the focus on what the problem is right now. I have checked and double-checked but clearly there’s something wrong that I can’t find.

Please find attached the run_int2lm, the .out, YUCHKDAT, INPUT, OUTPUT and YUDEBUG files.

Best wishes,

Edoardo

RE: CCLM simulations fail on Mistral - floating point exception C - Added by Hans-Juergen Panitz over 3 years ago

Dear Edoardo,

did you realize that the ERA-Interim data (caf-files) in /pool/data/CCLM/reanalyses are Netcdf4 compressed?
There is a README in the ERAINT directory telling that.
Perhaps that is your problem.

Alternatively you can try to use the umcompressed caf-files that are available from my workspace:
/work/bb0849/b364034/ERAINT/CCLM_Forcing_Data/

Furthermore, the ERAINT data have T_SKIN, thus set “luse_t_skin=.TRUE.”
Of course, consider also Burkhardt’s suggestion “lprog_qi=.TRUE.”, since QI is also available

Hans-Juergen

RE: CCLM simulations fail on Mistral - floating point exception C - Added by Eva Nowatzki 7 months ago

Dear all,
I currently encountered a very similar problem. I get the following messages:

239: [m10063:37251:0] Caught signal 8 (Floating point exception)
248: ==== backtrace ====
248:  2 0x000000000005767c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
248:  3 0x00000000000577ec mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
248:  4 0x0000000000032510 killpg()  ??:0
248:  5 0x00000000008606ab src_soil_multlay_mp_terra_multlay_()  ??:0
248:  6 0x000000000055e4bf organize_physics_()  ??:0
248:  7 0x000000000058d900 MAIN__()  ??:0
248:  8 0x00000000004052fe main()  ??:0
248:  9 0x000000000001ed1d __libc_start_main()  ??:0
248: 10 0x00000000004051f9 _start()  ??:0
248: ===================

I already tried to decompress the ERA-Interim data and I also considered the hints you gave before, but it still doesn’t work.
I would be very grateful for help.
Thank you very much and best regards,
Eva

RE: CCLM simulations fail on Mistral - floating point exception C - Added by Eva Nowatzki 7 months ago

Dear all,
by changing the INT2LM I could solve the problem I posted before, but now a new error appears, that is also kind of similar.

  0:  OPEN: ncdf-file:
  0:  /scratch/b/b380794/Ref_run/output/cclm/1999_01/out01/lffd1999010100.nc
  0:  CLOSING ncdf FILE
  0:  OPEN: ncdf-file:
  0:  /scratch/b/b380794/Ref_run/output/cclm/1999_01/out01/lffd1999010100.nc
  0:  CLOSING ncdf FILE
  0:  OPEN: ncdf-file:
  0:  /scratch/b/b380794/Ref_run/output/cclm/1999_01/out01/lffd1999010100z.nc
  0:  CLOSING ncdf FILE
  0:  OPEN: ncdf-file:
  0:  /scratch/b/b380794/Ref_run/output/cclm/1999_01/out01/lffd1999010100p.nc
  0:  CLOSING ncdf FILE
  0:  OPEN: ncdf-file:
  0:  /scratch/b/b380794/Ref_run/output/cclm/1999_01/out01/lffd1999010100.nc
  0:   smoothing pmsl over mountainous terrain
  0:  CLOSING ncdf FILE
  0:  OPEN: ncdf-file:
  0:  /scratch/b/b380794/Ref_run/output/cclm/1999_01/out01/lffd1999010100.nc
549: [m11510:32171:0] Caught signal 11 (Segmentation fault)
 77: [m11397:46159:0] Caught signal 11 (Segmentation fault)
...
  2: [m11394:43474:0] Caught signal 11 (Segmentation fault)
  0:  CLOSING ncdf FILE
  0: [m11394:43472:0] Caught signal 11 (Segmentation fault)
...
380: [m11503:12975:0] Caught signal 11 (Segmentation fault)
 28: ==== backtrace ====
 28:  2 0x000000000005767c mxm_handle_error()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:641
 28:  3 0x00000000000577ec mxm_error_signal_handler()  /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u7-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.9.7-gcc-OFED-3.18-redhat6.7-x86_64/mxm-v3.6/src/mxm/util/debug/debug.c:616
 28:  4 0x0000000000032510 killpg()  ??:0
 28:  5 0x000000000053343b organize_data_()  ??:0
 28:  6 0x000000000058de82 MAIN__()  ??:0
 28:  7 0x00000000004052fe main()  ??:0
 28:  8 0x000000000001ed1d __libc_start_main()  ??:0
 28:  9 0x00000000004051f9 _start()  ??:0
 28: ===================

Could anyone please help me with this problem? I would be very grateful for help.
Thank you very much and best regards,
Eva Nowatzki

    (1-6/6)