Hi Sophie,
Please find my answers inline
What OS, Kernel and driver version are you using? (modinfo
mlx4_core | grep -i version).
RHEL, kernel version 3.17
Have you seen an followed documents:
HowTo
Compile Linux Kernel for NVMe over Fabrics
HowTo
Configure NVMe over Fabrics
[We are not referring this doc.] (we are not working NVMe OF standard Linux drivers)
What is the last trace
generated in the messages file prior to crash?
Are you getting the same result with any jobs above 4 ? (IE:
5,6,7)
We are running the 4 or more threads/jobs and getting into
situation.
vender_err 87 reports a number of RNR NACK exceeding and terminate
the QP. (receiver not ready (RNR) error).
In which situation we expect the receiver to flag RNR.
Is there OFED, mlx4 driver
dependency on this?
Or receiver does not have
sufficient CPU cycles?