PHB:Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
Based on the topo you give, the mlx5_1 and mlx5_0 is connected to gpu0 by a PCIe Host Bridge.
It meas that, even gdr, the flow from GPU0 to localnode Host,then nic(mlx5_1) on local node.
On remote node,the flow from nic(mlx5_1) to host,then GPU0.
At the non-gdr,it just replaces the GPU with mem(DDR).Still, the flow through the host. Maybe that's why it seems the same result.
How much is your test latency ?