Yes using your way segmentation fault got resolved.
I am using "Mellanox ConnectX-5" adapter.
OS - CentOS74
Is the below topology looks good to you
## nvidia-smi topo -m
GPU0 mlx5_0 mlx5_1 CPU Affinity
GPU0 X PHB PHB 18-35
mlx5_0 PHB X PIX
mlx5_1 PHB PIX X
Running the below command to check the latency
mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D