Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all 6275 articles
Browse latest View live

Re: RoCEv2 GID disappeared ?

$
0
0

scisoft13:~ % cat /sys/class/infiniband/mlx5_0/ports/1/gids/0

0000:0000:0000:0000:0000:0000:0000:0000

scisoft13:~ % cat /sys/class/infiniband/mlx5_0/ports/1/gids/1

0000:0000:0000:0000:0000:0000:0000:0000

 

scisoft13:~ % cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/0

cat: /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/0: Invalid argument

 

same issue on 2 servers with connectx-5 EN 100Gb/s optical link and connectx-3 40GBb/s  copper link

ofed install without issue


Re: RoCEv2 GID disappeared ?

$
0
0

Hi Raphael,

 

Thank you for the information, it looks as an unexpected behaviour related to the driver and this specific Operating system.

For us to continue and investigate it please send an email to support@mellanox.com and open a support ticket with all the details.

 

Thank you,

Karen.

Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

$
0
0

Thanks a lot for the reply. It solved the above issue but after running mpirun, i do not see any latency difference with and without GDR

 

My Questions :

  1. Why I do not see any latency difference with and without GDR. ?
  2. Does below sequence or steps correct ? Does it matter for my Question 1

 

Note: I am having single GPU on both host and peer. Iommu is disabled.

## nvidia-smi topo -m

           GPU0    mlx5_0  mlx5_1  CPU Affinity

GPU0     X      PHB     PHB     18-35

mlx5_0  PHB      X      PIX

mlx5_1  PHB     PIX      X

 

Steps followed are:

1. Install CUDA 9.2 and add the library and bin path in .bashrc

2. Install latest MLX OFED

3. Compile and Install nv_peer_mem driver

4. Get UCX from git. Configure UCX with cuda and  Install UCX

5. Configure Openmpi-3.1.1 and install it.

./configure --prefix=/usr/local --with-wrapper-ldflags=-Wl,-rpath,/lib --enable-orterun-prefix-by-default --disable-io-romio --enable-picky --with-cuda=/usr/local/cuda-9.2

6. Configure OSU Benchmarks-5.4.2 with cuda and install it

./configure prefix=/root/osu_benchmarks CC=mpicc --enable-cuda --with-cuda=/usr/local/cuda-9.2

 

Run mpirun. I do not see any latency difference with and without GDR.

 

Thanks for your Help.

Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

$
0
0

I'm not sure  have you resolved seg 11 problem by my way.

As far as I see,I compile the openmpi with my ucx:

./configure --prefix=/usr/local/openmpi-3.1.1 --with-wrapper-ldflags=-Wl,-rpath,/lib --disable-vt --enable-orterun-prefix-by-default -disable-io-romio --enable-picky --with-cuda=/usr/local/cuda  --with-ucx=/opt/ucx-cuda --enable-mem-debug --enable-debug --enable-timing

Actually, It will be less latency on GDR. What kind of net card have you been using?CX4 or CX 3?

Wish you share some test data and  test environment configuration,it will be great.

Web interface error on SX6036

$
0
0

I am trying to setup a SX6036 VPI switch, previously used at another institute. I've configured the mgmt interface and can connect to the web UI, however it immediately gives the following error:

 

Internal Error

An internal error has occurred.

Your options from this point are:

See the logs for more details.

Return to the home page.

Retry the bad page which gave the error.

 

 

When I enable logging monitor and try to log in I see the following on the terminal:

 

Jul 23 11:34:29 ib-switch rh[5127]: [web.ERR]: web_include_template(), web_template.c:364, build 1: can't use empty string as operand of "!"

Jul 23 11:34:29 ib-switch rh[5127]: [web.ERR]: Error in template "status-logs" at line 545 of the generated TCL code

Jul 23 11:34:29 ib-switch rh[5127]: [web.ERR]: web_render_template(), web_template.c:226, build 1: Error code 14002 (assertion failed) returned

Jul 23 11:34:29 ib-switch rh[5127]: [web.ERR]: main(), rh_main.c:337, build 1: Error code 14002 (assertion failed) returned

Jul 23 11:34:29 ib-switch rh[5127]: [web.ERR]: Request handler failed with error code 14002: assertion failed

Jul 23 11:34:29 ib-switch httpd[4535]: [Mon Jul 23 11:34:29 2018] [error] [client 137.158.30.196] Exited with error code 14002: assertion failed, referer: http://ip.removed./admin/launch?script=rh&template=failure&badpage=%2Fadmin%2Flaunch%3Fscript%3Drh%26template%3Dstatus-logs

 

 

Any idea as to check what may have failed and how to fix it?

 

regards

Andrew

Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

$
0
0

Yes using your way segmentation fault got resolved.

I am using "Mellanox ConnectX-5" adapter.

OS - CentOS74

 

Is the below topology looks good to you

## nvidia-smi topo -m

           GPU0    mlx5_0  mlx5_1  CPU Affinity

GPU0     X      PHB     PHB     18-35

mlx5_0  PHB      X      PIX

mlx5_1  PHB     PIX      X

 

Running the below command to check the latency

mpirun --allow-run-as-root -host LOCALNODE,REMOTENODE -mca btl_openib_want_cuda_gdr 1 -np 2 -mca btl_openib_if_include mlx5_1 -mca -bind-to core -cpu-set 23 -x CUDA_VISIBLE_DEVICES=0 /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D

Re: MLNX+NVIDIA ASYNC GPUDirect - Segmentation fault: invalid permissions for mapped object running mpi with CUDA

$
0
0

PHB:Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

 

Based on the topo you give, the mlx5_1 and mlx5_0  is connected to gpu0 by a PCIe Host Bridge.

It meas that, even gdr, the flow from GPU0 to localnode Host,then nic(mlx5_1)  on local node.

On remote node,the flow from nic(mlx5_1) to host,then GPU0.

At the non-gdr,it just replaces the GPU with mem(DDR).Still, the flow through the host. Maybe that's why it seems the same result.

How much is your test latency ?

 

Re: DPDK with MLX4 VF on Hyper-v VM

$
0
0

Hello Hui,

 

I have tested this internally and found that currently we  don’t officially support WinOf driver versions with DPDK (except for WinOf driver for Azure).

You should use Azure VMs and contact them for retrieving the supported WinOf driver.

 

Thank you,

Karen.


InfiniBand amber port led flashing

$
0
0

We recently replaced IB switches (both of int-a,b) to newer model, SX6790 36 port IB switch.

After replacing, we noticed that some port LEDs showing abnormal/non-defined behavior;

i.e. Port#8 LED of int-b switch was flashing amber continuously (port#8 was connected to Node8/int-b).

We replaced the IB-cable and then port8 LED became normal(solid green).

Please tell us the meaning of amber LED flashing, it's possible impact and implication, and fix/workaround.

Connext-x3 roce mode

$
0
0

Hello,

 

can someone confirm what roce mode should I use for my Windows 2016 S2D deployment? I am using Connect-X3 (not Pro).

 

Thanks!

Re: Connext-x3 roce mode

Re: sr-iov and vxlan used

mst start fails with ConnectX-4 on ppc64le

$
0
0

Hi,

 

I'm trying to setup VFs using SRIOV on a ppc64le machine

 

$ lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description:    Ubuntu 16.04.4 LTS

Release:        16.04

Codename:       xenial


$ uname -a

 

Linux p006n03 4.10.0-35-generic #39~16.04.1-Ubuntu SMP Wed Sep 13 08:59:44 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux

 

$ lspci | grep Mellanox

0000:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

0040:01:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]

 

First i installed MLNX_OFED driver as per steps: https://community.mellanox.com/docs/DOC-2688

Then i installed latest MFT (4.10.0) for ppc64le from here: http://www.mellanox.com/page/management_tools

 

Running "mst start" subsequently fails however

 

$ sudo mst start

Starting MST (Mellanox Software Tools) driver set

Loading MST PCI module - Success

Loading MST PCI configuration module - Success

Create devices

/usr/bin/mst: line 382: 13070 Segmentation fault      (core dumped) ${mbindir}/minit $fullname ${busdevfn} 88 92

cat: /dev/mst/mt4115_pci_cr0: No such file or directory

/usr/bin/mst: line 382: 13132 Segmentation fault      (core dumped) ${mbindir}/minit $fullname ${busdevfn} 88 92

cat: /dev/mst/mt4115_pci_cr1: No such file or directory

Unloading MST PCI module (unused) - Success

 

Unloading MST PCI configuration module (unused) - Success

 

What could be the reason for this error?

 

I ultimately want to enable VFs on the CX4 as per steps here: https://community.mellanox.com/docs/DOC-2386 but cannot proceed due to this error

Re: mst start fails with ConnectX-4 on ppc64le

Re: Yocto embedded build of rdma-core

$
0
0

The solution to this problem was to make use of the incorporated recipes in the updated openembedded build.  About a month ago, rdma-core was added to the mainlline tree.  We had been trying to get this to work ourselves by writing our own recipes.  Now that the code is integrated it just builds.


rxe driver does not support kernel ABI

$
0
0

Getting a small error when I try to do an rping test. I'm building rxe into kernel 4.16 and rdma-core using yocto on an Arria10 socfpga containing a dual core A53 ARM processor. I get the kernel modules and userland loaded:

 

root@arria10:~# lsmod | grep rxe
rdma_rxe 102400 0
ib_core 192512 6 rdma_rxe,ib_cm,rdma_cm,ib_uverbs,iw_cm,rdma_ucm

 

I can configure the rxe0 device but rxe_cfg is giving a strange error:

 

root@arria10:~# rxe_cfg
libibverbs: Warning: Driver rxe does not support the kernel ABI of 1 (supports 2 to 2) for device /sys/class/infiniband/rxe0
IB device 'rxe0' wasn't found
Name Link Driver Speed NMTU IPv4_addr RDEV RMTU
eth0 yes  st_gmac      1500 10.0.1.24 rxe0 (?)

 

Any hints on what this means, i.e. the kernel ABI error would be appreciated!

 

Thanks,

FM

Re: rxe driver does not support kernel ABI

$
0
0

After setting up the yocto build to include the various rdma-core modules according to yocto practices, this error went away.

Re: rxe driver does not support kernel ABI

$
0
0

Its back.  For some reason I keep getting this warning

libibverbs: Warning: Driver rxe does not support the kernel ABI of 1 (supports 2 to 2) for device /sys/class/infiniband/rxe0

Re: Connext-x3 roce mode

$
0
0

Karen,

 

Thanks for replying and ref doc.

Re: sr-iov and vxlan used

Viewing all 6275 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>