on my machine, perfquery -x returns 64-bit values for the port counters, but i am unable to determine where these counters are. e.g. the counter /sys/class/infiniband/mlx4_0/ports/1/counters/port_rcv_data is only a 32-bit value and is maxed out at 4294967295. according to mlx5 docs there should be a counters_ext directory but that is not present on my system. is there a way to enable that with mlx4 or how am i to get the correct value?

↧

Cannot change port type on one port

January 4, 2017, 5:48 am

≫ Next: Re: Multiple MTU in routed vlans

≪ Previous: Re: HCA extended port counters

I have a ConnectX-3 dual port. (HP branded)

When I open Device Manager > System Devices > Mellanox NIC > Port Protocol

Only Port 1 is available to change between IB, ETH and AUTO. Port 2 is greyed out.

When we installed the NIC in the server, both ports were set to IB - We then changed both to ETH. But now I need to change it back, with no luck.

I have tried reinstalling the driver, changing the settings with MLXTOOL and restoring the NIC back to defaults with powershell.

Anyone know what to do?

↧

Re: Multiple MTU in routed vlans

January 4, 2017, 9:41 am

≫ Next: ConnectX-4 LX RoCE does not like latency

≪ Previous: Cannot change port type on one port

Thanks for the response,

What kind of performance hit should I expect on an SX1024 due to packet fragmentation that happens during the inter-vlan-routing ?

↧

ConnectX-4 LX RoCE does not like latency

January 4, 2017, 4:24 pm

≫ Next: ConnectX3 directly connected without switch

≪ Previous: Re: Multiple MTU in routed vlans

This is a diagram of my current set up.

+----------------+

| Linux Router |

| ConnectX-3 |

| port 1 port 2 |

+----------------+

/ \

+---------------+ / \ +---------------+

| Host 1 | / A A \ | Host 2 |

| ConnectX-4-LX | / \ | ConnectX-4-LX |

| Port 1 |- -| Port 1 |

| Port 2 |----------------| Port 2 |

+---------------+ B +---------------+

The Linux router has the ConnectX-3 (not PRO) card in Ethernet mode and is using a breakout cable (port 1 only) to connect to the ConnectX-4-LX cards at 10 Gb as path 'A'. The second port of the ConnectX-4-LX cards are connected directly at 25 Gb as path 'B'. Host 1 & 2 are running CentOS 7.2 with 3.10.0-327.36.3.el7.x86_64 and OFED 3.4. Linux router is running CentOS 7.2 with 4.9.0 kernel.

Iser and RDMA works fine over path 'B' and path 'A' (in either bridge or router mode) and now I want to add latency and drop packets to understand the effects. I'm using tc and netem to add the latency into the path. When I add .5 ms of latency in both directions, iSER slows to a crawl, throws errors in dmesg and sometimes even causes the file system to go read only. If I set the latency back to zero then things clear up and full 10 Gb is achieved. Iperf performs the same with the latency set to 0 or .5 ms for each direction. We would like to get RoCE to work over high-latency high-bandwidth links. If someone has some ideas on how to resolve this issue, I'd love to hear them.

Commands run on the router server:

for i in 2 3; do tc qdisc change dev eth${i} root netem delay .5ms; done

# brctl show
bridge name     bridge id               STP enabled     interfaces
rleblanc                8000.f452147ce541       no              eth2
                                                        eth3

The iser target is a 100 GB RAM disk exported via iser. I format the disk on the initiator with ext4 and then run this fio command:

echo "3" > /proc/sys/vm/drop_caches; fio --rw=read --bs=4K --size=1G --numjobs=40 --name=worker.matt --group_reporting

I see these messages on the initiator:

[25863.623453] 00000000 00000000 00000000 00000000

[25863.628564] 00000000 00000000 00000000 00000000

[25863.633634] 00000000 00000000 00000000 00000000

[25863.638619] 00000000 08007806 250003c7 0b0190d3

[25863.643593] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[25863.651180] connection40:0: detected conn error (1011)

[25874.368881] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[25874.375619] 00000000 00000000 00000000 00000000

[25874.380690] 00000000 00000000 00000000 00000000

[25874.385712] 00000000 00000000 00000000 00000000

[25874.390693] 00000000 08007806 250003c8 0501ddd3

[25874.395681] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[25874.403283] connection40:0: detected conn error (1011)

[25923.829903] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[25923.836663] 00000000 00000000 00000000 00000000

[25923.841724] 00000000 00000000 00000000 00000000

[25923.846752] 00000000 00000000 00000000 00000000

[25923.851733] 00000000 08007806 250003c9 510134d3

[25923.856709] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[25923.864308] connection40:0: detected conn error (1011)

[25943.184313] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[25943.191079] 00000000 00000000 00000000 00000000

[25943.196208] 00000000 00000000 00000000 00000000

[25943.201287] 00000000 00000000 00000000 00000000

[25943.206281] 00000000 08007806 250003ca 1afdbdd3

[25943.211272] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[25943.218901] connection40:0: detected conn error (1011)

[25962.538633] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[25962.545396] 00000000 00000000 00000000 00000000

[25962.550475] 00000000 00000000 00000000 00000000

[25962.555551] 00000000 00000000 00000000 00000000

[25962.560533] 00000000 08007806 250003cb 21012ed3

[25962.565526] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[25962.573155] connection40:0: detected conn error (1011)

[25973.291038] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[25973.297861] 00000000 00000000 00000000 00000000

[25973.302978] 00000000 00000000 00000000 00000000

[25973.308025] 00000000 00000000 00000000 00000000

[25973.313014] 00000000 08007806 250003cc 1901d2d3

[25973.318004] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[25973.325601] connection40:0: detected conn error (1011)

[26039.955899] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[26039.962690] 00000000 00000000 00000000 00000000

[26039.967825] 00000000 00000000 00000000 00000000

[26039.972894] 00000000 00000000 00000000 00000000

[26039.977891] 00000000 08007806 250003cd 850172d3

[26039.982905] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[26039.990512] connection40:0: detected conn error (1011)

[26067.411753] mlx5_warn:mlx5_1:dump_cqe:257:(pid 0): dump error cqe

[26067.418598] 00000000 00000000 00000000 00000000

[26067.423733] 00000000 00000000 00000000 00000000

[26067.428832] 00000000 00000000 00000000 00000000

[26067.433826] 00000000 08007806 250003ce 092977d3

[26067.438818] iser: iser_handle_wc: wr id ffffffffffffffff status 6 vend_err 78

[26067.446462] connection40:0: detected conn error (1011)

There are no messages on the target server.

↧

ConnectX3 directly connected without switch

January 4, 2017, 11:09 pm

≫ Next: Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

≪ Previous: ConnectX-4 LX RoCE does not like latency

Hello,

I have two Windows servers, directly connected with Mellanox ConnectX-3 without switch infrastructure between.

Is this a supported scenario with functioning RDMA/RoCE?

If so, do I still need to implement Datacenter Bridging in Windows or is only required when switches are in play?

↧

Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

January 5, 2017, 7:50 am

≫ Next: Re: MLNX OFED error for RHEL 7.3 RT

≪ Previous: ConnectX3 directly connected without switch

Starting from Windows 2012 and latter (including windows 2016 of course ) - all teaming drivers and support is now within Microsoft native OS NetLBFO

Mellanox is not involved whatsoever with providing module, packet etc...so it's all with MS to check whether CX4 or any other adapter adapter is in their support compatibility matrix

see https://technet.microsoft.com/en-us/library/jj130849.aspx

see also relevant NSDN documentation on that
Learn to Develop with Microsoft Developer Network | MSDN

↧

Re: MLNX OFED error for RHEL 7.3 RT

January 5, 2017, 9:50 am

≫ Next: Re: MLNX OFED error for RHEL 7.3 RT

≪ Previous: Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

Can anyone help in above issue?

↧

Re: MLNX OFED error for RHEL 7.3 RT

January 5, 2017, 5:18 pm

≫ Next: Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

≪ Previous: Re: MLNX OFED error for RHEL 7.3 RT

Hi Oskar,

Can you provide the contents of the logfile which is mentioned in the log /tmp/MLNX_OFED_LINUX-3.4-2.0.0.0-3.10.0-514.2.2.rt56.424.el7.x86_64/mlnx_ofed_iso.21610.log?

The logfile name should contain *.rpmbuild.log

Thanks and regards,

~Martijn

↧

Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

January 6, 2017, 3:17 am

≫ Next: Re: HCA extended port counters

≪ Previous: Re: MLNX OFED error for RHEL 7.3 RT

Aviap,

SRIOV is not supported on SET team.

↧

Re: HCA extended port counters

January 6, 2017, 4:45 am

≫ Next: compatibility issues between Veritas LLT on RedHat 6.8 and Mellanox

≪ Previous: Re: Win server 2016 Switch Embedded Teaming (SET) and SR-IOV

If a device PMA supports the extended port counters (which is your case), it depends on which kernel is being used. There were recent kernel changes to utilize the optional PortCountersExtended rather than the mandatory PortCounters. So either a recent kernel with these changes would be needed to see this or the relevant changes backported to some older kernel.

↧

compatibility issues between Veritas LLT on RedHat 6.8 and Mellanox

January 6, 2017, 8:14 am

≫ Next: Re: Multiple MTU in routed vlans

≪ Previous: Re: HCA extended port counters

Dear All,

Are you aware of any compatibility issues between Veritas LLT on RedHat 6.8 and Mellanox? I see suspicious messages during system boot (no issues with the functioning of LLT have been noticed).

LLT package is VRTSllt-6.2.1.500-RHEL6.x86_64 on Red Hat 6.8, kernel 2.6.32-642.4.2.el6.x86_64 (mlx4_en.ko 2.2-1 came with the kernel)

[nep179@prdctlscthdb01-20161206]$ sudo egrep "Nov 23(.)*kernel: llt" messages-20161127

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_create_cq

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_create_cq

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_resolve_addr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_resolve_addr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_dereg_mr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_dereg_mr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_reject

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_reject

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_disconnect

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_disconnect

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_resolve_route

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_resolve_route

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_bind_addr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_bind_addr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_create_qp

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_create_qp

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_destroy_cq

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_destroy_cq

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_create_id

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_create_id

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_listen

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_listen

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_destroy_qp

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_destroy_qp

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_get_dma_mr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_get_dma_mr

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_alloc_pd

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_alloc_pd

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_connect

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_connect

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_destroy_id

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_destroy_id

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_resize_cq

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_resize_cq

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol rdma_accept

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol rdma_accept

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: disagrees about version of symbol ib_dealloc_pd

Nov 23 15:09:36 prdctlscthdb01 kernel: llt: Unknown symbol ib_dealloc_pd

[nep179@prdctlscthdb01-20161206]$

[nep179@prdctlscthdb01-20161206]$ sudo egrep "Nov 23(.)*kernel: llt" messages-20161127 | wc -l

[nep179@prdctlscthdb01-20161206]$

Thank you,

Aleksandr

↧

Re: Multiple MTU in routed vlans

January 6, 2017, 8:33 am

≫ Next: Re: Multiple MTU in routed vlans

≪ Previous: compatibility issues between Veritas LLT on RedHat 6.8 and Mellanox

Hi,

After rechecking this issue, I figured out that we don't support fragmentation on the switches, it should be done on the adapter.

packets that will arrived with larger MTU directed to smaller MTU port will be dropped.

There is a DF (don't fragment) flag on the IP header that allow of forbid packet fragments. your problem is when running 9K packets from the Storage to some of the 1500 MTU servers.

Re: Multiple MTU in routed vlans

January 6, 2017, 8:36 am

≫ Next: Re: HCA extended port counters

≪ Previous: Re: Multiple MTU in routed vlans

I deleted the old reply to avoid confusion.

Fragmentation is not supported on the switch. So there is no issue with latency.

↧

Re: HCA extended port counters

January 6, 2017, 3:43 pm

≫ Next: Re: MLNX OFED error for RHEL 7.3 RT

≪ Previous: Re: Multiple MTU in routed vlans

We are using CentOS 6.8 with kernel 2.6.32-642.11.1.el6.x86_64 (latest available), and the CentOS mlx4 kernel modules (I tried using OFED but they wouldn't support NFSoRDMA)

↧

Re: MLNX OFED error for RHEL 7.3 RT

January 6, 2017, 5:34 pm

≫ Next: Re: HCA extended port counters

≪ Previous: Re: HCA extended port counters

Hi Oskar,

Are you able to provide us with the requested log?

Thanks and regards,
~Martijn

↧

Re: HCA extended port counters

January 7, 2017, 4:35 am

≫ Next: Re: Cannot change port type on one port

≪ Previous: Re: MLNX OFED error for RHEL 7.3 RT

The changes for this are relatively recent and went into some 4.x kernel.

↧

Re: Cannot change port type on one port

January 9, 2017, 3:16 am

≫ Next: Re: What is the recommended firmware updating tool for Mellanox HCAs and/or NICs?

≪ Previous: Re: HCA extended port counters

Hello Mads,

Please try the following:

1. Connect a cable to the second port and then try to change the port type (if you don't have a spare cable you can connect the cable in loop back from port 1 to 2)

2. Please make sure that the other end also configured to work in IB

3. You can try changing the port type form the powershell using our MFT tool:

C:\Program Files\Mellanox\WinMFT>.\mst status
C:\Program Files\Mellanox\WinMFT> .\mlxconfig.exe -d <device id> set LINK_TYPE_P1=1

C:\Program Files\Mellanox\WinMFT> .\mlxconfig.exe -d <device id> set LINK_TYPE_P2=1

Thanks,

Karen.

↧