Re: New to infiniband, can't get a working connection.

July 20, 2013, 11:51 am

≫ Next: Re: Re: New to infiniband, can't get a working connection.

≪ Previous: Re: vSphere 5.1 U1 IPoIB VLAN support

yes, still having problems. since original post, i'm now on my 3rd switch, and the second topspin 120 and i'm having the exact same issue. While I can plug in two of the infinihost IIIs together and get a link light, when I plug them into the switch, I get no link light. I also cannot plug the connectx card into either and get it to work, but i'm starting to suspect that it's just a bad card. I just can't believe one person can have this much trouble with this stuff.

↧

Re: Re: New to infiniband, can't get a working connection.

July 20, 2013, 12:17 pm

≫ Next: ESX 5.1.0 IPoIB NFS datastore freezes

≪ Previous: Re: New to infiniband, can't get a working connection.

No worries. Which OS are you using?

Is there any chance you could do stuff on CentOS/RHEL 6.4?

Asking that because it's what I'm super familiar with.

If you're ok with that, please install the CentOS/RHEL provided IB software, and also pciutils:

$ sudo yum groupinstall "Infiniband Support"

$ sudo yum install mstflint pciutils

$ sudo chkconfig rdma on

$ sudo service rdma start

Then let's do some basic info gathering so we know what we're dealing with.

Run lspci -Qvvs on the ConnectX card, and at least one of the Infinihost III's, then post the results here
Also query the firmware of both using mstflint

Example from a ConnectX card here. First I find out it's PCI address in the box:

$ sudo lspci |grep Mell

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

Then use lspci -Qvvs on that address, to retrieve all of the potentially useful info:

$ sudo lspci -Qvvs 01:00.0

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

Subsystem: Mellanox Technologies Device 0006

Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

Latency: 0, Cache Line Size: 64 bytes

Interrupt: pin A routed to IRQ 16

Region 0: Memory at f7c00000 (64-bit, non-prefetchable) [size=1M]

Region 2: Memory at f0000000 (64-bit, prefetchable) [size=8M]

Capabilities: [40] Power Management version 3

Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

Capabilities: [48] Vital Product Data

Product Name: Eagle DDR

Read-only fields:

[PN] Part number: 375-3549-01

[EC] Engineering changes: 51

[SN] Serial number: 1388FMH-0905400010

[V0] Vendor specific: PCIe x8

[RV] Reserved: checksum good, 0 byte(s) reserved

Read/write fields:

[V1] Vendor specific: N/A

[YA] Asset tag: N/A

[RW] Read-write area: 111 byte(s) free

End

Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-

Vector table: BAR=0 offset=0007c000

PBA: BAR=0 offset=0007d000

Capabilities: [60] Express (v2) Endpoint, MSI 00

DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited

ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+

DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

MaxPayload 256 bytes, MaxReadReq 512 bytes

DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

LnkCap: Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited

ClockPM- Surprise- LLActRep- BwNot-

LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-

ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-

DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported

DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled

LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

Compliance De-emphasis: -6dB

LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)

ARICap: MFVC- ACS-, Next Function: 1

ARICtl: MFVC- ACS-, Function Group: 0

Kernel driver in use: mlx4_core

Kernel modules: mlx4_core

Note the blue highlighted bits. For ConnectX cards this stuff is useful. For my card, it's showing a Sun part number, as it was originally a Sun badged card (now reflashed to stock firmware). The PCI link is in x8 state too, which is useful (if it wasn't, it would indicate a problem).

And the mstflint output example:

$ sudo mstflint -d 01:00.0 q

Image type: ConnectX

FW Version: 2.9.1000

Device ID: 25418

Description: Node Port1 Port2 Sys image

GUIDs: 0003ba000100edb8 0003ba000100edb9 0003ba000100edba 0003ba000100edbb

MACs: 0003ba00edb9 0003ba00edba

Board ID: (MT_04A0120002)

VSD:

PSID: MT_04A0120002

That tells us the firmware version on the card. Useful to know, as it might need upgrading (very easy to do).

After you've pasted that info here, we can start figuring out if there's anything wrong with the basics first and fix them. Then we can move onto the next stuff.

(note - edited for typo fixes)

↧

ESX 5.1.0 IPoIB NFS datastore freezes

July 22, 2013, 10:33 am

≫ Next: Management command failed in KVM for SR-IOV

≪ Previous: Re: Re: New to infiniband, can't get a working connection.

Hello,

hopefully this is the right place to provide some information about a VMWare IPoIB datastore freeze. We are testing a new VMWare ESX 5.1.0 setup. Sadly we have only 3 ConnectX (gen 1) cards left, the other ConnectX2 ones are in our productive ESX 4.1 environment. We know that this is not officially supported right now but I want to make sure that the error will not happen when we upgrade the productive machines.

The hardware is:

Fujitsu RX300 S6 (Dual Intel X5670)

ConnectX MT25418 Firmware 2.9.1000

ESX 5.1.0 1117900

Mellanox driver 1.8.1

When copying data between VMs all of a sudden the adapter freezes and the datastore is "lost". From the vmkernel log we can read endless lines as below:

2013-07-22T17:22:48.775Z cpu10:8202)<3>vmnic_ib1:ipoib_send:504: found skb where it does not belongtx_head = 3827020, tx_tail =3827020

2013-07-22T17:22:48.775Z cpu10:8202)<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

2013-07-22T17:22:48.775Z cpu10:8202)Backtrace for current CPU #10, worldID=8202, ebp=0x41220029af68

2013-07-22T17:22:48.776Z cpu10:8202)0x41220029af68:[0x41802a310d59]ipoib_send@<None>#<None>+0x5d4 stack: 0xffffff, 0x0, 0x412410d4c948,

2013-07-22T17:22:48.777Z cpu10:8202)0x41220029b018:[0x41802a310d59]ipoib_send@<None>#<None>+0x5d4 stack: 0x41220029b088, 0x418029e0a55b

2013-07-22T17:22:48.777Z cpu10:8202)0x41220029b148:[0x41802a317160]ipoib_mcast_send@<None>#<None>+0xf7 stack: 0x41220029b188, 0x418029d

2013-07-22T17:22:48.778Z cpu10:8202)0x41220029b238:[0x41802a31dabf]ipoib_start_xmit@<None>#<None>+0x396 stack: 0x41220029b598, 0x412200

2013-07-22T17:22:48.778Z cpu10:8202)0x41220029b398:[0x41802a31ac3b]vmipoib_start_xmit@<None>#<None>+0x49a stack: 0x41000be0b880, 0x839e

2013-07-22T17:22:48.779Z cpu10:8202)0x41220029b468:[0x41802a16d8f0]DevStartTxImmediate@com.vmware.driverAPI#9.2+0x137 stack: 0x41220029

2013-07-22T17:22:48.779Z cpu10:8202)0x41220029b4d8:[0x418029d3470e]UplinkDevTransmit@vmkernel#nover+0x295 stack: 0x10787a40, 0x41220029

2013-07-22T17:22:48.780Z cpu10:8202)0x41220029b558:[0x418029dabbaa]NetSchedFIFORunLocked@vmkernel#nover+0x1a5 stack: 0xc0bd95300, 0x0,

2013-07-22T17:22:48.781Z cpu10:8202)0x41220029b5e8:[0x418029dabf57]NetSchedFIFOInput@vmkernel#nover+0x24e stack: 0x41220029b638, 0x4180

2013-07-22T17:22:48.781Z cpu10:8202)0x41220029b698:[0x418029dab0b2]NetSchedInput@vmkernel#nover+0x191 stack: 0x41220029b748, 0x41000bd9

2013-07-22T17:22:48.782Z cpu10:8202)0x41220029b738:[0x418029d3ced0]IOChain_Resume@vmkernel#nover+0x247 stack: 0x41220029b798, 0x418029d

2013-07-22T17:22:48.782Z cpu10:8202)0x41220029b788:[0x418029d2c0e4]PortOutput@vmkernel#nover+0xe3 stack: 0x41220029b808, 0x41802a216a2a

2013-07-22T17:22:48.783Z cpu10:8202)0x41220029b808:[0x41802a2254c8]TeamES_Output@<None>#<None>+0x16b stack: 0x0, 0x418029cc3879, 0x4122

2013-07-22T17:22:48.784Z cpu10:8202)0x41220029ba08:[0x41802a218047]EtherswitchPortDispatch@<None>#<None>+0x142a stack: 0xffffffff000000

2013-07-22T17:22:48.784Z cpu10:8202)0x41220029ba78:[0x418029d2b2c7]Port_InputResume@vmkernel#nover+0x146 stack: 0x410001553540, 0x41220

2013-07-22T17:22:48.785Z cpu10:8202)0x41220029baa8:[0x41802a3b95cb]TcpipTxDispatch@<None>#<None>+0x9a stack: 0x7c1f45, 0x41220029bad8,

2013-07-22T17:22:48.785Z cpu10:8202)0x41220029bb28:[0x41802a3ba118]TcpipDispatch@<None>#<None>+0x1c7 stack: 0x246, 0x41220029bb70, 0x41

2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bca8:[0x418029d0b245]WorldletProcessQueue@vmkernel#nover+0x4b0 stack: 0x41220029bd58, 0xb

2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bce8:[0x418029d0b895]WorldletBHHandler@vmkernel#nover+0x60 stack: 0x100000000000001, 0x41

2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bd68:[0x418029c2083a]BH_Check@vmkernel#nover+0x185 stack: 0x41220029be68, 0x41220029be08,

2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be68:[0x418029dbc9bc]CpuSchedIdleLoopInt@vmkernel#nover+0x13b stack: 0x41220029be98, 0x41

2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be78:[0x418029dc66de]CpuSched_IdleLoop@vmkernel#nover+0x15 stack: 0xa, 0x14, 0x41220029bf

2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be98:[0x418029c4f71e]Init_SlaveIdle@vmkernel#nover+0x49 stack: 0x0, 0x0, 0x0, 0x0, 0x0

2013-07-22T17:22:48.788Z cpu10:8202)0x41220029bfe8:[0x418029ee26a6]SMPSlaveIdle@vmkernel#nover+0x31d stack: 0x0, 0x0, 0x0, 0x0, 0x0

Any help is appreciated.

Best regards.

Markus

↧

Management command failed in KVM for SR-IOV

July 22, 2013, 10:40 am

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: ESX 5.1.0 IPoIB NFS datastore freezes

Hi,

This is my forth day of fighting with SR-IOV and KVM.

I can ping from VM to other IPoIB computer but when I tried to use ibnetdiscover command I get SIGSEGV

ibnetdiscover

src/query_smp.c:98; send failed; -5

# Topology file: generated on Fri Jul 19 19:28:24 2013

Segmentation fault (core dumped)

Any access from most of the ib commands failed, dmesg shows:

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:3 in_param 0x29f3a000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

mlx4_core 0000:04:00.0: vhcr command SET_PORT (0xc) slave:3 in_param 0x29f3a000 in_mod=0x1, op_mod=0x0 failed with error:0, status -22

mlx4_core 0000:04:00.0: slave 3 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

Looks like command firmware MAD_IFC is failing by somereason in device but I don't have idea about the cause, possibly this part of code is related to this:

+ if (slave != dev->caps.function &&

+ ((smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) ||

+ (smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED &&

+ smp->method == IB_MGMT_METHOD_SET))) {

+ mlx4_err(dev, "slave %d is trying to execute a Subnet MGMT MAD, "

+ "class 0x%x, method 0x%x for attr 0x%x. Rejecting\n",

+ slave, smp->method, smp->mgmt_class,

+ be16_to_cpu(smp->attr_id));

+ return -EPERM;

+ }

from

+static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int slave,

+ struct mlx4_vhcr *vhcr,

+ struct mlx4_cmd_mailbox *inbox,

+ struct mlx4_cmd_mailbox *outbox,

+ struct mlx4_cmd_info *cmd)

Please find below some details about my build.

Reallly appreciate if anybody can point me the right direction or even better help me to fix the issue.

Thanks in advance

Marcin

Host:

-----

Motherboard: Supermicro X9DRI-F

CPUs: 2x E5-2640

System: CentOS 6.3:2.6.32-279.el6.x86_64 and CentOS 6.4 2.6.32-358.el6.x86_64

Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3], MCX354A-QCB

Mellanox OFED: MLNX_OFED_LINUX-2.0-2.0.5-rhel6.3-x86_64

qemu-kvm.x86_64 2:0.12.1.2-2.355.el6

#lspci | grep Mel

04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

04:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.2 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.3 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.4 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.5 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.6 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.7 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:01.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

#dmesg | grep mlx4

mlx4_core: Mellanox ConnectX core driver v1.1 (Apr 23 2013)

mlx4_core: Initializing 0000:04:00.0

mlx4_core 0000:04:00.0: PCI INT A -> GSI 32 (level, low) -> IRQ 32

mlx4_core 0000:04:00.0: setting latency timer to 64

mlx4_core 0000:04:00.0: Enabling SR-IOV with 5 VFs

mlx4_core 0000:04:00.0: Running in master mode

mlx4_core 0000:04:00.0: irq 109 for MSI/MSI-X

mlx4_core 0000:04:00.0: irq 110 for MSI/MSI-X

mlx4_core 0000:04:00.0: irq 111 for MSI/MSI-X

mlx4_core 0000:04:00.0: irq 112 for MSI/MSI-X

mlx4_core: Initializing 0000:04:00.1

mlx4_core 0000:04:00.1: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.1: setting latency timer to 64

mlx4_core 0000:04:00.1: Detected virtual function - running in slave mode

mlx4_core 0000:04:00.1: Sending reset

mlx4_core 0000:04:00.0: Received reset from slave:1

mlx4_core 0000:04:00.1: Sending vhcr0

mlx4_core 0000:04:00.1: HCA minimum page size:512

mlx4_core 0000:04:00.1: irq 113 for MSI/MSI-X

mlx4_core 0000:04:00.1: irq 114 for MSI/MSI-X

mlx4_core 0000:04:00.1: irq 115 for MSI/MSI-X

mlx4_core 0000:04:00.1: irq 116 for MSI/MSI-X

mlx4_core: Initializing 0000:04:00.2

mlx4_core 0000:04:00.2: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.2: setting latency timer to 64

mlx4_core 0000:04:00.2: Skipping virtual function:2

mlx4_core: Initializing 0000:04:00.3

mlx4_core 0000:04:00.3: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.3: setting latency timer to 64

mlx4_core 0000:04:00.3: Skipping virtual function:3

mlx4_core: Initializing 0000:04:00.4

mlx4_core 0000:04:00.4: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.4: setting latency timer to 64

mlx4_core 0000:04:00.4: Skipping virtual function:4

mlx4_core: Initializing 0000:04:00.5

mlx4_core 0000:04:00.5: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.5: setting latency timer to 64

mlx4_core 0000:04:00.5: Skipping virtual function:5

<mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (Apr 23 2013)

mlx4_core 0000:04:00.0: mlx4_ib: multi-function enabled

mlx4_core 0000:04:00.0: mlx4_ib: initializing demux service for 80 qp1 clients

mlx4_core 0000:04:00.1: mlx4_ib: multi-function enabled

mlx4_core 0000:04:00.1: mlx4_ib: operating in qp1 tunnel mode

mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.1 (Apr 23 2013)

mlx4_en 0000:04:00.0: Activating port:2

mlx4_en: eth2: Using 216 TX rings

mlx4_en: eth2: Using 4 RX rings

mlx4_en: eth2: Initializing port

mlx4_en 0000:04:00.1: Activating port:2

mlx4_en: eth3: Using 216 TX rings

mlx4_en: eth3: Using 4 RX rings

mlx4_en: eth3: Initializing port

mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is up

mlx4_core 0000:04:00.0: Received reset from slave:2

mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x106a10000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is up

mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x119079000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is down

mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is down

mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is up

mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is up

# ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.11.500

node_guid: 0002:c903:00a2:8fb0

sys_image_guid: 0002:c903:00a2:8fb3

vendor_id: 0x02c9

vendor_part_id: 4099

hw_ver: 0x0

board_id: MT_1090110018

phys_port_cnt: 2

port: 1

state: PORT_ACTIVE (4)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_DOWN (1)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

#cat /etc/modprobe.d/mlx4_core.conf

options mlx4_core num_vfs=8 port_type_array=1,1 probe_vf=1

KVM Guest: CentOS 6.4 and CentOS 6.3

----------------------

Mellanox OFED: MLNX_OFED_LINUX-2.0-2.0.5-rhel6.3-x86_64

Kernel: 2.6.32-279.el6.x86_64

#lspci | grep Mel

00:07.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

#ibv_devinfo

hca_id: mlx4_0

transport: InfiniBand (0)

fw_ver: 2.11.500

node_guid: 0014:0500:c0bb:4473

sys_image_guid: 0002:c903:00a2:8fb3

vendor_id: 0x02c9

vendor_part_id: 4100

hw_ver: 0x0

board_id: MT_1090110018

phys_port_cnt: 2

port: 1

state: PORT_ACTIVE (4)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 1

port_lid: 1

port_lmc: 0x00

link_layer: InfiniBand

port: 2

state: PORT_DOWN (1)

max_mtu: 2048 (4)

active_mtu: 2048 (4)

sm_lid: 0

port_lid: 0

port_lmc: 0x00

link_layer: InfiniBand

# sminfo

ibwarn: [3673] _do_madrpc: send failed; Function not implemented

ibwarn: [3673] mad_rpc: _do_madrpc failed; dport (Lid 1)

sminfo: iberror: failed: query

OpenSM log:

Jul 19 09:57:54 001056 [C520D700] 0x02 -> osm_vendor_init: 1000 pending umads specified

Jul 19 09:57:54 002074 [C520D700] 0x80 -> Entering DISCOVERING state

Using default GUID 0x14050000000002

Jul 19 09:57:54 191924 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x14050000000002

Jul 19 09:57:54 671075 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x14050000000002

Jul 19 09:57:54 671503 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x14050000000002

Jul 19 09:57:54 672363 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x14050000000002

Jul 19 09:57:54 672774 [C520D700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0014050000000002

Jul 19 09:57:54 673345 [C520D700] 0x01 -> osm_vendor_set_sm: ERR 5431: setting IS_SM capmask: cannot open file '/dev/infiniband/issm0': Invalid argument

Jul 19 09:57:54 674233 [C1605700] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f11b00008c0 of size 256 TID 0x1234 failed -5 (Invalid argument)

Jul 19 09:57:54 674278 [C1605700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_ERROR): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1234

Jul 19 09:57:54 674311 [C1605700] 0x01 -> vl15_send_mad: ERR 3E03: MAD send failed (IB_UNKNOWN_ERROR)

Jul 19 09:57:54 674336 [C0C04700] 0x01 -> state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown

↧

Re: Management command failed in KVM for SR-IOV

July 22, 2013, 12:24 pm

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Management command failed in KVM for SR-IOV

Hello,

Could you provide ibstat?

Could you provide an sminfo example that targets a specific port you know an SM is present in?

Where are you running your SM?

I do not see any indication of fabric connectivity outside of state: PORT_ACTIVE (4). I am thinking that perhaps the commands you are receiving bad MAD responses from are being directed toward a non-linked port. Similar behavior happens in dual port cards with some commands.

↧

Re: Management command failed in KVM for SR-IOV

July 22, 2013, 2:51 pm

≫ Next: Re: Using ConnectX-2 VPI adapters for network workstation with 2 nodes.

≪ Previous: Re: Management command failed in KVM for SR-IOV

Also... i've seen before certain machines that do not support SR-IOV. this function also needs to be supported by the HW. you need to check with your server vendor.

↧

Re: Using ConnectX-2 VPI adapters for network workstation with 2 nodes.

July 22, 2013, 3:07 pm

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: Management command failed in KVM for SR-IOV

I got a question about the SX6018 switch and any input would be greatly appreciated. Will the FDR ports auto-negotiate down QDR and work with ConnectX-2 adapters? Thanks!

↧

Re: Management command failed in KVM for SR-IOV

July 22, 2013, 3:12 pm

≫ Next: ping time inconsistent in 10G Ethernet Cards

≪ Previous: Re: Using ConnectX-2 VPI adapters for network workstation with 2 nodes.

His hardware supports it because he sees the virtual functions.

Sent from my iPhone

↧

ping time inconsistent in 10G Ethernet Cards

July 22, 2013, 10:11 pm

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: Management command failed in KVM for SR-IOV

linux version : Linux 2.6.32-279.el6.x86_64(CentOS 6.3 64bit)

Mellanox OFED Version : MLNX_OFED_LINUX-1.9-0.1.8-rhel6.3-x86_64

VMA Version : libvma-6.3.28-0-x86_64.rpm

HCA Card : MCX312A-XCBT

# ethtool -i p5p1driver: mlx4_en (MT_1080120023_CX-3)version: 1.5.10 (Jan 2013)firmware-version: 2.10.700bus-info: 0000:04:00.0
# ethtool -i p5p1driver: mlx4_en (MT_1080120023_CX-3)version: 1.5.10 (Jan 2013)firmware-version: 2.10.800bus-info: 0000:04:00.0

My customer was using Mellanox 10G Ethernet Cards.

The Mellanox 10G Card did not have a consistent ping time(11us~38us --> inconsistant).

but another card(solarflare,chelsio...) had a consistent ping time(13~15us --> consistant).

They has a very sensitive network issue and speed.

is there a way to resole this issue?

4 bytes from 10.154.136.10: icmp_seq=2 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=3 ttl=64 time=0.026 ms

64 bytes from 10.154.136.10: icmp_seq=4 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=5 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=6 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=7 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=8 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=9 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=10 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=11 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=12 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=13 ttl=64 time=0.014 ms

64 bytes from 10.154.136.10: icmp_seq=14 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=15 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=16 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=17 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=18 ttl=64 time=0.044 ms

64 bytes from 10.154.136.10: icmp_seq=19 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=20 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=21 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=22 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=23 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=24 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=25 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=26 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=27 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=28 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=29 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=30 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=31 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=32 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=33 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=34 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=35 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=36 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=37 ttl=64 time=0.014 ms

64 bytes from 10.154.136.10: icmp_seq=38 ttl=64 time=0.049 ms

64 bytes from 10.154.136.10: icmp_seq=39 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=40 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=41 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=42 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=43 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=44 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=45 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=46 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=47 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=48 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=49 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=50 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=51 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=52 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=53 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=54 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=55 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=56 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=57 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=58 ttl=64 time=0.044 ms

64 bytes from 10.154.136.10: icmp_seq=59 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=60 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=61 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=62 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=63 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=64 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=65 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=66 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=67 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=68 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=69 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=70 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=71 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=72 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=73 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=74 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=75 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=76 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=77 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=78 ttl=64 time=0.023 ms

64 bytes from 10.154.136.10: icmp_seq=79 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=80 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=81 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=82 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=83 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=84 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=85 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=86 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=87 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=88 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=89 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=90 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=91 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=92 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=93 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=94 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=95 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=96 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=97 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=98 ttl=64 time=0.038 ms

64 bytes from 10.154.136.10: icmp_seq=99 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=100 ttl=64 time=0.019 ms

↧

Re: Management command failed in KVM for SR-IOV

July 23, 2013, 5:24 am

≫ Next: Re: ESX 5.1.0 IPoIB NFS datastore freezes

≪ Previous: ping time inconsistent in 10G Ethernet Cards

Hi,

Thanks for respons and help.

First of all, Supermicro's mobo supports SR-IOV, VT-D. In this version of motherboard nad BIOS, SR-IOV is turn on all the time.

I forgot to mention when phisical device is passthrough everything works as expected in VM.

On this issue, I'm working on 2 phisical nodes S1 and G1, I have OpenSM on G1

OpenSM (sminfo BUILD VERSION: 1.5.8.MLNX_20110906 Build date: Jun 26 2012 21:31:16))

S1 has virtual system CentOS64

I've started working on infiniband technology about 1 year ago and so far almost everything I needed was out of the box

therefore I can be a little bit clumsy

[root@G1]# sminfo

sminfo: sm lid 2 sm guid 0x8f104039814b9, activity count 3626 priority 0 state 3 SMINFO_MASTER

[root@S1 ~]# sminfo

sminfo: sm lid 2 sm guid 0x8f104039814b9, activity count 3707 priority 0 state 3 SMINFO_MASTER

[root@S1 ~]# sminfo -G 0x8f104039814b9

sminfo: sm lid 2 sm guid 0x8f104039814b9, activity count 3718 priority 0 state 3 SMINFO_MASTER

[root@CentOS64 ~]# sminfo -G 0x8f104039814b9

ibwarn: [3178] ib_path_query_via: sa call path_query failed

sminfo: iberror: failed: can't resolve destination port 0x8f104039814b9

[root@CentOS64 ~]# ibstat

CA 'mlx4_0'

CA type: MT4100

Number of ports: 2

Firmware version: 2.11.500

Hardware version: 0

Node GUID: 0x001405008eaa0a36

System image GUID: 0x0002c90300a28fb3

Port 1:

State: Active

Physical state: LinkUp

Rate: 20

Base lid: 1

LMC: 0

SM lid: 2

Capability mask: 0x02514868

Port GUID: 0x0014050000000002

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Disabled

Rate: 10

Base lid: 4

LMC: 0

SM lid: 1

Capability mask: 0x02514868

Port GUID: 0x0014050000000081

Link layer: InfiniBand

2 devices from: options mlx4_core num_vfs=5 port_type_array=1,1 probe_vf=1

[root@S1 ~]# ibstat

CA 'mlx4_0'

CA type: MT4099

Number of ports: 2

Firmware version: 2.11.500

Hardware version: 0

Node GUID: 0x0002c90300a28fb0

System image GUID: 0x0002c90300a28fb3

Port 1:

State: Active

Physical state: LinkUp

Rate: 20

Base lid: 1

LMC: 0

SM lid: 2

Capability mask: 0x02514868

Port GUID: 0x0002c90300a28fb1

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Disabled

Rate: 10

Base lid: 4

LMC: 0

SM lid: 1

Capability mask: 0x02514868

Port GUID: 0x0002c90300a28fb2

Link layer: InfiniBand

CA 'mlx4_1'

CA type: MT4100

Number of ports: 2

Firmware version: 2.11.500

Hardware version: 0

Node GUID: 0x00140500f8bf4e16

System image GUID: 0x0002c90300a28fb3

Port 1:

State: Active

Physical state: LinkUp

Rate: 20

Base lid: 1

LMC: 0

SM lid: 2

Capability mask: 0x02514868

Port GUID: 0x0014050000000001

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Disabled

Rate: 10

Base lid: 4

LMC: 0

SM lid: 1

Capability mask: 0x02514868

Port GUID: 0x0014050000000080

Link layer: InfiniBand

[root@G1 ]# ibstat

CA 'mthca0'

CA type: MT25208 (MT23108 compat mode)

Number of ports: 2

Firmware version: 4.7.600

Hardware version: a0

Node GUID: 0x0008f104039814b8

System image GUID: 0x0008f104039814bb

Port 1:

State: Active

Physical state: LinkUp

Rate: 20

Base lid: 2

LMC: 0

SM lid: 2

Capability mask: 0x02510a6a

Port GUID: 0x0008f104039814b9

Link layer: InfiniBand

Port 2:

State: Down

Physical state: Polling

Rate: 10

Base lid: 0

LMC: 0

SM lid: 0

Capability mask: 0x02510a68

Port GUID: 0x0008f104039814ba

Link layer: InfiniBand

↧

Re: ESX 5.1.0 IPoIB NFS datastore freezes

July 23, 2013, 6:53 am

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: Management command failed in KVM for SR-IOV

Hello,

in between I was able to narrow the error down to a quite simple testcase.

1) create a NFS mount inside a VM that uses the IPOIB network interface of the ESX host.

2) copy data via scp from somwhere into this machine onto the NFS mount

When the error occurs for the first time, one can read from the vmkernel.log:

WARNING: LinDMA: Linux_DMACheckContraints:149:Cannot

map machine address = 0x15ffff37b0, length = 65160

for device 0000:02:00.0; reason = buffer straddles

device dma boundary (0xffffffff)

<3>vmnic_ib1:ipoib_send:504: found skb where it does not belong

tx_head = 323830, tx_tail =323830

<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

Backtrace for current CPU #20, worldID=8212, ebp=0x41220051b028

ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c4524aa, 0x4f0f5000000d

ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c44bca8, 0x41000fe5d6c0

ipoib_start_xmit@<None>#<None>+0x53 stack: 0x41220051b238, 0x41800c4

...

Best regards.

Markus

↧

Re: Management command failed in KVM for SR-IOV

July 23, 2013, 7:36 am

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: ESX 5.1.0 IPoIB NFS datastore freezes

are you running the SM from the hypervisor host machine? if yes, can you try running the SM on a regular machine?

↧

Re: Management command failed in KVM for SR-IOV

July 23, 2013, 7:40 am

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: Management command failed in KVM for SR-IOV

no, SM is running on regular one

↧

Re: Management command failed in KVM for SR-IOV

July 23, 2013, 7:46 am

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: Management command failed in KVM for SR-IOV

your SM is running on the older HCA type (MT25208) - which should be just fine but still i am thinking about maybe try to run it on a newer ConnectX card (you few of those out there).

↧

Re: Management command failed in KVM for SR-IOV

July 23, 2013, 8:32 am

≫ Next: Can I use FDR and QDR on the same infiniband switch?

≪ Previous: Re: Management command failed in KVM for SR-IOV

I have a few more computers that work on older 20Gbs devices, the new ContectX-3 and server I bought to test VM functionality. As you suggested I moved OpenSM to the only one new dev I have at the moment but no improvements. Now OpenSM is running on S1 and it's visible.

[root@S1 ~]# sminfo

sminfo: sm lid 1 sm guid 0x2c90300a28fb1, activity count 239 priority 0 state 3 SMINFO_MASTER

[root@G1 ]# sminfo

sminfo: sm lid 1 sm guid 0x2c90300a28fb1, activity count 303 priority 0 state 3 SMINFO_MASTER

[root@CentOS64 ~]# sminfo

ibwarn: [3702] _do_madrpc: send failed; Function not implemented

ibwarn: [3702] mad_rpc: _do_madrpc failed; dport (Lid 1)

sminfo: iberror: failed: query

I'm not sure if sminfo on vm shows errors coming from sr-iov functionality and VF device or simple it cannot get out of VM because of invalid configuration.

Should I configure OpenSM in any special way? Maybe this VF device is treated in "special" way by sm?

↧

Can I use FDR and QDR on the same infiniband switch?

July 23, 2013, 6:25 pm

≫ Next: Re: Management command failed in KVM for SR-IOV

≪ Previous: Re: Management command failed in KVM for SR-IOV

I want to make a small cluster using the upcoming SX6012 switch with a FDR ConnectX-3 adapter for the main server and QDR ConnectX-2 adapters for the nodes. From my readings into infiniband this shouldn't be problem but I just want to make sure. Thank you!

↧

Re: Management command failed in KVM for SR-IOV

July 24, 2013, 4:26 am

≫ Next: Re: ESX 5.1.0 IPoIB NFS datastore freezes

≪ Previous: Can I use FDR and QDR on the same infiniband switch?

Hi,

Still nothing... hope this info can be helpful.

I notice that OpenSM must be started on hypervisor host in my case this is S1 otherwise the virtual function's ports are linked up but have state DOWN.

When I start OpenSM (option: PORTS="ALL") all the ports become active (both are cable connected).

I noticed also a few more things:

So far only with ibnetdiscover in virtual system produce system message in hypervisor host:

mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x26aaf000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

sminfo command gives the correct OpenSM lid information i.e. give the lid number from OpenSM master:

# sminfo --debug -v

ibwarn: [2843] smp_query_status_via: attr 0x20 mod 0x0 route Lid 1

ibwarn: [2843] _do_madrpc: send failed; Function not implemented

ibwarn: [2843] mad_rpc: _do_madrpc failed; dport (Lid 1)

sminfo: iberror: [pid 2843] main: failed: query

In virtual host I can see message in log:

ibnetdiscover[2755]: segfault at e4 ip 00000031d420a8b6 sp 00007fffc2eee6b8 error 4 in libibmad.so.5.3.1[31d4200000+12000]

and in hypervisor host:

<mlx4_ib> _mlx4_ib_mcg_port_cleanup: _mlx4_ib_mcg_port_cleanup-1102: ff12401bffff000000000000ffffffff (port 2): WARNING: group refcount 1!!! (pointer ffff88083f4fa000)

One more thing:

In virtual machine I started OpenSM with guid point to local port in VF and get those messages:

Jul 24 14:27:09 830432 [FA2C0700] 0x80 -> Entering DISCOVERING state

Using default GUID 0x14050000000002

Jul 24 14:27:09 994036 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x14050000000002

Jul 24 14:27:10 398748 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x14050000000002

Jul 24 14:27:10 398958 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x14050000000002

Jul 24 14:27:10 399371 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x14050000000002

Jul 24 14:27:10 399960 [FA2C0700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0014050000000002

Jul 24 14:27:10 400439 [FA2C0700] 0x01 -> osm_vendor_set_sm: ERR 5431: setting IS_SM capmask: cannot open file '/dev/infiniband/issm0': Invalid argument

Jul 24 14:27:10 401700 [F66B8700] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7fcfe40008c0 of size 256 TID 0x1234 failed -5 (Invalid argument)

Jul 24 14:27:10 401700 [F66B8700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_ERROR): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1234

Jul 24 14:27:10 401700 [F66B8700] 0x01 -> vl15_send_mad: ERR 3E03: MAD send failed (IB_UNKNOWN_ERROR)

Jul 24 14:27:10 401983 [F5CB7700] 0x01 -> state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown

Regular linux cat on file /dev/infiniband/issm0 works in hypervisor system at least it's waiting when in VM I get exactly the messages from OpenSM log:

# cat /dev/infiniband/issm0

cat: /dev/infiniband/issm0: Invalid argument

both file on host and VM are the same regarding to access:

VM:

#ls -aZ /dev/infiniband/issm0

crw-rw----. root root system_u:object_r:device_t:s0 /dev/infiniband/issm0

Host:

#ls -lZ /dev/infiniband/issm0

crw-rw----. root root system_u:object_r:device_t:s0 /dev/infiniband/issm0

↧

Re: ESX 5.1.0 IPoIB NFS datastore freezes

July 24, 2013, 7:53 am

≫ Next: Re: Can I use FDR and QDR on the same infiniband switch?

≪ Previous: Re: Management command failed in KVM for SR-IOV

memo (for myself)

Modifications to the environment during my tests:

Infiniband card was exchanged to an ConnectX PCIe gen2 (MT26418). The newer chip with PCIe 5.0GT but still not an ConnectX2 card. Error still the same.

Updating the host BIOS does not help too. Even with latest version installed error still occurs.

↧