Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all 6275 articles
Browse latest View live

Re: New to infiniband, can't get a working connection.

$
0
0

yes, still having problems.  since original post, i'm now on my 3rd switch, and the second topspin 120 and i'm having the exact same issue.  While I can plug in two of the infinihost IIIs together and get a link light, when I plug them into the switch, I get no link light.  I also cannot plug the connectx card into either and get it to work, but i'm starting to suspect that it's just a bad card.  I just can't believe one person can have this much trouble with this stuff.


Re: Re: New to infiniband, can't get a working connection.

$
0
0

No worries.  Which OS are you using?

 

Is there any chance you could do stuff on CentOS/RHEL 6.4?

 

Asking that because it's what I'm super familiar with.

 

If you're ok with that, please install the CentOS/RHEL provided IB software, and also pciutils:

 

$ sudo yum groupinstall "Infiniband Support"

$ sudo yum install mstflint pciutils

$ sudo chkconfig rdma on

$ sudo service rdma start

 

Then let's do some basic info gathering so we know what we're dealing with.

 

  • Run lspci -Qvvs on the ConnectX card, and at least one of the Infinihost III's, then post the results here
  • Also query the firmware of both using mstflint

 

Example from a ConnectX card here.  First I find out it's PCI address in the box:

 

$ sudo lspci |grep Mell

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

 

Then use lspci -Qvvs on that address, to retrieve all of the potentially useful info:


$ sudo lspci -Qvvs 01:00.0

01:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

    Subsystem: Mellanox Technologies Device 0006

    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+

    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-

    Latency: 0, Cache Line Size: 64 bytes

    Interrupt: pin A routed to IRQ 16

    Region 0: Memory at f7c00000 (64-bit, non-prefetchable) [size=1M]

    Region 2: Memory at f0000000 (64-bit, prefetchable) [size=8M]

    Capabilities: [40] Power Management version 3

        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)

        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-

    Capabilities: [48] Vital Product Data

        Product Name: Eagle DDR

        Read-only fields:

            [PN] Part number: 375-3549-01         

            [EC] Engineering changes: 51

            [SN] Serial number: 1388FMH-0905400010     

            [V0] Vendor specific: PCIe x8        

            [RV] Reserved: checksum good, 0 byte(s) reserved

        Read/write fields:

            [V1] Vendor specific: N/A  

            [YA] Asset tag: N/A                            

            [RW] Read-write area: 111 byte(s) free

        End

    Capabilities: [9c] MSI-X: Enable+ Count=128 Masked-

        Vector table: BAR=0 offset=0007c000

        PBA: BAR=0 offset=0007d000

    Capabilities: [60] Express (v2) Endpoint, MSI 00

        DevCap:    MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited

            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+

        DevCtl:    Report errors: Correctable- Non-Fatal- Fatal- Unsupported-

            RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-

            MaxPayload 256 bytes, MaxReadReq 512 bytes

        DevSta:    CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-

        LnkCap:    Port #8, Speed 2.5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited

            ClockPM- Surprise- LLActRep- BwNot-

        LnkCtl:    ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-

            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-

        LnkSta:    Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-

        DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported

        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled

        LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-

             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-

             Compliance De-emphasis: -6dB

        LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-

             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-

    Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)

        ARICap:    MFVC- ACS-, Next Function: 1

        ARICtl:    MFVC- ACS-, Function Group: 0

    Kernel driver in use: mlx4_core

    Kernel modules: mlx4_core

 

Note the blue highlighted bits.  For ConnectX cards this stuff is useful.   For my card, it's showing a Sun part number, as it was originally a Sun badged card (now reflashed to stock firmware).  The PCI link is in x8 state too, which is useful (if it wasn't, it would indicate a problem).

 

And the mstflint output example:

 

$ sudo mstflint -d 01:00.0 q

Image type:      ConnectX

FW Version:      2.9.1000

Device ID:       25418

Description:     Node             Port1            Port2            Sys image

GUIDs:           0003ba000100edb8 0003ba000100edb9 0003ba000100edba 0003ba000100edbb

MACs:                                 0003ba00edb9     0003ba00edba

Board ID:         (MT_04A0120002)

VSD:            

PSID:            MT_04A0120002

 

That tells us the firmware version on the card.  Useful to know, as it might need upgrading (very easy to do).

 

After you've pasted that info here, we can start figuring out if there's anything wrong with the basics first and fix them.  Then we can move onto the next stuff.

 

(note - edited for typo fixes)

ESX 5.1.0 IPoIB NFS datastore freezes

$
0
0

Hello,

 

hopefully this is the right place to provide some information about a VMWare IPoIB datastore freeze. We are testing a new VMWare ESX 5.1.0 setup. Sadly we have only 3 ConnectX (gen 1) cards left, the other ConnectX2 ones are in our productive ESX 4.1 environment. We know that this is not officially supported right now but I want to make sure that the error will not happen when we upgrade the productive machines.

 

The hardware is:

Fujitsu RX300 S6 (Dual Intel X5670)

ConnectX MT25418 Firmware 2.9.1000

ESX 5.1.0 1117900

Mellanox driver 1.8.1

 

When copying data between VMs all of a sudden the adapter freezes and the datastore is "lost". From the vmkernel log we can read endless lines as below:

 

2013-07-22T17:22:48.775Z cpu10:8202)<3>vmnic_ib1:ipoib_send:504: found skb where it does not belongtx_head = 3827020, tx_tail =3827020

2013-07-22T17:22:48.775Z cpu10:8202)<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

2013-07-22T17:22:48.775Z cpu10:8202)Backtrace for current CPU #10, worldID=8202, ebp=0x41220029af68

2013-07-22T17:22:48.776Z cpu10:8202)0x41220029af68:[0x41802a310d59]ipoib_send@<None>#<None>+0x5d4 stack: 0xffffff, 0x0, 0x412410d4c948,

2013-07-22T17:22:48.777Z cpu10:8202)0x41220029b018:[0x41802a310d59]ipoib_send@<None>#<None>+0x5d4 stack: 0x41220029b088, 0x418029e0a55b

2013-07-22T17:22:48.777Z cpu10:8202)0x41220029b148:[0x41802a317160]ipoib_mcast_send@<None>#<None>+0xf7 stack: 0x41220029b188, 0x418029d

2013-07-22T17:22:48.778Z cpu10:8202)0x41220029b238:[0x41802a31dabf]ipoib_start_xmit@<None>#<None>+0x396 stack: 0x41220029b598, 0x412200

2013-07-22T17:22:48.778Z cpu10:8202)0x41220029b398:[0x41802a31ac3b]vmipoib_start_xmit@<None>#<None>+0x49a stack: 0x41000be0b880, 0x839e

2013-07-22T17:22:48.779Z cpu10:8202)0x41220029b468:[0x41802a16d8f0]DevStartTxImmediate@com.vmware.driverAPI#9.2+0x137 stack: 0x41220029

2013-07-22T17:22:48.779Z cpu10:8202)0x41220029b4d8:[0x418029d3470e]UplinkDevTransmit@vmkernel#nover+0x295 stack: 0x10787a40, 0x41220029

2013-07-22T17:22:48.780Z cpu10:8202)0x41220029b558:[0x418029dabbaa]NetSchedFIFORunLocked@vmkernel#nover+0x1a5 stack: 0xc0bd95300, 0x0,

2013-07-22T17:22:48.781Z cpu10:8202)0x41220029b5e8:[0x418029dabf57]NetSchedFIFOInput@vmkernel#nover+0x24e stack: 0x41220029b638, 0x4180

2013-07-22T17:22:48.781Z cpu10:8202)0x41220029b698:[0x418029dab0b2]NetSchedInput@vmkernel#nover+0x191 stack: 0x41220029b748, 0x41000bd9

2013-07-22T17:22:48.782Z cpu10:8202)0x41220029b738:[0x418029d3ced0]IOChain_Resume@vmkernel#nover+0x247 stack: 0x41220029b798, 0x418029d

2013-07-22T17:22:48.782Z cpu10:8202)0x41220029b788:[0x418029d2c0e4]PortOutput@vmkernel#nover+0xe3 stack: 0x41220029b808, 0x41802a216a2a

2013-07-22T17:22:48.783Z cpu10:8202)0x41220029b808:[0x41802a2254c8]TeamES_Output@<None>#<None>+0x16b stack: 0x0, 0x418029cc3879, 0x4122

2013-07-22T17:22:48.784Z cpu10:8202)0x41220029ba08:[0x41802a218047]EtherswitchPortDispatch@<None>#<None>+0x142a stack: 0xffffffff000000

2013-07-22T17:22:48.784Z cpu10:8202)0x41220029ba78:[0x418029d2b2c7]Port_InputResume@vmkernel#nover+0x146 stack: 0x410001553540, 0x41220

2013-07-22T17:22:48.785Z cpu10:8202)0x41220029baa8:[0x41802a3b95cb]TcpipTxDispatch@<None>#<None>+0x9a stack: 0x7c1f45, 0x41220029bad8,

2013-07-22T17:22:48.785Z cpu10:8202)0x41220029bb28:[0x41802a3ba118]TcpipDispatch@<None>#<None>+0x1c7 stack: 0x246, 0x41220029bb70, 0x41

2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bca8:[0x418029d0b245]WorldletProcessQueue@vmkernel#nover+0x4b0 stack: 0x41220029bd58, 0xb

2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bce8:[0x418029d0b895]WorldletBHHandler@vmkernel#nover+0x60 stack: 0x100000000000001, 0x41

2013-07-22T17:22:48.786Z cpu10:8202)0x41220029bd68:[0x418029c2083a]BH_Check@vmkernel#nover+0x185 stack: 0x41220029be68, 0x41220029be08,

2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be68:[0x418029dbc9bc]CpuSchedIdleLoopInt@vmkernel#nover+0x13b stack: 0x41220029be98, 0x41

2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be78:[0x418029dc66de]CpuSched_IdleLoop@vmkernel#nover+0x15 stack: 0xa, 0x14, 0x41220029bf

2013-07-22T17:22:48.787Z cpu10:8202)0x41220029be98:[0x418029c4f71e]Init_SlaveIdle@vmkernel#nover+0x49 stack: 0x0, 0x0, 0x0, 0x0, 0x0

2013-07-22T17:22:48.788Z cpu10:8202)0x41220029bfe8:[0x418029ee26a6]SMPSlaveIdle@vmkernel#nover+0x31d stack: 0x0, 0x0, 0x0, 0x0, 0x0

 

 

Any help is appreciated.

 

Best regards.

 

Markus

Management command failed in KVM for SR-IOV

$
0
0

Hi,

 

This is my forth day of fighting with SR-IOV and KVM.

 

I can ping from VM to other IPoIB computer but when I tried to use ibnetdiscover command I get SIGSEGV

 

ibnetdiscover

src/query_smp.c:98; send failed; -5

#

# Topology file: generated on Fri Jul 19 19:28:24 2013

#

Segmentation fault (core dumped)

 

Any access from most of the ib commands failed, dmesg shows:

 

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:3 in_param 0x29f3a000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

mlx4_core 0000:04:00.0: vhcr command SET_PORT (0xc) slave:3 in_param 0x29f3a000 in_mod=0x1, op_mod=0x0 failed with error:0, status -22

mlx4_core 0000:04:00.0: slave 3 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

 

Looks like command firmware MAD_IFC is failing by somereason in device but I don't have idea about the cause, possibly this part of code is related to this:

 

+ if (slave != dev->caps.function &&

+    ((smp->mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE) ||

+     (smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED &&

+      smp->method == IB_MGMT_METHOD_SET))) {

+ mlx4_err(dev, "slave %d is trying to execute a Subnet MGMT MAD, "

+ "class 0x%x, method 0x%x for attr 0x%x. Rejecting\n",

+ slave, smp->method, smp->mgmt_class,

+ be16_to_cpu(smp->attr_id));

+ return -EPERM;

+ }

 

from

 

+static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int slave,

+ struct mlx4_vhcr *vhcr,

+ struct mlx4_cmd_mailbox *inbox,

+ struct mlx4_cmd_mailbox *outbox,

+ struct mlx4_cmd_info *cmd)

 

 

Please find below some details about my build.

 

 

Reallly appreciate if anybody can point me the right direction or even better help me to fix the issue.

 

 

Thanks in advance

Marcin

 

 

Host:

-----

Motherboard: Supermicro X9DRI-F

CPUs: 2x E5-2640

 

 

System: CentOS 6.3:2.6.32-279.el6.x86_64 and CentOS 6.4 2.6.32-358.el6.x86_64

 

 

Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3], MCX354A-QCB

Mellanox OFED: MLNX_OFED_LINUX-2.0-2.0.5-rhel6.3-x86_64

 

 

qemu-kvm.x86_64                         2:0.12.1.2-2.355.el6

 

 

 

 

#lspci | grep Mel

04:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

04:00.1 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.2 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.3 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.4 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.5 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.6 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:00.7 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

04:01.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

 

 

#dmesg | grep mlx4

mlx4_core: Mellanox ConnectX core driver v1.1 (Apr 23 2013)

mlx4_core: Initializing 0000:04:00.0

mlx4_core 0000:04:00.0: PCI INT A -> GSI 32 (level, low) -> IRQ 32

mlx4_core 0000:04:00.0: setting latency timer to 64

mlx4_core 0000:04:00.0: Enabling SR-IOV with 5 VFs

mlx4_core 0000:04:00.0: Running in master mode

mlx4_core 0000:04:00.0: irq 109 for MSI/MSI-X

mlx4_core 0000:04:00.0: irq 110 for MSI/MSI-X

mlx4_core 0000:04:00.0: irq 111 for MSI/MSI-X

mlx4_core 0000:04:00.0: irq 112 for MSI/MSI-X

mlx4_core: Initializing 0000:04:00.1

mlx4_core 0000:04:00.1: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.1: setting latency timer to 64

mlx4_core 0000:04:00.1: Detected virtual function - running in slave mode

mlx4_core 0000:04:00.1: Sending reset

mlx4_core 0000:04:00.0: Received reset from slave:1

mlx4_core 0000:04:00.1: Sending vhcr0

mlx4_core 0000:04:00.1: HCA minimum page size:512

mlx4_core 0000:04:00.1: irq 113 for MSI/MSI-X

mlx4_core 0000:04:00.1: irq 114 for MSI/MSI-X

mlx4_core 0000:04:00.1: irq 115 for MSI/MSI-X

mlx4_core 0000:04:00.1: irq 116 for MSI/MSI-X

mlx4_core: Initializing 0000:04:00.2

mlx4_core 0000:04:00.2: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.2: setting latency timer to 64

mlx4_core 0000:04:00.2: Skipping virtual function:2

mlx4_core: Initializing 0000:04:00.3

mlx4_core 0000:04:00.3: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.3: setting latency timer to 64

mlx4_core 0000:04:00.3: Skipping virtual function:3

mlx4_core: Initializing 0000:04:00.4

mlx4_core 0000:04:00.4: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.4: setting latency timer to 64

mlx4_core 0000:04:00.4: Skipping virtual function:4

mlx4_core: Initializing 0000:04:00.5

mlx4_core 0000:04:00.5: enabling device (0000 -> 0002)

mlx4_core 0000:04:00.5: setting latency timer to 64

mlx4_core 0000:04:00.5: Skipping virtual function:5

<mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v1.0 (Apr 23 2013)

mlx4_core 0000:04:00.0: mlx4_ib: multi-function enabled

mlx4_core 0000:04:00.0: mlx4_ib: initializing demux service for 80 qp1 clients

mlx4_core 0000:04:00.1: mlx4_ib: multi-function enabled

mlx4_core 0000:04:00.1: mlx4_ib: operating in qp1 tunnel mode

mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.1 (Apr 23 2013)

mlx4_en 0000:04:00.0: Activating port:2

mlx4_en: eth2: Using 216 TX rings

mlx4_en: eth2: Using 4 RX rings

mlx4_en: eth2: Initializing port

mlx4_en 0000:04:00.1: Activating port:2

mlx4_en: eth3: Using 216 TX rings

mlx4_en: eth3: Using 4 RX rings

mlx4_en: eth3: Initializing port

mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is up

mlx4_core 0000:04:00.0: Received reset from slave:2

mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x106a10000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is up

mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x119079000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is down

mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is down

mlx4_core 0000:04:00.0: mlx4_ib: Port 1 logical link is up

mlx4_core 0000:04:00.1: mlx4_ib: Port 1 logical link is up

 

 

 

 

# ibv_devinfo

hca_id: mlx4_0

  transport: InfiniBand (0)

  fw_ver: 2.11.500

  node_guid: 0002:c903:00a2:8fb0

  sys_image_guid: 0002:c903:00a2:8fb3

  vendor_id: 0x02c9

  vendor_part_id: 4099

  hw_ver: 0x0

  board_id: MT_1090110018

  phys_port_cnt: 2

  port: 1

  state: PORT_ACTIVE (4)

  max_mtu: 2048 (4)

  active_mtu: 2048 (4)

  sm_lid: 1

  port_lid: 1

  port_lmc: 0x00

  link_layer: InfiniBand

 

 

  port: 2

  state: PORT_DOWN (1)

  max_mtu: 2048 (4)

  active_mtu: 2048 (4)

  sm_lid: 0

  port_lid: 0

  port_lmc: 0x00

  link_layer: InfiniBand

 

 

#cat /etc/modprobe.d/mlx4_core.conf

options mlx4_core num_vfs=8 port_type_array=1,1 probe_vf=1

 

 

KVM Guest: CentOS 6.4 and CentOS 6.3

----------------------

 

 

Mellanox OFED: MLNX_OFED_LINUX-2.0-2.0.5-rhel6.3-x86_64

Kernel: 2.6.32-279.el6.x86_64

 

 

#lspci | grep Mel

00:07.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function]

 

 

 

 

#ibv_devinfo

hca_id: mlx4_0

  transport: InfiniBand (0)

  fw_ver: 2.11.500

  node_guid: 0014:0500:c0bb:4473

  sys_image_guid: 0002:c903:00a2:8fb3

  vendor_id: 0x02c9

  vendor_part_id: 4100

  hw_ver: 0x0

  board_id: MT_1090110018

  phys_port_cnt: 2

  port: 1

  state: PORT_ACTIVE (4)

  max_mtu: 2048 (4)

  active_mtu: 2048 (4)

  sm_lid: 1

  port_lid: 1

  port_lmc: 0x00

  link_layer: InfiniBand

 

 

  port: 2

  state: PORT_DOWN (1)

  max_mtu: 2048 (4)

  active_mtu: 2048 (4)

  sm_lid: 0

  port_lid: 0

  port_lmc: 0x00

  link_layer: InfiniBand

 

 

# sminfo

ibwarn: [3673] _do_madrpc: send failed; Function not implemented

ibwarn: [3673] mad_rpc: _do_madrpc failed; dport (Lid 1)

sminfo: iberror: failed: query

 

 

 

 

OpenSM log:

Jul 19 09:57:54 001056 [C520D700] 0x02 -> osm_vendor_init: 1000 pending umads specified

Jul 19 09:57:54 002074 [C520D700] 0x80 -> Entering DISCOVERING state

Using default GUID 0x14050000000002

Jul 19 09:57:54 191924 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x14050000000002

Jul 19 09:57:54 671075 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x14050000000002

Jul 19 09:57:54 671503 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x14050000000002

Jul 19 09:57:54 672363 [C520D700] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x14050000000002

Jul 19 09:57:54 672774 [C520D700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0014050000000002

Jul 19 09:57:54 673345 [C520D700] 0x01 -> osm_vendor_set_sm: ERR 5431: setting IS_SM capmask: cannot open file '/dev/infiniband/issm0': Invalid argument

Jul 19 09:57:54 674233 [C1605700] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7f11b00008c0 of size 256 TID 0x1234 failed -5 (Invalid argument)

Jul 19 09:57:54 674278 [C1605700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_ERROR): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1234

Jul 19 09:57:54 674311 [C1605700] 0x01 -> vl15_send_mad: ERR 3E03: MAD send failed (IB_UNKNOWN_ERROR)

Jul 19 09:57:54 674336 [C0C04700] 0x01 -> state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown

Re: Management command failed in KVM for SR-IOV

$
0
0

Hello,

 

Could you provide ibstat?

Could you provide an sminfo example that targets a specific port you know an SM is present in?

Where are you running your SM?

 

I do not see any indication of fabric connectivity outside of state: PORT_ACTIVE (4).  I am thinking that perhaps the commands you are receiving bad MAD responses from are being directed toward a non-linked port.  Similar behavior happens in dual port cards with some commands.

Re: Management command failed in KVM for SR-IOV

$
0
0

Also... i've seen before certain machines that do not support SR-IOV. this function also needs to be supported by the HW. you need to check with your server vendor.

Re: Using ConnectX-2 VPI adapters for network workstation with 2 nodes.

$
0
0

I got a question about the SX6018 switch and any input would be greatly appreciated. Will the FDR ports auto-negotiate down QDR and work with ConnectX-2 adapters? Thanks!

Re: Management command failed in KVM for SR-IOV

$
0
0

His hardware supports it because he sees the virtual functions.

 

Sent from my iPhone


ping time inconsistent in 10G Ethernet Cards

$
0
0

linux version : Linux 2.6.32-279.el6.x86_64(CentOS 6.3 64bit)

Mellanox OFED Version : MLNX_OFED_LINUX-1.9-0.1.8-rhel6.3-x86_64

VMA Version : libvma-6.3.28-0-x86_64.rpm

HCA Card : MCX312A-XCBT

 

# ethtool -i p5p1driver: mlx4_en (MT_1080120023_CX-3)version: 1.5.10 (Jan 2013)firmware-version: 2.10.700bus-info: 0000:04:00.0
# ethtool -i p5p1driver: mlx4_en (MT_1080120023_CX-3)version: 1.5.10 (Jan 2013)firmware-version: 2.10.800bus-info: 0000:04:00.0

My customer was using Mellanox 10G Ethernet Cards.

The Mellanox 10G Card did not have a consistent ping time(11us~38us --> inconsistant).

but another card(solarflare,chelsio...) had a consistent ping time(13~15us --> consistant).

 

They has a very sensitive network issue and speed.

is there a way to resole this issue?

 

 

 

4 bytes from 10.154.136.10: icmp_seq=2 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=3 ttl=64 time=0.026 ms

64 bytes from 10.154.136.10: icmp_seq=4 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=5 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=6 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=7 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=8 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=9 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=10 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=11 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=12 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=13 ttl=64 time=0.014 ms

64 bytes from 10.154.136.10: icmp_seq=14 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=15 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=16 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=17 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=18 ttl=64 time=0.044 ms

64 bytes from 10.154.136.10: icmp_seq=19 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=20 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=21 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=22 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=23 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=24 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=25 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=26 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=27 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=28 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=29 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=30 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=31 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=32 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=33 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=34 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=35 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=36 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=37 ttl=64 time=0.014 ms

64 bytes from 10.154.136.10: icmp_seq=38 ttl=64 time=0.049 ms

64 bytes from 10.154.136.10: icmp_seq=39 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=40 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=41 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=42 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=43 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=44 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=45 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=46 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=47 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=48 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=49 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=50 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=51 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=52 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=53 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=54 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=55 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=56 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=57 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=58 ttl=64 time=0.044 ms

64 bytes from 10.154.136.10: icmp_seq=59 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=60 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=61 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=62 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=63 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=64 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=65 ttl=64 time=0.020 ms

64 bytes from 10.154.136.10: icmp_seq=66 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=67 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=68 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=69 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=70 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=71 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=72 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=73 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=74 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=75 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=76 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=77 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=78 ttl=64 time=0.023 ms

64 bytes from 10.154.136.10: icmp_seq=79 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=80 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=81 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=82 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=83 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=84 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=85 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=86 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=87 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=88 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=89 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=90 ttl=64 time=0.019 ms

64 bytes from 10.154.136.10: icmp_seq=91 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=92 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=93 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=94 ttl=64 time=0.018 ms

64 bytes from 10.154.136.10: icmp_seq=95 ttl=64 time=0.013 ms

64 bytes from 10.154.136.10: icmp_seq=96 ttl=64 time=0.011 ms

64 bytes from 10.154.136.10: icmp_seq=97 ttl=64 time=0.021 ms

64 bytes from 10.154.136.10: icmp_seq=98 ttl=64 time=0.038 ms

64 bytes from 10.154.136.10: icmp_seq=99 ttl=64 time=0.012 ms

64 bytes from 10.154.136.10: icmp_seq=100 ttl=64 time=0.019 ms

 

Re: Management command failed in KVM for SR-IOV

$
0
0

Hi,

 

Thanks for respons and help.

 

First of all, Supermicro's mobo supports SR-IOV, VT-D. In this version of motherboard nad BIOS, SR-IOV is turn on all the time.

I forgot to mention when phisical device is passthrough everything works as expected in VM.

 

On this issue, I'm working on 2 phisical nodes S1 and G1, I have OpenSM on G1

OpenSM (sminfo BUILD VERSION: 1.5.8.MLNX_20110906 Build date: Jun 26 2012 21:31:16))

S1 has virtual system CentOS64

 

I've started working on infiniband technology about 1 year ago and so far almost everything I needed was out of the box

therefore I can be a little bit clumsy

 

[root@G1]# sminfo

sminfo: sm lid 2 sm guid 0x8f104039814b9, activity count 3626 priority 0 state 3 SMINFO_MASTER

 

 

[root@S1 ~]# sminfo

sminfo: sm lid 2 sm guid 0x8f104039814b9, activity count 3707 priority 0 state 3 SMINFO_MASTER

 

 

[root@S1 ~]# sminfo -G 0x8f104039814b9

sminfo: sm lid 2 sm guid 0x8f104039814b9, activity count 3718 priority 0 state 3 SMINFO_MASTER

 

 

[root@CentOS64 ~]# sminfo -G 0x8f104039814b9

ibwarn: [3178] ib_path_query_via: sa call path_query failed

sminfo: iberror: failed: can't resolve destination port 0x8f104039814b9

 

 

 

 

[root@CentOS64 ~]# ibstat

CA 'mlx4_0'

  CA type: MT4100

  Number of ports: 2

  Firmware version: 2.11.500

  Hardware version: 0

  Node GUID: 0x001405008eaa0a36

  System image GUID: 0x0002c90300a28fb3

  Port 1:

  State: Active

  Physical state: LinkUp

  Rate: 20

  Base lid: 1

  LMC: 0

  SM lid: 2

  Capability mask: 0x02514868

  Port GUID: 0x0014050000000002

  Link layer: InfiniBand

  Port 2:

  State: Down

  Physical state: Disabled

  Rate: 10

  Base lid: 4

  LMC: 0

  SM lid: 1

  Capability mask: 0x02514868

  Port GUID: 0x0014050000000081

  Link layer: InfiniBand

 

 

2 devices from: options mlx4_core num_vfs=5 port_type_array=1,1 probe_vf=1

[root@S1 ~]# ibstat

CA 'mlx4_0'

  CA type: MT4099

  Number of ports: 2

  Firmware version: 2.11.500

  Hardware version: 0

  Node GUID: 0x0002c90300a28fb0

  System image GUID: 0x0002c90300a28fb3

  Port 1:

  State: Active

  Physical state: LinkUp

  Rate: 20

  Base lid: 1

  LMC: 0

  SM lid: 2

  Capability mask: 0x02514868

  Port GUID: 0x0002c90300a28fb1

  Link layer: InfiniBand

  Port 2:

  State: Down

  Physical state: Disabled

  Rate: 10

  Base lid: 4

  LMC: 0

  SM lid: 1

  Capability mask: 0x02514868

  Port GUID: 0x0002c90300a28fb2

  Link layer: InfiniBand

CA 'mlx4_1'

  CA type: MT4100

  Number of ports: 2

  Firmware version: 2.11.500

  Hardware version: 0

  Node GUID: 0x00140500f8bf4e16

  System image GUID: 0x0002c90300a28fb3

  Port 1:

  State: Active

  Physical state: LinkUp

  Rate: 20

  Base lid: 1

  LMC: 0

  SM lid: 2

  Capability mask: 0x02514868

  Port GUID: 0x0014050000000001

  Link layer: InfiniBand

  Port 2:

  State: Down

  Physical state: Disabled

  Rate: 10

  Base lid: 4

  LMC: 0

  SM lid: 1

  Capability mask: 0x02514868

  Port GUID: 0x0014050000000080

  Link layer: InfiniBand

 

 

 

 

[root@G1 ]# ibstat

CA 'mthca0'

  CA type: MT25208 (MT23108 compat mode)

  Number of ports: 2

  Firmware version: 4.7.600

  Hardware version: a0

  Node GUID: 0x0008f104039814b8

  System image GUID: 0x0008f104039814bb

  Port 1:

  State: Active

  Physical state: LinkUp

  Rate: 20

  Base lid: 2

  LMC: 0

  SM lid: 2

  Capability mask: 0x02510a6a

  Port GUID: 0x0008f104039814b9

  Link layer: InfiniBand

  Port 2:

  State: Down

  Physical state: Polling

  Rate: 10

  Base lid: 0

  LMC: 0

  SM lid: 0

  Capability mask: 0x02510a68

  Port GUID: 0x0008f104039814ba

  Link layer: InfiniBand

Re: ESX 5.1.0 IPoIB NFS datastore freezes

$
0
0

Hello,

 

in between I was able to narrow the error down to a quite simple testcase.

 

1) create a NFS mount inside a VM that uses the IPOIB network interface of the ESX host.

2) copy data via scp from somwhere into this machine onto the NFS mount

 

When the error occurs for the first time, one can read from the vmkernel.log:

 

WARNING: LinDMA: Linux_DMACheckContraints:149:Cannot

         map machine address = 0x15ffff37b0, length = 65160

         for device 0000:02:00.0; reason = buffer straddles

         device dma boundary (0xffffffff)

<3>vmnic_ib1:ipoib_send:504: found skb where it does not belong

                             tx_head = 323830, tx_tail =323830

<3>vmnic_ib1:ipoib_send:505: netif_queue_stopped = 0

Backtrace for current CPU #20, worldID=8212, ebp=0x41220051b028

ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c4524aa, 0x4f0f5000000d

ipoib_send@<None>#<None>+0x5d4 stack: 0x41800c44bca8, 0x41000fe5d6c0

ipoib_start_xmit@<None>#<None>+0x53 stack: 0x41220051b238, 0x41800c4

...

 

Best regards.

 

Markus

Re: Management command failed in KVM for SR-IOV

$
0
0

are you running the SM from the hypervisor host machine? if yes, can you try running the SM on a regular machine?

Re: Management command failed in KVM for SR-IOV

$
0
0

no, SM is running on regular one

Re: Management command failed in KVM for SR-IOV

$
0
0

your SM is running on the older HCA type (MT25208) - which should be just fine but still i am thinking about maybe try to run it on a newer ConnectX card (you few of those out there).

Re: Management command failed in KVM for SR-IOV

$
0
0

I have a few more computers that work on older 20Gbs devices, the new ContectX-3 and server I bought to test VM functionality. As you suggested I moved OpenSM to the only one new dev I have at the moment but no improvements. Now OpenSM is running on S1 and it's visible.

 

[root@S1 ~]# sminfo

sminfo: sm lid 1 sm guid 0x2c90300a28fb1, activity count 239 priority 0 state 3 SMINFO_MASTER

 

[root@G1 ]# sminfo

sminfo: sm lid 1 sm guid 0x2c90300a28fb1, activity count 303 priority 0 state 3 SMINFO_MASTER

 

[root@CentOS64 ~]# sminfo

ibwarn: [3702] _do_madrpc: send failed; Function not implemented

ibwarn: [3702] mad_rpc: _do_madrpc failed; dport (Lid 1)

sminfo: iberror: failed: query

 

I'm not sure if sminfo on vm shows errors coming from sr-iov functionality and VF device or simple it cannot get out of VM because of invalid configuration.

Should I configure OpenSM in any special way? Maybe this VF device is treated in "special" way by sm?


Can I use FDR and QDR on the same infiniband switch?

$
0
0

I want to make a small cluster using the upcoming SX6012 switch with a FDR ConnectX-3 adapter for the main server and QDR ConnectX-2 adapters for the nodes. From my readings into infiniband this shouldn't be problem but I just want to make sure. Thank you!

Re: Management command failed in KVM for SR-IOV

$
0
0

Hi,

 

 

Still nothing... hope this info can be helpful.

 

 

I notice that OpenSM must be started on hypervisor host in my case this is S1 otherwise the virtual function's ports are linked up but have state DOWN.

When I start OpenSM (option: PORTS="ALL") all the ports become active (both are cable connected).

 

 

I noticed also a few more things:

 

 

So far only with ibnetdiscover in virtual system produce system message in hypervisor host:

 

 

mlx4_core 0000:04:00.0: slave 2 is trying to execute a Subnet MGMT MAD, class 0x1, method 0x81 for attr 0x11. Rejecting

mlx4_core 0000:04:00.0: vhcr command MAD_IFC (0x24) slave:2 in_param 0x26aaf000 in_mod=0xffff0001, op_mod=0xc failed with error:0, status -1

 

 

sminfo command gives the correct OpenSM lid information i.e. give the lid number from OpenSM master:

# sminfo --debug -v

ibwarn: [2843] smp_query_status_via: attr 0x20 mod 0x0 route Lid 1

ibwarn: [2843] _do_madrpc: send failed; Function not implemented

ibwarn: [2843] mad_rpc: _do_madrpc failed; dport (Lid 1)

sminfo: iberror: [pid 2843] main: failed: query

 

 

In virtual host I can see message in log:

ibnetdiscover[2755]: segfault at e4 ip 00000031d420a8b6 sp 00007fffc2eee6b8 error 4 in libibmad.so.5.3.1[31d4200000+12000]

 

 

and in hypervisor host:

<mlx4_ib> _mlx4_ib_mcg_port_cleanup: _mlx4_ib_mcg_port_cleanup-1102: ff12401bffff000000000000ffffffff (port 2): WARNING: group refcount 1!!! (pointer ffff88083f4fa000)

 

One more thing:

 

In virtual machine I started OpenSM with guid point to local port in VF and get those messages:

 

Jul 24 14:27:09 830432 [FA2C0700] 0x80 -> Entering DISCOVERING state

Using default GUID 0x14050000000002

Jul 24 14:27:09 994036 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x14050000000002

Jul 24 14:27:10 398748 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x14050000000002

Jul 24 14:27:10 398958 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x14050000000002

Jul 24 14:27:10 399371 [FA2C0700] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x14050000000002

Jul 24 14:27:10 399960 [FA2C0700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0014050000000002

Jul 24 14:27:10 400439 [FA2C0700] 0x01 -> osm_vendor_set_sm: ERR 5431: setting IS_SM capmask: cannot open file '/dev/infiniband/issm0': Invalid argument

Jul 24 14:27:10 401700 [F66B8700] 0x01 -> osm_vendor_send: ERR 5430: Send p_madw = 0x7fcfe40008c0 of size 256 TID 0x1234 failed -5 (Invalid argument)

Jul 24 14:27:10 401700 [F66B8700] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_ERROR): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1234

Jul 24 14:27:10 401700 [F66B8700] 0x01 -> vl15_send_mad: ERR 3E03: MAD send failed (IB_UNKNOWN_ERROR)

Jul 24 14:27:10 401983 [F5CB7700] 0x01 -> state_mgr_is_sm_port_down: ERR 3308: SM port GUID unknown

 

Regular linux cat on file /dev/infiniband/issm0 works in hypervisor system at least it's waiting  when in VM I get exactly the messages from OpenSM log:

 

# cat /dev/infiniband/issm0

cat: /dev/infiniband/issm0: Invalid argument

 

both file on host and VM are the same regarding to access:

VM:

#ls -aZ  /dev/infiniband/issm0

crw-rw----. root root system_u:object_r:device_t:s0    /dev/infiniband/issm0

 

Host:

#ls -lZ /dev/infiniband/issm0

crw-rw----. root root system_u:object_r:device_t:s0    /dev/infiniband/issm0

Re: ESX 5.1.0 IPoIB NFS datastore freezes

$
0
0

memo (for myself)

 

Modifications to the environment during my tests:

 

Infiniband card was exchanged to an ConnectX PCIe gen2 (MT26418). The newer chip with PCIe 5.0GT but still not an ConnectX2 card. Error still the same.

 

Updating the host BIOS does not help too. Even with latest version installed error still occurs.

Re: Can I use FDR and QDR on the same infiniband switch?

$
0
0

Shouldn't be a problem. the Infiniband spec defines backward comparability.

What are you planning on running between those nodes?

Re: ESX 5.1.0 IPoIB NFS datastore freezes

$
0
0

Hm,

 

seems as the error comes from using one port of an infiniband card as

 

1) VMKernel NFS Interface

2) and Network interface for the VM

 

See picture below.

vm.png

After separating the VM network and the Vmkernel to differnt ports of the

adapter one can transfer Gigabytes without errors. Maybe one of the

VMWare driver developers has an idea?

 

Best regards.

 

Markus

Viewing all 6275 articles
Browse latest View live


Latest Images

<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>