Quantcast
Channel: Mellanox Interconnect Community: Message List
Viewing all 6275 articles
Browse latest View live

Re: Trunks, pvlans, infiniband world

$
0
0

Hi Daniel,

you have no loops in this setup ...

in any case, I suggest you have a look here:

 

Designing an HPC Cluster with Mellanox InfiniBand Solutions

 

And maybe start with this:

 

Understanding Up/Down InfiniBand Routing Algorithm

it will help you understand the routing algorithms and networking for InfiniBand.

This is very small network. I would check that all the ports are being used, if there are not so many flows - let's say just one flow, only one port will be utilized - (same in Ethernet)....


Re: Trunks, pvlans, infiniband world

$
0
0

Can you add another figure with all the servers?

Re: Windows: Network cable unplugged

$
0
0

I was having the same problem, but now the next question is:
Where is the best place to run subnet manager?  A standalone server?  Server should have IB connected directly to it?  What about running it on the Xen hypervisor?  Is it ok as a VM?  Are there some best practices to this?

Re: Network topo for multi MPI clusters and one storage cluster?

$
0
0

Hello John,

I looking at the pdf; design 1 should work.    The independent clusters MPI traffic should remain local.  Design 2 could be modified for storage high availability if you wanted to go this route, but is not needed otherwise.

Regarding the routing algorithm should just use out of the box SM; I don’t see any other option here to improve upon with this configuration.

Re: Network topo for multi MPI clusters and one storage cluster?

$
0
0

Hi Scot,

Many thanks. We'll probably go with design #1 and use default SM as you suggest.

-John

Re: Network topo for multi MPI clusters and one storage cluster?

$
0
0

Please let me know how it works – should be ideal config.

 

Regards,

Scot Schultz

Director, HPC and Technical Computing

Mellanox Technologies

350 Oakmead Parkway, Suite 100, Sunnyvale CA, 94085

Office: 408-916-0018, Mobile: 408-444-1364, Fax: 408-585-0318

Re: Trunks, pvlans, infiniband world

$
0
0

Hi,  there will be 3x chassis at first with 16 blades in each

 

Each blade will have a duel port 40g mezz card connecting to the chassis 40g switches to get external ssd San storage

 

I'll be using srp targets in this first setup

 

It might look like overkill for now but future arrays will likely be 24x nvme slots so I'll need as much performance as possible

 

Our initial testing in our lab using only ddr 20g cards show the below from 2 ssd drives and some zfs read cache

 


Test Jive - Ignore


TEST - IGNORE : MLNX OFED 3.2 centos 7.2 with RT kernel error

$
0
0

Hello.

I've updated my 7.1 centos to 7.2 and got a new kernel

Next I compiled and installed RT kernel like in this article: How to build the CentOS 7 RT kernel - Hardware - Wiki

I need to say that I always use RT kernel with mellanox and its first time I got an error.

Next I downloaded last MLNX OFED package MLNX_OFED_LINUX-3.2-2.0.0.0-rhel7.2-x86_64

Next I generate by ./mlnx_add_kernel_support.sh -m PATHTOMLNX --make-tgz --skip-repo new tgz. I unpacked it.

when I try to install it says:

Logs dir: /tmp/MLNX_OFED_LINUX-3.2-2.0.0.0.7321.logs

Verifying KMP rpms compatibility with target kernel...

The kernel KMP rpms coming with MLNX_OFED are not compatible with kernel: 3.10.0-327.10.1.rt56.211.el7.centos.x86_64

See log at /tmp/MLNX_OFED_LINUX-3.2-2.0.0.0.7321.logs/is_kmp_compat_check.log

If you believe that this is a false alarm you can force the installation using the '--skip-kmp-verify' flag.

The 3.10.0-327.10.1.rt56.211.el7.centos.x86_64 kernel is installed, MLNX_OFED does not have drivers available for this kernel.

You can run mlnx_add_kernel_support.sh in order to to generate an MLNX_OFED package with drivers for this kernel.

 

If I try to use --skip-kmp-verify flag - it installs, but I cannot restart openibd cause it writes next:

 

Module mlx4_core belong to kernel-rt which is not a part of[FAILED]ED, skipping...

Module mlx4_ib belong to kernel-rt which is not a part of M[FAILED], skipping...

Module mlx4_core belong to kernel-rt which is not a part of[FAILED]ED, skipping...

Module mlx4_en belong to kernel-rt which is not a part of M[FAILED], skipping...

Module mlx5_core belong to kernel-rt which is not a part of[FAILED]ED, skipping...

Module mlx5_ib belong to kernel-rt which is not a part of M[FAILED], skipping...

Module ib_umad belong to kernel-rt which is not a part of M[FAILED], skipping...

Module ib_uverbs belong to kernel-rt which is not a part of[FAILED]ED, skipping...

Module ib_ipoib belong to kernel-rt which is not a part of [FAILED]D, skipping...

Module rdma_cm belong to kernel-rt which is not a part of M[FAILED], skipping...

Module ib_ucm belong to kernel-rt which is not a part of ML[FAILED] skipping...

Module rdma_ucm belong to kernel-rt which is not a part of [FAILED]D, skipping...

 

earlier when I got this error mlnx_add_kernel_support helps me. now - it doesnt help.

What I'm doing wrong? and what I can to do to fix it?

 

UPD: I tried to add support in builded tgz but it talks thats all ok

./mlnx_add_kernel_support.sh -m /tmp/MLNX_OFED_LINUX-3.2-2.0.0.0-rhel7.2-x86_64-ext/ --make-tgz

Note: This program will create MLNX_OFED_LINUX TGZ for rhel7.2 under /tmp directory.

Do you want to continue?[y/N]:y

See log file /tmp/mlnx_ofed_iso.10452.log

Required kernel (3.10.0-327.10.1.rt56.211.el7.centos.x86_64) is already supported by MLNX_OFED_LINUX

ConnectX-4 CX456A does not work with opensm

$
0
0

I have two servers each installed with a ConnectX-4 VPI 100Gb NIC (model:CX456A,two ports). The two ports are connected back to back using two copper cable. I have no problem when the two ports are set to Ethernet mode. The performance is quite close to 100Gb/s. To try the InfiniBand mode, I turn port one into InfiniBand Mode and restart the servers.

 

ibv_info shows the following:

...

hca_id: mlx5_0

        transport:                      InfiniBand (0)

        fw_ver:                         12.14.2036

        node_guid:                      7cfe:9003:0032:797a

        sys_image_guid:                 7cfe:9003:0032:797a

        vendor_id:                      0x02c9

        vendor_part_id:                 4115

        hw_ver:                         0x0

        board_id:                       MT_2190110032

        phys_port_cnt:                  1

        Device ports:

                port:   1

                        state:                  PORT_DOWN (1)

                        max_mtu:                4096 (5)

                        active_mtu:             4096 (5)

                        sm_lid:                 0

                        port_lid:               65535

                        port_lmc:               0x00

                        link_layer:             InfiniBand

...

Then I started the opensm daemon(service opensmd start) on one of the servers, but it seems the opensm has problem setting the LID of my card:

 

Mar 09 15:06:48 031794 [1D22700] 0x03 -> OpenSM 4.6.1.MLNX20160112.774e977

Mar 09 15:06:48 031842 [1D22700] 0x80 -> OpenSM 4.6.1.MLNX20160112.774e977

Mar 09 15:06:48 032470 [1D22700] 0x02 -> osm_vendor_init: 1000 pending umads specified

Mar 09 15:06:48 032516 [1D22700] 0x02 -> osm_vendor_init: 1000 pending umads specified

Mar 09 15:06:48 051285 [1D22700] 0x80 -> Entering DISCOVERING state

Mar 09 15:06:48 051416 [1D22700] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 086916 [1D22700] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 121806 [1D22700] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 121939 [1D22700] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x7cfe90030032797a

Mar 09 15:06:48 122094 [1D22700] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x7cfe90030032797a

Mar 09 15:06:48 123326 [FF0F1700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:06:48 123690 [EE6D0700] 0x80 -> SM port is down

Mar 09 15:06:58 052236 [FE0EF700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:08 052293 [FC0EB700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:18 052465 [FB8EA700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:28 052535 [F88E4700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:38 052566 [FF8F2700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:48 052771 [FE8F0700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:07:58 052805 [FC8EC700] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0

Mar 09 15:08:08 125373 [1D22700] 0x80 -> Exiting SM

 

I tried this sever times it is always like that. I googled around but can't find use information. Could you please give a hint what else should I do to find the reason?

Thank you so much!

Re: macvlan ipv6 troubles with mlx4 (ConnectX-3)

Re: ConnectX-4 CX456A does not work with opensm

$
0
0

I also tried it with SB7700 IB switch. The configuration shows that the subnet manager is enabled:

=================================================================

SB7700-IB-100Gb [standalone: master] (config) # show ib sm subnet-prefix

FE:80:00:00:00:00:00:00

SB7700-IB-100Gb [standalone: master] (config) # show ib sm sweep-interval

10 seconds

SB7700-IB-100Gb [standalone: master] (config) # show ib sm sweep-on-trap

enable

SB7700-IB-100Gb [standalone: master] (config) # show ib sm

enable

=================================================================

However, it didn't detected the connection on port 1 and 3:

 

=================================================================

SB7700-IB-100Gb [standalone: master] (config) # show interface ib status

 

 

Interface      Description                                Speed                   Current line rate   Logical port state   Physical port state

---------      -----------                                ---------               -----------------   ------------------   -------------------

IB1/1                                                     -                       -                   Down                 Polling

IB1/2                                                     -                       -                   Down                 Polling

IB1/3                                                     -                       -                   Down                 Polling

IB1/4                                                     -                       -                   Down                 Polling

IB1/5                                                     -                       -                   Down                 Polling

IB1/6                                                     -                       -                   Down                 Polling

IB1/7                                                     -                       -                   Down                 Polling

IB1/8                                                     -                       -                   Down                 Polling

IB1/9                                                     -                       -                   Down                 Polling

IB1/10                                                    -                       -                   Down                 Polling

IB1/11                                                    -                       -                   Down                 Polling

IB1/12                                                    -                       -                   Down                 Polling

IB1/13                                                    -                       -                   Down                 Polling

IB1/14                                                    -                       -                   Down                 Polling

IB1/15                                                    -                       -                   Down                 Polling

IB1/16                                                    -                       -                   Down                 Polling

IB1/17                                                    -                       -                   Down                 Polling

IB1/18                                                    -                       -                   Down                 Polling

IB1/19                                                    -                       -                   Down                 Polling

IB1/20                                                    -                       -                   Down                 Polling

  -                   Down                 Polling

IB1/22                                                    -                       -                   Down                 Polling

IB1/23                                                    -                       -                   Down                 Polling

IB1/24                                                    -                       -                   Down                 Polling

IB1/25                                                    -                       -                   Down                 Polling

IB1/26                                                    -                       -                   Down                 Polling

IB1/27                                                    -                       -                   Down                 Polling

IB1/28                                                    -                       -                   Down                 Polling

IB1/29                                                    -                       -                   Down                 Polling

IB1/30                                                    -                       -                   Down                 Polling

IB1/31                                                    -                       -                   Down                 Polling

IB1/32                                                    -                       -                   Down                 Polling

IB1/33                                                    -                       -                   Down                 Polling

IB1/34                                                    -                       -                   Down                 Polling

IB1/35                                                    -                       -                   Down                 Polling

IB1/36                                                    -                       -                   Down                 Polling

===========================================================================

Re: ConnectX-4 CX456A does not work with opensm

$
0
0

Hi Weijia,

 

Can you please provide from the switch the following outputs:

 

>show interface ib 1/1 transceiver

>show interface ib 1/2 transceiver

>show images

 

Can you also change the second port to IB and do a loopback test and check if the link comes online.

If so, try to do a back to back test between the servers using port 2 this time as IB.

 

Thank you,

Sophie.

Re: ConnectX-4 CX456A does not work with opensm

$
0
0

my 2c,

 

the issue is not with the subnet manger, issue is that the physical link between the 2 servers (in the b2b setup) or between the servers to the switch (in the switch setup) is not linking up -> subnet manager is responsible for the logical side of thing but physical links should be up before.

Re: MHQH19B-XTR - MFE_NO_FLASH_DETECTED

$
0
0

Hi Francesco Ghini,

 

Please open a case to Mellanox support by sending an email to support@mellanox.com.

This issue cannot be fixed with a simple workaround and a support assistence is needed here.

 

Best Regards,

Viki


Re: Can VMA and DPDK be used together?

$
0
0

Hi,

These are completely two different products, one is a Network stack (VMA) which emulates RDMA over kernel sockets and the other one is a user splace accelerator software which accelerates processes over L2 and L3.

 

They might be used together, for example, if DPDK is only accelerating L2 packets.

Re: ConnectX-4 CX456A does not work with opensm

$
0
0

Thank you Sophie, here is the result I get:

SB7700-IB-100Gb [standalone: master] # show interface ib 1/1 transceiver

IB1/1 state:

        Unknown cable.

        identifier              : (0x11)

        cable/ module type     : -

        infiniband speeds      : -

        vendor                 : -

        cable length           : -

        part number            : -

        revision               : -

        serial number          : -

 

SB7700-IB-100Gb [standalone: master] # show interface ib 1/2 transceiver

IB1/2 state:

        Cable is not present.

        identifier             : -

        cable/ module type     : -

        infiniband speeds      : -

        vendor                 : -

        cable length           : -

        part number            : -

        revision               : -

        serial number          : -

 

SB7700-IB-100Gb [standalone: master] # show interface ib 1/3 transceiver

IB1/3 state:

        Unknown cable.

        identifier              : (0x11)

        cable/ module type     : -

        infiniband speeds      : -

        vendor                 : -

        cable length           : -

        part number            : -

        revision               : -

        serial number          : -

============================================================

My cables are connected with SB7700 1 and 3. Port 2 is empty.

 

I also tried a back-to-back loop connection with two ports configure to IB mode. The link won't get up either:

 

...

CA 'mlx5_0'

        CA type: MT4115

        Number of ports: 1

        Firmware version: 12.14.2036

        Hardware version: 0

        Node GUID: 0x7cfe90030032797a

        System image GUID: 0x7cfe90030032797a

        Port 1:

                State: Down

                Physical state: Disabled

                Rate: 10

                Base lid: 65535

                LMC: 0

                SM lid: 0

                Capability mask: 0x2651e84a

                Port GUID: 0x7cfe90030032797a

                Link layer: InfiniBand

CA 'mlx5_1'

        CA type: MT4115

        Number of ports: 1

        Firmware version: 12.14.2036

        Hardware version: 0

        Node GUID: 0x7cfe90030032797b

        System image GUID: 0x7cfe90030032797a

        Port 1:

                State: Down

                Physical state: Disabled

                Rate: 10

                Base lid: 65535

                LMC: 0

                SM lid: 0

                Capability mask: 0x2651e848

                Port GUID: 0x7cfe90030032797b

                Link layer: InfiniBand

...

 

It seems that the SB7700 switch complains about the calbe model, which I'm using MCP1600. Should I use a different cable for IB?

Re: ConnectX-4 CX456A does not work with opensm

$
0
0

Thank you Eddie for the thoughts, I'm sure the physical link is corerctly linked up because the Ethernet mode is working without touching the hardware.

Re: ConnectX-4 CX456A does not work with opensm

$
0
0
I think you have ethernet cable that can't support IB mode. Could you check cable model in CLI?


Re: Error in ipoib

$
0
0

Hi Sophie Naudin

1.  ofed_info | head -1

MLNX_OFED_LINUX-3.1-1.0.3 (OFED-3.1-1.0.3):

2.  yum erase iptables - You're right, the firewallwasin the system.

 

If I use IPoIB, I needrdmamodules in the system?

 

 

Viewing all 6275 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>