Hi,
I am experiencing several problems when using a ConnectX-3 40GbE adapter (MCX313A-BCBT) in an Asus X99-E WS motherboard.
First it makes system startup quite unstable. Approx. 2 out of 10 tries, the system halts before POST, and shows error code 94 on the 7-segment display of the mainboard (meaning PCI Enumeration Error).
When it boots successfully, the latest Linux driver (mlnx-en-3.0-1.0.1.tgz), with the latest firmware, with Fedora 21 x86_64 (supported OS), fresh install, with a single NVidia GPU installed besides the HCA, it emits PCI bus errors during initialization. Sometimes it disables the card completely, sometimes it starts to work after a 1-1.5 minute wait during boot. When such errors occur, they look like:
[ 10.743067] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[ 10.743077] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[ 10.743142] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000
[ 10.743187] pcieport 0000:00:02.0: [14] Completion Timeout (First)
[ 10.743225] pcieport 0000:00:02.0: broadcast error_detected message
[ 16.852525] mlx4_core 0000:0a:00.0: command 0xff6 timed out (go bit not cleared)
[ 16.852527] mlx4_core 0000:0a:00.0: RUN_FW command failed, aborting
[ 16.855670] mlx4_core 0000:0a:00.0: mlx4_cmd_post:cmd_pending failed
[ 16.855702] mlx4_core 0000:0a:00.0: Failed to start FW, aborting
[ 17.858368] mlx4_core: probe of 0000:0a:00.0 failed with error -110
[ 17.858638] pcieport 0000:00:02.0: AER: Device recovery failed
[ 17.858643] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[ 17.858652] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[ 17.858735] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000
[ 17.858787] pcieport 0000:00:02.0: [14] Completion Timeout (First)
[ 17.858832] pcieport 0000:00:02.0: broadcast error_detected message
[ 17.858836] pcieport 0000:00:02.0: AER: Device recovery failed
...
[ 61.820905] mlx4_core: device is working in RoCE mode: Roce V1
[ 61.820907] mlx4_core: gid_type 1 for UD QPs is not supported by the devicegid_type 0 was chosen instead
[ 61.820908] mlx4_core: UD QP Gid type is: V1
[ 101.351233] mlx4_core 0000:0a:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 101.351235] mlx4_core 0000:0a:00.0: PCIe link width is x8, device supports x8
[ 101.354441] mlx4_core 0000:0a:00.0: irq 62 for MSI/MSI-X
[ 101.354445] mlx4_core 0000:0a:00.0: irq 63 for MSI/MSI-X
[ 101.354448] mlx4_core 0000:0a:00.0: irq 64 for MSI/MSI-X
[ 101.354451] mlx4_core 0000:0a:00.0: irq 65 for MSI/MSI-X
[ 101.354453] mlx4_core 0000:0a:00.0: irq 66 for MSI/MSI-X
[ 101.354456] mlx4_core 0000:0a:00.0: irq 67 for MSI/MSI-X
[ 101.354459] mlx4_core 0000:0a:00.0: irq 68 for MSI/MSI-X
[ 101.354462] mlx4_core 0000:0a:00.0: irq 69 for MSI/MSI-X
[ 101.354464] mlx4_core 0000:0a:00.0: irq 70 for MSI/MSI-X
[ 101.354466] mlx4_core 0000:0a:00.0: irq 71 for MSI/MSI-X
[ 101.354469] mlx4_core 0000:0a:00.0: irq 72 for MSI/MSI-X
[ 101.354471] mlx4_core 0000:0a:00.0: irq 73 for MSI/MSI-X
[ 101.354474] mlx4_core 0000:0a:00.0: irq 74 for MSI/MSI-X
[ 102.097189] mlx4_core 0000:0a:00.0: mlx4_pci_err_detected was called
[ 102.097198] mlx4_core 0000:0a:00.0: device is going to be reset
[ 102.125455] mlx4_en: Mellanox ConnectX HCA Ethernet driver v3.0-1.0.1 (Feb 2014)
[ 103.138702] mlx4_core 0000:0a:00.0: device was reset successfully
[ 103.138717] mlx4_core 0000:0a:00.0: Could not post command 0xd: ret=-5, in_param=0x65ae56000, in_mod=0x100, op_mod=0x0
[ 103.138721] mlx4_core 0000:0a:00.0: SW2HW_MPT failed (-5)
[ 103.138724] mlx4_en 0000:0a:00.0: Failed enabling memory region
[ 104.151519] pcieport 0000:00:02.0: AER: Device recovery failed
[ 104.151526] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[ 104.151536] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[ 104.151540] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000
[ 104.151543] pcieport 0000:00:02.0: [14] Completion Timeout (First)
[ 104.151548] pcieport 0000:00:02.0: broadcast error_detected message
[ 104.151553] mlx4_core 0000:0a:00.0: mlx4_pci_err_detected was called
[ 104.151556] ------------[ cut here ]------------
[ 104.151565] WARNING: CPU: 0 PID: 165 at drivers/pci/pci.c:1535 pci_disable_device+0x99/0xb0()
[ 104.151567] mlx4_core 0000:0a:00.0: disabling already-disabled device
[ 104.151569] Modules linked in:
[ 104.151571] mlx5_core(OE) mlx4_ib(OE) mlx4_en(OE) vxlan udp_tunnel nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT xt_conntrack cfg80211 ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw snd_hda_codec_hdmi vfat x86_pkg_temp_thermal fat coretemp kvm crct10dif_pclmul crc32_pclmul snd_hda_intel crc32c_intel eeepc_wmi asus_wmi snd_hda_controller sparse_keymap rfkill snd_hda_codec iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_hwdep snd_seq snd_seq_device snd_pcm sb_edac snd_timer serio_raw edac_core snd soundcore
[ 104.151619] mlx4_core(OE) mlx_compat(OE) mei_me i2c_i801 lpc_ich mei mfd_core shpchp tpm_infineon tpm_tis tpm nouveau video mxm_wmi igb drm_kms_helper ttm e1000e drm dca ata_generic ptp i2c_algo_bit pata_acpi pps_core wmi [last unloaded: mlx4_core]
[ 104.151642] CPU: 0 PID: 165 Comm: kworker/0:2 Tainted: G OE 3.17.4-301.fc21.x86_64 #1
[ 104.151644] Hardware name: ASUS All Series/X99-E WS, BIOS 1102 04/28/2015
[ 104.151650] Workqueue: events aer_isr
[ 104.151653] 0000000000000000 0000000017f53b38 ffff880659a8bbe8 ffffffff8173f929
[ 104.151657] ffff880659a8bc30 ffff880659a8bc20 ffffffff810970ad ffff88065ccbc000
[ 104.151661] ffff88065cc60510 0000000000000001 ffff880658ecfb10 ffff88065cc85800
[ 104.151665] Call Trace:
[ 104.151671] [<ffffffff8173f929>] dump_stack+0x45/0x56
[ 104.151678] [<ffffffff810970ad>] warn_slowpath_common+0x7d/0xa0
[ 104.151683] [<ffffffff8109712c>] warn_slowpath_fmt+0x5c/0x80
[ 104.151696] [<ffffffffa0354938>] ? mlx4_enter_error_state.part.7+0x188/0x350 [mlx4_core]
[ 104.151704] [<ffffffff813c3d09>] pci_disable_device+0x99/0xb0
[ 104.151720] [<ffffffffa036b117>] mlx4_pci_err_detected+0x77/0xa0 [mlx4_core]
[ 104.151725] [<ffffffff813d71e0>] report_error_detected+0x50/0x100
[ 104.151730] [<ffffffff813d7190>] ? find_source_device+0x80/0x80
[ 104.151734] [<ffffffff813bc7a9>] pci_walk_bus+0x79/0xa0
[ 104.151738] [<ffffffff813d7190>] ? find_source_device+0x80/0x80
[ 104.151742] [<ffffffff813d6a4c>] broadcast_error_message+0xdc/0x100
[ 104.151746] [<ffffffff813d6ab3>] do_recovery+0x43/0x280
[ 104.151750] [<ffffffff813d67a9>] ? get_device_error_info+0xd9/0x1b0
[ 104.151754] [<ffffffff813d769a>] aer_isr+0x36a/0x450
[ 104.151761] [<ffffffff810af88d>] process_one_work+0x14d/0x400
[ 104.151765] [<ffffffff810b021b>] worker_thread+0x6b/0x4a0
[ 104.151770] [<ffffffff810b01b0>] ? rescuer_thread+0x2a0/0x2a0
[ 104.151773] [<ffffffff810b52fa>] kthread+0xea/0x100
[ 104.151777] [<ffffffff810b5210>] ? kthread_create_on_node+0x1a0/0x1a0
[ 104.151783] [<ffffffff81746a3c>] ret_from_fork+0x7c/0xb0
[ 104.151787] [<ffffffff810b5210>] ? kthread_create_on_node+0x1a0/0x1a0
[ 104.151789] ---[ end trace 858d8c660747219b ]---
[ 104.151793] pcieport 0000:00:02.0: AER: Device recovery failed
[ 104.151796] pcieport 0000:00:02.0: AER: Uncorrected (Non-Fatal) error received: id=0010
[ 104.151803] pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0010(Requester ID)
[ 104.151807] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00004000/00000000
[ 104.151810] pcieport 0000:00:02.0: [14] Completion Timeout (First)
...
When using the same Mellanox card in a different mainboard (for example, a Gigabyte GA-Z97X-UD3H), it boots and inits flawlessly, using the exact same OS.
We have a cluster built up from these boards, and they all have the same issue randomly, so it's not a unique error of a single mainboard, but looks like some incompatibility.
Did anybody experience a similar issue?
Please share any suggestions about how to stabilize this.
Thanks,
Peter