I'm trying to use Mellanox Infiniband adapter with SECO GPU DevKit on NVIDIA CARMA platform. To PCI slot of SECO CARMA board I had connected Netstor NA255A expansion box and assembled Infiniband adapter and NVIDIA GPU inside.
The SECO software is based upon NVIDIA L4T R16 (Linux kernel 3.1.10). Unfortunately, I was not able to find a driver which would work correctly. I have tried the driver bundled with Kenel 3.1.10, and both 1.5.3 and 2.0 drivers from Mellanox website.
While trying with Mellanox ConnectX-2 VPI, the system starts fine and I can exchange data using Infiniband. Unfortunately, the system hangs after several hours of operation (which is not happening if Infiniband is not started). I see following messages before the crash.
[ 8671.282351] irq 130: nobody cared (try booting with the "irqpoll" option)
[ 8671.289890] handlers:
[ 8671.292225] [<c0043d7c>] tegra_pcie_isr
[ 8671.296211] [<be80a324>] mlx4_interrupt
[ 8671.300142] Disabling IRQ #130
With Mellanox ConnectX-3 VPI, the system crashes more-or-less directly after the driver is loaded reporting Internal Error.
[ 278.665090] mlx4_core 0000:06:00.0: Internal error detected:
[ 278.670933] mlx4_core 0000:06:00.0: buf[00]: 00000000
[ 278.676315] mlx4_core 0000:06:00.0: buf[01]: 00000000
[ 278.681682] mlx4_core 0000:06:00.0: buf[02]: 00000000
[ 278.687097] mlx4_core 0000:06:00.0: buf[03]: 00000000
[ 278.692547] mlx4_core 0000:06:00.0: buf[04]: 00000000
[ 278.697943] mlx4_core 0000:06:00.0: buf[05]: 00000000
[ 278.703342] mlx4_core 0000:06:00.0: buf[06]: 00000000
[ 278.708720] mlx4_core 0000:06:00.0: buf[07]: 00000000
[ 278.714104] mlx4_core 0000:06:00.0: buf[08]: 00000000
[ 278.719480] mlx4_core 0000:06:00.0: buf[09]: 00000000
[ 278.724867] mlx4_core 0000:06:00.0: buf[0a]: 00000000
[ 278.730232] mlx4_core 0000:06:00.0: buf[0b]: 00000000
[ 278.735605] mlx4_core 0000:06:00.0: buf[0c]: 00000000
[ 278.740975] mlx4_core 0000:06:00.0: buf[0d]: 00000000
[ 278.746343] mlx4_core 0000:06:00.0: buf[0e]: 00000000
[ 278.751705] mlx4_core 0000:06:00.0: buf[0f]: 00000000
In both cases, the mlx4_core is loaded with msi_x=0 option. Otherwise, there is not enough DMA resources to initialize event queue table.
Had somebody successfully used Mellanox adapters on ARM platform? Which driver version and which options?