Hi and thanks for the reply,
I have 8 cores in total on each server, but when I use more than 4 on iperf (e.g. -P 8), I get worse performance. I guess this is because I have two sockets, and each of them using its own NUMA cell.
I installed a CentOS 7 to experiment and with the same optimizations I was able to get 39.6Gbps! However, these two OSs (Ubuntu ) are completely different (Kernel 3.10 vs 3.2) and since CentOS 7 was released very recently, it has a much updated set of tools.
MTU 9000 helped a lot in this case as expected (contrary to what I saw on Ubuntu 12.04), but it was very unstable for each subsequent test and the results could vary from 20Gbps up to 39Gbps.
CentOS 7 is also using iperf3 and I was able to get this high speed only when I was using 1 thread (-P1) and the --zerocopy options (this option does not exist in the old version of IPerf coming with Ubuntu 12.04).
Using the following optimization, as described in the tuning guide, solved the problem or instabillity (or at least it made it much more stable... 29/30 times I will still get a speed of around 20Gbps on iperf)
ethtool --set-priv-flags <eth_if> mlx4_rss_xor_hash_function on
Unfortunately, the version of ethtool on Ubuntu 12.04 does not support the --set-priv-flags option, so I cannot test there.