Been a few folks recently looking at dual v quad socket machines. It's definitely worth you taking a look at this picture.
Notice that there are two IO Hubs and that each IO Hub is connected directly to two of the CP sockets via QPI. The implication of course is that a given NIC (which is connected via PCIe) to an IOH is more local to one pair of CPUs than another. Oh yes, and your performance will differ.
With the dual-socket Ne/Westmere architecture, you "really couldn't go wrong" (TM) - if using kernel stack, just get your interrupt and application affinity straight, if using Onload, just clean up your cores. But now affinitisation can make things worse if you get it wrong ..
Greg Law put it nicely the other day,
"On a 4 socket machine if you get affinitisation right you'll get best performance. But if you get it wrong, you'll get worse performance than if you did nothing at all and let everything float, then follow these steps:
1. Check your NIC is plugged into a PCIe slot that is closest to the NUMA node on which your app is running.
2. Check your interrupts are directed to a CPU on that same NUMA node
3. If running MRG, ensure the soft IRQ threads are pinned to the same CPU core as the interrupts.
In general, I would recommend you start profiling with no pinning of application or interrupts - let the OS decide. Then, pin things one by one (interrupt, softirq thread, application), and at each step, ensure that the performance is not adversely affected.
"
So ... don't start with your 2 socket configuration and think you can just "spread" to 4 cores. If in doubt, let everything float and start tuning again.
If you feel like panicking then suggest you rip out 2 of the CPUs !! Your application will likely go faster that way and you can sell them on ebay.
... oh and also think to the future where IO is local to each CPU