Threadpool程序在更快的服务器上运行得慢得多

upd我现在认为我的问题的根源不是“线程化”,因为我在程序的任何一点都观察到减速。 我认为在使用2个处理器时我的程序执行速度较慢可能是因为两个处理器需要在彼此之间“通信”。 我需要做一些测试。 我将尝试禁用其中一个处理器,看看会发生什么。

====================================

我不确定这是否是C#问题,可能更多关于硬件,但我认为C#最合适。

我使用便宜的DL120服务器,我决定升级到更昂贵的2处理器DL360p服务器。 出乎意料的是,我的C#程序在新服务器上的运行速度大约慢了2倍,这应该要快几倍。

我处理了大约60台仪器的FAST数据。 我为每个乐器创建了单独的任务,如下所示:

BlockingCollection updatesQuery; if (instrument2OrderUpdates.ContainsKey(instrument)) { updatesQuery = instrument2OrderUpdates[instrument]; } else { updatesQuery = new BlockingCollection(); instrument2OrderUpdates[instrument] = updatesQuery; ScheduleFastOrdersProcessing(updatesQuery); } orderUpdate.Checkpoint("updatesQuery.Add"); updatesQuery.Add(orderUpdate); } private void ScheduleFastOrdersProcessing(BlockingCollection updatesQuery) { Task.Factory.StartNew(() => { Instrument instrument = null; OrderBook orderBook = null; int lastRptSeqNum = -1; while (!updatesQuery.IsCompleted) { OrderUpdate orderUpdate; try { orderUpdate = updatesQuery.Take(); } catch(InvalidOperationException e) { Log.Push(LogItemType.Error, e.Message); continue; } orderUpdate.Checkpoint("received from updatesQuery.Take()"); ...................... ...................... // long not interesting processing code }, TaskCreationOptions.LongRunning); 

因为我有大约60个可以并行执行的任务,所以我希望2 * E5-2640(24个虚拟线程,12个真实线程)的执行速度要比1 * E3-1220(4个真实线程)快得多。 似乎使用DL360p我在任务管理器中找到了95个线程。 使用DL120我只有55个线程。

但是DL120G7的执行时间要快2倍(!!)! E3-1220的时钟频率比E5-2640(3.1 GHz vs 2.5Ghz)好一点但是我仍然希望我的代码在2 * E5-2640上运行得更快,因为它可以更好地并行,我绝对不会期望它工作慢2倍!

HP DL120G7 E3-1220

任务管理器中约50个线程最佳= 24个平均值~80微秒

  calling market.UpdateFastOrder = 23 updatesQuery.Add = 25 received from updatesQuery.Take() = 67 in orderbook = 80 calling market.UpdateFastOrder = 30 updatesQuery.Add = 32 received from updatesQuery.Take() = 64 in orderbook = 73 calling market.UpdateFastOrder = 31 updatesQuery.Add = 32 received from updatesQuery.Take() = 195 in orderbook = 204 calling market.UpdateFastOrder = 31 updatesQuery.Add = 32 received from updatesQuery.Take() = 74 in orderbook = 86 calling market.UpdateFastOrder = 18 updatesQuery.Add = 21 received from updatesQuery.Take() = 65 in orderbook = 78 calling market.UpdateFastOrder = 29 updatesQuery.Add = 32 received from updatesQuery.Take() = 76 in orderbook = 88 calling market.UpdateFastOrder = 30 updatesQuery.Add = 32 received from updatesQuery.Take() = 80 in orderbook = 92 calling market.UpdateFastOrder = 20 updatesQuery.Add = 21 received from updatesQuery.Take() = 65 in orderbook = 78 calling market.UpdateFastOrder = 21 updatesQuery.Add = 24 received from updatesQuery.Take() = 68 in orderbook = 81 calling market.UpdateFastOrder = 12 updatesQuery.Add = 13 received from updatesQuery.Take() = 58 in orderbook = 72 calling market.UpdateFastOrder = 22 updatesQuery.Add = 23 received from updatesQuery.Take() = 51 in orderbook = 59 calling market.UpdateFastOrder = 16 updatesQuery.Add = 16 received from updatesQuery.Take() = 20 in orderbook = 24 calling market.UpdateFastOrder = 28 updatesQuery.Add = 31 received from updatesQuery.Take() = 82 in orderbook = 94 calling market.UpdateFastOrder = 18 updatesQuery.Add = 21 received from updatesQuery.Take() = 65 in orderbook = 77 calling market.UpdateFastOrder = 29 updatesQuery.Add = 29 received from updatesQuery.Take() = 259 in orderbook = 264 calling market.UpdateFastOrder = 49 updatesQuery.Add = 52 received from updatesQuery.Take() = 99 in orderbook = 113 calling market.UpdateFastOrder = 22 updatesQuery.Add = 23 received from updatesQuery.Take() = 50 in orderbook = 60 calling market.UpdateFastOrder = 29 updatesQuery.Add = 32 received from updatesQuery.Take() = 76 in orderbook = 88 calling market.UpdateFastOrder = 16 updatesQuery.Add = 19 received from updatesQuery.Take() = 63 in orderbook = 75 calling market.UpdateFastOrder = 27 updatesQuery.Add = 27 received from updatesQuery.Take() = 226 in orderbook = 231 calling market.UpdateFastOrder = 15 updatesQuery.Add = 16 received from updatesQuery.Take() = 35 in orderbook = 42 calling market.UpdateFastOrder = 18 updatesQuery.Add = 21 received from updatesQuery.Take() = 66 in orderbook = 78 

HP DL360p G8 2 * E5-2640

任务管理器中约95个线程; 最佳= 40平均~150微秒

  calling market.UpdateFastOrder = 62 updatesQuery.Add = 64 received from updatesQuery.Take() = 144 in orderbook = 205 calling market.UpdateFastOrder = 27 updatesQuery.Add = 32 received from updatesQuery.Take() = 101 in orderbook = 154 calling market.UpdateFastOrder = 45 updatesQuery.Add = 50 received from updatesQuery.Take() = 124 in orderbook = 187 calling market.UpdateFastOrder = 46 updatesQuery.Add = 51 received from updatesQuery.Take() = 127 in orderbook = 162 calling market.UpdateFastOrder = 63 updatesQuery.Add = 68 received from updatesQuery.Take() = 137 in orderbook = 174 calling market.UpdateFastOrder = 53 updatesQuery.Add = 55 received from updatesQuery.Take() = 133 in orderbook = 171 calling market.UpdateFastOrder = 44 updatesQuery.Add = 46 received from updatesQuery.Take() = 131 in orderbook = 158 calling market.UpdateFastOrder = 37 updatesQuery.Add = 39 received from updatesQuery.Take() = 102 in orderbook = 140 calling market.UpdateFastOrder = 45 updatesQuery.Add = 50 received from updatesQuery.Take() = 115 in orderbook = 154 calling market.UpdateFastOrder = 50 updatesQuery.Add = 55 received from updatesQuery.Take() = 133 in orderbook = 160 calling market.UpdateFastOrder = 26 updatesQuery.Add = 50 received from updatesQuery.Take() = 99 in orderbook = 111 calling market.UpdateFastOrder = 14 updatesQuery.Add = 30 received from updatesQuery.Take() = 36 in orderbook = 40 <-- best one I can find among thousands 

您是否能够看到为什么我的程序在服务器速度提高几倍的情况下运行速度慢2倍? 可能我不应该创建~60任务? 可能我应该告诉.NET不要使用95个线程,但要限制为50甚至24? 可能这是2处理器与1处理器配置问题? 可能只是禁用我的DL360P Gen8上的一个处理器会显着加速程序?

添加

  • 调用market.UpdateFastOrder – 创建orderUpdate对象
  • updatesQuery.Add – orderUpdate被放入BlockingCollection
  • 从updatesQuery.Take()收到 – 从BlockingCollection中弹出的orderUpdate
  • 在orderbook中 – orderUpdated被解析并应用于orderBook

只是因为你有一个可以处理更multithreading的系统,这并不意味着所有这些线程都可以完全并行处理。

当我从Quadcore CPU升级到i7(虚拟8核)时,我注意到使用比核心更multithreading的设置导致线程相互阻塞一段时间,这导致系统整体减速。

问题只是我的algorythims已经能够使用他们的线程运行的核心的完整处理时间,而等待线程只能工作大约5到10%,这导致主线程完成但是一些单线程仍然有做他们所有的工作(再次花费相同的时间)。

线程池只有在所有工作者都完成后才会继续,因此完成之前的总时间将是其他线程的未处理处理器时间。

也许你只需要找到最佳线程数。