C#:如何制作阿特金增量筛选

我不知道这是否可能,但我只想问。 我的数学和算法技能在这里让我失望:P

问题是我现在有这个类产生的素数达到一定限度:

public class Atkin : IEnumerable { private readonly List primes; private readonly ulong limit; public Atkin(ulong limit) { this.limit = limit; primes = new List(); } private void FindPrimes() { var isPrime = new bool[limit + 1]; var sqrt = Math.Sqrt(limit); for (ulong x = 1; x <= sqrt; x++) for (ulong y = 1; y <= sqrt; y++) { var n = 4*x*x + y*y; if (n <= limit && (n % 12 == 1 || n % 12 == 5)) isPrime[n] ^= true; n = 3*x*x + y*y; if (n  y && n <= limit && n % 12 == 11) isPrime[n] ^= true; } for (ulong n = 5; n <= sqrt; n++) if (isPrime[n]) { var s = n * n; for (ulong k = s; k <= limit; k += s) isPrime[k] = false; } primes.Add(2); primes.Add(3); for (ulong n = 5; n <= limit; n++) if (isPrime[n]) primes.Add(n); } public IEnumerator GetEnumerator() { if (!primes.Any()) FindPrimes(); foreach (var p in primes) yield return p; } IEnumerator IEnumerable.GetEnumerator() { return GetEnumerator(); } } 

现在,我想要摆脱限制,以便序列永远不会停止(理论上)。

我在想它可能是这样的:

  1. 从一些微不足道的限制开始
    • 找到所有素数达到极限
    • 产生所有新发现的素数
    • 增加限制(通过加倍或平方旧限制或类似的东西)
    • 转到第2步

最好的是,第二步应该只需要在旧限制和新限制之间工作。 换句话说,它不应该一次又一次地找到最低素数。

有没有办法可以做到这一点? 我的主要问题是我不太明白这个算法中的xy是什么。 比如,我可以使用相同的算法类型,但将xy设置为oldLimit (最初为1)并将其运行到newLimit ? 或者它会如何工作? 任何聪明的头脑都会有一些亮光吗?

关键在于我不必设置该限制。 因此,我可以使用Linq,只需使用Take()但是我需要很多素数,而不用担心限制是否足够高,等等。

这是C#中无界初筛的解决方案,可以使用Eratosthenes筛选(SoE)或Atkin筛选算法(SoA)实现; 但是,我认为,优化的SoA解决方案的极端复杂性几乎不值得,而真正的SoE在没有太多复杂性的情况下提供相同的性能。 因此,这可能只是部分答案,虽然它显示了如何实现更好的SoA算法并展示了如何使用SoE实现无界序列,但它仅暗示如何组合这些以编写合理有效的SoA。

请注意,如果只需要讨论这些想法的最快实现,请跳到本答案的底部。

首先,我们应该评论这个练习的重点是生成一个无限的素数序列,以允许使用诸如Take(),TakeWhile(),Where(),Count()等等的IEnumerable扩展方法,因为这些方法放弃了一些由于增加了方法调用的级别,为每个调用添加至少28个机器周期并返回枚举一个值并为每个函数添加几个方法调用级别; 也就是说,即使在枚举结果上使用更多命令性过滤技术以获得更好的速度,拥有(有效)无限序列仍然是有用的。

注意,在下面的简单示例中,我将范围限制为无符号32位数字(uint)的范围,超出该范围,基本SoE或SoA实现在效率方面并不合适,需要切换用于“铲斗”或其他forms的增量筛,用于部分筛分以提高效率; 然而,对于一个完全优化的页面分段筛,就像在最快的实现中一样,范围增加到64位范围,尽管编写的可能不希望筛选过去大约一百万亿(十到十四次幂)作为运行时间全系列需要数百年的时间。

由于问题选择SoA可能是错误的原因,首先误认为试验部(TD)主筛用于真正的SoE然后在TD筛上使用天真的SoA,让我们建立真正的有界SoE,可以在几个实现line作为一种方法(可以按照问题的实现编码样式转换为类),如下所示:

 static IEnumerable primesSoE(uint top_number) { if (top_number < 2u) yield break; yield return 2u; if (top_number < 3u) yield break; var BFLMT = (top_number - 3u) / 2u; var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1,true); for (var i = 0u; i <= BFLMT; ++i) if (buf[(int)i]) { var p = 3u + i + i; if (i <= SQRTLMT) { for (var j = (p * p - 3u) / 2u; j <= BFLMT; j += p) buf[(int)j] = false; } yield return p; } } 

这可以计算i7-2700K(3.5 GHz)上大约16毫秒的200万个素数和32个数字范围内的203,280,221个素数,大约67秒,在同一台机器上大约67秒(给定256 MB的备用RAM)记忆)。

现在,使用上面的版本与SoA进行比较是不公平的,因为真正的SoA会自动忽略2,3和5的倍数,因此引入车轮分解来为SoE做同样的事情产生以下代码:

 static IEnumerable primesSoE(uint top_number) { if (top_number < 2u) yield break; yield return 2u; if (top_number < 3u) yield break; yield return 3u; if (top_number < 5u) yield break; yield return 5u; if (top_number < 7u) yield break; var BFLMT = (top_number - 7u) / 2u; var SQRTLMT = ((uint)(Math.Sqrt((double)top_number)) - 7u) / 2u; var buf = new BitArray((int)BFLMT + 1,true); byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; for (uint i = 0u, w = 0u; i <= BFLMT; i += WHLPTRN[w], w = (w < 7u) ? ++w : 0u) if (buf[(int)i]) { var p = 7u + i + i; if (i <= SQRTLMT) { var pX2 = p + p; uint[] pa = { p, pX2, pX2 + p }; for (uint j = (p * p - 7u) / 2u, m = w; j <= BFLMT; j += pa[WHLPTRN[m] - 1u], m = (m < 7u) ? ++m : 0u) buf[(int)j] = false; } yield return p; } } 

上面的代码在大约10毫秒内将素数计算为2百万,并且在与上述相同的机器上在大约40秒内计算32位数的素数。

接下来,让我们确定一下我们可能在这里编写的SoA版本是否与SoE相比在执行速度方面与上述代码相比具有任何优势。 下面的代码遵循上面的SoE模型,并优化维基百科文章中的SoA伪代码,以便使用单独的二次解决方案的单独循环调整'x'和'y'变量的范围,如该文章中所建议的,仅针对奇数解求解二次方程(和无方形消除),结合“3 * x ^ 2”二次方来求解同一遍中的正负二项,并消除计算上昂贵的模运算,生成的代码比在那里发布的伪代码的初始版本快3倍以上,并且在此处的问题中使用。 有界SoA的代码如下:

 static IEnumerable primesSoA(uint top_number) { if (top_number < 2u) yield break; yield return 2u; if (top_number < 3u) yield break; yield return 3u; if (top_number < 5u) yield break; var BFLMT = (top_number - 3u) / 2u; var buf = new BitArray((int)BFLMT + 1, false); var SQRT = (uint)(Math.Sqrt((double)top_number)); var SQRTLMT = (SQRT - 3u) / 2u; for (uint x = 1u, s = 1u, mdx12 = 5u, dmdx12 = 0u; s <= BFLMT; ++x, s += ((x << 1) - 1u) << 1) { for (uint y = 1u, n = s, md12 = mdx12, dmd12 = 8u; n <= BFLMT; y += 2, n += (y - 1u) << 1) { if ((md12 == 1) || (md12 == 5)) buf[(int)n] = buf[(int)n] ^ true; md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8u; if (dmd12 >= 12u) dmd12 -= 12u; } mdx12 += dmdx12; if (mdx12 >= 12u) mdx12 -= 12u; dmdx12 += 8u; if (dmdx12 >= 12u) dmdx12 -= 12u; } for (uint x = 1u, s = 0u, mdx12 = 3u, dmdx12 = 8u; s <= BFLMT; ++x, s += x << 1) { int y = 1 - (int)x, n = (int)s, md12 = (int)mdx12, dmd12 = ((-y - 1) << 2) % 12; for (; (y < 0) && (uint)n <= BFLMT; y += 2, n += (-y + 1) << 1) { if (md12 == 11) buf[(int)n] = buf[(int)n] ^ true; md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 4; if (dmd12 >= 12) dmd12 -= 12; } if (y < 1) { y = 2; n += 2; md12 += 4; dmd12 = 0; } else { n += 1; md12 += 2; dmd12 = 8; } if (md12 >= 12) md12 -= 12; for (; (uint)n <= BFLMT; y += 2, n += (y - 1) << 1) { if (md12 == 7) buf[(int)n] = buf[(int)n] ^ true; md12 += dmd12; if (md12 >= 12) md12 -= 12; dmd12 += 8; if (dmd12 >= 12) dmd12 -= 12; } mdx12 += dmdx12; if (mdx12 >= 12) mdx12 -= 12; dmdx12 += 4; if (dmdx12 >= 12) dmdx12 -= 12; } for (var i = 0u; i<=BFLMT; ++i) if (buf[(int)i]) { var p = 3u+i+i; if (i<=SQRTLMT) { var sqr = p*p; for (var j = (sqr - 3ul) / 2ul; j <= BFLMT; j += sqr) buf[(int)j] = false; } yield return p; } } 

由于没有完全优化的操作次数,这仍然是车轮分解SoE算法的两倍慢。 可以对SoA代码进行进一步优化,如使用模60操作和原始(非伪代码)算法,并使用位打包仅处理除3和5之外的倍数,以使代码与SoE的性能相当接近或者甚至略微超过它,但我们越来越接近Berstein实施的复杂性来实现这种性能。 我的观点是“为什么SoA,当一个人努力获得与SoE相同的表现时?”。

现在对于无界素数序列,最简单的方法是定义一个Uint32.MaxValue的const top_number,并消除primesSoE或primesSoA方法中的参数。 这有点效率低,因为无论处理的实际质数值如何,仍然在整数范围内进行剔除,这使得对小范围质数的确定花费的时间比它应该长得多。 除了这个以及极端的内存使用之外,还有其他理由去分段版本的素数筛:首先,使用主要在CPU L1或L2数据缓存中的数据的算法由于更高效的内存访问而处理得更快,并且其次,因为分段允许将作业轻松拆分成可以在后台使用多核处理器剔除的页面,以便加速,这可以与所使用的核心数成比例。

为了提高效率,我们应该选择CPU L1或L2缓存大小的数组大小,对于大多数现代CPU来说至少为16 Kilobytes,以避免缓存抖动,因为我们从arrays中剔除素数的复合,这意味着BitArray可以有一个大小为8倍(每字节8位)或128千位。 由于我们只需要筛选奇数素数,这代表了超过25万的数字范围,这意味着分段版本将仅使用8个段来筛选到欧拉问题10所要求的200万个。可以保存结果来自最后一段并从那一点开始继续,但这将妨碍将此代码适应多核案例,其中一个人将剔除降级到多个线程的后台以充分利用多核处理器。 (单线程)无界SoE的C#代码如下:

 static IEnumerable primesSoE() { yield return 2u; yield return 3u; yield return 5u; const uint L1CACHEPOW = 14u + 3u, L1CACHESZ = (1u << (int)L1CACHEPOW); //for 16K in bits... const uint BUFSZ = L1CACHESZ / 15u * 15u; //an even number of wheel rotations var buf = new BitArray((int)BUFSZ); const uint MAXNDX = (uint.MaxValue - 7u) / 2u; //need maximum for number range var SQRTNDX = ((uint)Math.Sqrt(uint.MaxValue) - 7u) / 2u; byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index byte[] WHLNDX = { 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7, 7, 7, //get index from position 0, 0, 1, 2, 2, 3, 4, 4, 5, 5, 5, 6, 7 }; //allow for overflow byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, //allow for overflow... 15, 15, 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27 }; uint BPLMT = (ushort.MaxValue - 7u) / 2u; var bpbuf = new BitArray((int)BPLMT + 1, true); for (var i = 0; i <= 124; ++i) if (bpbuf[i]) { var p = 7 + i + i; //initialize baseprimes array for (var j = (p * p - 7) / 2; j <= BPLMT; j += p) bpbuf[j] = false; } var pa = new uint[3]; for (uint i = 0u, w = 0, si = 0; i <= MAXNDX; i += WHLPTRN[w], si += WHLPTRN[w], si = (si >= BUFSZ) ? 0u : si, w = (w < 7u) ? ++w : 0u) { if (si == 0) { buf.SetAll(true); for (uint j = 0u, bw = 0u; j <= BPLMT; j += WHLPTRN[bw], bw = (bw < 7u) ? ++bw : 0u) if (bpbuf[(int)j]) { var p = 7u+j+j; var pX2=p+p; var k = p * (j + 3u) + j; if (k >= i + BUFSZ) break; pa[0] = p; pa[1] = pX2; pa[2] = pX2 + p; var sw = bw; if (k < i) { k = (i - k) % (15u * p); if (k != 0) { var os = WHLPOS[bw]; sw = os + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - os) * p - k; sw = WHLNDX[sw]; } } else k -= i; for (; k 

上面的代码需要大约16毫秒来筛选质量高达200万和大约30秒来筛选完整的32位数字范围。 这个代码比使用阶梯轮的类似版本更快,而不需要对大数范围进行分段,因为即使代码在计算上更复杂,也可以节省大量时间而不会破坏CPU缓存。 大部分复杂性在于计算每个基本每个基本素数的新起始指数,这可以通过保存每个每个主要素数的状态来减少,但上述代码可以很容易地转换,以便在上面运行剔除过程。对于具有四个内核的机器,将单独的后台线程进一步加速四倍,对于具有八个内核的机器,甚至更多。 不使用BitArray类并通过位掩码寻址各个位位置将提供大约两倍的进一步加速,并且因子轮的进一步增加将提供大约另外的时间减少到大约三分之二。 对于通过车轮分解消除的值,消除的索引中的比特arrays的更好的打包将略微提高大范围的效率,但是也会稍微增加比特操纵的复杂性。 然而,所有这些SoE优化都不会接近Berstein SoA实现的极端复杂性,而是以大致相同的速度运行。

要将上述代码从SoE转换为SoA,我们只需要从有界情况更改为SoA剔除代码,但需要修改为每个页面段重新计算起始地址,例如为SoE情况计算起始地址但是,由于二次方通过数字的平方而不是简单的素数倍数推进,操作甚至更复杂。 我会给学生留下必要的修改,虽然我没有真正看到这一点,因为合理优化的SoA并不比当前版本的SoE快,而且要复杂得多;)

EDIT_ADD:

注意:下面的代码已被更正,因为原始发布的代码对于提供的素数序列是正确的,它比它需要的慢一半,因为它剔除了范围的平方根以下的所有数字而不是仅找到的基数可以达到范围的平方根。

最有效和最简单的优化是将每个段页面的剔除操作降级为后台线程,这样,给定足够的CPU内核,枚举素数的主要限制是枚举循环本身,在这种情况下,32位数中的所有素数在没有其他优化的情况下,可以在上述引用机器(在C#中)在大约十秒内枚举范围,所有其他优化包括编写IEnumerable接口实现而不是使用内置迭代器,将所有203,280,221个素数减少到大约6秒在32位数字范围内 (1.65秒到10亿),再次花费大部分时间来枚举结果。 以下代码包含许多修改,包括使用Task使用的Dotnet Framework 4 ThreadPool中的后台任务,使用作为查找表实现的状态机来实现进一步的位打包以使质数枚举更快,并重新编写算法作为一个可枚举的类,使用“自己动手”的迭代器来提高效率:

 class fastprimesSoE : IEnumerable, IEnumerable { struct procspc { public Task tsk; public uint[] buf; } struct wst { public byte msk; public byte mlt; public byte xtr; public byte nxt; } static readonly uint NUMPROCS = (uint)Environment.ProcessorCount + 1u; const uint CHNKSZ = 1u; const uint L1CACHEPOW = 14u, L1CACHESZ = (1u << (int)L1CACHEPOW), PGSZ = L1CACHESZ >> 2; //for 16K in bytes... const uint BUFSZ = CHNKSZ * PGSZ; //number of uints even number of caches in chunk const uint BUFSZBTS = 15u * BUFSZ << 2; //even in wheel rotations and uints (and chunks) static readonly byte[] WHLPTRN = { 2, 1, 2, 1, 2, 3, 1, 3 }; //the 2,3,5 factorial wheel, (sum) 15 elements long static readonly byte[] WHLPOS = { 0, 2, 3, 5, 6, 8, 11, 12 }; //get wheel position from index static readonly byte[] WHLNDX = { 0, 1, 1, 2, 3, 3, 4, 5, 5, 6, 6, 6, 7, 0, 0, 0 }; //get index from position static readonly byte[] WHLRNDUP = { 0, 2, 2, 3, 5, 5, 6, 8, 8, 11, 11, 11, 12, 15, 15, 15, //allow for overflow... 17, 17, 18, 20, 20, 21, 23, 23, 26, 26, 26, 27, 30, 30, 30 }; //round multiplier up const uint BPLMT = (ushort.MaxValue - 7u) / 2u; const uint BPSZ = BPLMT / 60u + 1u; static readonly uint[] bpbuf = new uint[BPSZ]; static readonly wst[] WHLST = new wst[64]; static void cullpg(uint i, uint[] b, int strt, int cnt) { Array.Clear(b, strt, cnt); for (uint j = 0u, wp = 0, bw = 0x31321212u, bi = 0u, v = 0xc0881000u, bm = 1u; j <= BPLMT; j += bw & 0xF, wp = wp < 12 ? wp += bw & 0xF : 0, bw = (bw > 3u) ? bw >>= 4 : 0x31321212u) { var p = 7u + j + j; var k = p * (j + 3u) + j; if (k >= (i + (uint)cnt * 60u)) break; if ((v & bm) == 0u) { if (k < i) { k = (i - k) % (15u * p); if (k != 0) { var sw = wp + ((k + p - 1u) / p); sw = WHLRNDUP[sw]; k = (sw - wp) * p - k; } } else k -= i; var pd = p / 15; for (uint l = k / 15u + (uint)strt * 4u, lw = ((uint)WHLNDX[wp] << 3) + WHLNDX[k % 15u]; l < (uint)(strt + cnt) * 4u; ) { var st = WHLST[lw]; b[l >> 2] |= (uint)st.msk << (int)((l & 3) << 3); l += st.mlt * pd + st.xtr; lw = st.nxt; } } if ((bm <<= 1) == 0u) { v = bpbuf[++bi]; bm = 1u; } } } static fastprimesSoE() { for (var x = 0; x < 8; ++x) { var p = 7 + 2 * WHLPOS[x]; var pr = p % 15; for (int y = 0, i = ((p * p - 7) / 2); y < 8; ++y) { var m = WHLPTRN[(x + y) % 8]; i %= 15; var n = WHLNDX[i]; i += m * pr; WHLST[x * 8 + n] = new wst { msk = (byte)(1 << n), mlt = m, xtr = (byte)(i / 15), nxt = (byte)(8 * x + WHLNDX[i % 15]) }; } } cullpg(0u, bpbuf, 0, bpbuf.Length); } //init baseprimes class nmrtr : IEnumerator, IEnumerator, IDisposable { procspc[] ps = new procspc[NUMPROCS]; uint[] buf; Task dlycullpg(uint i, uint[] buf) { return Task.Factory.StartNew(() => { for (var c = 0u; c < CHNKSZ; ++c) cullpg(i + c * PGSZ * 60, buf, (int)(c * PGSZ), (int)PGSZ); }); } public nmrtr() { for (var i = 0u; i < NUMPROCS; ++i) ps[i] = new procspc { buf = new uint[BUFSZ] }; for (var i = 1u; i < NUMPROCS; ++i) { ps[i].tsk = dlycullpg((i - 1u) * BUFSZBTS, ps[i].buf); } buf = ps[0].buf; } public uint Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } } uint _curr; int b = -4; uint i = 0, w = 0; uint v, msk = 0; public bool MoveNext() { if (b < 0) if (b == -1) { _curr = 7; b += (int)BUFSZ; } else { if (b++ == -4) this._curr = 2u; else this._curr = 7u + ((uint)b << 1); return true; } do { i += w & 0xF; if ((w >>= 4) == 0) w = 0x31321212u; if ((this.msk <<= 1) == 0) { if (++b >= BUFSZ) { b = 0; for (var prc = 0; prc < NUMPROCS - 1; ++prc) ps[prc] = ps[prc + 1]; ps[NUMPROCS - 1u].buf = buf; var low = i + (NUMPROCS - 1u) * BUFSZBTS; ps[NUMPROCS - 1u].tsk = dlycullpg(i + (NUMPROCS - 1u) * BUFSZBTS, buf); ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } } while ((v & msk) != 0u); if (_curr > (_curr = 7u + i + i)) return false; else return true; } public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); } public void Dispose() { } } public IEnumerator GetEnumerator() { return new nmrtr(); } IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } } 

请注意,由于设置和初始化arrays的开销,此代码并不比小范围素数的最后版本快,因为高达一百或两百万,但对于大于40亿的大范围,这个代码要快得多。 它比Atkin筛的实施速度快8倍,但对于较大范围的速度可达20至50倍,最高可达40亿加。 在代码中有定义的常量设置基本高速缓存大小以及可以略微调整性能的每个线程(CHNKSZ)的剔除数量。 可以尝试一些轻微的进一步优化,可以将大质数的执行时间缩短两倍,例如2,3,5,7轮的进一步填充,而不仅仅是2,3,5轮。包装减少约25%(可能将执行时间减少到三分之二),并通过车轮因子预先剔除页面段,直至17倍,进一步减少约20%,但这些大约在20%的范围内与可以利用独特的本机代码优化的C代码相比,在纯C#代码中可以做什么。

无论如何,如果打算使用IEnumberable接口作为输出的问题,几乎不值得进一步优化,因为大约三分之二的时间用于枚举找到的素数,并且只有大约三分之一的时间用于剔除复合数。 更好的方法是在类上编写方法来实现期望的结果,例如CountTo,ElementAt,SumTo等,以便完全避免枚举。

但我确实做了额外的优化,作为额外的答案 ,尽管有额外的复杂性,但显示出额外的好处,并进一步说明了为什么人们不想使用SoA,因为完全优化的SoE实际上更好。

以下代码执行了我在上一个答案底部讨论的优化,并包含以下function:

  1. 可用范围已增加到64但未签名的数字范围18,446,744,073,709,551,615,删除了范围溢出检查,因为人们不太可能想要运行程序数百年来处理全部数字到那个限制。 这在处理时间上的成本非常低,因为可以使用32位页面范围完成分页,并且只需要将最终的主要输出计算为64位数字。
  2. 它已经将轮子因子分解从2,3,5增加到使用2,3,5,7素数因子轮,并使用额外的11,13和17的素数进行额外的预先剔除复合数。大大减少了多余的复合数剔除(现在只剔除每个复合数平均约1.5倍)。 由于(DotNet相关)这样做的计算开销(也适用于前一版本的2,3,5轮),实际节省剔除的时间并不是那么好,但是由于许多问题,列举的答案有点快在打包位表示中跳过“简单”复合数字。
  3. 它仍然使用DotNet 4中的任务并行库(TPL)以及每页基于线程池的multithreading。
  4. 它现在使用一个基本素数表示,它支持自动增长此类包含的数组,因为需要更多基本素数作为线程安全方法,而不是先前使用的固定预先计算的基本素数组。
  5. 基本素数表示已减少到每基本素数一个字节,以进一步减少内存占用; 因此,除了代码之外的总内存占用量是保存此素数的基本素数表示的数组,直到当前正在处理的范围的平方根,以及当前设置在L2高速缓存大小下的打包位页缓冲区每个CPU核心的256千字节(最小页面大小为14,586字节乘以17的CHNKSZ)每个CPU核心加一个额外的缓冲区供前台任务处理。 使用此代码,大约三兆字节足以处理高达十到十四次幂的素数范围。 除了允许有效的多处理之外的速度,这降低了内存需求是使用分页筛实现的另一个优点。

     class UltimatePrimesSoE : IEnumerable { static readonly uint NUMPRCSPCS = (uint)Environment.ProcessorCount + 1; const uint CHNKSZ = 17; const int L1CACHEPOW = 14, L1CACHESZ = (1 << L1CACHEPOW), MXPGSZ = L1CACHESZ / 2; //for buffer ushort[] //the 2,3,57 factorial wheel increment pattern, (sum) 48 elements long, starting at prime 19 position static readonly byte[] WHLPTRN = { 2,3,1,3,2,1,2,3,3,1,3,2,1,3,2,3,4,2,1,2,1,2,4,3, 2,3,1,2,3,1,3,3,2,1,2,3,1,3,2,1,2,1,5,1,5,1,2,1 }; const uint FSTCP = 11; static readonly byte[] WHLPOS; static readonly byte[] WHLNDX; //to look up wheel indices from position index static readonly byte[] WHLRNDUP; //to look up wheel rounded up index positon values, allow for overfolw static readonly uint WCRC = WHLPTRN.Aggregate(0u, (acc, n) => acc + n); static readonly uint WHTS = (uint)WHLPTRN.Length; static readonly uint WPC = WHTS >> 4; static readonly byte[] BWHLPRMS = { 2, 3, 5, 7, 11, 13, 17 }; const uint FSTBP = 19; static readonly uint BWHLWRDS = BWHLPRMS.Aggregate(1u, (acc, p) => acc * p) / 2 / WCRC * WHTS / 16; static readonly uint PGSZ = MXPGSZ / BWHLWRDS * BWHLWRDS; static readonly uint PGRNG = PGSZ * 16 / WHTS * WCRC; static readonly uint BFSZ = CHNKSZ * PGSZ, BFRNG = CHNKSZ * PGRNG; //number of uints even number of caches in chunk static readonly ushort[] MCPY; //a Master Copy page used to hold the lower base primes preculled version of the page struct Wst { public ushort msk; public byte mlt; public byte xtr; public ushort nxt; } static readonly byte[] PRLUT; /*Wheel Index Look Up Table */ static readonly Wst[] WSLUT; //Wheel State Look Up Table static readonly byte[] CLUT; // a Counting Look Up Table for very fast counting of primes static int count(uint bitlim, ushort[] buf) { //very fast counting if (bitlim < BFRNG) { var addr = (bitlim - 1) / WCRC; var bit = WHLNDX[bitlim - addr * WCRC] - 1; addr *= WPC; for (var i = 0; i < 3; ++i) buf[addr++] |= (ushort)((unchecked((ulong)-2) << bit) >> (i << 4)); } var acc = 0; for (uint i = 0, w = 0; i < bitlim; i += WCRC) acc += CLUT[buf[w++]] + CLUT[buf[w++]] + CLUT[buf[w++]]; return acc; } static void cull(ulong lwi, ushort[] b) { ulong nlwi = lwi; for (var i = 0u; i < b.Length; nlwi += PGRNG, i += PGSZ) MCPY.CopyTo(b, i); //copy preculled lower base primes. for (uint i = 0, pd = 0; ; ++i) { pd += (uint)baseprms[i] >> 6; var wi = baseprms[i] & 0x3Fu; var wp = (uint)WHLPOS[wi]; var p = pd * WCRC + PRLUT[wi]; var pp = (p - FSTBP) >> 1; var k = (ulong)p * (pp + ((FSTBP - 1) >> 1)) + pp; if (k >= nlwi) break; if (k < lwi) { k = (lwi - k) % (WCRC * p); if (k != 0) { var nwp = wp + (uint)((k + p - 1) / p); k = (WHLRNDUP[nwp] - wp) * p - k; if (nwp >= WCRC) wp = 0; else wp = nwp; } } else k -= lwi; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC]; for (uint wrd = (uint)kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < b.Length; ) { var st = WSLUT[ndx]; b[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } } static Task cullbf(ulong lwi, ushort[] b, Action f) { return Task.Factory.StartNew(() => { cull(lwi, b); f(b); }); } class Bpa { //very efficient auto-resizing thread-safe read-only indexer class to hold the base primes array byte[] sa = new byte[0]; uint lwi = 0, lpd = 0; object lck = new object(); public uint this[uint i] { get { if (i >= this.sa.Length) lock (this.lck) { var lngth = this.sa.Length; while (i >= lngth) { var bf = (ushort[])MCPY.Clone(); if (lngth == 0) { for (uint bi = 0, wi = 0, w = 0, msk = 0x8000, v = 0; w < bf.Length; bi += WHLPTRN[wi++], wi = (wi >= WHTS) ? 0 : wi) { if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) { var p = FSTBP + (bi + bi); var k = (p * p - FSTBP) >> 1; if (k >= PGRNG) break; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC]; for (uint wrd = kd * WPC + (uint)(kn >> 4), ndx = wi * WHTS + kn; wrd < bf.Length; ) { var st = WSLUT[ndx]; bf[wrd] |= st.msk; wrd += st.mlt * pd + st.xtr; ndx = st.nxt; } } } } else { this.lwi += PGRNG; cull(this.lwi, bf); } var c = count(PGRNG, bf); var na = new byte[lngth + c]; sa.CopyTo(na, 0); for (uint p = FSTBP + (this.lwi << 1), wi = 0, w = 0, msk = 0x8000, v = 0; lngth < na.Length; p += (uint)(WHLPTRN[wi++] << 1), wi = (wi >= WHTS) ? 0 : wi) { if (msk >= 0x8000) { msk = 1; v = bf[w++]; } else msk <<= 1; if ((v & msk) == 0) { var pd = p / WCRC; na[lngth++] = (byte)(((pd - this.lpd) << 6) + wi); this.lpd = pd; } } this.sa = na; } } return this.sa[i]; } } } static readonly Bpa baseprms = new Bpa(); static UltimatePrimesSoE() { WHLPOS = new byte[WHLPTRN.Length + 1]; //to look up wheel position index from wheel index for (byte i = 0, acc = 0; i < WHLPTRN.Length; ++i) { acc += WHLPTRN[i]; WHLPOS[i + 1] = acc; } WHLNDX = new byte[WCRC + 1]; for (byte i = 1; i < WHLPOS.Length; ++i) { for (byte j = (byte)(WHLPOS[i - 1] + 1); j <= WHLPOS[i]; ++j) WHLNDX[j] = i; } WHLRNDUP = new byte[WCRC * 2]; for (byte i = 1; i < WHLRNDUP.Length; ++i) { if (i > WCRC) WHLRNDUP[i] = (byte)(WCRC + WHLPOS[WHLNDX[i - WCRC]]); else WHLRNDUP[i] = WHLPOS[WHLNDX[i]]; } Func nmbts = (v) => { var acc = 0; while (v != 0) { acc += (int)v & 1; v >>= 1; } return acc; }; CLUT = new byte[1 << 16]; for (var i = 0; i < CLUT.Length; ++i) CLUT[i] = (byte)nmbts((ushort)(i ^ -1)); PRLUT = new byte[WHTS]; for (var i = 0; i < PRLUT.Length; ++i) { var t = (uint)(WHLPOS[i] * 2) + FSTBP; if (t >= WCRC) t -= WCRC; if (t >= WCRC) t -= WCRC; PRLUT[i] = (byte)t; } WSLUT = new Wst[WHTS * WHTS]; for (var x = 0u; x < WHTS; ++x) { var p = FSTBP + 2u * WHLPOS[x]; var pr = p % WCRC; for (uint y = 0, pos = (p * p - FSTBP) / 2; y < WHTS; ++y) { var m = WHLPTRN[(x + y) % WHTS]; pos %= WCRC; var posn = WHLNDX[pos]; pos += m * pr; var nposd = pos / WCRC; var nposn = WHLNDX[pos - nposd * WCRC]; WSLUT[x * WHTS + posn] = new Wst { msk = (ushort)(1 << (int)(posn & 0xF)), mlt = (byte)(m * WPC), xtr = (byte)(WPC * nposd + (nposn >> 4) - (posn >> 4)), nxt = (ushort)(WHTS * x + nposn) }; } } MCPY = new ushort[PGSZ]; foreach (var lp in BWHLPRMS.SkipWhile(p => p < FSTCP)) { var p = (uint)lp; var k = (p * p - FSTBP) >> 1; var pd = p / WCRC; var kd = k / WCRC; var kn = WHLNDX[k - kd * WCRC]; for (uint w = kd * WPC + (uint)(kn >> 4), ndx = WHLNDX[(2 * WCRC + p - FSTBP) / 2] * WHTS + kn; w < MCPY.Length; ) { var st = WSLUT[ndx]; MCPY[w] |= st.msk; w += st.mlt * pd + st.xtr; ndx = st.nxt; } } } struct PrcsSpc { public Task tsk; public ushort[] buf; } class nmrtr : IEnumerator, IEnumerator, IDisposable { PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; ushort[] buf; public nmrtr() { for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ] }; for (var s = 1u; s < NUMPRCSPCS; ++s) { ps[s].tsk = cullbf((s - 1u) * BFRNG, ps[s].buf, (bfr) => { }); } buf = ps[0].buf; } ulong _curr, i = (ulong)-WHLPTRN[WHTS - 1]; int b = -BWHLPRMS.Length - 1; uint wi = WHTS - 1; ushort v, msk = 0; public ulong Current { get { return this._curr; } } object IEnumerator.Current { get { return this._curr; } } public bool MoveNext() { if (b < 0) { if (b == -1) b += buf.Length; //no yield!!! so automatically comes around again else { this._curr = (ulong)BWHLPRMS[BWHLPRMS.Length + (++b)]; return true; } } do { i += WHLPTRN[wi++]; if (wi >= WHTS) wi = 0; if ((this.msk <<= 1) == 0) { if (++b >= BFSZ) { b = 0; for (var prc = 0; prc < NUMPRCSPCS - 1; ++prc) ps[prc] = ps[prc + 1]; ps[NUMPRCSPCS - 1u].buf = buf; ps[NUMPRCSPCS - 1u].tsk = cullbf(i + (NUMPRCSPCS - 1u) * BFRNG, buf, (bfr) => { }); ps[0].tsk.Wait(); buf = ps[0].buf; } v = buf[b]; this.msk = 1; } } while ((v & msk) != 0u); _curr = FSTBP + i + i; return true; } public void Reset() { throw new Exception("Primes enumeration reset not implemented!!!"); } public void Dispose() { } } public IEnumerator GetEnumerator() { return new nmrtr(); } IEnumerator IEnumerable.GetEnumerator() { return new nmrtr(); } static void IterateTo(ulong top_number, Action actn) { PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s] = new PrcsSpc { buf = new ushort[BFSZ], tsk = Task.Factory.StartNew(() => { }) }; var topndx = (top_number - FSTBP) >> 1; for (ulong ndx = 0; ndx <= topndx; ) { ps[0].tsk.Wait(); var buf = ps[0].buf; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1]; var lowi = ndx; var nxtndx = ndx + BFRNG; var lim = topndx < nxtndx ? (uint)(topndx - ndx + 1) : BFRNG; ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbf(ndx, buf, (b) => actn(lowi, lim, b)) }; ndx = nxtndx; } for (var s = 0u; s < NUMPRCSPCS; ++s) ps[s].tsk.Wait(); } public static long CountTo(ulong top_number) { if (top_number < FSTBP) return BWHLPRMS.TakeWhile(p => p <= top_number).Count(); var cnt = (long)BWHLPRMS.Length; IterateTo(top_number, (lowi, lim, b) => { Interlocked.Add(ref cnt, count(lim, b)); }); return cnt; } public static ulong SumTo(uint top_number) { if (top_number < FSTBP) return (ulong)BWHLPRMS.TakeWhile(p => p <= top_number).Aggregate(0u, (acc, p) => acc += p); var sum = (long)BWHLPRMS.Aggregate(0u, (acc, p) => acc += p); Func sumbf = (lowi, bitlim, buf) => { var acc = 0L; for (uint i = 0, wi = 0, msk = 0x8000, w = 0, v = 0; i < bitlim; i += WHLPTRN[wi++], wi = wi >= WHTS ? 0 : wi) { if (msk >= 0x8000) { msk = 1; v = buf[w++]; } else msk <<= 1; if ((v & msk) == 0) acc += (long)(FSTBP + ((lowi + i) << 1)); } return acc; }; IterateTo(top_number, (pos, lim, b) => { Interlocked.Add(ref sum, sumbf(pos, lim, b)); }); return (ulong)sum; } static void IterateUntil(Func prdct) { PrcsSpc[] ps = new PrcsSpc[NUMPRCSPCS]; for (var s = 0u; s < NUMPRCSPCS; ++s) { var buf = new ushort[BFSZ]; ps[s] = new PrcsSpc { buf = buf, tsk = cullbf(s * BFRNG, buf, (bfr) => { }) }; } for (var ndx = 0UL; ; ndx += BFRNG) { ps[0].tsk.Wait(); var buf = ps[0].buf; var lowi = ndx; if (prdct(lowi, buf)) break; for (var s = 0u; s < NUMPRCSPCS - 1; ++s) ps[s] = ps[s + 1]; ps[NUMPRCSPCS - 1] = new PrcsSpc { buf = buf, tsk = cullbf(ndx + NUMPRCSPCS * BFRNG, buf, (bfr) => { }) }; } } public static ulong ElementAt(long n) { if (n < BWHLPRMS.Length) return (ulong)BWHLPRMS.ElementAt((int)n); long cnt = BWHLPRMS.Length; var ndx = 0UL; var cycl = 0u; var bit = 0u; IterateUntil((lwi, bfr) => { var c = count(BFRNG, bfr); if ((cnt += c) < n) return false; ndx = lwi; cnt -= c; c = 0; do { var w = cycl++ * WPC; c = CLUT[bfr[w++]] + CLUT[bfr[w++]] + CLUT[bfr[w]]; cnt += c; } while (cnt < n); cnt -= c; var y = (--cycl) * WPC; ulong v = ((ulong)bfr[y + 2] << 32) + ((ulong)bfr[y + 1] << 16) + bfr[y]; do { if ((v & (1UL << ((int)bit++))) == 0) ++cnt; } while (cnt <= n); --bit; return true; }); return FSTBP + ((ndx + cycl * WCRC + WHLPOS[bit]) << 1); } } 

The above code takes about 59 milliseconds to find the primes to two million (slightly slower than some of the other simpler codes due to initialization overhead), but calculates the primes to one billion and the full number range in 1.55 and 5.95 seconds, respectively. This isn't much faster than the last version due to the DotNet extra overhead of an extra array bound check in the enumeration of found primes compared to the time expended in culling composite numbers which is less than a third of the time spent emumerating, so the saving in culling composites is cancelled out by the extra time (due to an extra array bounds check per prime candidate) in the enumeration. However, for many tasks involving primes, one does not need to enumerate all primes but can just compute the answers without enumeration.

For the above reasons, this class provides the example static methods "CountTo", "SumTo", and "ElementAt" to count or sum the primes to a given upper limit or to output the zero-based nth prime, respectively. The "CountTo" method will produce the number of primes to one billion and in the 32-bit number range in about 0.32 and 1.29 seconds, respectively ; the "ElementAt" method will produce the last element in these ranges in about 0.32 and 1.25 seconds, respectively , and the "SumTo" method produces the sum of all the primes in these ranges in about 0.49 and 1.98 seconds respectively . This program calculates the sum of all the prime numbers to four billion plus as here in less time than many naive implementations can sum all the prime numbers to two million as in Euler Problem 10, for over 2000 times the practical range!

This code is only about four times slower than very highly optimized C code used by primesieve , and the reasons it is slower are mostly due to DotNet, as follows (discussing the case of a 256 Kilobyte buffer, which is the size of the L2 cache):

  1. Most of the execution time is spent in the main composite culling loop, which is the last "for loop" in the private static "cull" method" and only contains four statements per loop plus the range check.
  2. In DotNet, this compiles to take about 21.83 CPU clock cycles per loop, including about 5 clock cycles for the two array bounds checks per loop.
  3. The very efficient C compiler converts this loop into only about 8.33 clock cycles for an advantage of about 2.67 times.
  4. Primesieve also uses extreme manual "loop unrolling" optimizations to reduce the average time to perform the work per loop to about 4.17 clock cycles per composite cull loop, for a additional gain of two times and a total gain of about 5.3 times.
  5. Now, the highly optimized C code both doesn't Hyper Thread (HT) as well as the less efficient Just In Time (JIT) compiler produced code and as well, the OpemMP multi-threading used by primesieve doesn't appear to be as well adapted to this problem as use of the Thread Pool threads here, so the final multi-threaded gain in only about four times.
  6. One might consider the use of "unsafe" pointers to eliminate the array bounds checks, and it was tried, but the JIT compiler does not optimize pointers as well as normal array based code, so the gain of not having array bound checks is cancelled by the less efficient code (every pointer access (re)loads the pointer address from memory rather than using a register already pointing to that address as in the optimized array case).
  7. Primesieve is even faster when using smaller buffer sizes as in the size of the available L1 cache (16 Kilobytes when multi-threading for the i3/i5/i7 CPU's) as its more efficient code has more of an advantage reducing the average memory access time to one clock cycle from about four clock cycles, which advantage makes much less of a difference to the DotNet code, which gains more from less processing per reduced number of pages. Thus, primesieve is about five times faster when each use their most efficient buffer size.

This DotNet code will count (CountTo) the number of primes to ten to the power thirteen (ten trillion) in about an hour and a half (tested) and the number of primes to a hundred trillion (ten to the fourteenth) in just over a half day (estimated) as compared to twenty minutes and under four hours for primesieve, respectively. This is interesting historically as until 1985 only the count of primes in the range to ten to the thirteenth was known since it would take too much execution time on the (expensive) supercomputers of that day to find the range ten times larger; now we can easily compute the number of primes in those ranges on a common desktop computer (in this case, an Intel i7-2700K - 3.5 GHz)!

Using this code, it is easy to understand why Professor Atkin and Bernstein thought that the SoA was faster than the SoE - a myth that persists to this day , with the reasoning as follows:

  1. It is easy to have any SoA implementation count the number of state toggles and square free composite number culls (which last can be optimized using the same 2,3,5 wheel optimization as used inherently by the SoA algorithm) to determine that the total number of both of these operations is about 1.4 billion for the 32-bit number range.
  2. Bernstein's "equivalent" SoE implementation to his SoA implementation (neither of which are as optimized as this code), which uses the same 2,3,5 wheel optimization as the SoA, will have a total of about 1.82 billion cull operations with the same cull loop computational complexity.
  3. Therefore, Bernstein's results of about 30% better performance as compared to his implementation of the SoE is about right, just based on the number of equivalent operations. However, his implementation of the SoE did not take wheel factorization "to the max" since the SoA does not much respond to further degrees of wheel factorization as the 2,3,5 wheel is "baked in" to the basic algorithm.
  4. The wheel factorization optimizations used here reduces the number of composite cull operations to about 1.2 billion for the 32-bit number range; therefore, this algorithm using this degree of wheel factorization will run about 16.7% faster than an equivalent version of SoA since the culling loop(s) can be implemented about the same for each algorithm.
  5. The SoE with this level of optimizations is easier to write than an equivalent SoA as there needs only be one state table look up array to cull across the base primes instead of an additional state look up array's for each of the four quadratic equation solution cases that produce valid state toggles.
  6. When written using equivalent implementations as this code, the code will respond equivalently to C compiler optimizations for SoA as for the SoE used in primesieve.
  7. The SoA implementation will also respond to the extreme manual "loop unrolling" optimization used by primesieve just as effectively as for the SoE implementation.
  8. Therefore, if I went though all the work to implement the SoA algorithm using the techniques as for the above SoE code, the resulting SoA would only be a slight amount slower when the output primes were enumerated but would be about 16.7% slower when using the static direct methods .
  9. Memory footprint of the two algorithms is no different as both require the representation of the base primes and the same number of segment page buffers.

EDIT_ADD: Interestingly, this code runs 30% faster in x86 32-bit mode than in x64 64-bit mode, likely due to avoiding the slight extra overhead of extending the uint32 numbers to ulong's. All of the above timings are for 64-bit mode. END_EDIT_ADD

In summary: The segment paged Sieve of Atkin is actually slower than a maximally optimized segment paged Sieve of Eratosthenes with no saving in memory requirements!!!

I say again: "Why use the Sieve of Atkin?"

This is a more explicit answer to the question(s) raised, as follows:

Is there a way (the incremental Sieve of Atkin – GBG) can be done?

Of course the Sieve of Atkin (SoA) can be implemented as a segmented algorithm and in fact does not require the use of segmenting arrays as suggested at all but rather can be done just using raw sequences as I have done functionally using F# , although this is quite a bit slower than the extra efficiency permitted by using mutable arrays for the culling.

I am thinking it could go something like this: 1. Start with some trivial limit 2. Find all the primes up to the limit 3. Find all the primes up to the limit 4. Yield all the newfound primes 5. Increase the limit (by doubling or squaring the old limit or something like that) 6. Goto step 2

The above algorithm can be implemented in at least two different ways: 1) save the state of ‘x’ and ‘y’ for each value of ‘x’ when the sequences “run off” the current segment and start up again with those values for the next segment, or 2) calculate the applicable pair values of ‘x’ and ‘y’ to use for the new segment. Although the first way is simpler, I recommend the second method for two reasons: 1) it doesn’t use memory for all the (many) values of x and y that must be saved and only a representation of the base primes must be saved in memory for the “squares free” culling step, and 2) it opens the door to using multi-threading and assigning independent thread operations for each page segment for large savings in time on a multi-processor computer.

And indeed, a better understanding of ‘x’ and ‘y’ is necessary:

My main problem is that I don’t quite understand what x and y for example is in this algorithm. Like, could I just use the same algorithm kind of but set x and y to oldLimit (initially 1) and run it up to newLimit?

There has been one answer addressing this but perhaps it isn’t explicit enough. It may be easier to think of these quadratic equations as a potentially infinite sequence of sequences where one of ‘x’ or ‘y’ is fixed starting from their lowest values and the other variable produces all of the elements per sequence. For instance, one could consider the odd expression of the quadratic equation “4*x^2 + y^2” as the sequence of sequences starting with 5, 17, 37, 65, …, and with each of these sequences have elements as in {5, 13, 29, 53, …}, {17, 25, 41, 65, …}, {37, 45, 61, 85, …}, {65, 73, 89, 125, …}, … Obviously, some of these elements are not prime as they are composites of 3 or 5 and that is why these need to be eliminated by either the modulo test, or alternatively as in the Bernstein implementation, these can be skipped automatically by recognizing patterns in the modulo arithmetic for the generators so they never actually even appear.

Implementing the first easier way of producing a segmented version of the SoA then just requires saving the state of each of the sequence of sequences, which is basically what is done in the F# incremental implementation (although using a folded tree structure for efficiency), which could easily be adapted to working over an array page range. For the case where the sequence state is computed at the beginning of every segment page, one just needs to compute how many elements would fit into the space up to the number represented by the lowest element in the new segment page for each “active” sequence, where “active” means a sequence whose starting element is lower than the number represented by the start index of the segment page.

As for pseudo-code on how to implement the segmentation of an array based SoA sieve, I have written something up for a related post , which shows how this can be accomplished.

The point of this is so that I won’t have to set that limit. So that I can for example use Linq and just Take() however many primes I need, not worrying about if the limit is high enough, et cetera.

As stated in another answer, you could accomplish this end goal by just setting the maximum “limit” as a constant in your code, but this would be quite inefficient for small ranges of primes as culling would be taking place over a much larger array than necessary. As already stated, other than for better efficiency and reducing memory use by a huge factor, segmentation also has other advantages in permitting the efficient use of multi-processing. However, use of the Take(), TakeWhile(), Where(), Count(), etcetera, methods won’t provide very good performance for large ranges of primes as their use involves many stacked method calls per element at many clock cycles per call and return. But you would have the option of either using these or more imperative forms of program flow, so this isn’t a real objection.

I can try to explain what x and y does, but I don’t think you can do what you ask without restarting the loops from the beginning. This is pretty much the same for any “sieve”-algorithm.

What the sieve does is basically count how many different quadratic equations (even or odd) have each number as a solution. The specific equation checked for each number is different depending on what n % 12 is.

For example, numbers n which have a mod 12 remainder of 1 or 5 are prime if and only if the number of solutions to 4* x ^2+ y ^2= n is odd and the number is square-free. The first loop simply loops through all possible values of x and y that could satisfy these different equations. By flipping the isPrime[ n ] each time we find a solution for that n , we can keep track of whether the number of solutions is odd or even.

The thing is that we count this for all possible n at the same time, which makes this much more efficient than checking one at the time. Doing it for just some n would take more time (because you would have to make sure that n >= lower_limit in the first loop) and complicate the second loop, since that one requires knowledge of all primes smaller than sqrt.

The second loop checks that the number is square-free (has no factor which is a square of a prime number).

Also, I don’t think your implementation of the sieve of Atkin is necessarily faster than a straight-forward sieve of Eratosthenes would be. However, the fastest possible most optimized sieve of Atkin would beat the fastest possible most optimized sieve of Eratosthenes.