读取一个非常大的单行txt文件并将其拆分

我有以下问题：我有一个近500mb大的文件。它的文字，全部在一行。文本用虚线结尾分隔，它叫做ROW_DEL并且在文本中如下：

this is a line ROW_DEL and this is a line

现在我需要进行以下操作，我想将此文件拆分为其行，以便我得到这样的文件：

 this is a line and this is a line

问题，即使我用Windows文本编辑器打开它，它也会破坏，因为文件很大。

是否有可能像我在C＃，Java或Python中提到的那样拆分此文件？什么是最好的灵魂，不要过度使用我的CPU。

实际上500mb的文字并不那么大，只是记事本很糟糕。你可能没有sed可用，因为你在Windows上，但至少尝试在python中的天真解决方案，我认为它将工作正常：

 import os with open('infile.txt') as f_in, open('outfile.txt', 'w') as f_out: f_out.write(f_in.read().replace('ROW_DEL ', os.linesep))

以块的forms读取此文件，例如在c＃中使用StreamReader.ReadBlock 。您可以设置要在那里读取的最大字符数。

对于每个已读取的块，您可以将ROW_DEL替换为\r\n并将其附加到新文件。

只记得按当前读取的字符数增加当前索引。

这是我的解决方案。
原则很容易（ŁukaszW.pl给它），但如果想要照顾特殊情况（ŁukaszW.pl没有），那么编码就不那么容易了。

特殊情况是分隔符ROW_DEL在两个读取块中分割（如I4V所指出的），如果有两个连续的ROW_DEL，其中第二个在两个读取块中分割，则更为微妙。

由于ROW_DEL比任何可能的换行符（ '\r' ， '\n' ， '\r\n' ）都长，因此可以通过操作系统使用的换行符在文件中替换它。这就是为什么我选择重写文件本身。
为此，我使用模式'r+' ，它不会创建新文件。
使用二进制模式'b'也是绝对必要'b' 。

原理是读取一个块（在现实生活中，其大小将为262144）和x个附加字符，其中x是分隔符-1的长度。
然后检查分隔符是否存在于块的末尾+ x个字符。
如果它存在与否，则在执行ROW_DEL转换之前缩短或不缩短块，并重新编写。

裸码是：

 text = ('The hospital roommate of a man infected ROW_DEL' 'with novel coronavirus (NCoV)ROW_DEL' '—a SARS-related virus first identified ROW_DELROW_DEL' 'last year and already linked to 18 deaths—ROW_DEL' 'has contracted the illness himself, ROW_DEL' 'intensifying concerns about the ROW_DEL' "virus's ability to spread ROW_DEL" 'from person to person.') with open('eessaa.txt','w') as f: f.write(text) with open('eessaa.txt','rb') as f: ch = f.read() print ch.replace('ROW_DEL','ROW_DEL\n') print '\nlength of the text : %d chars\n' % len(text) #========================================== from os.path import getsize from os import fsync,linesep def rewrite(whichfile,sep,chunk_length,OSeol=linesep): if chunk_length

为了执行，这里是另一个打印消息的代码：

 text = ('The hospital roommate of a man infected ROW_DEL' 'with novel coronavirus (NCoV)ROW_DEL' '—a SARS-related virus first identified ROW_DELROW_DEL' 'last year and already linked to 18 deaths—ROW_DEL' 'has contracted the illness himself, ROW_DEL' 'intensifying concerns about the ROW_DEL' "virus's ability to spread ROW_DEL" 'from person to person.') with open('eessaa.txt','w') as f: f.write(text) with open('eessaa.txt','rb') as f: ch = f.read() print ch.replace('ROW_DEL','ROW_DEL\n') print '\nlength of the text : %d chars\n' % len(text) #========================================== from os.path import getsize from os import fsync,linesep def rewrite(whichfile,sep,chunk_length,OSeol=linesep): if chunk_length fR now at position %d\n' 'twelve == %r %d chars %s\n' ' -> fR now at position %d' % (chunk ,len(chunk), pch, twelve,len(twelve),m, ptw) ) pos = fW.tell() fW.write(y) fW.flush() fsync(fW.fileno()) print (' %r %d long\n' ' has been written from position %d\n' ' => fW now at position %d' % (y,len(y),pos,fW.tell())) if fR.tell() fR moved %d characters back to position %d'\ % (x2-pt,fR.tell()) else: print (" => fR is at position %d == file's size\n" ' File has thoroughly been read' % fR.tell()) fW.truncate() break raw_input('\npress any key to continue') rewrite('eessaa.txt','ROW_DEL',14) with open('eessaa.txt','rb') as f: ch = f.read() print '\n'.join(repr(line)[1:-1] for line in ch.splitlines(1)) print '\nlength of the text : %d chars\n' % len(ch)

在处理块的末端时有一些微妙之处，以便检测ROW_DEL是否跨越两个块以及是否有两个ROW_DEL连续。这就是为什么我花了很长时间才发布我的解决方案：我最终不得不写fR.seek(-x2+pt,1)而不仅仅是fR.seek(-2*x,1)或fR.seek(-x,1)根据sep是否跨越（2 * x在代码中是x2，ROW_DEL x和x2是6和12）。任何对这一点感兴趣的人都会通过更改if 'ROW_DEL' is in twelve部分中的代码来检查它， if 'ROW_DEL' is in twelve或者不是。

读取一个非常大的单行txt文件并将其拆分

WPF popup staysopen = false仍然会在外部单击时保持弹出窗口打开

C＃错误：并非所有代码路径都返回一个值

将XML请求体添加到Oauth IConsumerRequest

只读一次文件的下一行

包括类库中的服务引用

定时器回调关闭WPF应用程序（DispatcherTimer工作..）

用于轮询属性的最简单的C＃代码？

C＃multithreading控制台应用程序 – 控制台在线程完成之前退出

C＃为什么sizeof不安全以及如何以安全的方式获取结构的大小？

C＃Byte 到Url Friendly String