从txt文件中计算唯一单词的数量和每个单词的出现次数

目前我试图创建一个应用程序来做一些文本处理来读取文本文件,然后我使用字典来创建单词索引,从技术上讲它将是这样的..程序将运行并读取文本文件然后检查它,查看该单词是否已存在于该文件中,以及该单词作为唯一单词的id字。 如果是这样,它将打印出他们遇到的每个单词的索引号和外观总数,并继续检查整个文件。 并产生这样的东西: http : //pastebin.com/CjtcYchF

下面是我正在输入的文本文件的示例: http : //pastebin.com/ZRVbhWhV快速ctrl-F显示“not”出现2次,“that”出现4次。 我需要做的是索引每个单词并像这样调用它:

sample input : "that I have not that place sunrise beach like not good dirty beach trash beach" dictionary : output.txt / output.dat: index word 1 I 4:2 1:1 2:1 3:2 5:1 6:1 7:3 8:1 9:1 10:1 11:1 2 have 3 not 4 that 5 place 6 sunrise 7 beach 8 like 9 good 10 dirty 11 trash 

我试图实现一些代码来创建字典。 这是我到目前为止:

  private void bagofword_Click(object sender, EventArgs e) { //creating dictionary in background //Dictionary dict = new Dictionary(); string rawinputbow = File.ReadAllText(textBox31.Text); //string[] inputbow = rawinputbow.Split(' '); var inputbow = rawinputbow.Split(" ".ToCharArray(), StringSplitOptions.RemoveEmptyEntries) .ToList(); var dict = new OrderedDictionary(); var output = new List(); foreach (var element in inputbow.Select((word, index) => new { word, index })) { if (dict.Contains(element.word)) { var count = (int)dict[element.word]; dict[element.word] = ++count; output.Add(GetIndex(dict, element.word)); //textBoxfile.Text = output.ToString(); // textBoxfile.Text = inputbow.ToString(); string result = string.Join(",", output); textBoxfile.Text = result.ToString(); } else { dict[element.word] = 1; output.Add(GetIndex(dict, element.word)); //textBoxfile.Text = dict.ToString(); string result = string.Join(",", output); textBoxfile.Text = result.ToString(); } } } public int GetIndex(OrderedDictionary dictionary, string key) { for (int index = 0; index < dictionary.Count; index++) { if (dictionary[index] == dictionary[key]) return index; // We found the item //textBoxfile.Text = index.ToString(); } return -1; } 

有谁知道如何完成该代码? 任何帮助深表感谢!

使用此代码

  string input = "that I have not that place sunrise beach like not good dirty beach trash beach"; var wrodList = input.Split(null); var output = wrodList.GroupBy(x => x).Select(x => new Word { charchter = x.Key, repeat = x.Count() }).OrderBy(x=>x.repeat); foreach (var item in output) { textBoxfile.Text += item.charchter +" : "+ item.repeat+Environment.NewLine; } 

用于保存数据的类

  public class word { public string charchter { get; set; } public int repeat { get; set; } } 

在空白上分裂是不够的。 你有一些像temple, photos.cafes/restaraunts 。 更好的方法是使用像\w+这样的正则表达式。 这些词也应该以不区分大小写的方式进行比较。

我的方法是:

 var words = Regex.Matches(File.ReadAllText(filename), @"\w+").Cast() .Select((m, pos) => new { Word = m.Value, Pos = pos }) .GroupBy(s => s.Word, StringComparer.CurrentCultureIgnoreCase) .Select(g => new { Word = g.Key, PosInText = g.Select(z => z.Pos).ToList() }) .ToList(); foreach(var item in words) { Console.WriteLine("{0,-15} POS:{1}", item.Word, string.Join(",", item.PosInText)); } for (int i = 0; i < words.Count; i++) { Console.Write("{0}:{1} ", i, words[i].PosInText.Count); } 
 ### Sample code for you to tweak for your needs: touch test.txt echo "ravi chandran marappan 30" > test.txt echo "ramesh kumar marappan 24" >> test.txt echo "ram lakshman marappan 22" >> test.txt sed -e 's/ /\n/g' test.txt | sort | uniq | awk '{print "echo """,$1, """`grep -wc ",$1," test.txt`"}' | sh Results: 22 -1 24 -1 30 -1 chandran -1 kumar -1 lakshman -1 marappan -3 ram -1 ramesh -1 ravi -1