使用lucene 3.0.0索引和检索中文文件

一. 我本来的程序

其实我本来的程序挺简单, 完全修改自Demo里面的SearchFiles和IndexFiles. 唯一不同的是引用了SmartCN的分词器.

我把修改那一点的代码贴出来.

IndexhChinese.java:

Date　start　=　new　Date(); try　{ 　　 IndexWriter　writer　=　new　IndexWriter(FSDirectory.open(INDEX_DIR),　　　　　　　　　　　 new　SmartChineseAnalyzer(Version.LUCENE_CURRENT),　true,　IndexWriter.MaxFieldLength.LIMITED); 　　 indexDocs(writer,　docDir); 　　 System.out.println("Indexing　to　directory　'"　+INDEX_DIR+　"'..."); 　　 System.out.println("Optimizing..."); 　　 //writer.optimize(); 　　 writer.close();


　　 Date　end　=　new　Date();

　　 System.out.println(end.getTime()　-　start.getTime()　+　"　total　milliseconds");
}

　　　　 SearchChinese.java

Analyzer　analyzer　=　new　SmartChineseAnalyzer(Version.LUCENE_CURRENT);

BufferedReader　in　=　null; if　(queries　!=　null)　{ 　　 in　=　new　BufferedReader(new　FileReader(queries)); }　else　{ 　　 in　=　new　BufferedReader(new　InputStreamReader(System.in,　"GBK")); }

在这里, 我制定了输入的查询是采用GBK编码的.

然后我充满信心的运行后......发现无法检索出中文, 里面的英文检索是正常的.

二. 发现问题.

于是我就郁闷了, 由于对于java与lucene都是太熟悉, 而且用的3.0.0版外面的讨论又不是太多, 就瞎折腾了一会儿, 发现我如果把文件的格式另存为ansi就可以检索中文了(以前是utf-8的), 看来是文件编码的问题, 摸索了一下, 在indexChinese.java中发现了如下的代码:

static　void　indexDocs(IndexWriter　writer,　File　file) 　　 throws　IOException　{ 　　 //　do　not　try　to　index　files　that　cannot　be　read 　　 if　(file.canRead())　{ 　　　　 if　(file.isDirectory())　{ 　　　　　　 String[]　files　=　file.list(); 　　　　　　 //　an　IO　error　could　occur 　　　　　　 if　(files　!=　null)　{ 　　　　　　　　 for　(int　i　=　0;　i　<　files.length;　i++)　{ 　　　　　　　　　　 indexDocs(writer,　new　File(file,　files[i])); 　　　　　　　　 } 　　　　　　 } 　　　　 }　else　{ 　　　　　　 System.out.println("adding　"　+　file); 　　　　　　 try　{ 　　　　　　　　 writer.addDocument(FileDocument.Document(file)); 　　　　　　 } 　　　　　　 //　at　least　on　windows,　some　temporary　files　raise　this　exception　with　an　"access　denied"　message 　　　　　　 //　checking　if　the　file　can　be　read　doesn't　help 　　　　　　 catch　(FileNotFoundException　fnfe)　{ 　　　　　　　　 ; 　　　　　　 } 　　　　 } 　　 }

时间： 2024-12-29 02:08:56

使用lucene 3.0.0索引和检索中文文件

使用lucene 3.0.0索引和检索中文文件的相关文章

Lucene索引和检索中文文件的问题

使用lucene 3.0.0 索引和检索中文文件

使用Lucene索引和检索POI数据

Apache Lucene 5.4.0 发布，Java 搜索引擎

Apache Lucene 6.6.0 发布，Java 搜索引擎

在微软IIS 5.0泄漏索引目录的漏洞

Lucene 3.0.0 之样例解析(1)-配置Lucene的源代码

Lucene 3.0.0细节初窥(1)-深入探索Lucene的consumer与processor

学习不会用-如何加载Lucene.net3.0.0?