TokemStream
org.apache.lucene.analysis.TokenStream
一个抽象类。一个TokenStream会枚举若干个token的序列,要么来自文档的域,要门来自查询文本。
A TokenStream enumerates the sequence of tokens, either from Fields of a Document or from query text.
TokenStream org.apache.lucene.analysis.Analyzer.tokenStream(String fieldName, Reader reader)
从reader的文本中得到一个Analyzer分词后的TokenStream。
Creates a TokenStream which tokenizes all the text in the provided Reader.
void org.apache.lucene.analysis.TokenStream.reset() throws IOException
将TokenStream的游标重置到初始位置。
Resets this stream to the beginning.
boolean org.apache.lucene.analysis.TokenStream.incrementToken() throws IOException
消费者,也就是IndexWriter使用这个方法来获得下一个token。
Consumers (i.e., IndexWriter) use this method to advance the stream to the next token.
org.apache.lucene.analysis.tokenattributes.CharTermAttribute
一个token的词文本。
The term text of a Token.
<CharTermAttribute> CharTermAttribute org.apache.lucene.util.AttributeSource.getAttribute(Class<CharTermAttribute> attClass)
获得指定的Attribute。
The caller must pass in a Class<? extends Attribute> value. Returns the instance of the passed in Attribute contained in this AttributeSource。
Tokenizer
org.apache.lucene.analysis.Tokenizer
一个Tokenizer是一个输入为Reader的TokenStream。
A Tokenizer is a TokenStream whose input is a Reader.
TokenFilter
org.apache.lucene.analysis.TokenFilter
一个TokenFilter是一个输入为其他TokenStream的TokenStream。用于过滤。
A TokenFilter is a TokenStream whose input is another TokenStream.
org.apache.lucene.analysis.LowerCaseFilter
将token替换为小写。
Normalizes token text to lower case.
org.apache.lucene.analysis.StopFilter
从一个TokenStream中去除停用词。
Removes stop words from a token stream.
Analyzer
org.apache.lucene.analysis.KeywordAnalyzer
将整个stream作为一个token。适用于邮政编码、产品名称等。
"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.
org.apache.lucene.analysis.ReusableAnalyzerBase
一个Analyzer的方便的子类,可以方便地实现TokenStream的重用。
An convenience subclass of Analyzer that makes it easy to implement TokenStream reuse.