很多搜索引擎的检索结果都会将匹配的关键词高亮显示出来,便于用户的快速识别,Lucene.NET当然也提供高亮功能。
Lucene.NET的高亮功能由Lucene.NET.HighLight包实现,使用NuGet管理器安装,建议与Lucene.NET保持相同版本。
高亮显示是一个锦上添花的功能,所以打算把是否高亮设置为搜索输入项的可配置项,同时高亮的功能实现也在具体的查询方法中体现。
搜索输入项修SingleSearchOption改为:
public class SingleSearchOption:SearchOptionBase { /// <summary> /// 检索关键词 /// </summary> public string Keyword { get; set; } /// <summary> /// 限定检索域 /// </summary> public List<string> Fields { get; set; } /// <summary> /// 是否高亮显示 /// </summary> public bool IsHightLight { get; set; } public SingleSearchOption(string keyword,List<string> fields,int maxHits=100,bool isHightLight=false) { if (string.IsNullOrWhiteSpace(keyword)) { throw new ArgumentException("搜索关键词不能为空"); } Keyword = keyword; Fields = fields; MaxHits = maxHits; IsHightLight = isHightLight; } }查询方法SingleSearch修改为:
public SingleSearchResult SingleSearch(SingleSearchOption option) { SingleSearchResult result = new SingleSearchResult(); Stopwatch watch=Stopwatch.StartNew(); using (Lucene.Net.Index.DirectoryReader reader = DirectoryReader.Open(Directory)) { //实例化索引检索器 IndexSearcher searcher = new IndexSearcher(reader); var queryParser = new MultiFieldQueryParser(LuceneVersion.LUCENE_48, option.Fields.ToArray(), Analyzer); Query query = queryParser.Parse(option.Keyword); var matches = searcher.Search(query, option.MaxHits).ScoreDocs; #region 高亮 QueryScorer scorer = new QueryScorer(query); Highlighter highlighter = new Highlighter(scorer); #endregion result.TotalHits = matches.Count(); foreach (var match in matches) { var doc = searcher.Doc(match.Doc); SearchResultItem item = new SearchResultItem(); item.Score = match.Score; item.EntityId = doc.GetField(CoreConstant.EntityId).GetStringValue(); item.EntityName = doc.GetField(CoreConstant.EntityType).GetStringValue(); String storedField = doc.Get(option.Fields[0]); if (option.IsHightLight)//高亮 { TokenStream stream = TokenSources.GetAnyTokenStream(reader, match.Doc, option.Fields[0], doc, Analyzer); IFragmenter fragmenter = new SimpleSpanFragmenter(scorer); highlighter.TextFragmenter = fragmenter; string fragment = highlighter.GetBestFragment(stream, storedField); item.FieldValue = fragment; } else { item.FieldValue = storedField; } result.Items.Add(item); } } watch.Stop(); result.Elapsed = watch.ElapsedMilliseconds; return result; }
简单的高亮功能就修改完成了,使用WebAPI接口测试一下
在返回结果中可以看到,检索结果中的关键词“设计”均被加上了<B></B>的标签。
在上面的示例中使用了Lucene.NET最简单的高亮效果,其原理并不复杂,了解其原理也能帮助我们实现更多更丰富的效果。简单来讲就是将查询结果进行二次处理,找到匹配关键字的位置,并添加样式重写查询结果。其大致流程如下。
QueryScorer scorer = new QueryScorer(query);QueryScorer实现了Lucene.Net.Search.Highlight.IScorer接口,根据找到的唯一查询词的数量对文本片段进行评分。其构造函数的参数是当前的查询示例Query,此外可选项有IndexReader实例、需要高亮显示的Field的名称。
public QueryScorer(Query query) => this.Init(query, (string) null, (IndexReader) null, true); public QueryScorer(Query query, string field) => this.Init(query, field, (IndexReader) null, true); public QueryScorer(Query query, IndexReader reader, string field) => this.Init(query, field, reader, true); public QueryScorer(Query query, IndexReader reader, string field, string defaultField) { this.defaultField = defaultField.Intern(); this.Init(query, field, reader, true); } public QueryScorer(Query query, string field, string defaultField) { this.defaultField = defaultField.Intern(); this.Init(query, field, (IndexReader) null, true); }
Highlighter highlighter = new Highlighter(scorer);HighLighter类从其名字上就能看出来用于高亮标记文本中的对应项。
public Highlighter(IScorer fragmentScorer) : this((IFormatter) new SimpleHTMLFormatter(), fragmentScorer) { } public Highlighter(IFormatter formatter, IScorer fragmentScorer) : this(formatter, (IEncoder) new DefaultEncoder(), fragmentScorer) { } public Highlighter(IFormatter formatter, IEncoder encoder, IScorer fragmentScorer) { this._formatter = formatter; this._encoder = encoder; this._fragmentScorer = fragmentScorer; }
TokenStream可以说是这里核心了,通过之前Lucene.NET的工作流程我们知道,文本会被分割成TokenStream,里面记录的每个Token的位置。通过TokenStream就能快速的找到需要添加高亮效果的分词。
TokenStream stream = TokenSources.GetAnyTokenStream(reader, match.Doc, option.Fields[0], doc, Analyzer);从Token集合中获取当前IndexReader、匹配文档Document和Analyzer对应的Token Stream。
public static TokenStream GetAnyTokenStream( IndexReader reader, int docId, string field, Document doc, Analyzer analyzer) { TokenStream tokenStream = (TokenStream) null; Terms terms = reader.GetTermVectors(docId)?.GetTerms(field); if (terms != null) tokenStream = TokenSources.GetTokenStream(terms); return tokenStream ?? TokenSources.GetTokenStream(doc, field, analyzer); }
IFragmenter fragmenter = new SimpleSpanFragmenter(scorer); highlighter.TextFragmenter = fragmenter;实例化实现IFragmenter接口的类,用于将文本拆分成不同大小的片段,而不是单个的字。
/// <param name="queryScorer"><see cref="T:Lucene.Net.Search.Highlight.QueryScorer" /> that was used to score hits</param> public SimpleSpanFragmenter(QueryScorer queryScorer) : this(queryScorer, 100) { } /// <param name="queryScorer"><see cref="T:Lucene.Net.Search.Highlight.QueryScorer" /> that was used to score hits</param> /// <param name="fragmentSize">size in bytes of each fragment</param> public SimpleSpanFragmenter(QueryScorer queryScorer, int fragmentSize) { this.fragmentSize = fragmentSize; this.queryScorer = queryScorer; }
String storedField = doc.Get(option.Fields[0]); string fragment = highlighter.GetBestFragment(stream, storedField);最后一步就是将找到的关键字添加样式。
public string[] GetBestFragments(TokenStream tokenStream, string text, int maxNumFragments) { maxNumFragments = Math.Max(1, maxNumFragments); TextFragment[] bestTextFragments = this.GetBestTextFragments(tokenStream, text, true, maxNumFragments); List<string> stringList = new List<string>(); for (int index = 0; index < bestTextFragments.Length; ++index) { if (bestTextFragments[index] != null && (double) bestTextFragments[index].Score > 0.0) stringList.Add(bestTextFragments[index].ToString()); } return stringList.ToArray(); }
注意:这个过程有一点需要注意,那就是使用的Analyzer--或者进一步讲是使用的分词器--要前后保持一致。如果创建索引和查询时使用的分词器不同,关键词与结果匹配不上不说,由于分词结果的差异,关键词的位置也会出现偏移。