For more information on past and future Lucene versions, please see: http://s.apache.org/luceneversions
If you instead want to apply different similarities (e.g. ones with
different parameter values or different algorithms entirely) to different
fields, implement PerFieldSimilarityWrapper with your per-field logic.
(David Mark Nemeskey via Robert Muir)
Some existing analysis components changed packages:
All analyzers in contrib/analyzers and contrib/icu were moved to the
analysis/ module. The 'smartcn' and 'stempel' components now depend on 'common'.
(Chris Male, Robert Muir)
writer.doXYZ()
to writer.get().doXYZ()
(it is also advisable to add an assert writer != null;
before you
access the wrapped IndexWriter.)
In addition, MergePolicy only exposes a default constructor, and the one that
took IndexWriter as argument has been removed from all MergePolicy extensions.
(Shai Erera via Mike McCandless)
In addition, Directory.copyTo* were removed in favor of copy which takes the
target Directory, source and target files as arguments, and copies the source
file to the target Directory under the target file name.
(Shai Erera)
ClassicTokenizer/Analyzer retains the old (pre-Lucene 3.1) StandardTokenizer/ Analyzer implementation and behavior. Only the Unicode Basic Multilingual Plane (code points from U+0000 to U+FFFF) is covered.
UAX29URLEmailTokenizer tokenizes URLs and E-mail addresses according to the
relevant RFCs, in addition to implementing the UAX#29 Word Break rules.
(Steven Rowe, Robert Muir, Uwe Schindler)
Alternatively, use Searchable.search(Weight, Filter, Collector) and pass in a TopFieldCollector instance, using the following code sample:
TopFieldCollector tfc = TopFieldCollector.create(sort, numHits, fillFields,
true /* trackDocScores */,
true /* trackMaxScore */,
false /* docsInOrder */);
searcher.search(query, tfc);
TopDocs results = tfc.topDocs();>
Note that your Sort object cannot use SortField.AUTO when you directly instantiate TopFieldCollector.
Also, the method search(Weight, Filter, Collector) was added to the Searchable interface and the Searcher abstract class to replace the deprecated HitCollector versions. If you either implement Searchable or extend Searcher, you should change your code to implement this method. If you already extend IndexSearcher, no further changes are needed to use Collector.
Finally, the values Float.NaN and Float.NEGATIVE_INFINITY are not
valid scores. Lucene uses these values internally in certain
places, so if you have hits with such scores, it will cause
problems.
(Shai Erera via Mike McCandless)
The interface changes are only notable for users implementing the interfaces,
which was unlikely done, because there is no possibility to change
Lucene's FieldCache implementation.
(Grant Ingersoll, Uwe Schindler)
Going forward Searchable will be kept for convenience only and may
be changed between minor releases without any deprecation
process. It is not recommended that you implement it, but rather extend
Searcher.
(Shai Erera, Chris Hostetter, Martin Ruckli, Mark Miller via Mike McCandless)
TopDocsCollector tdc = new TopScoreDocCollector(10);
Collector c = new PositiveScoresOnlyCollector(tdc);
searcher.search(query, c);
TopDocs hits = tdc.topDocs();
...>
NOTE: users of ParallelReader must change back all of these
defaults in order to ensure the docIDs "align" across all parallel
indices.
(Mike McCandless)
NOTE: this also fixes an "under-merging" bug whereby it is
possible to get far too many segments in your index (which will
drastically slow down search, risks exhausting file descriptor
limit, etc.). This can happen when the number of buffered docs
at close, plus the number of docs in the last non-ram segment is
greater than mergeFactor.
(Ning Li, Yonik Seeley)
These changes are back compatible. The new code can read old
indexes. But old code will not be able read new indexes.
(cutting)
Note: This changes the encoding of an indexed value. Indexes
should be re-created from scratch in order for search scores to
be correct. With the new code and an old index, searches will
yield very large scores for shorter fields, and very small scores
for longer fields. Once the index is re-created, scores will be
as before.
(cutting)
This permits, for the purpose of phrase searching, placing multiple terms in a single position. This is useful with stemmers that produce multiple possible stems for a word.
This also permits the introduction of gaps between terms, so that terms which are adjacent in a token stream will not be matched by and exact phrase query. This makes it possible, e.g., to build an analyzer where phrases are not matched over stop words which have been removed.
Finally, repeating a token with an increment of zero can also be
used to boost scores of matches on that token.
(cutting)
This could be used, for example, with a RangeQuery on a formatted
date field to implement date filtering. One could re-use a
single QueryFilter that matches, e.g., only documents modified
within the last week. The QueryFilter and RangeQuery would only
need to be reconstructed once per day.
(cutting)
a. Queries are no longer modified during a search. This makes it possible, e.g., to reuse the same query instance with multiple indexes from multiple threads.
b. Term-expanding queries (e.g. PrefixQuery, WildcardQuery, etc.) now work correctly with MultiSearcher, fixing bugs 12619 [LUCENE-56] and 12667 [LUCENE-57].
c. Boosting BooleanQuery's now works, and is supported by the query parser (problem reported by Lee Mallabone). Thus a query like "(+foo +bar)^2 +baz" is now supported and equivalent to "(+foo^2 +bar^2) +baz".
d. New method: Query.rewrite(IndexReader). This permits a query to re-write itself as an alternate, more primitive query. Most of the term-expanding query classes (PrefixQuery, WildcardQuery, etc.) are now implemented using this method.
e. New method: Searchable.explain(Query q, int doc). This returns an Explanation instance that describes how a particular document is scored against a query. An explanation can be displayed as either plain text, with the toString() method, or as HTML, with the toHtml() method. Note that computing an explanation is as expensive as executing the query over the entire index. This is intended to be used in developing Similarity implementations, and, for good performance, should not be displayed with every hit.
f. Scorer and Weight are public, not package protected. It now possible for someone to write a Scorer implementation that is not in the org.apache.lucene.search package. This is still fairly advanced programming, and I don't expect anyone to do this anytime soon, but at least now it is possible.
g. Added public accessors to the primitive query classes (TermQuery, PhraseQuery and BooleanQuery), permitting access to their terms and clauses.
Caution: These are extensive changes and they have not yet been
tested extensively. Bug reports are appreciated.
(cutting)