-
Notifications
You must be signed in to change notification settings - Fork 108
Expand file tree
/
Copy pathquery-rewriting.html
More file actions
331 lines (305 loc) · 13 KB
/
query-rewriting.html
File metadata and controls
331 lines (305 loc) · 13 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
---
# Copyright Vespa.ai. All rights reserved.
title: "Query Rewriting"
redirect_from:
- /en/query-rewriting.html
---
<p>
A search application can improve the quality by interpreting the intended meaning of the user queries.
Once the meaning is guessed,
the query can be rewritten to one that will satisfy the user better than the raw query.
Vespa includes a query rewriting language which makes it easy to use query
rewriting to understand and act upon the query semantics.
</p><p>
These query rewriting techniques can be combined to improve the search experience:
</p>
<ul>
<li>Query focusing: Decide a field to search for a term</li>
<li>Query enhancing: Add additional terms which improves the query</li>
<li>Stopwords: Remove terms which hurts recall or precision -
<a href="https://github.com/vespa-cloud/cord-19-search/blob/main/src/main/java/ai/vespa/example/cord19/searcher/BoldingSearcher.java">
example</a></li>
<li>Synonyms: Replace terms or phrases by others</li>
</ul>
<p>
Query rewriting done by <em>semantic rules</em> or <em>searchers</em>.
Semantic rules is a simple production rule language that operates on queries.
For more complex query rewriting logic which could not be handled by simple rules,
one could create a rewriting searcher making use of the query rewriting framework.
</p>
<h2 id="equiv">EQUIV</h2>
<p>
EQUIV is a query operator that can be used to add synonyms
for words where the various synonyms should be equivalent - example:
</p>
<ul>
<li>The user query is <code>(used AND automobile)</code></li>
<li><em>automobile</em> is a synonym for <em>car</em> (from a dictionary)</li>
<li>Rewrite the query to <code>(used AND (automobile EQUIV car))</code></li>
<li><em>automobile</em> or <em>car</em> are here equivalent -
the query shall behave as if all occurrences of <em>car</em> in the document corpus
had been replaced by <em>automobile</em></li>
</ul>
<p>
See the <a href="../reference/querying/yql.html#equiv">reference</a>
for differences between OR and EQUIV.
In many cases it might be better to use OR instead of EQUIV.
Example <em>Snoop</em> Dogg:
</p>
<pre>
"Snoop" EQUIV "Snoop Doggy Dogg" EQUIV "Snoop Lion" EQUIV "Calvin Broadus" EQUIV "Calvin Cordozar Broadus Junior"
</pre>
<p>
However, <em>Snoop</em> is used by other people -
so matching that alone is not a sure hit for the correct entity,
and finding more than one of the synonyms in the same text gives better confidence.
This is exactly what OR does:
</p>
<pre>
"Snoop"!20 OR "Snoop Doggy Dogg"!90 OR "Snoop Lion"!75 OR "Calvin Broadus"!60 OR "Calvin Cordozar Broadus Junior"!100
</pre>
<p>
Use lower weights on the alternatives with less confidence.
If it looks like the many words and phrases inside the OR
overwhelms other words in the query, giving even lower weights may be useful,
for example making the sum of weights 100 - the default weight for just one alternative.
</p><p>
The decision to use EQUIV must be taken by application-specific dictionary or linguistics use.
This can be done using <a href="../reference/querying/yql.html#equiv">YQL</a>
or from a container plugin (example
<a href="https://github.com/vespa-engine/sample-apps/blob/master/album-recommendation-java/src/main/java/ai/vespa/example/album/EquivSearcher.java">
EquivSearcher.java</a>) where the query object is manipulated as follows:
</p>
<ol>
<li>Find a word item in the query</li>
<li>Check that an EQUIV can be used in that place
(see <a href="../reference/querying/yql.html#equiv">limitations</a>)</li>
<li>Find the synonyms in the dictionary</li>
<li>Make an <code>EquivItem</code> with the synonyms (and the original word) as children</li>
<li>Replace the original <code>WordItem</code> with the new <code>EquivItem</code></li>
</ol>
<h2 id="rules">Rules</h2>
<p>
A simple semantic rule looks like:
</p>
<pre>
lotr -> lord of the rings;
</pre>
<p>
This means that whenever the term <em>lotr</em> is encountered in a query,
replace it by the terms <em>lord of the rings</em>.
Rules can also refer to conditions, and the produced terms can be a
modified version of whatever is matched instead of a concrete term:
</p>
<pre>
[brand] -> company:[brand];
[brand] :- sony, dell, ibm, hp;
</pre>
<p>
This rule says that, whenever the condition named <em>brand</em> is matched,
replace the matched term(s) by <em>the same term(s)</em> searching the <em>company</em> field.
In addition, the <em>brand</em> condition is defined to match any of a list of brands.
Note how <code>-></code> means a replacing production rule,
<code>:-</code> means a condition and <code>,</code> separates alternatives.
</p><p>
It is also possible to do grouping using parentheses,
list multiple terms which must be matched in sequence,
and to write <em>adding</em> production rules using <code>+></code> instead of <code>-></code>.
Terms are by default added using the query default (as if they were written in the search box),
but it is also possible to force them to be AND, OR, NOT or RANK using respectively
<code>+</code>, <code>?</code>, <code>-</code> and <code>$</code>.
Here is a more complex rule illustrating this:
</p>
<pre>
[destination] (in, by, at, on) [place] +> $name:[destination]
</pre>
<p>
This rule boosts matches which has a destination which matches the
<em>name</em> field followed by a preposition and a place
(the definition of the <em>destination</em> and <em>place</em> conditions are not shown).
This is achieved by adding a RANK term -
a term which do not impact whether a document is matched or not,
but which adds a relevancy boost if it is.
</p><p>
The complete syntax of this language is found in the
<a href="../reference/querying/semantic-rules.html ">semantic rules reference</a>.
</p>
<h2 id="rule-bases">Rule bases</h2>
<p>
A collection of rules used together are collected in a <em>rule base</em> -
a text file containing rules and conditions, with file suffix <code>.sr</code> (for semantic rules).
Example:
</p>
<pre>
# Replacements
lotr -> lord of the rings;
colour -> color;
audi -> skoda;
# Stopwords
[stopword] -> ; # (Replace them by nothing)
[stopword] :- and, or, the, be;
# Focus brands to the brand field. If we think the <em>brand</em>
# field has high quality data, we can replace. We use the same name
# for the condition and the field, but this is not necessary.
[brand] :- brand:[brand];
[brand] :- sony, dell, ibm, hp;
# Boost recognized categories
[category] +> $category:[category];
[category] :- laptop, digital camera, camera;
</pre>
<p>
The rules in a rule base is evaluated in order from the top down.
A rule will be matched as many times as is possible before evaluation moves on to the next query.
So the query <em>colour colour</em> will be rewritten to <em>color color</em>
before moving on to the next rule.
</p>
<h2 id="configuration">Configuration</h2>
<p>
A rule base file is placed in the <code>rules/</code> directory under
the <a href="../reference/applications/application-packages.html">application package</a>,
and will be named as the file, excluding the <code>.sr</code> suffix.
E.g. if the rules above are saved to <code>[my-application]/rules/example.sr</code>,
the rules base available is named <code>example</code>.
</p><p>
To make a rule base be used by default in queries,
add <code>@default</code> on a separate line to the rule base.
To deactivate the default rules,
add <a href="../reference/api/query.html#rules.off">rules.off</a> to the query.
</p><p>
The rules can safely be updated at any time by running <code>vespa prepare</code> again.
If there are errors in the rule bases, they will not be updated,
and the errors will be reported on the command line.
</p><p>
To trace what the rules are doing,
add <a href="../reference/api/query.html#tracelevel.rules">tracelevel.rules=[number]</a> to the query.
</p>
<h2 id="using-multiple-rule-bases">Using multiple rule bases</h2>
<p>
It is possible to place multiple rule bases in the <code>[my-application]/rules/</code>
and choose between them in the query.
Rules may also include each other.
This is useful to organize larger sets of rules,
to experiment with variants of the rule set in new bases which includes the standard base,
or to use different sets of rules for different use cases.
</p><p>
To include one rule base in another,
add <code>@include(rulebasename)</code> on a separate line,
where <em>rulebasename</em> is the file name (with or without the <em>.sr</em>).
The result will be the same as if the included rule base were copied in
to the location of the <code>include</code> line.
If a condition is defined in both bases, the one from the <em>including</em> base will be used.
It is also possible to refer to the same-named condition in an included rule base
using the <code>@super</code> directive as a condition.
For example, this rule base adds some more categories to the <em>category</em> definition
in the <code>example.sr</code> above:
</p>
<pre>
@include(example)
# Category becomes laptop, digital camera, camera, palmtop, phone
[category] :- @super, palmtop, phone;
</pre>
<p>
Multiple rule bases can be included, and included rule bases can themselves have included rule bases.
All the rule bases included in the application package will be available when making queries.
One of the rule bases can be made the default by adding <code>@default</code> on a separate line in the rule base.
To use another rule base,
add <a href="../reference/api/query.html#rules.rulebase">rules.rulebase=[rulebasename]</a> to the query.
</p>
<h2 id="using-a-finite-state-automaton">Using a finite state automaton</h2>
<p>
<em>Finite state automata</em> (FSA) are efficient in storing and making lookups in large string lists.
A rule base can be compiled into an FSA to increase performance.
An automaton is created from a text file which lists the condition terms to match
and the condition names separated by a tab (by default).
The name of the condition can be followed by a semicolon and additional data which will be ignored.
</p><p>
This automaton source file defines the same as the
<em>stopword</em> and <em>brand</em> conditions in the example rule base:
</p>
<pre>
and stopword
or stopword
be stopword
the stopword
sony brand
dell brand
ibm brand; This text is ignored
hp brand
</pre>
<p>Use <a href="../reference/operations/tools.html#vespa-makefsa">vespa-makefsa</a> to compile the automaton file:</p>
<pre>
$ vespa-makefsa -t sourcefile.txt targetfile.fsa
</pre>
<p>
The target file is used from a rule base by adding <em>@automata(automatonfile)</em>
on a separate line in the rule base file (the file path is relative to <em>$VESPA_HOME</em>).
Automata-files must be stored on all container nodes.
</p><p>
Note that automata are not included in others,
so a rule base including another which uses an automaton
must also declare to use the same automaton
(or an automaton containing any changes from the automaton of the included base).
</p>
<h2 id="query-phrasing">Query phrasing</h2>
<p>
Users search for phrases like <em>New York</em>, <em>Rolling Stones</em>,
<em>The Who</em>, or <em>daily horoscopes</em>.
Considering the latter, most of the time the query string will look like this:
</p>
<pre>
/search/?query=daily horoscopes&…
</pre>
<p>
This is actually a search for documents where both <em>daily</em> and <em>horoscopes</em> match,
but not necessarily documents with the exact phrase <em>"daily horoscopes"</em>.
PhrasingSearcher is a Searcher that compares queries with a list of common phrases,
and replaces two search terms with a phrase.
If <em>"daily horoscopes"</em> is a common phrase, the above query becomes:
</p>
<pre>
/search/?query="daily horoscopes"&…
</pre>
<p>
The PhrasingSearcher must be configured with a list of common phrases,
compiled into a <em>finite state automation</em> (FSA). The phrase list must be:
</p>
<ul>
<li>all lowercase</li>
<li>sorted alphabetically</li>
</ul>
<p>Example:</p>
<pre>
$ perl -ne 'print lc' listofphrasestextfile.unsorted.mixedcase | sort > listofphrasestextfile
</pre>
<p>
Note that the Perl command to convert the text file to lowercase does
not handle non-ASCII characters well (this is just an example).
If the list of phrases is e.g. UTF-8 encoded and/or contains non-English characters,
double-check that the resulting file is correct.
</p><p>
Use <a href="../reference/operations/tools.html#vespa-makefsa">vespa-makefsa</a>
to compile the list into an FSA file:
</p>
<pre>
$ vespa-makefsa listofphrasestextfile phrasefsa
</pre>
<p>
Put the file on all container nodes, configure the location
and <a href="../basics/applications.html">deploy</a>:
</p>
<pre>
<container id="default" version="1.0">
<config name="container.qr-searchers">
<com>
<yahoo>
<prelude>
<querytransform>
<PhrasingSearcher>
<automatonfile><span class="pre-hilite">/path/phrasefsa</span></automatonfile>
</PhrasingSearcher>
</querytransform>
</prelude>
</yahoo>
</com>
</config>
</pre>