SrchQery.sgm 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410
  1. <!-- $XConsortium: dtsrqery.sgm 1996 -->
  2. <!-- (c) Copyright 1995 Digital Equipment Corporation. -->
  3. <!-- (c) Copyright 1995 Hewlett-Packard Company. -->
  4. <!-- (c) Copyright 1995 International Business Machines Corp. -->
  5. <!-- (c) Copyright 1995 Sun Microsystems, Inc. -->
  6. <!-- (c) Copyright 1995 Novell, Inc. -->
  7. <!-- (c) Copyright 1995 FUJITSU LIMITED. -->
  8. <!-- (c) Copyright 1995 Hitachi. -->
  9. <![ %CDE.C.CDE; [<refentry id="CDE.SEARCH.DtSearchQuery">]]>
  10. <refmeta><refentrytitle>DtSearchQuery</refentrytitle>
  11. <manvolnum>library call</manvolnum>
  12. </refmeta>
  13. <refnamediv>
  14. <refname><function>DtSearchQuery</function></refname>
  15. <refpurpose>Perform a DtSearch database search for a specified query
  16. </refpurpose>
  17. </refnamediv>
  18. <refsynopsisdiv>
  19. <funcsynopsis>
  20. <funcsynopsisinfo>#include &lt;Dt/Search.h></funcsynopsisinfo>
  21. <funcdef>int <function>DtSearchQuery</function></funcdef>
  22. <paramdef>void <parameter>*qry</parameter></paramdef>
  23. <paramdef>char <parameter>*dbname</parameter></paramdef>
  24. <paramdef>int <parameter>search_type</parameter></paramdef>
  25. <paramdef>char <parameter>*date1</parameter></paramdef>
  26. <paramdef>char <parameter>*date2</parameter></paramdef>
  27. <paramdef>DtSrResult <parameter>**results</parameter></paramdef>
  28. <paramdef>long <parameter>*resultscount</parameter></paramdef>
  29. <paramdef>char <parameter>*stems</parameter></paramdef>
  30. <paramdef>int <parameter>*stemcount</parameter></paramdef>
  31. </funcsynopsis>
  32. </refsynopsisdiv>
  33. <refsect1>
  34. <title>DESCRIPTION</title>
  35. <para><function>DtSearchQuery</function> is the DtSearch API search function.
  36. </para>
  37. <para><function>DtSearchQuery</function> is passed a query string and some
  38. search options, performs the requested search, and if successful returns a
  39. linked list of <structname>DtSrResult</structname> structures representing
  40. the documents satisfying the search.
  41. </para>
  42. <para>The results list contains information about the documents that can be
  43. used for subsequent retrievals, as well as information suitable for
  44. display to an end user.
  45. </para>
  46. <refsect2>
  47. <title>Search Types</title>
  48. <para><function>DtSearchQuery</function> supports three types of searches:
  49. <Literal>P</Literal>, <Literal>W</Literal>, and <Literal>S</Literal>.
  50. </para>
  51. <refsect3>
  52. <title>Type <Literal>P</Literal> Search Query Strings</title>
  53. <para>Query strings for search type <Literal>P</Literal> have the simplest syntax, namely a
  54. sequence of words separated by ASCII whitespace. Punctuation and invalid words
  55. are silently discarded by the search engine. The only possible syntax error
  56. is that all query words happen to be invalid in the language of the database.
  57. </para>
  58. <para>Search type <Literal>P</Literal> is often used to implement a limited
  59. Query-by-Example (QBE) search paradigm. In this scenario, users
  60. typically paste document text from whatever source into a query string
  61. text field. Their expectation is that the search engine will return the
  62. documents in the database that are "most similar" to the text of the
  63. query string, and the statistical sort of the results list usually
  64. satisfies that expectation.
  65. </para>
  66. <para>Note that although search type <Literal>P</Literal> does not use boolean
  67. syntax, it is actually implemented as a stemmed search (type
  68. <Literal>S</Literal> search) with implied boolean ORs between words.
  69. </para>
  70. </refsect3>
  71. <refsect3>
  72. <title>Types <Literal>S</Literal> and <Literal>W</Literal> Boolean Query Strings</title>
  73. <para>Query strings for search types <Literal>S</Literal> (stemmed boolean)
  74. and <Literal>W</Literal> (exact word boolean) must be syntactically
  75. valid boolean expressions as described below. Any string that does not
  76. match a valid expression rule is invalid and will fail with an error
  77. message.
  78. </para>
  79. <para>Query words for all search types may be entered in any codeset for a
  80. supported DtSearch language, including multibyte languages. Words may be
  81. identified as invalid by the language module of the database for a
  82. number of reasons including any words that would not have been indexed
  83. because they are too short, too long, on the stop list, etc. With one
  84. exception, linguistically invalid words result in a syntax error. The
  85. exception is in the case of an "all ANDs" query, where invalid words and
  86. valid words that happen not to be in the database are silently erased
  87. from the query string.
  88. </para>
  89. <para>The boolean query operators are the ASCII metacharacters: '&amp;' for
  90. AND, '|' for OR, '~' for NOT, '(' and ')' for open and close parentheses
  91. respectively, and '@ <Literal>nnn</Literal>' for collocation expressions.
  92. </para>
  93. <para>All expression tokens are separated by ASCII whitespace. Typically this
  94. i 1 or more space or tab characters. Omitting whitespace separators is
  95. legal if it can be done unambiguously. For example "word1&amp;word2" is
  96. a legal expression but "word1word2" would be interpreted as a single
  97. word token.
  98. </para>
  99. <para>The ASCII "at" sign (@) marks a special boolean <emphasis>collocation
  100. operator</emphasis>. The collocation operator has the syntax "@n...",
  101. the ASCII "at" sign followed by one or more ASCII numeric digits,
  102. representing an integer with value greater than zero. Collocation is a
  103. variation of the AND search where a user can specify the maximum
  104. distance in bytes between any two words. In most languages a byte is
  105. equivalent to a character position. For example to find "ice" and
  106. "cream" separated by no more than five characters, the search query "ice
  107. @5 cream" may be used. Unlike other boolean operators, the collocation
  108. operator can apply only to naked word tokens, not other expressions.
  109. Searches including collocation operators are slower than searches
  110. without them, and can be much slower for common words.
  111. </para>
  112. <para>There are a maximum of 8 distinct word tokens. Collocation operators
  113. count as part of the 8. There is no limit to the number of operators, as
  114. long as they match the syntax rules.
  115. </para>
  116. <note>
  117. <para>
  118. Collocation operators are only supported for "Austext flavor" databases.
  119. The default flavor of database created by <command>dtsrcreate</command> is
  120. "Dtinfo flavor," which does not support collocation.
  121. </para>
  122. </note>
  123. </refsect3>
  124. </refsect2>
  125. <refsect2>
  126. <title>Boolean Query Syntax Rules</title>
  127. <para>There are only 6 syntax rules and the rules are recursive. Ambiguity is
  128. resolved by precedence and associativity rules.
  129. </para>
  130. <orderedlist>
  131. <listitem>
  132. <para><emphasis>valid_expression</emphasis> := <emphasis>word_token</emphasis>
  133. </para>
  134. <para>A valid expression can be just a valid naked word token. Semantically,
  135. the expression returns all documents containing the specified word. The
  136. <emphasis>word_token</emphasis> must be a valid word in the language of
  137. the database being searched.
  138. </para>
  139. </listitem>
  140. <listitem>
  141. <para><emphasis>valid_expression</emphasis> := <emphasis>valid_expression</emphasis> '&amp;' <emphasis>valid_expression</emphasis>
  142. </para>
  143. <para>The ASCII ampersand character is the AND character. Semantically, it
  144. returns all documents satisfying both the first and second expressions
  145. (boolean intersection). AND is also the "implied" boolean operator in
  146. the following sense: the query parser will insert an ampersand between
  147. words or expressions that otherwise would be separated only by
  148. whitespace. For example "word1 word2" becomes "word1 &amp; word2".
  149. </para>
  150. </listitem>
  151. <listitem>
  152. <para><emphasis>valid_expression</emphasis> := <emphasis>valid_expression</emphasis> '|' <emphasis>valid_expression</emphasis>
  153. </para>
  154. <para>The ASCII virgule (vertical slash) character is the OR character. It
  155. means return all documents satisfying either the first or the second
  156. expression (boolean union).
  157. </para>
  158. </listitem>
  159. <listitem>
  160. <para><emphasis>valid_expression</emphasis> := '(' <emphasis>valid_expression</emphasis> ')'
  161. </para>
  162. <para>Valid expressions may be recursively nested in ASCII open and close
  163. parentheses characters. The query parser "forgives" two common human errors.
  164. It will automatically discard excessive close parentheses characters, and
  165. it will automatically generate close parentheses characters if necessary at
  166. the end of a query. For example, "aaa | (bbb &amp; ccc)))))) ddd" becomes
  167. "aaa | ( bbb &amp; ccc) &amp; ddd", and "aaa ((bbbb" becomes "aaa ( ( bbb
  168. ) )".
  169. </para>
  170. </listitem>
  171. <listitem>
  172. <para><emphasis>valid_expression</emphasis> := '~' <emphasis>valid_expression</emphasis>
  173. </para>
  174. <para>The ASCII tilde character is the unary NOT operator. It returns every
  175. document in the database that is not in the set satisfying the expression.
  176. </para>
  177. </listitem>
  178. <listitem>
  179. <para><emphasis>valid_expression</emphasis> := <emphasis>word_token</emphasis>
  180. <emphasis>collocation_operator</emphasis> <emphasis>word_token</emphasis>
  181. </para>
  182. <para>Collocation operators are permitted only between words, not expressions.
  183. Each of the word tokens and the collocation operator itself occupy slots
  184. in the table of 8 maximum word tokens.
  185. </para>
  186. </listitem>
  187. </orderedlist>
  188. </refsect2>
  189. <refsect2>
  190. <title>Boolean Associativity and Precedence Table</title>
  191. <para>In order from highest precedence to lowest:
  192. </para>
  193. <informaltable>
  194. <tgroup cols="3" colsep="0" rowsep="0">
  195. <colspec align="left" colwidth="114*">
  196. <colspec align="left" colwidth="105*">
  197. <colspec align="left" colwidth="3.51in">
  198. <thead>
  199. <row><entry align="left" valign="bottom"><para>Associativity</para></entry>
  200. <entry align="left" valign="bottom"><para>Operator</para></entry><entry align="left"
  201. valign="bottom"><para>Example</para></entry></row></thead>
  202. <tbody>
  203. <row>
  204. <entry align="left" valign="top"><para>(none)</para></entry>
  205. <entry align="left" valign="top"><para>COLLOC</para></entry>
  206. <entry align="left" valign="top"><para></para></entry></row>
  207. <row>
  208. <entry align="left" valign="top"><para>right</para></entry>
  209. <entry align="left" valign="top"><para>NOT</para></entry>
  210. <entry align="left" valign="top"><para>"aaa~bbb" resolved as "aaa &amp; (&tilde;(bbb)"
  211. </para></entry></row>
  212. <row>
  213. <entry align="left" valign="top"><para>left</para></entry>
  214. <entry align="left" valign="top"><para>AND</para></entry>
  215. <entry align="left" valign="top"><para>"aaa bbb ccc" resolved
  216. as "(aaa &amp; bbb) &amp; ccc"</para></entry></row>
  217. <row>
  218. <entry align="left" valign="top"><para>left</para></entry>
  219. <entry align="left" valign="top"><para>OR</para></entry>
  220. <entry align="left" valign="top"><para>"aaa|bbb|ccc"
  221. resolved as "(aaa | bbb) | ccc"</para></entry></row>
  222. <row>
  223. <entry align="left" valign="top"><para>(none)</para></entry>
  224. <entry align="left" valign="top"><para>naked word</para></entry>
  225. <entry align="left" valign="top"><para></para></entry></row></tbody></tgroup>
  226. </informaltable>
  227. </refsect2>
  228. <refsect2>
  229. <title>Example Boolean Queries</title>
  230. <programlisting>
  231. aaa bbb ccc
  232. </programlisting>
  233. <para>Returns all records that contain at least one occurrence of all three words.
  234. </para>
  235. <programlisting>
  236. aaa | (bbb ~ccc)
  237. </programlisting>
  238. <para>Retrieves all records containing "aaa"
  239. and also all records containing "bbb", but not
  240. "ccc".
  241. </para>
  242. <programlisting>
  243. aaa ~(aaa @1 bbb)
  244. </programlisting>
  245. <para>Returns all records containing "aaa" but omits those
  246. where "aaa" is one character away from "bbb".
  247. </para>
  248. <para>It is possible to formulate a query that requires retrieving all records
  249. in the database that contain none of the query words (for example,
  250. <literal>~aaa</literal>. Users should be warned that in
  251. a large database such a search can take a very long time.
  252. </para>
  253. <para>Using the implied associativity and precedence rules, the ambiguous
  254. query string <literal>aaa ~bbb | ccc ~ddd @10 eee</literal>
  255. is disambiguated as <literal>(aaa &amp; (~bbb))
  256. | (ccc &amp; (~(ddd @10 eee)))</literal>.
  257. </para>
  258. </refsect2>
  259. </refsect1><refsect1>
  260. <title>ARGUMENTS</title>
  261. <variablelist>
  262. <varlistentry><term><symbol role="Variable">search_type</symbol></term>
  263. <listitem>
  264. <para>Specifies the type of search to perform. Valid values are
  265. <Literal>P</Literal>, <Literal>W</Literal>, and <Literal>S</Literal>.
  266. </para>
  267. <para>Search type <Literal>P</Literal> indicates that the query string is a
  268. sequence of words separated by ASCII whitespace.
  269. It requests that the words be stemmed prior to searching, that all
  270. documents containing any of the words be returned, that the results list
  271. be statistically sorted, and that no more than the top
  272. <symbol role="Variable">MaxResults</symbol> list items be returned where
  273. <symbol role="Variable">MaxResults</symbol> is the current value
  274. returned from <function>DtSearchGetMaxResults</function>. Note that a
  275. type <Literal>P</Literal> search is identical to a type
  276. <Literal>S</Literal> boolean search with an implied boolean OR between
  277. words.
  278. </para>
  279. <para>Search types <Literal>W</Literal> and <Literal>S</Literal> are boolean
  280. query searches. They indicate that the query string is a sequence of
  281. words and boolean operators matching the syntax described under "Types
  282. <Literal>S</Literal> and <Literal>W</Literal> Boolean Query Strings"
  283. above.
  284. </para>
  285. <para>Type <Literal>S</Literal> requests that words be stemmed prior to
  286. searching. Type 'W' requests that words be left unstemmed. Both types
  287. request that all documents containing the combinations of query words
  288. specified by the boolean operations be returned, that the results list
  289. be statistically sorted if possible, and that no more than the top
  290. <symbol role="Variable">MaxResults</symbol> list items be returned
  291. where<symbol role="Variable">MaxResults</symbol> is the current value
  292. returned from <function>DtSearchGetMaxResults</function>.
  293. </para>
  294. </listitem>
  295. </varlistentry>
  296. <varlistentry><term><symbol role="Variable">dbname</symbol></term>
  297. <listitem>
  298. <para>Specifies which database is to be searched. It is any one of the
  299. database name strings returned from <function>DtSearchInit</function> or
  300. <function>DtSearchReinit</function>. If
  301. <symbol role="Variable">dbname</symbol> is NULL, the first database name string
  302. is used.
  303. </para>
  304. <para>Within the specified database, searches will be restricted to those
  305. documents whose <symbol role="Variable">DtSrKeytype.is_selected</symbol>
  306. field is nonzero.
  307. </para>
  308. </listitem>
  309. </varlistentry>
  310. <varlistentry><term><symbol role="Variable">date1</symbol> and
  311. <symbol role="Variable">date2</symbol></term>
  312. <listitem>
  313. <para>Specify a range of document dates to use for the search. Only documents
  314. within the specified range will be returned on the results list.
  315. </para>
  316. <para><symbol role="Variable">date1</symbol> is the older end of the range and
  317. if not NULL, requests DtSearch to return only those records younger than
  318. (that is, after) the specified date.
  319. </para>
  320. <para><symbol role="Variable">date2</symbol> is the younger end of the range
  321. and if not NULL, requests DtSearch to return only those records older
  322. than (that is before) the specified date.
  323. </para>
  324. <para>It is valid to specify just one of the arguments.
  325. </para>
  326. <para>Undated documents always qualify for a results list regardless of search
  327. date strings. The format of a valid date string is described in
  328. &cdeman.DtSearchValidDateString;.
  329. </para>
  330. </listitem>
  331. </varlistentry>
  332. <varlistentry><term><symbol role="Variable">stems</symbol> and
  333. <symbol role="Variable">stemscount</symbol></term>
  334. <listitem>
  335. <para>Specify a character buffer to hold parsed and stemmed words and a
  336. variable to receive the number of stored words.
  337. <symbol role="Variable">stems</symbol> and <symbol role="Variable">stemscount</symbol> are optional; they can be NULL. However, if either
  338. is specified, they must both be specified.
  339. </para>
  340. <para>If specified <symbol role="Variable">stems</symbol>must point to a
  341. character buffer large enough to hold
  342. <symbol role="Variable">DtSrMAX_STEMCOUNT</symbol> by
  343. <symbol role="Variable">DtSrMAXWIDTH_HWORD</symbol> bytes. An array of parsed
  344. and stemmed query words will be stored here by the API for use by a
  345. later call to <function>DtSearchHighlight</function>.
  346. </para>
  347. <para>The size of the array will be stored in
  348. <symbol role="Variable">stemscount</symbol>.
  349. </para>
  350. </listitem>
  351. </varlistentry>
  352. <varlistentry><term><symbol role="Variable">results</symbol> and
  353. <symbol role="Variable">resultscount</symbol></term>
  354. <listitem>
  355. <para>Specify where a pointer to the results list will be stored and a
  356. variable to receive the number of items on the list.
  357. </para>
  358. <para>Results lists can be manipulated with several utility functions.
  359. </para>
  360. <para>In <function>DtSearch</function>, frequency of occurrence information is
  361. maintained for words across the whole database and within documents. For
  362. most queries, results lists are sorted by this statistical information
  363. and presented to the user as a "proximity" number for each document on
  364. the list. Proximity is meant to appear to a user as a distance, or a
  365. measure of the nearness of the query to the document. Conceptually, the
  366. smaller the proximity the "closer" the document is to the query and the
  367. more likely it will be valuable to the user
  368. </para>
  369. <para>DtSearch searches only one database at a time and returns only results
  370. lists for that single database. However, browsers often provide the
  371. illusion of simultaneous searches in multiple databases, merging the
  372. results lists by proximity when completed. Since the domain of knowledge
  373. and density of words and records may vary from database to database, the
  374. value of proximity numbers may similarly vary, and some databases may be
  375. underrepresented on merged results lists.
  376. </para>
  377. </listitem>
  378. </varlistentry>
  379. </variablelist>
  380. </refsect1>
  381. <refsect1>
  382. <title>RETURN VALUE</title>
  383. <para>This function has three common return codes.
  384. </para>
  385. <para><systemitem class="constant">DtSrOK</systemitem> is returned, as well
  386. as a results list and stems array, when the search was completely successful.
  387. </para>
  388. <para><systemitem class="constant">DtSrNOTAVAIL</systemitem> is returned when
  389. the query was valid but the search was unsuccessful (that is, no set of
  390. documents matched the query). There are usually no messages with
  391. <systemitem class="constant">DtSrNOTAVAIL</systemitem>.
  392. </para>
  393. <para><systemitem class="constant">DtSrFAIL</systemitem> is returned when the
  394. search was unsuccessful, usually because of an invalid query, and user
  395. messages on the MessageList explain why.
  396. </para>
  397. <para>Any API function can also return <systemitem class="constant">DtSrREINIT</systemitem> and the return codes for fatal engine errors at any time.
  398. </para>
  399. </refsect1><refsect1>
  400. <title>SEE ALSO</title>
  401. <para>&cdeman.DtSrAPI;,
  402. &cdeman.DtSearchReinit;,
  403. &cdeman.DtSearchGetMaxResults;,
  404. &cdeman.DtSearchSetMaxResults;,
  405. &cdeman.DtSearchGetKeytypes;,
  406. &cdeman.DtSearchValidDateString;,
  407. &cdeman.DtSearchSortResults;,
  408. &cdeman.DtSearchFreeResults;,
  409. &cdeman.DtSearchHighlight;</para>
  410. </refsect1></refentry>