1. User guide

This user guide introduces how to search text with Amberfish. The af tool creates indexes of text documents and then allows efficient searching of the documents using the indexes.

1.1. Creating an index

A typical command for creating a new index looks something like:

af -i -d mydb -C -v *.txt

The -i option is used for building indexes. The files to be added to an index are listed at the end of the command line, in this case *.txt.

The -d option specifies a database name. The index will be written to a set of files that begin with this name. In this example, files are created with names such as mydb.db, mydb.dt, mydb.fd, etc.

The -C option creates a new database, overwriting any existing database with the same name. This option should not be used when adding to an existing index.

The -v option prints names of the files as they are processed.

Another option, -m (not shown above), can be used to specify how much memory in megabytes will be used. Increasing memory usage can significantly reduce indexing time.

1.2. Searching

Once the index has been created, we can use it to run queries:

af -s -d mydb -q 'cat dog mouse'

The -s option is used for searching.

The -q option specifies a free-text search query, which is effectively a list of words to search for.

The search command above is roughly equivalent to this Boolean search:

af -s -d mydb -Q 'cat or dog or mouse'

The -Q option specifies a Boolean search query, in which "or" acts as a Boolean operator. This query searches for all documents that contain "cat" or "dog" or "mouse".

A more interesting Boolean query might be:

af -s -d mydb -Q 'cat and (dog or mouse)'

This searches for all documents that contain "cat" and also contain "dog" or "mouse".

Both free-text and Boolean search queries are case-insensitive, meaning that uppercase and lowercase characters are interchangeable.

The output of these searches is a list of documents taking the form:

<score> <dbname> <docid> <parent> <filename> <begin> <end>

where <score> is an estimate of the document’s relevance to the query, <dbname> is the database name, <docid> is an unique number identifying the document within the database, <parent> is the docid of a document that contains this document (or 0 if no such relationship exists), <filename> is the name of the file where the document is located, and <begin> and <end> are byte offsets of the beginning and ending of the document within the file.

The words in a query can end with an asterisk (*):

af -s -d mydb -q 'car*'

In this example, car* finds all documents containing the word car, cars, carpet, or any other word that begins with the prefix car.

1.4. Phrases

Phrase searching finds a sequence of words:

af -s -d mydb -q '"John Quincy Adams"'

The phrase is surrounded by quotation marks ("). Individual words in a phrase can end with an asterisk for wildcard searching.

Phrase searching must be enabled when the database is created by using the --phrase option.

1.5. Multiple documents in a file

A document can consist of an entire file or a portion of a file. Documents are identified in an index by their file name and beginning and ending byte offsets. By default a file is considered to be a single document.

The --split index option is a basic way of dividing files into multiple documents, for example:

af -i -d mydb -C --split '====' -v *.txt

In the above example, any occurrences of the string ==== are interpreted as the beginning of a new document.

The list of documents in an index can be viewed with, for example:

af -l -d mydb

A document can be extracted using the --fetch option:

af --fetch <filename> <begin> <end>

where <filename>, <begin>, and <end> are taken from the output of af -s or af -l.

The --split option does not work with the xml document type (described below), which uses a different method of dividing files into documents.

Many documents contain fields, such as Title, Author, Subject, etc., which add structure to the text. For XML files (and potentially other file types in the future), Amberfish queries can be restricted to specific fields as needed. This is enabled when creating an index by specifying -t xml, for example:

af -i -d mydb -C -t xml -v *.xml

where xml is the "document type" for XML files. The default document type is text which does not support field search.

An example of querying within a "Title" field:

af -s -d mydb -q 'Title/cat'

This searches for documents that contain "cat" in the "Title" field. Note that field names may be case-sensitive, depending on the document type.

1.7. More on searching XML

1.7.1. Field paths

Suppose we add a file called jones.xml to an index:

<Document>
   <Author>
      <Name>
         <FirstName> Tom </FirstName>
         <LastName> Jones </LastName>
      </Name>
   </Author>
</Document>

This might be done using the command:

af -i -d mydb -t xml jones.xml

The index will store the words "Tom" and "Jones" as being located at a field path within the document:

/Document/_c/Author/_c/Name/_c/FirstName/_c/Tom
/Document/_c/Author/_c/Name/_c/LastName/_c/Jones

The “_c” is a special field that means the "content" of the XML element, as opposed to the "attribute" which is written as “_a”. So the search:

af -s -d mydb -1 '/Document/_c/Author/_c/Name/_c/LastName/_c/Jones'

will return jones.xml as matching the query. Other queries that will also match:

af -s -d mydb -q '/.../Document/_c/Author/_c/Name/_c/LastName/_c/Jones'
af -s -d mydb -q '/.../_c/Author/_c/Name/_c/LastName/_c/Jones'
af -s -d mydb -q '/.../Author/_c/Name/_c/LastName/_c/Jones'
af -s -d mydb -q '/.../_c/Name/_c/LastName/_c/Jones'
af -s -d mydb -q '/.../Name/_c/LastName/_c/Jones'
af -s -d mydb -q '/.../_c/LastName/_c/Jones'
af -s -d mydb -q '/.../LastName/_c/Jones'
af -s -d mydb -q '/.../_c/Jones'
af -s -d mydb -q '/.../Jones'
af -s -d mydb -q 'Jones'

The “…​” means "a sequence of any 0 or more fields". These queries are equivalent:

af -s -d mydb -q '/.../LastName/_c/Jones'
af -s -d mydb -q 'LastName/_c/Jones'

These queries match jones.xml:

af -s -d mydb -q '/Document/_c/Author/_c/Name/.../Jones'
af -s -d mydb -q 'Name/.../LastName/.../Jones'

The first of the two examples above will match @samp{Jones} anywhere within the author’s name, not necessarily only his last name. The second matches only a last name of Jones, but it need not be the author; for example, it would match a document containing the following fragment:

<Bibliography>
   <Reference Type="book">
      <Title> Text searching the old fashioned way. </Title>
      <Name>
         <FirstName> Indiana </FirstName>
         <LastName> Jones </LastName>
      </Name>
   </Reference>
</Bibliography>

Other queries that would match the above fragment:

af -s -d mydb -q 'Reference/_a/Type/book'
af -s -d mydb -q 'Reference/_a/.../book'
af -s -d mydb -q 'Reference/.../book'

Examples of phrase searching with fields:

af -s -d mydb -q 'Title/.../"text searching"'
af -s -d mydb -Q 'Name/.../Indiana and Name/.../Jones'

1.7.2. Hierarchical documents

XML tags can be parsed into nested documents, which allows more specific search results. This is controlled using the --dlevel option, which limits the number of levels of nesting.

For example:

af -i -d mydb -C -t xml --dlevel 2 medline.xml

The setting --dlevel 1 is the default and results in one document per file, while --dlevel 2 adds one level of nested documents within the outermost XML element. Note that large values for --dlevel can lead to a significant increase in processing time and disk usage.

With --dlevel defined larger than 1, search results will show the most specific (innermost) documents. To include the ancestors of these documents, use the option --style=lineage:

af -s -d mydb -q 'nutrition' --style=lineage

This causes the output to show inner documents indented under their parent (enclosing) documents.

1.8. Searching multiple indexes

We can search across multiple indexes:

af -s -d patents1978 -d patents1979 -d patents1980 -q 'mousetrap'

Each database is queried and the results are merged into a single result set.

1.9. Listing documents

To list documents that have been added to an index:

af -l -d mydb

The output looks like:

<docid> <parent> <filename> <begin> <end> <doctype>

where <docid> is an unique number identifying the document within the database, <parent> is the docid of a document that contains this document (or 0 if no such relationship exists), <filename> is the name of the file where the document is located, <begin> and <end> are byte offsets of the beginning and ending of the document within the file, and <doctype> is the name of the document type associated with the document.