1. User guide
This user guide introduces how to search text with Amberfish. The
af
tool creates indexes of text documents and then allows efficient
searching of the documents using the indexes.
1.1. Creating an index
A typical command for creating a new index looks something like:
af -i -d mydb -C -v *.txt
The -i
option is used for building indexes. The files to be added
to an index are listed at the end of the command line, in this case
*.txt
.
The -d
option specifies a database name. The index will be written
to a set of files that begin with this name. In this example, files
are created with names such as mydb.db
, mydb.dt
, mydb.fd
, etc.
The -C
option creates a new database, overwriting any existing
database with the same name. This option should not be used when
adding to an existing index.
The -v
option prints names of the files as they are processed.
Another option, -m
(not shown above), can be used to specify how
much memory in megabytes will be used. Increasing memory usage can
significantly reduce indexing time.
1.2. Searching
Once the index has been created, we can use it to run queries:
af -s -d mydb -q 'cat dog mouse'
The -s
option is used for searching.
The -q
option specifies a free-text search query, which is
effectively a list of words to search for.
The search command above is roughly equivalent to this Boolean search:
af -s -d mydb -Q 'cat or dog or mouse'
The -Q
option specifies a Boolean search query, in which "or" acts
as a Boolean operator. This query searches for all documents that
contain "cat" or "dog" or "mouse".
A more interesting Boolean query might be:
af -s -d mydb -Q 'cat and (dog or mouse)'
This searches for all documents that contain "cat" and also contain "dog" or "mouse".
Both free-text and Boolean search queries are case-insensitive, meaning that uppercase and lowercase characters are interchangeable.
The output of these searches is a list of documents taking the form:
<score> <dbname> <docid> <parent> <filename> <begin> <end>
where <score> is an estimate of the document’s relevance to the query, <dbname> is the database name, <docid> is an unique number identifying the document within the database, <parent> is the docid of a document that contains this document (or 0 if no such relationship exists), <filename> is the name of the file where the document is located, and <begin> and <end> are byte offsets of the beginning and ending of the document within the file.
1.3. Wildcard search
The words in a query can end with an asterisk (*
):
af -s -d mydb -q 'car*'
In this example, car*
finds all documents containing the word car
,
cars
, carpet
, or any other word that begins with the prefix car
.
1.4. Phrases
Phrase searching finds a sequence of words:
af -s -d mydb -q '"John Quincy Adams"'
The phrase is surrounded by quotation marks ("
). Individual words
in a phrase can end with an asterisk for wildcard searching.
Phrase searching must be enabled when the database is created by using
the --phrase
option.
1.5. Multiple documents in a file
A document can consist of an entire file or a portion of a file. Documents are identified in an index by their file name and beginning and ending byte offsets. By default a file is considered to be a single document.
The --split
index option is a basic way of dividing files into
multiple documents, for example:
af -i -d mydb -C --split '====' -v *.txt
In the above example, any occurrences of the string ====
are
interpreted as the beginning of a new document.
The list of documents in an index can be viewed with, for example:
af -l -d mydb
A document can be extracted using the --fetch
option:
af --fetch <filename> <begin> <end>
where <filename>, <begin>, and <end> are taken from the output of af
-s
or af -l
.
The --split
option does not work with the xml
document type
(described below), which uses a different method of dividing files
into documents.
1.6. Field search
Many documents contain fields, such as Title, Author, Subject, etc.,
which add structure to the text. For XML files (and potentially other
file types in the future), Amberfish queries can be restricted to
specific fields as needed. This is enabled when creating an index by
specifying -t xml
, for example:
af -i -d mydb -C -t xml -v *.xml
where xml
is the "document type" for XML files. The default
document type is text
which does not support field search.
An example of querying within a "Title" field:
af -s -d mydb -q 'Title/cat'
This searches for documents that contain "cat" in the "Title" field. Note that field names may be case-sensitive, depending on the document type.
1.7. More on searching XML
1.7.1. Field paths
Suppose we add a file called jones.xml
to an index:
<Document> <Author> <Name> <FirstName> Tom </FirstName> <LastName> Jones </LastName> </Name> </Author> </Document>
This might be done using the command:
af -i -d mydb -t xml jones.xml
The index will store the words "Tom" and "Jones" as being located at a field path within the document:
/Document/_c/Author/_c/Name/_c/FirstName/_c/Tom /Document/_c/Author/_c/Name/_c/LastName/_c/Jones
The “_c” is a special field that means the "content" of the XML element, as opposed to the "attribute" which is written as “_a”. So the search:
af -s -d mydb -1 '/Document/_c/Author/_c/Name/_c/LastName/_c/Jones'
will return jones.xml
as matching the query. Other queries that
will also match:
af -s -d mydb -q '/.../Document/_c/Author/_c/Name/_c/LastName/_c/Jones' af -s -d mydb -q '/.../_c/Author/_c/Name/_c/LastName/_c/Jones' af -s -d mydb -q '/.../Author/_c/Name/_c/LastName/_c/Jones' af -s -d mydb -q '/.../_c/Name/_c/LastName/_c/Jones' af -s -d mydb -q '/.../Name/_c/LastName/_c/Jones' af -s -d mydb -q '/.../_c/LastName/_c/Jones' af -s -d mydb -q '/.../LastName/_c/Jones' af -s -d mydb -q '/.../_c/Jones' af -s -d mydb -q '/.../Jones' af -s -d mydb -q 'Jones'
The “…” means "a sequence of any 0 or more fields". These queries are equivalent:
af -s -d mydb -q '/.../LastName/_c/Jones' af -s -d mydb -q 'LastName/_c/Jones'
These queries match jones.xml
:
af -s -d mydb -q '/Document/_c/Author/_c/Name/.../Jones' af -s -d mydb -q 'Name/.../LastName/.../Jones'
The first of the two examples above will match @samp{Jones} anywhere within the author’s name, not necessarily only his last name. The second matches only a last name of Jones, but it need not be the author; for example, it would match a document containing the following fragment:
<Bibliography> <Reference Type="book"> <Title> Text searching the old fashioned way. </Title> <Name> <FirstName> Indiana </FirstName> <LastName> Jones </LastName> </Name> </Reference> </Bibliography>
Other queries that would match the above fragment:
af -s -d mydb -q 'Reference/_a/Type/book' af -s -d mydb -q 'Reference/_a/.../book' af -s -d mydb -q 'Reference/.../book'
Examples of phrase searching with fields:
af -s -d mydb -q 'Title/.../"text searching"' af -s -d mydb -Q 'Name/.../Indiana and Name/.../Jones'
1.7.2. Hierarchical documents
XML tags can be parsed into nested documents, which allows more
specific search results. This is controlled using the --dlevel
option, which limits the number of levels of nesting.
For example:
af -i -d mydb -C -t xml --dlevel 2 medline.xml
The setting --dlevel 1
is the default and results in one document
per file, while --dlevel 2
adds one level of nested documents within
the outermost XML element. Note that large values for --dlevel
can
lead to a significant increase in processing time and disk usage.
With --dlevel
defined larger than 1
, search results will show the
most specific (innermost) documents. To include the ancestors of
these documents, use the option --style=lineage
:
af -s -d mydb -q 'nutrition' --style=lineage
This causes the output to show inner documents indented under their parent (enclosing) documents.
1.8. Searching multiple indexes
We can search across multiple indexes:
af -s -d patents1978 -d patents1979 -d patents1980 -q 'mousetrap'
Each database is queried and the results are merged into a single result set.
1.9. Listing documents
To list documents that have been added to an index:
af -l -d mydb
The output looks like:
<docid> <parent> <filename> <begin> <end> <doctype>
where <docid> is an unique number identifying the document within the database, <parent> is the docid of a document that contains this document (or 0 if no such relationship exists), <filename> is the name of the file where the document is located, <begin> and <end> are byte offsets of the beginning and ending of the document within the file, and <doctype> is the name of the document type associated with the document.