4.6 FIELD SELECTION TABLE (FST)
This
is perhaps the most difficult of the four forms to understand.
CDS/ISIS has two ways of finding
information in the database, which can be compared with the two ways of finding
information in a book. Suppose we have a book on architecture and we want to
find any mention of cathedrals. One method is to start at page 1 and scan each
page in turn to see whether ’cathedrals’ occurs on that page. This is known as
a ’serial’ or ’sequential’ search, because we are searching through the pages
in sequence. It would be quite a reliable method (provided we could keep up the
concentration) but it would
take a long time if the book had several
hundred pages.
A much quicker method is to make use of
the index (provided that the book has one). We look under C, find ’cathedrals’,
and then see an entry something like:
cathedrals 30, 212, 360
Now
we can go straight to those page numbers and read what is said about
cathedrals. This method might not be quite so reliable, since it depends on the
skills of the indexer. He or she might have considered some mentions of
’cathedrals’ to be too insignificant to index.
CDS/ISIS allows both these approaches to
information retrieval. The first method, scanning through the records
sequentially examining the text contained in the record is known as free-text
searching. It is likely to be a slow process when the database contains more
than a few hundred records. The second method, using an index, is the normal
way of searching. CDS/ISIS allows you to set up the index automatically and
refers to it as the index or inverted file. (The list of terms in the index
without the details of their occurrences is also referred to as the terms
dictionary.)
The selection of terms from the database
records to go on to the index file is controlled by the Field Selection Table.
It is not possible for the computer
to select terms according to their
significance. Instead the selection depends upon three rules:
i.
Which fields from the record are to be indexed (e.g. you probably want authors
indexed but not the publisher or the number of pages).
ii. How
the index terms are to be constructed from the data in these fields (called the
indexing
technique). For example, do you want
the title ’Good secretarial practice’ as a whole
field under ’G’, or do you want it split
up into separate words so that
’secretarial’ can be searched under ’s’?
iii.
You can specify a list of stopwords which are not to be used on their own as
index terms, e.g. ’in’, ’of’ and ’the’.
CDS/ISIS
allows much flexibility in specifying each of these three rules. It is
important to consider them carefully, since they determine what searches will
be possible on the database. For instance, if you index authors as separate
words, then ’Walpole, Horace’ will appear under
’Horace’
and under ’Walpole’: you cannot search him as ’Walpole, Horace’. If you index
titles as whole fields, then ’The Concise Oxford Dictionary of Quotations’
cannot be searched under
’Dictionary’
or under ’Quotations’. It is, in fact, possible in CDS/ISIS to index the same
field in more than one way.
If you have divided the field into
subfields, you can index different subfields by different techniques (or some
subfields but not others).
Each line of the Field Selection Table
comprises three elements: the Tag or Name, the Technique and the Format. You
need to make an entry in the table for each field you want to index (i.e. to
make searchable) and if the same field
is indexed in two ways you need two
entries for it.
Again
if you are unsure about writing FSTs it would be a good idea to engage the
services of the
Dictionary
Assistant. This will give you a dialog box like the one in Figure 4.3.
Figure 4.3 Dictionary Assistant dialog box
All you
need do is to choose which technique to apply and which fields to index. The listbox
on the right shows the techniques available. The two most commonly used are 0 –
by line and 4 – by word.
0 means
that the whole field contents will be indexed as a single term.
1 means
index each subfield separately and so is relevant only if the field is divided
into subfields.
2 means
index only words or phrases which have been entered between angle brackets,
e.g.
<inflation
rate>. This technique can be used to select particular terms
from a lengthy piece of text such as an abstract. Some CDS/ISIS users like to
enter descriptors this way and use technique 2 to index them.
3 is
similar to 2 but indexes terms entered between slashes, e.g.
/Windward Islands/
4 signifies
that each word in the field will be indexed separately (except stopwords – see
Section
4.7).
If the field is divided into subfields, you must specify mode mhl or mdl in
the extraction format – see Section 5.2.
Other values are also available and are explained in
the Reference manual. If you choose
one of the values 5 to 8 you will have to edit the format manually to put in
the required prefix. For help on choosing the right technique please see
Section 4.8.
Now click the check boxes against the
fields you want to be indexed (i.e. searchable) and finally click OK. The FST is then displayed and you can edit it if
necessary. Using the Dictionary Assistant, all the fields selected are indexed
by the same technique: if you want to apply different techniques to different
fields, you will need to make changes here.
Each
entry in the FST has three parts. In the top part of the dialog box the entry
being edited is shown in three separate boxes. In the Entries box each entry is
shown on one line with spaces between the three parts.
The first value, which was called the ID
in the DOS version of CDS/ISIS, is normally the same as the tag of the field
from which the terms come. (It does not have to be, but this usually makes
searching easier.) It can be used to specify
the type of term when searching, as we shall see in chapter 7. If you choose a
number that corresponds to a field tag, Winisis will show the field name in the
Tag/Name box when you are editing it. If you choose a number that does not
correspond to a field tag, it will be shown as the number followed by “FST
Tag”.
The
second value, the indexing technique, specifies how the index terms are to be
extracted as explained above.
The
third column, the format, shows which field in the record the terms are to come
from. As in the display format, fields are specified with v in front of their tags.
So,
if the title field has a tag 200 and we want to index each individual word, the
entries would be:
Tag/Name: 200 Title Technique: 4 Format: v200
and
if the author field is 100 and we want to index the author name as a whole:
Tag/Name: 100 Author Technique:
0 Format:
v100
If
we want to index only subfield a of field 100 we could specify
Tag/Name:
100
Author Technique: 0 Format:
v100^a
This
dialog box works in a similar way to the one for the FDT. When you have entered
the data for each field, the focus will be on the Add button. Either click on the button or press {Enter} to add the field to the table
(displayed in the Entries box). If you need to correct the details for any
entry, just click on that entry in the Entries box and the details will be
copied into the boxes used for editing. If you need to remove an entry,
highlight it and click the Delete button. An example of an FST is shown in Figure 4.4.
Figure 4.4 Example of Field Selection Table (FST )
For
more information on writing the data extraction format, please see Chapter 5,
especially
Section
5.2 for dealing with subfield markers and Section 5.5 for dealing with repeated
fields.
Again, do not be too concerned to get
the Field Selection Table right first
time. It is best to try it out on a few
sample records and look at the index terms produced. If they are not what you
want, edit the FST and then
regenerate the inverted file.
When
you have completed your entries in the Field Selection Table, click the Terminate button. You are then asked to
confirm that you want the database to be created. Click Yes and your wish should be granted.
You are then invited to select a database to work on: you can choose the one
you have just created or a previous one.
No comments:
Post a Comment