Spidering
or, FILESYSTEM vs. HTTP
SWISH-E has been enhanced to support different file access methods.
The current version supports access via a FILESYSTEM or HTTP. The
method is chosen at indexing time time. The index format is identical; you
may access an index created with an executable compiled with one
method by an executable compiled with another.
How to Choose an Access Method
The FILESYSTEM access method is chosen by default. You can pick a
different method by specifying the -S option during indexing ("-S fs"
for filesystems) and ("-S http" for spidering).
Excluding method during compilation
If you like to exclude either method from compilation, you can do so
by unsetting the appropriate ALLOW_XXX_INDEXING_DATA_SOURCE variable
in the config.h file.
Required Common Directories
IndexDir requires a method appropriate value. Please use filenames or
directories for the FILESYSTEM method and URLs for the HTTP method.
FILESYSTEM only directives
The following directives are now only available with the FILESYSTEM
acess method:
HTTP only directives
The HTTP access method implements the following directives.
- MaxDepth: (default 5)
- This defines how many links the spider should
follow before stopping. A value of 0 configures the spider to
traverse all links
- Delay: (default 60)
- The number of seconds to wait between issuing
requests to a server.
- TmpDir: (default /var/tmp)
- The location of a writeable temp directory
on your system. The HTTP access method tells the Perl helper to place
its files there.
- SpiderDirectory: (default ./)
- The location of the Perl helper
script. Remember, if you use a relative directory, it is relative to
your directory when you run SWISH-E, not to the directory that SWISH-E
is in.
- EquivalentServer: (default nothing)
- This allows you to deal with
servers that use respond to multiple DNS names. Each line should have
a list of all the method/names that should be considered equivalent.
If you have multiple directives, each one defines its own set of
equivalent servers.
Writing your own File Access Method
SWISH-E has been rearchitechted to allow new file access methods to be
implemented without requiring any changes to the central engine. To
implement a new method, the following functions need to be
implemented:
int parseconfline(char *line)
This function gives your code a chance to define its own configuration
file directives. The function will never get called with comments or
blank lines. Return 0 if the directive is unrecognized.
void indexpath(char *startpoint)
This function is called once for each starting point definied via the
IndexDir directive. This function must call countwords() for each
entity it wants indexed.
int vgetc(void *vp)
int vsize(void *vp)
These functions return the next character (or EOF) and the size of the
entity being index respectively. The pointer is the first argument
passed to countwords().
The following are function from the core engine of SWISH-E that you
will need to use:
int countwords(void *vp, char *location, char *title, int indextitleonly)
This function must be called once for each entity you want indexed.
The first argument (vp) is an opaque handle used for accessing the
entity's data. It will be supplied in the call to vgetc() and
vsize(). The second argument (location) is the index specific
location of the file to be stored in the index after being modified by
ReplaceRules. The third argument (title) is the descriptive title of
the entity. The fourth argument (indextitleonly) should be true if
the contents of the file should not be indexed.
Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000 Hewlett-Packard Company
Originally by Kevin Hughes, kev@kevcom.com, March 11, 1994.
SWISH-E is distributed with no warranty under the terms of the GNU Public License,
Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA
Public questions may be posted to
the SWISH-E Discussion.
Document maintained at http://sunsite.berkeley.edu/SWISH-E/Manual/spidering.html
by the SunSITE Manager.
Last update December 16, 1998. SunSITE Manager:
manager@sunsite.berkeley.edu