accents_db: | ${database_base}.uml.db |
accept_language: | en-us en it |
add_anchors_to_excerpt: | no |
allow_double_slash: | true |
<SELECT NAME="search_algorithm"> |
allow_in_form: | search_algorithm search_results_header |
allow_numbers: | true |
allow_space_in_url: | true |
allow_virtual_hosts: | false |
anchor_target: | body |
any_keywords: | yes |
author_factor: | 1 |
authorization: | myusername:mypassword |
backlink_factor: | 501.1 |
bad_extensions: | .foo .bar .bad |
No example provided |
bad_querystr: | forum=private section=topsecret&passwd=required |
contrib/examples
directory.
bad_word_list: | ${common_dir}/badwords.txt |
The default value of this attribute is determined at compile time.
bin_dir: | /usr/local/bin |
boolean
.
See also the
boolean_syntax_errors attribute.
boolean_keywords: | et ou non |
boolean
.
They are used in conjunction with the
boolean_keywords attribute, and
comprise all
English-specific parts of these error messages. The order in which
the strings are put together may not be ideal, or even gramatically
correct, for all languages, but they can be used to make fairly
intelligible messages in many languages.
boolean_syntax_errors: | Attendait "un mot" "à la fin" "au lieu de" "fin d'expression" "guillemet" |
build_select_lists: |
MATCH_LIST matchesperpage matches_per_page_list \ 1 1 1 matches_per_page "Previous Amount" \ RESTRICT_LIST,multiple restrict restrict_names 2 1 2 restrict "" \ FORMAT_LIST,radio format template_map 3 2 1 template_name "" |
caps_factor: | 1 |
case_sensitive: | false |
check_unique_date: | false |
check_unique_md5: | false |
collection_names: | htdig_docs htdig_bugs |
common_dir: | /tmp |
common_url_parts: |
http://www.htdig.org/ml/ \ .html \ https://github.com/solbu/hldig \ https://solbu.github.io/htdig/ |
compression_level: | 0 |
No example provided |
The default value of this attribute is determined at compile time.
config_dir: | /var/htdig/conf |
file://
URL from its extension, this program is used to determine the type.
The program is called with one argument, the name of (possibly a
temporary copy of) the file.
See also mime_types.
content_classifier: | file -i -b |
cookies_input_file: | ${common_dir}/cookies.txt |
sort -u
to get a unique list.
create_image_list: | yes |
sort -u
to get a unique list.
create_url_list: | yes |
database_base: | ${database_dir}/sales |
The default value of this attribute is determined at compile time.
database_dir: | /var/htdig |
date_factor: | 0.35 |
date_format: | %Y-%m-%d |
description_factor: | 350 |
description_meta_tag_names: | "description htdig-description" |
disable_cookies: | true |
doc_db: | ${database_base}documents.db |
doc_excerpt: | ${database_base}excerpts.db |
doc_index: | documents.index.db |
doc_list: | /tmp/documents.text |
endday: | 31 |
end_ellipses: | ... |
end_highlight: | </font> |
endings_affix_file: | /var/htdig/affix_rules |
endings_dictionary: | /var/htdig/dictionary |
endings_root2word_db: | /var/htdig/r2w.db |
endings_word2root_db: | /var/htdig/w2r.bm |
endmonth: | 12 |
endyear: | 2002 |
excerpt_length: | 500 |
excerpt_show_top: | yes |
exclude: | myhost.com/mailarchive/ |
exclude_urls: | students.html cgi-bin |
The parser program takes four command-line
parameters, not counting any parameters already
given in the command string:
infile content-type URL configuration-file
Parameter | Description | Example |
---|---|---|
infile | A temporary file with the contents to be parsed. | /var/tmp/htdext.14242 |
content-type | The MIME-type of the contents. | text/html |
URL | The URL of the contents. | http://www.htdig.org/attrs.html |
configuration-file | The configuration-file in effect. | /etc/htdig/htdig.conf |
The external parser is to write information for
htdig on its standard output. Unless it is an
external converter, which will output a document
of a different content-type, then its output must
follow the format described here.
The output consists of records, each record terminated
with a newline. Each record is a series of (unless
expressively allowed to be empty) non-empty tab-separated
fields. The first field is a single character
that specifies the record type. The rest of the fields
are determined by the record type.
Record type | Fields | Description |
---|---|---|
w | word | A word that was found in the document. |
location | A number indicating the normalized location of the word within the document. The number has to fall in the range 0-1000 where 0 means the top of the document. | |
heading level |
A heading level that is used to compute the
weight of the word depending on its context in
the document itself. The level is in the range of
0-11 and are defined as follows:
|
|
u | document URL | A hyperlink to another document that is referenced by the current document. It must be complete and non-relative, using the URL parameter to resolve any relative references found in the document. |
hyperlink description | For HTML documents, this would be the text between the <a href...> and </a> tags. | |
t | title | The title of the document |
h | head | The top of the document itself. This is used to build the excerpt. This should only contain normal ASCII text |
a | anchor | The label that identifies an anchor that can be used as a target in an URL. This really only makes sense for HTML documents. |
i | image URL | An URL that points at an image that is part of the document. |
m | http-equiv | The HTTP-EQUIV attribute of a META tag. May be empty. |
name | The NAME attribute of this META tag. May be empty. | |
contents | The CONTENTS attribute of this META tag. May be empty. |
external_parsers: |
text/html /usr/local/bin/htmlparser \ application/pdf /usr/local/bin/parse_doc.pl \ application/msword->text/plain "/usr/local/bin/mswordtotxt -w" \ application/x-gunzip->user-defined /usr/local/bin/ungzipper |
Parameter | Description | Example |
---|---|---|
protocol | The URL scheme to be used. | https |
URL | The URL to be retrieved. | https://www.htdig.org:8008/attrs.html |
configuration-file | The configuration-file in effect. | /etc/htdig/htdig.conf |
The external protocol script is to write information for htdig on the standard output. The output must follow the form described here. The output consists of a header followed by a blank line, followed by the contents of the document. Each record in the header is terminated with a newline. Each record is a series of (unless expressively allowed to be empty) non-empty tab-separated fields. The first field is a single character that specifies the record type. The rest of the fields are determined by the record type.
Record type | Fields | Description |
---|---|---|
s | status code |
An HTTP-style status code, e.g. 200, 404. Typical codes include:
|
r | reason | A text string describing the status code, e.g "Redirect" or "Not Found." |
m | status code | The modification time of this document. While the code is fairly flexible about the time/date formats it accepts, it is recommended to use something standard, like RFC1123: Sun, 06 Nov 1994 08:49:37 GMT, or ISO-8601: 1994-11-06 08:49:37 GMT. |
t | content-type | A valid MIME type for the document, like text/html or text/plain. |
l | content-length | The length of the document on the server, which may not necessarily be the length of the buffer returned. |
u | url | The URL of the document, or in the case of a redirect, the URL that should be indexed as a result of the redirect. |
external_protocols: |
https /usr/local/bin/handler.pl \ ftp /usr/local/bin/ftp-handler.pl |
extra_word_characters: | _ |
head_before_get: | false |
heading_factor: | 20 |
hlnotify_prefix_file: | ${common_dir}/notify_prefix.txt |
hlnotify_replyto: | design-group@foo.com |
hlnotify_sender: | bigboss@yourcompany.com |
hlnotify_suffix_file: | ${common_dir}/notify_suffix.txt |
hlnotify_webmaster: | Notification Service |
http_proxy
environement variable, but it currently cannot.
The use of a proxy server greatly improves performance
of the indexing process.http_proxy: | http://proxy.bigbucks.com:3128 |
http_proxy_authorization: | myusername:mypassword |
http_proxy_exclude: | http://intranet.foo.com/ |
ignore_alt_text: | true |
ignore_dead_servers: | false |
sort -u
on the file to
eliminate duplicates from the file.
image_list: | allimages |
The default value of this attribute is determined at compile time.
image_url_prefix: | /images/htdig |
include: | ${config_dir}/htdig.conf |
iso_8601: | true |
keywords_factor: | 12 |
<META name="somename" content="somevalue">
keywords_meta_tag_names: | keywords description |
limit_normalized: | http://www.mydomain.com |
http://
if none is
specified).limit_urls_to: | .sdsu.edu kpbs [.*\.html] |
local_default_doc: | default.html default.htm index.html index.htm |
local_urls: | http://www.foo.com/=/usr/www/htdocs/ |
file://
urls
are not retrieved, except throught the local_urls mechanism.
local_urls_only: | true |
local_user_urls: | http://www.my.org/=/home/,/www/ |
locale: | en_US |
logging: | true |
maintainer: | ben.dover@uptight.com |
match_method: | boolean |
matches_per_page: | 999 |
max_connection_requests: | 100 |
max_description_length: | 40 |
max_descriptions: | 1 |
max_doc_size: | 5000000 |
max_excerpts: | 10 |
max_head_length: | 50000 |
max_hop_count: | 4 |
max_keywords: | 10 |
max_meta_description_length: | 1000 |
max_prefix_matches: | 100 |
max_retries: | 6 |
max_stars: | 6 |
maximum_page_buttons: | 20 |
maximum_pages: | 20 |
maximum_word_length: | 15 |
md5_db: | ${database_base}.md5.db |
meta_description_factor: | 20 |
metaphone_db: | ${database_base}.mp.db |
method_names: | or Or and And |
See also content_classifier.
mime_types: | /etc/mime.types |
minimum_prefix_length: | 2 |
minimum_speling_length: | 3 |
minimum_word_length: | 2 |
multimatch_factor: | 1000 |
next_page_text: | <img src="/htdig/buttonr.gif"> |
no_excerpt_show_top: | yes |
no_excerpt_text: |
no_next_page_text: |
no_page_list_header: | <hr noshade size=2>All results on this page.<br> |
no_page_number_text: |
<strong>1</strong> <strong>2</strong> \ <strong>3</strong> <strong>4</strong> \ <strong>5</strong> <strong>6</strong> \ <strong>7</strong> <strong>8</strong> \ <strong>9</strong> <strong>10</strong> |
no_prev_page_text: |
no_title_text: | "No Title Found" |
foosomethingbar
"
matches the word "foobar", not the phrase "foo bar". White space
following noindex_end is counted as white space. See also
noindex_start.
noindex_end: | </SCRIPT> |
noindex_start: | <SCRIPT |
HTML
text to display when no matches were found.
The file should contain a complete HTML
document.nothing_found_file: | /www/searching/nothing.html |
nph: | true |
page_list_header: |
page_number_separator: | "</td> <td>" |
page_number_text: |
<em>1</em> <em>2</em> \ <em>3</em> <em>4</em> \ <em>5</em> <em>6</em> \ <em>7</em> <em>8</em> \ <em>9</em> <em>10</em> |
persistent_connections: | false |
plural_suffix: | en |
prefix_match_character: | ing |
prev_page_text: | <img src="/htdig/buttonl.gif"> |
regex_max_words: | 10 |
remove_bad_urls: | true |
remove_default_doc: | default.html default.htm index.html index.htm |
remove_unretrieved_urls: | true |
restrict: | http://www.acme.com/widgets/ |
robotstxt_name: | myhtdig |
contrib/scriptname
directory for a small example. Note that this
attribute also affects the value of the CGI variable
used in hlsearch templates.
script_name: | /search/results.shtml |
search_algorithm: | exact:1 soundex:0.3 |
search_results_contenttype: | text/xml |
search_results_footer: | /usr/local/etc/ht/end-stuff.html |
search_results_header: | /usr/local/etc/ht/start-stuff.html |
search_results_order: | /docs/|faq.html * /maillist/ /testresults/ |
search_results_wrapper: | ${common_dir}/wrapper.html |
search_rewrite_rules: |
http://(.*)\\.mydomain\\.org/([^/]*) http://\\2.\\1.com \ http://www\\.myschool\\.edu/myorgs/([^/]*) http://\\1.org |
server_aliases: |
foo.mydomain.com:80=www.mydomain.com:80 \ bar.mydomain.com:80=www.mydomain.com:80 |
server_max_docs: | 50 |
server_wait_time: | 20 |
|
|
sort: | revtime |
sort_names: |
score 'Best Match' time Newest title A-Z \ revscore 'Worst Match' revtime Oldest revtitle Z-A |
soundex_db: | ${database_base}.snd.db |
star_blank: | http://www.somewhere.org/icons/noelephant.gif |
star_image: | http://www.somewhere.org/icons/elephant.gif |
star_patterns: |
http://www.sdsu.edu /sdsu.gif \ http://www.ucsd.edu /ucsd.gif |
startday: | 1 |
start_ellipses: | ... |
start_highlight: | <font color="#FF0000"> |
startmonth: | 1 |
start_url: | http://www.somewhere.org/alldata/index.html |
startyear: | 2001 |
No example provided |
substring_max_words: | 100 |
synonym_db: | ${database_base}.syn.db |
synonym_dictionary: | /usr/dict/synonyms |
syntax_error_file: | ${common_dir}/synerror.html |
template_map: |
Short short ${common_dir}/short.html \ Normal normal builtin-long \ Detailed detail ${common_dir}/detail.html |
template_name: | long |
template_patterns: |
http://www.sdsu.edu ${common_dir}/sdsu.html \ http://www.ucsd.edu ${common_dir}/ucsd.html |
text_factor: | 0 |
timeout: | 42 |
title_factor: | 12 |
translate_latin1: | false |
url_list: | /tmp/urls |
url_log: | /tmp/htdig.progress |
url_part_aliases: |
http://search.example.com/~htdig *site \ http://www.htdig.org/this/ *1 \ .html *2 |
url_part_aliases: |
http://www.htdig.org/ *site \ http://www.htdig.org/that/ *1 \ .htm *2 |
url_rewrite_rules: |
(.*)\\?JServSessionIdroot=.* \\1 \ (.*)\\&JServSessionIdroot=.* \\1 \ (.*)&context=.* \\1 |
url_seed_score: |
/mailinglist/ *.5-1e6 /docs/|/news/ *1.5 /testresults/ "*.7 -200" /faq-area/ *2+10000 |
url_text_factor: | 1 |
use_doc_date: | true |
use_meta_description: | true |
use_star_image: | no |
user_agent: | htdig-digger |
valid_extensions: | .html .htm .shtml |
half-hearted
the digger will see this as the three
words half
, hearted
and
halfhearted
.valid_punctuation: | -' |
version: | 3.2.0 |
word_db: | ${database_base}.allwords.db |
word_dump: | /tmp/words.txt |
wordlist_cache_inserts: | true |
wordlist_cache_size: | 40000000 |
wordlist_compress: | false |
wordlist_compress_zlib: | false |
wordlist_monitor: | true |
wordlist_monitor_period: | .1 |
wordlist_monitor_output: | myfile |
wordlist_page_size: | 8192 |
wordlist_verbose: | true |
No example provided |
No example provided |