First step of making mandocdb a true makewhatis/mandb replacement:
accept a set of directories on the command line ("manpaths") that are
recursed for files. The databases are created in each manpath root.
This temporarily removes OP_UPDATE and OP_DELETE functionality, which
will be added back in.
Move parts of mandocdb that "do stuff" to the databases into their own
functions. This will make it easier to call repeatedly (for different
directoreis) as must be done with the new interfaces being developed.
Rename makewhatis [back] into mandocdb. This is to maintain consistency
with OpenBSD, which is sandboxing the code for merge. It makes sense
because it doesn't really make a `makewhatis' file in the traditional
sense, so it may be confusing.
Fairly straightforward patch adding basic update (-u) and remove (-r)
functionality to makewhatis. This is somewhat expensive (requiring the
index file to be trawled multiple times), but it's a good start.
Make sure that `br' and `sp' don't emit space before the initial `SH' in
-man. This actually seems to be what groff does. Sort-of. Anyway,
it's required to get perlpod pages rendered nicely, so until perlpod
stops producing shit, do it. Ok schwarze@.
Fix two issues: the first, where a `.\}' wasn't being interpreted as a
proper macro in some conditions, resulting in strange parse errors. The
second, where `\}' was being re-written as `\&'. Instead, we re-write
this as two spaces OR nothing at all, if at the end of line. This isn't
exactly what groff does (who knows...) but is a much safer and better
way than how I was doing it before.
Clean up how -man -T[x]html handles TP, IP, and HP (dd lists and
indented paragraph macros, respectively). This cleans up code and also
cleans up the output quite a lot.
Fix a bug in the -man parser where deleting nodes (such as `PP' or `LP'
in certain situations) caused the next macros to be assigned as siblings
instead of child nodes to the original parent. Noticed and ok by
schwarze@.
The bufcat() function in -T[x]html was eating one byte off the end of its
concatenated string. This for some reason hasn't been found before now... ?
Anyway, fixed, and make the IDs created again be correctly prefixed by a
letter as per the HTML spec.
Fix a TODO noted by schwarze@, originally by Christian Weisgerber:
literal mode (`nf') is ended by SH (and, it turns out, SS as well).
Noted the updated behaviour in man.7 as well.
Make scan for text tokens in a line recursive. This is really only for
the benefit of `Nd', which is the only [to date] node that can consist
of sub-nodes.
Ouch: predefined strings moved into roff.c weren't being reinitalised
after the first parse. Do this, but note there are more efficient ways
just waiting for a table of macros.
First fix how `sp 1' doesn't imply `1v' (it now does) and that 1
followed by non-digits, e.g. `1g', really means `1'. Next, fix some
spacing issues where `sp' was invoked in -man after sections or
subsections. Make sure this behaviour is mirrored in -Thtml.
Try again to get the transfer from hash to btree working. This time
just closing and re-opening the database, as deleting records with
(*hash->del) either in the scan loop or after it causes uncertain
behaviour (left-over keys, mystery keys, etc.). This finally does the
Right Thing (tm).
Big change to makewhatis: use an in-memory hashtable to collapse
multiple types of the same name (e.g., "foo" being a manual name,
utility name, etc.) into a single bitmask'd region. This considerably
reduces the size of the keyword database.
Allow RS/RE blocks to nest. This requires first the syntax tree to
accomodate for the fix, then for the front-ends. -T[x]html accepted the
syntax tree natively, but -Tascii had to use relative offsets. It's
quite a simple fix.
Add back in a check that the leading `-' exists for arguments. This
mysteriously disappeared in 1.14. No idea why. While here, remove an
unnecessary header and order the function prototypes.
Fix an assertion failure raised by the following interesting scenario: a
auto-opened `It' (i.e., a column list with a free-text first line) with
leading spaces in the line triggering assertion when searching for
arguments.
This led to a fix giving a nice performance speed-ups (a few percent,
with some quick trials): the search for flags immediately exits if the
macro has no flags, instead of having to first parse the leading word
then look it up. I also cleaned up the argv parsing stuff a little bit
and added more documentation.
Fix some bad bits in the mandoc manual: `Xr' instead of `Sx', unescaped
stuff that should be escaped, and a style matter or two. Pointed out by
Jason McIntyre, thanks!
preconv is now on encoding-recognition parity with groff. This last
commit adds parsing of "File Variables" in the first two lines in order
to grok the encoding. This completes groff's recognition sequence (-e,
BOM, File variables, -D, default). I've also cleaned up the manual to
indicate this and for some general readability.
preconv is now compiled by default in the Makefile.
Significantly improve preconv. Allow it to recode UTF-8 characters into
the \[uNNNN] strings (taking into account big-endian archs). Also allow
it to determine from the BOM whether it's a UTF-8 file. Also add the
initial manual. This has been tested over a random selection of UTF-8
documents, as
% preconv -e utf-8 foo.1 | ./mandoc -Tlocale
where -Tlocale is allowed (-DUSE_WCHAR).
Note that we're still missing the "type" indicator that preconv accepts.
If a predefined string is missing, emit a warning and make it an empty
string instead of passing it along to libmdoc/libman (where it'll be
printed verbatim, now). This is what groff seems to do, too (of course
without a warning).
It's annoying that we don't have preconv, so throw together a quick
version and let it grow in-tree. Right now, this only supports the
Latin-1 and US-ASCII encoding. I'll do UTF-8 next. It's
call-compatible with GNU's preconv although I don't do fancy stuff like
BOM or header check. This will come. I used read.c's file-grokking
code.
Use the correct Unicode value for the zero-width space, which means that
spec2cp never needs to fall through to spec2str. Then clean out html.c
of its unnecessary print_res() function.
Remove predefined strings from the chars.in file, as they're now local
to predefs.in. This also makes "BOTH" entries directly into CHAR. The
res2str and spec2str are now effectively the same function.
Most important move in getting predefined strings entirely contained
within roff.c. These are now grokked from a table in the roff
allocation routine and rest in the newly-created predefs.in (for
consistency with chars.in). This is a first implementation and will
likely be optimised along with the ds/de lookup table itself.
This allows mandoc-defined predefined strings to be correctly removed or
whatnot; earlier they couldn't. What will follow is the stripping-away
of all predefined-string crud in the other parts of the system.
Have conditional closure for both text and macro lines call through to
ccond(). Fix the text handler to behave like the macro handler
regarding escaped \}. Make \} actually become a zero-width space, too,
and clean up the documentation in this regard.
Fix a TODO to the effect that `.if n \{\ foo .br \}' was failing due to
the `\}' not being directly after the `.br'. Now we check for `\}' in
arbitrary parts of the line, and account for if it's escaped in funny
ways.
This behaviour diverges somewhat from groff in that the text at and
following the `\}' is lost, while groff keeps it (sort-of). I'll add a
COMPATIBILITY note to this effect.
Flip on -Tutf8 backend support. This forces the UTF-8 LC_CTYPE and does
little else. Also remove the check for __STDC_ISO_10646__. It turns
out that very few systems---even those that support it---actually
declare this and it's just causing problems instead of being useful.
Allow non-ASCII terminal encodings to accept unicode values for the
special characters, if possible. This is broken into a separate switch
statement for clarity.
It seems that __STDC_ISO_10646__ isn't defined even where it can be
defined, so remove the check for it and leave it up to people compiling
the software (DOWNSTREAM) to take care of this. This will eventually
need to be fixed up with a proper non-10646 converter and so on, but
this is a simple start. While here, strengthen then language in the
Makefile to this effect.
Make any un-recognised font be considered a call for the Roman font.
This makes sequences of \f[unknown] \fP not completely puke. From a
TODO by schwarze@.
Locale support. I'm checking this in to clean up fall-out in-tree, but
it looks pretty good. Basically, the -Tlocale option propogates into
term_ascii.c, where we set locale-specific console call-backs IFF (1)
setlocale() works; (2) locale support is compiled in (see Makefile for
-DUSE_WCHAR); (3) the internal structure of wchar_t maps directly to
Unicode codepoints as defined by __STDC_ISO_10646__; and (4) the console
supports multi-byte characters.
To date, this configuration only supports GNU/Linux. OpenBSD doesn't
export __STDC_ISO_10646__ although I'm told by stsp@openbsd.org that it
should (it has the correct map). Apparently FreeBSD is the same way.
NetBSD? Don't know. Apple also supports this, but doesn't define the
macro. Special-casing!
Benchmark: -Tlocale incurs less than 0.2 factor overhead when run
through several thousand manuals when UTF8 output is enabled. Native
mode (whether directly -Tascii or through no locale or whatever) is
UNCHANGED: the function callbacks are the same as before.
Note. If the underlying system does NOT support STDC_ISO_10646, there
is a "slow" version possible with iconv or other means of flipping from
a Unicode codepoint to a wchar_t.
Add mode for -Tlocale. This mode, with this commit, behaves exactly
like -Tascii. While adding this, inline term_alloc() (was a one-liner),
remove some switches around the terminal encoding for the symbol table
(unnecessary), and split out ascii_alloc() into ascii_init(), which is
also called from locale_init().
In tbl layouts, we puked if a space didn't followed a vertical bar
(found by Yuri Pankov). This was due to looking for modifiers for the
vertical bar. This has been fixed, along with other special-key layout
types.