Significantly improve preconv. Allow it to recode UTF-8 characters into
the \[uNNNN] strings (taking into account big-endian archs). Also allow
it to determine from the BOM whether it's a UTF-8 file. Also add the
initial manual. This has been tested over a random selection of UTF-8
documents, as
% preconv -e utf-8 foo.1 | ./mandoc -Tlocale
where -Tlocale is allowed (-DUSE_WCHAR).
Note that we're still missing the "type" indicator that preconv accepts.
If a predefined string is missing, emit a warning and make it an empty
string instead of passing it along to libmdoc/libman (where it'll be
printed verbatim, now). This is what groff seems to do, too (of course
without a warning).
It's annoying that we don't have preconv, so throw together a quick
version and let it grow in-tree. Right now, this only supports the
Latin-1 and US-ASCII encoding. I'll do UTF-8 next. It's
call-compatible with GNU's preconv although I don't do fancy stuff like
BOM or header check. This will come. I used read.c's file-grokking
code.
Use the correct Unicode value for the zero-width space, which means that
spec2cp never needs to fall through to spec2str. Then clean out html.c
of its unnecessary print_res() function.
Remove predefined strings from the chars.in file, as they're now local
to predefs.in. This also makes "BOTH" entries directly into CHAR. The
res2str and spec2str are now effectively the same function.
Most important move in getting predefined strings entirely contained
within roff.c. These are now grokked from a table in the roff
allocation routine and rest in the newly-created predefs.in (for
consistency with chars.in). This is a first implementation and will
likely be optimised along with the ds/de lookup table itself.
This allows mandoc-defined predefined strings to be correctly removed or
whatnot; earlier they couldn't. What will follow is the stripping-away
of all predefined-string crud in the other parts of the system.
Have conditional closure for both text and macro lines call through to
ccond(). Fix the text handler to behave like the macro handler
regarding escaped \}. Make \} actually become a zero-width space, too,
and clean up the documentation in this regard.
Fix a TODO to the effect that `.if n \{\ foo .br \}' was failing due to
the `\}' not being directly after the `.br'. Now we check for `\}' in
arbitrary parts of the line, and account for if it's escaped in funny
ways.
This behaviour diverges somewhat from groff in that the text at and
following the `\}' is lost, while groff keeps it (sort-of). I'll add a
COMPATIBILITY note to this effect.
Flip on -Tutf8 backend support. This forces the UTF-8 LC_CTYPE and does
little else. Also remove the check for __STDC_ISO_10646__. It turns
out that very few systems---even those that support it---actually
declare this and it's just causing problems instead of being useful.
Allow non-ASCII terminal encodings to accept unicode values for the
special characters, if possible. This is broken into a separate switch
statement for clarity.
It seems that __STDC_ISO_10646__ isn't defined even where it can be
defined, so remove the check for it and leave it up to people compiling
the software (DOWNSTREAM) to take care of this. This will eventually
need to be fixed up with a proper non-10646 converter and so on, but
this is a simple start. While here, strengthen then language in the
Makefile to this effect.
Make any un-recognised font be considered a call for the Roman font.
This makes sequences of \f[unknown] \fP not completely puke. From a
TODO by schwarze@.
Locale support. I'm checking this in to clean up fall-out in-tree, but
it looks pretty good. Basically, the -Tlocale option propogates into
term_ascii.c, where we set locale-specific console call-backs IFF (1)
setlocale() works; (2) locale support is compiled in (see Makefile for
-DUSE_WCHAR); (3) the internal structure of wchar_t maps directly to
Unicode codepoints as defined by __STDC_ISO_10646__; and (4) the console
supports multi-byte characters.
To date, this configuration only supports GNU/Linux. OpenBSD doesn't
export __STDC_ISO_10646__ although I'm told by stsp@openbsd.org that it
should (it has the correct map). Apparently FreeBSD is the same way.
NetBSD? Don't know. Apple also supports this, but doesn't define the
macro. Special-casing!
Benchmark: -Tlocale incurs less than 0.2 factor overhead when run
through several thousand manuals when UTF8 output is enabled. Native
mode (whether directly -Tascii or through no locale or whatever) is
UNCHANGED: the function callbacks are the same as before.
Note. If the underlying system does NOT support STDC_ISO_10646, there
is a "slow" version possible with iconv or other means of flipping from
a Unicode codepoint to a wchar_t.
Add mode for -Tlocale. This mode, with this commit, behaves exactly
like -Tascii. While adding this, inline term_alloc() (was a one-liner),
remove some switches around the terminal encoding for the symbol table
(unnecessary), and split out ascii_alloc() into ascii_init(), which is
also called from locale_init().
In tbl layouts, we puked if a space didn't followed a vertical bar
(found by Yuri Pankov). This was due to looking for modifiers for the
vertical bar. This has been fixed, along with other special-key layout
types.
Documentation: note COMPATIBILITY of -Tascii `?' printing in mandoc.1
and remove some long-fixed notes in sthe same section. Also, add an
`Lb' for the mandoc library to mandoc.3 (noted by Sascha Wildner).
Flip on printing `?' at Unicode codepoints in -Tascii, -Tpdf, and -Tps.
The reasoning behind printing SOMETHING at a Unicode codepoint is
because the input is not "wrong" (we suppress printing of "wrong"
things). It's just that ASCII can't handle it.
Cleanups in -T[x]html: make html_idcat() use the buffer and be called
bufcat_id(), then collapse it into a little function without so much
crap. Next, make bufinit() only be called when we really need to do so,
and not simply before pre/post calls.
Clean-ups in -T[x]html: inline print_num(), as it was just a single
conditional; same for print_xmltype() and print_doctype(), same reason;
make bufncat() be static, as it was only being called from html.c;
have bufcat() simply call through to strlcat(). Finally, assert()
whenever we truncate.
Also rename buffmt() -> bufcat_fmt() to differentiate from buffmt_man et
al., which do not concatenate.
Clean up -T[x]html by using a table instead of a switch statement for
the roff units. Also remove a comment about CSS and number types (they
all accept decimal numbers).
Remove function calls to res() and so forth in term_word(). These were
only used once and simply bloated the binary. Also fix mchars_num2char
to correctly render the character instead of using atoi(). This makes
the conversation more strict, but it's more correct.
Move struct termp_ps into term_ps.c; remove the engine union in struct termp,
which only held one entry; finally (as per the first), make "ps" member into a
pointer managed by term_ps.c. This frees up a nice chunk of memory during
run-time and in the binary.
Make character engine (-Tascii, -Tpdf, -Tps) ready for Unicode: make buffer
consist of type "int". This will take more work (especially in encode and
friends), but this is a strong start. This commit also consists of some
harmless lint fixes.
Give -Thtml and -Txhtml the gift of recognising escapes when calculating
widths (e.g., `Bl -tag -width "\s[blahblah]bar"). This has long since
been done for -Tascii but escaped noticed with -T[x]html.
Add configurations (`Cd') to mandoc-db mining.
Also put some notes into index.sgml to the effect that mandoc-db exists,
but is not linked to the build.
Back out stripping of non-predef and non-special escape sequences from
input (this is not yet possible with mandoc_escape(), which depends on
nil-terminated strings).
Move "chars" interface out of out.h and into mandoc.h. This doesn't
change any code but for renaming functions and types to be consistent
with other mandoc.h stuff. The reason for moving into libmandoc is that
the rendering of special characters is part of mandoc itself---not an
external part. From mandoc(1)'s perspective, this changes nothing, but
for other utilities, it's important to have these part of libmandoc.
Note this isn't documented [yet] in mandoc.3 because there are some
parts I'd like to change around beforehand.
Closing delimiters only suppress spacing when they follow something.
Fixing a regression introduced in rev. 1.105.
ok and prodding for comments kristaps@.
Make the `Nm' -Thtml attribute be min-width instead of width. This is a
quick fix for, say, rc.d(8) in OpenBSD, which has nested macros on the
`Nm' SYNOPSIS line that were skipped over by the length calculator. This
should [maybe?] be a recursive length check, but still it'd need to be
a min-width to accomodate for (say) `Qq' and the like printing excess
characters post-length-calculation.
Add \*(Ai (ANSI) and \*(Px (POSIX) predefined strings, which are part of
groff's tmac.doc package. Originally noted by Matthew Dempsky.
Feedback by Jason McIntyre, joerg@, and schwarze@. Also add some
documentation about predefined strings, tweaked by schwarze@.
Clean up parsing of delimiters in -mdoc. First, remove the "dowarn"
variable from mandoc_getarg() so that it prints the warning every time.
Then, remove the warning from args_checkpunct(). This way, warnings
are being posted at the correct time. This makes the flag argument to
mdoc_zargs() superfluous, so make it be zero when it's invoked. Finally,
move the args() flags into mdoc_argv.c and make them enums.
The semantics of .Bk was described incorrectly
for the case of multiple sibling macros on a single input line.
Issue found investigating a question from sobrado@.
"I like this diff" kristaps@
Check in fix to roff conditional if/else stack running out of space.
This transforms the stack pop to occur prior to body execution, instead
of afterward. Floated to tech@ without response, but it makes sense
that this is alright and doesn't cause problems during extensive
testing.
Remove the warning for empty bodies of `Sh', `Ss', `SH', and `SS'. This
prompted by a TODO by schwarze@, originally from Gleydson Soares, that
an empty `SS' was raising an error (it hasn't for some time). It makes
sense these shouldn't warn, as omitting their contents doesn't change
anything in the structure of the document (groff and mandoc specifically
account for the whitespace between empty sections).
This doesn't change any manuals, which only refer to the line arguments
(or possibly next-line, in the case of man(7) syntax).
Have mandoc-db accumulate manual page descriptions (`Nd' in -mdoc parlance)
in the index. This allows, with both the btree and index, full emulation
of apropos(1) and other goodies.
\# Everything up to and including the next newline is
ignored. This is interpreted in copy mode. This is like \"
except that the terminating newline is ignored as well.
Use dbt_xxxx functions to stash both filename and manual section in the
value part of the index. This is the actual manual section---before,
mandoc.cgi was relying on the file suffix, but this can be (e.g.) .man or
whatnot. This is The Correct Way (tm).