]> git.cameronkatri.com Git - mandoc.git/blob - mandoc_escape.3
stricter parsing of Unicode escape names
[mandoc.git] / mandoc_escape.3
1 .\" $Id: mandoc_escape.3,v 1.2 2014/10/28 14:06:31 schwarze Exp $
2 .\"
3 .\" Copyright (c) 2014 Ingo Schwarze <schwarze@openbsd.org>
4 .\"
5 .\" Permission to use, copy, modify, and distribute this software for any
6 .\" purpose with or without fee is hereby granted, provided that the above
7 .\" copyright notice and this permission notice appear in all copies.
8 .\"
9 .\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
10 .\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
11 .\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
12 .\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
13 .\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
14 .\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
15 .\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
16 .\"
17 .Dd $Mdocdate: October 28 2014 $
18 .Dt MANDOC_ESCAPE 3
19 .Os
20 .Sh NAME
21 .Nm mandoc_escape
22 .Nd parse roff escape sequences
23 .Sh LIBRARY
24 .Lb libmandoc
25 .Sh SYNOPSIS
26 .In sys/types.h
27 .In mandoc.h
28 .Ft "enum mandoc_esc"
29 .Fo mandoc_escape
30 .Fa "const char **end"
31 .Fa "const char **start"
32 .Fa "int *sz"
33 .Fc
34 .Sh DESCRIPTION
35 This function scans a
36 .Xr roff 7
37 escape sequence.
38 .Pp
39 An escape sequence consists of
40 .Bl -dash -compact -width 2n
41 .It
42 an initial backslash character
43 .Pq Sq \e ,
44 .It
45 a single ASCII character called the escape sequence identifier,
46 .It
47 and, with only a few exceptions, an argument.
48 .El
49 .Pp
50 Arguments can be given in the following forms; some escape sequence
51 identifiers only accept some of these forms as specified below.
52 The first three forms are called the standard forms.
53 .Bl -tag -width 2n
54 .It \&In brackets: Ic \&[ Ns Ar argument Ns Ic \&]
55 The argument starts after the initial
56 .Sq \&[ ,
57 ends before the final
58 .Sq \&] ,
59 and the escape sequence ends with the final
60 .Sq \&] .
61 .It Two-character argument short form: Ic \&( Ns Ar ar
62 This form can only be used for arguments
63 consisting of exactly two characters.
64 It has the same effect as
65 .Ic \&[ Ns Ar ar Ns Ic \&] .
66 .It One-character argument short form: Ar a
67 This form can only be used for arguments
68 consisting of exactly one character.
69 It has the same effect as
70 .Ic \&[ Ns Ar a Ns Ic \&] .
71 .It Delimited form: Ar C Ns Ar argument Ns Ar C
72 The argument starts after the initial delimiter character
73 .Ar C ,
74 ends before the next occurrence of the delimiter character
75 .Ar C ,
76 and the escape sequence ends with that second
77 .Ar C .
78 Some escape sequences allow arbitrary characters
79 .Ar C
80 as quoting characters, some restrict the range of characters
81 that can be used as quoting characters.
82 .El
83 .Pp
84 Upon function entry,
85 .Fa end
86 is expected to point to the escape sequence identifier.
87 The values passed in as
88 .Fa start
89 and
90 .Fa sz
91 are ignored and overwritten.
92 .Pp
93 By design, this function cannot handle those
94 .Xr roff 7
95 escape sequences that require in-place expansion, in particular
96 user-defined strings
97 .Ic \e* ,
98 number registers
99 .Ic \en ,
100 width measurements
101 .Ic \ew ,
102 and numerical expression control
103 .Ic \eB .
104 These are handled by
105 .Fn roff_res ,
106 a private preprocessor function called from
107 .Fn roff_parseln ,
108 see the file
109 .Pa roff.c .
110 .Pp
111 The function
112 .Fn mandoc_escape
113 is used
114 .Bl -dash -compact -width 2n
115 .It
116 recursively by itself, because some escape sequence arguments can
117 in turn contain other escape sequences,
118 .It
119 for error detection internally by the
120 .Xr roff 7
121 parser part of the
122 .Lb libmandoc ,
123 see the file
124 .Pa roff.c ,
125 .It
126 above all externally by the
127 .Xr mandoc
128 formatting modules, in particular
129 .Fl Tascii
130 and
131 .Fl Thtml ,
132 for formatting purposes, see the files
133 .Pa term.c
134 and
135 .Pa html.c ,
136 .It
137 and rarely externally by high-level utilities using the mandoc library,
138 for example
139 .Xr makewhatis 8 ,
140 to purge escape sequences from text.
141 .El
142 .Sh RETURN VALUES
143 Upon function return, the pointer
144 .Fa end
145 is set to the character after the end of the escape sequence,
146 such that the calling higher-level parser can easily continue.
147 .Pp
148 For escape sequences taking an argument, the pointer
149 .Fa start
150 is set to the beginning of the argument and
151 .Fa sz
152 is set to the length of the argument.
153 For escape sequences not taking an argument,
154 .Fa start
155 is set to the character after the end of the sequence and
156 .Fa sz
157 is set to 0.
158 Both
159 .Fa start
160 and
161 .Fa sz
162 may be
163 .Dv NULL ;
164 in that case, the argument and the length are not returned.
165 .Pp
166 For sequences taking an argument, the function
167 .Fn mandoc_escape
168 returns one of the following values:
169 .Bl -tag -width 2n
170 .It Dv ESCAPE_FONT
171 The escape sequence
172 .Ic \ef
173 taking an argument in standard form:
174 .Ic \ef[ , \ef( , \ef Ns Ar a .
175 Two-character arguments starting with the character
176 .Sq C
177 are reduced to one-character arguments by skipping the
178 .Sq C .
179 More specific values are returned for the most commonly used arguments:
180 .Bl -column "argument" "ESCAPE_FONTITALIC"
181 .It argument Ta return value
182 .It Cm R No or Cm 1 Ta Dv ESCAPE_FONTROMAN
183 .It Cm I No or Cm 2 Ta Dv ESCAPE_FONTITALIC
184 .It Cm B No or Cm 3 Ta Dv ESCAPE_FONTBOLD
185 .It Cm P Ta Dv ESCAPE_FONTPREV
186 .It Cm BI Ta Dv ESCAPE_FONTBI
187 .El
188 .It Dv ESCAPE_SPECIAL
189 The escape sequence
190 .Ic \eC
191 taking an argument delimited with the single quote character
192 and, as a special exception, the escape sequences
193 .Em not
194 having an identifier, that is, those where the argument, in standard
195 form, directly follows the initial backslash:
196 .Ic \eC' , \e[ , \e( , \e Ns Ar a .
197 Note that the one-character argument short form can only be used for
198 argument characters that do not clash with escape sequence identifiers.
199 .Pp
200 If the argument matches one of the forms described below under
201 .Dv ESCAPE_UNICODE ,
202 that value is returned instead.
203 .Pp
204 The
205 .Dv ESCAPE_SPECIAL
206 special character escape sequences can be rendered using the functions
207 .Fn mchars_spec2cp
208 and
209 .Fn mchars_spec2str
210 described in the
211 .Xr mchars_alloc 3
212 manual.
213 .It Dv ESCAPE_UNICODE
214 Escape sequences of the same format as described above under
215 .Dv ESCAPE_SPECIAL ,
216 but with an argument of the forms
217 .Ic u Ns Ar XXXX ,
218 .Ic u Ns Ar YXXXX ,
219 or
220 .Ic u10 Ns Ar XXXX
221 where
222 .Ar X
223 and
224 .Ar Y
225 are hexadecimal digits and
226 .Ar Y
227 is not zero:
228 .Ic \eC'u , \e[u .
229 As a special exception,
230 .Fa start
231 is set to the character after the
232 .Ic u ,
233 and the
234 .Fa sz
235 return value does not include the
236 .Ic u
237 either.
238 .Pp
239 Such Unicode character escape sequences can be rendered using the function
240 .Fn mchars_num2uc
241 described in the
242 .Xr mchars_alloc 3
243 manual.
244 .It Dv ESCAPE_NUMBERED
245 The escape sequence
246 .Ic \eN
247 followed by a delimited argument.
248 The delimiter character is arbitrary except that digits cannot be used.
249 If a digit is encountered instead of the opening delimiter, that
250 digit is considered to be the argument and the end of the sequence, and
251 .Dv ESCAPE_IGNORE
252 is returned.
253 .Pp
254 Such ASCII character escape sequences can be rendered using the function
255 .Fn mchars_num2char
256 described in the
257 .Xr mchars_alloc 3
258 manual.
259 .It Dv ESCAPE_IGNORE
260 .Bl -bullet -width 2n
261 .It
262 The escape sequence
263 .Ic \es
264 followed by an argument in standard form or by an argument delimited
265 by the single quote character:
266 .Ic \es' , \es[ , \es( , \es Ns Ar a .
267 As a special exception, an optional
268 .Sq +
269 or
270 .Sq \-
271 character is allowed after the
272 .Sq s
273 for all forms.
274 .It
275 The escape sequences
276 .Ic \eF ,
277 .Ic \eg ,
278 .Ic \ek ,
279 .Ic \eM ,
280 .Ic \em ,
281 .Ic \en ,
282 .Ic \eV ,
283 and
284 .Ic \eY
285 followed by an argument in standard form.
286 .It
287 The escape sequences
288 .Ic \eA ,
289 .Ic \eb ,
290 .Ic \eD ,
291 .Ic \eo ,
292 .Ic \eR ,
293 .Ic \eX ,
294 and
295 .Ic \eZ
296 followed by an argument delimited by an arbitrary character.
297 .It
298 The escape sequences
299 .Ic \eH ,
300 .Ic \eh ,
301 .Ic \eL ,
302 .Ic \el ,
303 .Ic \eS ,
304 .Ic \ev ,
305 and
306 .Ic \ex
307 followed by an argument delimited by a character that cannot occur
308 in numerical expressions.
309 However, if any character that can occur in numerical expressions
310 is found instead of a delimiter, the sequence is considered to end
311 with that character, and
312 .Dv ESCAPE_ERROR
313 is returned.
314 .El
315 .It Dv ESCAPE_ERROR
316 Escape sequences taking an argument but not matching any of the above patterns.
317 In particular, that happens if the end of the logical input line
318 is reached before the end of the argument.
319 .El
320 .Pp
321 For sequences that do not take an argument, the function
322 .Fn mandoc_escape
323 returns one of the following values:
324 .Bl -tag -width 2n
325 .It Dv ESCAPE_SKIPCHAR
326 The escape sequence
327 .Qq \ez .
328 .It Dv ESCAPE_NOSPACE
329 The escape sequence
330 .Qq \ec .
331 .It Dv ESCAPE_IGNORE
332 The escape sequences
333 .Qq \ed
334 and
335 .Qq \eu .
336 .El
337 .Sh FILES
338 This function is implemented in
339 .Pa mandoc.c .
340 .Sh SEE ALSO
341 .Xr mchars_alloc 3 ,
342 .Xr mandoc_char 7 ,
343 .Xr roff 7
344 .Sh HISTORY
345 This function has been available since mandoc 1.11.2.
346 .Sh AUTHORS
347 .An Kristaps Dzonsons Aq Mt kristaps@bsd.lv
348 .An Ingo Schwarze Aq Mt schwarze@openbsd.org
349 .Sh BUGS
350 The function doesn't cleanly distinguish between sequences that are
351 valid and supported, valid and ignored, valid and unsupported,
352 syntactically invalid, or undefined.
353 For sequences that are ignored or unsupported, it doesn't tell
354 whether that deficiency is likely to cause major formatting problems
355 and/or loss of document content.
356 The function is already rather complicated and still parses some
357 sequences incorrectly.
358 .
359 .ig
360 For these sequences, the list given below specifies a starting string
361 and either the length of the argument or an ending character.
362 The argument starts after the starting string.
363 In the former case, the sequence ends with the end of the argument.
364 In the latter case, the argument ends before the ending character,
365 and the sequence ends with the ending character.
366 ..