Coda to the discussion on converting the HTML s6 documentation
Hi all,
i've received an email offlist asking some clarifying questions
about automating the conversion of the current HTML s6
documentation, and i thought it might be useful to post some of
the things i noted in my reply.
The issue isn't that the HTML is unparseable (it's not). A tool
like `pandoc` can be used to convert the pages into other formats,
including roff. Over at Void, we recently tried to make use of
`pandoc` to create a man page for Érico's neat `void-docs` script,
which allows viewing the Void Handbook locally in a number of
formats. What i found is that the output of pandoc produced roff
that was fine visually, but which relied on presentational markup,
rather than semantic markup. i'll return to this issue below.
The issue is twofold:
* Things like bare "<em>" tags (i.e. without a 'class' attribute
describing their contents) are used in the HTML to convey
multiple types of information that mdoc/roff
distinguishes. Sometimes an "<em>" is used for an argument (Ar
in mdoc), sometimes it's simply used for emphasis (Em in
mdoc). Similarly, bare "<tt>" tags are used for a path (Pa in
mdoc), function types (Ft in mdoc),
functions (Fn in mdoc), libraries (which could have a man page
that should be cross-referenced with an Xr macro), and so on. A
human is needed to decide the semantics involved (e.g. for
Casper's putative IL), based on context.
* Many things /simply aren't marked up at all/. The example i gave
in my earlier post was environment variables: again, a human is
needed to decide whether something in ALLCAPS is an env var, a
cpp macro, or something else altogether (like a reference to the
'TAI64' concept.)
The question might be asked: "Well, who cares? Why care about
semantic markup? As long as the visual output is the same, what's
the issue?" Two things:
* Having the documentation source use semantic markup as much as
possible facilitates conversion between formats. `mandoc(1)`
doesn't only output man pages from mdoc source: it can also
produce HTML (used on man.voidlinux.org, with some custom CSS
for Void theming), PDF, PostScript, Markdown and plain ASCII. So
if things like flags, arguments, paths, environment variables,
variable types, variables, function types, functions etc. are
marked up in the mdoc source, a PDF (for example) can be styled
appropriately for each case.
* Additionally, extensive semantic markup has a direct benefit to
end-users: the ability to use the functionality of `apropos` to
find appropriate content. For example, say one wished to find
all uses of the 'GID' env var in the s6 man pages. One could use
`apropos 'Ev=GID' | grep s6-`. (This sort of use-case is part of
why i've made sure all the names of all the man pages i'm
creating are prefixed with "s6-".) Similarly, one could search
for all mentions of the 'notification-fd' file with `apropos
'Pa~.*notification-fd'`, with the '~' indicating an extended
regular expression. However, this won't work without the
relevant markup in the sources.
Fwiw, my suggestion, for those interested in converting the
documentation to One True Format as decided by Laurent, would be
to leverage my efforts to use semantic markup extensively in the
man pages. Once the s6-man-pages repo is ready, use `mandoc -T
html` to convert the pages to HTML, which will contain consistent
semantic markup (e.g. '<h1 class="Sh" id="DESCRIPTION">'). That
HTML can then be parsed and converted to the One True Format, an
authoritative source from which man pages and HTML can be
produced.
Alexis.
Received on Wed Sep 02 2020 - 09:59:10 UTC
This archive was generated by hypermail 2.3.0
: Sun May 09 2021 - 19:44:19 UTC