3. The Coding Scheme of the Lampeter Corpus
The aim of the mark-up employed in the Lampeter Corpus is to make as much of the original layout features (cf. 3.1.) and background (cf. 3.2.) of the texts retrievable for the corpus user. Needless to say, such a procedure has certain limits, either because not all the desirable information could actually be found, or because the effort required for tagging exceeded its potential value. Thus, very regular or purely typological lay-out features with little or no meaning were not given mark-up. Features ignored for mark-up purposes include:
- long S (rendered as normal modern "s" in all cases),
- ct-ligatures (rendered "ct"),
- line breaks in normal running prose text (as printing technique in the 17th century was sufficiently advanced so as not to need orthographical variation for line length requirements), but with the exception of end-of-line separation of words: these cases were marked by help of a special entity reference (&rehy;),
- font size (although this would admittedly have captured titlepage layout more effectively),
- catchwords,
- indentures of text,
- centering of text,
- spaces in the form of empty lines,
- quotation marks (of whatever function) at the beginning of successive lines of running text.
The mark-up system employed here is the SGML system based on the TEI (Text Encoding Initiative) guidelines (cf. Burnard & Sperberg-McQueen). A special form of it, or document type description (dtd), was created by Lou Burnard to suit the particular requirements of the Lampeter Corpus.
| -Contents- | ||
|
|
||