Skip to main content

Research Repository

Advanced Search

Automated re-typesetting, indexing and content enhancement for scanned marriage registers

Brailsford, David F.

Automated re-typesetting, indexing and content enhancement for scanned marriage registers Thumbnail


Authors

David F. Brailsford



Abstract

For much of England and Wales marriage registers began to be kept in 1537. The marriage details were recorded locally, and in longhand, until 1st July 1837, when central records began. All registers were kept in the local parish church.
In the period from 1896 to 1922 an attempt was made, by the Phillimore company of London, using volunteer help, to transcribe marriage registers for as many English parishes as possible and to have them printed.
This paper describes an experiment in the automated retypesetting of Volume 2 of the 15-volume Phillimore series relating to the county of Derbyshire. The source material was plain text derived from running Optical Character Recognition (OCR) on a set of page scans taken from the original printed volume.
The aim of the experiment was to avoid any idea of labour-intensive page-by-page rebuilding with tools such as Acrobat Capture. Instead, it proved possible to capitalise on the regular, tabular, structure of the Register pages as a means of automating the re-typesetting process, using UNIX troff software and its tbl preprocessor. A series of simple software tools helped to bring about the OCR-to-troff transformation.
However, the re-typesetting of the text was not just an end in itself but, additionally, a step on the way to content enhancement and content repurposing. This included the indexing of the marriage entries and their potential transformation into XML and GEDCOM notations. The experiment has shown, for highly regular material, that the efforts of one programmer, with suitable low-level tools, can be far more effective than attempting to recreate the printed material using WYSIWYG software.

Conference Name ACM Symposium on Document Engineering (DocEng '09)
End Date Sep 18, 2009
Publication Date Sep 1, 2009
Deposit Date Feb 24, 2015
Publicly Available Date Feb 24, 2015
Peer Reviewed Peer Reviewed
Keywords Re-typesetting, GEDCOM, OCR, troff, genealogy, hyperlinking, indexing.
Public URL https://nottingham-repository.worktribe.com/output/1013594
Publisher URL http://dl.acm.org/citation.cfm?doid=1600193.1600202
Additional Information Published in: DocEng '09: proceedings of the 9th ACM Symposium on Document Engineering. New York : ACM, 2009, ISBN: 978-1-60558-575-8. pp. 29-38, doi: 10.1145/1600193.1600202

Files





Downloadable Citations