Jacob Hughes
Generating summary documents for a variable-quality PDF document collection
Hughes, Jacob; Brailsford, David F.; Bagley, Steven R.; Adams, Clive E.
Authors
David F. Brailsford
Steven R. Bagley
Clive E. Adams
Abstract
The Cochrane Schizophrenia Group’s Register of studies details all aspects of the effects of treating people with schizophrenia. It has been gathered over the last 20 years and consists of around 20,000 documents, overwhelmingly in PDF. Document collections of this sort – on a given theme but gathered from a wide range of sources – will generally have huge variability in the quality of the PDF, particularly with respect to the key property of text searchability.
Summarising the results from the best of these papers, to allow evidence-based health care decision making, has so far been done by manually creating a summary document, starting from a visual inspection of the relevant PDF file. This labour-intensive process has resulted, to date, in only 4,000 of the papers being summarised – with enormous duplication of effort and with many issues around the validity and reliability of the data extraction.
This paper describes a pilot project to provide a computer-assisted framework in which any of the PDF documents could be searched for the occurrence of some 8,000 keywords and key phrases.Once keyword tagging has been completed the framework assists in the generation of a standard summary document, thereby greatly speeding up the production of these summaries. Early examples of the framework are described and its capabilities illustrated.
Citation
Hughes, J., Brailsford, D. F., Bagley, S. R., & Adams, C. E. Generating summary documents for a variable-quality PDF document collection. Presented at ACM Symposium on Document Engineering (DocEng '14)
Conference Name | ACM Symposium on Document Engineering (DocEng '14) |
---|---|
End Date | Sep 19, 2014 |
Publication Date | Sep 1, 2014 |
Deposit Date | Mar 17, 2015 |
Publicly Available Date | Mar 17, 2015 |
Peer Reviewed | Peer Reviewed |
Keywords | Schizophrenia; PDF; OCR; document collections |
Public URL | https://nottingham-repository.worktribe.com/output/994508 |
Publisher URL | http://dx.doi.org/10.1145/2644866.2644892 |
Additional Information | Published in: DocEng '14: proceedings of the 14th ACM Symposium on Document Engineering. New York : ACM, 2014, ISBN: 978-1-4503-2949-1. pp. 49-52, doi: 10.1145/2644866.2644892 |
Files
eprinthughes13.pdf
(2.2 Mb)
PDF
You might also like
Revisiting a summer vacation: digital restoration and typesetter forensics
(2013)
Presentation / Conference Contribution
No need to justify your choice: pre-compiling line breaks to improve eBook readability
(2013)
Presentation / Conference Contribution
Reflowable documents composed from pre-rendered atomic components
(2011)
Presentation / Conference Contribution
Optimized reprocessing of documents using stored processor state
(2010)
Presentation / Conference Contribution
Tracking sub-page components in document workflows
(2008)
Presentation / Conference Contribution
Downloadable Citations
About Repository@Nottingham
Administrator e-mail: discovery-access-systems@nottingham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2025
Advanced Search