O. Ibrahim
Term frequency with average term occurrences for textual information retrieval
Ibrahim, O.; Landa-Silva, Dario
Abstract
In the context of Information Retrieval (IR) from text documents, the term-weighting scheme (TWS) is a key component of the matching mechanism when using the vector space model (VSM). In this paper we propose a new TWS that is based on computing the average term occurrences of terms in documents and it also uses a discriminative approach based on the document centroid vector to remove less significant weights from the documents. We call our approach Term Frequency With Average Term Occurrence (TF-ATO). An analysis of commonly used document collections shows that test collections are not fully judged as achieving that is expensive and may be infeasible for large collections. A document collection being fully judged means that every document in the collection acts as a relevant document to a specific query or a group of queries. The discriminative approach used in our proposed approach is a heuristic method for improving the IR effectiveness and performance, and it has the advantage of not requiring previous knowledge about relevance judgements. We compare the performance of the proposed TF-ATO to the well-known TF-IDF approach and show that using TF-ATO results in better effectiveness in both static and dynamic document collections. In addition, this paper investigates the impact that stop-words removal and our discriminative approach have on TFIDF and TF-ATO. The results show that both, stopwords removal and the discriminative approach, have a positive effect on both term-weighting schemes. More importantly, it is shown that using the proposed discriminative approach is beneficial for improving IR effectiveness and performance with no information in the relevance judgement for the collection.
Citation
Ibrahim, O., & Landa-Silva, D. (2016). Term frequency with average term occurrences for textual information retrieval. Soft Computing, 20(8), 3045-3061. https://doi.org/10.1007/s00500-015-1935-7
Journal Article Type | Article |
---|---|
Acceptance Date | Oct 30, 2015 |
Online Publication Date | Nov 28, 2015 |
Publication Date | Aug 1, 2016 |
Deposit Date | Jan 21, 2016 |
Publicly Available Date | Jan 21, 2016 |
Journal | Soft Computing |
Print ISSN | 1432-7643 |
Electronic ISSN | 1433-7479 |
Publisher | Springer Verlag |
Peer Reviewed | Peer Reviewed |
Volume | 20 |
Issue | 8 |
Pages | 3045-3061 |
DOI | https://doi.org/10.1007/s00500-015-1935-7 |
Keywords | Heuristic term-weighting scheme, Random term weights, Textual information retrieval, Discriminative approach, Stop-words removal |
Public URL | https://nottingham-repository.worktribe.com/output/975510 |
Publisher URL | http://link.springer.com/article/10.1007/s00500-015-1935-7 |
Additional Information | The final publication is available at Springer via http://dx.doi.org/10.1007/s00500-015-1935-7 |
Contract Date | Jan 21, 2016 |
Files
dls_soco2015.pdf
(594 Kb)
PDF
You might also like
Local-global methods for generalised solar irradiance forecasting
(2024)
Journal Article
An agent based modelling approach for the office space allocation problem
(2018)
Presentation / Conference Contribution
Lookahead policy and genetic algorithm for solving nurse rostering problems
(2018)
Presentation / Conference Contribution
Downloadable Citations
About Repository@Nottingham
Administrator e-mail: discovery-access-systems@nottingham.ac.uk
This application uses the following open-source libraries:
SheetJS Community Edition
Apache License Version 2.0 (http://www.apache.org/licenses/)
PDF.js
Apache License Version 2.0 (http://www.apache.org/licenses/)
Font Awesome
SIL OFL 1.1 (http://scripts.sil.org/OFL)
MIT License (http://opensource.org/licenses/mit-license.html)
CC BY 3.0 ( http://creativecommons.org/licenses/by/3.0/)
Powered by Worktribe © 2024
Advanced Search